Nelson Lab - Model Based Tracking

Supplementary material for

MacIver, M. A., Nelson, M. E. (2000). Body modeling and model-based tracking for neuroethology. Journal of Neuroscience Methods, 95(2): 133-143.

Links to equipment suppliers

Urethane Casting Material: BJB Enterprises
Silicone rubber mold material: Rhodia Silicones
3D Digitizer: MicroScribe 3DX
3D Modeling software: Rhinoceros
Numerical computation software: MATLAB
SMPTE Time Coder: Burst Electronics
Infrared Diodes: Everlight Electronics
CCD Camera #1: Sony
CCD Camera #2: Sanyo
SVHS VTR: Panasonic
Videodigitizing System: Avid
3D Virtual Reality System: EVL
... Commercially available from Fakespace Systems Inc.
Virtual Director, time-Series visualization software for the CAVE
Stereolithography apparatus: 3D Systems

Additional Useful Resources

Casting and moldmaking supplies: Burman Foam
Video Resolution Chart, EPS
Video Resolution Chart With Point Resolution Additions, EPS
Video Resolution Chart Instructions

Methods for making a surface model of an animal

Obtaining a quantitative representation of a surface involves measuring coordinate values of points on the surface and constructing a best fit surface model that passes near those points.

One approach is to coat a cast of the object with a mold release agent and embed it in a rectangular block of rigid casting compound. The block containing the embedded cast is then sliced with a thin-kerf bandsaw, and the cast slices are pushed out. The resulting cross-section negatives are scanned on a flatbed scanner. The images are then imported into a drawing program that allows extraction of 2-D coordinates of points on the edge of the cross-sections. The outline of the embedding block is used for registration between cross-sections. Knowledge of the thickness of each slice allows reconstruction of the longitudinal dimension for a 3-D surface model.

Another option is to build a model of the organism from a set of photographs. 3-D modeling software, such as Rhinoceros, often allow you to put an image in the background as you construct the model. You build the surface by generating construction curves based on the photograph. By carefully controlling the position of the camera and photographing with scale bars, an adequate model can be made for simple body forms.

A more flexible and precise technique is to use 3-D digitizers. There are two common types of 3-D digitizers, optical and contact. Optical scanners, such as the Cyberware Model 15 (Cyberware Inc., Monterey CA USA), compute the (x,y,z) position of a dense grid (20 microns) of surface points as a laser beam is rapidly scanned and reflected from the target object. Optical scanners are significantly more expensive but can be easier to use for objects with complex surfaces. Additionally, they do not require the surface of the object to be rigid. Since optical scanning requires little user interaction, laser digitizing can be outsourced to commercial scanning services. Contact digitizers, such as the MicroScribe 3DX (Immersion Corp., San Jose CA USA), consist of a stylus at the end of a multi-joint rigid arm that is touched to selected points on the surface of the object being digitized. Each joint of the digitizing arm contains sensors that measure the angle of the joint, allowing the software to compute the (x,y,z) location of the stylus. They have an accuracy of around 0.2 mm. Generating a model with a contact digitizer requires knowledge of the surface generation functions of the software it is connected to, and is aided by marking the rigid object with a lattice of transverse-sectional and cross-sectional lines to guide what points on the object are touched with the stylus.

The temporal and spatial resolution of video

An understanding of technical specifications for video resolution is required for determining whether the spatial resolution of a video system will be adequate to meet the needs of a particular application. In this appendix we provide a general technical background on video resolution and show how technical specifications are applied to estimate the spatial resolution of our infrared video system. We will restrict our discussion to the video format used in North America, often referred to as National Television Systems Committee (NTSC) video, but the discussion applies to other formats with minor variations. For additional technical information outside the scope of this discussion, see these books:

Poynton, C.A. (1996). A technical introduction to digital video. J. Wiley, N.Y.

Jack, K. (1993). Video Demystified: A handbook for the digital engineer. HighText Publications. Solana Beach, CA.

Young, L., Poynton, C., Schubin, M. Watkinson, J., Olson, T. (1995). Pixels, pictures, and perception: The differences and similarities between computer imagery, film, and video. SMPTE-Society of Motion Picture and Television Engineers, White Plains, N.Y.

The resolution of digitized video images is determined by contributions of each device or transformation interposed between the imaged scene and the final digitized image. This includes the CCD sensor and camera electronics, recording and playback device and media, digitizing resolution, and any post-digitization image processing such as deinterlacing.

The temporal resolution of video is nominally the frame rate, which is 29.97 frames/s for NTSC video. In the NTSC video format each video frame has 525 horizontal scan lines divided into two fields, consisting of 262.5 even and 262.5 odd scan lines. To reduce flicker the odd lines are drawn on the screen first, then the even lines are drawn. This creates an interval of 16.7 ms between an odd scan line and its adjacent even scan line. An image artifact termed motion interlace blur results from this interlacing. For example, a fish moving at 15 cm/s parallel to the scan line drawing direction will move 2.5 mm in the 16.7 ms inter-field interval. Given the scaling of our system, this results in a 3-4 pixel blurry fringe at the leading and trailing edges of the fish. Thus we deinterlace our digitized images, which eliminates interlace blur and doubles the effective frame rate to 59.94 frames/s but also reduces vertical resolution.

Because of the scanning system used in video, vertical and horizontal spatial resolutions are determined by different factors. In general, video resolution is defined in terms of the number of black and white line pairs resolvable on the display, termed luminance resolution. It is most often specified in terms of the total number of lines (L), rather than number of line pairs. The implied spatial scale is the height of the display. Therefore, when lines of resolution is quoted it means lines per picture height (H). Thus, vertical resolution is specified as the total number of resolvable horizontal lines per picture height. For NTSC, the picture width (W) is 4/3 times the picture height. To maintain the same spatial scale for vertical and horizontal resolution, horizontal resolution is also specified as lines per picture height (L/H) rather than lines per picture width (L/W). Horizontal resolution in lines per picture height (L/H) is thus equivalent to the total number of resolvable vertical lines across the width of the display divided by the 4/3 aspect ratio.

Maximum vertical resolution is limited by number of scan lines in the video format. Although there are a total of 525 raster lines in NTSC, no more than 485 carry picture information. The subjective vertical resolution of a video image is consistently found to be less than the resolution predicted on the basis of the number of visible scan lines, in part because of the small gap between neighboring scan lines. This deviation is specified as the ratio of perceived vertical resolution (L/H) to visible scan lines (485), and is called the Kell factor. A commonly quoted value is 0.7, but this is based on non-interlaced displays. For the 2:1 interlace scanning system in NTSC video, the value is between 0.4-0.7, depending upon a number of factors including movement of the image. This article contains many details on the Kell factor and difficulties of establishing resolution specifications:

Hsu, S. C., (1986). The Kell Factor: Past and Present. SMPTE Journal---Society of Motion Picture and Television Engineers, 95, 206-214.

Maximum horizontal resolution is limited by the total bandwidth of the video system. Typical horizontal resolutions (L/H) obtainable from commercial VCRs are 700 (Betacam), 400 (Super-VHS), and 220 (VHS).

In general, the S-VHS recording format is the best practical choice because of the high cost of Betacam recorders. In S-VHS, VHS, and some other recording formats, the luminance signal is kept separate from the hue and color saturation signal. However, standard video signals are composite, combining color and luminance signals together. This requires that the composite signal be decoded prior to recording or display using what is termed a comb filter. Comb filters are only activated when color information is detected. As comb filtering degrades the signal bandwidth to a degree that is noticeable with S-VHS (but not VHS), it is preferable to use S-video or component cabling with color video. These cabling systems have separate wires for luminance and chrominance.

When choosing a CCD camera, the higher the resolution the better the recorded signal will be, even if the resolution of the CCD exceeds that of the recording device. For example, when recording to S-VHS, better results are obtained with a camera that has higher horizontal resolution than 400 L/H. This is because the depth of modulation of the video signal is greater with a higher resolution camera. Studio cameras have horizontal resolutions of over 1,000 L/H, despite the 333 L/H limit of NTSC broadcast video.

Using standard resolution test patterns (see resolution chart PostScript files above) we measured the resolution of our system including digitization and deinterlacing to be approximately 355 L/H horizontal and 325 L/H vertical with optimal lighting. To calculate the vertical spatial resolution in L/mm, we take the vertical resolution in L/H and divide by the vertical field of view in mm. To determine the horizontal spatial resolution, we multiply the horizontal resolution (L/H) by 4/3 to obtain the L/W resolution, and divide this by the horizontal field of view in mm. Using this procedure we obtain a spatial resolution of approximately 1 L/mm in both dimensions. In our application, the 2-3 mm diameter Daphnia are representative of the minimum feature size of interest, and are just barely discriminable at this resolution under experimental lighting conditions.