Figure 50: Tracking in Three Dimensions
The vision and sensor fusion techniques described in the previous chapters provide a measurement of target locations for each image frame. In its raw form, this information is of limited use for camera control because it is imprecise due to measurement noise; it may include false-positive detections of people; and it provides no association between new measurements and previous target locations. This makes it difficult to develop smooth camera control motions from the raw measurements. In addition, the data is in polar coordinates which complicates the task of associating the data with other sensors or devices in the room that may not share the same coordinate system. For these reasons, target measurements are converted to a global Cartesian coordinate system, associated with previously tracked targets, and used to update a filter/state estimator for each target track. Figure 50 illustrates the tracking of target positions in Cartesian coordinates. The crosshairs represent position estimates for the targets, and the ellipsoids represent the relative uncertainty of the estimated position in three dimensions.
The coordinates measured by the camera system must be transformed into Cartesian coordinates for tracking and data association. This is important for measuring the distance between targets and measurements, and for using state estimation techniques based on Netwon's laws of motion. Each pixel location in a camera image represents a different azimuth and elevation with respect to the camera orientation. Adding these angles to the camera's pan and tilt position defines a line through the real world. One way that the distance to the target may be estimated is by considering the target's size in the image, and incorporating a priori knowledge about the actual size of the object. This was used for determining the range to faces, assuming that most heads are about the same size. Since this distance calculation is highly sensitive to sensor noise and variations in target size, the expected measurement error in depth is much larger that the expected error in directions orthogonal to depth. If one computes the error covariance and plots a locus of points of equal probability of being the actual target location around the measured target position, one will obtain an ellipsoid similar to those in Figure 50, with its major axis aligned along the vector pointing out from the camera. Other methods for calculating depth include the use of multiple orthogonally mounted cameras or making the assumption all of the targets exist within the same plane, measuring their position from a camera overhead.
For the tracking system to perform properly, the most likely measured potential target location should be used to update the target's state estimator. This is generally known as the data association problem. The probability of the given measurement being correct is a distance function between the predicted state of the target and the measured state. Note that state is not limited to position; it may also consist of features such as color. This becomes especially important for targets that may come close to or cross one another, such as people. Popular association algorithms for single-target applications are based on the following schemes:
For multiple sensors and multiple targets, the problem becomes increasingly complex. Common association algorithms are:
Some comparisons of target tracking data association techniques are provided by Drummond , Bar-Shalom and Fortmann , and Deb et al. .
For tracking the positions of people walking around in a room for applications such as surveillance, image-difference is used to segment the person's body from the background, and their measured position is determined from room and camera geometry. For data association, object position as well as color is exploited as state variables. This allows objects to cross directly in front of one another without losing track of which is which after they separate. One may measure the color histogram difference, H(I,M), between each new measured object and the previously detected target data using Swain and Ballard's histogram intersection technique :
where Ij is the jth color histogram bin of an object in the current frame, and Mj is the jth color bin of a tracked object in the previous frame. 64 bins are currently used in each color histogram, which provides sufficient resolution to differentiate most colored objects, such as peoples' clothing, from one another.
In order to obtain a distance metric for data association that incorporates both the histogram intersection and position difference, we calculate the joint probability of these two measurements. Let us define Xi,j as the event that a detected object i is actually the previous object j, Yi,j as the value of the histogram intersection between objects i and j, and Zi,j as the distance between the position of object i and the predicted position of object j. One may express the probability of a correct match conditioned on the statistically independent measures of color and position as
This probability may be incorporated into association/tracking algorithms such as nearest-neighbor, joint probabilistic data association, and multi-hypothesis track splitting. For person-tracking, the color/position metric has been found to be good enough for a simple winner-take-all nearest-neighbor data association scheme to suffice. If one assumes equal prior probabilities for all Xi,j, one may simplify the nearest neighbor decision process to one that seeks to maximize the value Fy(Yi,j)Fz(Zi,j), where
Fy(Yi,j) and Fz(Zi,j) are monotonically decreasing functions of the color histogram intersection and position differences, respectively. These functions may be generated by tasking statistics on typical color differences and position distances between the same and different objects in successive frames. By fitting the data to a parameterized model, such as a decaying exponential curve, a general matching function with appropriate weights for color and position may be obtained.
When tracking faces, similar methods may be used for data association. However, since most faces are quite similar in color, and intensity may vary as the person moves through different lighting angles, position is the most reliable state variable. For this dissertation project, only position information was used for data association between detected faces and targets being tracked. Using the estimated measurement error covariance in polar coordinates, the covariance matrix was transformed to Cartesian coordinates (linear assumptions were made due to very small angle errors) and the Mahalanobis distance was used to calculate the distance between new measurements and previously tracked targets. For this project, a simple nearest-neighbor assignment policy was used for target measurement updates.
In target tracking applications, the most popular methods for updating target positions incorporate variations of the Kalman filter/state estimator [117,118,119]. The Kalman filter assumes that the dynamics of the target can be modeled, and that noise affecting the target dynamics and sensor data is stationary and zero mean. In cases where the target is actively maneuvering, the plant disturbance is not zero mean, and the performance of the Kalman filter degrades. To compensate, it is important to minimize sensor noise, such that the sensor data gains will be higher and the reliance on the model dynamics will be reduced. This is of considerable importance when tracking people, whose erratic movements are poorly matched to any model of more than second order.
Since the face is attached to the rest of the human body, its dynamics are directly related to those of the body. Head movements may be thought of as zero-mean random fluctuations with respect to the body position, so the same state model used for tracking the entire body has also been used for tracking head position. A human being has a complex locomotion system, which poses serious challenges in modeling. Rather than analyzing properties of the human gait, one may assume a drastically simplified model of the target as a mass under the influence of two forces: average leg force and friction. Leg force is used to push the body into motion, and friction comes from the inefficiency of the human body maintaining this motion. Intra-stride dynamics and rotation of the body, assuming that the body may move in any direction, are constrained only by leg force and inertia. A block diagram of the target model is given in Figure 51.
Figure 51: Model of Target Dynamics
This model is generic for any target mass with linear friction and force response. Actual forces and movements of the system are in three dimensions, but for now only one dimension of motion, x, will be considered. A target mass of 72.57 kg (160 lbs) and a friction coefficient of 100 N/m/s was assumed. This gives the continuous state space matrices
For the discrete computations of the Kalman filter, one needs a discrete state-space representation of the target, calculated for a sample rate of 5Hz as:
The objective of using this model to remove measurement noise with a Kalman filter/state estimator. The Kalman filter used in this system is identical to a current observer, except that the feedback gain vector G varies with time, and is calculated to provide optimally low error. This optimal solution incorporates the target model, state disturbances, and estimates of sensor noise variance. Figure 52 shows the model of the target including the state disturbance noise, W(k) and the sensor noise, V(k).
Figure 52: Noise Entering the Plant
Note that in this application, the input U(k), or leg force, cannot be measured by the system, and must be treated as part of the disturbance. Disturbance models for the Kalman filter assume stationary, zero mean, Gaussian noise distributions. While the human leg force may not be stationary, the Kalman filter may still compensate for its effects. For indoor applications, it is rare that a human being will accelerate faster than 0.3 m/s2. For the 72.57 kg model, this would require a 21.8 N force. Thus the variance of the input force is estimated to be approximately (21.8)2 = 474 N2. For visual sensor noise, sensor covariance changes depending on the position and range of the speaker, and is recalculated for every target during each frame. The covariance of W(k), assumed to be constant, shall be assigned to the variable Rw, and the covariance of V(k) shall be assigned to the variable Rv(k).
This Kalman filter is based on a current observer state estimator that provides an estimate, q(k), of the current system state x(k), as well as a prediction, , of the state at sample k+1. From , the filter equations are
Kalman filter design develops the observer sensor feedback matrix G(k) such that the values of G(k) lead to an optimal estimator, where the expected values of the squared estimation errors are minimized. The determination of G(k) is recursive, and must be calculated at run-time for this application since the sensor covariance Rv(k) changes depending on target position. From , the following equations are used to find G(k):
Where M(k) is the covariance of the prediction errors, P(k) is the covariance of the estimation errors, and B1 = B. When a new target is detected and its tracked path is initialized, the values of q(k) and q~(k) are set equal to the current sensor measurement and M(k) is set equal to the identity matrix.
If the measurement error for each dimension of movement (x, y, and z) were statistically independent, then a separate Kalman filter state estimator could be used for each dimension. Unfortunately, the errors are not statistically independent, primarily because errors in depth perception in polar coordinates cause a measurement error that typically exists along a diagonal when transformed to Cartesian coordinates. For this reason, the three dimensions must be combined in the vector , increasing the size of the vectors and matrices that make up the filter. The new state representation becomes:
With this increase in dimensionality, P(k) and M(k) become 6 x 6 matrices, and Rw and Rv(k) become 3 x 3 matrices.
Figure 53: Tracking a Person Walking
Figure 54: Position Estimate while Tracking a Face
Figure 55: Target Tracking of a Face in 3D
Figure 53 shows the Kalman filter tracking a person walking. The rectangle shows the bounding box around the pixels extracted from the background, and the crosshairs show the projection of the estimated target state onto the current view. Here, color histogram data and position are used to associate regions measured by the vision system with active tracks. New tracks are created when new visual targets do not match any existing tracks with more than an acceptable threshold of probability, and tracks that have not been matched to new target data over an arbitrary time limit are deleted. An overview of applicable track initiation and pruning schemes is described in . Figure 54 shows a screen-shot of the computer program while tracking a face. The rectangle around the face indicates the largest region of "noisy face pixels;" the red crosshairs mark position estimate from the Kalman filter, and the green vertical line indicates the angle of peak sound intensity. Kalman filter output results for tracking a face in three dimensions are given in Figure 55. The Kalman filter suppresses sensor noise as well as occasional incorrect association of targets.
The target tracking methods described in this chapter allow noisy sensor measurement to be combined into more reliable estimates of target positions. The model of human motion dynamics and in Cartesian coordinates provides the basis for filtering and smoothing the sensor data. Target position data may then be used for camera control or for intelligent room applications. However, question of which target to follow and how to look for targets still remains. In the next chapter, the foundations for behavior-based control will be presented.