When combined with acoustical speech information, visual speech information (lip movement) significantly improves Automatic Speech Recognition (ASR) in acoustically noisy environments. Previous research has demonstrated that visual modality is a viable tool for identifying speech. However, the visual information has yet to become utilized in mainstream ASR systems due to the difficulty in accurately tracking lips in real-world conditions. This paper presents our current progress in tracking face and lips in visually challenging environments. Findings suggest the mean shift algorithm performs poorly for small regions, in this case the lips, but it achieves near 80% accuracy for facial tracking.
An example of the scale algorithm is provided in Figure 2. Fig. 2(a) displays the frame from which the model ROI was selected while (b) provides an example of the increase in scale and (c) a decrease in scale. Note the offset in (b); this was caused by the subject’s face just leaving the edge of frame, partially occluding the face.
Finally, an example of the lip tracking implementation can be seen in Fig. 3. This displays the tracking results for a subject with minimal spatial velocity and near the camera (increased lip ROI size); note the successful tracking regardless of the mouth being open or closed. Compare this with the example of Fig. 3 (b), in which the reduced size of the lip candidate ROI and the subject’s spatial velocity results in the algorithm selecting a local mode that does not correspond to the lip model ROI; however, as can be seen in Fig.3 (c), the lip ROI has nearly recovered the target once the subject’s velocity decreased.
Three examples corresponding to system success can be seen in Fig. 4. Note the method implemented to overlay the ROIs can result in the lip ROI being copied into the face ROI; this can be seen in (c) and in no way affected the test results. The figure displays the first detected frame on the left and the last frame on the right. Of the three videos, both (a) and (b) lost no frames in detecting and localizing, whereas (c) required 5 frames to locate the face and lips. Of interest is the successful tracking of the lips in (a), where hair has partially occluded the lips for the previous 6 seconds.
This paper presented a method of tracking both the face and lips using color-based features. With bounds provided by the localization algorithm, custom model was generated to enable tracking by the mean shift algorithm. This model resulted in a dramatic decrease in storage and processing requirement over similar system. Additional processing reductions were realized utilizing the proposed scaling algorithm allowing the model to adjust to a subject moving in 3d space.
Finally, a method of scaling the MS vector was proposed that reduced the required MS algorithm iterations by approximately 36%. Preliminary testing suggests an increase in image resolution would greatly increase the system performance by providing model and candidate ROIs that are much more informative. The addition of a PTZ camera would allow the scale algorithm to control the zoom, thereby increasing the resolution of the candidate ROIs.
Source: California Polytechnic State University
Authors: Brandon Crow | Jane Xiaozheng Zhang