In videoconferencing applications the perceived quality of the video signal is affected by the presence of an audio signal (speech). To achieve high compression rates, video coders must compromise image quality in terms of spatial resolution, grayscale resolution, and frame rate, and may introduce various kinds of artifact.s We consider tradeoffs in grayscale resolution and frame rate, and use subjective evaluations to assess the perceived quality of the video signal in the presence of speech. In particular we explore the importance of lip synchronization. In our experiment we used an original grayscale sequence at QCIF resolution, 30 frames/second, and 256 gray levels. We compared the 256-level sequence at different frame rates with a two-level version of the sequence at 30 frames/sec. The viewing distance was 20 image heights, or roughly two feet from an SGI workstation. We used uncoded speech. To obtain the two-level sequence we used an adaptive clustering algorithm for segmentation of video sequences. The binary sketches it creates move smoothly and preserve the main characteristics of the face, so that it is easily recognizable. More importantly, the rendering of lip and eye movements is very accurate. The test results indicate that when the frame rate of the full grayscale sequence is low (less than 5 frames/sec), most observers prefer the two-level sequence.