Most current audio-visual automatic speech recognition (AV-ASR) systems use static weights to leverage between audio and visual information during information fusion. State of the art research has led to using audio reliability metrics for dynamically changing the fusion weights in order to successfully improve overall recognition results. So far, however, incorporating visual reliability metrics into these audio reliability metric based systems have not significantly improved performance. We introduce a new approach to this problem by inferring the "consistency" between the audio and visual information and leveraging the existing audio reliability metrics to create a video reliability metric. Our approach is formulated in the extractedfeature space and, thus, does not rely on analyzing the actual video signalitself. The framework presented in this work competes with the audio-onlyreliability metric based systems and shows promise to consistently outperform.