The purpose was to investigate the repeatability and bias of the output of two classifiers commonly used in computeraided diagnosis for the task of distinguishing benign from malignant lesions. Classifier training and testing were performed within a bootstrap approach using a dataset of 125 sonographic breast lesions (54 malignant, 71 benign). The classifiers investigated were linear discriminant analysis (LDA) and a Bayesian Neural Net (BNN) with 5 hidden units. Both used the same 4 input lesion features. The bootstrap.632plus area under the ROC curve (AUC) was used as a summary performance metric. On an individual case basis, the variability of the classifier output was used in a detailed performance evaluation of repeatability and bias. The LDA obtained an AUC value of 0.87 with 95% confidence interval [0.81; 0.92]. For the BNN, those values were 0.86 and [.76;.93], respectively. The classifier outputs for individual cases displayed better repeatability (less variability) for the LDA than for the BNN and for the LDA the maximum repeatability (lowest variability) lied in the middle of the range of possible outputs, while the BNN was least repeatable (highest variability) in this region. There was a small but significant systematic bias in the LDA output, however, while for the BNN the bias appeared to be weak. In summary, while ROC analysis suggested similar classifier performance, there were substantial differences in classifier behavior on a by-case basis. Knowledge of this behavior is crucial for successful translation and implementation of computer-aided diagnosis in clinical decision making.