Ideally, the outcome of any CAD performance assessment should predict how well the system would work if used clinically. In principle, if the selection process draws cases that are "representative" of the general patient population, the study design will be unbiased. In this study we explored the effect of stratified sampling on stand-alone and radiologists' performance using data from an observer study. Although our database was relatively small, 50 cancer cases, no meaningful difference in performance was measured among different stratified sampling schemes or against the whole dataset nor was there any difference in the variance in the measured performance metrics. These results cast doubts on the usefulness of requiring stratified sampling, whose added cost does not seem to be justifiable without empirical evidence. We believe that it is more important to specify how cases should be collected than try to define the range and frequency of the characteristics of patients and cancers to be included the dataset, which we suspect to be prone to actually produce unrealistic samples.