A common method in the practice of large scale auction design, e.g., in auctions placing advertisements on online media and Internet search engines, is A/B testing. In A/B testing, the auction house is running an incumbent mechanism A, and would like to determine if a novel mechanism B obtains higher revenue. This is done by splitting the traffic so that most of it goes to A and some of it, e.g., five to ten percent, goes to B. An issue with this approach is that if the bidders are unaware of which mechanism their bid will be considered in, the bid equilibrium is neither for A nor B but for a mechanism C that is a convex combination of A and B. A miscalculation sometimes performed in practice is to consider and compare average revenue from A (resp. B) from the times when A (resp. B) is run. This miscalculation is equivalent to simulating A on the bids in C and can often give the opposite conclusion; e.g., if A and B are one- and two-unit highest-bids-win winner-pays-bid auctions, respectively, then B will always appear in this miscalculation to have higher revenue. For a fixed set of bids, a winner-pays-bid mechanism's revenue is monotone in its allocation probabilities. Of course, in equilibrium, increased allocation probabilities can cause reduced revenue as bidders may lower their bids. We present an A/B testing method that applies generally to the position auction model popularized by the Varian  and Edelman et al.  analyses of auctions for sponsored search and now a fundamental model for the study of auction theory; e.g., see Hartline . A position auction is defined by a decreasing sequence of weights, bidders are assigned to positions in decreasing order of bids, and payments are charged. Typical payment rules are "generalized first price" and "generalized second price"; the former requires bidders to pay their weighted bid, whereas the latter requires bidders to pay the weighted bid of the next highest bidder. (Unfortunately, our methods do not apply to the generalized second price position auction.) To revisit the example auction scenario above, notice that the mechanism C, namely the 50-50 convex combination of one- and two-unit auctions, is mathematically equivalent to a position auction with weights 1 and 0.5 for the top two positions respectively, and zero for all remaining positions. Mechanisms A and B are also position auctions. Our model of A/B testing is especially applicable to the following generalization of the classical problem with fixed position weights. Consider the problem Google and Bing face in selecting the layout of the sponsored results on the results page of a search query. The main variable in the layout is the number of ads to display in the mainline, i.e., above the organic search results, versus the sidebar, i.e., to the right of the organic search results. Typical layouts include between zero and three advertisements in the mainline. The weights, which correspond to click probabilities, depend on which of these four layout choices is selected. The optimal number of ads to show on the mainline is distinct for each search query and our methods allow these layouts to be optimized individually. Notice that this setting generalizes the opening example of selecting whether to sell one or two units of an item. Chawla et al.  developed an estimator for the revenue of one position auction (A or B) from the bids of another (C). This estimator is a simple weighted order statistic of the distribution of equilibrium bids in C. To estimate the revenue of B (likewise, A): (1) for any number N, the evaluation of a formula based on the definition of B and C gives N weights, and, (2) given a sorted list of N i.i.d. samples from the bid distribution for C, the estimated revenue of B is the weighted average of these bids. This method compares favorably to ideal A/B testing, where the bids in A are in equilibrium for A and revenue can be estimated by simple averaging, respectively for B. The present paper improves the analysis of the estimator of Chawla et al.  and applies this refined analysis to the application of A/B testing. In an A/B test where mechanism C puts e probability on B (and 1 - ϵ probability on A), our error bounds have better dependence on ϵ than ideal A/B testing. In ideal A/B testing with a total of N bid samples, there are eN samples from B. Thus, the standard statistical error in estimating the revenue of B is (equation presented). Our error bounds depend on ϵ as log 1/ϵ, a significant improvement. In particular, we obtain good bounds even when ϵ < 1/N and the ideal A/B test is unlikely to have ever run the B mechanism. The reason for this improvement is that, while the ideal A/B test mechanism only sees ϵN bids that depend on B, all bids in our A/B test mechanism C depend a little bit on the presence of B. Our main results are as follows: - For arbitrary n-agent position auctions A and B, and C defined as (1 - ϵ)A + ϵB, the revenue of B can be estimated from N bids from the equilibrium bid distribution of C with absolute error bounded by (equation presented). - For arbitrary n-agent position auctions A, B1, and B2, and C defined as (1-2ϵ)A + ϵB1 + ϵB2, N bids from the equilibrium bid distribution of C can be used to establish which of B1 and B2 has higher revenue, with error probability bounded by exp(-O(N/poly(n)log(1/ϵ))). - The position auction B with position weights w1 = 1, wk = 1/2 for 1 < k < n - 1, and wn = 0, that is, that sells one or n - 1 items with equal probability, is a universal B-test in the following sense: for arbitrary position auction A, and C = (1 - ϵ)A + ϵB, the revenue of any position auction can be determined from N bids of C with error bounded by (equation presented).