This paper presents efficient techniques for designing high-throughput, low-latency sorting units for FPGA implementation. Our sorting units use modular design techniques that hierarchically construct large sorting units from smaller building blocks. They are optimized for situations in which only the M largest numbers from N inputs are needed; this situation commonly occurs in high-energy physics experiments and other forms of digital signal processing. Based on these techniques, we design parameterized, pipelined sorting units. A detailed analysis indicates that their resource requirements scale linearly with the number of inputs, latencies scale logarithmically with the number of inputs, and frequencies remain fairly constant. Synthesis results indicate that a single pipelined 256-to-4 sorting unit with 19 stages can perform 200 million sorts per second with a latency of about 95 ns per sort on a Virtex-5 FPGA.