Joint optimization (JO) of source and mask together is known to produce better SMO solutions than sequential optimization of the source and the mask. However, large scale JO problems are very difficult to solve because the global impact of the source variables causes an enormous number of mask variables to be coupled together. This work presents innovation that minimize this runtime bottleneck. The proposed SMO parallelization algorithm allows separate mask regions to be processed efficiently across multiple CPUs in a high performance computing (HPC) environment, despite the fact that a truly joint optimization is being carried out with source variables that interact across the entire mask. Building on this engine a progressive deletion (PD) method was developed that can directly compute "binding constructs" for the optimization, i.e. our method can essentially determine the particular feature content which limits the process window attainable by the optimum source. This method allows us to minimize the uncertainty inherent to different clustering/ranking methods in seeking an overall optimum source that results from the use of heuristic metrics. An objective benchmarking of the effectiveness of different pattern sampling methods was performed during post-optimization analysis. The PD serves as a golden standard for us to develop optimum pattern clustering/ranking algorithms. With this work, it is shown that it is not necessary to exhaustively optimize the entire mask together with the source in order to identify these binding clips. If the number of clips to be optimized exceeds the practical limit of the parallel SMO engine one can starts with a pattern selection step to achieve high clip count compression before SMO. With this LSSO capability one can address the challenging problem of layout-specific design, or improve the technology source as cell layouts and sample layouts replace lithography test structures in the development cycle.