Intel Xeon Phi is a processor based on MIC architecture that contains a large number of compute cores with a high local memory bandwidth and 512-bit vector processing units. To achieve high performance on Xeon Phi, it is important for programmers to explore all the software features provided by the Intel compiler and libraries to fully utilize the new hardware resources. In this paper, we use the K-Means algorithm to study the performance of various Intel software settings available for Xeon Phi and their impacts to the performance of K-means. At first we examine different memory layouts for storing data points using Intel compiler-intrinsic functions. During distance calculation, the computational kernel of K-means, when the size of individual input data points is not vector-friendly, we pad the data points to align with the VPU width. At last, we implement a parallel reduction to increase memory access parallelism and cache hits. These techniques enable us to successfully take advantage of thread-level parallelism and data-level parallelism on Xeon Phi. Experimental results demonstrate large performance gains over the default auto-vectorization approach. The K-Means implemented with the proposed techniques achieves up to 68.65% and 56.14% performance improvements for aligned datasets and unaligned datasets, respectively. For high-dimensional aligned datasets, we achieved up to 53.49% performance improvement on a large-scale parallel computer.