The erratic memory access pattern makes it hard to implement fast large-scale graph analysis. Although algorithms of fine-grain parallelism seem to benefit from multithreading, it is unclear whether the long memory latency of such workload is fully masked on current systems, and if not, whether improving locality brings any performance benefit, especially when the cache is simple. We optimize several fundamental graph algorithms on a multi-threaded, multi-core platform, with simple caches. Although the naive implementation scales, we show nonetheless the number of hardware threads is insufficient to fully mask the memory latency for typical graph analysis workload and the processor is unlikely to be fully utilized. In optimizing for cache performance, we show that known cache-friendly designs that prove effective on traditional architectures do not perform well on this platform. We explore low-cost measures such as software prefetching and manipulating the storage of the input to improve performance. Our results show that compared with the original implementation speedups between 10% and 200% are achieved at different number of threads with our optimization.