Hi, Intel support staff,
I am applying Intel 2018 Amplifier to our Intel Fortran application codes to diagnose nested do-loops. These nested do-loops cost too much time and amplifier shows that LLC miss count # is very huge. All tests with Amplifier are done on a 2-socket Skylake Dell computer with L1 cache: 2.5MB, L2 cache: 40MB and L3 cache: 55MB and DRAM 191GB.
I found the following information on the Intel Website:
Metric Description
The LLC (last-level cache) is the last, and longest-latency, level in the memory hierarchy before main memory (DRAM). Any memory requests missing here must be serviced by local or remote DRAM, with significant latency. The LLC Miss metric shows a ratio of cycles with outstanding LLC misses to all cycles.
Possible Issues
A high number of CPU cycles is being spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but they can increase latency by interfering with normal loads, and can increase pressure on the memory system.
In order to either reduce the LLC Miss Count # or completely eliminate it, could you tell me any suggestions on how to revise our nested do-loops which are written with Intel Fortran? I look forward to your help. Thanks in advance.
If you need our test codes, I may upload them on demand.
Best regards,
Dingjun