Hey all,
I'm trying a simple text-book example on false sharing where four threads continuously write to distinct data on the same shared cache line. Using the General Exploration analysis, I used to see the "Store Bound" value flagged with a high Store Latency and False Sharing child metric.
Now, it looks like this:
Elapsed Time: 0.828s Clockticks: 10,659,000,000 Instructions Retired: 5,616,800,000 CPI Rate: 1.898 MUX Reliability: 0.998 Front-End Bound: 5.0% Bad Speculation: 6.1% Back-End Bound: 72.5% Memory Bound: 21.3% L1 Bound: 24.2% L2 Bound: N/A with HT on L3 Bound: N/A with HT on DRAM Bound: N/A with HT on Store Bound: 0.0% Core Bound: 51.3% Divider: 32.1% Port Utilization: 65.2% Cycles of 0 Ports Utilized: 51.3% Cycles of 1 Port Utilized: 41.0% Cycles of 2 Ports Utilized: 21.7% Cycles of 3+ Ports Utilized: 5.0% Retiring: 16.4% Total Thread Count: 5 Paused Time: 0s
Note how Store Bound got 0.0%... I can increase the workload but that doesn't change anything. The child-metrics also look OK, which makes this even more odd... Store Latency got 76.6% but is supposedly unreliably due to MUX issues or lack of PMU events. Fale Sharing metric is at 13.9% and not unreliable.
Can anyone explain what's going on here? The code for my test case is the following:
#include <cmath> #include <iostream> #include <thread> #include <future> struct Results { virtual ~Results() = default; virtual unsigned int* data1() = 0; virtual unsigned int* data2() = 0; virtual unsigned int* data3() = 0; virtual unsigned int* data4() = 0; }; // the two data elements are distinct but share a single cache line struct Unaligned : Results { unsigned int* data1() override { return &m_data1; } unsigned int* data2() override { return &m_data2; } unsigned int* data3() override { return &m_data3; } unsigned int* data4() override { return &m_data4; } unsigned int m_data1 = 0; unsigned int m_data2 = 0; unsigned int m_data3 = 0; unsigned int m_data4 = 0; }; // the two data elements are distinct and live on separate cache lines struct Aligned : Results { unsigned int* data1() override { return &m_data1; } unsigned int* data2() override { return &m_data2; } unsigned int* data3() override { return &m_data3; } unsigned int* data4() override { return &m_data4; } unsigned int m_data1 = 0; alignas(64) unsigned int m_data2 = 0; alignas(64) unsigned int m_data3 = 0; alignas(64) unsigned int m_data4 = 0; }; void do_something(volatile unsigned int* result) { for (size_t i = 0; i < 100000000; ++i) { *result += sqrt(i); } } int main(int argc, char** /*argv*/) { // when any command line argument is given, use the aligned data, otherwise // use the unaligned data to show the effect of false sharing std::unique_ptr<Results> results(argc > 1 ? static_cast<Results*>(new Aligned) : static_cast<Results*>(new Unaligned)); { const auto f1 = std::async(std::launch::async, do_something, results->data1()); const auto f2 = std::async(std::launch::async, do_something, results->data2()); const auto f3 = std::async(std::launch::async, do_something, results->data3()); const auto f4 = std::async(std::launch::async, do_something, results->data4()); } std::cout << "random sums: "<< *results->data1()<< ", "<< *results->data2()<< ", "<< *results->data3()<< ", "<< *results->data4() << '\n'; return 0; }