I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id++) {
volatile int64_t ignored = data[id];
}
return 0;
}
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id += 4) {
volatile int64_t ignored = data[id];
}
return 0;
}
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?
Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?
Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) {
auto random = Random(0);
for (size_t id = 0; id < length; id++) {
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
}
}
static void MemoryAccessAllElements_Benchmark(State &state) {
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state) {
DoNotOptimize(MemoryAccessAllElements(array, size));
}
delete[] array;
}
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) {
for (int size = 2; size <= (1 << 24); size *= 2) {
benchmark->Arg(size);
}
}
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.
Related
I have been using https://github.com/google/benchmark and g++ 9.4.0 to check the performance of data access in different scenarios (compilation with "-O3"). The result has been surprising to me.
My baseline is accessing longs in an std::array ("reduced data"). I want to add an additional byte datum. One time I create an additional container ("split data") and one time I store a struct in the arrays ("combined data").
This is the code:
#include <benchmark/benchmark.h>
#include <array>
#include <random>
constexpr int width = 640;
constexpr int height = 480;
std::array<std::uint64_t, width * height> containerWithReducedData;
std::array<std::uint64_t, width * height> container1WithSplitData;
std::array<std::uint8_t, width * height> container2WithSplitData;
struct CombinedData
{
std::uint64_t first;
std::uint8_t second;
};
std::array<CombinedData, width * height> containerWithCombinedData;
void fillReducedData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
containerWithReducedData.at(static_cast<unsigned int>(row * width + column)) = number;
}
}
}
std::uint64_t accessReducedData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += containerWithReducedData.at(static_cast<unsigned int>(row * width + column));
}
}
return value;
}
static void BM_AccessReducedData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessReducedData());
}
}
BENCHMARK(BM_AccessReducedData)->Setup(fillReducedData);
void fillSplitData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
container1WithSplitData.at(static_cast<unsigned int>(row * width + column)) = number;
const std::uint8_t additionalNumber = bytesDistribution(engine);
container2WithSplitData.at(static_cast<unsigned int>(row * width + column)) = additionalNumber;
}
}
}
std::uint64_t accessSplitData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += container1WithSplitData.at(static_cast<unsigned int>(row * width + column));
value += container2WithSplitData.at(static_cast<unsigned int>(row * width + column));
}
}
return value;
}
static void BM_AccessSplitData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessSplitData());
}
}
BENCHMARK(BM_AccessSplitData)->Setup(fillSplitData);
void fillCombinedData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first = number;
const std::uint8_t additionalNumber = bytesDistribution(engine);
containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second = additionalNumber;
}
}
}
std::uint64_t accessCombinedData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first;
value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second;
}
}
return value;
}
static void BM_AccessCombinedData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessCombinedData());
}
}
BENCHMARK(BM_AccessCombinedData)->Setup(fillCombinedData);
Live demo
And this is the result:
Run on (12 X 4104.01 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.33, 1.82, 1.06
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_AccessReducedData 55133 ns 55133 ns 12309
BM_AccessSplitData 64089 ns 64089 ns 10439
BM_AccessCombinedData 170470 ns 170470 ns 3827
I am not surprised by the long running times of BM_AccessCombinedData. There is additional effort (compared to "reduced data") to add the bytes. My interpretation is that the added byte does not fit into the cache line anymore, which makes the access much more expensive. (Might there be even another effect?)
But why is it so fast to access different containers ("split data")? There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower? But it is almost three times faster than the access of the combined data! Isn't this surprising?
Preface: This answer was written only for the example/scenario you provided in your benchmark link: a summing reduction over interleaved vs non-interleaved collections of differently sized integers. Summing is an unsequenced operation. You can visit elements of the collections and add them to the accumulating result in any order. And whether you "combine" (via struct) or "split" (via separate arrays), the order of accumulation doesn't matter.
Note: It would help if you provided some information about what you already know about optimization techniques and what processors/memory are usually capable of. Your comments show you know about caching, but I have no idea what else you know, or what exactly you know about caching.
Terminology
This choice of "combined" vs "split" has other well known names:
parallel array (wikipedia article)
structure of arrays vs array of structures (wikipedia article)
This question fits into the area of concerns people have when talking about data-oriented design.
For the rest of this answer, I will stay consistent with your terminology.
Alignment, Padding, and Structs
quoting from CppReference,
The C++ language has this requirement:
Every complete object type has a property called alignment requirement, which is an integer value of type size_t representing the number of bytes between successive addresses at which objects of this type can be allocated. The valid alignment values are non-negative integral powers of two.
"Every complete object" includes instances of structs in memory. Reading on...
In order to satisfy alignment requirements of all members of a struct, padding may be inserted after some of its members.
One of its examples demonstrate:
// objects of struct X must be allocated at 4-byte boundaries
// because X.n must be allocated at 4-byte boundaries
// because int's alignment requirement is (usually) 4
struct X {
int n; // size: 4, alignment: 4
char c; // size: 1, alignment: 1
// three bytes padding
}; // size: 8, alignment: 4
This is what Peter Cordes mentioned in the comments. Because of this requirement/property/feature of the C++ language, there is inserted padding for your "combined" collection.
I am not sure if there is a significant detriment to cache performance resulting from the padding here, since the sum only visits each element of the arrays once. In a scenario where elements are frequently revisited, this is more likely to matter: the padding of the combined representation results in "wasted" bytes of the cache when compared with the split representation, and that wastage is more likely to have a significant impact on cache performance. But the degree to which this matters depends on the patterns of revisiting the data.
SIMD
wikipedia article
SIMD instructions are specialized CPU machine instructions for performing an operation on multiple pieces of data in memory, such as summing a group of same-sized integers which are laid out next to each other in memory (which is exactly what can be done in the "split"-representation version of your scenario).
Compared to machine code that doesn't use SIMD, SIMD usage can provide a constant-factor improvement (the value of the constant factor is based on the SIMD instruction). Ex. a SIMD instruction that adds 8 bytes together should be 8 times faster than a loop which does the same thing, or an unrolled loop which does the same thing.
Other keywords: vectorization, parallelized code.
Peter Cordes mentioned relevant examples (psadbw, paddq). Here's a list of intel SSE instructions for arithmetic.
As Peter mentioned, a degree of SIMD usage is still possible in the "combined" representation, but not as much as is possible with the "split" representation. It comes down to what the target machine architecture's instruction set provides. I don't think there's a dedicated SIMD instruction for your example's "combined" representation.
The Code
For the "split" representation, I would do something like:
// ...
#include <numeric> // for `std::reduce`
#include <execution> // for `std::execution`
#include <functional> // for `std::plus`
std::uint64_t accessSplitData()
{
return std::reduce(std::execution::unseq, container1WithSplitData.cbegin(), container1WithSplitData.cend(), std::uint64_t{0}, std::plus{});
+ std::reduce(std::execution::unseq, container2WithSplitData.cbegin(), container2WithSplitData.cend(), std::uint64_t{0}, std::plus{});
}
// ...
It's a much more direct way to communicate (to readers of the code and to a compiler) an unsequenced sum of collections of integers.
CppReference for std::reduce
CppReference for std::execution::<...>
Execution policies allow you to convey how an algorithm can and is desired to be performed (whether it is safe/still-correct and desirable to use SIMD or multiple threads). Many of the algorithms in the C++ standard library have a similar overload to accept an execution policy argument.
CppReference for std::plus
But What about the Different Positions?
There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower?
As I showed in the code above, for your specific scenario, there doesn't need to be alternating access. But if the specific scenario is changed to require alternating access, on average, usually I don't think there would be much cache impact.
There is the possible problem of conflict misses if the corresponding entries of the split arrays map to the same cache sets. I don't know how likely this is to be encountered, or if there are techniques in C++ to prevent that. If anyone knows, please edit this answer. If a cache has N-way set associativity, and the access pattern to the "split" representation data only accesses N or fewer arrays in the hot loop (ie. doesn't access any other memory), I believe it should be impossible to run into this.
Misc Notes
I would recommend that you keep your benchmark link in your question unchanged, and if you want to update it, add a new link, so people viewing the discussion will be able to see older versions being referred to.
Out of curiosity, is there a reason why you're not using newer compiler versions for the benchmark like gcc 11?
I highly recommend the usage I showed of std::reduce. It's a widely recommended practice to use a dedicated C++ standard algorithm instead of a raw loop where the algorithm. See the reasons cited in the CppCoreGuidlines link. The code may be long (and in that sense, ugly), but it clearly conveys the intent to perform a sum where the reduction operator (plus) is unsequenced.
Your question is specifically about speed, but it's noteworthy that in C++, the choice of struct-of-array vs array-of-struct can be important where space costs matter, precisely because of alignment and padding.
There are more considerations in choosing struct-of-array vs array-of-struct that I haven't listed: memory-access-patterns are the main consideration for performance. readability and simplicity are also important considerations; you can alleviate problems by building good abstractions, but there's still a limit to that, and maintenance, readability, and simplicity costs of building the abstraction itself.
You may also find this video by Matt Godbolt on "Memory and Caches" useful.
I was trying to solve a coding problem in C++ which counts the number of prime numbers less than a non-negative number n.
So I first came up with some code:
int countPrimes(int n) {
vector<bool> flag(n+1,1);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 88 ms and uses 8.6 MB of memory. Then I changed my code into:
int countPrimes(int n) {
// vector<bool> flag(n+1,1);
bool flag[n+1] ;
fill(flag,flag+n+1,true);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 28 ms and 9.9 MB. I don't really understand why there is such a performance gap in both the running time and memory consumption. I have read relative questions like this one and that one but I am still confused.
EDIT: I reduced the running time to 40 ms with 11.5 MB of memory after replacing vector<bool> with vector<char>.
std::vector<bool> isn't like any other vector. The documentation says:
std::vector<bool> is a possibly space-efficient specialization of
std::vector for the type bool.
That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.
std::vector<bool> is special case. It is specialized template. Each value is stored in single bit, so bit operations are needed. This memory compact but has couple drawbacks (like no way to have a pointer to bool inside this container).
Now bool flag[n+1]; compiler will usually allocate same memory in same manner as for char flag[n+1]; and it will do that on stack, not on heap.
Now depending on page sizes, cache misses and i values one can be faster then other. It is hard to predict (for small n array will be faster, but for larger n result may change).
As an interesting experiment you can change std::vector<bool> to std::vector<char>. In this case you will have similar memory mapping as in case of array, but it will be located at heap not a stack.
I'd like to add some remarks to the good answers already posted.
The performance differences between std::vector<bool> and std::vector<char> may vary (a lot) between different library implementations and different sizes of the vectors.
See e.g. those quick benches: clang++ / libc++(LLVM) vs. g++ / libstdc++(GNU).
This: bool flag[n+1]; declares a Variable Length Array, which (despites some performance advantages due to it beeing allocated in the stack) has never been part of the C++ standard, even if provided as an extension by some (C99 compliant) compilers.
Another way to increase the performances could be to reduce the amount of calculations (and memory occupation) by considering only the odd numbers, given that all the primes except for 2 are odd.
If you can bare the less readable code, you could try to profile the following snippet.
int countPrimes(int n)
{
if ( n < 2 )
return 0;
// Sieve starting from 3 up to n, the number of odd number between 3 and n are
int sieve_size = n / 2 - 1;
std::vector<char> sieve(sieve_size);
int result = 1; // 2 is a prime.
for (int i = 0; i < sieve_size; ++i)
{
if ( sieve[i] == 0 )
{
// It's a prime, no need to scan the vector again
++result;
// Some ugly transformations are needed, here
int prime = i * 2 + 3;
for ( int j = prime * 3, k = prime * 2; j <= n; j += k)
sieve[j / 2 - 1] = 1;
}
}
return result;
}
Edit
As Peter Cordes noted in the comments, using an unsigned type for the variable j
the compiler can implement j/2 as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that j will always be non-negative.
It's also possible to reduce the number of candidates exploiting the fact that all primes (past 2 and 3) are one below or above a multiple of 6.
I am getting different timings and memory usage than the ones mentioned in the question when compiling with g++-7.4.0 -g -march=native -O2 -Wall and running on a Ryzen 5 1600 CPU:
vector<bool>: 0.038 seconds, 3344 KiB memory, IPC 3.16
vector<char>: 0.048 seconds, 12004 KiB memory, IPC 1.52
bool[N]: 0.050 seconds, 12644 KiB memory, IPC 1.69
Conclusion: vector<bool> is the fastest option because of its higher IPC (instructions per clock).
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <vector>
size_t countPrimes(size_t n) {
std::vector<bool> flag(n+1,1);
//std::vector<char> flag(n+1,1);
//bool flag[n+1]; std::fill(flag,flag+n+1,true);
for(size_t i=2;i<n;i++) {
if(flag[i]==1) {
for(size_t j=i;i*j<n;j++) {
flag[i*j]=0;
}
}
}
size_t result=0;
for(size_t i=2;i<n;i++) {
result+=flag[i];
}
return result;
}
int main() {
{
const rlim_t kStackSize = 16*1024*1024;
struct rlimit rl;
int result = getrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
if(rl.rlim_cur < kStackSize) {
rl.rlim_cur = kStackSize;
result = setrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
}
}
printf("%zu\n", countPrimes(10e6));
return 0;
}
I am developing an application for which performance is a fundamental issue. In particular, I was willing to organize a tree-like structure that needs to be traversed really quickly in blocks of the same size as my memory page size so that it would reduce the number of cache misses needed to reach a leaf.
I am quite a novice in the art of memory optimization. As far as I understand, the process of accessing the main memory goes more or less as follows:
CPUs have several layer of caches of increasing size and decreasing speed.
Every time some data that I need is already in the cache, it is fetched from the cache (cache hit).
If it is not in the cache, it will be fetched from the main memory.
Anytime something is loaded from the main memory, the whole page (or pages) containing the data are loaded and stored in the cache. In this way, if I try to access locations in memory that are close to the ones I already fetched from the main memory, they will already be in my CPU cache.
However, if I organize my data in blocks of the same size as my memory page size, I thought that it would also be needed to align that data properly, so that whenever a new block of my data needs to be loaded only one page of memory will need to be fetched from the main memory rather than the two pages containing the first half and the second half of my data block). In principle, shouldn't a correctly aligned data block mean only one access to the memory rather than two? Shouldn't that more or less double memory performance?
I tried the following:
#include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
using namespace std;
#define BLOCKS 262144
#define TESTS 131072
unsigned long int utime()
{
struct timeval tp;
gettimeofday(&tp, NULL);
return tp.tv_sec * 1000000 + tp.tv_usec;
}
unsigned long int pagesize = sysconf(_SC_PAGE_SIZE);
unsigned long int block_slots = pagesize / sizeof(unsigned int);
unsigned int t = 0;
unsigned int p = 0;
unsigned long int test(unsigned int * data)
{
unsigned long int start = utime();
for(unsigned int n=0; n<TESTS; n++)
{
for(unsigned int i=0; i<block_slots; i++)
t += data[p * block_slots + i];
p = t % BLOCKS;
}
unsigned long int end = utime();
return end - start;
}
int main()
{
srand((unsigned int) time(NULL));
char * buffer = new char[(BLOCKS + 3) * pagesize];
for(unsigned int i=0; i<(BLOCKS + 3) * pagesize; i++)
buffer[i] = rand();
for(unsigned int i=0; i<pagesize; i++)
cout<<test((unsigned int *) (buffer + i))<<endl;
cout<<"("<<t<<")"<<endl;
delete [] buffer;
}
This code instantiates more or less 1 GB of empty bytes, fills them with random numbers. Then the function test is called with all the possible shifts in a memory page (from a 0 shift to a 4096 shift). The test function interprets the pointer provided as a group of blocks of data and carries out some simple operation (sum) over those blocks. The order of access to the blocks is more or less random (it's determined by the partial sums) so that every time a new block is accessed it is nearly certain not to already be in the cache.
The function test is then timed. In all the shift configurations but one I should observe some timing, while in one particular shift configuration (the null shift, maybe?) I should observe some big improvement in terms of efficiency. This, however, does not happen at all: all the shift timings are perfectly compatible with each other.
Why does this happen and what does this mean? Can I just forget about memory alignment? Can I also forget about making my data blocks exactly as big as a memory page? (I was planning to use some padding in case they were smaller). Or maybe something in the cache management process is just unclear to me?
I'm testing how reading multiple streams of data affects a CPUs caching performance. I'm using the following code to benchmark this. The benchmark reads integers stored sequentially in memory and writes partial sums back sequentially. The number of sequential blocks that are read from is varied. Integers from the blocks are read in a round-robin manner.
#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
int num_values = 100000000;
// Fix up the number of values. The effect of this should be insignificant.
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto arrays = vector<vector<int>>(num_arrays);
for (int i = 0; i < num_arrays; ++i) {
arrays.emplace_back(num_values_per_array);
}
for (int i = 0; i < num_values; ++i) {
arrays[i%num_arrays].emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += arrays[i%num_arrays][i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
int num_arrays = 1;
while (true) {
test_with_split(num_arrays++);
}
}
Here are the timings for splitting 1-80 ways on an Intel Core 2 Quad CPU Q9550 # 2.83GHz:
The bump in the speed soon after 8 streams makes sense to me, as the processor has an 8-way associative L1 cache. The 24-way associative L2 cache in turn explains the bump at 24 streams. These especially hold if I'm getting the same effects as in Why is one loop so much slower than two loops?, where multiple big allocations always end up in the same associativity set. To compare I've included at the end timings when the allocation is done in one big block.
However, I don't fully understand the bump from one stream to two streams. My own guess would be that it has something to do with prefetching to L1 cache. Reading the Intel 64 and IA-32 Architectures Optimization Reference Manual it seems that the L2 streaming prefetcher supports tracking up to 32 streams of data, but no such information is given for the L1 streaming prefetcher. Is the L1 prefetcher unable to keep track of multiple streams, or is there something else at play here?
Background
I'm investigating this because I want to understand how organizing entities in a game engine as components in the structure-of-arrays style affects performance. For now it seems that the data required by a transformation being in two components vs. it being in 8-10 components won't matter much with modern CPUs. However, the testing above suggests that sometimes it might make sense to avoid splitting some data into multiple components if that would allow a "bottlenecking" transformation to only use one component, even if this means that some other transformation would have to read data it is not interested in.
Allocating in one block
Here are the timings if instead allocating multiple blocks of data only one is allocated and accessed in a strided manner. This does not change the bump from one stream to two, but I've included it for sake of completeness.
And here is the modified code for that:
void test_with_split(int num_arrays) {
int num_values = 100000000;
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto array = vector<int>(num_values);
for (int i = 0; i < num_values; ++i) {
array.emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
Edit 1
I made sure that the 1 vs 2 splits difference was not due to the compiler unrolling the loop and optimizing the first iteration differently. Using the __attribute__ ((noinline)) I made sure the work function is not inlined into the main function. I verified that it did not happen by looking at the generated assembly. The timings after these changed were the same.
To answer the main part of your question: Is the L1 prefetcher able to keep track of multiple streams?
No. This is actually because the L1 cache doesn't have a prefetcher at all. The L1 cache isn't big enough to risk speculatively fetching data that might not be used. It would cause too many evictions and adversely impact any software that isn't reading data in specific patterns suited to that particular L1 cache prediction scheme. Instead L1 caches data that has been explicitly read or written. L1 caches are only helpful when you are writing data and re-reading data that has recently been accessed.
The L1 cache implementation is not the reason for your profile bump from 1X to 2X array depth. On streaming reads like what you've set up, the L1 cache plays little or no factor in performance. Most of your reads are coming directly from the L2 cache. In your first example using nested vectors, some number of reads are probably pulled from L1 (see below).
My guess is your bump from 1X to 2X has a lot to do with the algo and how the compiler is optimizing it. If the compiler knows num_arrays is a constant equal to 1, then it will automatically eliminate a lot of per-iteration overhead for you.
Now for the second part, as to why is the second version faster?:
The reason for the second version being faster is not so much in how the data is arranged in physical memory, but rather what under-the-hood logic change a nested std::vector<std::vector<int>> type implies.
In the nested (first) case, compiled code performs the following steps:
Accesses top-level std::vector class. This class contains a pointer to the start of the data array.
That pointer value must be loaded from memory.
Add current loop offset [i%num_arrays] to that pointer.
Access nested std::vector class data. (Likely L1 cache hit)
Load pointer to the vector's start of data array. (Likely L1 cache hit)
Add loop offset [i/num_arrays]
Read data. Finally!
(note the chances of getting L1 cache hits on steps #4 and #5 decrease drastically after 24 streams or so, due to likeliness of evictions before the next iteration trough the loop)
The second version, by comparison:
Accesses top-level std::vector class.
Load pointer to the vector's start of data array.
Add offset [(i%num_arrays)*num_values_per_array+i/num_arrays]
Read data!
An entire set of under-the-hood steps are removed. The calculation for offset is slightly longer since it needs an extra multiply by num_values_per_array. But the other steps more than make up for it.
I have a simple program as follows. When I compiled the code without any optimization, it took 5.986s (user 3.677s, sys 1.716s) to run on a mac with 2.4G i5 processor, and 16GB DDR3-1600 9 CAS memory. I am trying to figure out how many L1 cache miss happen in this program. Any suggestions? Thanks!
void main()
{
int size = 1024 * 1024 * 1024;
int * a = new int[size];
int i;
for (i = 0; i < size; i++) a[i] = i;
delete[] a;
}
You can measure the number of cache misses using cachegrind feature of valgrind. This page provides a pretty detailed summary.
note: If you're using C then you should be using malloc. Don't forget to call free: as it stands your program will leak memory. If you are using C++ (and this question is incorrectly tagged), you should be using new and delete.
If you want really fine granularity measurements of cache misses, you should use Intel's architectural counters, which can accessed from userspace using the rdpmc instruction. The kernel module source I wrote in this answer will enable rdpmc in userspace for older CPU's.
Here is another kernel module to enable configuration of the counters for measuring last-level cache misses and last-level cache references. Note that I have hardcoded 8 cores, because that was what I had happened to use for my configuration.
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#define PERFEVTSELx_MSR_BASE 0x00000186
#define PMCx_MSR_BASE 0x000000c1 /* NB: write when evt disabled*/
#define PERFEVTSELx_USR (1U << 16) /* count in rings 1, 2, or 3 */
#define PERFEVTSELx_OS (1U << 17) /* count in ring 0 */
#define PERFEVTSELx_EN (1U << 22) /* enable counter */
static void
write_msr(uint32_t msr, uint64_t val)
{
uint32_t lo = val & 0xffffffff;
uint32_t hi = val >> 32;
__asm __volatile("wrmsr" : : "c" (msr), "a" (lo), "d" (hi));
}
static uint64_t
read_msr(uint32_t msr)
{
uint32_t hi, lo;
__asm __volatile("rdmsr" : "=d" (hi), "=a" (lo) : "c" (msr));
return ((uint64_t) lo) | (((uint64_t) hi) << 32);
}
static uint64_t old_value_perfsel0[8];
static uint64_t old_value_perfsel1[8];
static spinlock_t mr_lock = SPIN_LOCK_UNLOCKED;
static unsigned long flags;
static void wrapper(void* ptr) {
int id;
uint64_t value;
spin_lock_irqsave(&mr_lock, flags);
id = smp_processor_id();
// Save the old values before we do something stupid.
old_value_perfsel0[id] = read_msr(PERFEVTSELx_MSR_BASE);
old_value_perfsel1[id] = read_msr(PERFEVTSELx_MSR_BASE+1);
// Clear out the existing counters
write_msr(PERFEVTSELx_MSR_BASE, 0);
write_msr(PERFEVTSELx_MSR_BASE + 1, 0);
write_msr(PMCx_MSR_BASE, 0);
write_msr(PMCx_MSR_BASE + 1, 0);
if (clear){
spin_unlock_irqrestore(&mr_lock, flags);
return;
}
// Table 19-1 in the most recent Intel Manual - Architectural
// Last Level Cache References Event select 2EH, Umask 4FH
value = 0x2E | (0x4F << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
write_msr(PERFEVTSELx_MSR_BASE, value);
// Table 19-1 in the most recent Intel Manual - Architectural
// Last Level Cache Misses Event select 2EH, Umask 41H
value = 0x2E | (0x41 << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
write_msr(PERFEVTSELx_MSR_BASE + 1, value);
spin_unlock_irqrestore(&mr_lock, flags);
}
static void restore_wrapper(void* ptr) {
int id = smp_processor_id();
if (clear) return;
write_msr(PERFEVTSELx_MSR_BASE, old_value_perfsel0[id]);
write_msr(PERFEVTSELx_MSR_BASE+1, old_value_perfsel1[id]);
}
int init_module(void)
{
printk(KERN_INFO "Entering write-msr!\n");
on_each_cpu(wrapper, NULL, 0);
/*
* A non 0 return means init_module failed; module can't be loaded.
*/
return 0;
}
void cleanup_module(void)
{
on_each_cpu(restore_wrapper, NULL, 0);
printk(KERN_INFO "Exiting write-msr!\n");
}
Here is a user-space wrapper around rdpmc.
uint64_t
read_pmc(int ecx)
{
unsigned int a, d;
__asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx));
return ((uint64_t)a) | (((uint64_t)d) << 32);
}
You have to be running on a 64 bit system. You are setting 4 GB of data to zero. The number of cache misses is 4 x 1024 x 1024 x 1024, divided by the cache line size. However, since all the memory accesses are sequential, you will not have many TLB misses etc. and the processor most likely optimises accesses to sequential cache lines.
Your performance here (or lack thereof) is totally dominated by paging, every new page (probably 4k in your case) would cause a page fault since it's newly allocated and was never used, and trigger an expensive OS flow. Cachegrind and performance monitors should show you the same behavior, so you may be confused if you expected just simple data accesses.
One way to avoid this is to allocate, store once to the entire array (or even once per page) to warm up the page tables, and then measure time internally in you application (using rdtsc or any c API you like) over the main loop.
Alternatively, if you want to use external time measurements, just loop multiple times (> 1000), and amortize, so the initial penalty would become less significant.
Once you do all that, the cache misses you measure should reflect the number of accesses for each new 64 byte line (i.e. ~16M), plus the page walks (256k pages assuming they're 4k, multiplied by the page table levels, since each walk would have to lookup the memory on each level).
under a virtualized platform the paging would become squared (9 accesses instead of 3 for e.g.) since each level of the guest page table also needs paging on the host.