Why is processing multiple streams of data slower than processing one? - c++

I'm testing how reading multiple streams of data affects a CPUs caching performance. I'm using the following code to benchmark this. The benchmark reads integers stored sequentially in memory and writes partial sums back sequentially. The number of sequential blocks that are read from is varied. Integers from the blocks are read in a round-robin manner.
#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
int num_values = 100000000;
// Fix up the number of values. The effect of this should be insignificant.
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto arrays = vector<vector<int>>(num_arrays);
for (int i = 0; i < num_arrays; ++i) {
arrays.emplace_back(num_values_per_array);
}
for (int i = 0; i < num_values; ++i) {
arrays[i%num_arrays].emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += arrays[i%num_arrays][i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
int num_arrays = 1;
while (true) {
test_with_split(num_arrays++);
}
}
Here are the timings for splitting 1-80 ways on an Intel Core 2 Quad CPU Q9550 # 2.83GHz:
The bump in the speed soon after 8 streams makes sense to me, as the processor has an 8-way associative L1 cache. The 24-way associative L2 cache in turn explains the bump at 24 streams. These especially hold if I'm getting the same effects as in Why is one loop so much slower than two loops?, where multiple big allocations always end up in the same associativity set. To compare I've included at the end timings when the allocation is done in one big block.
However, I don't fully understand the bump from one stream to two streams. My own guess would be that it has something to do with prefetching to L1 cache. Reading the Intel 64 and IA-32 Architectures Optimization Reference Manual it seems that the L2 streaming prefetcher supports tracking up to 32 streams of data, but no such information is given for the L1 streaming prefetcher. Is the L1 prefetcher unable to keep track of multiple streams, or is there something else at play here?
Background
I'm investigating this because I want to understand how organizing entities in a game engine as components in the structure-of-arrays style affects performance. For now it seems that the data required by a transformation being in two components vs. it being in 8-10 components won't matter much with modern CPUs. However, the testing above suggests that sometimes it might make sense to avoid splitting some data into multiple components if that would allow a "bottlenecking" transformation to only use one component, even if this means that some other transformation would have to read data it is not interested in.
Allocating in one block
Here are the timings if instead allocating multiple blocks of data only one is allocated and accessed in a strided manner. This does not change the bump from one stream to two, but I've included it for sake of completeness.
And here is the modified code for that:
void test_with_split(int num_arrays) {
int num_values = 100000000;
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto array = vector<int>(num_values);
for (int i = 0; i < num_values; ++i) {
array.emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
Edit 1
I made sure that the 1 vs 2 splits difference was not due to the compiler unrolling the loop and optimizing the first iteration differently. Using the __attribute__ ((noinline)) I made sure the work function is not inlined into the main function. I verified that it did not happen by looking at the generated assembly. The timings after these changed were the same.

To answer the main part of your question: Is the L1 prefetcher able to keep track of multiple streams?
No. This is actually because the L1 cache doesn't have a prefetcher at all. The L1 cache isn't big enough to risk speculatively fetching data that might not be used. It would cause too many evictions and adversely impact any software that isn't reading data in specific patterns suited to that particular L1 cache prediction scheme. Instead L1 caches data that has been explicitly read or written. L1 caches are only helpful when you are writing data and re-reading data that has recently been accessed.
The L1 cache implementation is not the reason for your profile bump from 1X to 2X array depth. On streaming reads like what you've set up, the L1 cache plays little or no factor in performance. Most of your reads are coming directly from the L2 cache. In your first example using nested vectors, some number of reads are probably pulled from L1 (see below).
My guess is your bump from 1X to 2X has a lot to do with the algo and how the compiler is optimizing it. If the compiler knows num_arrays is a constant equal to 1, then it will automatically eliminate a lot of per-iteration overhead for you.
Now for the second part, as to why is the second version faster?:
The reason for the second version being faster is not so much in how the data is arranged in physical memory, but rather what under-the-hood logic change a nested std::vector<std::vector<int>> type implies.
In the nested (first) case, compiled code performs the following steps:
Accesses top-level std::vector class. This class contains a pointer to the start of the data array.
That pointer value must be loaded from memory.
Add current loop offset [i%num_arrays] to that pointer.
Access nested std::vector class data. (Likely L1 cache hit)
Load pointer to the vector's start of data array. (Likely L1 cache hit)
Add loop offset [i/num_arrays]
Read data. Finally!
(note the chances of getting L1 cache hits on steps #4 and #5 decrease drastically after 24 streams or so, due to likeliness of evictions before the next iteration trough the loop)
The second version, by comparison:
Accesses top-level std::vector class.
Load pointer to the vector's start of data array.
Add offset [(i%num_arrays)*num_values_per_array+i/num_arrays]
Read data!
An entire set of under-the-hood steps are removed. The calculation for offset is slightly longer since it needs an extra multiply by num_values_per_array. But the other steps more than make up for it.

Related

Throughput of trivially concurrent code does not increase with the number of threads

I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark.
Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.
I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.
It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...
With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.
The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).
When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.
With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.
The loop is almost certainly still getting optimized, even if you output the value of x after the outer loop. The compiler can trivially replace the entire loop with a single instruction since the loop bounds are constant at compile time. Indeed, in this example:
#include <iostream>
int main()
{
int x = 0;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
++x;
}
}
std::cout << x << '\n';
return 0;
}
The loop is replaced with the single assembly instruction mov esi, 10000000.
Always inspect the assembly output when benchmarking to make sure that you're measuring what you think you are; in this case you are just measuring the overhead of creating threads, which of course will be higher the more threads you create.
Consider having the innermost loop do something that can't be optimized away. Random number generation is a good candidate because it should perform in constant time, and it has the side-effect of permuting the PRNG state (making it ineligible to be removed entirely, unless the seed is known in advance and the compiler is able to unravel all of the mutation in the PRNG).
For example:
#include <iostream>
#include <random>
int main()
{
std::mt19937 r;
std::uniform_real_distribution<double> dist{0, 1};
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
dist(r);
}
}
return 0;
}
Both loops and the PRNG invocation are left intact here.

My C++ program gets slower as computation proceeds

I wrote a neural network program in C++ to test something, and I found that my program gets slower as computation proceeds. Since this kind of phenomenon is somewhat I've never seen before, I checked possible causes. Memory used by program did not change when it got slower. RAM and CPU status were fine when I ran the program.
Fortunately, the previous version of the program did not have such problem. So I finally found that a single statement that makes the program slow. The program does not get slower when I use this statement:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
However, the program gets slower and slower as soon as I replace above statement with:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
To solve this problem, I did my best to find and remove the cause but I failed... The below is the simple code structure. For the case that this is not the problem that is related to local statement, I uploaded my code to google drive. The URL is located at the end of this question.
MLP.h
class MLP
{
private:
...
double lambda;
double ***w;
double ***dw;
neuron **hidden;
...
MLP.cpp
...
for(k = n_depth - 1; k > 0; k--)
{
if(k == n_depth - 1)
...
else
{
...
for(j = 1; n_neuron > j; j++)
{
for(i = 0; n_neuron > i; i++)
{
//dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
}
}
}
}
...
Full source code: https://drive.google.com/open?id=1A8Uw0hNDADp3-3VWAgO4eTtj4sVk_LZh
I'm not sure exactly why it gets slower and slower, but I do see where you can gain some performance.
Two and higher dimensional arrays are still stored in one dimensional
memory. This means (for C/C++ arrays) array[i][j] and array[i][j+1]
are adjacent to each other, whereas array[i][j] and array[i+1][j] may
be arbitrarily far apart.
Accessing data in a more-or-less sequential fashion, as stored in
physical memory, can dramatically speed up your code (sometimes by an
order of magnitude, or more)!
When modern CPUs load data from main memory into processor cache,
they fetch more than a single value. Instead they fetch a block of
memory containing the requested data and adjacent data (a cache line
). This means after array[i][j] is in the CPU cache, array[i][j+1] has
a good chance of already being in cache, whereas array[i+1][j] is
likely to still be in main memory.
Source: https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf
With your current code, w[k][i][j] will be read, and on the next iteration, w[k][i+1][j] will be read. You should invert i and j so that w is read in sequential order:
for(j = 1; n_neuron > j; ++j)
{
for(i = 0; n_neuron > i; ++i)
{
dw[k][j][i] = hidden[k-1][j].y * hidden[k][i].phi - lambda*w[k][j][i];
}
}
Also note that ++x should be slightly faster than x++, since x++ has to create a temporary containing the old value of x as the expression result. The compiler might optimize it when the value is unused though, but do not count on it.

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/
As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!
Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.
Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.
No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.

for loop optimization c ++

this is my first time posting in this site and I hope I get some help/hint. I have an assignment where I need to optimize the performance to the inner for loop but I have no idea how to do that. the code was given in the assignment. I need to count the time(which I was able to do) and improve the performance.
Here is the code:
//header files
#define N_TIMES 200 //This is originally 200000 but changed it to test the program faster
#define ARRAY_SIZE 9973
int main (void) {
int *array = (int*)calloc(ARRAY_SIZE, sizeof(int));
int sum = 0;
int checksum = 0;
int i;
int j;
int x;
// Initialize the array with random values 0 to 13.
srand(time(NULL));
for (j=0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
//printf("Checksum is %d.\n",checksum);
for (i = 0; i < N_TIMES; i++) {
// Do not alter anything above this line.
// Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
printf("Sum is now: %d\n",sum);
}
// Do not alter anything below this line.
// ---------------------------------------------------------------
// Check each iteration.
//
if (sum != checksum) {
printf("Checksum error!\n");
}
sum = 0;
}
return 0;
}
The code takes about 695 seconds to run. Any help on how to optimize it please?
thanks a lot.
The bottleneck in that loop is obviously the IO done by printf; since you are probably writing the output on a console, the output is line buffered, which means that the stdio buffer is flushed at each iteration, which slows down things a lot.
If you have to do all that prints, you can greatly enhance the performance by forcing the stream to do block buffering: before the for add a
setvbuf(stdout, NULL, _IOFBF, 0);
In alternative, if this approach is not considered valid, you can do your own buffering by allocating a big buffer on your own and do your own buffering: write in your buffer using sprintf, periodically emptying it in the output stream with a fwrite.
Also, you can use the poor man's approach to buffering - just use a buffer big enough to write all that stuff (you can calculate how big it must be quite easily) and write in it without worrying about when it's full, when to empty it, ... - just empty it at the end of the loop. edit: see #paxdiablo's answer for an example of this
Applying just the first optimization, what I get with time is
real 0m6.580s
user 0m0.236s
sys 0m2.400s
vs the original
real 0m8.451s
user 0m0.700s
sys 0m3.156s
So, we got down of ~3 seconds in real time, half a second in user time and ~0.7 seconds in system time. But what we can see here is the huge difference between user+sys and real, which means that the time is not spent in doing something inside the process, but waiting.
Thus, the real bottleneck here is not in our process, but in the process of the virtual terminal emulator: sending huge quantities of text to the console is going to be slow no matter what optimizations we do in our program; in other words, your task is not CPU-bound, but IO-bound, so CPU-targeted optimizations won't be of much benefit, since at the end you have to wait anyway for your IO device to do his slow stuff.
The real way to speed up such a program would be much simpler: avoid the slow IO device (the console) and just write the data to file (which, by the way, is block-buffered by default).
matteo#teokubuntu:~/cpp/test$ time ./a.out > test
real 0m0.369s
user 0m0.240s
sys 0m0.068s
Since there's absolutely no variation in that loop based on i (the outer loop), you don't need to calculate it each time.
In addition, the printing of the data should be outside the inner loop so as not to impose I/O costs on the calculation.
With those two things in mind, one possibility is:
static int sumCalculated = 0;
if (!sumCalculated) {
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sumCalculated = 1;
}
printf("Sum is now: %d\n",sum);
although that has different output to the original which may be an issue (one line at the end rather than one line per addition).
If you do need to print the accumulating sum within the loop, I'd simply buffer that as well (since it doesn't vary each time through the i loop.
The string Sum is now: 999999999999\n (12 digits, it may vary depending on your int size) takes up 25 bytes (excluding terminating NUL). Multiply that by 9973 and you need a buffer of about 250K (including a terminating NUL). So something like this:
static char buff[250000];
static int sumCalculated = 0;
if (!sumCalculated) {
int offset = 0;
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
offset += sprintf (buff[offset], "Sum is now: %d\n",sum);
}
sumCalculated = 1;
}
printf ("%s", buff);
Now that sort of defeats the whole intent of the outer loop as a benchmark tool but loop-invariant removal is a valid approach to optimisation.
Move the printf outside the for loop.
// Do not alter anything above this line.
//Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
printf("Sum is now: %d\n",sum);
// Do not alter anything below this line.
// ---------------------------------------------------------------
Getting the I/O out of the loop is a big help.
Depending on the compiler and machine, you might get a tiny increase in speed by using pointers rather than indexing (though on modern hardware, it generally doesn't make a difference).
Loop unrolling might help to increase the ratio of useful work to loop overhead.
You could use vector instructions (e.g., SIMD) to do a bunch of calculation in parallel.
Are you allowed to pack the array? Can you use an array of a smaller type than int (given that all the values are very small)? Making the array physically shorter improves locality.
Loop unrolling might look something like this:
for (int j = 0; j < ARRAY_SIZE; j += 2) {
sum += array[j] + array[j+1];
}
You'd have to figure out what to do if the array isn't an exact multiple of the unrolling size (which is probably why the assignment uses a prime number).
You would have to experiment to see how much unrolling would be the right amount.

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.
This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.
Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.
Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.
Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.
Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.
This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];