I am trying to read a struct from a binary byte buffer using cast and pack.
I was trying to keep track of worst case read time from in memory buffer so I decided to keep a chrono high resolution clock nano timer. Whenever the timer increased I printed the value. It gave me a worst case scenario of about 20 micro seconds which was huge considering the size of the struct.
When I measured the average time taken it came out to be ~20 nanoseconds. Then I measured how many times was I breaching 50. And it turns out of the ~20 million times, I was breaching 50 nanoseconds only 500 times.
My question is what can possibly cause this performance fluctuation: average of 20 and worst of 20,000?
Secondly, how can I ensure a constant time performance. I am compiling with -O3 and C++11.
// new approach
#pragma pack(push, 1)
typedef struct {
char a;
long b, c;
char d, name[10];
int e , f;
char g, h;
int h, i;
} myStruct;
#pragma pack(pop)
//in function where i am using it
auto am1 = chrono::high_resolution_clock::now();
myStruct* tmp = (myStruct*)cTemp;
tmp->name[10] = 0;
auto am2 = chrono::high_resolution_clock::now();
chrono::duration<long, nano> arM = chrono::duration_cast<chrono::nanoseconds>(am2 - am1);
if(arM.count() > maxMPO.count())
{
cout << "myStruct read time increased: " << arM.count() << "\n";
maxMPO = arM;
}
I am using g++4.8 with C++11 and an ubuntu server.
what can possibly cause this performance fluctuation: avg of 20 and
worst of 20,000?
On a PC (or Mac, or any desktop), there are Ethernet interrupts, timers, mem-refresh, and dozens of other things going on over which you have no (or very little) control.
You might consider changing the target. If you use a single board computer (SBC) with only static ram, and a network connection which you can turn off and disconnect, and timers and clocks and every other kind of interrupt under your software control, you might achieve an acceptable result.
I once worked with a gal who wrote software for an 8085 SBC. When we hooked up a scope and saw the waveform stability of a software controlled bit, I thought she must have added logic chips. It was amazing.
You simply can not achieve 'jitter' free behaviour on a desktop.
Related
I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.
There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.
One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).
Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.
Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?
#include <chrono>
#include <iostream>
#include <thread>
void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);
namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;
int main() {
int* x = new int[itemCount];
int* y = new int[itemCount];
// Initialize arrays
for (int i = 0; i < itemCount; i++) {
x[i] = 1;
y[i] = 2;
}
// Call add() on multiple threads
std::thread threads[threadCount];
startTimer();
for (int i = 0; i < threadCount; ++i) {
threads[i] = std::thread(add, x, y, i);
}
for (auto& thread : threads) {
thread.join();
}
stopTimer();
// Verify results
for (int i = 0; i < itemCount; ++i) {
if (y[i] != 3) {
std::cout << "Error!";
}
}
delete[] x;
delete[] y;
}
void add(int* x, int* y, int threadIdx) {
int firstIdx = threadIdx * itemsPerThread;
int lastIdx = firstIdx + itemsPerThread - 1;
for (int i = firstIdx; i <= lastIdx; ++i) {
y[i] = x[i] + y[i];
}
}
void startTimer() {
start = ch::steady_clock::now();
}
void stopTimer() {
auto end = ch::steady_clock::now();
auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
std::cout << duration << " ms\n";
}
You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.
On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).
As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.
Your problem is not the processor. You ran against the RAM read and write latency. As your cache is able to hold some megabytes of data and you exceed this storage by far. Multi-threading is so long useful, as long as you can shovel data into your processor. The cache in your processor is incredibly fast, compared to your RAM. As you exceed your cache storage, this results in a RAM latency test.
If you want to see the advantages of multi-threading, you have to choose data sizes in range of your cache size.
EDIT
Another thing to do, would be to create a higher workload for the cores, so the storage latency goes unrecognized.
sidenote: keep in mind, your core has several execution units. one or more for each type of operation - integer, float, shift and so on. That means, one core can execute more then one command per step. In particular one operation per execution unit. You can keep the data size of the test data and do more stuff with it - be creative =) Filling the queue with integer operations only, will give you an advantage in multi-threading. If you can variate in your code, when and where you do different operations, do it, this also will show impact on the speedup. Or avoid it, if you want to see a nice speedup on multi-threading.
to avoid any kind of optimization, you should use randomized test data. so neither the compiler nor the processor itself can predict what the outcome of your operation is.
Also avoid doing branches like if and while. Each decision the processor has to predict and execute, will slow you down and alter the result. With branch-prediction, you will never get a deterministic result. Later in a "real" program, be my guest and do what you want. But when you want to explore the multi-threading world, this could lead you to wrong conclusions.
BTW
Please use a delete for every new you use, to avoid memory leaks. AND even better, avoid plain pointers, new and delete. You should use RAII. I advice to use std::array or std::vector, simple a STL-container. This will save you tons of debugging time and headaches.
Speedup from parallelization is limited by the portion of the task that remains serial. This is called Amdahl's law. In your case, a decent amount of that serial time is spent initializing the array.
Are you compiling the code with -O3? If so, the compiler might be able to unroll and/or vectorize some of the loops. The loop strides are predictable, so hardware prefetching might help as well.
You might want to also explore if using all 8 hyperthreads are useful or if it's better to run 1 thread per core (I am going to guess that since the problem is memory-bound, you'll likely benefit from all 8 hyperthreads).
Nevertheless, you'll still be limited by memory bandwidth. Take a look at the roofline model. It'll help you reason about the performance and what speedup you can theoretically expect. In your case, you're hitting the memory bandwidth wall that effectively limits the ops/sec achievable by your hardware.
I have decided to compare the times of passing by value and by reference in C++ (g++ 5.4.0) with the following code:
#include <iostream>
#include <sys/time.h>
using namespace std;
int fooVal(int a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int fooRef(int & a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int main() {
int a = 0;
struct timeval stop, start;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooVal(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooRef(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
return 0;
}
It was expected that the fooRef execution would take much more time in comparison with fooVal case because of "looking up" referenced value in memory while performing operations inside fooRef. But the result proved to be unexpected for me:
The loop has taken 18446744073708648210 microseconds
The loop has taken 99967 microseconds
And the next time I run the code it can produce something like
The loop has taken 97275 microseconds
The loop has taken 99873 microseconds
Most of the time produced values are close to each other (with fooRef being just a little bit slower), but sometimes outbursts like in the output from the first run can happen (both for fooRef and fooVal loops).
Could you please explain this strange result?
UPD: Optimizations were turned off, O0 level.
If gettimeofday() function relies on operating system clock, this clock is not really designed for dealing with microseconds in an accurate manner. The clock is typically updated periodically and only frequently enough to give the appearance of showing seconds accurately for the purpose of working with date/time values. Sampling at the microsecond level may be unreliable for a benchmark such as the one you are performing.
You should be able to work around this limitation by making your test time much longer; for example, several seconds.
Again, as mentioned in other answers and comments, the effects of which type of memory is accessed (register, cache, main, etc.) and whether or not various optimizations are applied, could substantially impact results.
As with working around the time sampling limitation, you might be able to somewhat work around the memory type and optimization issues by making your test data set much larger such that memory optimizations aimed at smaller blocks of memory are effectively bypassed.
Firstly, you should look at the assembly language to see if there are any differences between passing by reference and passing by value.
Secondly, make the functions equivalent by passing by constant reference. Passing by value says that the original variable won't be changed. Passing by constant reference keeps the same principle.
My belief is that the two techniques should be equivalent in both assembly language and performance.
I'm no expert in this area, but I would tend to think that the reason why the two times are somewhat equivalent is due to cache memory.
When you need to access a memory location (Say, address 0xaabbc125 on an IA-32 architecure), the CPU copies the memory block (addresses 0xaabbc000 to 0xaabbcfff) to your cache memory. Reading from and writing to the memory is very slow, but once it's been copied into you cache, you can access values very quickly. This is useful because programs usually require the same range of addresses over and over.
Since you execute the same code over and over and that your code doesn't require a lot of memory, the first time the function is executed, the memory block(s) is (are) copied to your cache once, which probably takes most of the 97000 time units. Any subsequent calls to your fooVal and fooRef functions will require addresses that are already in your cache, so they will require only a few nanoseconds (I'd figure roughly between 10ns and 1µs). Thus, dereferencing the pointer (since a reference is implemented as a pointer) is about double the time compared to just accessing a value, but it's double of not much anyway.
Someone who is more of an expert may have a better or more complete explanation than mine, but I think this could help you understand what's going on here.
A little idea : try to run the fooVal and fooRef functions a few times (say, 10 times) before setting start and beginning the loop. That way, (if my explanation was correct!) the memory block will (should) be already into cache when you begin looping them, which means you won't be taking caching in your times.
About the super-high value you got, I can't explain that. But the value is obviously wrong.
It's not a bug, it's a feature! =)
Given a while loop and the function ordering as follows:
int k=0;
int total=100;
while(k<total){
doSomething();
if(approx. t milliseconds elapsed) { measure(); }
++k;
}
I want to perform 'measure' every t-th milliseconds. However, since 'doSomething' can be close to the t-th millisecond from the last execution, it is acceptable to perform the measure after approximately t milliseconds elapsed from the last measure.
My question is: how could this be achieved?
One solution would be to set timer to zero, and measure it after every 'doSomething'. When it is withing the acceptable range, I perform measures, and reset. However, I'm not which c++ function I should use for such a task. As I can see, there are certain functions, but the debate on which one is the most appropriate is outside of my understanding. Note that some of the functions actually take into account the time taken by some other processes, but I want my timer to only measure the time of the execution of my c++ code (I hope that is clear). Another thing is the resolution of the measurements, as pointed out below. Suppose the medium option of those suggested.
High resolution timing is platform specific, and you have not specified in the question. The standard library clock() function returns a count that increments at CLOCKS_PER_SEC per second. On some platforms this may be fast enough to give you the resolution you need but you should check your system's tick rate since it is implementation defined. However if you find it is high enough then:
#define SAMPLE_PERIOD_MS 100
#define SAMPLE_PERIOD_TICKS ((CLOCKS_PER_SEC * SAMPLE_PERIOD_MS) / 1000)
int k=0;
int total=100;
clock_t measure_time = clock() + SAMPLE_PERIOD_TICKS ;
while(k<total)
{
doSomething();
if( clock() - measure_time > 0 )
{
measure();
measure_time += SAMPLE_PERIOD_TICKS ;
++k;
}
}
You might replace clock() with some other high-resolution clock source if necessary.
However note a couple of issues. This method is a "busy-loop"; unless either doSomething() or measure() yield the CPU, the process will take all the cpu cycles it can. If this is the only code running on a target, that may not matter. On the other hand is this is running on a general purpose OS such as Windows or Linux which are not real-time, the process may be pre-empted by other processes, and this may affect the accuracy of the sampling periodicity. If you need accurate timing use of an RTOS and performing doSomething() and measure() in separate threads would be better. Even in a GPOS that would be better. For example a general pattern (using a made-up API in teh absence of any specification) would be:
int main()
{
StartThread( measure_thread, HIGH_PRIORITY ) ;
for(;;)
{
doSomething() ;
}
}
void measure_thread()
{
for(;;)
{
measure() ;
sleep( SAMPLE_PERIOD_MS ) ;
}
}
The code for measure_thread() is only accurate if measure() takes a negligible time to run. If it takes significant time you may need to account for that. If it is non-deterministic, you may even have to measure its execution time in order to subtract it the sleep period.
In the book Game Coding Complete, 3rd Edition, the author mentions a technique to both reduce data structure size and increase access performance. In essence it relies on the fact that you gain performance when member variables are memory aligned. This is an obvious potential optimization that compilers would take advantage of, but by making sure each variable is aligned they end up bloating the size of the data structure.
Or that was his claim at least.
The real performance increase, he states, is by using your brain and ensuring that your structure is properly designed to take take advantage of speed increases while preventing the compiler bloat. He provides the following code snippet:
#pragma pack( push, 1 )
struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};
struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused[ 2 ]; // fill to 8-byte boundary for array use
};
#pragma pack( pop )
Using the above struct objects in an unspecified test he reports a performance increase of 15.6% (222ms compared to 192ms) and a smaller size for the FastStruct. This all makes sense on paper to me, but it fails to hold up under my testing:
Same time results and size (counting for the char unused[ 2 ])!
Now if the #pragma pack( push, 1 ) is isolated only to FastStruct (or removed completely) we do see a difference:
So, finally, here lies the question: Do modern compilers (VS2010 specifically) already optimize for the bit alignment, hence the lack of performance increase (but increase the structure size as a side-affect, like Mike Mcshaffry stated)? Or is my test not intensive enough/inconclusive to return any significant results?
For the tests I did a variety of tasks from math operations, column-major multi-dimensional array traversing/checking, matrix operations, etc. on the unaligned __int64 member. None of which produced different results for either structure.
In the end, even if their was no performance increase, this is still a useful tidbit to keep in mind for keeping memory usage to a minimum. But I would love it if there was a performance boost (no matter how minor) that I am just not seeing.
It is highly dependent on the hardware.
Let me demonstrate:
#pragma pack( push, 1 )
struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};
struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused[ 2 ]; // fill to 8-byte boundary for array use
};
#pragma pack( pop )
int main (void){
int x = 1000;
int iterations = 10000000;
SlowStruct *slow = new SlowStruct[x];
FastStruct *fast = new FastStruct[x];
// Warm the cache.
memset(slow,0,x * sizeof(SlowStruct));
clock_t time0 = clock();
for (int c = 0; c < iterations; c++){
for (int i = 0; i < x; i++){
slow[i].a += c;
}
}
clock_t time1 = clock();
cout << "slow = " << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
// Warm the cache.
memset(fast,0,x * sizeof(FastStruct));
time1 = clock();
for (int c = 0; c < iterations; c++){
for (int i = 0; i < x; i++){
fast[i].a += c;
}
}
clock_t time2 = clock();
cout << "fast = " << (double)(time2 - time1) / CLOCKS_PER_SEC << endl;
// Print to avoid Dead Code Elimination
__int64 sum = 0;
for (int c = 0; c < x; c++){
sum += slow[c].a;
sum += fast[c].a;
}
cout << "sum = " << sum << endl;
return 0;
}
Core i7 920 # 3.5 GHz
slow = 4.578
fast = 4.434
sum = 99999990000000000
Okay, not much difference. But it's still consistent over multiple runs.So the alignment makes a small difference on Nehalem Core i7.
Intel Xeon X5482 Harpertown # 3.2 GHz (Core 2 - generation Xeon)
slow = 22.803
fast = 3.669
sum = 99999990000000000
Now take a look...
6.2x faster!!!
Conclusion:
You see the results. You decide whether or not it's worth your time to do these optimizations.
EDIT :
Same benchmarks but without the #pragma pack:
Core i7 920 # 3.5 GHz
slow = 4.49
fast = 4.442
sum = 99999990000000000
Intel Xeon X5482 Harpertown # 3.2 GHz
slow = 3.684
fast = 3.717
sum = 99999990000000000
The Core i7 numbers didn't change. Apparently it can handle
misalignment without trouble for this benchmark.
The Core 2 Xeon now shows the same times for both versions. This confirms that misalignment is a problem on the Core 2 architecture.
Taken from my comment:
If you leave out the #pragma pack, the compiler will keep everything aligned so you don't see this issue. So this is actually an example of what could happen if you misuse #pragma pack.
Such hand-optimizations are generally long dead. Alignment is only a serious consideration if you're packing for space, or if you have an enforced-alignment type like SSE types. The compiler's default alignment and packing rules are intentionally designed to maximize performance, obviously, and whilst hand-tuning them can be beneficial, it's not generally worth it.
Probably, in your test program, the compiler never stored any structure on the stack and just kept the members in registers, which do not have alignment, which means that it's fairly irrelevant what the structure size or alignment is.
Here's the thing: There can be aliasing and other nasties with sub-word accessing, and it's no slower to access a whole word than to access a sub-word. So in general, it's no more efficient, in time, to pack more tightly than word size if you're only accessing, say, one member.
Visual Studio is a great compiler when it comes to optimization. However, bear in mind that the current "Optimization War" in game development is not on the PC arena. While such optimizations may quite well be dead on the PC, on the console platforms it's a completely different pair of shoes.
That said, you might want to repost this question on the specialized gamedev stackexchange site, you might get some answers directly from "the field".
Finally, your results are exactly the same up to the microsecond which is dead impossible on a modern multithreaded system -- I'm pretty sure you either use a very low resolution timer, or your timing code is broken.
Modern compilers align members on different byte boundaries depending on the size of the member. See the bottom of this.
Normally you really shouldn't care about structure padding but if you have an object that is going to have 1000000 instances or something the rule of the thumb is simply to order your members from biggest to smallest. I wouldn't recommend messing with the padding with #pragma directives.
The compiler is going to either optimize for size or speed and unless you explicitly tell it you wont know what you get. But if you follow the advice of that book you will win-win on most compilers. Put the biggest, aligned, things first in your struct then half size stuff, then single byte stuff if any, add some dummy variables to align. Using bytes for things that dont have to be can be a performance hit anyway, as a compromise use ints for everything (have to know the pros and cons of doing that)
The x86 has made for a lot of bad programmers and compilers because it allows unaligned accesses. Making it hard for many folks to move to other platforms (that are taking over). Although unaligned accesses work on an x86 you take a serious performance hit. Which is why it is important to know how compilers work both in general as well as the particular one you are using.
having caches, and as with the modern computer platforms relying on caches to get any kind of performance, you want to both be aligned and packed. The simple rule being taught gives you both...in general. It is very good advice. Adding compiler specific pragmas is not nearly as good, makes the code non-portable, and doesnt take much searching through SO or googling to find out how often the compiler ignores the pragma or doesnt do what you really wanted.
On some platforms the compiler doesn't have an option: objects of types bigger than char often have strict requirements to be at a suitably aligned address. Typically the alignment requirements are identical to the size of the object up to the size of the biggest word supported by the CPU natively. That is short typically requires to be at an even address, long typically requires to be at an address divisible by 4, double at an address divisible by 8, and e.g. SIMD vectors at an address divisible by 16.
Since C and C++ require ordering of members in the order they are declared, the size of structures will differ quite a bit on the corresponding platforms. Since bigger structures effectively cause more cache misses, page misses, etc., there will be a substantial performance degradation when creating bigger structures.
Since I saw a claim that it doesn't matter: it matters on most (if not all) systems I'm using. There is a simple examples of showing different sizes. How much this affects the performance obviously depends on how the structures are to be used.
#include <iostream>
struct A
{
char a;
double b;
char c;
double d;
};
struct B
{
double b;
double d;
char a;
char c;
};
int main()
{
std::cout << "sizeof(A) = " << sizeof(A) << "\n";
std::cout << "sizeof(B) = " << sizeof(B) << "\n";
}
./alignment.tsk
sizeof(A) = 32
sizeof(B) = 24
The C standard specifies that fields within a struct must be allocated at increasing addresses. A struct which has eight variables of type 'int8' and seven variables of type 'int64', stored in that order, will take 64 bytes (pretty much regardless of a machine's alignment requirements). If the fields were ordered 'int8', 'int64', 'int8', ... 'int64', 'int8', the struct would take 120 bytes on a platform where 'int64' fields are aligned on 8-byte boundaries. Reordering the fields yourself will allow them to be packed more tightly. Compilers, however, will not reorder fields within a struct absent explicit permission to do so, since doing so could change program semantics.
All the 10 questions with 5 marks need to be answered within time. so the time consumed for each question n remaining time should be displayed. can anybody help?
A portable C++ solution would be to use chrono::steady_clock to measure time. This is available in C++11 in the header <chrono>, but may well be available to older compilers in TR1 in <tr1/chrono> or boost.chrono.
The steady clock always advances at a rate "as uniform as possible", which is an important consideration on a multi-tasking multi-threaded platform. The steady clock is also independent of any sort of "wall clock", like the system clock (which may be arbitrarily manipulated at any time).
(Note: if steady_clock isn't in your implementation, look for monotonic_clock.)
The <chrono> types are a bit fiddly to use, so here is a sample piece of code that returns a steady timestamp (or rather, a timestamp from whichever clock you like, e.g. the high_resolution_clock):
template <typename Clock>
long long int clockTick(int multiple = 1000)
{
typedef typename Clock::period period;
return (Clock::now().time_since_epoch().count() * period::num * multiple) / period::den;
}
typedef std::chrono::monotonic_clock myclock; // old
typedef std::chrono::steady_clock yourclock; // C++11
Usage:
long long int timestamp_ms = clockTick<myclock>(); // milliseconds by default
long long int timestamp_s = clockTick<yourclock>(1); // seconds
long long int timestamp_us = clockTick<myclock>(1000000); // microseconds
Use time().
This has the limitation that Kerrek has pointed out in his answer. But it's also very simple to use.