Use a precomputed array or a function? [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
What's faster when compiling with GCC in C++,
using a precompute array of 2 columns and 300 rows,
or using a third grade polynomial, such as "x^3 + x^2 + x + 40"?
(Sorry for my english)
Edit:
Is faster searching in an array, (Input value the first column and output the second column.)
or using a function (The input and output of the polynomial is obvious)?
edit2:
using index

I think he is trying to compare the speed between a polynomial computation and a lookup table.
It depends. The lookup table is usually stored in memory and the LD instruction will be involved. If the lookup table is not cached then expect long delay from memory.
If you need to access the lookup table frequently and for multiple times, and the table is of reasonable size, try to use lookup table. it is because, very likely, the table will be cached. If you were able to store the table on stack, then do it. Since the data on stack are more likely to be cached than the data on heap.
On the other hand, if the calculation is not frequent, then using polynomial computation is fine. This can save you some memory and make your code more readable.

You should actually profile the code.
Polynomial Evaluation
The function is:
int Evaluate_Polynomial(int x)
{
register const int term1 = x * x * x;
register const int term2 = x * x;
register const int result = term1 + term2 + x + 40;
return result;
}
Note: In the above function, register is used to remind the compiler to use registers, even in the unoptimized version (a.k.a. debug).
Without optimization, the above function has 3 multiply operations and 3 addition operations, for a total of 6 data processing operations (not including load or store).
Table Lookup
The function is:
int Table_Lookup_Polynomial(int x)
{
int result = 0;
if ((x < 0) || (x > 300))
{
result = table[x];
}
else
{
// Handle array index out of bounds
}
return result;
}
In the above example, there is a possibility of 3 comparisons (jumps) and a pointer dereference. Verify importantly, there is a need for error handling.
Summary
The polynomial version may contain more instructions, but they are data processing instructions and can be easily inlined. They do not cause a processor's instruction cache to be reloaded.
The table lookup needs to perform boundary checking. The boundary checking will cause the processor to pause and maybe reload the instruction pipeline, which takes time. The error checking and handling may cause issues during maintenance if the range is ever changed.
The functions should be profiled to verify which algorithm is faster. The loss of time due to altering the execution flow may be longer than the pure math data processing functions of the polynomial evaluation.
The compiler may be able to use special processor functions to make the polynomial evaluation faster. For example, the ARM7 processor has instructions that can perform a multiply and add together.

Computing a 3rd grade polynomial will nearly always be faster.
People seem to forget they need to search a value in this "look-up" table. Which is O(log N).
Evaluating polynomial of 3rd grade is trivial enough, that the table would need to be uselessly small to outperform it.
The table has chance only, if you store exactly the values for arguments you will look for and you know where in the table the reside. So you do not have to perform a search. That would make it a real look-up table, and it would probably be faster.
The example I know where tables are indeed used are computing sine function with high precision (more on wiki). Although, there computation would be really expensive.

Related

Code speed difference when using a random array index

Given a real number X within [0,1], after a specific binning I have to identify in what bin X falls. Given the bin size dx, I am using i = std::size_t(X/dx) , which works very well. I then look for the respective value of a given array v and set a second variable Y using double Y=v[i]. The whole code looks as follows:
double X = func();
dx=0.01;
int i = std::size_t(X/dx);
double Y = v[i];
print(Y)
This method correctly gives the expected value for the index i within the range [0, length(v)].
My main issue is not with finding the index, but using it: X is determined from an auxiliary function, and whenever I need to set Y=v[i] using the index determined above the code becomes extremely slow.
Without commenting or removing any of the lines, the code becomes much faster when setting X to some random value between 0 and 1 right after its definition or by setting i to some random value between 0 and length of v after the third line.
Could anyone be able to tell why this occurs? The speed changes of a factor 1000 if not more, and since there are only additional steps in the faster method and func() is called anyway I can't understand why it should become faster.
Since you have put no code in the question, there has to be a wild-guess like this:
You didn't sort all the X results before accessing lookup table. Processing a sorted array is faster.
Some of X had denormalized values which took a toll on computation time for certain CPU types including yours.
The dataset is too big for the L3 cache and it accessed RAM always, instead of quick cache hits that were seen in the other test.
Compiler was optimizing all of the expensive function calls out, but in real-world test scenario, it is not.
Time measurement has bugs
Computer is not stable in performance (like being a shared server or an antivirus intervention feeding on RAM bandwidth)

Why does LLVM choose open-addressing hash table to implement llvm::StringMap?

Many sources say open-addressing, the hash collision handling approach used in llvm::StringMap, is not stable. Open-addressing is said to be inferior to chaining when the load factor is high (which is imaginable).
But if the load factor is low, there will be a huge memory waste for open-addressing, because I have to allocate Bucket_number * sizeof(Record) bytes in memory, even if the majority of buckets do not hold a record.
So my question is, what is the reason for LLVM to choose open-addressing over separate-chaining? Is it merely because of the advantage in speed gained by cache locality (records are stored in buckets themselves)?
Thanks:)
Edit: C++11 standard's requirements on std::unordered_setand std::unordered_map imply the chaining approach, not open-addressing. Why does LLVM choose an hash collision handling method that can't even satisfy the C++ standard? Are there any special use cases for llvm::StringMap that warrants this deviation? (Edit: this slide deck compares the performance of several LLVM data structures with that of STL counterparts)
Another question, by the way:
How does llvm::StringMap guarantee that keys' hash values are not recomputed when growing?
The manual says:
hash table growth does not recompute the hash values for strings already in the table.
Let us look at the implementation. Here we see that the table is stored as parallel arrays of indirect pointers to records as well as any array of cached 32-bit hash codes, that is separate arrays-of-structures.
Effectively:
struct StringMap {
uint32_t hashcode[CAPACITY];
StringMapEntry *hashptr[CAPACITY];
};
Except that the capacity is dynamic and the load factor would appear to be maintained at between 37.5% and 75% of capacity.
For N records a load factor F this yields N/F pointers plus N/F integers for the open-addressed implementation as compared to N*(1+1/F) pointers plus N integers for the equivalent chained implementation. On a typical 64-bit system the open-address version is between ~4% larger and ~30% smaller.
However as you rightly suspected the main advantage here lies in cache effects. Beyond on average cache reducing contention by shrinking the data the filtering of collisions boils down to a linear reprobing of consecutive 32-bit hash keys, without examining any further information. Rejecting a collision is therefore much faster the chained case in which the link must be followed into what is likely uncached storage, and therefore a significantly higher load factor may be used. On the flip side one additional likely cache miss must be taken on the pointer lookup table, however this is a constant which does not degrade with load equivalent to one chained collision.
Effectively:
StringMapEntry *StringMap::lookup(const char *text) {
for(uint32_t *scan = &hashcode[hashvalue % CAPACITY]; *scan != SENTINEL; ++scan) {
uint32_t hash_value = hash_function(text);
if(hash_value == *scan) {
StringMapEntry *entry = p->hashptr[scan - hashcode];
if(!std::strcmp(entry->text, text))
return entry;
}
}
}
}
Plus subtleties such as wrapping.
As for your second question the optimization is to precalculate and store the hash keys. This waste some storage but prevents the expensive operation of examining potentially long variable-length strings unless a match is almost certain. And in degenerate cases, complex template name mangling, may be hundreds of characters.
A further optimization in RehashTable is to use a power-of-two instead of a prime table size. This insures that growing the table effectively brings one additional hash code bit into play and de-interleaves the doubled table into two consecutive target arrays, effectively rendering the operation a cache-friendly linear sweep.

Parallelizing a large dynamic program

I have a high-performance dynamic program in C++ whose results are placed in an M × N table, which is roughly on the order of 2000 rows × 30000 columns.
Each entry (r, c) depends on a few of the rows in a few other columns in the table.
The most obvious way to parallelize the computation of row r across P processors is to statically partition the data: i.e., have processor p compute the entries (r, p + k P) for all valid k.
However, the entries for different columns take a somewhat different amount of time to compute (e.g., one might take five times as long as the other), so I would like to partition the data dynamically to achieve better performance, by avoiding the stalling of CPUs that finish early and having them instead steal work from CPUs that are still catching up.
One way to approach this is to keep an atomic global counter that specifies the number of columns already computed, and to increase it each time a CPU needs more work.
However, this forces all CPUs to access the same global counter after computing every entry in the table -- i.e., it serializes the program to some extent. Since computing each entry is more or less a quick process, this is somewhat undesirable.
So, my question is:
Is there a way to perform this dynamic partitioning in a more scalable fashion (i.e. without having to access a single global counter after computing every entry)?
I assume you're using a second array for the new values. If so, it sounds like the looping constructs from either TBB or Cilk Plus would do. Both use workstealing to apportion the work among the available processors and when one processor runs out of work it will steal work from processors who have work available. This evens out the "chunkiness" of the data.
To use Cilk, you'll need a commpiler that supports Cilk Plus. Both GCC 4.9 and the Intel compiler support it. Typically you'd write something like:
cilk_for (int x = 0; x < XMAX; x++) {
for (int y = 0; y < YMAX; y++) {
perform-the-calculation;
}
}
TBB's constructs are similar.
Another approach is to "tile" the calculation in a cache-oblivious way. That is, you implement your own divide-and-conquer algorithm that will break the calculation into chunks that will be cache-efficient. See http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms/implementation for more information on cache-oblivious algorithms.
Barry

Efficiency in C and C++

So my teacher tells me that I should compute intermediate results as needed on the fly rather than storing them, because the speed of processors nowadays is much more faster than the speed of memory.
So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me ?
your teacher is right speed of processors nowadays is much more faster than the speed of memory. Access to RAM is slower what access to the internal memory: cache, registers, etc.
Suppose you want to compute a trigonometric function: sin(x). To do this you can either call a function (math library offers one, or implement your own) which is computing the value; or you can use a lookup table stored in memory to get the result which means storing the intermediate values (sort of).
Calling a function will result in executing a number of instructions, while using a lookup table will result in fewer instructions (getting the address of the LUT, getting the offset to the desired element, reading from address+offset). In this case, storing the intermediate values is faster
But if you were to do c = a+b, computing the value will be much faster than reading it from somewhere in RAM. Notice that in this case the number of instructions to be executed would be similar.
So while it is true that access to RAM is slower, whether it's worth accessing RAM instead of doing the computation is a sensible question and several things need to be considered: number of instructions to be executed, if the computation happens in a loop and you can take advantage the architectures pipeline, cache memory, etc.
There is no one answer, you need to analyze each situation individually.
Your teacher's advice is oversimplifying advice on a complex topic.
If you think of "intermediate" as a single term (in the arithmetical sense of the word), then ask yourself, is your code re-using that term anywhere else ? I.e. if you have code like:
void calculate_sphere_parameters(double radius, double & area, double & volume)
{
area = 4 * (4 * acos(1)) * radius * radius;
volume = 4 * (4 * acos(1)) * radius * radius * radius / 3;
}
should you instead write:
void calculate_sphere_parameters(double radius, double & area, double *volume)
{
double quarter_pi = acos(1);
double pi = 4 * quarter_pi;
double four_pi = 4 * pi;
double four_thirds_pi = four_pi / 3;
double radius_squared = radius * radius;
double radius_cubed = radius_squared * radius;
area = four_pi * radius_squared;
volume = four_thirds_pi * radius_cubed; // maybe use "(area * radius) / 3" ?
}
It's not unlikely that a modern optimizing compiler will emit the same binary code for these two. I leave it to the reader to determine what they prefer to see in the sourcecode ...
The same is true for a lot of simple arithmetics (at the very least, if no function calls are involved in the calculation). In addition to that, modern compilers and/or CPU instruction sets might have the ability to do "offset" calculations for free, i.e. something like:
for (int i = 0; i < N; i++) {
do_something_with(i, i + 25, i + 314159);
}
will turn out the same as:
for (int i = 0; i < N; i++) {
int j = i + 25;
int k = i + 314159;
do_something_with(i, j, k);
}
So the main rule should be, if your code's readability doesn't benefit from creating a new variable to hold the result of a "temporary" calculation, it's probably overkill to use one.
If, on the other hand, you're using i + 12345 a dozen times in ten lines of code ... name it, and comment why this strange hardcoded offset is so important.
Remember just because your source code contains a variable doesn't mean the binary code as emitted by the compiler will allocate memory for this variable. The compiler might come to the conclusion that the value isn't even used (and completely discard the calculation assigning it), or it might come to the conclusion that it's "only an intermediate" (never used later where it would've to be retrieved from memory) and so store it in a register, to overwrite after "last use". It's far more efficiently to do something like calculate the value i + 1 each time you need it than to retrieve it from a memory location.
My advice would be:
keep your code readable first and foremost - too many variables rather obscure than help.
don't bother saving "simple" intermediates - addition/subtraction or scaling by powers of two is pretty much a "free" operation
if you reuse the same value ("arithmetic term") in multiple places, save it if it is expensive to calculate (for example involves function calls, a long sequence of arithmetics, or a lot of memory accesses like an array checksum).
So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me?
There are several levels of memory in a computer. The layers look like this
registers – the CPU does all the calculations on this and access is instant
Caches - memory that's tightly coupled to the CPU core; all memory accesses to main system memory go through the cache actually and to the program it looks like if the data goes and comes from system memory. If the data is present in the cache and the access is well aligned the access is almost instant as well and hence very fast.
main system memory - connected to the CPU through a memory controller and shared by the CPU cores in a system. Accessing main memory introduces latencies through addressing and the limited bandwidth between memory and CPUs
When you work with in-situ calculated intermediary results those often never leave the registers or may go only as far as the cache and thus are not limited by the available system memory bandwidth or blocked by memory bus arbitration or address generation interlock.
This hurts me.
Ask your teacher (or better, don't, because with his level of competence in programming I wouldn't trust him), whether he has measured it, and what the difference was. The rule when you are programming for speed is: If you haven't measured it, and measured it before and after a change, then what you are doing is purely based on presumption and worthless.
In reality, an optimising compiler will take the code that you write and translate it to the fastest possible machine code. As a result, it is unlikely that there is any difference in code or speed.
On the other hand, using intermediate variables will make complex expressions easier to understand and easier to get right, and it makes debugging a lot easier. If your huge complex expression gives what looks like the wrong result, intermediate variables make it possible to check the calculation bit by bit and find where the error is.
Now even if he was right and removing intermediate variables made your code faster, and even if anyone cared about the speed difference, he would be wrong: Making your code readable and easier to debug gets you to a correctly working version of the code quicker (and if it doesn't work, nobody cares how fast it is). Now if it turns out that the code needs to be faster, the time you saved will allow you to make changes that make it really faster.

boost::flat_map and its performance compared to map and unordered_map

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_map so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?
Thanks!
I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.
Benchmarking
On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).
1) You need to consider cache warming
Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.
The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only.
Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).
2) RDTSC accuracy measure
I also recommend doing this:
u64 g_correctionFactor; // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;
static u64 const errormeasure = ~((u64)0);
#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // flush OOO instruction pipeline
return __rdtsc();
}
inline void WarmupRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // warmup cpuid.
__cpuid(a, 0x80000000);
__cpuid(a, 0x80000000);
// measure the measurer overhead with the measurer (crazy he..)
u64 minDiff = LLONG_MAX;
u64 maxDiff = 0; // this is going to help calculate our PRECISION ERROR MARGIN
for (int i = 0; i < 80; ++i)
{
u64 tick1 = GetRDTSC();
u64 tick2 = GetRDTSC();
minDiff = std::min(minDiff, tick2 - tick1); // make many takes, take the smallest that ever come.
maxDiff = std::max(maxDiff, tick2 - tick1);
}
g_correctionFactor = minDiff;
printf("Correction factor %llu clocks\n", g_correctionFactor);
g_accuracy = maxDiff - minDiff;
printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif
This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.
Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.
3) parameters
The last problem is people usually test for too few variations of the scenario.
A container performance is affected by:
Allocator
size of the contained type
cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
number of elements in the container (size of the problem)
type has trivial 3.-operations
type is POD
Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...
(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)
Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.
Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.
Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.
Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.
Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.
About the flat map
Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.
This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...
Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that—an open address hash map.
The problem of open address hash maps is that at the time of rehash they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.
The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.
A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.
The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.
But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.
Benchmark results
This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.
tested types information:
typeid=__int64 . sizeof=8 . ispod=yes
typeid=struct MediumTypePod . sizeof=184 . ispod=yes
Insertion
EDIT:
My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test:
I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:
map: O(N * log(N))
hashmaps: O(N)
vector and flatmaps: O(N * N)
Warning: hereafter the 2 tests for std::map and both flat_maps are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):
We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map (therefore 160% of the optimal).
Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.
Random search of 3 elements (clocks renormalized to 1)
in size = 100
in size = 10000
Iteration
over size 100 (only MediumPod type)
over size 10000 (only MediumPod type)
Final grain of salt
In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).
This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.
From the docs it seems this is analogous to Loki::AssocVector which I'm a fairly heavy user of. Since it's based on a vector it has the characteristics of a vector, that is to say:
Iterators gets invalidated whenever size grows beyond capacity.
When it grows beyond capacity it needs to reallocated and move objects over, ie insertion isn't guaranteed constant time except for the special case of inserting at end when capacity > size
Lookup is faster than std::map due to cache locality, a binary search which has the same performance characteristics as std::map otherwise
Uses less memory because it isn't a linked binary tree
It never shrinks unless you forcibly tell it to ( since that triggers reallocation )
The best use is when you know the number of elements in advance ( so you can reserve upfront ), or when insertion / removal is rare but lookup is frequent. Iterator invalidation makes it a bit cumbersome in some use cases so they're not interchangeable in terms of program correctness.