Counting the length of the Collatz sequence - custom iterator produces slowdown? - c++

I've been solving the UVA problem #100 - "The 3n + 1 problem". This is their "sample" problem, with a very forgiving time limit (limit of 3 sec, their sample solution with no caching at all runs in 0.738 sec, my best solution so far run in 0.016 sec), so I thought however I'd experiment with the code, I should always fit in the limit. Well, I was mistaken.
The problem specification is simple: each line of input has two numbers i and j, and the output should print these numbers followed by the maximum length of the Collatz sequence whose beginning is in between i and j inclusive.
I've prepared four solutions. They are quite similar and all produce good answers.
The first solution caches up to 0x100000 sequence lengths in a vector. Length of 0 means the length for the sequence that starts at this particular number has not yet been calculated. It runs fast enough - 0.03 sec.
The second solution is quite similar, only it caches every singe length in a sparse array implemented with an unordered_map. It runs fairly slower than the previous solution, but still fits well in the limit: 0.28 sec.
As an exercise, I also wrote the third solution, which based on the second one. The goal was to use the max_element function, which only accepts iterators. I couldn't use unordered_map::iterator, for incrementing such iterator is, AFAIK, linear in map size; therefore, I wrote a custom iterator, operating on an abstract "container" that "holds" the sequence length of every possible number (but in fact calculates them and caches them only as needed). In it's core, it's actually the same unordered_map solution - only there's an extra layer of iterators added on the top. The solution did not fit in the 3 sec. limit.
And now comes the thing I can't understand. While obviously the third solution is overcomplicated on purpose, I could hardly believe that an extra layer of iterators could produce such a slowdown. To check this out, I added a same iterator layer to the vector solution. This is my fourth solution. Judging from what this iterator idea did to my unordered_map solution, I expected a considerable slowdown here as well; but oddly enough, this did not happen at all. This solution runs almost as fast as the plain vector one, in 0.035 sec.
How is this possible? What exactly is responsible for the slowdown in the third solution? How is it possible that overcomplicating two similar solutions in an exactly the same way slows one of them down very much, and hardly harms the other at all? Why adding the iterator layer to the unordered_map solution made it not fitting in time, and doing the same to the vector solution hardly slowed it down at all?
EDIT:
I found out that the problem seems to be most visible if the input contains many repetitive lines. I tested all four solutions on my machine against the input of 1 1000000, repeated 200 times. The solution with a plain vector processed all of them in 1.531 sec. The solution with the vector and the additional iterator layer took 3.087 sec. The solution with the plain unordered map took 33.821 sec. And the solution with the unordered map and the additional iterator layer managed took more than half an hour - I halted it after 31 minutes 0.482 sec! I tested it on Linux mint 17.2 64 bit, g++ version Ubuntu 4.8.4-2ubuntu1~14.04 with flags -std=c++11 -O2, processor Celeron 2955U #1.4 GHz x 2

It appears to be a problem in GCC 4.8. It doesn't occur in 4.9 up. For some reason, subsequent outer loops (with the populated unordered_map cache) run progressively slower, not faster. I'm not sure why, since the unordered_map isn't getting bigger.
If you follow that link and switch GCC 4.8 to 4.9, then you'll see the expected behavior where subsequent runs over the same numeric range add little time because they leverage the cache.
Altogether, the philosophy of being "conservative" with compiler updates has been outdated for a long time. Compilers today are rigorously tested and you should use the latest (or at least some recent) release for everyday development.
For an online judge to subject you to long-fixed bugs is just cruel.

Related

How to find time of execution independent of other processes in c++ in Windows?

I have an assignment to plot the running time of a quicksort algorithm(implemented myself) with both versions i.e randomised and deterministic(picking last element as pivot).
Now I have very large lists of integers and also relatively short ones [range : (500 - 10^6) ]. Now as the worst case complexity of Quicksort is O(n^2) I assume the code could take more than 30 mins to execute and I want to measure the CPU time taken by the process as is expected from std::clock however I am on a Windows system and std::clock returns the same time as the Wall clock. This is an issue as I want precise time that the process takes to execute and isn't affected by other processes on my computer.
So far I have searched on some questions on this website and codereview ; nothing seems to be doing what I want.
I also have Ubuntu installed on WSL2 on my system, so if there is way to do the same there, that'd help me too.
So my question is : How do I find exact time taken by the process on the CPU for my sorting algorithms?
(PS: This website seems to suggest that the std::clock may wrap after 36mins(approx.) so if there is a way to note time in case the process takes more than that time would be helpful too.)

concurrency::parallel_sort overhead and performance hit (rule of thumb)?

Recently I stumbled across a very large performance improvement -- I'm talking about a 4x improvement -- with a one line code change. I just changed a std::sort call to concurrency_parallel sort
// Get a contiguous vector copy of the pixels from the image.
std::vector<float> vals = image.copyPixels();
// New, fast way. Takes 7 seconds on a test image.
concurrency::parallel_buffered_sort(vals.begin(), vals.end());
// Old, slow way -- takes 30 seconds on a test image
// std::sort(vals.begin(), vals.end());
This was for a large image and dropped my processing time 30 seconds to 7 seconds. However some cases will involve small images. I don't know if I can or should just do this blindly.
I would like to make some judicious use of parallel_sort, parallel_for and the like but I'm wondering about what threshold needs to be crossed (in terms of number of elements to be sorted/iterated through) before it becomes a help and not a hindrance.
I will eventually go through some lengthy performance testing but at the moment I don't have a lot of time do that. I would like to get this working better "most" of the time and not hurting performance any of the time (or at least only rarely).
Can someone out there with some experience in this area can give me a reasonable rule-of-thumb that will help me in "most" cases? Does one exist?
The requirement of RandomIterator and presence of overloads with a const size_t _Chunk_size = 2048 parameter, which control the threshold of serialisation, imply that the library authors are aware of this concern. Thus probably just using concurrency::parallel_* as drop-in replacements for std::* will do fine.
Here is how I think about it, windows thread scheduling time quanta is ~20-60 ms on workstation and 120ms on the server so anything that can be done in this much time doesn't need concurrency.
So, I am guessing up to 1k-10k you are good with std::sort the latency in launching multiple threads would be an overkill, but 10k onwards there is a distinct advantage in using parallel sort or p-buffered sort (if you can afford it) and parallel radix sort probably would be great for very very large values.
Caveats apply. :o)
I don’t know about that concurrency namespace, but any sane implementation of a parallel algorithm will adapt appropriately to the size of the input. You shouldn’t have to worry about the details of the underlying thread implementation. Just do it.

How to judge Time Limit for a constraint bound coding?

I have seen many coding sites stating time limits and size of source code constraints to be considered while submitting any solution of a problem. I never make out how can i check whether my code would pass or not like if its exponential time limit is doubtful, or if O(n^2) maybe 2 sec depending on size of input. But how can i get a rough idea that this much size of test case will pass in the stated time?
Some good examples would be helpful.
There are some rules of thumb, but a lot depends on the hardware/programming language the judge system is using. The best way is to make some tests like for loops or putting random numbers in a priority queue, just to get a feeling for it.
Mostly if you need more than 10^7 steps (which can consist of several simple operations), than you have to watch out for time out. That means:
If running time is O(n!), than n>11 is already critical: you have at least 10^7 operations and it is a lot.
If running time is O(2^n), than it is safe for n<=20, but too risky for n>25
If running time is O(n^3), than n should be around 300.
For O(n^2), n=5000 could be Ok but 10000 would very probable fail.
For n<=200000, algorithms with O(nlogn) are mostly ok.
For n<=10^7 linear running times are Ok, after that you would need sublinear algorithms.
But, as already said, these numbers can vary depending on the judging system

SIMD Implementation of std::nth_element

I have an algorithm that runs on my dual-core, 3 GHz Intel processor in on average 250ms, and I am trying to optimize it. Currently, I have an std::nth_element call that is invoked around 6,000 times on std::vectors of between 150 and 300 elements, taking on average 50ms. I've spent some time optimizing the comparator I use, which currently looks up two doubles from a vector and does a simple < comparison. The comparator takes a negligible fraction of the time to run std::nth_element. The comparator's copy-constructor is also simple.
Since this call is currently taking 20% of the time for my algorithm, and since the time is mostly spent in the code for nth_element that I did not write (i.e. not the comparator), I'm wondering if anyone knows of a way of optimizing nth_element using SIMD or any other approach? I've seen some questions on parallelizing std::nth_element using OpenCL and multiple threads, but since the vectors are pretty short, I'm not sure how much benefit I would get from that approach, though I'm open to being told I'm wrong.
If there is an SSE approach, I can use any SSE instruction up to (the current, I think) SSE4.2.
Thanks!
Two thoughts:
Multithreading probably won't speed up processing for any single vector, but might help you as the number of vectors grows large.
Sorting is too powerful a tool for your problem: you're computing the entire order of the vector, but you don't care about anything but the top few. You know for each vector how many elements make up the top 5%, so instead of sorting the whole thing you should make one pass through the array and find the k largest. You can do this is O(n) time with k extra space, so it's probably a win overall.

std::vector faster than plain array?

I have just tried to benchmark std::sort on both std::vector<std::pair<float, unsigned int>> (filled with push_back operation) and plain std::pair<float, unsigned int>> * array (allocated using new and then filled one by one). The compare function just compared the float parts of the pairs.
Surprisingly, when used on 16M values, on std::vector it took only about 1940 ms but on the array it was about 2190 ms. Can anybody explain how can be the vector faster? Is it due to cacheing, or is just array version of std::sort poorly implemented?
gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6)
Intel(R) Core(TM) i7 CPU 870 # 2.93GHz - cache size 8192 KB (the computer has two quad-core CPUs but I assume the sort is only single-threaded)
EDIT: Now you can call me dumbass, but when I tried to reproduce the code I used for the measurements (I have already deleted the original one) I cannot reproduce the results - now the array version takes about 1915 +- 5ms (measured on 32 runs). I can only swear I have run the test on 10 measurements three times (manually) with similar results, but this is not a rigorous proof.
There was probably some bug in the original code, background process seems unprobable because I have alternated measurements of vector and array versions and the vector results hold and no user was logged in.
Please, consider this question as closed. Thanks for your efforts.
std::vector<std::pair<float, unsigned int>> (filled with push_back operation)
This stores all the data continguously, so memory locality is very good
std::pair<float, unsigned int>> * array (allocated using new and then filled one by one)
This scatters the data all over memory.
You've set up a very unfair comparison between vector and a simple array. The extra indirection involved in the array case is going to hurt, the lack of locality is going to kill cache performance. I'm surprised you don't see a bigger win in favor of contiguous storage.
They will use the same version of sort. It's quite likely random CPU effects, like caching or thread context switches.
Did you use -O3 to compile your code?
If not, do it. All other benchmark results are meaningless, especially for template code.
Did you run the test many times?
This is done to prevent things like interrupts and or caching to have to much of an impact on your result.
Don't use floatint point comparison or arithmetic for benchmarks. The results depend heavily on the compiler, platform, compiler options etc.
How was your testdata created?
The time required by most sorting algorithm changes depending on the sorting of the input data.
Which method for measuring time did you use? Clock cycles? A timer?
Anyway, writing benchmarks that provide reliable results is not as easy as it might seem at first. Don't use a benchmark to determine what the proper code for your problem is.