I am using Python 2.7.5 # Mac OS X 10.9.3 with 8GB memory and 1.7GHz Core i5. I have tested time consumption as below.
d = {i:i*2 for i in xrange(10**7*3)} #WARNING: it takes time and consumes a lot of RAM
%time for k in d: k,d[k]
CPU times: user 6.22 s, sys: 10.1 ms, total: 6.23 s
Wall time: 6.23 s
%time for k,v in d.iteritems(): k, v
CPU times: user 7.67 s, sys: 27.1 ms, total: 7.7 s
Wall time: 7.69 s
It seems iteritems is slower.
I am wondering what is the advantage of iteritems over directly accessing the dict.
Update:
for a more accuracy time profile
In [23]: %timeit -n 5 for k in d: v=d[k]
5 loops, best of 3: 2.32 s per loop
In [24]: %timeit -n 5 for k,v in d.iteritems(): v
5 loops, best of 3: 2.33 s per loop
To answer your question we should first dig some information about how and when iteritems() was added to the API.
The iteritems() method
was added in Python2.2 following the introduction of iterators and generators in the language (see also:
What is the difference between dict.items() and dict.iteritems()?). In fact the method is explicitly mentioned in PEP 234. So it was introduced as a lazy alternative to the already present items().
This followed the same pattern as file.xreadlines() versus file.readlines() which was introduced in Python 2.1 (and already deprecated in python2.3 by the way).
In python 2.3 the itertools module was added which introduced lazy counterparts to map, filter etc.
In other words, at the time there was (and still there is) a strong trend towards lazyness of operations. One of the reasons is to improve memory efficiency. An other one is to avoid unneeded computation.
I cannot find any reference that says that it was introduced to improve the speed of looping over the dictionary. It was simply used to replace calls to items() that didn't actually have to return a list. Note that this include more use-cases than just a simple for loop.
For example in the code:
function(dictionary.iteritems())
you cannot simply use a for loop to replace iteritems() as in your example. You'd have to write a function (or use a genexp, even though they weren't available when iteritems() was introduced, and they wouldn't be DRY...).
Retrieving the items from a dict is done pretty often so it does make sense to provide a built-in method and, in fact, there was one: items(). The problem with items() is that:
it isn't lazy, meaning that calling it on a big dict can take quite some time
it takes a lot of memory. It can almost double the memory usage of a program if called on a very big dict that contains most objects being manipulated
Most of the time it is iterated only once
So, when introducing iterators and generators, it was obvious to just add a lazy counterpart. If you need a list of items because you want to index it or iterate more than once, use items(), otherwise you can just use iteritems() and avoid the problems cited above.
The advantages of using iteritems() are the same as using items() versus manually getting the value:
You write less code, which makes it more DRY and reduces the chances of errors
Code is more readable.
Plus the advantages of lazyness.
As I already stated I cannot reproduce your performance results. On my machine iteritems() is always faster than iterating + looking up by key. The difference is quite negligible anyway, and it's probably due to how the OS is handling caching and memory in general. In otherwords your argument about efficiency isn't a strong argument against (nor pro) using one or the other alternative.
Given equal performances on average, use the most readable and concise alternative: iteritems(). This discussion would be similar to asking "why use a foreach when you can just loop by index with the same performance?". The importance of foreach isn't in the fact that you iterate faster but that you avoid writing boiler-plate code and improve readability.
I'd like to point out that iteritems() was in fact removed in python3. This was part of the "cleanup" of this version. Python3 items() method id (mostly) equivalent to Python2's viewitems() method (actually a backport if I'm not mistaken...).
This version is lazy (and thus provides a replacement for iteritems()) and has also further functionality, such as providing "set-like" operations (such as finding common items between dicts in an efficient way etc.) So in python3 the reasons to use items() instead of manually retrieving the values are even more compelling.
Using for k,v in d.iteritems() with more descriptive names can make the code in the loop suite easier to read.
as opposed to using the system time command, running in ipython with timeit yields:
d = {i:i*2 for i in xrange(10**7*3)} #WARNING: it takes time and consumes a lot of RAM
timeit for k in d: k, d[k]
1 loops, best of 3: 2.46 s per loop
timeit for k, v in d.iteritems(): k, v
1 loops, best of 3: 1.92 s per loop
i ran this on windows, python 2.7.6. have you run it multiple times to confirm it wasn't something going on with the system itself?
I know technically this is not an answer to the question, but the comments section is a poor place to put this sort of information. I hope that this helps people better understand the nature of the problem being discussed.
For thoroughness I've timed a bunch of different configurations. These are all timed using timeit with a repetition factor of 10. This is using CPython version 2.7.6 on Mac OS X 10.9.3 with 16GB memory and 2.3GHz Core i7.
The original configuration
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k in d: k, d[k]'
>> 10 loops, best of 3: 2.05 sec per loop
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k, v in d.iteritems(): k, v'
>> 10 loops, best of 3: 1.74 sec per loop
Bakuriu's suggestion
This suggestion involves passing in the iteritems loop, and assigning a value to a variable v in the first loop by accessing the dictionary at k.
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k in d: v = d[k]'
>> 10 loops, best of 3: 1.29 sec per loop
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k, v in d.iteritems(): pass'
>> 10 loops, best of 3: 934 msec per loop
No assignment in the first
This one removes the assignment in the first loop but keeps the dictionary access. This is not a fair comparison because the second loop creates an additional variable and assigns it a value implicitly.
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k in d: d[k]'
>> 10 loops, best of 3: 1.27 sec per loop
Interestingly, the assignment is trivial to the access itself -- the difference being a mere 20 msec total. In every comparison (even the final, unfair one), the iteritems wins out.
The times are closest, percentage wise, in the original configuration. This is probably due to the bulk of the work being creating a tuple (which is not assigned anywhere). Once that is removed from the equation, the differences between the two methods becomes more pronounced.
dict.iter() wins out heavily in python 3.5.
Here is a small performance stat:
d = {i:i*2 for i in range(10**3)}
timeit.timeit('for k in d: k,d[k]', globals=globals())
75.92739052970501
timeit.timeit('for k, v in d.items(): k,v', globals=globals())
57.31370617801076
Related
I'm using linux perf tools to profile one of CRONO benchmarks, I'm specifically interested in L1 DCache Misses, so I run the program like this:
perf record -e L1-dcache-read-misses -o perf/apsp.cycles apps/apsp/apsp 4 16384 16
It runs fine but generates those warnings:
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Cannot read kernel map
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
Threads Returned!
Threads Joined!
Time: 2.932636 seconds
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.709 MB perf/apsp.cycles (44765 samples) ]
I then annotate the output file like this:
perf annotate --stdio -i perf/apsp.cycles --dsos=apsp
But in one of the code sections, I see some weird results:
Percent | Source code & Disassembly of apsp for L1-dcache-read-misses
---------------------------------------------------------------------------
: {
: if((D[W_index[v][i]] > (D[v] + W[v][i])))
19.36 : 401140: movslq (%r10,%rcx,4),%rsi
14.50 : 401144: lea (%rax,%rsi,4),%rdi
1.22 : 401148: mov (%r9,%rcx,4),%esi
5.82 : 40114c: add (%rax,%r8,4),%esi
20.02 : 401150: cmp %esi,(%rdi)
0.00 : 401152: jle 401156 <do_work(void*)+0x226>
: D[W_index[v][i]] = D[v] + W[v][i];
9.72 : 401154: mov %esi,(%rdi)
19.93 : 401156: add $0x1,%rcx
:
Now in those results, How come that some arithmetic instructions have L1 read misses? Also, how come that instructions of the second statement cause so many cache misses even though they should've brought into cache by the previous if statement?
Am I doing something wrong here? I tried the same on a different machine with root access, it gave me similar results, so I think the warnings I mentioned above are not causing this. But what exactly is going on?
So we have this code:
for(v=0;v<N;v++)
{
for(int i = 0; i < DEG; i++)
{
if((/* (V2) 1000000000 * */ D[W_index[v][i]] > (D[v] + W[v][i])))
D[W_index[v][i]] = D[v] + W[v][i];
Q[v]=0; //Current vertex checked
}
}
Note that I added (V2) as a comment in the code. We below come back to this code.
First approximation
Remember that W_index is initialized as W_index[i][j] = i + j (A).
Let's focus on one inner iteration, and first let's assume that DEG is large. Further we assume that the cache is large enough to hold all data for at least two iterations.
D[W_index[v][i]]
The lookup W_index[v] is loaded into a register. For W_index[v][i] we assume one cache miss (64 byte cache line, 4 byte per int, we call the programm with DIM=16). The lookup in D starts always at v, so most of the required part of the array is already in cache. With the assumption that DEG is large this lookup is for free.
D[v] + W[v][i]
The lookup D[v] is for free as it depends on v. The second lookup is the same as above, one cache miss for the second dimension.
The whole inner statement has no influence.
Q[v]=0;
As this is v, this can be ignored.
When we sum up, we get two cache misses.
Second approximation
Now, we come back to the assumption that DEG is large. In fact this is wrong because DEG = 16. So there are fractions of cache misses we also need to consider.
D[W_index[v][i]]
The lookup W_index[v] costs 1/8 of a cache miss (it has a size of 8 bytes, a cache line is 64 byte, so we get a cache miss each eigth iteration).
The same is true for D[W_index[v][i]], except that D holds integers. In average all but one integer are in cache, so this costs 1/16 of a cache miss.
D[v] + W[v][i]
D[v] is already in cache (this is W_index[v][0]). But we get another 1/8 of a cache miss for W[v] for the same reasoning as above.
Q[v]=0;
This is another 1/16 of a cache miss.
And surprize, if we now use the code (V2) where the if-clause never evaluates to true, I get 2.395 cache misses per iteration (note that you really need to configure your CPU well, i.e., no hyperthreading, no turboboost, performance governor if possible). The calculation above would lead to 2.375. So we are pretty good.
Third approximation
Now there is this unfortunate if clause. How often does this comparison evaluate to true. We can't say, in the beginning it will be quite often, and in the end it will never evaluate to true.
So let's focus on the really first execution of the complete loop. In this case, D[v] is infinity and W[v][i] is a number between 1 and 101. So the loop evaluates to true in each iteration.
And then it gets hard - we get 2.9 cache misses in this iteration. Where are they coming from - all data should already be in cache.
But: This is the "mystery of compilers". You never know what they produce in the end. I compiled with GCC and Clang and get the same measures. I activate -funroll-loops, and suddenly I get 2.5 cache misses. Of course this may be different on your system. When I inspected the assembly, I observed that it is really exactly the same, just the loop has been unrolled four times.
So what does this tell us? You never know what your compiler does except you check it. And even then, you can't be sure.
I guess hardware prefetching or execution order could have an influence here. But this is a mystery.
Regarding perf and your problems with it
I think the measurements you did have two problems:
They are relative, the exact line is not that accurate.
You are multithreaded, this may be harder to track.
My experience is that when you want to get good measures for a specific part of your code, you really need to check it manually. Sometimes - not always - it can explain things pretty good.
Problem description:
I'm optimizing quite complex algorithm, which unfortunately relies heavily on using set and frozenset datatypes (because of faster in operator). This means I get different time of execution every time I run the test, even for exactly the same input data. And since I need (badly) to optimize the algorithm, I'd like to have a constant (as far as it's possible) time of execution every time.
Demonstration
I made a simplified example, which hopefully demonstrates the problem:
import timeit
class A(object):
def __init__(self, val):
self.val = val
def run_test():
result = []
for i in xrange(100):
a = {A(j) for j in xrange(100)}
result.append(sorted(v.val for v in a))
return result
N = 10
times = timeit.Timer(run_test).repeat(repeat=3, number=N)
print '%.6f s' % (min(times) / N,)
The core of the problem is the ordering of objects in sets - it depends ( I think) on their position in memory, which of course is different each time. Then, when sorting the values, sorted execution speed will be different each time. On my machine, it gives execution times with tolerance of about 10%.
It's not the best demonstration, because my real code depends much more on set ordering and time differences are much higher, but I hope you get the picture.
What I tried:
sorting the sets in the algoritm - it gives constant execution time, but also makes the whole algorithm ten times slower
using very large number and repeat parameters - unless I want to wait an hour after each change, this won't help me
What I'd like to try:
I think that if I could somehow "reset" python interpreter to have a "clean memory", it would lead to predictable memory position for objects and the time measurements would be constant. But I have no idea how to do something like this, except for creating a VM and restarting it every time I want to make a test..
I think
Not an issue:
I profile a lot, and I know which functions are slowest by now - I just need to make them faster - those are the functions which speeds I'm trying to measure.
I can't use anything other than set and frozenset for testing (like some ordered set), because it would be much slower and the measured speed wouldn't be in any relation to production code
set and frozenset performance is not important here
Summary / Question:
my algorithm uses sets internally
I want to measure execution speed
the execution speed depends on the order in which the elements contained in the internal set are retrieved
the test I'm using has fixed input values
based on timeit measurement, I'm unable to measure impact of any change I make
in the test above, the run_test function is a good example of my real problem
So I need some way to temporarily make sure all set elements will be created in the same memory positions, which will make the test execution speed and number of function calls (profiling) deterministic.
Additional example
This example perhaps demonstrates my problem better:
import timeit
import time
class A(object):
def __init__(self, val):
self.val = val
def get_value_from_very_long_computation(self):
time.sleep(0.1)
return self.val
def run_test():
input_data = {A(j) for j in xrange(20)}
for i, x in enumerate(input_data):
value = x.get_value_from_very_long_computation()
if value > 16:
break
print '%d iterations' % (i + 1,)
times = timeit.Timer(run_test).repeat(repeat=1, number=1)
print '%.9f s' % (min(times) / N,)
Which returns, for example:
$ ./timing-example2.py
4 iterations
0.400907993 s
$ ./timing-example2.py
3 iterations
0.300778866 s
$ ./timing-example2.py
8 iterations
0.801693201 s
(this will be, of course, different every time you run it and may or may not be completely different on another machine)
You can see that the execution speed is VERY different each time while the input data remains exactly the same. This is exactly the behaviour I see when I measure my algorithm speed.
I have a Matlab-background, and when I bought a laptop a year ago, I carefully selected one which has a lot of compute power, the machine has 4 threads and it offers me 8 threads at 2.4GHz. The machine proved itself to be very powerful, and using simple parfor-loops I could utilize all the processor threads, with which I got a speedup near 8 for many problems and experiments.
This nice Sunday I was experimenting with numpy, people often tell me that the core business of numpy is implemented efficiently using libblas, and possibly even using multiple cores and libraries like OpenMP (with OpenMP you can create parfor-like loops using c-style pragma's).
This is the general approach for many numerical and machine learning algorithms, you express them using expensive high-level operations like matrix multiplications, but in an expensive, high-level language like Matlab and python for comfort. Moreover, c(++) allows us to bypass the GIL.
So the cool part is that linear algebra-stuff should process really fast in python, whenever you use numpy. You just have the overhead of some function-calling, but then if the calculation behind it is large, that's negligible.
So, without even touching the topic that not everything can be expressed in linear algebra or other numpy operations, I gave it a spin:
t = time.time(); numpy.dot(range(100000000), range(100000000)); print(time.time() - t)
40.37656021118164
So I, these 40 seconds I saw ONE of the 8 threads on my machine working for 100%, and the others were near 0%. I didn't like this, but even with one thread working I'd expect this to run in approximately 0.something seconds. The dot-product does 100M +'es and *'es, so we have 2400M / 100M = 24 clock ticks per second for one +, one * and whatever overhead.
Nevertheless, the algorithm needs 40* 24 =approx= 1000 ticks (!!!!!) for the +, * and overhead. Let's do this in C++:
#include<iostream>
int main() {
unsigned long long result = 0;
for(unsigned long long i=0; i < 100000000; i++)
result += i * i;
std::cout << result << '\n';
}
BLITZ:
herbert#machine:~$ g++ -std=c++11 dot100M.cc
herbert#machine:~$ time ./a.out
662921401752298880
real 0m0.254s
user 0m0.254s
sys 0m0.000s
0.254 seconds, almost 100 times faster than numpy.dot.
I thought, maybe the python3 range-generator is the slow part, so I handicapped my c++11 implementation by storing all 100M numbers in a std::vector first (using iterative push_back's), and than iterating over it. This was a lot slower, it took a little below 4 seconds, which still is 10 times faster.
I installed my numpy using 'pip3 install numpy' on ubuntu, and it started compiling for some time, using both gcc and gfortran, moreover I saw mentions of blas-header files passing through the compiler output.
For what reason is numpy.dot so extremely slow?
So your comparison is unfair. In your python example, you first generate two range objects, convert them to numpy-arrays and then doing the scalar product. The calculation takes the least part. Here are the numbers for my computer:
>>> t=time.time();x=numpy.arange(100000000);numpy.dot(x,x);print time.time()-t
1.28280997276
And without the generation of the array:
>>> t=time.time();numpy.dot(x,x);print time.time()-t
0.124325990677
For completion, the C-version takes roughly the same time:
real 0m0.108s
user 0m0.100s
sys 0m0.007s
range generates a list based on your given parameters, where as your for loop in C merely increments a number.
I agree that it seems fairly costly computationally wise to spend so much time on generating one list-- then again, it is a big list, and you're requesting two of them ;-)
EDIT: As mentioned in comments range generates lists, not arrays.
Try substituting your range method with an incrementing while loop or similar and see if you get more tolerable results.
I have a use case where a set of strings will be searched for a particular string, s. The percent of hits or positive matches for these searches will be very high. Let's say 99%+ of the time, s will be in the set.
I'm using boost::unordered_set right now, and even with its very fast hash algorithm, it takes about 40ms 600ms on good hardware a VM to search the set 500,000 times. Yeah, that's pretty good, but unacceptable for what I'm working on.
So, is there any sort of data structure optimized for a high percent of hits? I cannot precompute the hashes for the strings coming in, so I think I'm looking at a complexity of \$O(avg length of string)\$ for a hash set like boost::unordered_set. I looked at Tries, these would probably perform well in the opposite case where there is rarely hits, but not really any better than hash sets.
edit: some other details with my particular use case:
the number of strings in the set is around 5,000. The longest string is probably no more than 200 chars. Search gets called again and again with the same strings, but they are coming in from an outside system and I cannot predict what the next string will be. The exact match rate is actually 99.975%.
edit2: I did some of my own benchmarking
I collected 5,000 of the strings that occur in the real system. I created two scenarios.
1) I loop over the list of known strings and do a search for them in the container. I do this for 500,000 searches("hits").
2) I loop through a set of strings known not to be in the container, for 500,000 searches ("misses").
(Note - I'm interested in hashing the data in reverse because eyeballing my data, I noticed that there are a lot of common prefixes and the suffixes differ - at least that is what it looks like.)
Tests done on a virtualbox CentOS 5.6 VM running on a macbook host.
hits (ms) misses (ms)
boost::unordered_set with default hash and no reserved size: 591.15 441.39
tr1::unordered_set with default hash 191.09 143.80
boost::unordered_set with a reserve size set: 579.31 431.54
boost::unordered_set w/custom hash (hash on the last 15 chars + str size): 357.34 812.13
boost::unordered_set w/custom hash (hash on the last 25 chars + str size): 362.60 795.33
trie: 1809.34 58.11
trie with reversed insertion/search: 2806.26 311.14
In my tests, where there are a lot of matches, the tr1 set is the best. Where there are a lot of misses, the Trie wins big.
my test loop looks like this, where function_set is the container being tested loaded with 5,000 strings, and functions is a vector of either all the strings in the container or a bunch of strings that are not in the container.
while (searched < kTotalSearches) {
for(std::vector<std::string>::const_iterator i = functions.begin(); i != functions.end(); ++i) {
function_set.count(*i);
searched++;
if (searched == kTotalSearches)
break;
}
}
std::cout << searched << " searches." << std::endl;
I'm pretty sure that Tries is what you are looking for. You are guaranteed not to go down a number of nodes greater than the length of your string. Once you've reached a leaf, then there might be some linear search if there are collisions for this particular node. It depends on how you build it. Since you're using a set I would assume that this is not a problem.
The unordered_set will have a complexity of at worse O(n), but n in this case is the number of nodes that you have (500k) and not the number of characters you are searching for (probably less than 500k).
After edit:
Maybe what you really need is a cache of the results after your search algo succeeded.
This question piqued my curiosity so I did a few tests to satisfy myself with the following results. A few general notes:
The usual caveats about benchmarking apply (don't trust my numbers, do your own benchmarks with your specific use case and data, etc...).
Tests were done using MSVS C++ 2010 (speed optimized, release build).
Benchmarks were run using 10 million loops to improve timing accuracy.
Names were generated by randomly concatenating 20 different strings fragments into strings ranging from 4 to 65 characters in length.
Names included only letters and some tests (trie) were case-insensitive for simplicity, though there's no reason the methods can't be extended to include other characters.
Tests try to match the 99.975% hit rate given in the question.
Test Descriptions
Basic description of the tests run with the relevant details:
String Iteration -- Simply iterates through the function name for a baseline time comparison.
Map -- std::unordered_map<std::string, int>
Set -- std::unordered_set<std::string>
BoostSet -- boost::unordered_set<std::string>, v1.47.0
CharMap -- std::unordered_map<const char*, int>
CharSet -- std::unordered_set<const char*>
FastMap -- Simply a std::unordered_map<> using a custom FNV-1a hash algorithm.
FastSet -- Simply a std::unordered_set<> using a custom FNV-1a hash algorithm.
CustomMap -- A basic hash map I wrote myself years ago.
Trie -- A standard trie downloaded from Google code.
CustomTrie -- A bare-bones trie I wrote myself.
BinarySearch -- Using std::binary_search() on a sorted std::vector<std::string>.
SortArrayMap -- An attempt to use a size_t VectorIndex[26][26][26][26][26] array to index into a sorted array.
PerfectMap -- A std::unordered_map<> using a perfect hash from gperf.
PerfectWordSet -- Using the gperf is_word_set() function directly.
PerfectWordSetFunc -- Same as PerfectWordSet but called in a function instead of inline.
PerfectWordSetThread -- Same as PerfectWordSet but work is split into N threads (standard Window threads). No synchronization is used except for waiting for the threads to finish.
Test Results (Mostly Hits)
Results sorted from slowest to fastest (for the case of mostly hits, ~99.975%):
Trie -- 9100 ms
SortArrayMap -- 6600 ms
PerfectWordSetFunc -- 4050 ms
CustomTrie -- 3470 ms
BinarySearch -- 3420 ms
CustomMap -- 2700 ms
CharSet -- 1300 ms
CharMap -- 1300 ms
BoostSet -- 1200 ms
FastSet -- 970 ms
FastMap -- 930 ms
Original Poster -- 800 ms (estimated)
Set -- 730 ms
Map -- 690 ms
PerfectMap -- 650 ms
PerfectWordSet -- 500 ms
PerfectWordSetThread(1) -- 500 ms
StringIteration -- 350 ms
PerfectWordSetThread(2) -- 260 ms
PerfectWordSetThread(4) -- 150 ms
PerfectWordSetThread(32) -- 125 ms
PerfectWordSetThread(8) -- 120 ms
PerfectWordSetThread(16) -- 110 ms
Test Results (Mostly Misses)
Results sorted from slowest to fastest (for the case of mostly misses, ~0.1% hits):
BinarySearch -- ? (took too long)
SortArrayMap -- 8050 ms
Trie -- 3200 ms
CustomMap -- 1700 ms
BoostSet -- 920 ms
CustomTrie -- 850 ms
FastMap -- 590 ms
FastSet -- 580 ms
CharSet -- 550 ms
CharMap -- 550 ms
StringIteration -- 350 ms
Set -- 330 ms
Map -- 330 ms
PerfectMap -- 280 ms
PerfectWordSet -- 140 ms
PerfectWordSetThread(1) -- 130 ms
PerfectWordSetThread(2) -- 75 ms
PerfectWordSetThread(4) -- 45 ms
PerfectWordSetThread(32) -- 45 ms
PerfectWordSetThread(8) -- 40 ms
PerfectWordSetThread(16) -- 35 ms
Discussion
My first guess was that a trie would be a good fit for this sort of thing but from the results the opposite actually appears to be true. Thinking about it some more this makes sense and is along the same reasons to not use a linked-list.
I assume you may be familiar with the table of latencies that every programmer should know. In your case you have 500k lookups executing in 40ms, or 80ns/lookup. At that scale you easily lose if you have to access anything not already in the L1/L2 cache. A trie is really bad for this as you have an indirect and probably non-local memory access for every character. Given the size of the trie in this case I couldn't figure any way of getting the entire trie to fit in cache to improve performance (though it may be possible). I still think that even if you did get the trie to fit entirely in L2 cache you would lose with all the indirection required.
The std::unordered_ containers actually do a very good job of things out of the box. In fact, in trying to speed them up I actually made them slower (in the poorly named FastMap and FastSet trials).
Same thing with trying to switch from std::string to const char * (about twice as slow).
The boost::unordered_set<> was twice as slow as the std::unordered_set<> and I don't know if that is because I just used the built-in hash function, was using a slightly old version of boost, or something else. Have you tried std::unordered_set<> yourself?
By using gperf you can easily create a perfect hash function if your set of strings is known at compile time. You could probably create a perfect hash at runtime as well, depending on how often new strings are added to the map. This gets you a 23% speed increase over the standard map implementation.
The PerfectWordSetThread tests simply use the perfect hash and splits the work up into 1-32 threads. This problem is perfectly parallel (at least the benchmark is) so you get almost a 5x boost of performance in the 16 threads case. This works out to only 6.3ms/500k lookups, or 13 ns/lookup...a mere 50 cycles on a 4GHz processor.
The StringIteration case really points out how difficult it is going to be to get much faster. Just iterating the string being found takes 350 ms, or 70% of the time compared to the 500 ms map case. Even if you could perfectly guess each string you would still need this 350 ms (for 10 million lookups) to actually compare and verify the match.
Edit: Another thing that illustrates how tight things are is the difference between the PerfectWordSetFunc at 4050 ms and PerfectWordSet at 500 ms. The only difference between the two is that one is called in a function and one is called inline. Calling it as a function reduces the speed by a factor of 8. In basic pseudo-code this is just:
bool IsInPerfectWordSet (string Match)
{
return in_word_set(Match);
}
//Inline benchmark: PerfectWordSet
for i = 1 to 10,000,000
{
if (in_word_set(SomeString)) ++MatchCount;
}
//Function call benchmark: PerfectWordSetFunc
for i = 1 to 10,000,000
{
if (IsInPerfectWordSet(SomeString)) ++MatchCount;
}
This really highlights the difference in performance that inline code/functions can make. You also have to be careful in making sure what you are measuring in a benchmark. Sometimes you would want to include the function call overhead, and sometimes not.
Can You Get Faster?
I've learned to never say "no" to this question, but at some point the effort may not be worth it. If you can split the lookups into threads and use a perfect, or near-perfect, hash function you should be able to approach 100 million lookup matches per second (probably more on a machine with multiple physical processors).
A couple ideas I don't have the knowledge to attempt:
Assembly optimization using SSE
Use the GPU for additional throughput
Change your design so you don't need fast lookups
Take a moment to consider #3....the fastest code is that which never needs to run. If you can reduce the number of lookups, or reduce the need for an extremely high throughput, you won't need to spend time micro-optimizing the ultimate lookup function.
If the set of strings is fixed at compile time (e.g. it is a dictionnary of known human words), you could perhaps use a perfect hash algorithm, and use the gperf generator.
Otherwise, you might perhaps use an array of 26 hash tables, indexed by the first letter of the word to hash.
BTW, perhaps using a sorted array of these strings, with a dichotomical access, might be faster (since log 5000 is about 13), or a std::map or a std::set.
At last, you might define your own hashing function: perhaps in your particular case, hashing only the first 16 bytes could be enough!
If the set of strings is fixed, you could consider generating a dichotomical search on it (e.g. code a script to generate a function with 5000 tests, but only log 5000 being executed).
Also, even if the set of strings is slightly variable (e.g. change from one program run to the next, but stays constant during a single run), you might even consider generating the function (by emitting C++ code, then compiling it) on the fly and dlopen-ing it.
You really should benchmark and try several solutions! It probably is more an engineering issue than an algorithmic one.
I have two huge arrays (int source[1000], dest[1000] in the code below, but having millions of elements in reality). The source array contains a series of ints of which I want to copy 3 out of every 4.
For example, if the source array is:
int source[1000] = {1,2,3,4,5,6,7,8....};
int dest[1000];
Here is my code:
for (int count_small = 0, count_large = 0; count_large < 1000; count_small += 3, count_large +=4)
{
dest[count_small] = source[count_large];
dest[count_small+1] = source[count_large+1];
dest[count_small+2] = source[count_large+2];
}
In the end, dest console output would be:
1 2 3 5 6 7 9 10 11...
But this algorithm is so slow! Is there an algorithm or an open source function that I can use / include?
Thank you :)
Edit: The actual length of my array would be about 1 million (640*480*3)
Edit 2: Processing this for loop takes about 0.98 seconds to 2.28 seconds, while the other code only take 0.08 seconds to 0.14 seconds, so the device uses at least 90 % cpu time only for the loop
Well, the asymptotic complexity there is as good as it's going to get. You might be able to achieve slightly better performance by loading in the values as four 4-way SIMD integers, shuffling them into three 4-way SIMD integers, and writing them back out, but even that's not likely to be hugely faster.
With that said, though, the time to process 1000 elements (Edit: or one million elements) is going to be utterly trivial. If you think this is the bottleneck in your program, you are incorrect.
Before you do much more, try profiling your application and determine if this is the best place to spend your time. Then, if this is a hot spot, determine how fast is it, and how fast you need it to be/might achieve? Then test the alternatives; the overhead of threading or OpenMP might even slow it down (especially, as you now have noted, if you are on a single core processor - in which case it won't help at all). For single threading, I would look to memcpy as per Sean's answer.
#Sneftel has also reference other options below involving SIMD integers.
One option would be to try parallel processing the loop, and see if that helps. You could try using the OpenMP standard (see Wikipedia link here), but you would have to try it for your specific situation and see if it helped. I used this recently on an AI implementation and it helped us a lot.
#pragma omp parallel for
for (...)
{
... do work
}
Other than that, you are limited to the compiler's own optimisations.
You could also look at the recent threading support in C11, though you might be better off using pre-implemented framework tools like parallel_for (available in the new Windows Concurrency Runtime through the PPL in Visual Studio, if that's what you're using) than rolling your own.
parallel_for(0, max_iterations,
[...] (int i)
{
... do stuff
}
);
Inside the for loop, you still have other options. You could try a for loop that iterates and skips every for, instead of doing 3 copies per iteration (just skip when (i+1) % 4 == 0), or doing block memcopy operations for groups of 3 integers as per Seans answer. You might achieve slightly different compiler optimisations for some of these, but it is unlikely (memcpy is probably as fast as you'll get).
for (int i = 0, int j = 0; i < 1000; i++)
{
if ((i+1) % 4 != 0)
{
dest[j] = source[i];
j++;
}
}
You should then develop a test rig so you can quickly performance test and decide on the best one for you. Above all, decide how much time is worth spending on this before optimising elsewhere.
You could try memcpy instead of the individual assignments:
memcpy(&dest[count_small], &source[count_large], sizeof(int) * 3);
Is your array size only a 1000? If so, how is it slow? It should be done in no time!
As long as you are creating a new array and for a single threaded application, this is the only away AFAIK.
However, if the datasets are huge, you could try a multi threaded application.
Also you could explore having a bigger data type holding the value, such that the array size decreases... That is if this is viable to your real life application.
If you have Nvidia card you can consider using CUDA. If thats not the case you can try other parallel programming methods/environments as well.