Related
I recently asked this question:
Why is iterating an std::array much faster than iterating an std::vector?
As people quickly pointed out, my benchmark had many flaws. So as I was trying to fix my benchmark, I noticed that std::vector wasn't slower than std::array and, in fact, it was quite the opposite.
#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>
using namespace std;
constexpr int n = 100'000'000;
vector<int> v(n);
//array<int, n> v;
int main()
{
int res = 0;
auto start = chrono::steady_clock::now();
for(int x : v)
res += x;
auto end = chrono::steady_clock::now();
auto diff = end - start;
double elapsed =
std::chrono::duration_cast<
std::chrono::duration<double, std::milli>
>(end - start).count();
printf("result: %d\ntime: %f\n", res, elapsed);
}
Things I've tried to improve from my previous benchmark:
Made sure I'm using the result, so the whole loop is not optimized away
Using -O3 flag for speed
Use std::chrono instead of the time command. That's so we can isolate the part we want to measure (just the for loop). Static initialization of variables and things like that won't be measured.
The measured times:
array:
$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 99.554109
vector:
$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 30.734491
I'm just wondering what I'm doing wrong this time.
Watch the disassembly in godbolt
The difference is due to memory pages of array not being resident in process address space (global scope array is stored in .bss section of the executable that hasn't been paged in, it is zero-initialized). Whereas vector has just been allocated and zero-filled, so its memory pages are already present.
If you add
std::fill_n(v.data(), n, 1); // included in <algorithm>
as the first line of main to bring the pages in (pre-fault), that makes array time the same as that of vector.
On Linux, instead of that, you can do mlock(v.data(), v.size() * sizeof(v[0])); to bring the pages into the address space. See man mlock for full details.
Memory mapping/allocating is lazy: the first access to a page will cause a page fault exception (#PF on x86). This includes the BSS, as well as file-backed mappings like the text segment of your executable. These page faults are "valid" so they don't result in a SIGSEGV being delivered; instead the kernel allocates a physical page if necessary and wires up the hardware page tables so the load or store can rerun and not fault the 2nd time.
This is expensive, especially if the kernel doesn't "fault-around" and prepare multiple pages during one page fault. (Especially with Spectre + Meltdown mitigation enabled making user<->kernel round trips more expensive on current x86-64 hardware.)
You're letting std:vector's constructor write zeros to the array after dynamic allocation1. std::vector does all the page-faulting outside your timed loop. This happens before main, while the implementation is running constructors for static objects.
But the array is zero-initialized so it gets placed in the BSS. The first thing to touch it is your loop. Your array<> loop pays for all the page faults inside the timed region.
If you used new int[n] to dynamically allocate but not initialize a block of memory, you'd see the same behaviour as from your static array<>. (Maybe slightly better if Linux is more willing to use transparent hugepages for a dynamic allocation instead of the BSS mapping.)
Footnote 1 std::vector in libstdc++ and libc++ is too stupid to take advantage of getting already-zeroed pages from the OS, like it could if it used calloc or equivalent. It would be possible if the library provided a new/delete-compatible allocator for zeroed memory.
C++ new/delete is crippled vs. malloc/free/calloc/realloc. I have no idea why ISO C++ left out calloc and realloc: both are very useful for large allocations, especially realloc for resizing a std::vector of trivially-copyable objects that might have room to grow its mapping without copying. But since new/delete aren't guaranteed to be compatible with malloc/free, and new is replaceable, libraries can't very easily use calloc and realloc even under the hood.
Another factor: read-only leaves pages CoW mapped to the same physical zero page
When lazy allocation is triggered by a read (instead of write), it reads as zero. (BSS pages read as zero, new pages from mmap(MAP_ANONYMOUS) read as all-zero.)
The (soft) page fault handler that wired up the HW page table didn't need to actually allocate a physical page aka page-frame to back that virtual page. Instead, Linux maps clean (unwritten) anonymous pages to a single physical zeroed page. (This applies across all tasks.)
If we make multiple passes over the array, this leads to the curious situation where we can get TLB misses but L1d or L3 hits (depending on hugepage or not) because we have multiple virtual pages pointing to the same physical location.
(Some CPUs, e.g. AMD Ryzen, use micro-tagging in the L1d cache to save, at the cost of the cache only being able to hit for one virtual address even if the same memory is mapped to multiple virtual addresses. Intel CPUs use true VIPT L1d caches and really can get this effect),
I made a test program for Linux that will use madvise(MADV_HUGEPAGE) (to encourage the kernel to defrag memory for hugepages) or madvise(MADV_NOHUGEPAGE) (to disable hugepages even for the read-only case).
For some reason Linux BSS pages don't use hugepages when you write them. Only reading them does use 2M hugepages (too big for L1d or L2, but does fit in L3. But we do get all TLB hits). It's hard to see this in /proc/PID/smaps because unwritten memory doesn't show up as "resident" at all. (Remember it's physically backed by a system-wide shared region of zeroes).
I made some changes to your benchmark code to rerun the sum loop multiple times after an init pass that either reads or writes the array, according to command-line args. The repeat-loop makes it run longer so we can get more precise timing, and to amortize the init so we get useful results from perf.
#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>
#include <sys/mman.h>
using namespace std;
constexpr int n = 100'000'000;
//vector<int> v(n);
alignas(4096) array<int, n> v;
//template<class T>
__attribute__((noinline))
int toucharray(volatile int *vv, int write_init) {
int res=vv[0];
for(int i=32 ; i<n ; i+=128)
if(write_init)
vv[i] = 0;
else
res += vv[i];
// volatile int sum = res; // noinline is fine, we don't need to stop multiple calls from CSEing
return res;
}
template <class T>
__attribute__((noinline,noclone))
int sum_container(T &vv) {
unsigned int res=0;
for(int x : vv)
res += x;
__attribute__((used)) static volatile int sink;
sink = res; // a side-effect stops IPA from deciding that this is a pure function
return res;
}
int main(int argc, char**argv)
{
int write_init = 0;
int hugepage = 0;
if (argc>1) {
hugepage = argv[1][0] & 1;
write_init = argv[1][0] & 2;
}
int repcount = 1000;
if (argc>2)
repcount = atoi(argv[2]);
// TODO: option for no madvise.
madvise(v.data(), n*sizeof(v[0]), MADV_SEQUENTIAL);
madvise(v.data(), n*sizeof(v[0]), hugepage ? MADV_HUGEPAGE : MADV_NOHUGEPAGE);
madvise(v.data(), n*sizeof(v[0]), MADV_WILLNEED);
// SEQ and WILLNEED probably only matter for file-backed mappings to reduce hard page faults.
// Probably not encouraging faultahead / around for lazy-allocation soft page fault
toucharray(v.data(), write_init);
int res = 0;
auto start = chrono::steady_clock::now();
for(int i=0; i<repcount ; i++)
res = sum_container(v);
auto end = chrono::steady_clock::now();
double elapsed =
std::chrono::duration_cast<
std::chrono::duration<double, std::milli>
>(end - start).count();
printf("result: %d\ntime: %f\n", res, elapsed);
}
best case: clang++ -O3 -march=native (skylake) actually unrolls with multiple accumulators, unlike gcc -funroll-loops which does a silly job.
On my Skylake i7-6700k with DDR4-2666 DRAM, configured for 4.2GHz max turbo and governor=performance -
# using std::array<int,n>
# 0&1 = 0 -> MADV_NOHUGEPAGE. 0&2 = 0 -> read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 0 1000
result: 0
time: 1961.952394
Performance counter stats for './touchpage-array-madv-nohuge-argc.clang 0 1000':
2,017.34 msec task-clock:u # 1.000 CPUs utilized
50 context-switches # 0.025 K/sec
0 cpu-migrations # 0.000 K/sec
97,774 page-faults # 0.048 M/sec
8,287,680,837 cycles # 4.108 GHz
14,500,762,859 instructions # 1.75 insn per cycle
13,688 mem_load_retired.l2_hit:u # 0.007 M/sec
12,501,329,912 mem_load_retired.l1_hit:u # 6196.927 M/sec
144,559 mem_inst_retired.stlb_miss_loads:u # 0.072 M/sec
2.017765632 seconds time elapsed
1.979410000 seconds user
0.036659000 seconds sys
Notice considerable TLB misses (mem_inst_retired.stlb_miss_loads:u counts 2nd-level TLB misses in user-space). And 97k page faults. That's pretty much exactly as many 4k pages as it takes to cover the 100M * 4 = 400MB array, so we got 1 fault per page with no pre-fault / fault-around.
Fortunately Skylake has two page-walk units so it can be doing two speculative page-walks in parallel. Also, all the data access is hitting in L1d so page-tables will stay hot in at least L2, speeding up page walks.
# using array
# MADV_HUGEPAGE, read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 1 1000
result: 0
time: 5947.741408
Performance counter stats for './touchpage-array-argc.clang 1 1000':
5,951.40 msec task-clock:u # 1.000 CPUs utilized
9 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
687 page-faults # 0.115 K/sec
24,377,094,416 cycles # 4.096 GHz
14,397,054,228 instructions # 0.59 insn per cycle
2,183,878,846 mem_load_retired.l2_hit:u # 366.952 M/sec
313,684,419 mem_load_retired.l1_hit:u # 52.708 M/sec
13,218 mem_inst_retired.stlb_miss_loads:u # 0.002 M/sec
5.951530513 seconds time elapsed
5.944087000 seconds user
0.003284000 seconds sys
Notice ~1/10th the TLB misses, but that out of the same ~12G mem loads, only 2G of them hit in L2, probably thanks to successful HW prefetch. (The rest did hit in L3 though.) And that we only had 687 page faults; a combination of faultaround and hugepages made this much more efficient.
And note that the time taken is 3x higher because of the bottleneck on L3 bandwidth.
Write-init of the array gives us the worst of both worlds:
# using array
# MADV_HUGEPAGE (no apparent effect on BSS) and write-init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 3 1000
result: 0
time: 16510.222762
Performance counter stats for './touchpage-array-argc.clang 3 1000':
17,143.35 msec task-clock:u # 1.000 CPUs utilized
341 context-switches # 0.020 K/sec
0 cpu-migrations # 0.000 K/sec
95,218 page-faults # 0.006 M/sec
70,475,978,274 cycles # 4.111 GHz
17,989,948,598 instructions # 0.26 insn per cycle
634,015,284 mem_load_retired.l2_hit:u # 36.983 M/sec
107,041,744 mem_load_retired.l1_hit:u # 6.244 M/sec
37,715,860 mem_inst_retired.stlb_miss_loads:u # 2.200 M/sec
17.147615898 seconds time elapsed
16.494211000 seconds user
0.625193000 seconds sys
Lots of page faults. Also far more TLB misses.
std::vector version is basically the same as array:
strace shows that madvise didn't work because I didn't align the pointer. glibc / libstdc++ new tends to return a pointer that page-aligned + 16, with allocator bookkeeping in that first 16 bytes. For the array, I used alignas(4096) to make sure I could pass it to madvise.
madvise(0x7f760d133010, 400000000, MADV_HUGEPAGE) = -1 EINVAL (Invalid argument)
So anyway, with my kernel tuning settings, it only tries to defrag memory for hugepages on madvise, and memory is pretty fragmented ATM. So it didn't end up using any hugepages.
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-vector-argv.clang 3 1000
result: 0
time: 16020.821517
Performance counter stats for './touchpage-vector-argv.clang 3 1000':
16,159.19 msec task-clock:u # 1.000 CPUs utilized
17 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
97,771 page-faults # 0.006 M/sec
66,146,780,261 cycles # 4.093 GHz
15,294,999,994 instructions # 0.23 insn per cycle
217,426,277 mem_load_retired.l2_hit:u # 13.455 M/sec
842,878,166 mem_load_retired.l1_hit:u # 52.161 M/sec
1,788,935 mem_inst_retired.stlb_miss_loads:u # 0.111 M/sec
16.160982779 seconds time elapsed
16.017206000 seconds user
0.119618000 seconds sys
I'm not sure why TLB misses is so much higher than for the THP read-only test. Maybe contention for memory access and/or eviction of cached page tables by touching more memory ends up slowing down pagewalks so TLB-prefetch doesn't keep up.
Out of the ~12G loads, HW prefetching was able to make about 1G of them hit in L1d or L2 cache.
While working on a competitive programming problem I discovered an interesting issue that drastically reduced the performance of some of my code. After much experimentation I have managed to reduce the issue to the following minimal example:
module Main where
main = interact handle
handle :: String -> String
-- handle s = show $ sum l
-- handle s = show $ length l
-- handle s = show $ seq (length l) (sum l)
where
l = [0..10^8] :: [Int]
If you uncomment each commented line individually, compile with ghc -O2 test.hs and run with time ./test > /dev/null, you should get something like the following:
For sum l:
0.02user 0.00system 0:00.03elapsed 93%CPU (0avgtext+0avgdata 3380maxresident)k
0inputs+0outputs (0major+165minor)pagefaults 0swaps
For length l:
0.02user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 3256maxresident)k
0inputs+0outputs (0major+161minor)pagefaults 0swaps
For seq (length l) (sum l):
5.47user 1.15system 0:06.63elapsed 99%CPU (0avgtext+0avgdata 7949048maxresident)k
0inputs+0outputs (0major+1986697minor)pagefaults 0swaps
Look at that huge increase in peak memory usage. This makes some amount of sense, because of course both sum and length can lazily consume the list as a stream, while the seq will be triggering the evaluation of the whole list, which must then be stored. But the seq version of the code is using just shy of 8 GB of memory to handle a list that contains just 400 MB of actual data. The purely functional nature of Haskell lists could explain some small constant factor, but a 20 fold increase in memory seems unintended.
This behaviour can be triggered by a number of things. Perhaps the easiest way is using force from Control.DeepSeq, but the way in which I originally encountered this was while using Data.Array.IArray (I can only use the standard library) and trying to construct an array from a list. The implementation of Array is monadic, and so was forcing the evaluation of the list from which it was being constructed.
If anyone has any insight into the underlying cause of this behaviour, I would be very interested to learn why this happens. I would of course also appreciate any suggestions as to how to avoid this issue, bearing in mind that I have to use just the standard library in this case, and that every Array constructor takes and eventually forces a list.
I hope you find this issue as interesting as I did, but hopefully less baffling.
EDIT: user2407038's comment made me realize I had forgotten to post profiling results. I have tried profiling this code and the profiler simply states that 100% of allocations are performed in handle.l, so it seems that simply anything that forces the evaluation of the list uses huge amounts of memory. As I mentioned above, using the force function from Control.DeepSeq, constructing an Array, or anything else that forces the list causes this behaviour. I am confused as to why it would ever require 8 GB of memory to compute a list containing 400 MB of data. Even if every element in the list required two 64-bit pointers, that is still only a factor of 5, and I would think GHC would be able to do something more efficient than that. If not this is an obvious bottleneck for the Array package, as constructing any array inherently requires us to allocate far more memory than the array itself.
So, ultimately: Does anyone have any idea why forcing a list requires such huge amounts of memory, which has such a high cost on performance?
EDIT: user2407038 provided a link to the very helpful GHC Memory Footprint reference. This explains exactly the data sizes of everything, and almost entirely explains the huge overhead: An [Int] is specified as requiring 5N+1 words of memory, which at 8 bytes per word gives 40 bytes per element. In this example that would suggest 4 GB, which accounts for half the total peak usage. It is easy to then believe that the evaluation of sum would then add a similar factor, so this answers my question.
Thanks to all commenters for your help.
EDIT: As I mentioned above, I originally encountered this behaviour why trying to construct an Array. Having had a bit of a dig into GHC.Arr I have found what I think is the root cause of this behaviour when constructing an array: The constructor folds over the list to compose a program in the ST monad that it then runs. Obviously the ST can't be executed until it is completely composed, and in this case the ST construct will be large and linear in the size of the input. To avoid this behaviour we would have to somehow modify the constructor to stream elements from the list as it adds them in ST.
There are multiple factors that come to play here. The first one is that GHC will lazily lift l out of handle. This would enable handle to reuse l, so that you don't have to recalculate it every time, but in this case it creates a space leak. You can check this if you -ddump-simplified core:
Main.handle_l :: [Int]
[GblId,
Str=DmdType,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 40 0}]
Main.handle_l =
case Main.handle3 of _ [Occ=Dead] { GHC.Types.I# y_a1HY ->
GHC.Enum.eftInt 0 y_a1HY
}
The functionality to calculate the [0..10^7] 1 is hidden away in other functions, but essentially, handle_l = [0..10^7], at top-level (TopLvl=True). It won't get reclaimed, since you may or may not use handle again. If we use handle s = show $ length l, l itself will be inlined. You will not find any TopLvl=True function that has type [Int].
So GHC detects that you use l twice and creates a top-level CAF. How big is that CAF? An Int takes two words:
data Int = I# Int#
One for I#, one for Int#. How much for [Int]?
data [a] = [] | (:) a ([a]) -- pseudo, but similar
That's one word for [], and three words for (:) a ([a]). A list of [Int] with size N will therefore have a total size of (3N + 1) + 2N words, in your case 5N+1 words. Given your memory, I assume a word is 8byte on your plattform, so we end up with
5 * 10^8 * 8 bytes = 4 000 000 000 bytes
So how do we get rid of that list? The first option we have is to get rid of l:
handle _ = show $ seq (length [0..10^8]) (sum [0..10^8])
This will now run in constant memory due to foldr/buildr rules. While we have [0..10^8] there twice, they don't share the same name. If we check the -stats, we will see that it runs in constant memory:
> SO.exe +RTS -s
5000000050000000 4,800,066,848 bytes allocated in the heap
159,312 bytes copied during GC
43,832 bytes maximum residency (2 sample(s))
20,576 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9154 colls, 0 par 0.031s 0.013s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.188s ( 4.232s elapsed)
GC time 0.031s ( 0.013s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 4.219s ( 4.247s elapsed)
%GC time 0.7% (0.3% elapsed)
Alloc rate 1,146,284,620 bytes per MUT second
Productivity 99.3% of total user, 98.6% of total elapsed
But that's not really nice, since we now have to track all the uses of [0..10^8]. What if we create a function instead?
handle :: String -> String
handle _ = show $ seq (length $ l ()) (sum $ l ())
where
{-# INLINE l #-}
l _ = [0..10^7] :: [Int]
This works, but we must inline l, otherwise we get the same problem as before if we use optimizations. -O1 (and -O2) enable -ffull-laziness, which—together with common subexpression elimination—would lift l () to the top. So we either need to inline it or use -O2 -fno-full-laziness to prevent that behaviour.
1 Had to decrease the list size, otherwise I would have started swapping.
I have written a function that returns a vector A equal to the product of a sparse matrix Sparse by another vector F. The non-zero values of the matrix are in Sparse(nnz), rowind(nnz) and colind(nnz) each contain the row and column of each particular value of Sparse. It was relatively simple to vectorize the (now commented) inner loop by the two lines beneath do kx.... I cannot see how to vectorize the outer loop, since pos has different size for different kx.
The question is : can the outer loop (do kx=1,nxy) be vectorized, and if yes how?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Vladimir F correctly surmises that I come from the Python/Octave world. I have moved (back) to fortran to get more performance out of my hardware, as the PDE I solve become larger. As of a half hour ago, vectorization meant to get rid of do loops, something that fortran seems very good at: the time savings involved in replacing the "inner loop" (do ky=1,size(pos)..) by the two lines above is astonishing. I look at the info given by gfortran (really gcc?) when -fopt-info is invoked and see loop modification is often used. I will immediately go and read about SIMD and array notation. Please,please if there are good sources on this topic please let me know.
In reply to Holz, there are myriad ways to store sparse matrices, usually resulting in lowering the rank of the operator by 1: The example I cooked up involves forcing and solution vectors that are evaluated at each position in some field,and therefore have rank 1. The operator that relates then (S, as in A= S . F) is two dimensional BUT sparse. It is stored in such a way that only nonzero values are kept. If there are nnz non-zero values in S, then Sp, the sparse equivalent to S, is Sp(1:nnz). If pos represents the location within that sequence of some number Sp(pos), then the column and row position in the original matrix S is given by colind(pos) and rowind(pos).
With that background, I might enlarge the question to: What is the very best (measured by execution time) that can be done to accomplish the multiplication?
pure function SparseMul(Sparse,F) result(A)
implicit none
integer (kind=4),allocatable :: pos(:)
integer (kind=4) :: kx,ky ! gp counters
real (kind=8),intent(in) :: Sparse(:),F(:)
real (kind=8),allocatable :: A(:)
allocate(A(nxy))
do kx=1,nxy !for each row
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
end do
end function SparseMul
I assume the question "as is", i.e.:
we do not want to change the matrix storage format
we do not want to use an external library to perform the task
Otherwise, I think that using an external library should
be the best way to approach the problem, e.g.
https://software.intel.com/en-us/node/520797.
It is not easy to predict the "best" Fortran way to write the
multiplication. It depends on several factors (compiler, architecture,
matrix size,...). I think that the best strategy is to propose
some (reasonable) attempts and test them in a realistic configuration.
If I correctly understand the matrix storage format, my attempts -- including those reported in the question -- are provided
below:
save non-zero positions using pack
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=0
do ky=1,size(pos)
A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
end do
end do
as the previous one but using Fortran array syntax
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
end do
use a conditional to determine the components to be used
do kx=1,nxy
A(kx)=0
do ky=1,nnz
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
as the previous one but interchanging loops
A(:)=0
do ky=1,nnz
do kx=1,nxy
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
use the intrisic sum with the mask argument
do kx=1,nxy
A(kx)=sum(Sparse*F(colind), mask=(rowind==kx))
enddo
as the previous one but using an implied do-loop
A =[(sum(Sparse*F(colind), mask=(rowind==kx)), kx=1,nxy)]
These are the results using a 1000x1000 matrix with
33% non-zero values. The machine is an Intel Xeon
and my tests were performed using Intel v17 and GNU 6.1 compiler
using no optimization, high optimization but
without vectorization, and high optimization.
V1 V2 V3 V4 V5 V6
-O0
ifort 4.28 4.26 0.97 0.91 1.33 2.70
gfortran 2.10 2.10 1.10 1.05 0.30 0.61
-O3 -no-vec
ifort 0.94 0.91 0.23 0.22 0.23 0.52
gfortran 1.73 1.80 0.16 0.15 0.16 0.32
-O3
ifort 0.59 0.56 0.23 0.23 0.30 0.60
gfortran 1.52 1.50 0.16 0.15 0.16 0.32
A few short comments on the results:
Versions 3-4-5 are usually the fastest ones
The role of compiler optimizations is crucial for any version
The vectorization seems to play an important role only for
the non-optimal versions
Version 4 is the best for both compilers
gfortran V4 is the "best" version
Elegance does not always mean good performance (V6 is not very good)
Additional comments can be done analyzing the reports of the
compiler optimizations.
If we have a multi-core machine, we can try to use
all the cores. This implies dealing with the code parallelization, which
is a wide issue, but just to give some hints let us test
two possible OpenMP parallelizations. We work on the
serial fastest version (even though there is no guarantee it
is also the best version to be parallelized).
OpenMP 1.
!$omp parallel
!$omp workshare
A(:)=0
!$omp end workshare
!$omp do
do ky=1,nnz
do kx=1,nxy !for each row
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
!$omp end do
!$omp end parallel
</pre>
OpenMP 2. add firstprivate to read-only vectors to improve memory access
!$omp parallel firstprivate(Sparse, colind, rowind)
...
!$omp end parallel
These are the results for up to 16 threads on 16 cores:
#threads 1 2 4 8 16
OpenMP v1
ifort 0.22 0.14 0.088 0.050 0.027
gfortran 0.155 0.11 0.064 0.035 0.020
OpenMP v2
ifort 0.24 0.12 0.065 0.042 0.029
gfortran 0.157 0.11 0.052 0.036 0.029
The scalability (around 8 at 16 threads) is reasonable considering
that it is a memory-bound computation. The firstprivate optimization
has advantages only for a small number of threads. gfortran using
16 threads is the "best" OpenMP solution.
I am having a hard time seeing where COLIND is and what it is doing... And also KX and KY. For the inner loop you want that vectorized, and that seems easiest for me using OpenMP SIMD REDUCTION. I am specifically looking here:
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
If you have to gather (PACK) then it may not help much. If there are more than 7/8 of zeros in F then F is likely better to PACK. Otherwise it may be better to vector multiply everything (including the zero-sums).
The main rule is that the data needs to be contiguous, so you cannot vectorize across the second dimension... If feels like Sparse and F are rank=2, but they are shown as being RANK=1. That works fine for going through as a vector, even if they are really a rank=2 array. UNION/MAP can also be used to implement a 2D array as also being a 1D vector.
Are Sparse and F really rank=1? and what are nax, nay, nxy and colind used for? And many of those are not defined (e.g. nay , nnz and colind )
I have a use case where a set of strings will be searched for a particular string, s. The percent of hits or positive matches for these searches will be very high. Let's say 99%+ of the time, s will be in the set.
I'm using boost::unordered_set right now, and even with its very fast hash algorithm, it takes about 40ms 600ms on good hardware a VM to search the set 500,000 times. Yeah, that's pretty good, but unacceptable for what I'm working on.
So, is there any sort of data structure optimized for a high percent of hits? I cannot precompute the hashes for the strings coming in, so I think I'm looking at a complexity of \$O(avg length of string)\$ for a hash set like boost::unordered_set. I looked at Tries, these would probably perform well in the opposite case where there is rarely hits, but not really any better than hash sets.
edit: some other details with my particular use case:
the number of strings in the set is around 5,000. The longest string is probably no more than 200 chars. Search gets called again and again with the same strings, but they are coming in from an outside system and I cannot predict what the next string will be. The exact match rate is actually 99.975%.
edit2: I did some of my own benchmarking
I collected 5,000 of the strings that occur in the real system. I created two scenarios.
1) I loop over the list of known strings and do a search for them in the container. I do this for 500,000 searches("hits").
2) I loop through a set of strings known not to be in the container, for 500,000 searches ("misses").
(Note - I'm interested in hashing the data in reverse because eyeballing my data, I noticed that there are a lot of common prefixes and the suffixes differ - at least that is what it looks like.)
Tests done on a virtualbox CentOS 5.6 VM running on a macbook host.
hits (ms) misses (ms)
boost::unordered_set with default hash and no reserved size: 591.15 441.39
tr1::unordered_set with default hash 191.09 143.80
boost::unordered_set with a reserve size set: 579.31 431.54
boost::unordered_set w/custom hash (hash on the last 15 chars + str size): 357.34 812.13
boost::unordered_set w/custom hash (hash on the last 25 chars + str size): 362.60 795.33
trie: 1809.34 58.11
trie with reversed insertion/search: 2806.26 311.14
In my tests, where there are a lot of matches, the tr1 set is the best. Where there are a lot of misses, the Trie wins big.
my test loop looks like this, where function_set is the container being tested loaded with 5,000 strings, and functions is a vector of either all the strings in the container or a bunch of strings that are not in the container.
while (searched < kTotalSearches) {
for(std::vector<std::string>::const_iterator i = functions.begin(); i != functions.end(); ++i) {
function_set.count(*i);
searched++;
if (searched == kTotalSearches)
break;
}
}
std::cout << searched << " searches." << std::endl;
I'm pretty sure that Tries is what you are looking for. You are guaranteed not to go down a number of nodes greater than the length of your string. Once you've reached a leaf, then there might be some linear search if there are collisions for this particular node. It depends on how you build it. Since you're using a set I would assume that this is not a problem.
The unordered_set will have a complexity of at worse O(n), but n in this case is the number of nodes that you have (500k) and not the number of characters you are searching for (probably less than 500k).
After edit:
Maybe what you really need is a cache of the results after your search algo succeeded.
This question piqued my curiosity so I did a few tests to satisfy myself with the following results. A few general notes:
The usual caveats about benchmarking apply (don't trust my numbers, do your own benchmarks with your specific use case and data, etc...).
Tests were done using MSVS C++ 2010 (speed optimized, release build).
Benchmarks were run using 10 million loops to improve timing accuracy.
Names were generated by randomly concatenating 20 different strings fragments into strings ranging from 4 to 65 characters in length.
Names included only letters and some tests (trie) were case-insensitive for simplicity, though there's no reason the methods can't be extended to include other characters.
Tests try to match the 99.975% hit rate given in the question.
Test Descriptions
Basic description of the tests run with the relevant details:
String Iteration -- Simply iterates through the function name for a baseline time comparison.
Map -- std::unordered_map<std::string, int>
Set -- std::unordered_set<std::string>
BoostSet -- boost::unordered_set<std::string>, v1.47.0
CharMap -- std::unordered_map<const char*, int>
CharSet -- std::unordered_set<const char*>
FastMap -- Simply a std::unordered_map<> using a custom FNV-1a hash algorithm.
FastSet -- Simply a std::unordered_set<> using a custom FNV-1a hash algorithm.
CustomMap -- A basic hash map I wrote myself years ago.
Trie -- A standard trie downloaded from Google code.
CustomTrie -- A bare-bones trie I wrote myself.
BinarySearch -- Using std::binary_search() on a sorted std::vector<std::string>.
SortArrayMap -- An attempt to use a size_t VectorIndex[26][26][26][26][26] array to index into a sorted array.
PerfectMap -- A std::unordered_map<> using a perfect hash from gperf.
PerfectWordSet -- Using the gperf is_word_set() function directly.
PerfectWordSetFunc -- Same as PerfectWordSet but called in a function instead of inline.
PerfectWordSetThread -- Same as PerfectWordSet but work is split into N threads (standard Window threads). No synchronization is used except for waiting for the threads to finish.
Test Results (Mostly Hits)
Results sorted from slowest to fastest (for the case of mostly hits, ~99.975%):
Trie -- 9100 ms
SortArrayMap -- 6600 ms
PerfectWordSetFunc -- 4050 ms
CustomTrie -- 3470 ms
BinarySearch -- 3420 ms
CustomMap -- 2700 ms
CharSet -- 1300 ms
CharMap -- 1300 ms
BoostSet -- 1200 ms
FastSet -- 970 ms
FastMap -- 930 ms
Original Poster -- 800 ms (estimated)
Set -- 730 ms
Map -- 690 ms
PerfectMap -- 650 ms
PerfectWordSet -- 500 ms
PerfectWordSetThread(1) -- 500 ms
StringIteration -- 350 ms
PerfectWordSetThread(2) -- 260 ms
PerfectWordSetThread(4) -- 150 ms
PerfectWordSetThread(32) -- 125 ms
PerfectWordSetThread(8) -- 120 ms
PerfectWordSetThread(16) -- 110 ms
Test Results (Mostly Misses)
Results sorted from slowest to fastest (for the case of mostly misses, ~0.1% hits):
BinarySearch -- ? (took too long)
SortArrayMap -- 8050 ms
Trie -- 3200 ms
CustomMap -- 1700 ms
BoostSet -- 920 ms
CustomTrie -- 850 ms
FastMap -- 590 ms
FastSet -- 580 ms
CharSet -- 550 ms
CharMap -- 550 ms
StringIteration -- 350 ms
Set -- 330 ms
Map -- 330 ms
PerfectMap -- 280 ms
PerfectWordSet -- 140 ms
PerfectWordSetThread(1) -- 130 ms
PerfectWordSetThread(2) -- 75 ms
PerfectWordSetThread(4) -- 45 ms
PerfectWordSetThread(32) -- 45 ms
PerfectWordSetThread(8) -- 40 ms
PerfectWordSetThread(16) -- 35 ms
Discussion
My first guess was that a trie would be a good fit for this sort of thing but from the results the opposite actually appears to be true. Thinking about it some more this makes sense and is along the same reasons to not use a linked-list.
I assume you may be familiar with the table of latencies that every programmer should know. In your case you have 500k lookups executing in 40ms, or 80ns/lookup. At that scale you easily lose if you have to access anything not already in the L1/L2 cache. A trie is really bad for this as you have an indirect and probably non-local memory access for every character. Given the size of the trie in this case I couldn't figure any way of getting the entire trie to fit in cache to improve performance (though it may be possible). I still think that even if you did get the trie to fit entirely in L2 cache you would lose with all the indirection required.
The std::unordered_ containers actually do a very good job of things out of the box. In fact, in trying to speed them up I actually made them slower (in the poorly named FastMap and FastSet trials).
Same thing with trying to switch from std::string to const char * (about twice as slow).
The boost::unordered_set<> was twice as slow as the std::unordered_set<> and I don't know if that is because I just used the built-in hash function, was using a slightly old version of boost, or something else. Have you tried std::unordered_set<> yourself?
By using gperf you can easily create a perfect hash function if your set of strings is known at compile time. You could probably create a perfect hash at runtime as well, depending on how often new strings are added to the map. This gets you a 23% speed increase over the standard map implementation.
The PerfectWordSetThread tests simply use the perfect hash and splits the work up into 1-32 threads. This problem is perfectly parallel (at least the benchmark is) so you get almost a 5x boost of performance in the 16 threads case. This works out to only 6.3ms/500k lookups, or 13 ns/lookup...a mere 50 cycles on a 4GHz processor.
The StringIteration case really points out how difficult it is going to be to get much faster. Just iterating the string being found takes 350 ms, or 70% of the time compared to the 500 ms map case. Even if you could perfectly guess each string you would still need this 350 ms (for 10 million lookups) to actually compare and verify the match.
Edit: Another thing that illustrates how tight things are is the difference between the PerfectWordSetFunc at 4050 ms and PerfectWordSet at 500 ms. The only difference between the two is that one is called in a function and one is called inline. Calling it as a function reduces the speed by a factor of 8. In basic pseudo-code this is just:
bool IsInPerfectWordSet (string Match)
{
return in_word_set(Match);
}
//Inline benchmark: PerfectWordSet
for i = 1 to 10,000,000
{
if (in_word_set(SomeString)) ++MatchCount;
}
//Function call benchmark: PerfectWordSetFunc
for i = 1 to 10,000,000
{
if (IsInPerfectWordSet(SomeString)) ++MatchCount;
}
This really highlights the difference in performance that inline code/functions can make. You also have to be careful in making sure what you are measuring in a benchmark. Sometimes you would want to include the function call overhead, and sometimes not.
Can You Get Faster?
I've learned to never say "no" to this question, but at some point the effort may not be worth it. If you can split the lookups into threads and use a perfect, or near-perfect, hash function you should be able to approach 100 million lookup matches per second (probably more on a machine with multiple physical processors).
A couple ideas I don't have the knowledge to attempt:
Assembly optimization using SSE
Use the GPU for additional throughput
Change your design so you don't need fast lookups
Take a moment to consider #3....the fastest code is that which never needs to run. If you can reduce the number of lookups, or reduce the need for an extremely high throughput, you won't need to spend time micro-optimizing the ultimate lookup function.
If the set of strings is fixed at compile time (e.g. it is a dictionnary of known human words), you could perhaps use a perfect hash algorithm, and use the gperf generator.
Otherwise, you might perhaps use an array of 26 hash tables, indexed by the first letter of the word to hash.
BTW, perhaps using a sorted array of these strings, with a dichotomical access, might be faster (since log 5000 is about 13), or a std::map or a std::set.
At last, you might define your own hashing function: perhaps in your particular case, hashing only the first 16 bytes could be enough!
If the set of strings is fixed, you could consider generating a dichotomical search on it (e.g. code a script to generate a function with 5000 tests, but only log 5000 being executed).
Also, even if the set of strings is slightly variable (e.g. change from one program run to the next, but stays constant during a single run), you might even consider generating the function (by emitting C++ code, then compiling it) on the fly and dlopen-ing it.
You really should benchmark and try several solutions! It probably is more an engineering issue than an algorithmic one.
I am using Python 2.7.5 # Mac OS X 10.9.3 with 8GB memory and 1.7GHz Core i5. I have tested time consumption as below.
d = {i:i*2 for i in xrange(10**7*3)} #WARNING: it takes time and consumes a lot of RAM
%time for k in d: k,d[k]
CPU times: user 6.22 s, sys: 10.1 ms, total: 6.23 s
Wall time: 6.23 s
%time for k,v in d.iteritems(): k, v
CPU times: user 7.67 s, sys: 27.1 ms, total: 7.7 s
Wall time: 7.69 s
It seems iteritems is slower.
I am wondering what is the advantage of iteritems over directly accessing the dict.
Update:
for a more accuracy time profile
In [23]: %timeit -n 5 for k in d: v=d[k]
5 loops, best of 3: 2.32 s per loop
In [24]: %timeit -n 5 for k,v in d.iteritems(): v
5 loops, best of 3: 2.33 s per loop
To answer your question we should first dig some information about how and when iteritems() was added to the API.
The iteritems() method
was added in Python2.2 following the introduction of iterators and generators in the language (see also:
What is the difference between dict.items() and dict.iteritems()?). In fact the method is explicitly mentioned in PEP 234. So it was introduced as a lazy alternative to the already present items().
This followed the same pattern as file.xreadlines() versus file.readlines() which was introduced in Python 2.1 (and already deprecated in python2.3 by the way).
In python 2.3 the itertools module was added which introduced lazy counterparts to map, filter etc.
In other words, at the time there was (and still there is) a strong trend towards lazyness of operations. One of the reasons is to improve memory efficiency. An other one is to avoid unneeded computation.
I cannot find any reference that says that it was introduced to improve the speed of looping over the dictionary. It was simply used to replace calls to items() that didn't actually have to return a list. Note that this include more use-cases than just a simple for loop.
For example in the code:
function(dictionary.iteritems())
you cannot simply use a for loop to replace iteritems() as in your example. You'd have to write a function (or use a genexp, even though they weren't available when iteritems() was introduced, and they wouldn't be DRY...).
Retrieving the items from a dict is done pretty often so it does make sense to provide a built-in method and, in fact, there was one: items(). The problem with items() is that:
it isn't lazy, meaning that calling it on a big dict can take quite some time
it takes a lot of memory. It can almost double the memory usage of a program if called on a very big dict that contains most objects being manipulated
Most of the time it is iterated only once
So, when introducing iterators and generators, it was obvious to just add a lazy counterpart. If you need a list of items because you want to index it or iterate more than once, use items(), otherwise you can just use iteritems() and avoid the problems cited above.
The advantages of using iteritems() are the same as using items() versus manually getting the value:
You write less code, which makes it more DRY and reduces the chances of errors
Code is more readable.
Plus the advantages of lazyness.
As I already stated I cannot reproduce your performance results. On my machine iteritems() is always faster than iterating + looking up by key. The difference is quite negligible anyway, and it's probably due to how the OS is handling caching and memory in general. In otherwords your argument about efficiency isn't a strong argument against (nor pro) using one or the other alternative.
Given equal performances on average, use the most readable and concise alternative: iteritems(). This discussion would be similar to asking "why use a foreach when you can just loop by index with the same performance?". The importance of foreach isn't in the fact that you iterate faster but that you avoid writing boiler-plate code and improve readability.
I'd like to point out that iteritems() was in fact removed in python3. This was part of the "cleanup" of this version. Python3 items() method id (mostly) equivalent to Python2's viewitems() method (actually a backport if I'm not mistaken...).
This version is lazy (and thus provides a replacement for iteritems()) and has also further functionality, such as providing "set-like" operations (such as finding common items between dicts in an efficient way etc.) So in python3 the reasons to use items() instead of manually retrieving the values are even more compelling.
Using for k,v in d.iteritems() with more descriptive names can make the code in the loop suite easier to read.
as opposed to using the system time command, running in ipython with timeit yields:
d = {i:i*2 for i in xrange(10**7*3)} #WARNING: it takes time and consumes a lot of RAM
timeit for k in d: k, d[k]
1 loops, best of 3: 2.46 s per loop
timeit for k, v in d.iteritems(): k, v
1 loops, best of 3: 1.92 s per loop
i ran this on windows, python 2.7.6. have you run it multiple times to confirm it wasn't something going on with the system itself?
I know technically this is not an answer to the question, but the comments section is a poor place to put this sort of information. I hope that this helps people better understand the nature of the problem being discussed.
For thoroughness I've timed a bunch of different configurations. These are all timed using timeit with a repetition factor of 10. This is using CPython version 2.7.6 on Mac OS X 10.9.3 with 16GB memory and 2.3GHz Core i7.
The original configuration
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k in d: k, d[k]'
>> 10 loops, best of 3: 2.05 sec per loop
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k, v in d.iteritems(): k, v'
>> 10 loops, best of 3: 1.74 sec per loop
Bakuriu's suggestion
This suggestion involves passing in the iteritems loop, and assigning a value to a variable v in the first loop by accessing the dictionary at k.
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k in d: v = d[k]'
>> 10 loops, best of 3: 1.29 sec per loop
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k, v in d.iteritems(): pass'
>> 10 loops, best of 3: 934 msec per loop
No assignment in the first
This one removes the assignment in the first loop but keeps the dictionary access. This is not a fair comparison because the second loop creates an additional variable and assigns it a value implicitly.
python -m timeit -n 10 -s 'd={i:i*2 for i in xrange(10**7*3)}' 'for k in d: d[k]'
>> 10 loops, best of 3: 1.27 sec per loop
Interestingly, the assignment is trivial to the access itself -- the difference being a mere 20 msec total. In every comparison (even the final, unfair one), the iteritems wins out.
The times are closest, percentage wise, in the original configuration. This is probably due to the bulk of the work being creating a tuple (which is not assigned anywhere). Once that is removed from the equation, the differences between the two methods becomes more pronounced.
dict.iter() wins out heavily in python 3.5.
Here is a small performance stat:
d = {i:i*2 for i in range(10**3)}
timeit.timeit('for k in d: k,d[k]', globals=globals())
75.92739052970501
timeit.timeit('for k, v in d.items(): k,v', globals=globals())
57.31370617801076