Related
Are SIMD instructions built for vector numerical calculations only? Or does it lend itself well to a class of string manipulation tasks like, writing rows of data to a text file where the order of the rows does not matter? If so which APIs or libraries should I start with?
Yes! And this is actually done in high performance parsing libraries. One example: simdjson- a parser than can parse gigabytes of JSON per second. There's an About simdjson section in the readme, which has a link to a talk that goes over some of the implementation details.
SIMD instructions operate on numeric values, but once you're at that level, "text" is just numeric values, e.g. UTF-8 codepoints are just unsigned 8-bit integers, with plenty of SIMD support. Processing bitmaps is full of operations on multiple 8-bit unsigned integers in parallel, and it just so conveniently happens that this is so common that SIMD instruction sets cover these operations, and plenty of them are thus also usable for text processing.
I/O is so many orders of magnitude slower than the CPU
Not really. It is slower, but when the CPU has to do tasks that kill the streaming performance, such as branch mispredictions, cache misses, or wasting lots of speculative execution resources on dead-ends, the CPU can very easily not keep up with I/O. Modern network cards used for fast storage access or multi-machine communications can saturate the CPU's memory ports. All of them. And keep them that way. But that's state of the art and quite expensive at the moment (bonded 50 GBit links and such). Sequential, byte-at-a-time parser code is way slower than that.
Yes, especially for ASCII e.g. Convert a String In C++ To Upper Case. Or checking for valid UTF-8 (https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/), or checking if a string happens to be the ASCII subset of UTF-8. (If so, you know you have fixed-width characters which is very useful for other things.)
As Daniel Lemire reported, an early attempt at UTF-8 validation gave "a few CPU cycles per character." But with SIMD, he and collaborators were able to achieve ~1 instruction per byte, for net speeds of ~12GB/s. (vs. DRAM bandwidth of a Haswell desktop being ~25GB/s, or Skylake at 34GB/s with DDR4-2133).
Of course, most C libraries already have hand-written asm implementations of functions like strlen, strcpy, strcasecmp, strstr, etc. that use SIMD if it's a win (like on x86-64 where pmovmskb allows relatively efficient compare/branch on any/all SIMD compare results being true or false.) The first part of my answer on Why does glibc's strlen need to be so complicated to run quickly? has some links to hand-optimized asm that glibc actually uses on mainstream platforms, instead of the portable plain C fallback the question is asking about.
https://github.com/WojciechMula/sse4-strstr has a variety of strstr implementations. Substring searching is a much harder problem, with non-trivial algorithm choices as well as just brute-force. The SSE4.2 "string" instructions may help for that, but if not then SIMD vector compares definitely can for better brute-force building blocks.
(SSE4.2 "string" instructions like pcmpistri are definitely worse for memcmp / strcmp and strlen where plain SSE2 (or AVX2) is better. See How much faster are SSE4.2 string instructions than SSE2 for memcmp? and https://www.strchr.com/strcmp_and_strlen_using_sse_4.2)
You can even do cool tricks with looking up a shuffle control vector based on a vector compare bitmap, e.g. Fastest way to get IPv4 address from string or How to implement atoi using SIMD?.
Although I'm not sure the SIMD atoi is a win vs. scalar, especially for short numbers.
Naively I would say SIMD would not help since for long strings memory bandwidth would be the bottleneck. Why is this not the case?
DRAM bandwidth is really pretty good compared to modern CPU speeds, especially when the data comes in byte chunks, not 8-byte double chunks. And data is often hot in L3 cache after copying (e.g. from a read system call).
Even if data has to come from DRAM, modern desktop / laptop CPUs can load about 8 bytes per core clock cycle, within a factor of 2 of that anyway, especially if this core isn't competing with other bandwidth-intensive code on other cores. Good luck keeping up with that with byte-at-a-time scalar loops.
Besides, if you just did a read() system call to get the kernel to memcpy some data from a network buffer or pagecache into your process's memory, the data might still be hot in L3 cache, or even L2. Xeon CPUs can even DMA into L3 cache, or something like that. Aiming for memory bandwidth is a pretty low / unambitious goal, and a poor excuse for not fully optimizing a function if it actually gets use a lot.
Fewer instructions to process the same data lets out-of-order exec "see" farther ahead, and start demand-loads for later pages / cache lines earlier in cases where HW prefetch wouldn't (e.g. across page boundaries). And also better overlap the string processing with earlier / later independent work.
It can also be more hyperthreading-friendly, leaving the HT sibling core with better throughput if anything's running on it. (Maybe nothing if there aren't a lot of threads active). Also, if SIMD is efficient enough, it may save energy: tracking instructions through the pipeline is a large part of the cost, not the integer execution units themselves. Higher power while running, but finishing sooner, is good: race to sleep. CPUs save much more power when fully idle than when just running "cheap" instructions.
SIMD instructions are used on a very low level. Writing data to a text file is a much higher level, involving buffered I/O etc.
You might use SIMD, e.g., to convert a string from lower case to upper case. Wrapping SIMD into a library would be moot. You write the instructions yourself. Which also means that they are processor-specific (e.g. SSE variants on x86/AMD64).
For processing several rows of text in parallel, you might use micro-parallelization instead, e.g. offered by OpenMP or TBB.
However, if you stick to the example of writing to a the text file, we get to another territory of performance optimizations (I/O instead of computation).
I have not performed any profile testing of this yet, but what would the general consensus be on the advantages/disadvantages of resource loading from the hard disk using multiple threads vs one thread? Note. I am not talking about the main thread.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Not sure which way to go architecturally, appreciate any advice.
EDIT: Apologies, I meant to mean an SSD drive not a magnetic drive. Both are HD's to me, but I am more interested in the case of a system with a single SSD drive.
As pointed out in the comments one advantage of using multiple threads is that a large file load will not delay the presentation of a smaller for to the receiver of the thread loader. In my case, this is a big advantage, and so even if it costs a little perf to do it, having multiple threads is desirable.
I know there are no simple answers, but the real question I am asking is, what kind of performance % penalty would there be for making the parallel disk writes sequential (in the OS layer) as opposed to allowing only 1 resource loader thread? And what are the factors that drive this? I don't mean like platform, manufacturer etc. I mean technically, what aspects of the OS/HD interaction influence this penalty? (in theory).
FURTHER EDIT:
My exact use case are texture loading threads which only exist to load from HD and then "pass" them on to opengl, so there is minimal "computation in the threads (maybe some type conversion etc). In this case, the thread would spend most of its time waiting for the HD (I would of thought), and therefore how the OS-HD interaction is managed is important to understand. My OS is Windows 10.
Note. I am not talking about the main thread.
Main vs non-main thread makes zero difference to the speed of reading a disk.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Indeed. Not only are the attempted parallel reads forced to wait for each other (and thus not actually be parallel), but they will also make access pattern of the disk random as opposed to sequential, which is much much slower due to disk head seek time.
Of course, if you were to deal with multiple hard disks, then one thread dedicated for each drive would probably be optimal.
Now, if you were using a solid state drive instead of a hard drive, the situation isn't quite so clear cut. Multiple threads may be faster, slower, or comparable. There are probably many factors involved such as firmware, file system, operating system, speed of the drive relative to some other bottle neck, etc.
In either case, RAID might invalidate assumptions made here.
It depends on how much processing of the data you're going to do. This will determine whether the application is I/O you bound or compute bound.
For example, if all you are going to do to the data is some simple arithmetic, e.g. add 1, then you will end up being I/O bound. The CPU can add 1 to data far quicker than any I/O system can deliver flows of data.
However, if you're going to do a large amount of work on each batch of data, e.g. a FFT, then a filter, then a convolution (I'm picking random DSP routine names here), then it's likely that you will end up being compute bound; the CPU cannot keep up with the data being delivered by the I/O subsystem which owns your SSD.
It is quite an art to judge just how an algorithm should be structured to match the underlying capabilities of the underlying machine, and vice versa. There's profiling tools like FTRACE/Kernelshark, Intel's VTune, which are both useful in analysing exactly what is going on. Google does a lot to measure how many searches-per-Watt their hardware accomplishes, power being their biggest cost.
In general I/O of any sort, even a big array of SSDs, is painfully slow. Even the main memory in a PC (DDR4) is painfully slow in comparison to what the CPU can consume. Even the L3 and L2 caches are sluggards in comparison to the CPU cores. It's hard to design and multi-threadify an algorithm just right so that the right amount of work is done on each data item whilst it is in L1 cache so that the L2, L3 caches, DDR4 and I/O subsystems can deliver the next data item to the L1 caches just in time to keep the CPU cores busy. And the ideal software design for one machine is likely hopeless on another with a different CPU, or SSD, or memory SIMMs. Intel design for good general purpose computer performance, and actually extracting peak performance from a single program is a real challenge. Libraries like Intel's MKL and IPP are very big helps in doing this.
General Guidance
In general one should look at it in terms of data bandwidth required by any particular arrangement of threads and work those threads are doing.
This means benchmarking your program's inner processing loop and measuring how much data it processed and how quickly it managed to do it in, choosing an number of data items that makes sense but much more than the size of L3 cache. A single 'data item' is an amount of input data, the amount of corresponding output data, and any variables used processing the input to the output, the total size of which fits in L1 cache (with some room to spare). And no cheating - use the CPUs SSE/AVX instructions where appropriate, don't forego them by writing plain C or not using something like Intel's IPP/MKL. [Though if one is using IPP/MKL, it kinda does all this for you to the best of its ability.]
These days DDR4 memory is going to be good for anything between 20 to 100GByte/second (depending on what CPU, number of SIMMs, etc), so long as your not making random, scattered accesses to the data. By saturating the L3 your are forcing yourself into being bound by the DDR4 speed. Then you can start changing your code, increasing the work done by each thread on a single data item. Keep increasing the work per item and the speed will eventually start increasing; you've reached the point where you are no longer limited by the speed of DDR4, then L3, then L2.
If after this you can still see ways of increasing the work per data item, then keep going. You eventually get to a data bandwidth somewhere near that of the IO subsystems, and only then will you be getting the absolute most out of the machine.
It's an iterative process, and experience allows one to short cut it.
Of course, if one runs out of ideas for things to increase the work done per data item then that's the end of the design process. More performance can be achieved only by improving the bandwidth of whatever has ended up being the bottleneck (almost certainly the SSD).
For those of us who like doing this software of thing, the PS3's Cell processor was a dream. No need to second guess the cache, there was none. One had complete control over what data and code was where and when it was there.
A lot people will tell you that an HD can't do more than one thing at once. This isn't quite true because modern IO systems have a lot of indirection. Saturating them is difficult to do with one thread.
Here are three scenarios that I have experienced where multi-threading the IO helps.
Sometimes the IO reading library has a non-trivial amount of computation, think about reading compressed videos, or parity checking after the transfer has happened. One example is using robocopy with multiple threads. Its not unusual to launch robocopy with 128 threads!
Many operating systems are designed so that a single process can't saturate the IO, because this would lead to system unresponsiveness. In one case I got a 3% percent read speed improvement because I came closer to saturating the IO. This is doubly true if some system policy exists to stripe the data to different drives, as might be set on a Lustre drive in a HPC cluster. For my application, the optimal number of threads was two.
More complicated IO, like a RAID card, contains a substantial cache that keep the HD head constantly reading and writing. To get optimal throughput you need to be sure that whenever the head is spinning its constantly reading/writing and not just moving. The only way to do this is, in practice, is to saturate the card's on-board RAM.
So, many times you can overlap some minor amount of computation by using multiple threads, and stuff starts getting tricky with larger disk arrays.
Not sure which way to go architecturally, appreciate any advice.
Determining the amount of work per thread is the most common architectural optimization. Write code so that its easy to increase the IO worker count. You're going to need to benchmark.
How big is (approximately) an I/O syscall overhead on Linux from C program, I mean how bad is running e.g. many small read / write operations compared with read / write on large buffers (on regular files or network sockets)? App is strongly multithreaded.
Syscalls take at least 1-2 microseconds on most modern machines just for the syscall overhead, and much more time if they're doing anything complex that could block or sleep. Expect at least 20 microseconds and up to the order of milliseconds for IO. Compare this with a tiny function call or macro that reads a byte from a userspace buffer, which is likely to complete in a matter of nanoseconds (maybe 200 ns on a bad day).
You can measure this yourself. Just open /dev/zero and do some reading and writing while measuring the time. Also vary the number of bytes you put into each call - e.g. 1 bytes, 2 bytes, 128 bytes, .. 4096bytes. Also take care to use the read(2) and write(2) syscalls and not anything using internal buffers.
So I have megabytes of data stored as doubles that need to be sent over a network... now I don't need the precision that a double offers, so I want to convert these to a float before sending them over the network. What is the overhead of simply doing:
float myFloat = (float)myDouble;
I'll be doing this operation several million times every few seconds and don't want to slow anything down. Thanks
EDIT: My platform is x64 with Visual Studio 2008.
EDIT 2: I have no control over how they are stored.
As Michael Burr said, while the overhead strongly depends on your platform, the overhead is definitely less than the time needed to send them over the wire.
a rough estimate:
800MBit/s payload on a excellent Gigabit wire, 25M-floats/second.
On a 2GHz single core, that gives you a whopping 80 clock cycles for each value converted to break even - anythign less, and you will save time. That should be more than enough on all architectures :)
A simple load-store cycle (barring all caching delays) should be below 5 cycles per value. With instruction interleaving, SIMD extensions and/or parallelizing on multiple cores, you are likely to do multiple conversions in a single cycle.
Also, the receiver will be happy having to handle only half the data. Remember that memory access time is nonlinear.
The only thing arguing against the conversion would be is if the transfer should have minimal CPU load: a modern architecture could transfer the data from disk/memory to bus without CPU intervention. However, with above numbers I'd say that doesn't matter in practice.
[edit]
I checked some numbers, the 387 coprocessor would indeed have taken around 70 cycles for a load-store cycle. On the initial pentium, you are down to 3 cycles without any parallelization.
So, unless you run a gigabit network on a 386...
It's going to depend on your implementation of C++ libraries. Test it and see.
Even if it does take time this will not be the slow point in your application.
Your FPU can do the conversion a lot quicker than it can send network traffic (so the bottleneck here will more than likely be the write to the socket).
But as with al things like this measure it and see.
Personally I don't think any time spent here will affect the real time spent sending the data.
Assuming that you're talking about a significant number of packets to ship the data (a reasonable assumption if you're sending millions of values) casting the doubles to float will likely reduce the number of network packets by about half (assuming sizeof(double)==8 and sizeof(float)==4).
Almost certainly the savings in network traffic will dominate whatever time is spent performing the conversion. But as everyone says, measuring some tests will be the proof of the pudding.
Bearing in mind that most compilers deal with doubles a lot more efficiently than floats -- many promote float to double before performing operations on them -- I'd consider taking the block of data, ZIPping/compressing it, then sending the compressed block across. Depending on what your data looks like, you could get 60-90% compression, vs. the 50% you'd get converting 8-byte values to four bytes.
You don't have any choice but to measure them yourself and see. You could use timers to measure them. Looks like some has already implemented a neat C++ timer class
I think this cast is a lot cheaper than you think since it doesn't really involve any kind of calculation. In fact, it's just bitshifting to get rid of some of the digits of exponent and mantissa.
It will also depend on the CPU and what floating point support it has. In the bad old days (1980s), processors supported integer operations only. Floating point math had to be emulated in software. A separate chip for floating point (a coprocessor) could be bought separately.
Modern CPUs now have SIMD instructions, so large amounts of floating point data can be processed at once. These instructions include MMX, SSE, 3DNow! and the like. Your compiler may know how to make use of these instructions, but you may need to write your code in a particular way, and turn on the right options.
Finally, the fastest way to process floating point data is in a video card. A fairly new language called OpenCL lets you send tasks to the video card to be processed there.
It all depends on how much performance you need.
I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.