So I have megabytes of data stored as doubles that need to be sent over a network... now I don't need the precision that a double offers, so I want to convert these to a float before sending them over the network. What is the overhead of simply doing:
float myFloat = (float)myDouble;
I'll be doing this operation several million times every few seconds and don't want to slow anything down. Thanks
EDIT: My platform is x64 with Visual Studio 2008.
EDIT 2: I have no control over how they are stored.
As Michael Burr said, while the overhead strongly depends on your platform, the overhead is definitely less than the time needed to send them over the wire.
a rough estimate:
800MBit/s payload on a excellent Gigabit wire, 25M-floats/second.
On a 2GHz single core, that gives you a whopping 80 clock cycles for each value converted to break even - anythign less, and you will save time. That should be more than enough on all architectures :)
A simple load-store cycle (barring all caching delays) should be below 5 cycles per value. With instruction interleaving, SIMD extensions and/or parallelizing on multiple cores, you are likely to do multiple conversions in a single cycle.
Also, the receiver will be happy having to handle only half the data. Remember that memory access time is nonlinear.
The only thing arguing against the conversion would be is if the transfer should have minimal CPU load: a modern architecture could transfer the data from disk/memory to bus without CPU intervention. However, with above numbers I'd say that doesn't matter in practice.
[edit]
I checked some numbers, the 387 coprocessor would indeed have taken around 70 cycles for a load-store cycle. On the initial pentium, you are down to 3 cycles without any parallelization.
So, unless you run a gigabit network on a 386...
It's going to depend on your implementation of C++ libraries. Test it and see.
Even if it does take time this will not be the slow point in your application.
Your FPU can do the conversion a lot quicker than it can send network traffic (so the bottleneck here will more than likely be the write to the socket).
But as with al things like this measure it and see.
Personally I don't think any time spent here will affect the real time spent sending the data.
Assuming that you're talking about a significant number of packets to ship the data (a reasonable assumption if you're sending millions of values) casting the doubles to float will likely reduce the number of network packets by about half (assuming sizeof(double)==8 and sizeof(float)==4).
Almost certainly the savings in network traffic will dominate whatever time is spent performing the conversion. But as everyone says, measuring some tests will be the proof of the pudding.
Bearing in mind that most compilers deal with doubles a lot more efficiently than floats -- many promote float to double before performing operations on them -- I'd consider taking the block of data, ZIPping/compressing it, then sending the compressed block across. Depending on what your data looks like, you could get 60-90% compression, vs. the 50% you'd get converting 8-byte values to four bytes.
You don't have any choice but to measure them yourself and see. You could use timers to measure them. Looks like some has already implemented a neat C++ timer class
I think this cast is a lot cheaper than you think since it doesn't really involve any kind of calculation. In fact, it's just bitshifting to get rid of some of the digits of exponent and mantissa.
It will also depend on the CPU and what floating point support it has. In the bad old days (1980s), processors supported integer operations only. Floating point math had to be emulated in software. A separate chip for floating point (a coprocessor) could be bought separately.
Modern CPUs now have SIMD instructions, so large amounts of floating point data can be processed at once. These instructions include MMX, SSE, 3DNow! and the like. Your compiler may know how to make use of these instructions, but you may need to write your code in a particular way, and turn on the right options.
Finally, the fastest way to process floating point data is in a video card. A fairly new language called OpenCL lets you send tasks to the video card to be processed there.
It all depends on how much performance you need.
Related
Are SIMD instructions built for vector numerical calculations only? Or does it lend itself well to a class of string manipulation tasks like, writing rows of data to a text file where the order of the rows does not matter? If so which APIs or libraries should I start with?
Yes! And this is actually done in high performance parsing libraries. One example: simdjson- a parser than can parse gigabytes of JSON per second. There's an About simdjson section in the readme, which has a link to a talk that goes over some of the implementation details.
SIMD instructions operate on numeric values, but once you're at that level, "text" is just numeric values, e.g. UTF-8 codepoints are just unsigned 8-bit integers, with plenty of SIMD support. Processing bitmaps is full of operations on multiple 8-bit unsigned integers in parallel, and it just so conveniently happens that this is so common that SIMD instruction sets cover these operations, and plenty of them are thus also usable for text processing.
I/O is so many orders of magnitude slower than the CPU
Not really. It is slower, but when the CPU has to do tasks that kill the streaming performance, such as branch mispredictions, cache misses, or wasting lots of speculative execution resources on dead-ends, the CPU can very easily not keep up with I/O. Modern network cards used for fast storage access or multi-machine communications can saturate the CPU's memory ports. All of them. And keep them that way. But that's state of the art and quite expensive at the moment (bonded 50 GBit links and such). Sequential, byte-at-a-time parser code is way slower than that.
Yes, especially for ASCII e.g. Convert a String In C++ To Upper Case. Or checking for valid UTF-8 (https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/), or checking if a string happens to be the ASCII subset of UTF-8. (If so, you know you have fixed-width characters which is very useful for other things.)
As Daniel Lemire reported, an early attempt at UTF-8 validation gave "a few CPU cycles per character." But with SIMD, he and collaborators were able to achieve ~1 instruction per byte, for net speeds of ~12GB/s. (vs. DRAM bandwidth of a Haswell desktop being ~25GB/s, or Skylake at 34GB/s with DDR4-2133).
Of course, most C libraries already have hand-written asm implementations of functions like strlen, strcpy, strcasecmp, strstr, etc. that use SIMD if it's a win (like on x86-64 where pmovmskb allows relatively efficient compare/branch on any/all SIMD compare results being true or false.) The first part of my answer on Why does glibc's strlen need to be so complicated to run quickly? has some links to hand-optimized asm that glibc actually uses on mainstream platforms, instead of the portable plain C fallback the question is asking about.
https://github.com/WojciechMula/sse4-strstr has a variety of strstr implementations. Substring searching is a much harder problem, with non-trivial algorithm choices as well as just brute-force. The SSE4.2 "string" instructions may help for that, but if not then SIMD vector compares definitely can for better brute-force building blocks.
(SSE4.2 "string" instructions like pcmpistri are definitely worse for memcmp / strcmp and strlen where plain SSE2 (or AVX2) is better. See How much faster are SSE4.2 string instructions than SSE2 for memcmp? and https://www.strchr.com/strcmp_and_strlen_using_sse_4.2)
You can even do cool tricks with looking up a shuffle control vector based on a vector compare bitmap, e.g. Fastest way to get IPv4 address from string or How to implement atoi using SIMD?.
Although I'm not sure the SIMD atoi is a win vs. scalar, especially for short numbers.
Naively I would say SIMD would not help since for long strings memory bandwidth would be the bottleneck. Why is this not the case?
DRAM bandwidth is really pretty good compared to modern CPU speeds, especially when the data comes in byte chunks, not 8-byte double chunks. And data is often hot in L3 cache after copying (e.g. from a read system call).
Even if data has to come from DRAM, modern desktop / laptop CPUs can load about 8 bytes per core clock cycle, within a factor of 2 of that anyway, especially if this core isn't competing with other bandwidth-intensive code on other cores. Good luck keeping up with that with byte-at-a-time scalar loops.
Besides, if you just did a read() system call to get the kernel to memcpy some data from a network buffer or pagecache into your process's memory, the data might still be hot in L3 cache, or even L2. Xeon CPUs can even DMA into L3 cache, or something like that. Aiming for memory bandwidth is a pretty low / unambitious goal, and a poor excuse for not fully optimizing a function if it actually gets use a lot.
Fewer instructions to process the same data lets out-of-order exec "see" farther ahead, and start demand-loads for later pages / cache lines earlier in cases where HW prefetch wouldn't (e.g. across page boundaries). And also better overlap the string processing with earlier / later independent work.
It can also be more hyperthreading-friendly, leaving the HT sibling core with better throughput if anything's running on it. (Maybe nothing if there aren't a lot of threads active). Also, if SIMD is efficient enough, it may save energy: tracking instructions through the pipeline is a large part of the cost, not the integer execution units themselves. Higher power while running, but finishing sooner, is good: race to sleep. CPUs save much more power when fully idle than when just running "cheap" instructions.
SIMD instructions are used on a very low level. Writing data to a text file is a much higher level, involving buffered I/O etc.
You might use SIMD, e.g., to convert a string from lower case to upper case. Wrapping SIMD into a library would be moot. You write the instructions yourself. Which also means that they are processor-specific (e.g. SSE variants on x86/AMD64).
For processing several rows of text in parallel, you might use micro-parallelization instead, e.g. offered by OpenMP or TBB.
However, if you stick to the example of writing to a the text file, we get to another territory of performance optimizations (I/O instead of computation).
Under The most efficient types second here
...and when defining an object to store a floating point number, use the double type, ... The double type is two to three times less efficient than the float type...
Seems like it's contradicting itself?
And I read elsewhere (can't remember where) that computations involving ints are faster than shorts on many machines because they are converted to ints to perform the operations? Is this true? Any links on this?
One can always argue about the quality of the contents on the site you link to. But the two quotes you refer to:
...and when defining an object to store a floating point number, use the double type, ...
and
... The double type is two to three times less efficient than the float type...
Refer to two different things, the first hints that using doubles will give much less problems due to the increased precision, while the other talks about performance. But honestly I wouldn't pay too much attention to that, chance is that if your code performs suboptimal it is due to incorrect choice of algorithm rather than wrong choice of primitive data type.
Here is a quote about performance comparison of single and double precision floats from one of my old teachers: Agner Fog, who has a lot of interesting reads over at his website: http://www.agner.org about software optimizations, if you are really interested in micro optimizations go take a look at it:
In most cases, double precision calculations take no more time than single precision. When the floating point registers are used, there is simply no difference in speed between single and double precision. Long double precision takes only slightly more time. Single precision division, square root and mathematical functions are calculated faster than double precision when the XMM registers are used, while the speed of addition, subtraction, multiplication, etc. is still the same regardless of precision on most processors (when vector operations are not used).
source: http://agner.org/optimize/optimizing_cpp.pdf
While there might be different variations for different compilers, and different processors, the lesson one should learn from it, is that most likely you do not need to worry about optimizations at this level, look at choice of algorithm, even data container, not the primitive data type.
These optimizations are negligible unless you are writing software for space shuttle launches (which recently have not been doing too well). Correct code is far more important than fast code. If you require the precision, using doubles will barely affect the run time.
Things that affect execution time way more than type definitions:
Complexity - The more work there is to do, the more slowly the code will run. Reduce the amount of work needed, or break it up into smaller, faster tasks.
Repetition - Repetition can often be avoided and will inevitably ruin code performance. It comes in many guises-- for example, failing to cache the results of expensive calculations or of remote procedure calls. Every time you recompute, you waste efficiency. They also extend the executable size.
Bad Design - Self explanatory. Think before you code!
I/O - A program whose execution is blocked waiting for input or output (to and from the user, the disk, or a network connection) is bound to perform badly.
There are many more reasons, but these are the biggest. Personally, bad design is where I've seen most of it happen. State machines that could have been stateless, dynamic allocation where static would have been fine, etc. are the real problems.
Depending on the hardware, the actual CPU (or FPU if you like) performance of double is somewhere between half the speed and same speed on modern CPU's [for example add or subtract is probably same speed, multiply or divide may be different for larger type], when compared to float.
On top of that, there are "fewer per cache-line", so if when there is a large number of them, it gets slower still because memory speed is slower. Per cache-line, there are half as many double values -> about half the performance if the application is fully memory bound. It will be much less of a factor in a CPU-bound application.
Similarly, if you use SSE or similar SIMD technologies, the double will take up twice as much space, so the number of actual calculation with be half as many "per instruction", and typically, the CPU will allow the same number of instructions per cycle for both float and double - except for some operations that take longer for double. Again, leading to about half the performance.
So, yes, I think the page in the link is confusing and mixing up the ideal performance setup between double and float. That is, from a pure performance perspective. It is often much easier to get noticeable calculation errors when using float - which can be a pain to track down - so starting with double and switching to float if it's deemed necessary because you have identified it as a performance issue (either from experience or measurements).
And yes, there are several architectures where only one size integer exists - or only two sizes, such as 8-bit char and 32-bit int, and 16-bit short would be simulated by performing the 32-bit math, and then dropping the top part of the value. For example MIPS has only got 32-bit operations, but can store and load 16-bit values to memory. It doesn't necessarily make it slower, but it certainly means that it's "not faster".
I have a very complicated and sophisticated data fitting program which uses the Levenverg-Marquardt algorithm to do fitting in double precision (basically the fitting class is templatized, but I use instantiate it to doubles). The fitting process involves:
Calculating an error function (chi-square)
Solving a system of linear equations (I use lapack for that)
calculating the derivatives of a function with respect to the parameters, which I want to fit to the data (usually 20+ parameters)
calculating the function value continuously: the function is a complicated combination of a sinusoidal and exponential functions with a few harmonics.
A colleague of mine has suggested that I use integers for at least 10 times faster at least. My questions are:
Is that true that I will get that kind of improvement?
Is it safe to convert everything to integers? And what are the drawbacks to this?
What advice would you have for this whole issue? What would you do?
The program is developed to calculate some parameters from the signal online, which means that the program must be as fast as possible, but I'm wondering whether it's worth it to start the project of converting everything to integers.
The amount of improvement depends on your platform. For example, if your platform has a fast floating point coprocessor, performing arithmetic in floating point may be faster than integral arithmetic.
You may be able to get more performance gain by optimizing your algorithms rather than switching to integer arithmetic.
Another method for boosting performance is to reduce data cache hits and also reducing branches and loops.
I would measure performance of the program to find out where the bottlenecks are and then review the sections that where most of the performance takes place. For example, in my embedded system, micro-optimizations like what you are suggesting, saved 3 microseconds. This gain is not worth the effort to retest the entire system. If it works, don't fix it. Concentrate on correctness and robustness first.
The bottom line here is that you have to test it and decide for yourself. Profile a release build using real data.
1- Is that true that I will get that kind of improvement?
Maybe yes, maybe no. It depends on a number of factors, such as
How long it takes to convert from double to int
How big a word is on your machine
What platform/toolset you're using and what optimizations you have enabled
(Maybe) how big a cache line is on your platform
How fast your memory is
How fast your platform computes floating-point versus integer.
And who knows what else. In short, too many complex variables for anyone to be able to say for sure if you will or will not improve performance.
But I would be highly skeptical about your friend's claim, "at least 10 times faster at least."
2- Is it safe to convert everything to integers? And what are the
drawbacks to this?
It depends on what you're converting and how. Obviously converting a value like 123.456 to an integer is decidedly unsafe.
Drawbacks include loss of precision, loss of accuracy, and the expense in terms of space and time to actually do the conversions. Another significant drawback is the fact that you have to write a substantial amount of code, and every line of code you write is a probable source of new bugs.
3- What advice would you have for this whole issue? What would you do?
I would step back & take a deep breath. Profile your code under real-world conditions. Identify the sources of the bottlenecks. Find out what the real problems are, and if there even are any.
Identify inefficiencies in your algorithms, and fix them.
Throw hardware at the problem.
Then you can endeavor to start micro-optimizing. This would be my last resort, especially if the optimization technique you are considering would require writing a lot of code.
First, this reeks of attempting to optimize unnecessarily.
Second, doubles are a minimum of 64-bits. ints on most systems are 32-bits. So you have a couple of choices: truncate the double (which reduces your precision to a single), or store it in the space of 2 integers, or store it as an unsigned long long (which is at least 64-bits as well). For the first 2 options, you are facing a performance hit as you must convert the numbers back and forth between the doubles you are operating on and the integers you are storing it as. For the third option, you are not gaining any performance increase (in terms of memory usage) as they are basically the same size - so you'd just be converting them to integers for no reason.
So, to get to your questions:
1) Doubtful, but you can try it to see for yourself.
2) The problem isn't storage as the bits are just bits when they get into memory. The problem is the arithmetic. Since you stated you need double precision, attempting to do those operations on an integer type will not give you the results you are looking for.
3) Don't optimize until it has been proven something needs to have a performance improvement. And always remember Amdahl's Law: Make the common case fast and the rare case correct.
What I would do is:
First tune it in single-thread mode (by the random-pausing method) until you can't find any way to reduce cycles. The kinds of things I've found are:
a large fraction of time spent in library functions like sin, cos, exp, and log where the arguments were often unchanged, so the answers would be the same. The solution for that is called "memoizing", where you figure out a place to store old values of arguments and results, and check there first before calling the function.
In calling library functions like DGEMM (lapack matrix-multiply) that one would assume are optimized to the teeth, they are actually spending a large fraction of time calling a function to determine if the matrices are upper or lower triangle, square, symmetric, or whatever, rather than actually doing the multiplication. If so, the answer is obvious - write a special routine just for your situation.
Don't say "but I don't have those problems". Of course - you probably have different problems - but the process of finding them is the same.
Once you've made it as fast as possible in single-thread, then figure out how to parallelize it. Multi-threading can have high overhead, so it's best not to tightly-couple the threads.
Regarding your question about converting from doubles to integers, the other answers are right on the money. It only makes sense in very particular situations.
I'm wondering if taking care of memory sizes in C++ is a good or bad thing.
This question confused me ( Why does mode_t use 4 byte? ).
So it's not performant to just use a char if I don't need to store a bigger amount of data, because a modern CPU has to fill up the rest?
So thinking of performance and saving computing time the best would be to always use a size_t for every integer typed variable I need?
Does a CPU still need more instructions to deal with a short value than dealing with an size_t if I have a large array?
What about char arrays? Wouldn't they be supposed to be slower, too?
All in all: What's the best practice? I'd like to save as much memory as possible, because my server does not have a lot of memory. On the other hand I don't want to loose performance because of me thinking memory is more important.
Is there somewhere a great explaination about how all this works and what's faster under what circumstances?
There is no one answer to this question.
Reducing the size of the integer types that you use can increase locality and decrease the required memory bandwidth. So, that's a plus. (Note: the actual memory fetch does not cost less.)
Increasing the size of integer types that you use can decrease the number of conversions required. So, that's a plus.
So the questions are, how much memory do you save by choosing #1? How many conversions do you save by choosing #2?
The objective answer
In general, nothing less than whole-system profiling will tell you which is the better alternative. This is because answering questions about reduced memory pressure is incredibly difficult and system-specific. Reducing the memory usage of part of your program will typically increase the percentage of time your program spends in that part — and it may even increase the percentage of time that your program uses on the entire system — either due to the larger number of conversions necessary, or because the reduced memory pressure makes other parts of your system faster. Hence the need for whole system profiling.
This, unsurprisingly, is a real pain.
The subjective answer
However, my instinct tells me that it's almost never worth the effort to try and minimize memory usage of individual fields this way. How many copies of mode_t do you think your program will have in memory at a time? A handful, at most. So I have a rule of thumb for this:
If it goes in an array, then use the smallest type that has sufficient range. E.g., a string is char[] instead of int[].
If it goes anywhere else, use int or larger.
So my subjective answer is, spend your precious time elsewhere. Your time is valuable and you have better things to do than choose whether a field should be int or short.
This sounds like premature optimization. You are worried about running out of memory when it seems like it hasn't actually happened yet.
In general, accessing a small subsection of the native word size of your CPU generates more CODE. So the space you save putting data into only 8-bits is probably lost 50+ times over by the added CODE needed to only manipulate the specific 8-bits you care about. You could also end up in places where your "optimization" slows things down, too:
struct foo {
char a1, a2, a3;
short b1;
};
If the above structure is packed tightly, b1 crosses a 32bit boundary which on some architectures will throw exceptions and on other architectures will require two fetches to retrieve the data.
OR not. It depends on the CPU architecture, the computer's data architecture, the compiler, and your program's typical use patterns. I doubt there is a single "best practice" that is correct 99% of the time here.
If space is really important, tell the compiler to optimize for size rather than speed and see if that helps. But unless you are sharing the data across a slow binary pipe, you should not generally care how big it is as long as it is big enough to hold all valid values for your application.
tl;dr? Just use size_t until you can prove that reducing the size of that specific variable will significantly improve server performance.
Your answer is processor dependent: depends on the processor for the target platform. Read its data sheet to find out how it handles single 8-bit fetches.
The ARM7TDMI processor likes to fetch 32bit quantities. It is very efficient at that. It is labelled as an 8/32 processor and can handle 8-bit quantities as well.
The processor may be able to fetch 8-bit quantities directly depending on how it is wired up. Otherwise, it calculates the nearest 32-bit aligned address, reads 32 bits and discards the unused bits. This takes processing time.
So the trade-off is memory versus processing time:
Will compressing your application to use 8-bits significantly
increase processing time?
Does your development schedule gain any time by this task? (a.k.a. Return On Investment, ROI)
Do your clients complain about the size of the application?
Is your application correct and error free before worrying about
memory usage?
How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?
FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..
You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html
FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.
Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).
This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference