is there any faster way to parse than by walk each byte? - c++

is there any faster way to parse a text than by walk each byte of the text?
I wonder if there is any special CPU (x86/x64) instruction for string operation that is used by string library, that somehow used to optimize the parsing routine.
for example instruction like finding a token in a string that could be run by hardware instead of looping each byte until a token is found.
*edited->note: I'am asking more to algorithm instead of CPU architecture, so my really question is, is there any special algorithm or specific technique that could optimize the string manipulation routine given the current cpu architecture.

The x86 had a few string instructions, but they fell out of favor on modern processors, because they became slower than more primitive instructions which do the same thing.
The processor world is moving more and more towards RISC, ie, simplistic instruction sets.
Quote from Wikipedia (emphasis mine):
The first highly (or tightly) pipelined x86 implementations, the 486 designs from Intel, AMD, Cyrix, and IBM, supported every instruction that their predecessors did, but achieved maximum efficiency only on a fairly simple x86 subset that resembled only a little more than a typical RISC instruction set (i.e. without typical RISC load-store limitations).
This is still true on today's x86 processors.
You could get marginally better performance processing four bytes at a time, assuming each "token" in the text was four-byte-aligned. Obviously this isn't true for most text... so better to stick with byte-by-byte scanning.

Yes there are special CPU instructions; and the run-time library, which implements functions like strchr, might be written in assembly.
One technique that can be faster than walking bytes is to walk double-words, i.e. to process data 32 bits at a time.
The problem with walking bigger-than-the-smallest-addressable-memory-unit chunks in the context of strings is one of alignment
You add code at the begining and end of your function (before and after your loop), to handle the uneven/unaligned byte[s]. (shrug) It makes your code faster: not simpler.
The following for example is some source code which claims to be an improved version of strchr. It is using special CPU instructions, but it's not simple (and has extra code for the unaligned bytes):
PATCH: Optimize strchr/strrchr with SSE4.2 -- "This patch adds SSE4.2 optimized strchr/strrchr. It can speed up strchr/strrchr by up to 2X on Intel Core i7"

While (some) processors do have string instructions, they're of little use in producing faster code. First of all, as #zildjohn01 noted, they're often slower than other instructions with current processors. More importantly, it rarely makes much difference anyway -- if you're scanning much text at all, the bottleneck will usually be the bandwidth from the memory to the CPU, so essentially nothing you do with changing instructions is likely to make a significant difference in any case.
That said, especially if the token you're looking for is long, a better algorithm may be useful. A Boyer-Moore search (or variant) can avoid looking at some of the text, which can give a substantial improvement.

Well, you have to look at everything to know everything about the text, at some level. Arguably, you could have some sort of structured text, which gives you further information about where to walk at each point in a sort of n-ary space partition. But then, the "parsing" has partially been done back when the file was created. Starting from 0 information, you will need to touch every byte to know everything.

Yes. As your edit specifically asks for algorithms, I'd like to add an example fo that.
You're going to know the language that you are parsing, and you can use that knowledge when building a parser. Take for instance a language in which every token must be at least two characters, but any length of whitespace can occur between tokens. Now when you're scanning through whitespace for the next token, you can skip every second character. Only when you hit the first non-whitespace do you need to back up one character.
Example:
01234567890123456789
FOO BAR
When scanning for the next token, you'd probe 4,6,8,10 and 12. Only when you see the A do you look back to 11 to find B. You never looked at 3,5,7 and 9.
This is a special case of the Boyer–Moore string search algorithm

Even though this question is long gone, I have another answer for you.
The fastest way to parse is to not parse at all. Huh?!
By that I mean, statistically, most source code text, and most code and data generated from that source code, do not change from compile to compile.
Therefore you can use a technique called incremental compilation. The first time you compile source, you divide it into coarse grained chunks, for example, at the boundaries of global declarations and function bodies. You have to persist the chunks' source, or its signatures or checksums, plus the information about the boundaries, what they compiled to, etc.
The next time you have to recompile the same source, given the same environment, you can gallop over the source code looking for changes. One easy way is to compare (longword at a time) the current code to the persisted snapshot of the code you saved from last time. As long as the longwords match, you skip through the source; instead of parsing and compiling, you reuse the snapshotted compilation results for that section from last time.
As seen in "C#'88", QuickC 2.0, and VC++ 4.0 Incremental Recompilation.
Happy hacking!

is there any faster way to parse a text than by walk each byte of the text?
Yes, sometimes you can skip over data. The Boyer-Moore string search algorithm is based on this idea.
Of course, if the result of the parse operation somehow needs to contain all the information in the text, then there is no way around the fact that you have to read everything. If you want to avoid CPU load, you could build hardware which processes data with direct memory access I guess.

Related

Can I use SIMD for speeding up string manipulation?

Are SIMD instructions built for vector numerical calculations only? Or does it lend itself well to a class of string manipulation tasks like, writing rows of data to a text file where the order of the rows does not matter? If so which APIs or libraries should I start with?
Yes! And this is actually done in high performance parsing libraries. One example: simdjson- a parser than can parse gigabytes of JSON per second. There's an About simdjson section in the readme, which has a link to a talk that goes over some of the implementation details.
SIMD instructions operate on numeric values, but once you're at that level, "text" is just numeric values, e.g. UTF-8 codepoints are just unsigned 8-bit integers, with plenty of SIMD support. Processing bitmaps is full of operations on multiple 8-bit unsigned integers in parallel, and it just so conveniently happens that this is so common that SIMD instruction sets cover these operations, and plenty of them are thus also usable for text processing.
I/O is so many orders of magnitude slower than the CPU
Not really. It is slower, but when the CPU has to do tasks that kill the streaming performance, such as branch mispredictions, cache misses, or wasting lots of speculative execution resources on dead-ends, the CPU can very easily not keep up with I/O. Modern network cards used for fast storage access or multi-machine communications can saturate the CPU's memory ports. All of them. And keep them that way. But that's state of the art and quite expensive at the moment (bonded 50 GBit links and such). Sequential, byte-at-a-time parser code is way slower than that.
Yes, especially for ASCII e.g. Convert a String In C++ To Upper Case. Or checking for valid UTF-8 (https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/), or checking if a string happens to be the ASCII subset of UTF-8. (If so, you know you have fixed-width characters which is very useful for other things.)
As Daniel Lemire reported, an early attempt at UTF-8 validation gave "a few CPU cycles per character." But with SIMD, he and collaborators were able to achieve ~1 instruction per byte, for net speeds of ~12GB/s. (vs. DRAM bandwidth of a Haswell desktop being ~25GB/s, or Skylake at 34GB/s with DDR4-2133).
Of course, most C libraries already have hand-written asm implementations of functions like strlen, strcpy, strcasecmp, strstr, etc. that use SIMD if it's a win (like on x86-64 where pmovmskb allows relatively efficient compare/branch on any/all SIMD compare results being true or false.) The first part of my answer on Why does glibc's strlen need to be so complicated to run quickly? has some links to hand-optimized asm that glibc actually uses on mainstream platforms, instead of the portable plain C fallback the question is asking about.
https://github.com/WojciechMula/sse4-strstr has a variety of strstr implementations. Substring searching is a much harder problem, with non-trivial algorithm choices as well as just brute-force. The SSE4.2 "string" instructions may help for that, but if not then SIMD vector compares definitely can for better brute-force building blocks.
(SSE4.2 "string" instructions like pcmpistri are definitely worse for memcmp / strcmp and strlen where plain SSE2 (or AVX2) is better. See How much faster are SSE4.2 string instructions than SSE2 for memcmp? and https://www.strchr.com/strcmp_and_strlen_using_sse_4.2)
You can even do cool tricks with looking up a shuffle control vector based on a vector compare bitmap, e.g. Fastest way to get IPv4 address from string or How to implement atoi using SIMD?.
Although I'm not sure the SIMD atoi is a win vs. scalar, especially for short numbers.
Naively I would say SIMD would not help since for long strings memory bandwidth would be the bottleneck. Why is this not the case?
DRAM bandwidth is really pretty good compared to modern CPU speeds, especially when the data comes in byte chunks, not 8-byte double chunks. And data is often hot in L3 cache after copying (e.g. from a read system call).
Even if data has to come from DRAM, modern desktop / laptop CPUs can load about 8 bytes per core clock cycle, within a factor of 2 of that anyway, especially if this core isn't competing with other bandwidth-intensive code on other cores. Good luck keeping up with that with byte-at-a-time scalar loops.
Besides, if you just did a read() system call to get the kernel to memcpy some data from a network buffer or pagecache into your process's memory, the data might still be hot in L3 cache, or even L2. Xeon CPUs can even DMA into L3 cache, or something like that. Aiming for memory bandwidth is a pretty low / unambitious goal, and a poor excuse for not fully optimizing a function if it actually gets use a lot.
Fewer instructions to process the same data lets out-of-order exec "see" farther ahead, and start demand-loads for later pages / cache lines earlier in cases where HW prefetch wouldn't (e.g. across page boundaries). And also better overlap the string processing with earlier / later independent work.
It can also be more hyperthreading-friendly, leaving the HT sibling core with better throughput if anything's running on it. (Maybe nothing if there aren't a lot of threads active). Also, if SIMD is efficient enough, it may save energy: tracking instructions through the pipeline is a large part of the cost, not the integer execution units themselves. Higher power while running, but finishing sooner, is good: race to sleep. CPUs save much more power when fully idle than when just running "cheap" instructions.
SIMD instructions are used on a very low level. Writing data to a text file is a much higher level, involving buffered I/O etc.
You might use SIMD, e.g., to convert a string from lower case to upper case. Wrapping SIMD into a library would be moot. You write the instructions yourself. Which also means that they are processor-specific (e.g. SSE variants on x86/AMD64).
For processing several rows of text in parallel, you might use micro-parallelization instead, e.g. offered by OpenMP or TBB.
However, if you stick to the example of writing to a the text file, we get to another territory of performance optimizations (I/O instead of computation).

Finding which code segment is faster than the other

Say that we have two C++ code segments, for doing the same task. How can we determine which code will run faster?
As an example lets say there is this global array "some_struct_type numbers[]". Inside a function, I can read a location of this array in two ways(I do not want to alter the content of the array)
some_struct_type val = numbers[i];
some_struct_type* val = &numbers[i]
I assume the second one is faster. but I can't measure the time to make sure because it will be a negligible difference.
So in this type of a situation, how do I figure out which code segment runs faster? Is there a way to compile a single line of code or set of lines and view
how many lines of assembly instructions are there?
I would appreciate your thoughts on this matter.
The basics are to run the piece of code so many times that it takes a few seconds at least to complete, and measure the time.
But it's hard, very hard, to get any meaningful figures this way, for many reasons:
Todays compilers are very good at optimizing code, but the optimizations depend on the context. It often does not make sense to look at a single line and try to optimize it. When the same line appears in a different context, the optimizations applied may be different.
Short pieces of code can be much faster than the surrounding looping code.
Not only the compiler makes optimizations, the processor has a cache, an instruction pipeline, and tries to predict branching code. A value which has been read before will be read much faster the next time, for example.
...
Because of this, it's usually better to leave the code in its place in your program, and use a profiling tool to see which parts of your code use the most processing resources. Then, you can change these parts and profile again.
While writing new code, prefer readable code to seemingly optimal code. Choose the right algorithm, this also depends on your input sizes. For example, insertion sort can be faster than quicksort, if the input is very small. But don't write your own sorting code, if your input is not special, use the libraries available in general. And don't optimize prematurely.
Eugene Sh. is correct that these two lines aren't doing the same thing - the first one copies the value of numbers[i] into a local variable, whereas the second one stores the address of numbers[i] into a pointer local variable. If you can do what you need using just the address of numbers[i] and referring back to numbers[i], it's likely that will be faster than doing a wholesale copy of the value, although it depends on a lot of factors like the size of the struct, etc.
Regarding the general optimization question, here are some things to consider...
Use a Profiler
The best way to measure the speed of your code is to use a profiling tool. There are a number of different tools available, depending on your target platform - see (for example) How can I profile C++ code running in Linux? and What's the best free C++ profiler for Windows?.
You really want to use a profiler for this because it's notoriously difficult to tell just from looking what the costliest parts of a program will be, for a number of reasons...
# of Instructions != # of Processor Cycles
One reason to use a profiler is that it's often difficult to tell from looking at two pieces of code which one will run faster. Even in assembly code, you can't simply count the number of instructions, because many instructions take multiple processor cycles to complete. This varies considerably by target platform. For example, on some platforms the fastest way to load the value 1 to a CPU register is something straightforward like this:
MOV r0, #1
Whereas on other platforms the fastest approach is actually to clear the register and then increment it, like this:
CLR r0
INC r0
The second case has more instruction lines, but that doesn't necessarily mean that it's slower.
Other Complications
Another reason that it's difficult to tell which pieces of code will most need optimizing is that most modern computers employ fairly sophisticated caches that can dramatically improve performance. Executing a cached loop several times is often less expensive than loading a single piece of data from a location that isn't cached. It can be very difficult to predict exactly what will cause a cache miss, but when using a profiler you don't have to predict - it makes the measurements for you.
Avoid Premature Optimization
For most projects, optimizing your code is best left until relatively late in the process. If you start optimizing too early, you may find that you spend a lot of time optimizing a feature that turns out to be relatively inexpensive compared to your program's other features. That said, there are some notable counterexamples - if you're building a large-scale database tool you might reasonably expect that performance is going to be an important selling point.

C++ techniques for reducing CPU instruction sizes?

Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?
Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.
If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.

C++ cache aware programming

is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible?
Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?
Thanks!
According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:
Once we have a formula for the memory
requirement we can compare it with the
cache size. As mentioned before, the
cache might be shared with multiple
other cores. Currently {There
definitely will sometime soon be a
better way!} the only way to get
correct information without hardcoding
knowledge is through the /sys
filesystem. In Table 5.2 we have seen
the what the kernel publishes about
the hardware. A program has to find
the directory:
/sys/devices/system/cpu/cpu*/cache
This is listed in Section 6: What Programmers Can Do.
He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.
There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.
C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.
Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.
read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.
Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.
TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).
Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).
http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c
The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.
It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/
You can see this thread: http://software.intel.com/en-us/forums/topic/296674
The short answer is in this other thread:
On modern IA-32 hardware, the cache line size is 64. The value 128 is
a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium
D) where 64-byte lines are paired into 128-byte sectors. When a line
in a sector is fetched, the hardware automatically fetches the other
line in the sector too. So from a false sharing perspective, the
effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)
IIRC, GCC has a __builtin_prefetch hint.
http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html
has an excellent section on this. Basically, it suggests:
__builtin_prefetch (&array[i + LookAhead], rw, locality);
where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.
Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.
There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?
Determining the cache-size at compile-time
For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.
If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.
Determining the cache-size at runtime
At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.
I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.
Combining both approaches
If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.
The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.

Performance of C++ Operators

Is there any sort of performance difference between the arithmetic operators in c++, or do they all run equally fast? E.g. is "++" faster than "+=1"? What about "+=10000"? Does it make a significant difference if the numbers are floats instead of integers? Does "*" take appreciably longer than "+"?
I tried performing 1 billion each of "++", "+=1", and "+=10000". The strange thing is that the number of clock cycles (according to time.h) is actually counterintuitive. One might expect that if any of them are the fastest, it is "++", followed by "+=1", then "+=10000", but the data shows a slight trend in the opposite direction. The difference is more pronounced on 10 billion operations. This is all for integers.
I am dabbling in scientific computing, so I wanted to test the performance of operators. If any of the operators operated in time that was linear in terms of the inputs, for example.
About your edit, the language says nothing about the architecture it's running on. Your question is platform dependent.
That said, typically all fundamental data-type operations have a one-to-one correspondence to assembly.
x86 for example has an instruction which increments a value by 1, which i++ or i += 1 would translate into. Addition and multiplication also have single instructions.
Hardware-wise, it's fairly obvious that adding or multiplying numbers is at least linear in the number of bits in the numbers. Because the hardware has a constant number of bits, it's O(1).
Floats have their own processing unit, usually, which also has single instructions for operations.
Does it matter?
Why not write the code that does what you need it to do. If you want to add one, use ++. If you want to add a large number, add a large number. If you need floats, use floats. If you need to multiply two numbers, then multiply them.
The compiler will figure out the best way to do what you want, so instead of trying to be tricky, do what you need and let it do the hard work.
After you've written your working code, and you decide it's too slow, profile it and find out why. You'll find it's not silly things like multiplying versus adding, but rather going about the entire (sub-)problem in the wrong way.
Practically, all of the operators you listed will be done in a single CPU instruction anyway, on desktop platforms.
No, no, yes*, yes*, respectively.
* but do you really care?
EDIT: to give some kind of idea with a modern processor, you may be able to do 200 integer additions in the time it takes to make one memory access, and only 50 integer multiplications. If you think about it, you're still going to be bound by the memory accesses most of the time.
What you are asking is: What basic operations get transformed into which assembly instructions and what is the performance of those instructions on my specific architecture. And this is also your answer: The code they get translated to is dependant on your compiler and it's knowledge of your architecture, their performance depends on your architecture.
Mind you: in C++ operators can be overloaded for user defined types. They can behave differently from built-in types and the implementation of the overload can be non-trivial (no just one instruction).
Edit: A hint for testing. Most compilers support outputting the generated assembly code. The option for gcc is -S. If you use some other compiler have a look at their documentation.
The best answer is to time it with your compiler.
Look up the optimization manuals for your CPU. That's the only place you're going to find answers.
Get your compiler to output the generated assembly. Download the manuals for your CPU. Look up the instructions used by the compiler in the manual, and you know how they perform.
Of course, this presumes that you already know the basics of how a pipelined, superscalar out-of-order CPU operates, what branch prediction, instruction and data cache and everything else means. Do your homework.
Performance is a ridiculously complicated subject. Depending on context, floating-point code may be as fast as (or faster than) integer code, or it may be four times slower. Usually branches carry almost no penalty, but in special cases, they can be crippling. Sometimes, recomputing data is more efficient than caching it, and sometimes not.
Understand your programming language. Understand your compiler. Understand your CPU. And then examine exactly what the compiler is doing in your case, by profiling/timing, and on when necessary by examining the individual instructions. (and when timing your code, be aware of all the caveats and gotchas that can invalidate your benchmarks: Make sure optimizations are enabled, but also that the code you're trying to measure isn't optimized away. Take the cache into account (if the data is already in the CPU cache, it'll run much faster. If it has to read from physical memory to begin with, it'll take extra time. Both can invalidate your measurements if you're not careful. Keep in mind what you want to measure exactly)
For your specific examples, why should ++i be faster than i += 1? They do the exact same thing? Sometimes, it may make a difference whether you're adding a constant or a variable, but in this case, you're adding the constant one in both cases.
And in general, instructions take a fixed constant time regardless of their operands. adding one to something takes just as long as adding -2000 or 1772051912. The same goes for multiplication or division.
But if you care about performance, you need to understand how the entire technology stack works, not just rely on a few simple rules of thumb like "integer is faster than floating point, and ++ is faster than +=" (Apart from anything else, such simple rules of thumb are almost never true, at least not in every case)
Here is a twist on your evaluations: try Loop Unrolling. Loop unrolling is where you repeat the same statements in a loop to reduce the number of iterations in the loop.
Most modern processors hate branch instructions. The processors have a queue of pre-fetched instructions, which speeds up processing. They really hate branch instructions, because the processor has to clear out the queue and reload it after a branch. This takes more time than just processing sequential instructions.
When coding for processing time, try to minimize the number of branches, which can occur in loop constructs and decision constructs.
Depends on architecture, the built in operators for integer arithmetic translate directly to assembly (as I understand it) ++, +=1, and += 10000 are probably equally fast, multiplication would depend on the platform, overloaded operators would depend on you
Donald Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
unless you are writing extremely time critical software, you should probably worry about other things
Short answer: you should turn optimizations on before measuring.
The long answer: If you turned optimizations on, you're performing the operations on integers, and still you get different times for ++i; and i += 1;, then it's probably time to get a better compiler -- the two statements have exactly the same semantics and a competent compiler should translate them into the same instruction sequence.
"Does it make a significant difference if the numbers are floats instead of integers?"
-It depends on what kind of processor you are running on. Integer operations are faster on current x86 compatible CPUs.
About i++ and i+=1: there shouldn't be a difference with any good compiler, while you may expect i+=10000 to be slightly slower on x86 CPUs.
"Does "*" take appreciably longer than "+"?"
-Typically yes.
Note that you may run into all sorts of bottlenecks, in which case the speed difference between the operations doesn't show up. Eg. memory bandwidth, CPU pipeline stall due to data dependencies, etc...
The performance problems caused by C++ operators do not come from the operators and not from the operators implementation. It comes from the syntax, from hidden code being run without you knowing.
The best example, is implementing quick sort, on an object which has the operator[] implemented, but internally it's using a linked list. Now instead of O(nlogn) [1] you will get O(n^2logn).
The problem with performance is that you cannot know exactly what your code will eventually be.
[1] I know that quick sort is actually O(n^2), but it rarely gets to it, the average distribution will give you O(nlogn).