Text iteration, Assembly versus C++

Text iteration, Assembly versus C++ - c++

I am making a program in which is is frequently reading chunks of text received from the web looking for specific characters and parsing the data accordingly. I am becoming fairly skilled with C++, and have made it work well, however, is Assembly going to be faster than a
for(size_t len = 0;len != tstring.length();len++) {
if(tstring[len] == ',')
stuff();
}
Would an inline-assembly routine using cmp and jz/jnz be faster? I don't want to waste my time working with asm for the fact being able to say I used it, but for true speed purposes.
Thank you,

No way. Your loop is so simple, the cost of the optimizer losing the ability to reason about your code is going to be way higher than any performance you could gain. This isn't SSE intrinsics or a bootloader, it's a trivial loop.

An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:
you're retrieving tstring.length() once per loop iteration; that's unnecessary.
you're using random indexing, tstring[len] which might be a more-expensive operation than using a forward iterator.
you're calling stuff() during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whatever stuff() does), and only afterwards iterate over those results.
There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).
In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).
My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...

Checking characters one by one is not the fastest thing to do. Maybe you should try something like this and find out if it's faster.
string s("xxx,xxxxx,x,xxxx");
string::size_type pos = s.find(',');
while(pos != string::npos){
do_stuff(pos);
pos = s.find(',', pos+1);
}
Each iteration of the loop will give you the next position of a ',' character so the program will need only few loops to finish the job.

Would an inline-assembly routine using cmp and jz/jnz be faster?
Maybe, maybe not. It depends upon what stuff() does, what the type and scope of tstring is, and what your assembly looks like.
First, measure the speed of the maintainable C++ code. Only if this loop dominates your program's speed should you consider rewriting it.
If you choose to rewrite it, keep both implementations available, and comparatively measure them. Only use the less maintainable version if it is faster, and if the speed increase matters. Also, since you have the original version in place, future readers will be able to understand your intent even if they don't know asm that well.

Related

How good is the Visual Studio compiler at branch-prediction for simple if-statements?

Here is some c++ pseudo-code as an example:
bool importantFlag = false;
for (SomeObject obj : arr) {
if (obj.someBool) {
importantFlag = true;
}
obj.doSomethingUnrelated();
}
Obviously, once the if-statement evaluates as true and runs the code inside, there is no reason to even perform the check again since the result will be the same either way. Is the compiler smart enough to recognize this or will it continue checking the if-statement with each loop iteration and possibly redundantly assign importantFlag to true again? This could potentially have a noticeable impact on performance if the number of loop iterations is large, and breaking out of the loop is not an option here.
I generally ignore these kinds of situations and just put my faith into the compiler, but it would be nice to know exactly how it handles these kinds of situations.

Branch-prediction is a run-time thing, done by the CPU not the compiler.
The relevant optimization here would be if-conversion to a very cheap branchless flag |= obj.someBool;.
Ahead-of-time C++ compilers make machine code for the CPU to run; they aren't interpreters. See also Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” and How to remove "noise" from GCC/clang assembly output? for more about looking at optimized compiler-generated asm.
I guess what you're suggesting could be making a 2nd version of the loop that doesn't look at the bool, and converting the if() into an if() goto to set the flag once and then run the other version of the loop from this point onward. That would likely not be worth it, since a single OR instruction is so cheap if other members of the same object are already getting accessed.
But it's a plausible optimization; however I don't think compilers would typically do it for you. You can of course do it manually, although you'd have to iterate manually instead of using a range-for, because you want to use the same iterator to start part-way through the range.
Branch likelihood estimation at compile time is a thing compilers do to figure out whether branchy or branchless code is appropriate, e.g. gcc optimization flag -O3 makes code slower than -O2 uses CMOV for a case that looks unpredictable, but when run on sorted data is actually very predictable. My answer there shows the asm real-world compilers make; note that they don't multi-version the loop, although that wouldn't be possible in that case if the compiler didn't know about the data being sorted.
Also to guess which side of a branch is more likely, so they can lay out the fast path with fewer taken branches. That's what the C++20 [[likely]] / [[unlikely]] hints are for, BTW, not actually influencing run-time branch prediction. Except on some CPUs, indirectly via static prediction the first time a CPU sees a branch. Or a few ISAs, like PowerPC and MIPS, have "branch-likely" instructions with actual run-time hints for the CPU which compilers might or might not actually use even if available. See
How do the likely/unlikely macros in the Linux kernel work and what is their benefit? - They influence branch layout, making the "likely" path a straight line (branches usually not-taken) for I-cache locality and contiguous fetch.
Is there a compiler hint for GCC to force branch prediction to always go a certain way?

If you expect to have a large data set you could just have two for loops, the first of them breaking when the importantFlag is set to true. It's hard to know specifically what optimizations the compiler will make since it's not well documented.

Peter Cordes has already given a great answer.
I'd also like to mention shortcircuiting
In this example
if( importantFlag || some_expensive_check() ) {
importantFlag = true;
}
Once important Flag is set to true, the expensive check will never be performed, since the || stops at the first true.

How much do C/C++ compilers optimize conditional statements?

I recently ran into a situation where I wrote the following code:
for(int i = 0; i < (size - 1); i++)
{
// do whatever
}
// Assume 'size' will be constant during the duration of the for loop
When looking at this code, it made me wonder how exactly the for loop condition is evaluated for each loop. Specifically, I'm curious as to whether or not the compiler would 'optimize away' any additional arithmetic that has to be done for each loop. In my case, would this code get compiled such that (size - 1) would have to be evaluated for every loop iteration? Or is the compiler smart enough to realize that the 'size' variable won't change, thus it could precalculate it for each loop iteration.
This then got me thinking about the general case where you have a conditional statement that may specify more operations than necessary.
As an example, how would the following two pieces of code compile:
if(6)
if(1+1+1+1+1+1)
int foo = 1;
if(foo + foo + foo + foo + foo + foo)
How smart is the compiler? Will the 3 cases listed above be converted into the same machine code?
And while I'm at, why not list another example. What does the compiler do if you are doing an operation within a conditional that won't have any effect on the end result? Example:
if(2*(val))
// Assume val is an int that can take on any value
In this example, the multiplication is completely unnecessary. While this case seems a lot stupider than my original case, the question still stands: will the compiler be able to remove this unnecessary multiplication?
Question:
How much optimization is involved with conditional statements?
Does it vary based on compiler?

Short answer: the compiler is exceptionally clever, and will generally optimise those cases that you have presented (including utterly ignoring irrelevant conditions).
One of the biggest hurdles language newcomers face in terms of truly understanding C++, is that there is not a one-to-one relationship between their code and what the computer executes. The entire purpose of the language is to create an abstraction. You are defining the program's semantics, but the computer has no responsibility to actually follow your C++ code line by line; indeed, if it did so, it would be abhorrently slow as compared to the speed we can expect from modern computers.
Generally speaking, unless you have a reason to micro-optimise (game developers come to mind), it is best to almost completely ignore this facet of programming, and trust your compiler. Write a program that takes the inputs you want, and gives the outputs you want, after performing the calculations you want… and let your compiler do the hard work of figuring out how the physical machine is going to make all that happen.
Are there exceptions? Certainly. Sometimes your requirements are so specific that you do know better than the compiler, and you end up optimising. You generally do this after profiling and determining what your bottlenecks are. And there's also no excuse to write deliberately silly code. After all, if you go out of your way to ask your program to copy a 50MB vector, then it's going to copy a 50MB vector.
But, assuming sensible code that means what it looks like, you really shouldn't spend too much time worrying about this. Because modern compilers are so good at optimising, that you'd be a fool to try to keep up.

The C++ language specification permits the compiler to make any optimization that results in no observable changes to the expected results.
If the compiler can determine that size is constant and will not change during execution, it can certainly make that particular optimization.
Alternatively, if the compiler can also determine that i is not used in the loop (and its value is not used afterwards), that it is used only as a counter, it might very well rewrite the loop to:
for(int i = 1; i < size; i++)
because that might produce smaller code. Even if this i is used in some fashion, the compiler can still make this change and then adjust all other usage of i so that the observable results are still the same.
To summarize: anything goes. The compiler may or may not make any optimization change as long as the observable results are the same.

Yes, there is a lot of optimization, and it is very complex.
It varies based on the compiler, and it also varies based on the compiler options
Check
https://meta.stackexchange.com/questions/25840/can-we-stop-recommending-the-dragon-book-please
for some book recomendations if you really want to understand what a compiler may do. It is a very complex subject.
You can also compile to assembly with the -S option (gcc / g++) to see what the compiler is really doing. Use -O3 / ... / -O0 / -O to experiment with different optimization levels.

Is there significant performance difference between break and return in a loop at the end of a C function?

Consider the following (Obj-)C(++) code segment as an example:
// don't blame me for the 2-space indents. It's insane to type 12 spaces.
int whatever(int *foo) {
for (int k = 0; k < bar; k++) { // I know it's a boring loop
do_something(k);
if (that(k))
break; // or return
do_more(k);
}
}
A friend told me that using break is not only more logical (and using return causes troubles when someone wants to add something to the function afterwards), but also yields faster code. It's said that the processor gives better predictions in this case for jmp-ly instructions than for ret.
Or course I agree with him on the first point, but if there is actually some significant difference, why doesn't the compiler optimize it?

If it's insane to type 2 spaces, use a decent text editor with auto-indent. 4 space indentation is much more readable than 2 spaces.
Readability should be a cardinal value when you write C code.
Using break or return should be chosen based on context to make your code easier to follow and understand. If not to others, you will be doing a favor to yourself, when a few years from now you will be reading your own code, hunting for a spurious bug and trying to make sense of it.
No matter which option you choose, the compiler will optimize your code its own way and different compilers, versions or configurations will do it differently. No noticeable difference should arise from this choice, and even in the unlikely chance that it would, not a lasting one.
Focus on the choice of algorithm, data structures, memory allocation strategies, possibly memory layout cache implications... These are far more important for speed and overall efficiency than local micro-optimizations.

Any compiler is capable of optimizing jumps to jumps. In practice, though, there will probably be some cleanup to do before exiting anyway. When in doubt, profile. I don’t see how this could make any significant difference.
Stylistically, and especially in C where the compiler does not clean stuff up for me when it goes out of scope, I prefer to have a single point of return, although I don’t go so far as to goto one.

Intel C++ Compiler understanding what optimization is performed

I have a code segment which is as simple as :
for( int i = 0; i < n; ++i)
{
if( data[i] > c && data[i] < r )
{
--data[i];
}
}
It's a part of a large function and project. This is actually a rewrite of a different loop, which proved to be time consuming (long loops), but I was surprised by two things :
When data[i] was temporary stored like this :
for( int i = 0; i < n; ++i)
{
const int tmp = data[i];
if( tmp > c && tmp < r )
{
--data[i];
}
}
It became more much slower. I don't claim this should be faster, but I can not understand why it should be so much slower, the compiler should be able to figure out if tmp should be used or not.
But more importantly when I moved the code segment into a separate function it became around four times slower. I wanted to understand what was going on, so I looked in the opt-report and in both cases the loop is vectorized and seem to do the same optimization.
So my question is what can make such a difference on a function which is not called a million times, but is time consuming in itself ? What to look for in the opt-report ?
I could avoid it by just keeping it inlined, but the why is bugging me.
UPDATE :
I should underline that my main concern is to understand, why it became slower, when moved to a separate function. The code example given with tmp variable, was just a strange example I encountered during the process.

You're probably register starved, and the compiler is having to load and store. I'm pretty sure that the native x86 assembly instructions can take memory addresses to operate on- i.e., the compiler can keep those registers free. But by making it local, you may changing the behaviour wrt. aliasing and the compiler may not be able to prove that the faster version has the same semantics, especially if there is some form of multiple threads in here, allowing it to change the code.
The function was slower when in a new segment likely because function calls not only can break the pipeline, but also create poor instruction cache performance (there's extra code for parameter push/pop/etc).
Lesson: Let the compiler do the optimizing, it's smarter than you. I don't mean that as an insult, it's smarter than me too. But really, especially the Intel compiler, those guys know what they're doing when targetting their own platform.
Edit: More importantly, you need to recognize that compilers are targetted at optimizing unoptimized code. They're not targetted at recognizing half-optimized code. Specifically, the compiler will have a set of triggers for each optimization, and if you happen to write your code in such a way as that they're not hit, you can avoid optimizations being performed even if the code is semantically identical.
And you also need to consider implementation cost. Not every function ideal for inlining can be inlined- just because inlining that logic is too complex for the compiler to handle. I know that VC++ will rarely inline with loops, even if the inlining yields benefit. You may be seeing this in the Intel compiler- that the compiler writers simply decided that it wasn't worth the time to implement.
I encountered this when dealing with loops in VC++- the compiler would produce different assembly for two loops in slightly different formats, even though they both achieved the same result. Of course, their Standard library used the ideal format. You may observe a speedup by using std::for_each and a function object.

You're right, the compiler should be able to identify that as unused code and remove it/not compile it. That doesn't mean it actually does identify it and remove it.
Your best bet is to look at the generated assembly and check to see exactly what is going on. Remember, just because a clever compiler could be able to figure out how to do an optimization, it doesn't mean it can.
If you do check, and see that the code is not removed, you might want to report that to the intel compiler team. It sounds like they might have a bug.

C++ string comparison in one clock cycle

Is it possible to compare whole memory regions in a single processor cycle? More precisely is it possible to compare two strings in one processor cycle using some sort of MMX assembler instruction? Or is strcmp-implementation already based on that optimization?
EDIT:
Or is it possible to instruct C++ compiler to remove string duplicates, so that strings can be compared simply by their memory location? Instead of memcmp(a,b) compared by a==b (assuming that a and b are both native const char* strings).

Just use the standard C strcmp() or C++ std::string::operator==() for your string comparisons.
The implementations of them are reasonably good and are probably compiled to a very highly optimized assembly that even talented assembly programmers would find challenging to match.
So don't sweat the small stuff. I'd suggest looking at optimizing other parts of your code.

You can use the Boost Flyweight library to intern your immutable strings. String equality/inequality tests then become very fast since all it has to do at that point is compare pointers (pun not intended).

Not really. Your typical 1-byte compare instruction takes 1 cycle.
Your best bet would be to use the MMX 64-bit compare instructions( see this page for an example). However, those operate on registers, which have to be loaded from memory. The memory loads will significantly damage your time, because you'll be going out to L1 cache at best, which adds some 10x time slowdown*. If you are doing some heavy string processing, you can probably get some nifty speedup there, but again, it's going to hurt.
Other people suggest pre-computing strings. Maybe that'll work for your particular app, maybe it won't. Do you have to compare strings? Can you compare numbers?
Your edit suggests comparing pointers. That's a dangerous situation unless you can specifically guarantee that you won't be doing substring compares(ie, you are comparing some two byte strings: [0x40, 0x50] with [0x40, 0x42]. Those are not "equal", but a pointer compare would say they are).
Have you looked at the gcc strcmp() source? I would suggest that doing that would be the ideal starting place.
* Loosely speaking, if a cycle takes 1 unit, a L1 hit takes 10 units, an L2 hit takes 100 units, and an actual RAM hit takes really long.

It's not possible to perform general-purpose string operations in one cycle, but there are many optimizations you can apply with extra information.
If your problem domain allows the use of an aligned, fixed-size buffer for strings that fits in a machine register, you can perform single-cycle comparisons (not counting the load instructions).
If you always keep track of the lengths of your strings, you can compare lengths and use memcmp, which is faster than strcmp. If your application is multi-cultural, keep in mind that this only works for ordinal string comparison.
It appears you are using C++. If you only need equality comparisons with immutable strings, you can use a string interning solution (copy/paste link since I'm a new user) to guarantee that equal strings are stored at the same memory location, at which point you can simply compare pointers. See en.wikipedia.org/wiki/String_interning
Also, take a look at the Intel Optimization Reference Manual, Chapter 10 for details on the SSE 4.2's instructions for text processing. www.intel.com/products/processor/manuals/
Edit: If your problem domain allows the use of an enumeration, that is your single-cycle comparison solution. Don't fight it.

If you're optimizing for string comparisons, you may want to employ a string table (then you only need to compare the indexes of the two strings, which can be done in a single machine instruction).
If that's not feasible, you can also create a hashed string object that contains the string and a hash. Then most of the time you only have to compare the hashes if the strings aren't equal. If the hashes do match you'll have to do a full comparison though to make sure it wasn't a false positive.

It depends on how much preprocessing you do. C# and Java both have a process called interning strings which makes every string map to the same address if they have the same contents. Assuming a process like that, you could do a string equality comparison with one compare instruction.
Ordering is a bit harder.
EDIT: Obviously this answer is sidestepping the actual issue of attempting to do a string comparison within a single cycle. But it's the only way to do it unless you happen to have a sequence of instructions that can look at an unbounded amount of memory in constant time to determine the equivalent of a strcmp. Which is improbable, because if you had such an architecture the person who sold it to you would say "Hey, here's this awesome instruction that can do a string compare in a single cycle! How awesome is that?" and you wouldn't need to post a question on stackoverflow.
But that's just my reasoned opinion.

Or is it possible to instruct c++
compiler to remove string duplicates,
so that strings can be compared simply
by their memory location?
No. The compiler may remove duplicates internally, but I know of no compiler that guarantees or provides facilities for accessing such an optimisation (except possibly to turn it off). Certainly the C++ standard has nothing to say in this area.

Assuming you mean x86 ... Here is the Intel documentation.
But off the top of my head, no, I don't think you can compare more than the size of a register at a time.
Out of curiosity, why do you ask? I'm the last to invoke Knuth prematurely, but ... strcmp usually does a pretty good job.
Edit: Link now points to the modern documentation.

You can certainly compare more than one byte in a cycle. If we take the example of x86-64, you can compare up to 64-bits (8 bytes) in a single instruction (cmps), this isn't necessarily one cycle but will normally be in the low single digits (the exact speed depends on the specific processor version).
However, this doesn't mean you'll be able to all the work of comparing two arrays in memory much faster than strcmp :-
There's more than just the compare - you need to compare the two values, check if they are the same, and if so move to next chunk.
Most strcmp implementations will already be highly optimised, including checking if a and b point to the same address, and any suitable instruction-level optimisations.
Unless you're seeing alot of time spent in strcmp, I wouldn't worry about it - have you got a specific problem / use case you are trying to improve?

Even if both strings were cached, it wouldn't be possible to compare (arbitrarily long) strings in a single processor cycle. The implementation of strcmp in a modern compiler environment should be pretty much optimized, so you shouldn't bother to optimize too much.
EDIT (in reply to your EDIT):
You can't instruct the compiler to unify ALL duplicate strings - most compilers can do something like this, but it's best-effort only (and I don't know any compiler where it works across compilation units).
You might get better performance by adding the strings to a map and comparing iterators after that... the comparison itself might be one cycle (or not much more) then
If the set of strings to use is fixed, use enumerations - that's what they're there for.

Here's one solution that uses enum-like values instead of strings. It supports enum-value-inheritance and thus supports comparison similar to substring comparison. It also uses special character "¤" for naming, to avoid name collisions. You can take any class, function, or variable name and make it into enum-value (SomeClassA will become ¤SomeClassA).
struct MultiEnum
{
vector<MultiEnum*> enumList;
MultiEnum()
{
enumList.push_back(this);
}
MultiEnum(MultiEnum& base)
{
enumList.assign(base.enumList.begin(),base.enumList.end());
enumList.push_back(this);
}
MultiEnum(const MultiEnum* base1,const MultiEnum* base2)
{
enumList.assign(base1->enumList.begin(),base1->enumList.end());
enumList.assign(base2->enumList.begin(),base2->enumList.end());
}
bool operator !=(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)==enumList.end();
}
bool operator ==(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)!=enumList.end();
}
bool operator &(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)!=enumList.end();
}
MultiEnum operator|(const MultiEnum& other)
{
return MultiEnum(this,&other);
}
MultiEnum operator+(const MultiEnum& other)
{
return MultiEnum(this,&other);
}
};
MultiEnum
¤someString,
¤someString1(¤someString), // link to "someString" because it is a substring of "someString1"
¤someString2(¤someString);
void Test()
{
MultiEnum a = ¤someString1|¤someString2;
MultiEnum b = ¤someString1;
if(a!=¤someString2){}
if(b==¤someString2){}
if(b&¤someString2){}
if(b&¤someString){} // will result in true, because someString is substring of someString1
}
PS. I had definitely too much free time on my hands this morning, but reinventing the wheel is just too much fun sometimes... :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js