What is more efficient a switch case or an std::map - c++

I'm thinking about the tokenizer here.
Each token calls a different function inside the parser.
What is more efficient:
A map of std::functions/boost::functions
A switch case

I would suggest reading switch() vs. lookup table? from Joel on Software. Particularly, this response is interesting:
" Prime example of people wasting time
trying to optimize the least
significant thing."
Yes and no. In a VM, you typically
call tiny functions that each do very
little. It's the not the call/return
that hurts you as much as the preamble
and clean-up routine for each function
often being a significant percentage
of the execution time. This has been
researched to death, especially by
people who've implemented threaded
interpreters.
In virtual machines, lookup tables storing computed addresses to call are usually preferred to switches. (direct threading, or "label as values". directly calls the label address stored in the lookup table) That's because it allows, under certain conditions, to reduce branch misprediction, which is extremely expensive in long-pipelined CPUs (it forces to flush the pipeline). It, however, makes the code less portable.
This issue has been discussed extensively in the VM community, I would suggest you to look for scholar papers in this field if you want to read more about it. Ertl & Gregg wrote a great article on this topic in 2001, The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures
But as mentioned, I'm pretty sure that these details are not relevant for your code. These are small details, and you should not focus too much on it. Python interpreter is using switches, because they think it makes the code more readable. Why don't you pick the usage you're the most comfortable with? Performance impact will be rather small, you'd better focus on code readability for now ;)
Edit: If it matters, using a hash table will always be slower than a lookup table. For a lookup table, you use enum types for your "keys", and the value is retrieved using a single indirect jump. This is a single assembly operation. O(1). A hash table lookup first requires to calculate a hash, then to retrieve the value, which is way more expensive.
Using an array where the function addresses are stored, and accessed using values of an enum is good. But using a hash table to do the same adds an important overhead
To sum up, we have:
cost(Hash_table) >> cost(direct_lookup_table)
cost(direct_lookup_table) ~= cost(switch) if your compiler translates switches into lookup tables.
cost(switch) >> cost(direct_lookup_table) (O(N) vs O(1)) if your compiler does not translate switches and use conditionals, but I can't think of any compiler doing this.
But inlined direct threading makes the code less readable.

STL Map that comes with visual studio 2008 will give you O(log(n)) for each function call since it hides a tree structure beneath.
With modern compiler (depending on implementation) , A switch statement will give you O(1) , the compiler translates it to some kind of lookup table.
So in general , switch is faster.
However , consider the following facts:
The difference between map and switch is that : Map can be built dynamically while switch can't. Map can contain any arbitrary type as a key while switch is very limited to c++ Primitive types (char , int , enum , etc...).
By the way , you can use a hash map to achieve nearly O(1) dispatching (though , depending on the hash table implementation , it can sometimes be O(n) at worst case). Even though , switch will still be faster.
Edit
I am writing the following only for fun and for the matter of the discussion
I can suggest an nice optimization for you but it depends on the nature of your language and whether you can expect how your language will be used.
When you write the code:
You divide your tokens into two groups , one group will be of very High frequently used and the other of low frequently used. You also sort the high frequently used tokens.
For the high frequently tokens you write an if-else series with the highest frequently used coming first. for the low frequently used , you write a switch statement.
The idea is to use the CPU branch prediction in order to even avoid another level of indirection (assuming the condition checking in the if statement is nearly costless).
in most cases the CPU will pick the correct branch without any level of indirection . They will be few cases however that the branch will go to the wrong place.
Depending on the nature of your languege , Statisticly it may give a better performance.
Edit : Due to some comments below , Changed The sentence telling that compilers will allways translate a switch to LUT.

What is your definition of "efficient"? If you mean faster, then you probably should profile some test code for a definite answer. If you're after flexible and easy-to-extend code though, then do yourself a favor and use the map approach. Everything else is just premature optimization...

Like yossi1981 said, a switch could be optimized of beeing a fast lookup-table but there is not guarantee, every compiler has other algorithms to determine whether to implement the switch as consecutive if's or as fast lookup table, or maybe a combination of both.
To gain a fast switch your values should meet the following rule:
they should be consecutive, that is e.g. 0,1,2,3,4. You can leave some values out but things like 0,1,2,34,43 are extremely unlikely to be optimized.
The question really is: is the performance of such significance in your application?
And wouldn't a map which loads its values dynamically from a file be more readable and maintainable instead of a huge statement which spans multiple pages of code?

You don't say what type your tokens are. If they are not integers, you don't have a choice - switches only work with integer types.

The C++ standard says nothing about the performance of its requirements, only that the functionality should be there.
These sort of questions about which is better or faster or more efficient are meaningless unless you state which implementation you're talking about. For example, the string handling in a certain version of a certain implementation of JavaScript was atrocious, but you can't extrapolate that to being a feature of the relevant standard.
I would even go so far as to say it doesn't matter regardless of the implementation since the functionality provided by switch and std::map is different (although there's overlap).
These sort of micro-optimizations are almost never necessary, in my opinion.

Related

What is are the advantages of a custom data structure?

What's the need to go for defining and implementing data structures (e.g. stack) ourselves if they are already available in C++ STL?
What are the differences between the two implementations?
First, implementing by your own an existing data structure is a useful exercise. You understand better what it does (so you can understand better what the standard containers do). In particular, you understand better why time complexity is so important.
Then, there is a quality of implementation issue. The standard implementation might not be suitable for you.
Let me give an example. Indeed, std::stack is implementing a stack. It is a general-purpose implementation. Have you measured sizeof(std::stack<char>)? Have you benchmarked it, in the case of a million of stacks of 3.2 elements on average with a Poisson distribution?
Perhaps in your case, you happen to know that you have millions of stacks of char-s (never NUL), and that 99% of them have less than 4 elements. With that additional knowledge, you probably should be able to implement something "better" than what the standard C++ stack provides. So std::stack<char> would work, but given that extra knowledge you'll be able to implement it differently. You still (for readability and maintenance) would use the same methods as in std::stack<char> - so your WeirdSmallStackOfChar would have a push method, etc. If (later during the project) you realize or that bigger stack might be useful (e.g. in 1% of cases) you'll reimplement your stack differently (e.g. if your code base grow to a million lines of C++ and you realize that you have quite often bigger stacks, you might "remove" your WeirdSmallStackOfChar class and add typedef std::stack<char> WeirdSmallStackOfChar; ....)
If you happen to know that all your stacks have less than 4 char-s and that \0 is not valid in them, representing such "stack"-s as a char w[4] field is probably the wisest approach. It is fast and easy to code.
So, if performance and memory space matters, you might perhaps code something as weird as
class MyWeirdStackOfChars {
bool small;
union {
std::stack<char>* bigstack;
char smallstack[4];
}
Of course, that is very incomplete. When small is true your implementation uses smallstack. For the 1% case where it is false, your implemention uses bigstack. The rest of MyWeirdStackOfChars is left as an exercise (not that easy) to the reader. Don't forget to follow the rule of five.
Ok, maybe the above example is not convincing. But what about std::map<int,double>? You might have millions of them, and you might know that 99.5% of them are smaller than 5. You obviously could optimize for that case. It is highly probable that representing small maps by an array of pairs of int & double is more efficient both in terms of memory and in terms of CPU time.
Sometimes, you even know that all your maps have less than 16 entries (and std::map<int,double> don't know that) and that the key is never 0. Then you might represent them differently. In that case, I guess that I am able to implement something much more efficient than what std::map<int,double> provides (probably, because of cache effects, an array of 16 entries with an int and a double is the fastest).
That is why any developer should know the classical algorithms (and have read some Introduction to Algorithms), even if in many cases he would use existing containers. Be also aware of the as-if rule.
STL implementation of Data Structures is not perfect for every possible use case.
I like the example of hash tables. I have been using STL implementation for a while, but I use it mainly for Competitive Programming contests.
Imagine that you are Google and you have billions of dollars in resources destined to storing and accessing hash tables. You would probably like to have the best possible implementation for the company use cases, since it will save resources and make search faster in general.
Oh, and I forgot to mention that you also have some of the best engineers on the planet working for you (:
(This video is made by Kulukundis talking about the new hash table made by his team at Google )
https://www.youtube.com/watch?v=ncHmEUmJZf4
Some other reasons that justify implementing your own version of Data Structures:
Test your understanding of a specific structure.
Customize part of the structure to some peculiar use case.
Seek better performance than STL for a specific data structure.
Hating STL errors.
Benchmarking STL against some simple implementation.

What string search algorithm does strstr use?

I was reading through the String searching algorithm wikipedia article, and it made me wonder what algorithm strstr uses in Visual Studio? Should I try and use another implementation, or is strstr fairly fast?
Thanks!
The implementation in visual studio strstr is not know to me, and I am uncertain if it is to anyone. However I found these interesting sources and an example implementation. The latter shows that the algorithm runs in worst case quadratic time wrt the size of the searched string. Aggregate should be less than that. The algorithmic limit of non stochastic solutions should be that.
What is actually the case is that depending the size of the input it might be possible that different algorithms are used, mainly optimized to the metal. However, one cannot really bet on that. In case that you are doing DNA sequencing strstr and family are very important and most probably you will have to write your own customized version. Usually, standard implementations are optimized for the general case, but on the other hand those working on compilers know their shit n staff. At any rate you should not bet your own skills against the pros.
But really all this discussion about time to develop is hurting the effort to write good software. Be certain that the benefit of rewriting a custom strstr outweigh the effort that is going to be needed to maintain and tune it for your specific case, before you embark on this task.
As others have recommended: Profile. Perform valid performance tests.
Without the profile data, you could be optimizing a part of the code that runs 20% of the time, a waste of ROI.
Development costs are the prime concern with modern computers, not execution time. The best use of time is to develop the program to operate correctly with few errors before entering System Test. This is where the focus should be. Also due to this reasoning, most people don't care how Visual Studio implements strstr as long as the function works correctly.
Be aware that there is line or point where a linear search outperforms other searches. This line depends on the size of the data or the search criteria. For example, linear search using a processor with branch prediction and a large instruction cache may outperform other techniques for small and medium data sizes. A more complicated algorithm may have more branches that cause reloading of the instruction cache or data cache (wasting execution time).
Another method for optimizing your program is to make the data organization easier for searching. For example, making the string small enough to fit into a cache line. This also depends on the quantity of searching. For a large amount of searches, optimizing the data structure may gain some performance.
In summary, optimize if and only if the program is not working correctly, the User is complaining about speed, it is missing timing constraints or it doesn't fit in the allocated memory. Next step is then to profile and optimize the areas where most of the time is spent. Any other optimization is futile.
The C++ standard refers to the C standard for the description of what strstr does. The C standard doesn't seem to put any restrictions on the complexity, so pretty much any algorithm the finds the first instance of the substring would be compliant.
Thus different implementations may choose different algorithms. You'd have to look at your particular implementation to determine which it uses.
The simple, brute-force approach is likely O(m×n) where m and n are the lengths of the strings. If you need better than that, you can try other libraries, like Boost, or implement one of the sub-linear searches yourself.

Example of compiler optimizations that can be 'easily' done on C++ code but not C code

This question talks of an optimization of the sort function that cannot be readily achieved in C:
Performance of qsort vs std::sort?
Are there more examples of compiler optimizations which would be impossible or at least difficult to achieve in C when compared to C++?
As #sehe mentioned in a comment. It's about the abstractions more than anything else. In other words, if the language allows the coder to express intent better, then it can emit code which implements that intent in a more optimal fashion.
A simple example is std::fill. Sure for basic types, you could use memset, but, let's say it's an array of 32-bit unsigned longs. std::fill knows that the array size is a multiple of 32-bits. And depending on the compiler, it might even be able to make the assumption that the array is properly aligned on a 32-bit boundary as well.
All of this combined may allow the compiler to emit code which sets the value 32-bit at a time, with no run-time checks to make sure that it is valid to do so. If we are lucky, the compiler will recognize this and replace it with a particularly efficient architecture specific version of the code.
(in reality gcc and probably the other mainstream compilers do in fact do this for just about anything that could be considered equivalent to a memset already, including std::fill).
often, memset is implemented in a way that has run-time checks for these types of things in order to choose the optimal code path. While this difference is probably negligible, the idea is that we have better expressed the intent of "filling" an array with a specific value, so the compiler is able to make slightly better choices.
Other more complicated language features do a good job of using the expression of intent to get larger gains, but this is the simplest example.
To be clear, my point is not that std::fill is "better" than memset, instead this is an example of how c++ allows better expression of intent to the compiler, allowing it to have more information during compile time, resulting in some optimizations being easier to implement.
It depends a bit on what you think of as the optimization here. If you're thinking of it purely as "std::sort vs. qsort", then there are thousands of other similar optimizations. Using a C++ template can supports inlining in situations where essentially the only reasonable alternative in C is to use a pointer to a function and nearly no known compiler will inline the code being called. Depending on your viewpoint, this is either a single optimization, or an entire (open-ended) family of them.
Another possibility is using template meta-programming to turn something into a compile-time constant that would normally have to be computed at run-time with C. In theory, you could usually do this by embedding a magic number. This is possible via a #define into C, but can lose context, flexibility or both (e.g., in C++ you can define a constant at compile time, carry out an arbitrary calculation from that input, and produce a compile-time constant used by the rest of the code. Given the much more limited calculations you can carry out in a #define, that's not possible nearly as often.
Yet another possibility is function overloading and template specialization. These are separate, but give the same basic result: using code that's specialized to a particular type. In C, to keep the number of functions you deal with halfway reasonable, you frequently end up writing code that (for example) converts all integers to a long, then does math on that. Templates, template specialization, and overloading make it relatively easy to use code that keeps the smaller types their native sizes, which can give a substantial speed increase (especially when it can enable vectorizing the math).
One last obvious possibility stems from simply providing quite a few pre-built data structures and algorithms, and allowing such things to be packaged for relatively easy, efficient re-use. I doubt I could even count the number of times I wrote code in C using what I knew were relatively inefficient data structures and/or algorithms, simply because it wasn't worth the time to find (or adapt) a more efficient one to the task at hand. Yes, if it really became a major bottleneck, I'd go to the trouble of finding or writing something better -- but doing a bit of comparing, it's still fairly common to see speed double when written in C++.
I should add, however, that all of these are undoubtedly possible with C, at least in theory. If you approach this from a viewpoint of something like language complexity theory and theoretical models of computation (e.g., Turing machines) there's no question that C and C++ are equivalent. With enough work writing specialized versions of each function, you can/could theoretically do all of those same things with C as you can with C++.
From a viewpoint of what code you can plan on really writing in a practical project, the story changes very quickly -- the limit on what you can do mostly comes down to what you can reasonably manage, not anything like the theoretical model of computation represented by the language. Levels of optimization that are almost entirely theoretical in C are not only practical, but quite routine in C++.
Even the qsort vs std::sort example is invalid. If a C implementation wanted, it could put an inline version of qsort in stdlib.h, and any decent C compiler could handle inlining the comparison function. The reason this usually isn't done is that it's massively bloated and of dubious performance benefit -- issues C++ folks tend not to care about...

Equation parser efficiency

I sunk about a month of full time into a native C++ equation parser. It works, except it is slow (between 30-100 times slower than a hard-coded equation). What can I change to make it faster?
I read everything I could find on efficient code. In broad strokes:
The parser converts a string equation expression into a list of "operation" objects.
An operation object has two function pointers: a "getSource" and a "evaluate".
To evaluate an equation, all I do is a for loop on the operation list, calling each function in turn.
There isn't a single if / switch encountered when evaluating an equation - all conditionals are handled by the parser when it originally assigned the function pointers.
I tried inlining all the functions to which the function pointers point - no improvement.
Would switching from function pointers to functors help?
How about removing the function pointer framework, and instead creating a full set of derived "operation" classes, each with its own virtual "getSource" and "evaluate" functions? (But doesn't this just move the function pointers into the vtable?)
I have a lot of code. Not sure what to distill / post. Ask for some aspect of it, and ye shall receive.
In your post you don't mention that you have profiled the code. This is the first thing I would do if I were in your shoes. It'll give you a good idea of where the time is spent and where to focus your optimization efforts.
It's hard to tell from your description if the slowness includes parsing, or it is just the interpretation time.
The parser, if you write it as recursive-descent (LL1) should be I/O bound. In other words, the reading of characters by the parser, and construction of your parse tree, should take a lot less time than it takes to simply read the file into a buffer.
The interpretation is another matter.
The speed differential between interpreted and compiled code is usually 10-100 times slower, unless the basic operations themselves are lengthy.
That said, you can still optimize it.
You could profile, but in such a simple case, you could also just single-step the program, in the debugger, at the level of individual instructions.
That way, you are "walking in the computer's shoes" and it will be obvious what can be improved.
Whenever I'm doing what you're doing, that is, providing a language to the user, but I want the language to have fast execution, what I do is this:
I translate the source language into a language I have a compiler for, and then compile it on-the-fly into a .dll (or .exe) and run that.
It's very quick, and I don't need to write an interpreter or worry about how fast it is.
The very first thing is: Profile what actually went wrong. Is the bottleneck in parsing or in evaluation? valgrind offers some tools that can help you here.
If it's in parsing, boost::spirit might help you. If in evaluation, remember that virtual functions can be pretty slow to evaluate. I've made pretty good experiences with recursive boost::variant's.
You know, building an expression recursive descent parser is really easy, the LL(1) grammar for expressions is only a couple of rules. Parsing then becomes a linear affair and everything else can work on the expression tree (while parsing basically); you'd collect the data from the lower nodes and pass it up to the higher nodes for aggregation.
This would avoid altogether function/class pointers to determine the call path at runtime, relying instead of proven recursivity (or you can build an iterative LL parser if you wish).
It seems that you're using a quite complicated data structure (as I understand it, a syntax tree with pointers etc.). Thus, walking through pointer dereference is not very efficient memory-wise (lots of random accesses) and could slow you down significantly. As Mike Dunlavey proposed, you could compile the whole expression at runtime using another language or by embedding a compiler (such as LLVM). For what I know, Microsoft .Net provides this feature (dynamic compilation) with Reflection.Emit and Linq.Expression trees.
This is one of those rare times that I'd advise against profiling just yet. My immediate guess is that the basic structure you're using is the real source of the problem. Profiling the code is rarely worth much until you're reasonably certain the basic structure is reasonable, and it's mostly a matter of finding which parts of that basic structure can be improved. It's not so useful when what you really need to do is throw out most of what you have, and basically start over.
I'd advise converting the input to RPN. To execute this, the only data structure you need is a stack. Basically, when you get to an operand, you push it on the stack. When you encounter an operator, it operates on the items at the top of the stack. When you're done evaluating a well-formed expression, you should have exactly one item on the stack, which is the value of the expression.
Just about the only thing that will usually give better performance than this is to do like #Mike Dunlavey advised, and just generate source code and run it through a "real" compiler. That is, however, a fairly "heavy" solution. If you really need maximum speed, it's clearly the best solution -- but if you just want to improve what you're doing now, converting to RPN and interpreting that will usually give a pretty decent speed improvement for a small amount of code.

I need high performance. Will there be a difference if I use C or C++?

I need to write a program (a project for university) that solves (approx) an NP-hard problem.
It is a variation of Linear ordering problems.
In general, I will have very large inputs (as Graphs) and will try to find the best solution
(based on a function that will 'rate' each solution)
Will there be a difference if I write this in C-style code (one main, and functions)
or build a Solver class, create an instance and invoke a 'run' method from a main (similar to Java)
Also, there will be alot of floating point math going on in each iteration.
Thanks!
No.
The biggest performance gains/flaws will be on the algorithm you implement, and how much unneeded work you perform (Unneeded work could be everything from recalculating a previous value that could have been cached, to using too many malloc/free's vs using memory pools,
passing large immutable data by value instead of reference)
The biggest roadblock to optimal code is no longer the language (for properly compiled languages), but rather the programmer.
No, unless you are using virtual functions.
Edit: If you have a case where you need run-time dynamism, then yes, virtual functions are as fast or faster than a manually constructed if-else statement. However, if you drop in the virtual keyword in front of a method, but you don't actually need the polymorphism, then you will be paying an unnecessary overhead. The compiler won't optimize it away at compile time. I am just pointing this out because it's one of the features of C++ that breaks the 'zero-overhead principle` (quoting Stroustrup).
As a side note, since you mention heavy use of fp math:
The following gcc flags may help you speed things up (I'm sure there are equivalent ones for visual C++, but I don't use it): -mfpmath=sse, -ffast-math and -mrecip (The last two are 'slightly dangerous', meaning that they could give you weird results in edge cases in exchange for the speed. The first one reduces precision by a bit -- you have 64-bit doubles instead of 80-bit ones -- but this extra precision is often unneeded.) These flags would work equally well for C and C++ compilers.
Depending on your processor, you may also find that simulating true INFINITY with a large-but-not-infinite value gives you a good speed boost. This is because true INFINITY has to be handled as a special case by the processor.
Rule of thumb - do not optimize until you know what to optimize. So start with C++ and have some working prototype. Then profile it and rewrite bottle necks in assembly. But as others noted, chosen algorithm will have much greater impact than the language.
When speaking of performance, anything you can do in C can be done in C++.
For example, virtual methods are known to be “slow”, but if it's really a problem, you can still resort to C idioms.
C++ also brings templates, which lead to better performance than using void* for generic programming.
The Solver class will be constructed once, I take it, and the run method executed once... in that kind of environment, you won't see a difference. Instead, here are things to watch out for:
Memory management is hellishly expensive. If you need to do lots of little malloc()s, the operating system will eat your lunch. Make a determined effort to re-use whatever data structures you create if you know you'll be doing the same kind of thing again soon!
Instantiating classes generally means... allocating memory! Again, there's practically no cost for instantiating a handful of objects and re-using them. But beware of creating objects only to tear them down and rebuild them soon after!
Choose the right flavor of floating point for your architecture, insofar as the problem permits. It's possible that double will end up being faster than float, although it will need more memory. You should experiment to fine-tune this. Ideally, you'll use a #define or typedef to specify the type so you can change it easily in one place.
Integer calculations are probably faster than floating point. Depending on the numeric range of your data, you may also consider doing it with integers treated as fixed-point decimals. If you need 3 decimal places, you could use ints and just consider them "milli-somethings". You'll have to remember to shift decimals after division and multiplication... but no big deal. If you use any math functions beyond the basic arithmetic, of course, that would of course kill this possibility.
Since both are compiled, and the compilers now are very good at how to handle C++, I think the only problem would come from how well optimized your code is. I think it would be easier to write slower code in C++, but that depends on which style your model fits into best.
When it comes down to it, I doubt there will be any real difference, assuming both are well-written, any libraries you use, how well written they are, if you are measuring on the same computer.
Function call vs. member function call overhead is unlikely to be the limiting factor, compared to file input and the algorithm itself. C++ iostreams are not necessarily super high speed. C has 'restrict' if you're really optimizing, in C++ it's easier to inline function calls. Overall, C++ offers more options for organizing your code clearly, but if it's not a big program, or you're just going to write it in a similar manner whether it's C or C++, then the portability of C libraries becomes more important.
As long as you don't use any virtual functions etc. you won't note any considerable performance differences. Early C++ was compiled to C, so as long as you know the pinpoints where this creates any considerable overhead (such as with virtual functions) you can clearly calculate for the differences.
In addition I want to note that using C++ can give you a lot to gain if you use the STL and Boost Libraries. Especially the STL provides very efficient and proven implementations of the most important data structures and algorithms, so you can save a lot of development time.
Effectively it also depends on the compiler you will be using and how it will optimize the code.
first, writing in C++ doesn't imply using OOP, look at the STL algorithms.
second, C++ can be even slightly faster at runtime (the compilation times can be terrible compared to C, but that's because modern C++ tends to rely heavily on abstractions that tax the compiler).
edit: alright, see Bjarne Stroustrup's discussion of qsort and std::sort, and the article that FAQ mentions (Learning Standard C++ as a New Language), where he shows that C++-style code can be not only shorter and more readable (because of higher abstractions), but also somewhat faster.
Another aspect:
C++ templates can be an excellent tool to generate type-specific /
optimized code variations.
For example, C qsort requires a function call to the comparator, whereas std::sort can inline the functor passed. This can make a significant difference when compare and swap themselves are cheap.
Note that you could generate "custom qsorts" optimized for various types with a barrage of defines or a code generator, or by hand - you could do these optimizations in C, too, but at much higher cost.
(It's not a general weapon, templates help only in sepcific scenarios - usually a single algorithm applied to different data types or with differing small pieces of code injected.)
Good answers. I would put it like this:
Make the algorithm as efficient as possible in terms of its formal structure.
C++ will be as fast as C, except that it will tempt you to do dumb things, like constructing objects that you don't have to, so don't take the bait. Things like STL container classes and iterators may look like the latest-and-greatest thing, but they will kill you in a hotspot.
Even so, single-step it at the disassembly level. You should see it very directly working on your problem. If it is spending lots of cycles getting in and out of routines, try some in-lining (or macros). If it is wandering off into memory allocation and freeing, for much of the time, put a stop to that. If it's got inner loops where the loop overhead is a large percentage, try unrolling the loop.
That's how you can make it about as fast as possible.
I would go with C++ definitely. If you are careful about your design and avoid creating heavy objects inside hotspots you should not see any performance difference but the code is going to be much simpler to understand, maintain, and expand.
Use templates and classes judiciously. avoid unnecessary object creation by passing objects by reference. Avoid excessive memory allocation, if needed, allocate memory in advance of hotspots. Use restrict keyword on memory pointers to tell compiler whenever pointers overlap or not.
As far as optimization, pay careful attention to memory alignment. Assuming you are working on Intel processor, you can make use of vector instructions, provided you tell the compiler through pragma's about your memory alignment and aliased pointers. you can also use vector instructions directly via intrinsics.
you can also automatically create hotspot code using templates and let compiler optimize it out if you have things like short loops of different sizes. To find out performance and to drill down to your bottlenecks, Intel vtune or oprofile are extremely helpful.
hope that helps
I do some DSP coding, where it still pays off to go to assembly language sometimes. I'd say use C or C++, either one, and be prepared to go to assembly language when you need to, especially to exploit SIMD instructions.