I am using C++ to code some complicated FFT algorithm, so I need to implement such algebraic structures as quaternions and Hamilton-Eisenstein codes. Algorithm works with 2D array of that structures. What would be the overhead of implementing them as classes? In other way, should I create the array with [M][N] dimensions which consists of Quaternion classes, or should I create [M][N][4] array and work with [4] arrays as quaternions? Using classes is more convenient, but creating M*N classes and accessing their methods instead of working with just array - wouldn't that be too much overhead? I'm coding the algorithm for large images processing, so performance is important for me.
IMHO you are better served by implementing them as classes simply because this will let you write your code quicker with less errors. You should do measurements to see what performs best if that is important for you, but also make sure that it is actually this code that is the performance bottleneck. (Mandatory Donald Knuth quote: "premature optimization is the root of all evil").
Most compilers will do a very good job at optimizing code for you, I would say. More often than not I find that it is something else than these low level things that make a difference, like adding an early-out test or minimizing the dataset or whatnot.
For a quaternion, you can still implement the class using an array internally (in case that is actually faster), which should make the difference even less important.
You are probably better served by for instance making sure that you can run your algorithms in parallell on multicore machines or make your actual calculations use SSE instructions.
Regarding overhead of classes: Unless your classes have virtual functions, there's no penalty for using classes.
So, for example, an array of complex variables may be written as:
std::complex<double> m[10][10];
Beware of STL collection classes, though, as they tend to use dynamic allocation and sometimes introduce significant overhead (i.e., I wouldn't make arrays using vector< vector<> >.
You might want to investigate the use of a library such as Eigen for fast, optimized, matrix/vector classes.
Related
For many high performance applications, such as game engines or financial software, considerations of cache coherency, memory layout, and cache misses are crucial for maintaining smooth performance. As the C++ standard has evolved, especially with the introduction of Move Semantics and C++14, it has become less clear when to draw the line of pass by value vs. pass by reference for mathematical POD based classes.
Consider the common POD Vector3 class:
class Vector3
{
public:
float32 x;
float32 y;
float32 z;
// Implementation Functions below (all non-virtual)...
}
This is the most commonly used math structure in game development. It is a non-virtual, 12 byte size class, even in 64 bit since we are explicitly using IEEE float32, which uses 4 bytes per float. My question is as follows - What is the general best practice guideline to use when deciding to pass POD mathematical classes by value or by reference for high performance applications?
Some things for consideration when answering this question:
It is safe to assume the default constructor does not initialize any values
It is safe to assume no arrays beyond 1D are used for any POD math structures
Clearly most people pass 4-8 byte POD constants by value, so there doesn't seem to be much debate there
What happens when this Vector is a class member variable vs a local variable on the stack? If pass by reference is used, then it would use the memory address of the variable on the class vs a memory address of something local on the stack. Does this use-case matter? Could this difference where PBR is used result in more cache misses?
What about the case where SIMD is used or not used?
What about move semantic compiler optimizations? I have noticed that when switching to C++14, the compiler will often use move semantics when chain function calls are made passing the same vector by value, especially when it is const. I observed this by perusing the assembly breakdown
When using pass by value and pass by reference with these math structures, does const make a much impact on compiler optimizations? See the above point
Given the above, what is a good guideline for when to use pass by value vs pass by reference with modern C++ compilers (C++14 and above) to minimize cache misses and promote cache coherency? At what point might someone say this POD math structure is too large for pass by value, such as a 4v4 affine transform matrix, which is 64 bytes in size assuming use of float32. Does the Vector, or rather any small POD math structure, declared on the stack vs. being referenced as a member variable matter when making this decision?
I am hoping someone can provide some analysis and insight to where a good modern guideline for best practices can be established for the above situation. I believe the line has become more blurry as for when to use PBV vs PBR for POD classes as the C++ standard has evolved, especially in regard to minimizing cache misses.
I see the question title is on the choice of pass-by-value vs. pass-by-reference, though it sounds like what you are after more broadly is the best practice to efficiently passing around 3D vectors and other common PODs. Passing data is fundamental and intertwined with programming paradigm, so there isn't a consensus on the best way to do it. Besides performance, there are considerations to weigh like code readability, flexibility, and portability to decide which approach to favor in a given application.
That said, in recent years, "data-oriented design" has become a popular alternative to object-oriented programming, especially in video game development. The essential idea is to think about the program in terms of data it needs to process, and how all that data can be organized in memory for good cache locality and computation performance. There was a great talk about it at CppCon 2014: "Data-Oriented Design and C++" by Mike Acton.
With your Vector3 example for instance, it is often the case that a program has not just one but many 3D vectors that are all processed the same way, say, all undergo the same geometric transformation. Data-oriented design suggests it is then a good idea to lay the vectors out in contiguously in memory and that they are all transformed together in a batch operation. This improves caching and creates opportunities to leverage SIMD instructions. You could implement this example with the Eigen C++ linear algebra library. The vectors can be represented using a Eigen::Matrix<float, 3, Eigen::Dynamic> of shape 3xN to store N vectors, then manipulated using Eigen's SIMD-accelerated operations.
What's the need to go for defining and implementing data structures (e.g. stack) ourselves if they are already available in C++ STL?
What are the differences between the two implementations?
First, implementing by your own an existing data structure is a useful exercise. You understand better what it does (so you can understand better what the standard containers do). In particular, you understand better why time complexity is so important.
Then, there is a quality of implementation issue. The standard implementation might not be suitable for you.
Let me give an example. Indeed, std::stack is implementing a stack. It is a general-purpose implementation. Have you measured sizeof(std::stack<char>)? Have you benchmarked it, in the case of a million of stacks of 3.2 elements on average with a Poisson distribution?
Perhaps in your case, you happen to know that you have millions of stacks of char-s (never NUL), and that 99% of them have less than 4 elements. With that additional knowledge, you probably should be able to implement something "better" than what the standard C++ stack provides. So std::stack<char> would work, but given that extra knowledge you'll be able to implement it differently. You still (for readability and maintenance) would use the same methods as in std::stack<char> - so your WeirdSmallStackOfChar would have a push method, etc. If (later during the project) you realize or that bigger stack might be useful (e.g. in 1% of cases) you'll reimplement your stack differently (e.g. if your code base grow to a million lines of C++ and you realize that you have quite often bigger stacks, you might "remove" your WeirdSmallStackOfChar class and add typedef std::stack<char> WeirdSmallStackOfChar; ....)
If you happen to know that all your stacks have less than 4 char-s and that \0 is not valid in them, representing such "stack"-s as a char w[4] field is probably the wisest approach. It is fast and easy to code.
So, if performance and memory space matters, you might perhaps code something as weird as
class MyWeirdStackOfChars {
bool small;
union {
std::stack<char>* bigstack;
char smallstack[4];
}
Of course, that is very incomplete. When small is true your implementation uses smallstack. For the 1% case where it is false, your implemention uses bigstack. The rest of MyWeirdStackOfChars is left as an exercise (not that easy) to the reader. Don't forget to follow the rule of five.
Ok, maybe the above example is not convincing. But what about std::map<int,double>? You might have millions of them, and you might know that 99.5% of them are smaller than 5. You obviously could optimize for that case. It is highly probable that representing small maps by an array of pairs of int & double is more efficient both in terms of memory and in terms of CPU time.
Sometimes, you even know that all your maps have less than 16 entries (and std::map<int,double> don't know that) and that the key is never 0. Then you might represent them differently. In that case, I guess that I am able to implement something much more efficient than what std::map<int,double> provides (probably, because of cache effects, an array of 16 entries with an int and a double is the fastest).
That is why any developer should know the classical algorithms (and have read some Introduction to Algorithms), even if in many cases he would use existing containers. Be also aware of the as-if rule.
STL implementation of Data Structures is not perfect for every possible use case.
I like the example of hash tables. I have been using STL implementation for a while, but I use it mainly for Competitive Programming contests.
Imagine that you are Google and you have billions of dollars in resources destined to storing and accessing hash tables. You would probably like to have the best possible implementation for the company use cases, since it will save resources and make search faster in general.
Oh, and I forgot to mention that you also have some of the best engineers on the planet working for you (:
(This video is made by Kulukundis talking about the new hash table made by his team at Google )
https://www.youtube.com/watch?v=ncHmEUmJZf4
Some other reasons that justify implementing your own version of Data Structures:
Test your understanding of a specific structure.
Customize part of the structure to some peculiar use case.
Seek better performance than STL for a specific data structure.
Hating STL errors.
Benchmarking STL against some simple implementation.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Bittersweet SoAs
I've recently come to see the delights of using hand-written SIMD intrinsics with the SoA (structure of array) representation.
The speed improvements over my former AoS (array of structures) code, at least for straightforward sequential-type streaming operations, was little short of amazing with doubled to tripled speed ups. As a bonus, it simplified the logic to exclude those tricky horizontal operations and shuffling components around in addition to reducing memory use.
Yet then there's this bittersweet sense afterwards where I realize what a PITA they are to work with in code, especially interface design.
Mid-Level Interface Design
I'm often dealing with designing mid-level interfaces. They're higher-level than say, std::vector, but lower-level than say, a Monster class in a video game. These are always some of the most awkward interfaces for me to design and keep stable, because they're not low-level enough to provide a simple read/write interface as with a standard C++ container. Yet they're not high-level enough (lacking sufficient logic in the entry points in the interface) to completely hide and abstract away the underlying representation and only provide high-level operations.
An example of what I consider a mid-level design is a programmable particle system API which desires to be as efficient and scalable as possible for certain scenarios while being convenient for casual scenarios (ex: to scripters). Such a design has to offer particle access, and unless it's going to have a method for every possible algorithm related to particles imaginable, it'll have to expose some of those raw SoA details somewhere, somewhere, to let clients benefit from them.
The design also shouldn't necessarily be made to require SoA type code to be written all the time. The more daily usage still does not demand utmost efficiency so much as convenience, simplicity, productivity. It's only for those rarer, performance-critical scenarios where the underlying SoA representation comes in handy.
So how do you API/lib designers and large-scale system guys deal with balancing these types of needs?
Balancing Multiple Access Patterns
Since the SoA obliterates away any per-element structure, might it be a decent idea to instantiate structs/classes on the fly as the user accesses the nth element using the more convenient, random-access portions of the interface? Perhaps a structure containing pointers/references to the nth entries of multiple SoA arrays for mutable access?
Also if the more common usage patterns are more random-access scalar logic rather than sequential-access SIMD vector logic, but the SIMD portions are triggered enough to still make it better to just use one data structure for it all, might this kind of hybrid SoA representation balance all the needs better?
struct AoSoA
{
ALIGN16 float x[4];
ALIGN16 float y[4];
ALIGN16 float z[4];
};
ALIGN16 AoSoA elements[n/4];
I don't understand the nature of cache lines that well to know if this kind of representation is worthwhile. I have noticed it doesn't help so much for the sequential SIMD cases where we can devote the full resources to one bulky algorithm, but it seems like it might be helpful for cases that need a lot of horizontal logic across components or random-access scalar logic cases where the system might be doing a lot of other things at the same time.
Anyway, I'm generally looking for insight into how to effectively design middle-level data structure interfaces with SoA backend representations as implementation details without transferring the complexity to the client unless they really want it.
I really want to avoid forcing clients to always write SoA-type code in every place that uses the interface unless they really need that efficiency, and I'm curious as to how to balance those more daily, random-access scalar usage scenarios vs. the rarer but not too uncommon scenarios that take advantage of the SoA representation.
I actually don't know enough software engineering to formulate a general strategy for what you want to do, but in particular for the AoS vs SoA problem, I found this paper by Robert Strzodka fascinating: http://asc.ziti.uni-heidelberg.de/sites/default/files/research/papers/public/St11ASX_CUDA.pdf
The goal of this abstraction is to provide an easy way to switch between AoS and SoA and even more complicated nestings. The author uses it to show how performance can change with different access patterns without touching the algorithmic part, and without the pain of recoding all of your accesses.
Although it is focused more on the GPU side, the code provided works on CPUs as well.
I've found a decent fit so far with this kind of "hybrid SoA" or "AoSoA" rep internally.
struct HybridSoA
{
ALIGN float x[4];
ALIGN float y[4];
ALIGN float z[4];
};
It balances those sequential fast paths using SIMD with random-access and slower paths that don't really care about SIMD with a design that preserves reasonable spatial locality for random-access paths.
For the interface, I haven't gotten too fancy yet, just returning pointers to these structures for those fast sequential paths and proxies allowing scalar-style access for operator[] and so forth.
The interface kind of leaks some of its internals for the SIMD path but it seems inevitable as the design can't anticipate all the high-level needs without growing increasingly monolithic, and it's abstracted in a way and with tight ABI concerns that makes it difficult to utilize a richer interface (the actual interface is coded in C with a C++ wrapper on top).
Perhaps it might be better though if I provided a foreach kind of method that accepted a function pointer (or something which ultimately translates into one like std::function, though I can't use it directly due to ABI reasons) that gets called back instead of directly exposing internal handles. It could be fed this SoA data required for SIMD in bulk to mitigate the calling overhead, and that'd alleviate a temporal coupling issue I have where write accesses to the structure require an explicit commit call to record the changes to the application history.
Iterators might be nice if they double over as ways to access the data in a proxy-style form (less raw exposure). Though I've kind of fallen out of love with iterators for all but generic containers and especially where the algorithms associated don't fall into a generalized category. I just found it a burden in the past to maintain iterators for everything (a burden that outweights the benefits of using a range-based for over operator[], e.g.), and have come to favor a kind of bastardized aesthetic between C and C++ (just for these kinds of mid-level data structures which are more complex than a standard container and store disparate types of data, but aren't high-level enough to impose many constraints on the public interface beyond a generic container).
Just for these specific types of data structures, I've found it most productive to favor the interface aesthetic of a plain old C-style array, though that's definitely biased and certainly just a result of my own tendencies. For things like meshes, I just keep finding myself increasingly being drawn to a C-style aesthetic, if only because I kept erring on the side of too many layers of code in the past for these cases to a point where I became confused by my own creations.
Appreciate all the answers and comments so far!
I used C++ vectors to implement stacks, queue, heaps, priority queue and directed weighted graphs. In the books and references, I have seen big classes for these data structures, all of which can be implemented in short using vectors. (May be there is more flexibility in using pointers)
Can we also implement even advanced data structures using vectors ?
If yes, why do C++ books still explain concepts with the long classes using pointers ?
Is it to keep in mind the lower level idea, if it is more vivid that way or it makes students equipped with such usage of pointers ?
It's true that many data structures can be implemented on top of a vector (array, for the sake of this answer), essentially all of them can, since every computation task can be implemented to run on a turing-machine which has a far more basic data access capability (or, in the real world, you may say that any program you implement with pointers eventually runs on a CPU with a simply array-like virtual memory space, so you could just call that a huge array). However, it's not always clever. Two main reasons :
performance / time complexity - a vector simply can't provide all basic operations that in O(1). There's a solution for fast initialization, but try to randomly insert values into a large vector and see how bad you perform - that's because you have to move all the elements by one place over and over. A list could do that in a single operation. Of course other structures have their own performance shortcomings, but that's the beauty of designing complicated data structures with these basic building blocks.
structural complexity - you can think of a list along the same line of a vector as an ordered container, and perhaps extend this into multidimensional matrices that can be implemented on top of them since they still retain some basic ordering, but there are more complicated structures. Take for e.g. a tree, a simple full binary tree one can be implemented with a vector very easily since the parent-child relations can be easily converted to index arithmetics, but what if the tree isn't full and has varying number of children per node? Now, you may say it can still be done (any graph can be implemented with vectors either through adjacency matrix or adjacency list for e.g.), but there's almost no sense in doing so when you can have a much simpler implementation using pointer links. Just think of doing an AVL roll with an array. :shudder:
Mind you that the second argument may very well boil down to performance ("hey, it's an awkward approach but I still managed to use a vector!"), but it's more than that - it would complicate your code, clutter your data structure design, and could make it far more prone to bugs.
Now, here comes the "but" - even though there's much sense in using all the possible tools the language provides you, it's very widely accepted to use vector-based structures for performance critical tasks. See almost all scientific CPU benchmarks, most of them ultimately rely on vectors (uncited, but I can elaborate further if anyone is interested. Suffice to say that even the well-known *graph*500 does that).
The reason is not that it's best programming practice, but that it's more suited with CPU internal structure and gets more "juice" out of the HW. That's due to spatial locality - CPUs are very fond of that as it allows the memory unit to parallelize accesses (in an array you always know where's the next element, in a list you have to wait until the current one is fetched), and also issue stream/stride prefetches that reduce latency of future requests.
I can't say this is always a good practice, when you run through a graph the accesses are still pretty irregular even if you use an array implementation, but it's still a very common practice.
To summarize, taking the question literally - most of them can, of sorts (for a given definition of "most", ok?), but if the intention was "why teach pointers", I believe you can see that in order to understand your limits and what you can and should use - you need to know a great deal more than just arrays and even pointers. A good programmer should know a bit about everything - OS design, CPU design, etc. You can't do anything decent unless you really understand the fabric you're running on, and that unfortunately (or not) includes lots of pointers
You can implement a kind of allocator using an std::vector as the backing store. If you do that, all the standard data structures from elementary computer science can be implemented on top of vectors. It will hardly free you from using pointers, though: vectors are really just chunks of memory with a few useful additional operations, most notably the ability to expand.
More to the point: if you don't understand pointers, you won't understand how to do use vector for advanced data structures either. vector is a useful abstraction, but it follows the C++ rule that "you don't get what you don't pay for", so it's also a very "thin" abstraction, and you do pay for the cost of abstraction in terms of the amount of code you have to write.
(Jonathan Wakely points out, in the comments, that you won't get the exact guarantees that the C++ standard library requires of allocators data structures when you implement them on top of vector. Put in principle, vectors are just a way of handling blocks of memory.)
If you are learning C++ you need to be familiar with pointers and how to use them even if there are more higher level concepts that does that job for you.
Yes, it is possible to implement most data structures with vectors or lists and if you just started learning programming it's probably a good idea that you'll know how to write these data structures yourself.
With that being said, production code should always use the standard library unless there is a good reason not to do so.
I need to write a program (a project for university) that solves (approx) an NP-hard problem.
It is a variation of Linear ordering problems.
In general, I will have very large inputs (as Graphs) and will try to find the best solution
(based on a function that will 'rate' each solution)
Will there be a difference if I write this in C-style code (one main, and functions)
or build a Solver class, create an instance and invoke a 'run' method from a main (similar to Java)
Also, there will be alot of floating point math going on in each iteration.
Thanks!
No.
The biggest performance gains/flaws will be on the algorithm you implement, and how much unneeded work you perform (Unneeded work could be everything from recalculating a previous value that could have been cached, to using too many malloc/free's vs using memory pools,
passing large immutable data by value instead of reference)
The biggest roadblock to optimal code is no longer the language (for properly compiled languages), but rather the programmer.
No, unless you are using virtual functions.
Edit: If you have a case where you need run-time dynamism, then yes, virtual functions are as fast or faster than a manually constructed if-else statement. However, if you drop in the virtual keyword in front of a method, but you don't actually need the polymorphism, then you will be paying an unnecessary overhead. The compiler won't optimize it away at compile time. I am just pointing this out because it's one of the features of C++ that breaks the 'zero-overhead principle` (quoting Stroustrup).
As a side note, since you mention heavy use of fp math:
The following gcc flags may help you speed things up (I'm sure there are equivalent ones for visual C++, but I don't use it): -mfpmath=sse, -ffast-math and -mrecip (The last two are 'slightly dangerous', meaning that they could give you weird results in edge cases in exchange for the speed. The first one reduces precision by a bit -- you have 64-bit doubles instead of 80-bit ones -- but this extra precision is often unneeded.) These flags would work equally well for C and C++ compilers.
Depending on your processor, you may also find that simulating true INFINITY with a large-but-not-infinite value gives you a good speed boost. This is because true INFINITY has to be handled as a special case by the processor.
Rule of thumb - do not optimize until you know what to optimize. So start with C++ and have some working prototype. Then profile it and rewrite bottle necks in assembly. But as others noted, chosen algorithm will have much greater impact than the language.
When speaking of performance, anything you can do in C can be done in C++.
For example, virtual methods are known to be “slow”, but if it's really a problem, you can still resort to C idioms.
C++ also brings templates, which lead to better performance than using void* for generic programming.
The Solver class will be constructed once, I take it, and the run method executed once... in that kind of environment, you won't see a difference. Instead, here are things to watch out for:
Memory management is hellishly expensive. If you need to do lots of little malloc()s, the operating system will eat your lunch. Make a determined effort to re-use whatever data structures you create if you know you'll be doing the same kind of thing again soon!
Instantiating classes generally means... allocating memory! Again, there's practically no cost for instantiating a handful of objects and re-using them. But beware of creating objects only to tear them down and rebuild them soon after!
Choose the right flavor of floating point for your architecture, insofar as the problem permits. It's possible that double will end up being faster than float, although it will need more memory. You should experiment to fine-tune this. Ideally, you'll use a #define or typedef to specify the type so you can change it easily in one place.
Integer calculations are probably faster than floating point. Depending on the numeric range of your data, you may also consider doing it with integers treated as fixed-point decimals. If you need 3 decimal places, you could use ints and just consider them "milli-somethings". You'll have to remember to shift decimals after division and multiplication... but no big deal. If you use any math functions beyond the basic arithmetic, of course, that would of course kill this possibility.
Since both are compiled, and the compilers now are very good at how to handle C++, I think the only problem would come from how well optimized your code is. I think it would be easier to write slower code in C++, but that depends on which style your model fits into best.
When it comes down to it, I doubt there will be any real difference, assuming both are well-written, any libraries you use, how well written they are, if you are measuring on the same computer.
Function call vs. member function call overhead is unlikely to be the limiting factor, compared to file input and the algorithm itself. C++ iostreams are not necessarily super high speed. C has 'restrict' if you're really optimizing, in C++ it's easier to inline function calls. Overall, C++ offers more options for organizing your code clearly, but if it's not a big program, or you're just going to write it in a similar manner whether it's C or C++, then the portability of C libraries becomes more important.
As long as you don't use any virtual functions etc. you won't note any considerable performance differences. Early C++ was compiled to C, so as long as you know the pinpoints where this creates any considerable overhead (such as with virtual functions) you can clearly calculate for the differences.
In addition I want to note that using C++ can give you a lot to gain if you use the STL and Boost Libraries. Especially the STL provides very efficient and proven implementations of the most important data structures and algorithms, so you can save a lot of development time.
Effectively it also depends on the compiler you will be using and how it will optimize the code.
first, writing in C++ doesn't imply using OOP, look at the STL algorithms.
second, C++ can be even slightly faster at runtime (the compilation times can be terrible compared to C, but that's because modern C++ tends to rely heavily on abstractions that tax the compiler).
edit: alright, see Bjarne Stroustrup's discussion of qsort and std::sort, and the article that FAQ mentions (Learning Standard C++ as a New Language), where he shows that C++-style code can be not only shorter and more readable (because of higher abstractions), but also somewhat faster.
Another aspect:
C++ templates can be an excellent tool to generate type-specific /
optimized code variations.
For example, C qsort requires a function call to the comparator, whereas std::sort can inline the functor passed. This can make a significant difference when compare and swap themselves are cheap.
Note that you could generate "custom qsorts" optimized for various types with a barrage of defines or a code generator, or by hand - you could do these optimizations in C, too, but at much higher cost.
(It's not a general weapon, templates help only in sepcific scenarios - usually a single algorithm applied to different data types or with differing small pieces of code injected.)
Good answers. I would put it like this:
Make the algorithm as efficient as possible in terms of its formal structure.
C++ will be as fast as C, except that it will tempt you to do dumb things, like constructing objects that you don't have to, so don't take the bait. Things like STL container classes and iterators may look like the latest-and-greatest thing, but they will kill you in a hotspot.
Even so, single-step it at the disassembly level. You should see it very directly working on your problem. If it is spending lots of cycles getting in and out of routines, try some in-lining (or macros). If it is wandering off into memory allocation and freeing, for much of the time, put a stop to that. If it's got inner loops where the loop overhead is a large percentage, try unrolling the loop.
That's how you can make it about as fast as possible.
I would go with C++ definitely. If you are careful about your design and avoid creating heavy objects inside hotspots you should not see any performance difference but the code is going to be much simpler to understand, maintain, and expand.
Use templates and classes judiciously. avoid unnecessary object creation by passing objects by reference. Avoid excessive memory allocation, if needed, allocate memory in advance of hotspots. Use restrict keyword on memory pointers to tell compiler whenever pointers overlap or not.
As far as optimization, pay careful attention to memory alignment. Assuming you are working on Intel processor, you can make use of vector instructions, provided you tell the compiler through pragma's about your memory alignment and aliased pointers. you can also use vector instructions directly via intrinsics.
you can also automatically create hotspot code using templates and let compiler optimize it out if you have things like short loops of different sizes. To find out performance and to drill down to your bottlenecks, Intel vtune or oprofile are extremely helpful.
hope that helps
I do some DSP coding, where it still pays off to go to assembly language sometimes. I'd say use C or C++, either one, and be prepared to go to assembly language when you need to, especially to exploit SIMD instructions.