Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Bittersweet SoAs
I've recently come to see the delights of using hand-written SIMD intrinsics with the SoA (structure of array) representation.
The speed improvements over my former AoS (array of structures) code, at least for straightforward sequential-type streaming operations, was little short of amazing with doubled to tripled speed ups. As a bonus, it simplified the logic to exclude those tricky horizontal operations and shuffling components around in addition to reducing memory use.
Yet then there's this bittersweet sense afterwards where I realize what a PITA they are to work with in code, especially interface design.
Mid-Level Interface Design
I'm often dealing with designing mid-level interfaces. They're higher-level than say, std::vector, but lower-level than say, a Monster class in a video game. These are always some of the most awkward interfaces for me to design and keep stable, because they're not low-level enough to provide a simple read/write interface as with a standard C++ container. Yet they're not high-level enough (lacking sufficient logic in the entry points in the interface) to completely hide and abstract away the underlying representation and only provide high-level operations.
An example of what I consider a mid-level design is a programmable particle system API which desires to be as efficient and scalable as possible for certain scenarios while being convenient for casual scenarios (ex: to scripters). Such a design has to offer particle access, and unless it's going to have a method for every possible algorithm related to particles imaginable, it'll have to expose some of those raw SoA details somewhere, somewhere, to let clients benefit from them.
The design also shouldn't necessarily be made to require SoA type code to be written all the time. The more daily usage still does not demand utmost efficiency so much as convenience, simplicity, productivity. It's only for those rarer, performance-critical scenarios where the underlying SoA representation comes in handy.
So how do you API/lib designers and large-scale system guys deal with balancing these types of needs?
Balancing Multiple Access Patterns
Since the SoA obliterates away any per-element structure, might it be a decent idea to instantiate structs/classes on the fly as the user accesses the nth element using the more convenient, random-access portions of the interface? Perhaps a structure containing pointers/references to the nth entries of multiple SoA arrays for mutable access?
Also if the more common usage patterns are more random-access scalar logic rather than sequential-access SIMD vector logic, but the SIMD portions are triggered enough to still make it better to just use one data structure for it all, might this kind of hybrid SoA representation balance all the needs better?
struct AoSoA
{
ALIGN16 float x[4];
ALIGN16 float y[4];
ALIGN16 float z[4];
};
ALIGN16 AoSoA elements[n/4];
I don't understand the nature of cache lines that well to know if this kind of representation is worthwhile. I have noticed it doesn't help so much for the sequential SIMD cases where we can devote the full resources to one bulky algorithm, but it seems like it might be helpful for cases that need a lot of horizontal logic across components or random-access scalar logic cases where the system might be doing a lot of other things at the same time.
Anyway, I'm generally looking for insight into how to effectively design middle-level data structure interfaces with SoA backend representations as implementation details without transferring the complexity to the client unless they really want it.
I really want to avoid forcing clients to always write SoA-type code in every place that uses the interface unless they really need that efficiency, and I'm curious as to how to balance those more daily, random-access scalar usage scenarios vs. the rarer but not too uncommon scenarios that take advantage of the SoA representation.
I actually don't know enough software engineering to formulate a general strategy for what you want to do, but in particular for the AoS vs SoA problem, I found this paper by Robert Strzodka fascinating: http://asc.ziti.uni-heidelberg.de/sites/default/files/research/papers/public/St11ASX_CUDA.pdf
The goal of this abstraction is to provide an easy way to switch between AoS and SoA and even more complicated nestings. The author uses it to show how performance can change with different access patterns without touching the algorithmic part, and without the pain of recoding all of your accesses.
Although it is focused more on the GPU side, the code provided works on CPUs as well.
I've found a decent fit so far with this kind of "hybrid SoA" or "AoSoA" rep internally.
struct HybridSoA
{
ALIGN float x[4];
ALIGN float y[4];
ALIGN float z[4];
};
It balances those sequential fast paths using SIMD with random-access and slower paths that don't really care about SIMD with a design that preserves reasonable spatial locality for random-access paths.
For the interface, I haven't gotten too fancy yet, just returning pointers to these structures for those fast sequential paths and proxies allowing scalar-style access for operator[] and so forth.
The interface kind of leaks some of its internals for the SIMD path but it seems inevitable as the design can't anticipate all the high-level needs without growing increasingly monolithic, and it's abstracted in a way and with tight ABI concerns that makes it difficult to utilize a richer interface (the actual interface is coded in C with a C++ wrapper on top).
Perhaps it might be better though if I provided a foreach kind of method that accepted a function pointer (or something which ultimately translates into one like std::function, though I can't use it directly due to ABI reasons) that gets called back instead of directly exposing internal handles. It could be fed this SoA data required for SIMD in bulk to mitigate the calling overhead, and that'd alleviate a temporal coupling issue I have where write accesses to the structure require an explicit commit call to record the changes to the application history.
Iterators might be nice if they double over as ways to access the data in a proxy-style form (less raw exposure). Though I've kind of fallen out of love with iterators for all but generic containers and especially where the algorithms associated don't fall into a generalized category. I just found it a burden in the past to maintain iterators for everything (a burden that outweights the benefits of using a range-based for over operator[], e.g.), and have come to favor a kind of bastardized aesthetic between C and C++ (just for these kinds of mid-level data structures which are more complex than a standard container and store disparate types of data, but aren't high-level enough to impose many constraints on the public interface beyond a generic container).
Just for these specific types of data structures, I've found it most productive to favor the interface aesthetic of a plain old C-style array, though that's definitely biased and certainly just a result of my own tendencies. For things like meshes, I just keep finding myself increasingly being drawn to a C-style aesthetic, if only because I kept erring on the side of too many layers of code in the past for these cases to a point where I became confused by my own creations.
Appreciate all the answers and comments so far!
I'm fond of dispatch_data_t. It provides a useful abstraction on top of a range of memory: it provides reference counting, allows consumers to create arbitrary sub-ranges (which participate in the ref counting of the parent range), concatenate sub-ranges, etc. (I won't bother to get into the gory details -- the docs are right over here: Managing Dispatch Data Objects)
I've been trying to find out if there's a C++11 equivalent, but the terms "range", "memory" and "reference counting" are pretty generic, which is making googling for this a bit of a challenge. I suspect that someone who spends more time with the C++ Standard Library than I do might know off the top of their head.
Yes, I'm aware that I can use the dispatch_data_t API from C++ code, and yes, I'm aware that it would not be difficult to crank out a naive first pass implementation of such a thing, but I'm specifically looking for something idiomatic to C++, and with a high degree of polish/reliability. (Boost maybe?)
No.
Range views are being proposed for future standard revisions, but they are non-owning.
dispatch_data_t is highly tied to GCD in that cleanup occurs in a specified queue determined at creation: to duplicate that behaviour, we would need thread pools and queues in std, which we do not have.
As you have noted, an owning overlapping immutable range type into sparse or contiguous memory would not be hard to write up. Fully poished it would have to support allocators, some kind of raw input buffer system (type erasure on the owning/destruction mechanism?), have utlities for asynchronous iteration by block (with tuned block size), deal with errors and exceptions carefully, and some way to efficiently turn rc 1 views into mutable versions.
Something that complex would first have to show up in a library like boost and go through iterative improvements. And as it is quite many faceted, something with enough of its properties for your purposes may already be there.
If you roll your own I encourage you to submit it for boost consideration.
I'm currently working on a project that requires working with gigabytes of scientific data sets. The data sets are in the form of very large arrays (30,000 elements) of integers and floating point numbers. The problem here is that they are too large too fit into memory, so I need an on disk solution for storing and working with them. To make this problem even more fun, I am restricted to using a 32-bit architecture (as this is for work) and I need to try to maximize performance for this solution.
So far, I've worked with HDF5, which worked okay, but I found it a little too complicated to work with. So, I thought the next best thing would be to try a NoSQL database, but I couldn't find a good way to store the arrays in the database short of casting them to character arrays and storing them like that, which caused a lot of bad pointer headaches.
So, I'd like to know what you guys recommend. Maybe you have a less painful way of working with HDF5 while at the same time maximizing performance. Or maybe you know of a NoSQL database that works well for storing this type of data. Or maybe I'm going in the totally wrong direction with this and you'd like to smack some sense into me.
Anyway, I'd appreciate any words of wisdom you guys can offer me :)
Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?
If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.
In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.
Pick one of the right tools for the job and learn how to use it properly.
Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.
I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.
An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.
You may be interested in this paper.
It can allow you to query the HDF5 data directly by using SQL.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.
My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
I would recommend you to try Google SparseHash (or the C11 version Google SparseHash-c11) and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed.
I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.
You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.
Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.
The add/delete operations are much (100x) more frequent than the get operation.
That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?
Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.
If you have a multithreaded program, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map, but it's main functions are thread safe.
Also have a look at facebook's folly library, it has high performance concurrent hash table and skip list.
khash is very efficient. There is author's detailed benchmark: https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ and it also shows khash beats many other hash libraries.
from android sources (thus Apache 2 licensed)
https://github.com/CyanogenMod/android_system_core/tree/ics/libcutils
look at hashmap.c, pick include/cutils/hashmap.h, if you don't need thread safety you can remove mutex code, a sample implementation is in libcutils/str_parms.c
First check if existing solutions like libmemcache fits your need.
If not ...
Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.
Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.
If it is not, you should explore some good fast hashing algorithms found on the net
good old prime number multiply algo
http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/
http://code.google.com/p/google-sparsehash/
If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.
Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.
You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise
I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application
good luck
Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.
I would suggest uthash. Just include #include "uthash.h" then add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. A word about performance here.
http://incise.org/hash-table-benchmarks.html gcc has a very very good implementation. However, mind that it must respect a very bad standard decision :
If a rehash happens, all iterators are invalidated, but references and
pointers to individual elements remain valid. If no actual rehash
happens, no changes.
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
This means basically the standard says that the implementation MUST BE based on linked lists.
It prevents open addressing which has better performance.
I think google sparse is using open addressing, though in these benchmarks only the dense version outperforms the competition.
However, the sparse version outperforms all competition in memory usage. (also it doesn't have any plateau, pure straight line wrt number of elements)
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
My question is mostly about STL than the rest of the C++ that can be compared (I guess) to be as much fast as C a long as classes aren't used at every corner.
STL is standard for games and in engines like OGRE3D, but I was wondering that if STL's features are nice to use, the problem is while I don't really know how they work, I should know first what features can cause serious hogs before using them.
I'm very excited to begin that game programming school, and apparently there is no way I am going to not use those advanced features.
Using STL tends to generate as good if not more efficient code than hand written code for many cases.
Use a profiler to see where you have problems.
Even where C++ STL might perform worse its code is likely to be less error prone. So only write code if the profiler shows there is an issue
1) Debug builds. Severaly slow down many stl containers due to excessive error checking. At least on Microsoft compilers.
2) Excessive dynamic memory allocation. If you have routine that contains std::vector within it AND if you'll call it few thousand times per frame, it will be very slow and bottleneck will be somewhere within operator new or another memory allocation routine. If you'll turn this vector into some kind of static buffer (so you won't have to recreate it every time), it will be much faster. Memory allocation is slow. If you have a buffer, it is normally better to reuse it instead of making a new one during each call.
3) Using inefficient algorithms. For example, using linear search instead of binary search, using wrong sort algorithms (for example, quick sort, heap sort are faster than bubble sort on unsorted data but insertion sort can be faster than quicksort on partially sorted data). Searching instead of using std::map, and so on.
4) Using wrong kind of container. For example, std::vector isn't suitable for inserting elements at random places. std::deque, while is comparable with std::vector (random access), allows fast push_front and push_back, can be 10 times slower than std::vector if you subtract two iterators (on MSVC, again).
Unless you're building the entire engine from scratch, you're not going to notice a difference in using or not using C++ classes or STL. In general, most of the time the CPU is going to be running code that's not even written by you anyway. Plus the overhead imposed by any scripting languages you implement (Lua, Python, etc) will eclipse any small performance penalty you may incur by using C++/STL.
In short, don't worry about it. Writing good OOP code is better than trying to write super-fast code from the get-go. "Premature optimization is the root of much programming evil."
Actually, you can use classes everywhere, and still get as good of performance as C (and often better performance than typical C).
Most STL is designed to do most of its "tricky" parts at compile time, so the run-time performance is excellent. The main thing to look out for (especially if you write for things like game consoles or mobile phones, that have less capable graphics hardware) is structuring your data to work well with a cache.
Here are, in my opinion, the key points to writing performant C++/STL code:
Learn what memory allocation strategies are for each STL container,
Learn what algorithms work best with what iterator categories,
Learn run-time polymorphism vs compile-time polymorphism.
Good starting points are:
SGI STL Programmer's Guide,
STL Reference,
The Definitive C++ Book Guide and List (SO).
I recommend Effective STL by Scott Myers. The biggest performance hog of STL, is the lack of understanding of it. Please learn it well!
Also see Optimizing software in C++ by Agner Fog for C++ specific performance related topics.
The other answers are all accurate: the problems with STL and game programming are mostly with misuse.
My general approach is the following:
1. Write it with STL and carefully choose the appropriate algorithm, container, etc.
2. Profile for bottlenecks.
3. If it's STL causing the problem, replace it.
Optimizing too early can really slow you down and only cause more problems later.
Of course, it depends on platform as well. Sometimes, you have to write all of your own stuff because you simply can't afford the extra CPU/RAM overhead of STL.
Stroustrup talks about the design and performance of the STL in general, and specifically about the different performance characteristics of the various different container types, in his book The C++ Programming Language (3rd edition).
I don't have experience in gaming, but Electronic Arts developed their own (non-conforming) implementation of the STL. There is an extensive article explaining the motives and design of the library here.
Note that in many cases, you will be better off by using the STL that comes with your implementation, then measure, then measure again and make sure that you understand what is going on and what is really a performance problem. Then, only then, and if the problem is within the STL (and not in how the STL is used), I would use unstandard libraries.
These rarely become performance hogs if used correctly. A profiler should always be your primary means of finding bottlenecks in your code short of obvious algorithmic inefficiencies (in which case it's still good practice to use a profiler to make sure if you are on a tight deadline).
There are some legitimate efficiency concerns, however, if you do come across STL usage to show up as a profiler hotspot.
vector<ExpensiveElement> v;
// insert a lot of elements to v
v.push_back(ExpensiveElement(...) );
This push_back immediately above has the worst case scenario of having to linearly copy all the ExpensiveElements inserted so far (if we've exceeded the current capacity). In the best case scenario, we still have to copy ExpensiveElement one time unnecessarily.
We can mitigate the issue by making vector store shared_ptr, but now we pay for two additional heap allocations per ExpensiveElement inserted (one for the reference counter in boost::shared_ptr, and one for ExpensiveElement) along with the overhead of a pointer indirection each time we want to access ExpensiveElement stored in the vector.
To mitigate the memory allocation/deallocation overhead (generally more likely to be a hotspot than an additional level of indirection), we can implement a fast memory allocator for ExpensiveElement. Nevertheless, imagine if std::vector provided an alloc_back method:
new (v.alloc_back()) ExpensiveElement(...);
This would avoid any copy ctor overhead, but is unsafe and prone to abuse. Nevertheless, that's exactly what I did with our vector-clone in response to hotspots. Note that I work in raytracing which is a field where performance is often one of the highest measures of quality (other than correctness) and we profile our code daily so it's not like we just decided out of the blue that vector wasn't efficient enough for our purposes.
We also had no choice but to implement a vector clone because we provide a software development kit where other people's std::vector implementations may be incompatible with our own. I don't want to give you the wrong idea: explore these kinds of solutions only if your profiler sessions really call for it!
Another common source of inefficiency is when using linked STL containers like set, multiset, map, multimap, and list. However, that's not necessarily their fault, but rather the fault of the default std::allocator being used. These perform a separate memory allocation/deallocation per node so the default allocator can be pretty slow for these purposes, especially across multiple threads (thread contention, better off with per-thread memory pools). You can really get a speed boost by writing your own memory allocator (though this is not a trivial thing to do and don't forget alignment if you do).
I can't emphasize enough that these kinds of optimizations should only be applied in response to the profiler. You'll make your code harder to use and maintain this way, so you should be doing it only in exchange for a solid, demonstrable boost in your application's performance.
This book covers what issues you face when using C++ in games.