I need to make a decision about whether to use STM in a Clojure system I am involved with for a system which needs several GB to be stored in a single STM ref.
I would like to hear from anyone who has any advice in using Clojure STM with large indexed datasets to hear their experiences.
I've been using Clojure for some fairly large-scale data processing tasks (definitely gigabytes of data, typically lots of largish Java arrays stored inside various Clojure constructs/STM refs).
As long as everything fits in available memory, you shouldn't have a problem with extremely large amounts of data in a single ref. The ref itself applies only a small fixed amount of STM overhead that is independent of the size of whatever is contained within it.
A nice extra bonus comes from the structural sharing that is built into Clojure's standard data structures (maps, vectors etc.) - you can take a complete copy of a 10GB data structure, change one element anywhere in the structure, and be guaranteed that both data structures will together only require a fraction more than 10GB. This is very helpful, particularly if you consider that due to STM/concurrency you will potentially have several different versions of the data being created at any one time.
The performance isn't going to be any worse or any better than STM involving a single ref with a small dataset. Performance is more hindered by the number of updates to a dataset than the actual size of the dataset.
If you have one writer to the dataset and many readers, then performance will still be quite good. However if you have one reader and many writers, performance will suffer.
Perhaps more information would help us help you more.
Related
I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
The things I can think of that may make a difference would be:
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Compression would also be easier on a column-store regardless of whether in memory or not (seems like a non-issue if not saving back to storage I suppose? does compression ever matter on in-memory operations?)
Possible to vectorize operations.
Much, much easier to work with a struct on a row-by-row basis of course.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
A lot depends on the size of the dataset, what the contents of each row are, how you need to search in it, whether you want to add items to or remove items from the dataset, and so on.
There is also the CPU and memory architecture to consider; how big are your caches, what is the size of a cache line, and how intelligent is your CPU's prefetcher.
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Memory is not accessed a register at a time, but rather a cache line at a time. On most contemporary machines, cache lines are 64 bytes.
Compression would also be easier on a column-store regardless of whether in memory or not
Not really. You can compress/decompress a column even if it is not stored in memory consecutively. It might be faster though.
does compression ever matter on in-memory operations?
That depends. If it's in-memory, then it's likely that compression will reduce performance, but on the other hand, the amount of data that you need to store is smaller, so you will be able to fit more into memory.
Possible to vectorize operations.
It's only loading/storing to memory that might be slower if data is grouped by rows.
Much, much easier to work with a struct on a row-by-row basis of course.
It's easy to use a pointer to a struct with a row-by-row store, but with C++ you can make classes that hide the fact that data is stored column-by-column. That's a bit more work up front, but might make it as easy as row-by-row once you have set that up.
Also, column-by-column store is often used in the entity-component-system pattern, and there are libraries such as EnTT that make it quite easy to work with.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
Again, it heavily depends on the size of the dataset and how you want to access it. If you frequently use all columns in a row, then row-by-row store is preferred. If you frequently just use one column, and need to access that column of many consecutive rows, then a column-by-column store is best.
Also, there are hybrid solutions possible. You could have one column on its own, and then all the other columns stored in row-by-row fashion.
How you will search in a read-only dataset matters a lot. Is it going to be sorted, or is it more like a hash map? In the former case, you want the index to be as compact as possible, and possibly ordered like a B-tree as Alex Guteniev already mentioned. If it's going to be like a hash map, then you probably want row-by-row.
For in-memory arrays, this is called AoS vs SoA (array of structs vs struct of arrays).
I think the main advantage in SoA for a read-only database is that searches would need to access smaller memory range. This is more cache friendly, less prone to page faults.
The amount of improvement depends on how you use the database. There may be some more significant improvement by using more targetted structure (sorted array, B-tree)
According to this Quora forum,
One of the simplest rules of thumb is to remember that hardware loves arrays, and is highly optimized for iteration over arrays. A simple optimization for many problems is just to stop using fancy data structures and just use plain arrays (or std::vectors in C++). This can take some getting used to.
Are C++ classes one of those "fancy data structures," i.e. a kind of data type that can be replaced by arrays to achieve a higher performance in a C++ program?
If your class looks like this:
struct Person {
double age;
double income;
size_t location;
};
then you might benefit from rearranging to
std::vector<double> ages;
std::vector<double> incomes;
std::vector<size_t> locations;
But it depends on your access patterns. If you frequently access multiple elements of a person at a time, then having the elements blocked together makes sense.
If your class looks like this:
struct Population {
std::vector<double> many_ages;
std::vector<double> many_incomes;
std::vector<size_t> many_locations;
};
Then you're using the form your resource recommended. Using any one of these arrays individually is faster than using the first class, but using elements from all three arrays simultaneously is probably slower with the second class.
Ultimately, you should structure your code to be as clean and intuitive as a possible. The biggest source of speed will be a strong understanding and appropriate use of algorithms, not the memory layout. I recommend disregarding this unless you already have strong HPC skills and need to squeeze maximum performance from your machine. In almost every other case your development time and sanity is worth far more than saving a few clock cycles.
More broadly
An interesting paper related to this is SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. A lot of work has gone into mapping ML algorithms to GPUs and, for ML applications, getting the memory layout right does make a real difference since so much time is spent on training and GPUs are optimized specifically for contiguous-array processing. But, the authors of the paper contend that even here if you understand algorithms well you can beat specialized hardware with optimized memory layouts, and they demonstrate this by getting their CPU to train 3.5x faster than their GPU.
More broadly, your question deals with the idea of cache misses. Since a cache miss is 200x more expensive than an L1 reference (link), if your data layout is optimized to your computation, then you can really save time. However, as the above suggests, it is rarely the case that simply rearranging your data magically makes everything faster. Consider matrix multiplication. It's the perfect example because the data is laid out in a single array, as requested by your resource. However, for a simple triple-loop matmult GEMM implementation there are still 6 ways to arrange your loops. Some of these ways are much more efficient than others, but none of them give you anywhere near peak performance. Read through this step-by-step explanation of matmult to get a better sense of all the algorithmic optimizations necessary to get good performance.
What the above should demonstrate is that even for situations in which we have only a few arrays laid out exactly as your resource suggests, the layout alone doesn't give us the speed. Good algorithms do. Data layout considerations, if any, flow from the algorithms we choose and higher-level hardware constraints.
If this is so for simple arrays and operations like matrix multiplication, by extension you should also expect it to be so for "fancy data structures" as well.
Are C++ classes one of those "fancy data structures,
A C++ class is a construct which can be used to create a data type. It can be used to create data structures such as lists, queues etc.
i.e. a kind of data type
A class is a data type
that can be replaced by arrays
A class and array are not interchangeable. Arrays are data structures. You are comparing apples with oranges.
to achieve a higher performance in a C++ program?
That depends on how you implement your class
Are C++ classes one of those "fancy data structures,"
I think they are referring in particular to containers like std::map, std::deque, std::list etc, that hold data in many different heap allocations and therefore iterating over the contents of the container requires the CPU to "hop around" in the RAM address space to some extent, rather than just reading RAM sequentially. It's that hopping-around that often limits performance, as the CPU's on-board memory caches are less effective at avoiding execution-stalls due to RAM latency when future RAM access locations aren't easily predictable.
A C++ class on its own may or may not encourage non-sequential RAM access; whether it does or not depends entirely on how the class was implemented (and in particular on whether it is holding its data via multiple heap allocations). The std::vector class (mentioned in the forum text) is an example of a C++ class that does not require any non-sequential memory accesses as you iterate across its contents.
Are C++ classes one of those "fancy data structures," i.e. a kind of data type that can be replaced by arrays to achieve a higher performance in a C++ program?
Both computer time and your development time are valuable.
Don't optimize code unless you are sure it is taking most of the CPU time.
So use first a profiler (e.g. Gprof) and read the documentation of your C or C++ compiler (e.g. GCC). Compilers are capable of fancy optimizations.
If you really care about HPC, learn about GPGPU programming with e.g. OpenCL or OpenACC.
If you happen to use Linux (a common OS in the HPC world), read Advanced Linux Programming, then syscalls(2), then time(7).
MonetDB is very efficient column oriented database. I came to know that it follows light weight compression algorithms to speed it up. Can someone tell me more about the implementation of these compression/decompression algorithms in monetDB?
There is currently no compression on primitive values such as integers and floating point numbers. Thus, choosing the appropriate type for your data will make a difference once your tables get large.
The string storage uses pointers to a string heap. Hence, for categorical string values that only contain few distinct values, storage will generally be efficient. More advanced compression methods are in the works, but I do not expect them to be available in the next six months.
Finally, we had great experiences running MonetDB on a force-compressed file system (e.g. BTRFS). This greatly reduces the storage footprint of databases and also reduces the IO time, especially on spinning hard disks.
I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?
Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...
The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.
Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.
These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.
Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.
As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.
It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.
Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.
I am developing a clojure application which makes heavy use of STM. Is it better to use one global ref or many smaller refs in general. Each ref will be storing data like in a relational database table, so there will be several refs.
A possible benefit of using fewer refs is that you it will be easier to comprehend what is happening in your presumably multi-threaded app.
A possible benefit of using more refs is that you will be locking less code at a time and increasing speed.
If you have a ref per table and you need to maintain data integrity between two tables, then you are going to be responsible for implementing that logic since the STM has no knowledge of how the data relates between the tables.
Your answer might lie in how often a change in one table will effect another and whether breaking your single ref out into one ref per table even gives you a notable performance increase at all.
I've usually found it is better to minimise the number of refs.
Key reasons:
It's often helpful to treat large portions of application state as single immutable blocks (e.g. to enable easy snapshots for analysis, or passing to test functions)
There will be more overhead from using lots of small refs
It keeps your top-level namespace simpler
Of course, this will ultimately depend on your workload and concurrency demands, but my general advice would be to go with as few refs as possible until you are sure that you need more.