What is the appropriate madvise setting for reading a file backwards? - c++

I am using gcc 4.7.2 on a 64-bit Linux box.
I have 20 large sorted binary POD files that I need to read as a part of the final merge in an external merge-sort.
Normally, I would mmap all the files for reading and use a multiset<T,LessThan> to manage the merge sort from small to large, before doing a mmap write out to disk.
However, I realised that if I keep a std::mutex on each of these files, I can create a second thread which reads the file backwards, and sort from large to small at the same time. If I decide, beforehand, that the first thread will take exactly n/2 elements and the second thread will take the rest, I will have no need for a mutex on the output end of things.
Reading lock contentions, can be expected to occur, on average, maybe 1 in 20 in this particular case, so that's acceptable.
Now, here's my question. In the first case, it is obvious that I should call madvise with MADV_SEQUENTIAL, but I have no idea what I should do for the second case, where I'm reading the file backwards.
I see no MADV_REVERSE in the man pages. Should I use MADV_NORMAL or maybe don't call madvise at all?
Recall that an external sort is needed when the volume of data is so large that it will not fit into memory. So we are left with a more complex algorithm to use disk as a temporary store. Divide-and-conquer algorithms will usually involve breaking up the data, doing partial sorts, and then merging the partial sorts.
My steps for an external merge-sort
Take n=1 billion random numbers and break them into 20 shards of equal sizes.
Sort each shard individually from small to large, and write each out into its own file.
Open 40 mmap's, 2 for each file, one for going forward, one for going backwards, associate a mutex with each file.
Instantiate a std::multiset<T,LessThan> buff_fwd; for the forward thread and a std::multiset<T,GreaterThan> buff_rev for the reverse thread. Some people prefer to use priority queues here, but essentially, and sort-on-insert container will work here.
I like to call the two buffers surface and rockbottom, representing the smallest and largest numbers not yet added to the final sort.
Add items from the shards until n/2 is used up, and flush the shards to one output file using mmap from beginning towards the middle, and from the end towards to middle in the other thread. You can basically flush at will, but at least one should do it before either buffer uses up too much memory.

I would suggest:
MADV_RANDOM
To prevent useless read-ahead (which is in the wrong direction).

Related

Multithreaded array of arrays?

I have a data structure which consists of 1,000 array elements, each array element is a smaller array of 8 ints:
std::array<std::array<int, 8>, 1000>
The data structure contains two "pointers", which track the largest and smallest populated array elements (within the "outer", 1000-element array). So for example they might be:
min = 247
max = 842
How can I read and write to this data structure from multiple threads? I am worried about race conditions between pushing/popping elements and maintaining the two "pointers". My basic mode of operation is:
// Pop element from current index
// Calculate new index
// Write element to new index
// Update min and max "pointers"
You are correct that your current algorithm is not thread safe, there are a number of places where contention could occur.
This is impossible to optimize without more information though. You need to know where the slow-down is happening before you can improve it - and for that you need metrics. Profile your code and find out what bits are actually taking the time, because you can only gain by parallelizing those bits and even then you may find that it's actually memory or something else that is the limiting factor, not CPU.
The simplest approach will then be to just lock the entire structure for the full process. This will only work if the threads are doing a lot of other processing as well, if not you will actually lose performance compared to single threading.
After that you can consider having a separate lock for different sections of the data structure. You will need to properly analyse what you are using when and where and work out what would be useful to split. For example you might have chunks of the sub arrays with each chunk having its own lock.
Be careful of deadlocks in this situation though, you might have a thread claim 32 then want 79 while another thread already has 79 and then wants 32. Make sure you always claim locks in the same order.
The fastest option (if possible) may even be to give each thread it's own copy of the data structure, each processes 1/N of the work and then merge the results at the end. This way no synchronization is needed at all during processing.
But again it all comes back to the metrics and profiling. This is not a simple problem.

Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.
What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.
You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:
If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))
Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.
Google servers usually have this sort of problems.
Construct simply digital tree (Trie)
Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.
PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.
I suggest you use the Unix "sort" command that can easily handle such files.
See How could the UNIX sort command sort a very large file? .
Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.
Such programs are known as external sorting algorithms.
My understanding is that the Unix "sort" command uses the merge sort algorithm.
Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):
Name four tape drives as A, B, C, D, with the original data on A:
Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.
Practical implementations typically have many tweaks:
Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.
Use as much memory as you can and chunk your data. Read one chunk at a time into memory.
Step 1) Sort entries inside chunks
For each chunk:
Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk
Step 2) Merge all strings from sorted chunks into resulting sorted strings file
Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final
I think this could be an IO efficient way. But I've never implemented such thing.
However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.
[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html
Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.
1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".
2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.
3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.
4: Load as many strings as you can into this buffer.
5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.
6: Iterate over the strings and store them in their corresponding buffer.
7: Flush all of the buffers to their corresponding files. Clear all the buffers.
8: Go to step 4 and repeat until you have exhausted your source of strings.
9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.
10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.
If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.
Good luck with your project.

How to design a datastructure that spits out one available space for each thread in CUDA

In my Project with CUDA I need to have a data structure(available to all threads in the block)that is similar a "stash". In this stash there are multiple spaces which could be either empty or full. I need this data structure to spit out empty space when each thread asks for. The thread will ask for a space in the stash, put something in, and mark this position as full. I could not use a fifo because fetching from stash is random. Any position(and multiple positions)could be marked as empty or full.
The initial version I have is that I use an array to represent whether the space is empty or not. each thread will loop through each position space(using atomicCAS) until it finds a empty spot. But this algorithm the searching time depends on how full the stash is, which is not acceptable in my design.
How could I design a datastructure that the fetching time and write back time does not depend on how full the stash is?
Does this remind anyone of anything any algorithm similar?
Thanks
You could implement this with a FIFO containing a list of free locations.
At startup you fill the FIFO with all locations.
Then whenever you want a space, you take the next element from the FIFO .
When you are finished with the slot, you can place the address back into the FIFO again.
This should have O(1) allocation and deallocation time.
You could implement a hash table (SeparateChaining) with ThreadID as the key.
It is more or less similar to array of linked lists. This way you need not put a lock on the entire array as you did earlier. Instead, you use atomicCAS only while reading a linkedlist from a specific index. Thereby, you can have n threads running in parallel where array size is n.
Note: The distribution of threads however depends on the hash function.

remove all duplicate records efficiently

I have a file which might be 30+GB or more. And each line in this file is called a record and is composed of 2 cols, which goes like this
id1 id2
All of this 2 ids are integers (32-bit). My job is to write a program to remove all the duplicate record, make the record unique, finally output the unique id2 into a file.
There is some constraints, 30G memory is allowed at most, and better get the job done efficiently by a non-multithread/process program.
Initially I came up with an idea: because of the memory constraints, I decided to read the file n times, each only keep in memory those record with id1 % n = i (i = 0,1,2,..,n-1). The data structure I use is a std::map<int, std::set<int> >, it takes id1 as key, and put id2 in id1's std::set.
This way, memory constraints will not be violated, but it's quite slow. I think it's because as the std::map and std::set grows larger, the insertion speed goes down. Moreover, I need to read the file n times, when each round is done, I gotta clear the std::map for next round which also cost some time.
I also tried hash, but it doesn't satisfy me either, which I thought there might be too many collisions even with 300W buckets.
So, I post my problem here, help you guys can offer me any better data structure or algorithm.
Thanks a lot.
PS
Scripts (shell, python) are desired, if it can do it efficiently.
Unless I overlooked a requirement, it should be possible to do this on the Linux shell as
sort -u inputfile > outputfile
Many implementations enable you to use sort in a parallelised manner as well:
sort --parallel=4 -u inputfile > outputfile
for up to four parallel executions.
Note that sort might use a lot of space in /tmp temporarily. If you run out of disk space there, you may use the -T option to point it to an alternative place on disk to use as temporary directory.
(Edit:) A few remarks about efficiency:
A significant portion of the time spent during execution (of any solution to your problem) will be spent on IO, something that sort is highly optimised for.
Unless you have extremely much RAM, your solution is likely to end up performing some of the work on disk (just like sort). Again, optimising this means a lot of work, while for sort all of that work has been done.
One disadvantage of sort is that it operates on string representations of the input lines. If you were to write your own code, one thing you could do (similar to what you suggesed already) is to convert the input lines to 64-bit integers and hash them. If you have enough RAM, that may be a way to beat sort in terms of speed, if you get IO and integer conversions to be really fast. I suspect it may not be worth the effort as sort is easy to use and – I think – fast enough.
I just don't think you can do this efficiently without using a bunch of disk. Any form of data structure will introduce so much memory and/or storage overhead that your algorithm will suffer. So I would expect a sorting solution to be best here.
I reckon you can sort large chunks of the file at a time, and then merge (ie from merge-sort) those chunks after. After sorting a chunk, obviously it has to go back to disk. You could just replace the data in the input file (assuming it's binary), or write to a temporary file.
As far as the records, you just have a bunch of 64-bit values. With 30GB RAM, you can hold almost 4 billion records at a time. That's pretty sweet. You could sort that many in-place with quicksort, or half that many with mergesort. You probably won't get a contiguous block of memory that size. So you're going to have to break it up. That will make quicksort a little trickier, so you might want to use mergesort in RAM as well.
During the final merge it's trivial to discard duplicates. The merge might be entirely file-based, but at worst you'll use an amount of disk equivalent to twice the number of records in the input file (one file for scratch and one file for output). If you can use the input file as scratch, then you have not exceeded your RAM limits OR your disk limits (if any).
I think the key here is the requirement that it shouldn't be multithreaded. That lends itself well to disk-based storage. The bulk of your time is going to be spent on disk access. So you wanna make sure you do that as efficiently as possible. In particular, when you're merge-sorting you want to minimize the amount of seeking. You have large amounts of memory as buffer, so I'm sure you can make that very efficient.
So let's say your file is 60GB (and I assume it's binary) so there's around 8 billion records. If you're merge-sorting in RAM, you can process 15GB at a time. That amounts to reading and (over)writing the file once. Now there are four chunks. If you want to do pure merge-sort then you always deal with just two arrays. That means you read and write the file two more times: once to merge each 15GB chunk into 30GB, and one final merge on those (including discarding of duplicates).
I don't think that's too bad. Three times in and out. If you figure out a nice way to quicksort then you can probably do this with one fewer pass through the file. I imagine a data structure like deque would work well, as it can handle non-contiguous chunks of memory... But you'd probably wanna build your own and finely tune your sorting algorithm to exploit it.
Instead of std::map<int, std::set<int> > use std::unordered_multimap<int,int>. If you can not use C++11 - write your own.
The std::map is node based and it calls malloc on each insertion, this is probably why it is slow. With unodered map (hash table), if you know number of records, you can pre-allocate. Even if you don't, number of mallocs will be O(log N) instead of O(N) with std::map.
I can bet this will be several times faster and more memory efficient than using external sort -u.
This approach may help when there are not too many duplicate records in the file.
1st pass. Allocate most of the memory for Bloom filter. Hash every pair from input file and put the result into Bloom filter. Write each duplicate, found by Bloom filter into temporary file (this file will also contain some amount of false positives, which are not duplicates).
2nd pass. Load temporary file and construct a map from its records. Key is std::pair<int,int>, value is a boolean flag. This map may be implemented either as std::unordered_map/boost::unordered_map, or as std::map.
3rd pass. Read input file again, search each record in the map, output its id2 if either not found or flag is not yet set, then set this flag.

Fastest way to do many small, blind writes on a huge file (in C++)?

I have some very large (>4 GB) files containing (millions of) fixed-length binary records. I want to (efficiently) join them to records in other files by writing pointers (i.e. 64-bit record numbers) into those records at specific offsets.
To elaborate, I have a pair of lists of (key, record number) tuples sorted by key for each join I want to perform on a given pair of files, say, A and B. Iterating through a list pair and matching up the keys yields a list of (key, record number A, record number B) tuples representing the joined records (assuming a 1:1 mapping for simplicity). To complete the join, I conceptually need to seek to each A record in the list and write the corresponding B record number at the appropriate offset, and vice versa. My question is what is the fastest way to actually do this?
Since the list of joined records is sorted by key, the associated record numbers are essentially random. Assuming the file is much larger than the OS disk cache, doing a bunch of random seeks and writes seems extremely inefficient. I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. This has the benefit of greatly increasing the chances that the appropriate records will be cached for a cluster after updating its first pointer. However, even at this point, is it generally better to do a bunch of seeks and blind writes, or to read chunks of the file manually, update the appropriate pointers, and write the chunks back out? While the former method is much simpler and could be optimized by the OS to do the bare minimum of sector reads (since it knows the sector size) and copies (it can avoid copies by reading directly into properly aligned buffers), it seems that it will incur extremely high syscall overhead.
While I'd love a portable solution (even if it involves a dependency on a widely used library, such as Boost), modern Windows and Linux are the only must-haves, so I can make use of OS-specific APIs (e.g. CreateFile hints or scatter/gather I/O). However, this can involve a lot of work to even try out, so I'm wondering if anyone can tell me if it's likely worth the effort.
It looks like you can solve this by use of data structures. You've got three constraints:
Access time must be reasonably quick
Data must be kept sorted
You are on a spinning disk
B+ Trees were created specifically to address the kind of workload you're dealing with here. There are several links to implementations in the linked Wikipedia article.
Essentially, a B+ tree is a binary search tree, except groups of nodes are held together in groups. That way, instead of having to seek around for each node, the B+ tree loads only a chunk at a time. And it keeps a bit of information around to know which chunk it's going to need in a search.
EDIT: If you need to sort by more than one item, you can do something like:
+--------+-------------+-------------+---------+
| Header | B+Tree by A | B+Tree by B | Records |
+--------+-------------+-------------+---------+
|| ^ | ^ | ^
|\------/ | | | |
\-------------------/ | |
| | |
\----------+----------/
I.e. you have seperate B+Trees for each key, and a seperate list of records, pointers to which are stored in the B+ trees.
I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory.
it seems that it will incur extremely high syscall overhead.
You can use memory mapped access to the file to avoid syscall overhead. mmap() on *NIX, and CreateFileMapping() on Windows.
Split file logically into blocks, e.g. 32MB. If somethings needs to be changed in the block, mmap() it , modify data, optionally msync() if desired, munmap() and then move to the next block.
That would have been something I have tried first. OS would automatically read whatever needs to be read (on first access to the data), and it will queue IO anyway it likes.
Important things to keep in mind is that the real IO isn't that fast. Performance-wise limiting factors for random access are (1) the number of IOs per second (IOPS) storage can handle and (2) the number of disk seeks. (Usual IOPS is in hundreds range. Usual seek latency is 3-5ms.) Storage for example can read/write 50MB/s: one continuous block of 50MB in one second. But if you would try to patch byte-wise 50MB file, then seek times would simply kill the performance. Up to some limit, it is OK to read more and write more, even if to update only few bytes.
Another limit to observe is the OS' max size of IO operation: it depends on the storage but most OSs would split IO tasks larger than 128K. The limit can be changed and best if it is synchronized with the similar limit in the storage.
Also keep in mind the storage. Many people forget that storage is often only one. I'm trying here to say that starting crapload of threads doesn't help IO, unless you have multiple storages. Even single CPU/core is capable of easily saturating RAID10 with its 800 read IOPS and 400 write IOPS limits. (But a dedicated thread per storage at least theoretically makes sense.)
Hope that helps. Other people here often mention Boost.Asio which I have no experience with - but it is worth checking.
P.S. Frankly, I would love to hear other (more informative) responses to your question. I was in the boat several times already, yet had no chance to really get down to it. Books/links/etc related to IO optimizations (regardless of platform) are welcome ;)
Random disk access tends to be orders of magnitude slower than sequential disk access. So much so that it can be useful to choose algorithms that might sound badly inefficient at first blush. For example, you might try this:
Create your join index, but instead of using it, just write out the list of pairs (A index, B index) to a disk file.
Sort this new file of pairs by the A index. Use a sort algorithm designed for external sorting (though I've not tried it myself, the STXXL library from stxxl.sourceforge.net looked promising when I was researching a similar problem)
Sequentially walk through the A record file and the sorted pair list. Read a huge chunk in, make all the relevant changes in memory, write the chunk out. Never touch that portion of the A record file again (since the changes you planned to make come in sequential order)
Go back, sort the pair file by the B index (again, using an external sort). Use this to update the B record file in the same manner.
Instead of building a list of (key, record number A, record number B) I would leave out the key to save space and just build (record number A, record number B). I'd sort that table or file by the A's, sequentially seek to each A record, write the B number, then sort the list by the B's, sequentially seek to each B record, write the A number.
I'm doing very similar large file manipulations, and these newer machines are so damn fast it doesn't take long at all:
On a cheapo 2.4gHz HP Pavilion with 3gb ram and 32-bit Vista, writing 3 million sequential 1,008-byte records to a new file takes 56 seconds, using Delphi library routines (as opposed to the Win API).
Sequentially seeking to each record in the file and writing 8 bytes using Win API FileSeek/FileWrite on a booted machine takes 136 seconds. That's 3 million updates. Immediately rerunning the same code takes 108 seconds, since the O/S has some things cached.
Sorting record offsets first, then sequentially updating the files, is the way to go.