C++ performance - reading numbers from a file - c++

I have a file in which are numbers divided by spaces (i.e. matrix). For xample:
1 2 3
4 5 6
I would like to read these numbers and store them in two dimensional array int**. I've found numerous solutions how to solve this, but I don't know, which of these gives the best performance.
Furthermore I would like to ask, if there is possibility to read the mentioned file in paralell.
EDIT: The data I want to read is much more bigger (I included the data only as an example), I would like store big matrices, possibly with rows of different lengths into the mentioned array for further manipulation

For the best performance to read the file in paralell, you can use couple copies of file. And you can use row index to quickly read particular row.

Related

Reading a text file with floats fast

I have a text file with ~4 mio floats, i.e 30MB, and I want to read them into a vector<float>.
The code I have is very bare bone, and gets the job done
std::fstream is("data.txt", std::ios_base::in);
float number;
while (is >> number)
{
//printf("%f ", number);
number_vec.push_back(number);
}
The problem is that it takes 20-30 s on a modern desktop workstation. At first I assumed I did something stupid, but the more I starred at the code, the more I started accepting that maybe it was just the time it takes to parse all those ascii float values into floats
However, then I remembered that Matlab can read, and parse, the same file almost instantly (disk speed seems to be the limit), so it is obvious that my code is just very inefficient.
The only thing I could think of was to reserve the required elements in the vector in advance, but it didn't improve the situation at all.
Can someone help me understand why? and maybe help writing a faster solution?
EDIT The textfile looks like this:
152.00256 45.8569 5.87214 0.225 -0.0005 .....
i.e. One row, space delimited.
please consider taking a look at the possible duplicates shared by #gsamaras and #Brad Allred. Anyway, I will try to reply with a simple answer that will aim on keeping the code simplicity/friendliness and consider the following two premises:
You have a constraint regarding the file and will neither change the file format, neither the way floats are presented textually in it.
You want to keep using STL and are not looking for a library specialized/optimized for the challenge you are facing.
With those stated constraints and mindset, my main suggestion would be to preallocate your containers, both the float vector as the internal iostream buffer:
Increase performance of insertion in number_vec by reserving the required size in the std vector. This can be achieved by a call to reserve as explained in this stackoverflow post.
Increase performance of the iostream by setting the buffer size used internally. This can be achieved by a call to pubsetbuf as explained in this other stackoverflow post.

Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.
What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.
You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:
If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))
Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.
Google servers usually have this sort of problems.
Construct simply digital tree (Trie)
Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.
PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.
I suggest you use the Unix "sort" command that can easily handle such files.
See How could the UNIX sort command sort a very large file? .
Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.
Such programs are known as external sorting algorithms.
My understanding is that the Unix "sort" command uses the merge sort algorithm.
Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):
Name four tape drives as A, B, C, D, with the original data on A:
Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.
Practical implementations typically have many tweaks:
Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.
Use as much memory as you can and chunk your data. Read one chunk at a time into memory.
Step 1) Sort entries inside chunks
For each chunk:
Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk
Step 2) Merge all strings from sorted chunks into resulting sorted strings file
Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final
I think this could be an IO efficient way. But I've never implemented such thing.
However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.
[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html
Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.
1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".
2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.
3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.
4: Load as many strings as you can into this buffer.
5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.
6: Iterate over the strings and store them in their corresponding buffer.
7: Flush all of the buffers to their corresponding files. Clear all the buffers.
8: Go to step 4 and repeat until you have exhausted your source of strings.
9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.
10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.
If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.
Good luck with your project.

Efficient implementation of tail -n [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How would you implement tail efficiently?
A friend of mine was asked how he'd implement tail -n.
To be clear, we are required to print the last n lines of the file specified.
I thought of using an array of n strings and overwriting them in a cyclic manner.
But if we are given, say a 10 GB file, this approach doesn't scale at all.
Is there a better way to do this?
Memory map the file, iterate from the end looking for end of line n times, write from that point to the end of file to standard out.
You could potentially complicate the solution by not mapping the whole file, but just the last X kb of memory (say a couple of memory pages) and seeking there. If there aren't enough lines, then memory map a larger region until you get what you want. You can use some heuristic to implement the guess for how much memory you want to map (say 1kb per line as a rough estimate). I would not really do this though.
"It depends", no doubt. Given the size of the file should be knowable, and given a sensible file-manipulation library which can 'seek' to the end of a very large file without literatally traversing each byte in turn or thrashing virtual memory, you could simply scan backwards from the end counting newlines.
When you're dealing with files that big though, what do you do about the degenerate case where n is close to the number of lines in the multi-gigabyte file? Storing stuff in temporary strings won't scale then, either.

How to create a very large array of unique integers?

Part of an assignment for my parallel programming class is to create a file of 313 million unique integers. I figure I'll multiply two random numbers together to get a very large range, but what is the best way to check for uniqueness?
Should I create an array and search the array each time for matches? That seems as though it will be very inefficient.
Edit - the problem is to eventually sort the list (using threading / multi cores) so having a sequential list doesn't work.
You could fill the file up sequentially - the resulting file would look like:
0 1 2 3 4 5 6 7 8 9 ... 312999999
These numbers would be very simple to generate, and would be guaranteed to be unique.
For parallel programming assignment, I think maybe you need to divide the task into small and independent pieces.
As 313M is still litter than 2G which is the maximum number a 32bit can represent.
My suggestion is that you can break the 32bit number into 16bit higher and 16bit lower. For each thread you assign a fixed higher bits and the thread generate the lower 16bits. Then you combine the two parts, thus you can keep every number generated by each thread is different.

Dynamic length string array

I an a newbie to c++.
I want to write a program to read values from file which has data in format:
text<tab or space>text
text<tab or space>text
...
(... indicates more such lines)
The number of lines in file varies. Now, I want to read this file and store the text into either 1 2D string array or 2 1D string arrays.
How do I do it?
Furthermore, I want to run a for loop over this array to process the each entry in file. How can I write that loop?
You're looking for a resizable array. Try std::vector<string>. You can find documentation here.
Edit: You could probably also manage to do this by opening the file, looping through to count the lines of the file, generating your fixed-size array, closing and reopening the file, and then looping through the file to populate the array. However, this is not recommended, as it will increase your runtime complexity far more than the slight overhead involved with managing vector, and it will make your code much more confusing for anyone who reads it.
(ps - I agree with #matthias-vallentin, you should've been able to find this on the site with minimal work)