Fast CSV parser in C++

Fast CSV parser in C++ - c++

I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.

An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.

Related

Most efficient file type for random data access

I am writing a password generation program. I have collected a list of around 30,000 English words and plan on picking from them at random by index.
Currently, I have all the words in a .txt file each separated by a newline character and organized by length.
My current plan is to write the program in C++ because that is the language I am most comfortable in so I could just load the entire file into memory, but that seems incredibly sloppy.
What would be a more efficient way (or file type like JSON if necessary) to do this? Thanks

30,000 words sounds like an insignificant amount of data to load. Even if it's ~50-500MB just load it in and forget about it.
On a modern system this will take a fraction of a second to accomplish the first time, any SSD can do ~600MB/s+, and even less once it's in the OS disk buffer.
You'd only concern yourself with not loading it if you've got a file too big to fit in memory.

What is the best way to read large file (>2GB) (Text file contains ethernet data) and access the data randomly by different parameters?

I have a text file which looks like below:
0.001 ETH Rx 1 1 0 B45678810000000000000000AF0000 555
0.002 ETH Rx 1 1 0 B45678810000000000000000AF 23
0.003 ETH Rx 1 1 0 B45678810000000000000000AF156500
0.004 ETH Rx 1 1 0 B45678810000000000000000AF00000000635254
I need a way to read this file and form a structure and send it to client application.
Currently, I can do this with the help of circular queue by Boost.
The need here is to access different data at different time.
Ex: If I want to access data at 0.03sec while I am currently at 100sec, how can I do this in a best way instead of having file pointer track, or saving whole file to a memory which causes performance bottleneck? (Considering I have a file of size 2 GB with the above kind of data)

Usually the best practice for handling large files depends on the platform architecture (x86/x64) and OS (Windows/Linux etc.)
Since you mentioned boost, have you considered using boost memory mapped file?
Boost Memory Mapped File

Its all depends on
a. how frequently the data access is
b. what pattern the data access is
Splitting the file
If you need to access the data once in a while then this 2GB log
design is fine, if not the logger can be tuned to generate log with
periodic interval/ latter a logic can split the 2GB files into needed fashion of
smaller files. So that fetching the ranged log file and then reading
the log data and then sort out the needed lines is easier since file
read bytes will be reduced here.
Cache
For very frequent data access, for faster response maintaining cache is one the nice solution, again as you said it has its own bottleneck. The size and pattern of the cache memory selection is all depends on the b. what pattern of data access is. Also greater the cache size also slower the response, it should be optimum.
Database
If the searching pattern is un-ordered/dynamically grown on usage then data-base will work. Again here it will not give faster response like small cache.
A mix of database with perfect table organization to support the type of query + smaller cache layer will give optimum result.

Here is the solution I found:
Used Circular buffers (Boost lock free Buffers) for parsing file and to save the structured format of line
Used Separate threads:
One will continuously parse the file and push to lock free queue
One will continuously read from the buffer, process the line, form a structure and push to another queue
Whenever user needs random data, based on time, I will move the file pointer to particular line and read only the particular line.
Both threads have mutex wait mechanisms to stop parsing once the predefined buffer limit reached
User will get data at any time, and no need of storing the complete file contents. As and when the frame is read, I will be deleting the frame from queue. So file size doesn't matter. Parallel threads which fills the buffers allows to not spend time on reading file every time.
If I want to move to other line, move file pointer, wipe off existing data, start threads again.
Note:
Only issue is now to move the file pointer to particular line.
I need to parse line by line till I reach the point.
If there exist any solution to move file pointer to required line it would be helpful. Binary search or any efficient search algorithm can be used and will get what I want.
I appreciate if anybody gives solution for the above new issue!

What's the smallest possible file size on disk?

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.

the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.

Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'

You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.

on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

skipping first half of a 59GB fastq file to process last half: read line-by-line, or fgetpos?

I have 2 ~59GB text files in ".fastq" format. fastq files are genomics read files from a sequencer. Every 4 lines is a new read, but the lines are of variable size.
The filesize is roughly 59GB, and there are about 211M reads-- which means, give or take, approximatley 211M*4 = 844M lines. The program I'm using, Bowtie, currently has the ability to do the following options:
"--skip 105M --qupto 105M"
which essentially means "skip the first 105M reads and only process up to the next 105M reads." In this way you can break up processing of the file. The problem is, the way that it does the skipping is incredibly slow. It just reads the first 105M reads as it normally would, but doesn't process them. Then it starts comparisons once it gets to the read value it was given.
I am wondering if I can use something like C/C++'s fsetpos to set the position to the middle of the file [or wherever] which I realize will probably put me somewhere in the middle of a line, and then from there find the beginning of the first full read to start processing rather than waiting for it to read approximately 422M lines until it gets where it needs to go. Does anybody have experience doing fsetpos on such a large file, and know whether or not the performance is any better than it is how it's currently doing it?
Thanks--
Nick

Yes, you can position to the middle of a file using C++.
For huge files, the performance is usually better than reading the data.
In general, the process for positioning within a file:
A request is made to read the directory entry for the file.
The directory is searched to find the track and sector for the file
position.
Note: Some filesystems may have directory extensions for large
files, thus more data will need to be read.
On the next read, the hard drive is told to go to the given track
and sector, then read in data.
You are saving time from all the previous data to pass through the communications port and into memory (or ignored).

What's the best way to write to more files than the kernel allows open at a time?

I have a very large binary file and I need to create separate files based on the id within the input file. There are 146 output files and I am using cstdlib and fopen and fwrite. FOPEN_MAX is 20, so I can't keep all 146 output files open at the same time. I also want to minimize the number of times I open and close an output file.
How can I write to the output files effectively?
I also must use the cstdlib library due to legacy code.
The executable must also be UNIX and windows cross-platform compatible.

A couple possible approaches you might take:
keep a cache of opened output file handles that's less than FOPEN_MAX - if a write needs to occur on a files that already open, then just do the write. Otherwise, close one of the handles in the cache and open the output file. If your data is generally clumped together in terms of the data for a particular set of files is grouped together in the input file, this should work nicely with an LRU policy for the file handle cache.
Handle the output buffering yourself instead of letting the library do it for you: keep your own set of 146 (or however many you might need) output buffers and buffer the output to those, and perform an open/flush/close when a particular output buffer gets filled. You could even combine this with the above approach to really minimize the open/close operations.
Just be sure you test well for the edge conditions that can happen on filling or nearly filling an output buffer.

It may also be worth scanning the input file, making a list of each output id and sorting it so that you write all the file1 entries first, then all the file2 entries etc..

If you cannot increase the max FOPEN_MAX somehow, you can create a simple queue of requests and then close and re-open files as needed.
You can also keep track of the last write-time for each file, and try to keep the most recently written files open.

The solution seems obvious - open N files, where N is somewhat less than FOPEN_MAX. Then read through the input file and extract the contents of the first N output files. Then close the output files, rewind the input, and repeat.

First of all, I hope you are running as much in parallel as possible. There is no reason why you can't write to multiple files at the same time. I'd recommend doing what thomask said and queue requests. You can then use some thread synchronization to wait until the entire queue is flushed before allowing the next round of writes to go through.

You haven't mentioned if it's critical to write to these outputs in "real-time", or how much data is being written. Subject to your constraints, one option might be to buffer all the outputs and write them at the end of your software run.
A variant of this is to setup internal buffers of a fixed size, once you hit the internal buffer limit, open the file, append, and close, then clear the buffer for more output. The buffers reduce the number of open/close cycles and give you bursts of writes which the file system is usually setup to handle nicely. This would be for cases where you need somewhat real-time writes, and/or data is bigger than available memory, and file handles exceed some max in your system.

You can do it in 2 steps.
1) Write the first 19 ids to one file, the next 19 ids to the next file and so on. So you need 8 output files (and the input file) opened in parallel for this step.
2) For every so created file create 19 (only 13 for the last one) new files and write the ids to it.
Independent of how large the input file is and how many id-datasets it contains, you always need to open and close 163 files. But you need to write the data twice, so it may only worth it, if the id-datasets are really small and randomly distributed.
I think in most cases it is more efficient to open and close the files more often.

The safest method is to open a file and flush after writing, then close if no more recent writing will take place. Many things outside your program's control can corrupt the content of your file. Keep this in mind as you read on.
I suggest keeping an std::map or std::vector of FILE pointers. The map allows you to access file pointers by an ID. If the ID range is small, you could create a vector, reserving elements, and using the ID as an index. This will allow you to keep a lot of files open at the same time. Beware the concept of data corruption.
The limit of simultaneous open files is set by the operating system. For example, if your OS has a maximum of 10, you will have make arrangements when the 11th file is requested.
Another trick is reserve buffers in dynamic memory for each file. When all the data is processed, open a file (or more than one), write the buffer (using one fwrite), close and move on. This may be faster since you are writing to memory during the data processing rather than a file. An interesting side note is that your OS may also page the buffers to the hard drive as well. The size and quantities of buffers is an optimization issue that is platform dependent (you'll have to adjust and test to get a good combination). Your program will slow down if the OS pages the memory to the disk.

Well, if I was writing it with your listed constraints in the OP, I would create 146 buffers and plop the data into them, then at the end, sequentially walk through the buffers and close/open a single file-handle.
You mentioned in a comment that speed was a major concern and that the naive approach is too slow.
There are a few things that you can start considering. One is a reorganizing of the binary file into sequential strips, which would allow parallel operations. Another is a least-recently used approach to your filehandle collection. Another approach might be to fork out to 8 different processes, each outputting to 19-20 files.
Some of these approaches will be more or less practical to write depending on binary organization(Highly fragmented vs highly sequential).
A major constraint is the size of your binary data. Is it bigger than cache? bigger than memory? streamed out of a tape deck? Continually coming off a sensor stream and only existing as a 'file' in memory? Each of those presents a different optimization strategy...
Another question is usage patterns. Are you doing occasional spike writes to the files, or are you having massive chunks written only a few times? That determines the effectiveness of the different caching/paging strategies of filehandles.

Assuming you are on a *nix system, the limit is per process, not system-wide. So that implies you could launch multiple processes, each responsible for a subset of the id's you are filtering for. Each could keep within the FOPEN_MAX for its process.
You could have one parent process reading the input file then sending the data to various 'write' processes through pipe special files.

"Fewest File Opens" Strategy:
To achieve a minimum number of file opens and closes, you will have to read through the input multiple times. Each time, you pick a subset of the ids that need sorting, and you extract only those records into the output files.
Pseudocode for each thread:
Run through the file, collect all the unique ids.
fseek() back to the beginning of the input.
For every group of 19 IDs:
Open a file for each ID.
Run through the input file, appending matching records to the corresponding output file.
Close this group of 19 output files.
fseek() to the beginning of the input.
This method doesn't work quite as nicely with multiple threads, because eventually the threads will be reading totally different parts of the file. When that happens, it's difficult for the file cache to be efficient. You could use barriers to keep the threads more-or-less in lock-step.
"Fewest File Operations" Strategy
You could use multiple threads and a large buffer pool to make only one run-through of the input. This comes at the expense of more file opens and closes (probably). Each thread would, until the whole file was sorted:
Choose the next unread page of the input.
Sort that input into 2-page buffers, one buffer for each output file. Whenever one buffer page is full:
Mark the page as unavailable.
If this page has the lowest page-counter value, append it to the file using fwrite(). If not, wait until it is the lowest (hopefully, this doesn't happen much).
Mark the page as available, and give it the next page number.
You could change the unit of flushing output files to disk. Maybe you have enough RAM to collect 200 pages at a time, per output file?
Things to be careful about:
Is your data page-aligned? If not, you'll have to be clever about reading "the next page".
Make sure you don't have two threads fwrite()'ing to the same output file at the same time. If that happens, you might corrupt one of the pages.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js