Parsing tab separated data

Parsing tab separated data - c++

I have a text file (~10GB) with the following format:
data1<TAB>data2<TAB>data3<TAB>data4<NEWLINE>
I want to scan through it and do processing only on data2. What is the best (fastest) way to extract data2 in C++.
EDIT: Added NEWLINE

Read the file line by line. For each line, split on the tab. That will leave you with an array containing the fields, allowing you to work with the second field (data2).

This sounds like a job for a higher level tool like shell utilities:
cut -f2 # from stdin
cut -f2 <my_file # from file
But nonetheless, you can do that with C++ as well:
void parse(std::istream& in)
{
std::string word;
while( in ) {
std::cin >> word; // throwaway 1
std::cin >> word; // data2
process(word);
std::cin >> word >> word; // throwaway 3 and 4
}
}
// ...
parse(std::cin);
std::ifstream file("my_file");
parse(file);

Read the file a line at a time. It's pretty straight forward parsing out the tabs from there. You could use something like strtok() or similar routine.

Well, open a file stream (which should be able to handle 10gig files) and then just jump to after the first tab, which is a '\t', read your data and then skip to the next newline and repeat.
#include <fstream>
#include <string>
int main(){
std::fstream fin("your_file.txt");
while(fin){
std::string data2;
char sink = '\0';
// skip to first tab
fin.ignore(1024,'\t');
fin >> data2;
// do stuff with data2
// skip to next line
fin.ignore(1024,'\n');
}
}

Since the file is of a considerable size, you might consider using a technique that will allow you overlap your I/O with your processing. In response a comment, you mentioned you were working on linux. Provided you are using kernel 2.6 or later you might consider using Linux asynchronous I/O (AIO). Specifically you would use aio_read to queue up some read requests, then use aio_suspend to wait for one (or more) of the request to end. As requests complete you would scan through the buffers using a plain char* to locate the data you are interested in. For each piece of data you find you could at that point create a std::string (although avoiding copying may be beneficial) and process it. Once you have scanned a block you would requeue it to read another block from the file. You continue doing this until you have processed every block in the file.
The code for this method will be more complex than reading the file line by line, but it may be considerably faster.

You could use iostream as others have suggested. Another way to go would be to simply use fscanf. For example:
#include <stdio.h>
...
FILE* fp = fopen(path_to_file, "r");
char[256] data;
while(fscanf(fp, "%*s<tab>%s<tab>%*s<tab>%*s", data))
{
do what you want with your data
}

Related

How to determine I've reached EOF with a non coded file?

I've looked up several instances in EOF, but in all instances EOF is being used on a file that is part of the program, for example:
std::fstream myFile("file.txt", std::ios::in);
while(!myFile.eof()) {
/*program*/
}
However in this instance, I'm not using a file as part of the code. I'm just using basic cin commands. There's a command to quit, but let's say a user runs the program like this:
./program <myFile.txt> myOutput
Let's say that myFile had these commands in this:
add 1
add 2
delete 1
print
That's all fine, but they forgot to add a quit command at the end, so the code won't stop. So how do I get the code to detect EOF in this situation and stop?

The correct way to detect end-of-file is to check each input operation for success.
For example, with getline you'd use
std::string line;
while (std::getline(std::cin, line)) {
// do stuff with line
}
Or with >>:
while (std::cin >> x) {
// do stuff with x
}
This applies to all input streams, whether they're from files (fstream) or e.g. cin.

End of file (EOF) means there is nothing more to read from the file buffer, it’s not something one puts explicitly at the file itself.. you should still get there fine with your code
Another way is to read the buffer until there are no more bytes to read there

Is there a way to read an input string with "ifstream"

Currently my programme takes a string as an input which I access using argc and argv
then I use
FILE *fp, *input = stdin;
fp = fopen("input.xml","w+b");
while(fgets(mystring,100,input) != NULL)
{
fputs(mystring,fp);
}
fclose(fp);
I did this part only to create a file input.xml which I then supply to
ifstream in("input.xml");
string s((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
to get s as a string(basic string).
Is there a way to feed my input directly to ifstream? (i.e feeding a string to ifstream).

Let me get this straight:
You read a string from standard input
You write it to a file
You then read it from the file
And use the file stream object to create a string
That's crazy talk!
Drop the file streams and just instantiate the string from STDIN directly:
string s(
(std::istreambuf_iterator<char>(std::cin)),
std::istreambuf_iterator<char>()
);
Remember, std::cin is a std::istream, and the IOStreams part of the standard library is designed for this sort of generic access to data.
Be aware, though, that your approach with std::istreambuf_iterator is not the most efficient, which may be a concern if you're reading a lot of data.

If I get it right then you want to read all the text provided through standard input into a string.
An easy way to achieve this could be the use of std::getline, which reads all the bytes up to a specific delimiter into a string. If you may assume that it is text content (which does not contain any \0-character), you could write the following:
std::string s;
std::getline(cin,s,'\0');

What is the "right" way to read a file with C++ fstreams?

I am using the standard C++ fstreams library and I am wondering what is the right way to use it. By experience I sort of figured out a small usage protocol, but I am not really sure about it. For the sake of simplicity let's assume that I just want to read a file, e.g., to filter its content and put it on another file. My routine is roughly as follows:
I declare a local istream i("filename") variable to open the file;
I check either i.good() or i.is_open() to handle the case where something went bad when opening, e.g., because the file does not exist; after, I assume that the file exists and that i is ok;
I call i.peek() and then again i.good() or i.eof() to rule out the case where the file is empty; after, I assume that I have actually something to read;
I use >> or whatever to read the file's content, and eof() to check that I am over;
I do not explicitly close the file - I rely on RAII and keep my methods as short and coherent as I can.
Is it a sane (correct, minimal) routine? In the negative case, how would you fix it? Please note that I am not considering races - synchronization is a different affair.

I would eliminate the peek/good/eof (your third step). Simply attempt to read your data, and check whether the attempted read succeeded or failed. Likewise, in the fourth step, just check whether your attempted read succeeded or not.
Typical code would be something like:
std::ifstream i("whatever");
if (!i)
error("opening file");
while (i >> your_data)
process(your_data);
if (!i.eof())
// reading failed before end of file

It's simpler than you have described. The first two steps are fine (but the second is not necessary if you follow the rest of my advice). Then you should attempt extraction, but use the extraction as the condition of a loop or if statement. If, for example, the file is formatted as a series of lines (or other delimited sequences) all of the same format, you could do:
std::string line;
while (std::getline(i, line)) {
// Parse line
}
The body of the loop will only execute if the line extraction works. Of course, you will need to check the validity of the line inside the loop.
If you have a certain series of extractions or other operations to do on the stream, you can place them in an if condition like so:
if (i >> some_string &&
i.get() == '-' &&
i >> some_int) {
// Use some_string and some_int
}
If this first extraction fails, the i.ignore() not execute due to short-circuit evaluation of &&. The body of the if statement will only execute if both extractions succeed. If you have two extractions together, you can of course chain them:
if (i >> some_string >> some_int) {
// Use some_string and some_int
}
The second extraction in the chain will not occur if the first one fails. A failed extraction puts the stream in a state in which all following extractions also fail automatically.
For this reason, it's also fine to place the stream operations outside of the if condition and then check the state of the stream:
i >> some_string >> some_int;
if (i) {
// Use some_string and some_int
}
With both of these methods, you don't have to check for certain problems with the stream. Checking the stream for eof() doesn't necessarily mean that the next read will fail. A common case is when people use the following incorrect extraction loop:
// DO NOT DO THIS
while (!i.eof()) {
std::getline(i, line)
// Do something with line
}
Most text files end with an extra new line at the end that text editors hide from you. When you're reading lines from the text file, for the last iteration you haven't yet hit the end of file because there's still a \n to read. So the loop continues, attempts to extract the next line which doesn't exist and screws up. People often observe this as "reading the last line of the file twice".

fscanf multiple lines [c++]

I am reading in a file with multiple lines of data like this:
:100093000202C4C0E0E57FB40005D0E0020C03B463
:1000A3000105D0E0022803B40205D0E0027C03027C
:1000B30002E3C0E0E57FB40005D0E0020C0BB4011D
I am reading in values byte by byte and storing them in an array.
fscanf_s(in_file,"%c", &sc); // start code
fscanf_s(in_file,"%2X", &iByte_Count); // byte count
fscanf_s(in_file,"%4X", &iAddr); // 2 byte address
fscanf_s(in_file,"%2X", &iRec_Type); // record type
for(int i=0; i<iByte_Count; i++)
{
fscanf_s(in_file,"%2X", &iData[i]);
iArray[(iMaskedAddr/16)][iMaskedNumMove+3+i]=iData[i];
}
fscanf_s(in_file,"%2X", &iCkS);
This is working great except when I get to the end of the first line. I need this to repeat until I get to the end of the file but when I put this in a loop it craps out.
Can I force the position to the begining of the next line?
I know I can use a stream and all that but I am dealing with this method.
Thanks for the help

My suggestion is to dump fscanf_s and use either fgets or std::getline.
That said, your issue is handling the newlines, and the next beginning of record token, the ':'.
One method is to use fscanf_s("%c") until the ':' character is read or the end of file is reached:
char start_of_record;
do
{
fscanf_s(infile, "%c", &start_of_record);
} while (!feof(infile) && (start_of_record != ':'));
// Now process the header....
The data the OP is reading is a standard format for transmitting binary data, usually for downloading into Flash Memories and EPROMs.

Your topic clear states that you are using C++ so, if I may, I suggest you use the correct STL stream manipulators.
To read line-by-line, you can use ifstream::getline. But again, you are not reading the file line by line, you are reading it field by field. So, you should try using ifstream::read, which lets you choose the amount of bytes to read from the stream.
UPDATE:
While doing an unrelated search over the net, I found out about a library called IOF which may help you with this task. Check it out.

Fastest way to find the number of lines in a text (C++)

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until I reach EOF. It was not that fast in my case. I used both ifstream and fgets. They were both slow. Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db (may be by using bitwise operations).
The number of lines is in the millions in that file and it keeps getting larger, each line is about 40 or 50 characters. I'm using Linux.
Note:
I'm sure there will be people who might say use a DB idiot. But briefly in my case I can't use a db.

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.
As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.
Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:
#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;
unsigned int FileRead( istream & is, vector <char> & buff ) {
is.read( &buff[0], buff.size() );
return is.gcount();
}
unsigned int CountLines( const vector <char> & buff, int sz ) {
int newlines = 0;
const char * p = &buff[0];
for ( int i = 0; i < sz; i++ ) {
if ( p[i] == '\n' ) {
newlines++;
}
}
return newlines;
}
int main( int argc, char * argv[] ) {
time_t now = time(0);
if ( argc == 1 ) {
cout << "lines\n";
ifstream ifs( "lines.dat" );
int n = 0;
string s;
while( getline( ifs, s ) ) {
n++;
}
cout << n << endl;
}
else {
cout << "buffer\n";
const int SZ = 1024 * 1024;
std::vector <char> buff( SZ );
ifstream ifs( "lines.dat" );
int n = 0;
while( int cc = FileRead( ifs, buff ) ) {
n += CountLines( buff, cc );
}
cout << n << endl;
}
cout << time(0) - now << endl;
}

Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.
Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )
If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.
If the file is only ever being appended to, you could also remember the number of lines
and previous length, and start from there.
It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.

You wrote that it keeps getting larger.
This sounds like it is a log file or something similar where new lines are appended but existing lines are not changed. If this is the case you could try an incremental approach:
Parse to the end of file.
Remember the line count and the offset of EOF.
When the file grows fseek to the offset, parse to EOF and update the line count and the offset.

There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:
What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.
Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.
Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).
You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED or U+000B LINE TABULATION (a.k.a. vertical tab) be considered going to a new line?
Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)
If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.
The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.

Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.
#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>
struct TestEOL
{
bool operator()(char c)
{
last = c;
return last == '\n';
}
char last;
};
int main()
{
std::fstream file("Plop.txt");
TestEOL test;
std::size_t count = std::count_if(std::istreambuf_iterator<char>(file),
std::istreambuf_iterator<char>(),
test);
if (test.last != '\n') // If the last character checked is not '\n'
{ // then the last line in the file has not been
++count; // counted. So increement the count so we count
} // the last line even if it is not '\n' terminated.
}

It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.
However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.
Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.

You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.
However, there are a couple of possibilities which you may want to consider.
1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.
A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.
2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).
3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.
For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.
To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.

The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.
So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.
Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.
Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.
Reading large amounts of data from the harddrive is slow. There's no way around that.

If your file only grows, then Ludwig Weinzierl is the best solution if you do not have control of the writers. Otherwise, you can make it even faster: increment the counter by one each time a line is written to the file. If multiple writers may try to write to the file simultaneously, then make sure to use a lock. Locking your existing file is enough. The counter can be 4 or 8 bytes written in binary in a file written under /run/<your-prog-name>/counter (which is RAM so dead fast).
Ludwig Algorithm
Initialize offset to 0
Read file from offset to EOF counting '\n' (as mentioned by others, make sure to use buffered I/O and count the '\n' inside that buffer)
Update offset with position at EOF
Save counter & offset to a file or in a variable if you only need it in your software
Repeat from "Read file ..." on a change
This is actually how various software processing log files function (i.e. fail2ban comes to mind).
The first time, it has to process a huge file. Afterward, it is very small and thus goes very fast.
Proactive Algorithm
When creating the files, reset counter to 0.
Then each time you receive a new line to add to the file:
Lock file
Write one line
Load counter
Add one to counter
Save counter
Unlock file
This is very close to what database systems do so a SELECT COUNT(*) FROM table on a table with millions of rows return instantly. Databases also do that per index. So if you add a WHERE clause which matches a specific index, you also get the total instantly. Same principle as above.
Personal note: I see a huge number of Internet software which are backward. A watchdog makes sense for various things in a software environment. However, in most cases, when something of importance happens, you should send a message at the time it happens. Not use a backward concept of checking logs to detect that something bad just happened.
For example, you detect that a user tried to access a website and entered the wrong password 5 times in a row. You want to send a instant message to the admin to make sure there wasn't a 6th time which was successful and the hacker can now see all your user's data... If you use logs, the "instant message" is going to be late by seconds if not minutes.
Don't do processing backward.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js