search in a big binary file loading blocks in buffer - c++

I want to know the algorithm of searching in a big file by blocks, loading them into memory buffer.
So I have a huge file and will read it to small buffer and scan it for "needle" word:
while ( read = fread(buff, buff_size, 1, file) )
if strstr(buff, needle) print "found!";
But what if "needle" in the "hay" will be cut by block border? It will impossible to find.
One solution I see is to read next block fseek'ing it back every time (reducing offset on length of "needle" string)
offset += read - strlen(needle);
if (offset > 0) fseek(file, offset ,SEEK_SET);
Am I right?

You are right that you'll need to handle the case where the search pattern spans two blocks.
You are also right that seek can be one solution.
But there are other solutions which doesn't use seek.
Solution 1
A solution could be to copy the last part of the buffer, i.e. strlen(needle) to a little buffer capable of holding 2 times strlen(needle).
Then when reading the next block you copy the first part of the new buffer (again strlen(needle)) to the little buffer so that it is contatenated with the part from the end of the previous buffer.
Finally you can do a search in the little buffer for niddle.
Solution 2
A solution could be to read from file into buffer + strlen(needle), i.e. avoid overwriting the first strlen(needle) characters of buffer. The number of characters read from the file must be decreased accordingly (i.e. buff_size - strlen(needle))
When done with a buffer, you copy the last strlen(needle) characters to the start of buffer and read more data from the file into buffer + strlen(needle)
For the first search in buffer, you'll have to skip the first strlen(needle) characters (or make sure they don't match your pattern, e.g. by initialization). Subsequent searches shall search the whole buffer.

Related

How to correctly buffer lines using fread

I want to use fread() for IO for some reason (speed and ...). i have a file with different sized lines. by using a code like:
while( !EOF ){
fread(buffer,500MB,1,fileName);
// process buffer
}
the last line may read incompletely and we have to read the last line again in next trial, so how to force fread() to continue from the beginning of the last line?
or if possible how to force fread() to read more than 500MB untill reaching a \n or another specific character?
Thanks All
Ameer.
Assuming a bufferof bytes that you have reverse found a \n character in at position pos, then you want to roll back to the length of the buffer minus this pos. Call this step.
You can use fseek to move the file pointer back by this much:
int fseek( FILE *stream, long offset, int origin );
In your case
int ret = fseek(stream, -step, SEEK_END);
This will involve re-reading part of the file, and a fair bit of jumping around - the comments have suggested alternative ways that may be quicker.

fstream - How to seekg() to position x from end

I'm looking for a way to set my get pointer at position x from the end of an fstream.
I tried
file.seekg(-x, ios_base::end);
But according to this question, this line is undefined behavior.
How can I, in any way, seek to position x from the end of a fstream?
If you want to set your pointer at position x from the end, you need to know where the end is, so you need to begin with:
file.seekg(0, ios_base::end);
int length = file.tellg();
When you will know the file length, you can set your pointer:
file.seekg(length - x, ios_base::beg);
The problem is related to the text mode, which may convert certain sequences of bytes in the file into a sequence of different size.
Take for example text mode under Windows. Here the byte sequence '\r' '\n' in a file on the disk is converted at reading time into '\n'. Now imagine that you have a file:
Hello\r\n
World\r\n
If you position yourself at file.seekg(-1, ios_base::end); the result is undefined, because it's not clear what the result should be:
Should you simply be positioned at the '\n'? But in this case, reading the file in the reverse order would be inconsistent with reading the file in the correct order.
Should it be positioned at the '\r', as '\r' '\n' should be understood as a single byte? But in this case the positioning would have to be done byte by byte, and for every byte the library would have to check the previous one, just in case of.
This is by the way also the reason why you should only directly seekg() to positions previously acquired by tellg().
If you really have to do this kind of positioning from the end, if you open the file as ios::binary because then you're assured that a byte is always a byte whether counting from the end or counting from the beginning.

Fastest and efficient way of parsing raw data from file

I'm working on some project and I'm wondering which way is the most efficient to read a huge amount of data off a file(I'm speaking of file of 100 lines up to 3 billions lines approx., can be more thought). Once read, data will be stored in a structured data set (vector<entry> where "entry" defines a structured line).
A structured line of this file may look like :
string int int int string string
which also ends with the appropriate platform EOL and is TAB delimited
What I wish to accomplish is :
Read file into memory (string) or vector<char>
Read raw data from my buffer and format it into my data set.
I need to consider memory footprint and have a fast parsing rate.
I'm already avoiding usage of stringstream as they seems too slow.
I'm also avoiding multiple I/O call to my file by using :
// open the stream
std::ifstream is(filename);
// determine the file length
is.seekg(0, ios_base::end);
std::size_t size = is.tellg();
is.seekg(0, std::ios_base::beg);
// "out" can be a std::string or vector<char>
out.reserve(size / sizeof (char));
out.resize(size / sizeof (char), 0);
// load the data
is.read((char *) &out[0], size);
// close the file
is.close();
I've thought of taking this huge std::string and then looping line by line, I would extract line information (string and integer parts) into my data set row. Is there a better way of doing this?
EDIT : This application may run on a 32bit, 64bit computer, or on a super computer for bigger files.
Any suggestions are very welcome.
Thank you
Some random thoughts:
Use vector::resize() at the beginning (you did that)
Read large blocks of file data at a time, at least 4k, better still 256k. Read them into a memory buffer, parse that buffer into your vector.
Don't read the whole file at once, this might needlessly lead to swapping.
sizeof(char) is always 1 :)
while i cannot speak for supercomputers with 3 gig lines you will go nowhere in memory on a desktop machine.
i think you should first try to figure out all operations on that data. you should try to design all algorithms to operate sequentially. if you need random access you will do swapping all the time. this algorithm design will have a big impact on your data model.
so do not start with reading all data, just because that is an easy part, but design the whole system with a clear view an what data is in memory during the whole processing.
update
when you do all processing in a single run on the stream and separate the data processing in stages (read - preprocess - ... - write) you can utilize multithreading effectivly.
finally
whatever you want to do in a loop over the data, try to keep the number of loops a minimum. averaging for sure you can do in the read loop.
immediately make up a test file the size you expect to be the worst case in size and time two different approaches
.
time
loop
read line from disk
time
loop
process line (counting words per line)
time
loop
write data (word count) from line to disk
time
versus.
time
loop
read line from disk
process line (counting words per line)
write data (word count) from line to disk
time
if you have the algorithms already use yours. otherwise make up one (like counting words per line). if the write stage does not apply to your problem skip it. this test does take you less than an hour to write but can save you a lot.

Appending Binary files

I have to write numerical data to binary files. Since some of the data vectors I deal with can be several gigs in size, I have learned not to use C++ iostreams. Instead I want to use C File*. I'm running into a problem right off the bat where I need to write some meta data to the front of the binary file. Since some of the meta data is not known at first I need to append the meta data as I get it to the appropriate offsets in the file.
for example lets say I have to enter a uint16_t representation for the year, month , and day, but first I need to skip the first entry(a uint32_t value for precision);
I don't know what i'm doing wrong but I can't seem to append the file with "ab".
Here's an example of what I wrote:
#include<cstdio>
uint16_t year = 2001;
uint16_t month = 8;
uint16_t day = 23;
uint16_t dateArray[]={year , month, day};
File * fileStream;
fileStream = fopen("/Users/mmmmmm/Desktop/test.bin" , "wb");
if(fileStream){
// skip the first 4 bytes
fseek ( fileStream , 4 , SEEK_SET );
fwrite(dateArray, sizeof(dateArray[0]) ,( sizeof(dateArray) / sizeof(dateArray[0]) ), filestream);
fclose(filestream);
}
// loops and other code to prepare and gather other parameters
// now append the front of the file with the precision.
uint32_t precision = 32;
File *fileStream2;
fileStream2 = fopen("/Users/mmmmmm/Desktop/test.bin" , "ab");
if(fileStream2){
// Get to the top of the file
rewind(fileStream2);
fwrite(&precision, sizeof(precision) , 1 , fileStream2);
fclose(fileStream2);
}
The appended data does not write. If I change it to "wb", then the file overwrites. I was able to get it to work with "r+b", but I don't understand why. I thought "ab" would be proper. Also , should I be using buffers or is this a sufficient approach?
Thanks for the advise
BTW this is on MacOSX
Due to the way that hard drives and filesystems work, inserting bytes in to the middle of a file is really slow and should be avoided, especially when dealing with multi-gigabyte files. If your metadata is stored in to a fixed-size header, just make sure that there's enough space for it before you start with the other data. If the header is variably sized, then chunk up the header. Put 1k of header space at the beginning, and have 8 bytes reserved to contain the value of the offset to the next header chunk, or 0 for EOF. Then when that chunk gets filled up, just add another chunk to the end of the file and write its offset to the previous header.
As for the technical IO, use the fopen() modes of r+b, w+b, or a+b, depending on your need. They all act the same with minor differences. r+b opens the file for reading and writing, starting at the first byte. It will error if the file doesn't exist. w+b will do the same, but create the file if it doesn't exist. a+b is the same as r+b, but it starts with the file pointer at the end of the file.
You can navigate the file with fseek() and rewind(). rewind() moves the file pointer back to the beginning of the file. fseek() moves the file pointer to a specified location. You can read more about it here.
"r+b" means you can read and write to any position in the file. In your second code block the rewind() call set the current position to byte 0 and the write is done at this position.
If you use "a+b", this also means read and write access, but the writes are all at the end of the file so you cannot position at byte 0, unless a new empty file is created.
To re-access the file at a specific byte, just use fseek().
fseek ( fileStream , 0 , SEEK_SET ); - positions to the precision value
fseek ( fileStream , 4 , SEEK_SET ); - positions to the year value
fseek ( fileStream , 8 , SEEK_SET ); - positions to the month value
fseek ( fileStream , 12 , SEEK_SET ); - positions to the day value
With such large files, it would be highly inefficient to rewrite gigs just to prepend a few bytes.
It would be much better to create a small companion file of the metadata you need for each file, and only add the metadata fields to the beginning of the files if they are to be rewritten anyway as part of an edit.
This is because prepending to a file is so expensive on most file systems.
NTFS has a second data channel for most files that goes unseen by almost all programs except for MS internals such as file managers and security scanning programs. You could easily cook up a program to add your metadata to that channel without needing to overwrite gigs on disk every single time.

Accessing to information in a ".txt" file and go to a determinated row

When accessing a text file, I want to read from a specific line. Let's suppose that my file has 1000 rows and I want to read row 330. Each row has a different number of characters and could possibly be quite long (let's say around 100,000,000 characters per row). I'm thinking fseek() can't be used effectively here.
I was thinking about a loop to track linebreaks, but I don't know how exactly how to implement it, and I don't know if that would be the best solution.
Can you offer any help?
Unless you have some kind of index saying "line M begins at position N" in the file, you have to read characters from the file and count newlines until you find the desired line.
You can easily read lines using std::getline if you want to save the contents of each line, or std::istream::ignore if you want to discard the contents of the lines you read until you find the desired line.
There is no way to know where row 330 starts in an arbitrary text file without scanning the whole file, finding the line breaks, and then counting.
If you only need to do this once, then scan. If you need to do it many times, then you can scan once, and build up a data structure listing where all of the lines start. Now you can figure out where to seek to to read just that line. If you're still just thinking about how to organize data, I would suggest using some other type of data structure for random access. I can't recommend which one without knowing the actual problem that you are trying to solve.
Create an index on the file. You can do this "lazily" but as you read a buffer full you may as well scan it for each character.
If it is a text file on Windows that uses a 2-byte '\n' then the number of characters you read to the point where the newline occurs will not be the offset. So what you should do is a "seek" after each call to getline().
something like:
std::vector< off_t > lineNumbers;
std::string line;
lineNumbers.push_back(0); // first line begins at 0
while( std::getline( ifs, line ) )
{
lineNumbers.push_back(ifs.tellg());
}
last value will tell you where EOF is.
I think you need to scan the file and count the \n occurrences since you find the desired line. If this is a frequent operation, and you are the only one you write the file, you can possibly mantain an index file containing such information side by side with the one containing the data, a sort of "poor-man-index", but can save a lot of time.
Try running fgets in a loop
/* fgets example */
#include <stdio.h>
int main()
{
FILE * pFile;
char mystring [100];
pFile = fopen ("myfile.txt" , "r");
if (pFile == NULL) perror ("Error opening file");
else {
fgets (mystring , 100 , pFile);
puts (mystring);
fclose (pFile);
}
return 0;
}