read huge text file line by line in C++ with buffering

read huge text file line by line in C++ with buffering - c++

I need to read huge 35G file from disc line by line in C++. Currently I do it the following way:
ifstream infile("myfile.txt");
string line;
while (true) {
if (!getline(infile, line)) break;
long linepos = infile.tellg();
process(line,linepos);
}
But it gives me about 2MB/sec performance, though file manager copies the file with 100Mb/s speed. I guess that getline() is not doing buffering correctly. Please propose some sort of buffered line-by-line reading approach.
UPD: process() is not a bottleneck, code without process() works with the same speed.

You won't get anywhere close to line speed with the standard IO streams. Buffering or not, pretty much ANY parsing will kill your speed by orders of magnitude. I did experiments on datafiles composed of two ints and a double per line (Ivy Bridge chip, SSD):
IO streams in various combinations: ~10 MB/s. Pure parsing (f >> i1 >> i2 >> d) is faster than a getline into a string followed by a sstringstream parse.
C file operations like fscanf get about 40 MB/s.
getline with no parsing: 180 MB/s.
fread: 500-800 MB/s (depending on whether or not the file was cached by the OS).
I/O is not the bottleneck, parsing is. In other words, your process is likely your slow point.
So I wrote a parallel parser. It's composed of tasks (using a TBB pipeline):
fread large chunks (one such task at a time)
re-arrange chunks such that a line is not split between chunks (one such task at a time)
parse chunk (many such tasks)
I can have unlimited parsing tasks because my data is unordered anyway. If yours isn't then this might not be worth it to you.
This approach gets me about 100 MB/s on an 4-core IvyBridge chip.

I've translated my own buffering code from my java project and it does what I need. I had to put defines to overcome problems with M$VC 2010 compiler tellg, that always gives wrong negative values on huge files. This algorithm gives desired speed ~100MB/s, though it does some usless new[].
void readFileFast(ifstream &file, void(*lineHandler)(char*str, int length, __int64 absPos)){
int BUF_SIZE = 40000;
file.seekg(0,ios::end);
ifstream::pos_type p = file.tellg();
#ifdef WIN32
__int64 fileSize = *(__int64*)(((char*)&p) +8);
#else
__int64 fileSize = p;
#endif
file.seekg(0,ios::beg);
BUF_SIZE = min(BUF_SIZE, fileSize);
char* buf = new char[BUF_SIZE];
int bufLength = BUF_SIZE;
file.read(buf, bufLength);
int strEnd = -1;
int strStart;
__int64 bufPosInFile = 0;
while (bufLength > 0) {
int i = strEnd + 1;
strStart = strEnd;
strEnd = -1;
for (; i < bufLength && i + bufPosInFile < fileSize; i++) {
if (buf[i] == '\n') {
strEnd = i;
break;
}
}
if (strEnd == -1) { // scroll buffer
if (strStart == -1) {
lineHandler(buf + strStart + 1, bufLength, bufPosInFile + strStart + 1);
bufPosInFile += bufLength;
bufLength = min(bufLength, fileSize - bufPosInFile);
delete[]buf;
buf = new char[bufLength];
file.read(buf, bufLength);
} else {
int movedLength = bufLength - strStart - 1;
memmove(buf,buf+strStart+1,movedLength);
bufPosInFile += strStart + 1;
int readSize = min(bufLength - movedLength, fileSize - bufPosInFile - movedLength);
if (readSize != 0)
file.read(buf + movedLength, readSize);
if (movedLength + readSize < bufLength) {
char *tmpbuf = new char[movedLength + readSize];
memmove(tmpbuf,buf,movedLength+readSize);
delete[]buf;
buf = tmpbuf;
bufLength = movedLength + readSize;
}
strEnd = -1;
}
} else {
lineHandler(buf+ strStart + 1, strEnd - strStart, bufPosInFile + strStart + 1);
}
}
lineHandler(0, 0, 0);//eof
}
void lineHandler(char*buf, int l, __int64 pos){
if(buf==0) return;
string s = string(buf, l);
printf(s.c_str());
}
void loadFile(){
ifstream infile("file");
readFileFast(infile,lineHandler);
}

Use a line parser or write the same. here is a sample in the sourceforge http://tclap.sourceforge.net/ and put in a buffer if necessary.

Related

Receiving large Binary file from server

I am trying to receive and save file over 5gb size using c++. But during the course of the process, memory used by the application is increasing and sometimes the application crashes. Is there something I am doing wrong ? Or is there a better way to do this?
char * buffer = channel->cread(&len);
__int64_t lengthFile = *(__int64_t * ) buffer;
__int64_t received = 0;
__int64_t offset = 0;
__int64_t length;
string filename = "received/testFile";
ofstream ifs;
ifs.open(filename.c_str(),ios::binary | ios::out);
int len = 0;
while(1){
length = 256;
if(offset + MAX_MESSAGE >= lengthFile){
length = lengthFile - offset;
breakCondition = true;
}
file = new filemsg(offset,length);
channel->cwrite((char *)file,sizeof (*file));
buffer = channel->cread(&len);
ifs.write(buffer,length);
received = received + length;
offset = offset + MAX_MESSAGE;
if(breakCondition)
break;
}

Reading file consists of float number with mmap() in C++

I'm trying to read a file consists of 100000000 float numbers like 0.12345678 or -0.1234567 separated by space in c++. I used fscanf() to read the file and the codes is like this:
FILE *fid = fopen("testingfile.txt", "r");
if (fid == NULL)
return false;
float v;
for (int i = 0; i < 100000000; i++)
fscanf(fid, "%f", &v);
fclose(fid);
The file is 1199999988 bytes in size and took around 18 seconds to finish reading using fscanf().Therefore, I would like to use mmap() to speed up the reading and code is like this:
#define FILEPATH "testingfile.txt"
char text[10] = {'\0'};
struct stat s;
int status = stat(FILEPATH, &s);
int fd = open(FILEPATH, O_RDONLY);
if (fd == -1)
{
perror("Error opening file for reading");
return 0;
}
char *map = (char *)mmap(NULL, s.st_size, PROT_READ, MAP_SHARED, fd, 0);
close(fd);
if (map == MAP_FAILED)
{
perror("Error mmapping the file");
return 0;
}
for (int i = 0,j=0; i < s.st_size; i++)
{
if (isspace(map[i]))
{
text[j] = '\0';
j = 0;
float v = atof(text);
for (int j = 0; j < 10; j++)
text[j] = '\0';
continue;
}
text[j] = map[i];
j++;
}
if (munmap(map, s.st_size) == -1)
{
return 0;
}
However, it still takes around 14.5 seconds to finish reading. I found the most time consuming part is converting array to float,which consumes around 10 seconds
So I have three questions:
Is there any way I can directly read float instead of char or
Is there any better method to convert char array to float
How does fscanf recognize floating point value and read it, which is much faster than atof().
Thanks in advance!

Based on the advice given, here are two possible solutions to this problem:
The first approach would be a bit "stupid". Since the format of floating number values stored is known, conversion from char array to float number can be easily done without usingatof().
By removing atof(), it only takes 8 seconds to finish reading and conversion for the same file.
The second approach is to change the store format of float numbers in the file (as advised by Jeremy Friesner). Floating number values are stored in binary format so that conversion part for mmap() is not required. The code becomes something like this:
#define FILEPATH "myfile.bin"
int main()
{
int start_s = clock();
struct stat s;
int status = stat(FILEPATH, &s);
int fd = open(FILEPATH, O_RDONLY);
if (fd == -1)
{
perror("Error opening file for reading");
return 0;
}
float *map = (float *)mmap(NULL, s.st_size, PROT_READ, MAP_SHARED, fd, 0);
close(fd);
if (map == MAP_FAILED)
{
perror("Error mmapping the file");
return 0;
}
for (int i = 0; i < s.st_size / 4; i++)
{
float v = map[i];
}
if (munmap(map, s.st_size) == -1)
{
return 0;
}
}
This would dramatically reduce the time required to read the file in same size.

arduino low i2c read speed;

I'm currently working on a project using the genuino 101 where i need to read large amounts of data trough i2c, to fill an arbitrarily sized buffer.from the following image i can see that the read requests themselves only take about 3 milliseconds and the write request about 200 nanoseconds.
however there is a very large time (750+ ms) between read transactions in the same block
#define RD_BUF_SIZE 32
void i2cRead(unsigned char device, unsigned char memory, int len, unsigned char * rdBuf)
{
ushort bytesRead = 0;
ushort _memstart = memory;
while (bytesRead < len)
{
Wire.beginTransmission((int)device);
Wire.write(_memstart);
Wire.endTransmission();
Wire.requestFrom((int)device, BLCK_SIZE);
int i = 0;
while (Wire.available())
{
rdBuf[bytesRead+i] = Wire.read();
i++;
}
bytesRead += BLCK_SIZE;
_memstart += BLCK_SIZE;
}
}
from my understanding this shouldn't take that long, unless adding to memstart and bytesRead is taking extremely long. by my, arguably limited, understanding of time complexity this function has a time complexity of O(n) and should, in the best case only take about 12 ms for a 128 byte query
Am i missing something?

Those 700ms are not caused by the execution time of the few instructions in your function. Those should be done in microseconds. You may have a buffer overflow, or the other device might be delaying transfers, or there's another bug not related to buffer overflow.
This is about how I'd do it:
void i2cRead(unsigned char device, unsigned char memory, int len, unsigned char * rdBuf, int bufLen)
{
ushort _memstart = memory;
if ( bufLen < len ) {
len = bufLen;
}
while (len > 0)
{
Wire.beginTransmission((int)device);
Wire.write(_memstart);
Wire.endTransmission();
int reqSize = 32;
if ( len < reqSize ) {
reqSize = len;
}
Wire.requestFrom((int)device, reqSize);
while (Wire.available() && (len != 0))
{
*(rdBuf++) = Wire.read();
_memstart++;
len--;
}
}
}

Base 64 Encoding Losing data

This is my fourth attempt at doing base64 encoding. My first tries work but it isn't standard. It's also extremely slow!!! I used vectors and push_back and erase a lot.
So I decided to re-write it and this is much much faster! Except that it loses data. -__-
I need as much speed as I can possibly get because I'm compressing a pixel buffer and base64 encoding the compressed string. I'm using ZLib. The images are 1366 x 768 so yeah.
I do not want to copy any code I find online because... Well, I like to write things myself and I don't like worrying about copyright stuff or having to put a ton of credits from different sources all over my code..
Anyway, my code is as follows below. It's very short and simple.
const static std::string Base64Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
inline bool IsBase64(std::uint8_t C)
{
return (isalnum(C) || (C == '+') || (C == '/'));
}
std::string Copy(std::string Str, int FirstChar, int Count)
{
if (FirstChar <= 0)
FirstChar = 0;
else
FirstChar -= 1;
return Str.substr(FirstChar, Count);
}
std::string DecToBinStr(int Num, int Padding)
{
int Bin = 0, Pos = 1;
std::stringstream SS;
while (Num > 0)
{
Bin += (Num % 2) * Pos;
Num /= 2;
Pos *= 10;
}
SS.fill('0');
SS.width(Padding);
SS << Bin;
return SS.str();
}
int DecToBinStr(std::string DecNumber)
{
int Bin = 0, Pos = 1;
int Dec = strtol(DecNumber.c_str(), NULL, 10);
while (Dec > 0)
{
Bin += (Dec % 2) * Pos;
Dec /= 2;
Pos *= 10;
}
return Bin;
}
int BinToDecStr(std::string BinNumber)
{
int Dec = 0;
int Bin = strtol(BinNumber.c_str(), NULL, 10);
for (int I = 0; Bin > 0; ++I)
{
if(Bin % 10 == 1)
{
Dec += (1 << I);
}
Bin /= 10;
}
return Dec;
}
std::string EncodeBase64(std::string Data)
{
std::string Binary = std::string();
std::string Result = std::string();
for (std::size_t I = 0; I < Data.size(); ++I)
{
Binary += DecToBinStr(Data[I], 8);
}
for (std::size_t I = 0; I < Binary.size(); I += 6)
{
Result += Base64Chars[BinToDecStr(Copy(Binary, I, 6))];
if (I == 0) ++I;
}
int PaddingAmount = ((-Result.size() * 3) & 3);
for (int I = 0; I < PaddingAmount; ++I)
Result += '=';
return Result;
}
std::string DecodeBase64(std::string Data)
{
std::string Binary = std::string();
std::string Result = std::string();
for (std::size_t I = Data.size(); I > 0; --I)
{
if (Data[I - 1] != '=')
{
std::string Characters = Copy(Data, 0, I);
for (std::size_t J = 0; J < Characters.size(); ++J)
Binary += DecToBinStr(Base64Chars.find(Characters[J]), 6);
break;
}
}
for (std::size_t I = 0; I < Binary.size(); I += 8)
{
Result += (char)BinToDecStr(Copy(Binary, I, 8));
if (I == 0) ++I;
}
return Result;
}
I've been using the above like this:
int main()
{
std::string Data = EncodeBase64("IMG." + ::ToString(677) + "*" + ::ToString(604)); //IMG.677*604
std::cout<<DecodeBase64(Data); //Prints IMG.677*601
}
As you can see in the above, it prints the wrong string. It's fairly close but for some reason, the 4 is turned into a 1!
Now if I do:
int main()
{
std::string Data = EncodeBase64("IMG." + ::ToString(1366) + "*" + ::ToString(768)); //IMG.1366*768
std::cout<<DecodeBase64(Data); //Prints IMG.1366*768
}
It prints correctly.. I'm not sure what is going on at all or where to begin looking.
Just in-case anyone is curious and want to see my other attempts (the slow ones): http://pastebin.com/Xcv03KwE
I'm really hoping someone could shed some light on speeding things up or at least figuring out what's wrong with my code :l

The main encoding issue is that you are not accounting for data that is not a multiple of 6 bits. In this case, the final 4 you have is being converted into 0100 instead of 010000 because there are no more bits to read. You are supposed to pad with 0s.
After changing your Copy like this, the final encoded character is Q, instead of the original E.
std::string data = Str.substr(FirstChar, Count);
while(data.size() < Count) data += '0';
return data;
Also, it appears that your logic for adding padding = is off because it is adding one too many = in this case.
As far as comments on speed, I'd focus primarily on trying to reduce your usage of std::string. The way you are currently converting the data into a string with 0 and 1 is pretty inefficent considering that the source could be read directly with bitwise operators.

I'm not sure whether I could easily come up with a slower method of doing Base-64 conversions.
The code requires 4 headers (on Mac OS X 10.7.5 with G++ 4.7.1) and the compiler option -std=c++11 to make the #include <cstdint> acceptable:
#include <string>
#include <iostream>
#include <sstream>
#include <cstdint>
It also requires a function ToString() that was not defined; I created:
std::string ToString(int value)
{
std::stringstream ss;
ss << value;
return ss.str();
}
The code in your main() — which is what uses the ToString() function — is a little odd: why do you need to build a string from pieces instead of simply using "IMG.677*604"?
Also, it is worth printing out the intermediate result:
int main()
{
std::string Data = EncodeBase64("IMG." + ::ToString(677) + "*" + ::ToString(604));
std::cout << Data << std::endl;
std::cout << DecodeBase64(Data) << std::endl; //Prints IMG.677*601
}
This yields:
SU1HLjY3Nyo2MDE===
IMG.677*601
The output string (SU1HLjY3Nyo2MDE===) is 18 bytes long; that has to be wrong as a valid Base-64 encoded string has to be a multiple of 4 bytes long (as three 8-bit bytes are encoded into four bytes each containing 6 bits of the original data). This immediately tells us there are problems. You should only get zero, one or two pad (=) characters; never three. This also confirms that there are problems.
Removing two of the pad characters leaves a valid Base-64 string. When I use my own home-brew Base-64 encoding and decoding functions to decode your (truncated) output, it gives me:
Base64:
0x0000: SU1HLjY3Nyo2MDE=
Binary:
0x0000: 49 4D 47 2E 36 37 37 2A 36 30 31 00 IMG.677*601.
Thus it appears you have encode the null terminating the string. When I encode IMG.677*604, the output I get is:
Binary:
0x0000: 49 4D 47 2E 36 37 37 2A 36 30 34 IMG.677*604
Base64: SU1HLjY3Nyo2MDQ=
You say you want to speed up your code. Quite apart from fixing it so that it encodes correctly (I've not really studied the decoding), you will want to avoid all the string manipulation you do. It should be a bit manipulation exercise, not a string manipulation exercise.
I have 3 small encoding routines in my code, to encode triplets, doublets and singlets:
/* Encode 3 bytes of data into 4 */
static void encode_triplet(const char *triplet, char *quad)
{
quad[0] = base_64_map[(triplet[0] >> 2) & 0x3F];
quad[1] = base_64_map[((triplet[0] & 0x03) << 4) | ((triplet[1] >> 4) & 0x0F)];
quad[2] = base_64_map[((triplet[1] & 0x0F) << 2) | ((triplet[2] >> 6) & 0x03)];
quad[3] = base_64_map[triplet[2] & 0x3F];
}
/* Encode 2 bytes of data into 4 */
static void encode_doublet(const char *doublet, char *quad, char pad)
{
quad[0] = base_64_map[(doublet[0] >> 2) & 0x3F];
quad[1] = base_64_map[((doublet[0] & 0x03) << 4) | ((doublet[1] >> 4) & 0x0F)];
quad[2] = base_64_map[((doublet[1] & 0x0F) << 2)];
quad[3] = pad;
}
/* Encode 1 byte of data into 4 */
static void encode_singlet(const char *singlet, char *quad, char pad)
{
quad[0] = base_64_map[(singlet[0] >> 2) & 0x3F];
quad[1] = base_64_map[((singlet[0] & 0x03) << 4)];
quad[2] = pad;
quad[3] = pad;
}
This is written as C code rather than using native C++ idioms, but the code shown should compile with C++ (unlike the C99 initializers elsewhere in the source). The base_64_map[] array corresponds to your Base64Chars string. The pad character passed in is normally '=', but can be '\0' since the system I work with has eccentric ideas about not needing padding (pre-dating my involvement in the code, and it uses a non-standard alphabet to boot) and the code handles both the non-standard and the RFC 3548 standard.
The driving code is:
/* Encode input data as Base-64 string. Output length returned, or negative error */
static int base64_encode_internal(const char *data, size_t datalen, char *buffer, size_t buflen, char pad)
{
size_t outlen = BASE64_ENCLENGTH(datalen);
const char *bin_data = (const void *)data;
char *b64_data = (void *)buffer;
if (outlen > buflen)
return(B64_ERR_OUTPUT_BUFFER_TOO_SMALL);
while (datalen >= 3)
{
encode_triplet(bin_data, b64_data);
bin_data += 3;
b64_data += 4;
datalen -= 3;
}
b64_data[0] = '\0';
if (datalen == 2)
encode_doublet(bin_data, b64_data, pad);
else if (datalen == 1)
encode_singlet(bin_data, b64_data, pad);
b64_data[4] = '\0';
return((b64_data - buffer) + strlen(b64_data));
}
/* Encode input data as Base-64 string. Output length returned, or negative error */
int base64_encode(const char *data, size_t datalen, char *buffer, size_t buflen)
{
return(base64_encode_internal(data, datalen, buffer, buflen, base64_pad));
}
The base64_pad constant is the '='; there's also a base64_encode_nopad() function that supplies '\0' instead. The errors are somewhat arbitrary but relevant to the code.
The main point to take away from this is that you should be doing bit manipulation and building up a string that is an exact multiple of 4 bytes for a given input.

std::string EncodeBase64(std::string Data)
{
std::string Binary = std::string();
std::string Result = std::string();
for (std::size_t I = 0; I < Data.size(); ++I)
{
Binary += DecToBinStr(Data[I], 8);
}
if (Binary.size() % 6)
{
Binary.resize(Binary.size() + 6 - Binary.size() % 6, '0');
}
for (std::size_t I = 0; I < Binary.size(); I += 6)
{
Result += Base64Chars[BinToDecStr(Copy(Binary, I, 6))];
if (I == 0) ++I;
}
if (Result.size() % 4)
{
Result.resize(Result.size() + 4 - Result.size() % 4, '=');
}
return Result;
}

Segfault on reading large file with ifstream on 64 bit Debian

I am trying to read a large file (~5GB) using ifstream in C++.
Since I'm on a 64bit OS, I thought this shouldn't be a problem.
Still, I get a segfault. Everything runs fine with smaller files,
so I'm pretty sure that is where the problem is.
I'm using g++ (4.4.5-8) and libstdc++6 (4.4.5-8).
Thanks.
The code looks like this:
void load (const std::string &path, int _dim, int skip = 0, int gap = 0) {
std::ifstream is(path.c_str(), std::ios::binary);
BOOST_VERIFY(is);
is.seekg(0, std::ios::end);
size_t size = is.tellg();
size -= skip;
long int line = sizeof(float) * _dim + gap;
BOOST_VERIFY(size % line == 0);
long int _N = size / line;
reset(_dim, _N);
is.seekg(skip, std::ios::beg);
char *off = dims;
for (long int i = 0; i < N; ++i) {
is.read(off, sizeof(T) * dim);
is.seekg(gap, std::ios::cur);
off += stride;
}
BOOST_VERIFY(is);
}
The segfault is in the is.read line for i=187664.
T is float and I'm reading dim=1000 floats at a time.
When the segfault occures, i * stride is way smaller than size, so I'm not running past the end of the file.
dims is allocated here
void reset (int _dim, int _N)
{
BOOST_ASSERT((ALIGN % sizeof(T)) == 0);
dim = _dim;
N = _N;
stride = dim * sizeof(T) + ALIGN - 1;
stride = stride / ALIGN * ALIGN;
if (dims != NULL) delete[] dims;
dims = (char *)memalign(ALIGN, N * stride);
std::fill(dims, dims + N * stride, 0);
}

I don't know if this is the bug, but this code looks very C like and plenty of opportunity to leak. Any way try changing
void reset (int _dim, int _N)
to
void reset (size_t dim, size_t _N)
//I would avoid using leading underscores that is usually used to identify elements of the standard library.
When you are dealing with the size or index of something in memory ALWAYS use size_t, it is guaranteed to be able to hold the maximum size of an object including arrays.

I think you have to use _ftelli64 etc... to have the right size of your file, and to use long long (or _int64) variables to manage it. But it's C library. I don't find how to use ifstream with so big file (actualy > 2Go). Did you find the way ?
PS : In your case size_t is fine, but I'm not sure that's OK with 32-bit software. I'm sure it's OK with 64-bit.
int main()
{
string name="tstFile.bin";
FILE *inFile,*inFile2;
fopen_s(&inFile,name.c_str(),"rb");
if (!inFile)
{
cout<<"\r\n***error -> File not found\r\n";
return 0;
}
_fseeki64 (inFile,0L,SEEK_END);
long long fileLength = _ftelli64(inFile);
_fseeki64 (inFile,0L,SEEK_SET);
cout<<"file lg : "<<fileLength<<endl;
return 1;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

read huge text file line by line in C++ with buffering - c++

Use a line parser or write the same. here is a sample in the sourceforge http://tclap.sourceforge.net/ and put in a buffer if necessary.

Related

Receiving large Binary file from server

Reading file consists of float number with mmap() in C++

arduino low i2c read speed;

Base 64 Encoding Losing data

Segfault on reading large file with ifstream on 64 bit Debian

Categories

Resources