Handling large gzfile in c++

Handling large gzfile in c++ - c++

char buffer[1001];
for(;!gzeof(m_fHandle);){
gzread(m_fHandle, buffer, 1000);
The file I'm handling is more than 1GB.
do I load the entire file to the buffer? or should I malloc and allocate the size?
Or should I load it line by line? the file has a "\n" demarkating the EOL. if so, how do I do that for handling gzfile in c++?

The zlib approach would be:
You can just call gzread with a limited buffer size repeatedly. If you can be sure that he max line length is eg BUFLEN: See it Live On Coliru
#include <zlib.h>
#include <iostream>
#include <algorithm>
static const unsigned BUFLEN = 1024;
void error(const char* const msg)
{
std::cerr << msg << "\n";
exit(255);
}
void process(gzFile in)
{
char buf[BUFLEN];
char* offset = buf;
for (;;) {
int err, len = sizeof(buf)-(offset-buf);
if (len == 0) error("Buffer to small for input line lengths");
len = gzread(in, offset, len);
if (len == 0) break;
if (len < 0) error(gzerror(in, &err));
char* cur = buf;
char* end = offset+len;
for (char* eol; (cur<end) && (eol = std::find(cur, end, '\n')) < end; cur = eol + 1)
{
std::cout << std::string(cur, eol) << "\n";
}
// any trailing data in [eol, end) now is a partial line
offset = std::copy(cur, end, buf);
}
// BIG CATCH: don't forget about trailing data without eol :)
std::cout << std::string(buf, offset);
if (gzclose(in) != Z_OK) error("failed gzclose");
}
int main()
{
process(gzopen("test.gz", "rb"));
}
If you cannot know the maximum line size, I'd suggest abstracting it a bit more and deriving from std::basic_streambuf overriding underflow so you can use std::getline with an istream based on this buffer.
UPDATE Since you're new to C++, implementing your own streambuf is likely not a good idea. I recommend using a c++ library (instead of zlib).
E.g. Boost Iostream allows you to simply do this:
Live On Coliru
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
namespace io = boost::iostreams;
int main()
{
io::filtering_istream in;
in.push(io::gzip_decompressor());
in.push(io::file_source("my_file.txt"));
// read from in using std::istream interface
std::string line;
while (std::getline(in, line, '\n'))
{
process(line); // your code :)
}
}

You say this is a gzfile. That implies a binary format where '\n' is not valid for EOL (there is no concept of EOL with binary files.)
That said, in practice you have a couple choices for buffer size. Loading the entire file into memory will certainly be easier for you as a developer to work with the data. However, this is a costly solution in terms of memory consumed for the task.
If memory is a concern then you need to work on the data in pieces. There is probably an optimal amount of data to try to fetch at a time and a lot of that will depend on the hardware architecture of the machine you have all the way from the CPU through cache lines, memory bus, SATA bus, and even the drives that hold the file itself.
If this is just a onesy-twosy kind of problem you're solving and you're running this on a modern computer, 1GB is probably ok to keep in memory. Just new a uint8_t[] the size of the file and read the whole thing in then parse the data.
Otherwise, you need to integrate your parsing of the file with the reading of the file.

Related

What are the fastest methods to read from a file in standard C++? [duplicate]

I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?
edit:
The code I am using is more or less this:
string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
while(txtFile.good())
{
m_numLines++;
getline(txtFile, tmpString);
}
txtFile.close();
}
edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.
edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.
edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)

Updates: Be sure to check the (surprising) updates below the initial answer
Memory mapped files have served me well1:
#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm> // for std::find
#include <iostream> // for std::cout
#include <cstring>
int main()
{
boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
This should be rather quick.
Update
In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru
#include <algorithm>
#include <iostream>
#include <cstring>
// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
const char* map_file(const char* fname, size_t& length);
int main()
{
size_t length;
auto f = map_file("test.cpp", length);
auto l = f + length;
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
void handle_error(const char* msg) {
perror(msg);
exit(255);
}
const char* map_file(const char* fname, size_t& length)
{
int fd = open(fname, O_RDONLY);
if (fd == -1)
handle_error("open");
// obtain file size
struct stat sb;
if (fstat(fd, &sb) == -1)
handle_error("fstat");
length = sb.st_size;
const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
if (addr == MAP_FAILED)
handle_error("mmap");
// TODO close fd at some point in time, call munmap(...)
return addr;
}
Update
The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?

4000 * 400,000 = 1.6 GB if you're hard drive isn't an SSD you're likely getting ~100 MB/s sequential read. That's 16 seconds just in I/O.
Since you don't elaborate on the specific code your using or how you need to parse these files (do you need to read it line by line, does the system have a lot of RAM could you read the whole file into a large RAM buffer and then parse it?) There's little you can do to speed up the process.
Memory mapped files won't offer any performance improvement when reading a file sequentially. Perhaps manually parsing large chunks for new lines rather than using "getline" would offer an improvement.
EDIT After doing some learning (thanks #sehe). Here's the memory mapped solution I would likely use.
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>
int main() {
char* fName = "big.txt";
//
struct stat sb;
long cntr = 0;
int fd, lineLen;
char *data;
char *line;
// map the file
fd = open(fName, O_RDONLY);
fstat(fd, &sb);
//// int pageSize;
//// pageSize = getpagesize();
//// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize);
data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
line = data;
// get lines
while(cntr < sb.st_size) {
lineLen = 0;
line = data;
// find the next line
while(*data != '\n' && cntr < sb.st_size) {
data++;
cntr++;
lineLen++;
}
/***** PROCESS LINE *****/
// ... processLine(line, lineLen);
}
return 0;
}

Neil Kirk, unfortunately I can not reply to your comment (not enough reputation) but I did a performance test on ifstream an stringstream and the performance, reading a text file line by line, is exactly the same.
std::stringstream stream;
std::string line;
while(std::getline(stream, line)) {
}
This takes 1426ms on a 106MB file.
std::ifstream stream;
std::string line;
while(ifstream.good()) {
getline(stream, line);
}
This takes 1433ms on the same file.
The following code is faster instead:
const int MAX_LENGTH = 524288;
char* line = new char[MAX_LENGTH];
while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) {
}
This takes 884ms on the same file.
It is just a little tricky since you have to set the maximum size of your buffer (i.e. maximum length for each line in the input file).

As someone with a little background in competitive programming, I can tell you: At least for simple things like integer parsing the main cost in C is locking the file streams (which is by default done for multi-threading). Use the unlocked_stdio versions instead (fgetc_unlocked(), fread_unlocked()). For C++, the common lore is to use std::ios::sync_with_stdio(false) but I don't know if it's as fast as unlocked_stdio.
For reference here is my standard integer parsing code. It's a lot faster than scanf, as I said mainly due to not locking the stream. For me it was as fast as the best hand-coded mmap or custom buffered versions I'd used previously, without the insane maintenance debt.
int readint(void)
{
int n, c;
n = getchar_unlocked() - '0';
while ((c = getchar_unlocked()) > ' ')
n = 10*n + c-'0';
return n;
}
(Note: This one only works if there is precisely one non-digit character between any two integers).
And of course avoid memory allocation if possible...

Do you have to read all files at the same time? (at the start of your application for example)
If you do, consider parallelizing the operation.
Either way, consider using binary streams, or unbffered read for blocks of data.

Use Random file access or use binary mode. for sequential, this is big but still it depends on what you are reading.

File.exe has Triggered a Breakpoint because of Fseek

I'm trying to determine how big a file i'm reading is in bytes so I used Fseek to jump to the end and it triggered the error: file.exe has triggered a breakpoint.
Heses the code:
FileUtils.cpp:
#include "FileUtils.h"
namespace impact {
std::string read_file(const char* filepath)
{
FILE* file = fopen(filepath, "rt");
fseek(file, 0, SEEK_END);
unsigned long length = ftell(file);
char* data = new char[length + 1];
memset(data, 0, length + 1);
fseek(file, 0 ,SEEK_SET);
fread(data, 1, length, file);
fclose(file);
std::string result(data);
delete[] data;
return result;
}
}
FileUtils.h:
#pragma once
#include <stdio.h>
#include <string>
#include <fstream>
namespace impact {
std::string read_file(const char* filepath);
}
If more info is required just ask me for it I would be more than happy to provide more!

You are doing this in the C way, C++ has much better (in my opinion) ways of handling files.
Your error looks like it may be caused because the file didn't open correctly (you need to check if file != nullptr).
To do this in C++17 you should use the standard library filesystem
(Note: You can also do this with C++11 experimental/filesystem using std::experimental::filesystem namespace)
Example:
std::string read_file(const std::filesystem::path& filepath) {
auto f_size = std::filesystem::file_size(filepath);
...
}
Additionally to read a file in C++ you do not need to know the size of the file. You can use streams:
std::string read_file(const std::filesystem::path& filepath) {
std::ifstream file(filepath); // Open the file
// Throw if failed to open the file
if (!file) throw std::runtime_error("File failed to open");
std::stringstream data; // Create the buffer
data << file.rdbuf(); // Read into the buffer the internal buffer of the file
return data.str(); // Convert the stringstream to string and return it
}
As you can see, the C++ way of doing it is much shorter and much easier to debug (helpful exceptions with descriptions are thrown when something goes wrong)

Best way to read binary file c++ though input redirection

I am trying to read a large binary file thought input redirection (stdin) at runtime, and stdin is mandatory.
./a.out < input.bin
So far I have used fgets. But fgets skips blanks and newline. I want to include both. My currentBuffersize could dynamically vary.
FILE * inputFileStream = stdin;
int currentPos = INIT_BUFFER_SIZE;
int currentBufferSize = 24; // opt
unsigned short int count = 0; // As Max number of packets 30,000/65,536
while (!feof(inputFileStream)) {
char buf[INIT_BUFFER_SIZE]; // size of byte
fgets(buf, sizeof(buf), inputFileStream);
cout<<buf;
cout<<endl;
}
Thanks in advance.

If it were me I would probably do something similar to this:
const std::size_t INIT_BUFFER_SIZE = 1024;
int main()
{
try
{
// on some systems you may need to reopen stdin in binary mode
// this is supposed to be reasonably portable
std::freopen(nullptr, "rb", stdin);
if(std::ferror(stdin))
throw std::runtime_error(std::strerror(errno));
std::size_t len;
std::array<char, INIT_BUFFER_SIZE> buf;
// somewhere to store the data
std::vector<char> input;
// use std::fread and remember to only use as many bytes as are returned
// according to len
while((len = std::fread(buf.data(), sizeof(buf[0]), buf.size(), stdin)) > 0)
{
// whoopsie
if(std::ferror(stdin) && !std::feof(stdin))
throw std::runtime_error(std::strerror(errno));
// use {buf.data(), buf.data() + len} here
input.insert(input.end(), buf.data(), buf.data() + len); // append to vector
}
// use input vector here
}
catch(std::exception const& e)
{
std::cerr << e.what() << '\n';
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
Note you may need to re-open stdin in binary mode not sure how portable that is but various documentation suggests is reasonably well supported across systems.

Trouble with C++ file I/O

Noobie Alert.
Ugh. I'm having some real trouble getting some basic file I/O stuff done using <stdio.h> or <fstream>. They both seem so clunky and non-intuitive to use. I mean, why couldn't C++ just provide a way to get a char* pointer to the first char in the file? That's all I'd ever want.
I'm doing Project Euler Question 13 and need to play with 50-digit numbers. I have the 150 numbers stored in the file 13.txt and I'm trying to create a 150x50 array so I can play with the digits of each number directly. But I'm having tons of trouble. I've tried using the C++ <fstream> library and recently straight <stdio.h> to get it done, but something must not be clicking for me. Here's what I have;
#include <iostream>
#include <stdio.h>
int main() {
const unsigned N = 100;
const unsigned D = 50;
unsigned short nums[N][D];
FILE* f = fopen("13.txt", "r");
//error-checking for NULL return
unsigned short *d_ptr = &nums[0][0];
int c = 0;
while ((c = fgetc(f)) != EOF) {
if (c == '\n' || c == '\t' || c == ' ') {
continue;
}
*d_ptr = (short)(c-0x30);
++d_ptr;
}
fclose(f);
//do stuff
return 0;
}
Can someone offer some advice? Perhaps a C++ guy on which I/O library they prefer?

Here's a nice efficient solution (but doesn't work with pipes):
std::vector<char> content;
FILE* f = fopen("13.txt", "r");
// error-checking goes here
fseek(f, 0, SEEK_END);
content.resize(ftell(f));
fseek(f, 0, SEEK_BEGIN);
fread(&content[0], 1, content.size(), f);
fclose(f);
Here's another:
std::vector<char> content;
struct stat fileinfo;
stat("13.txt", &fileinfo);
// error-checking goes here
content.resize(fileinfo.st_size);
FILE* f = fopen("13.txt", "r");
// error-checking goes here
fread(&content[0], 1, content.size(), f);
// error-checking goes here
fclose(f);

I would use an fstream. The one problem you have is that you obviously can't fit the numbers in the file into any of C++'s native numeric types (double, long long, etc.)
Reading them into strings is pretty easy though:
std::fstream in("13.txt");
std::vector<std::string> numbers((std::istream_iterator<std::string>(in)),
std::istream_iterator<std::string>());
That will read in each number into a string, so the number that was on the first line will be in numbers[0], the second line in numbers[1], and so on.
If you really want to do the job in C, it can still be quite a lot easier than what you have above:
char *dupe(char const *in) {
char *ret;
if (NULL != (ret=malloc(strlen(in)+1))
strcpy(ret, in);
return ret;
}
// read the data:
char buffer[256];
char *strings[256];
size_t pos = 0;
while (fgets(buffer, sizeof(buffer), stdin)
strings[pos++] = dupe(buffer);

Rather than reading the one hundred 50 digit numbers from a file, why not read them directly in from a character constant?
You could start your code out with:
static const char numbers[] =
"37107287533902102798797998220837590246510135740250"
"46376937677490009712648124896970078050417018260538"...
With a semicolon at the last line.

Combine big files

I'm trying to join two big files (like the UNIX cat command: cat file1 file2 > final) in C++.
I don't know how to do it because every method that I try it's very slow (for example, copy the second file into the first one line by line)
¿What is the best method for do that?
Sorry for being so brief, my english is not too good

If you're using std::fstream, then don't. It's intended primarily for formatted input/output, and char-level operations for it are slower than you'd expect. Instead, use std::filebuf directly. This is in addition to suggestions in other answers, specifically, using the larger buffer size.

Use binary-mode in the standard streams to do the job, don't deal with it as formatted data.
This is a demo if you want transfer the data in blocks:
#include <fstream>
#include <vector>
std::size_t fileSize(std::ifstream& file)
{
std::size_t size;
file.seekg(0, std::ios::end);
size = file.tellg();
file.seekg(0, std::ios::beg);
return size;
}
int main()
{
// 1MB! choose a conveinent buffer size.
const std::size_t blockSize = 1024 * 1024;
std::vector<char> data(blockSize);
std::ifstream first("first.txt", std::ios::binary),
second("second.txt", std::ios::binary);
std::ofstream result("result.txt", std::ios::binary);
std::size_t firstSize = fileSize(first);
std::size_t secondSize = fileSize(second);
for(std::size_t block = 0; block < firstSize/blockSize; block++)
{
first.read(&data[0], blockSize);
result.write(&data[0], blockSize);
}
std::size_t firstFilerestOfData = firstSize%blockSize;
if(firstFilerestOfData != 0)
{
first.read(&data[0], firstFilerestOfData);
result.write(&data[0], firstFilerestOfData);
}
for(std::size_t block = 0; block < secondSize/blockSize; block++)
{
second.read(&data[0], blockSize);
result.write(&data[0], blockSize);
}
std::size_t secondFilerestOfData = secondSize%blockSize;
if(secondFilerestOfData != 0)
{
second.read(&data[0], secondFilerestOfData);
result.write(&data[0], secondFilerestOfData);
}
first.close();
second.close();
result.close();
return 0;
}

Using plain old C++:
#include <fstream>
std::ifstream file1("x", ios_base::in | ios_base::binary);
std::ofstream file2("y", ios_base::app | ios_base::binary);
file2 << file1.rdbuf();
The Boost headers claim that copy() is optimized in some cases, though I'm not sure if this counts:
#include <boost/iostreams/copy.hpp>
// The following four overloads of copy_impl() optimize
// copying in the case that one or both of the two devices
// models Direct (see
// http://www.boost.org/libs/iostreams/doc/index.html?path=4.1.1.4)
boost::iostreams::copy(file1, file2);
update:
The Boost copy function is compatible with a wide variety of types, so this can be combined with Pavel Minaev's suggestion of using std::filebuf like so:
std::filebuf file1, file2;
file1.open("x", ios_base::in | ios_base::binary);
file2.open("y", ios_base::app | ios_base::binary);
file1.setbuf(NULL, 64 * 1024);
file2.setbuf(NULL, 64 * 1024);
boost::iostreams::copy(file1, file2);
Of course the actual optimal buffer size depends on many variables, 64k is just a wild guess.

As an alternative which may or may not be faster depending on your file size and memory on the machine. If memory is tight, you can make the buffer size smaller and loop over the f2.read grabbing the data in chunks and writing to f1.
#include <fstream>
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
ofstream f1("test.txt", ios_base::app | ios_base::binary);
ifstream f2("test2.txt");
f2.seekg(0,ifstream::end);
unsigned long size = f2.tellg();
f2.seekg(0);
char *contents = new char[size];
f2.read(contents, size);
f1.write(contents, size);
delete[] contents;
f1.close();
f2.close();
return 1;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Handling large gzfile in c++ - c++

Related

What are the fastest methods to read from a file in standard C++? [duplicate]

File.exe has Triggered a Breakpoint because of Fseek

Best way to read binary file c++ though input redirection

Trouble with C++ file I/O

Combine big files

Categories

Resources