Combine big files

Combine big files - c++

I'm trying to join two big files (like the UNIX cat command: cat file1 file2 > final) in C++.
I don't know how to do it because every method that I try it's very slow (for example, copy the second file into the first one line by line)
¿What is the best method for do that?
Sorry for being so brief, my english is not too good

If you're using std::fstream, then don't. It's intended primarily for formatted input/output, and char-level operations for it are slower than you'd expect. Instead, use std::filebuf directly. This is in addition to suggestions in other answers, specifically, using the larger buffer size.

Use binary-mode in the standard streams to do the job, don't deal with it as formatted data.
This is a demo if you want transfer the data in blocks:
#include <fstream>
#include <vector>
std::size_t fileSize(std::ifstream& file)
{
std::size_t size;
file.seekg(0, std::ios::end);
size = file.tellg();
file.seekg(0, std::ios::beg);
return size;
}
int main()
{
// 1MB! choose a conveinent buffer size.
const std::size_t blockSize = 1024 * 1024;
std::vector<char> data(blockSize);
std::ifstream first("first.txt", std::ios::binary),
second("second.txt", std::ios::binary);
std::ofstream result("result.txt", std::ios::binary);
std::size_t firstSize = fileSize(first);
std::size_t secondSize = fileSize(second);
for(std::size_t block = 0; block < firstSize/blockSize; block++)
{
first.read(&data[0], blockSize);
result.write(&data[0], blockSize);
}
std::size_t firstFilerestOfData = firstSize%blockSize;
if(firstFilerestOfData != 0)
{
first.read(&data[0], firstFilerestOfData);
result.write(&data[0], firstFilerestOfData);
}
for(std::size_t block = 0; block < secondSize/blockSize; block++)
{
second.read(&data[0], blockSize);
result.write(&data[0], blockSize);
}
std::size_t secondFilerestOfData = secondSize%blockSize;
if(secondFilerestOfData != 0)
{
second.read(&data[0], secondFilerestOfData);
result.write(&data[0], secondFilerestOfData);
}
first.close();
second.close();
result.close();
return 0;
}

Using plain old C++:
#include <fstream>
std::ifstream file1("x", ios_base::in | ios_base::binary);
std::ofstream file2("y", ios_base::app | ios_base::binary);
file2 << file1.rdbuf();
The Boost headers claim that copy() is optimized in some cases, though I'm not sure if this counts:
#include <boost/iostreams/copy.hpp>
// The following four overloads of copy_impl() optimize
// copying in the case that one or both of the two devices
// models Direct (see
// http://www.boost.org/libs/iostreams/doc/index.html?path=4.1.1.4)
boost::iostreams::copy(file1, file2);
update:
The Boost copy function is compatible with a wide variety of types, so this can be combined with Pavel Minaev's suggestion of using std::filebuf like so:
std::filebuf file1, file2;
file1.open("x", ios_base::in | ios_base::binary);
file2.open("y", ios_base::app | ios_base::binary);
file1.setbuf(NULL, 64 * 1024);
file2.setbuf(NULL, 64 * 1024);
boost::iostreams::copy(file1, file2);
Of course the actual optimal buffer size depends on many variables, 64k is just a wild guess.

As an alternative which may or may not be faster depending on your file size and memory on the machine. If memory is tight, you can make the buffer size smaller and loop over the f2.read grabbing the data in chunks and writing to f1.
#include <fstream>
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
ofstream f1("test.txt", ios_base::app | ios_base::binary);
ifstream f2("test2.txt");
f2.seekg(0,ifstream::end);
unsigned long size = f2.tellg();
f2.seekg(0);
char *contents = new char[size];
f2.read(contents, size);
f1.write(contents, size);
delete[] contents;
f1.close();
f2.close();
return 1;
}

Related

writting a binary in c++

So i have this program that supposedly reads any file (e.g images, txt) and get its data and creates a new file with that same data. The problem is that i want the data in an array and not in a vector and when i copy that same data to char array, whenever i try to write those bits into a file it doesnt write the file properly.
So the question is how can i get the data from std::ifstream input( "hello.txt", std::ios::binary ); and save it an char array[] so that i can write that data into a new file?
Program:
#include <stdlib.h>
#include <string.h>
#include <fstream>
#include <iterator>
#include <vector>
#include <iostream>
#include <algorithm>
int main()
{
FILE *newfile;
std::ifstream input( "hello.txt", std::ios::binary );
std::vector<unsigned char> buffer(std::istreambuf_iterator<char>(input), {});
char arr[buffer.size()];
std::copy(buffer.begin(), buffer.end(), arr);
int sdfd;
sdfd = open("newhello.txt",O_WRONLY | O_CREAT);
write(sdfd,arr,strlen(arr)*sizeof(char));
close(sdfd);
return(0);
}

Try this:
(It basically uses a char*, but it's an array here. You probably can't have an array in the stack in this case)
#include <iostream>
#include <fstream>
int main() {
std::ifstream input("hello.txt", std::ios::binary);
char* buffer;
size_t len; // if u don't want to delete the buffer
if (input) {
input.seekg(0, input.end);
len = input.tellg();
input.seekg(0, input.beg);
buffer = new char[len];
input.read(buffer, len);
input.close();
std::ofstream fileOut("newhello.txt");
fileOut.write(buffer, len);
fileOut.close();
// delete[] buffer; u may delete the buffer or keep it for further use anywhere else
}
}
This should probably solve your problem, remember to always have the length (len here) of the buffer if you don't want to delete it.
More here

What are the fastest methods to read from a file in standard C++? [duplicate]

I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?
edit:
The code I am using is more or less this:
string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
while(txtFile.good())
{
m_numLines++;
getline(txtFile, tmpString);
}
txtFile.close();
}
edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.
edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.
edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)

Updates: Be sure to check the (surprising) updates below the initial answer
Memory mapped files have served me well1:
#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm> // for std::find
#include <iostream> // for std::cout
#include <cstring>
int main()
{
boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
This should be rather quick.
Update
In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru
#include <algorithm>
#include <iostream>
#include <cstring>
// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
const char* map_file(const char* fname, size_t& length);
int main()
{
size_t length;
auto f = map_file("test.cpp", length);
auto l = f + length;
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
void handle_error(const char* msg) {
perror(msg);
exit(255);
}
const char* map_file(const char* fname, size_t& length)
{
int fd = open(fname, O_RDONLY);
if (fd == -1)
handle_error("open");
// obtain file size
struct stat sb;
if (fstat(fd, &sb) == -1)
handle_error("fstat");
length = sb.st_size;
const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
if (addr == MAP_FAILED)
handle_error("mmap");
// TODO close fd at some point in time, call munmap(...)
return addr;
}
Update
The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?

4000 * 400,000 = 1.6 GB if you're hard drive isn't an SSD you're likely getting ~100 MB/s sequential read. That's 16 seconds just in I/O.
Since you don't elaborate on the specific code your using or how you need to parse these files (do you need to read it line by line, does the system have a lot of RAM could you read the whole file into a large RAM buffer and then parse it?) There's little you can do to speed up the process.
Memory mapped files won't offer any performance improvement when reading a file sequentially. Perhaps manually parsing large chunks for new lines rather than using "getline" would offer an improvement.
EDIT After doing some learning (thanks #sehe). Here's the memory mapped solution I would likely use.
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>
int main() {
char* fName = "big.txt";
//
struct stat sb;
long cntr = 0;
int fd, lineLen;
char *data;
char *line;
// map the file
fd = open(fName, O_RDONLY);
fstat(fd, &sb);
//// int pageSize;
//// pageSize = getpagesize();
//// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize);
data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
line = data;
// get lines
while(cntr < sb.st_size) {
lineLen = 0;
line = data;
// find the next line
while(*data != '\n' && cntr < sb.st_size) {
data++;
cntr++;
lineLen++;
}
/***** PROCESS LINE *****/
// ... processLine(line, lineLen);
}
return 0;
}

Neil Kirk, unfortunately I can not reply to your comment (not enough reputation) but I did a performance test on ifstream an stringstream and the performance, reading a text file line by line, is exactly the same.
std::stringstream stream;
std::string line;
while(std::getline(stream, line)) {
}
This takes 1426ms on a 106MB file.
std::ifstream stream;
std::string line;
while(ifstream.good()) {
getline(stream, line);
}
This takes 1433ms on the same file.
The following code is faster instead:
const int MAX_LENGTH = 524288;
char* line = new char[MAX_LENGTH];
while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) {
}
This takes 884ms on the same file.
It is just a little tricky since you have to set the maximum size of your buffer (i.e. maximum length for each line in the input file).

As someone with a little background in competitive programming, I can tell you: At least for simple things like integer parsing the main cost in C is locking the file streams (which is by default done for multi-threading). Use the unlocked_stdio versions instead (fgetc_unlocked(), fread_unlocked()). For C++, the common lore is to use std::ios::sync_with_stdio(false) but I don't know if it's as fast as unlocked_stdio.
For reference here is my standard integer parsing code. It's a lot faster than scanf, as I said mainly due to not locking the stream. For me it was as fast as the best hand-coded mmap or custom buffered versions I'd used previously, without the insane maintenance debt.
int readint(void)
{
int n, c;
n = getchar_unlocked() - '0';
while ((c = getchar_unlocked()) > ' ')
n = 10*n + c-'0';
return n;
}
(Note: This one only works if there is precisely one non-digit character between any two integers).
And of course avoid memory allocation if possible...

Do you have to read all files at the same time? (at the start of your application for example)
If you do, consider parallelizing the operation.
Either way, consider using binary streams, or unbffered read for blocks of data.

Use Random file access or use binary mode. for sequential, this is big but still it depends on what you are reading.

Handling large gzfile in c++

char buffer[1001];
for(;!gzeof(m_fHandle);){
gzread(m_fHandle, buffer, 1000);
The file I'm handling is more than 1GB.
do I load the entire file to the buffer? or should I malloc and allocate the size?
Or should I load it line by line? the file has a "\n" demarkating the EOL. if so, how do I do that for handling gzfile in c++?

The zlib approach would be:
You can just call gzread with a limited buffer size repeatedly. If you can be sure that he max line length is eg BUFLEN: See it Live On Coliru
#include <zlib.h>
#include <iostream>
#include <algorithm>
static const unsigned BUFLEN = 1024;
void error(const char* const msg)
{
std::cerr << msg << "\n";
exit(255);
}
void process(gzFile in)
{
char buf[BUFLEN];
char* offset = buf;
for (;;) {
int err, len = sizeof(buf)-(offset-buf);
if (len == 0) error("Buffer to small for input line lengths");
len = gzread(in, offset, len);
if (len == 0) break;
if (len < 0) error(gzerror(in, &err));
char* cur = buf;
char* end = offset+len;
for (char* eol; (cur<end) && (eol = std::find(cur, end, '\n')) < end; cur = eol + 1)
{
std::cout << std::string(cur, eol) << "\n";
}
// any trailing data in [eol, end) now is a partial line
offset = std::copy(cur, end, buf);
}
// BIG CATCH: don't forget about trailing data without eol :)
std::cout << std::string(buf, offset);
if (gzclose(in) != Z_OK) error("failed gzclose");
}
int main()
{
process(gzopen("test.gz", "rb"));
}
If you cannot know the maximum line size, I'd suggest abstracting it a bit more and deriving from std::basic_streambuf overriding underflow so you can use std::getline with an istream based on this buffer.
UPDATE Since you're new to C++, implementing your own streambuf is likely not a good idea. I recommend using a c++ library (instead of zlib).
E.g. Boost Iostream allows you to simply do this:
Live On Coliru
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
namespace io = boost::iostreams;
int main()
{
io::filtering_istream in;
in.push(io::gzip_decompressor());
in.push(io::file_source("my_file.txt"));
// read from in using std::istream interface
std::string line;
while (std::getline(in, line, '\n'))
{
process(line); // your code :)
}
}

You say this is a gzfile. That implies a binary format where '\n' is not valid for EOL (there is no concept of EOL with binary files.)
That said, in practice you have a couple choices for buffer size. Loading the entire file into memory will certainly be easier for you as a developer to work with the data. However, this is a costly solution in terms of memory consumed for the task.
If memory is a concern then you need to work on the data in pieces. There is probably an optimal amount of data to try to fetch at a time and a lot of that will depend on the hardware architecture of the machine you have all the way from the CPU through cache lines, memory bus, SATA bus, and even the drives that hold the file itself.
If this is just a onesy-twosy kind of problem you're solving and you're running this on a modern computer, 1GB is probably ok to keep in memory. Just new a uint8_t[] the size of the file and read the whole thing in then parse the data.
Otherwise, you need to integrate your parsing of the file with the reading of the file.

How to read a file in multiple chunks until EOF (C++)

So, here's my problem: I want to make a program that reads chunks of data from a file. Let's say, 1024 bytes per chunk.
So I read the first 1024 bytes, perform various operations and then open the next 1024 bytes, without reading the old data. The program should keep reading data untile the EOF is reached.
I'm currently using this code:
std::fstream fin("C:\\file.txt");
vector<char> buffer (1024,0); //reads only the first 1024 bytes
fin.read(&buffer[0], buffer.size());
But how can I read the next 1024 bytes? I was thinking by using a for loop, but I don't really know how. I'm totally a noob in C++, so if anyone can help me out, that would be great. Thanks!

You can do this with a loop:
std::ifstream fin("C:\\file.txt", std::ifstream::binary);
std::vector<char> buffer (1024,0); //reads only the first 1024 bytes
while(!fin.eof()) {
fin.read(buffer.data(), buffer.size())
std::streamsize s=fin.gcount();
///do with buffer
}
##EDITED
http://en.cppreference.com/w/cpp/io/basic_istream/read

Accepted answer doesn't work for me - it doesn't read last partial chunk. This does:
void readFile(std::istream &input, UncompressedHandler &handler) {
std::vector<char> buffer (1024,0); //reads only 1024 bytes at a time
while (!input.eof()) {
input.read(buffer.data(), buffer.size());
std::streamsize dataSize = input.gcount();
handler({buffer.begin(), buffer.begin() + dataSize});
}
}
Here UncompressedHandler accepts std::string, so I use constructor from two iterators.

I think you missed up that there is a pointer points to the last place you've visit in the file , so that when you read for the second time you will not start from the first , but from the last point you've visit .
Have a look to this code
std::ifstream fin("C:\\file.txt");
char buffer[1024]; //I prefer array more than vector for such implementation
fin.read(buffer,sizeof(buffer));//first read get the first 1024 byte
fin.read(buffer,sizeof(buffer));//second read get the second 1024 byte
so that how you may think about this concept .

I think that will work
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fstream>
// Buffer size 16 Megabyte (or any number you like)
size_t buffer_size = 1 << 24; // 20 is 1 Megabyte
char* buffer = new char[buffer_size];
std::streampos fsize = 0;
std::ifstream file("c:\\file.bin", std::ios::binary);
fsize = file.tellg();
file.seekg(0, std::ios::end);
fsize = file.tellg() - fsize;
int loops = fsize / buffer_size;
int lastChunk = fsize % buffer_size;
for (int i = 0; i < loops; i++) {
file.read(buffer, buffer_size);
// DO what needs with the buffer
}
if (lastChunk > 0) {
file.read(buffer, lastChunk);
// DO what needs with the buffer
}
delete[] buffer;

How to write to a memory buffer with a FILE*?

Is there any way to create a memory buffer as a FILE*. In TiXml it can print the xml to a FILE* but i cant seem to make it print to a memory buffer.

There is a POSIX way to use memory as a FILE descriptor: fmemopen or open_memstream, depending on the semantics you want: Difference between fmemopen and open_memstream

I guess the proper answer is that by Kevin. But here is a hack to do it with FILE *. Note that if the buffer size (here 100000) is too small then you lose data, as it is written out when the buffer is flushed. Also, if the program calls fflush() you lose the data.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
FILE *f = fopen("/dev/null", "w");
int i;
int written = 0;
char *buf = malloc(100000);
setbuffer(f, buf, 100000);
for (i = 0; i < 1000; i++)
{
written += fprintf(f, "Number %d\n", i);
}
for (i = 0; i < written; i++) {
printf("%c", buf[i]);
}
}

fmemopen can create FILE from buffer, does it make any sense to you?

I wrote a simple example how i would create an in-memory FILE:
#include <unistd.h>
#include <stdio.h>
int main(){
int p[2]; pipe(p); FILE *f = fdopen( p[1], "w" );
if( !fork() ){
fprintf( f, "working" );
return 0;
}
fclose(f); close(p[1]);
char buff[100]; int len;
while( (len=read(p[0], buff, 100))>0 )
printf(" from child: '%*s'", len, buff );
puts("");
}

C++ basic_streambuf inheritance
In C++, you should avoid FILE* if you can.
Using only the C++ stdlib, it is possible to make a single interface that transparently uses file or memory IO.
This uses techniques mentioned at: Setting the internal buffer used by a standard stream (pubsetbuf)
#include <cassert>
#include <cstring>
#include <fstream>
#include <iostream>
#include <ostream>
#include <sstream>
/* This can write either to files or memory. */
void write(std::ostream& os) {
os << "abc";
}
template <typename char_type>
struct ostreambuf : public std::basic_streambuf<char_type, std::char_traits<char_type> > {
ostreambuf(char_type* buffer, std::streamsize bufferLength) {
this->setp(buffer, buffer + bufferLength);
}
};
int main() {
/* To memory, in our own externally supplied buffer. */
{
char c[3];
ostreambuf<char> buf(c, sizeof(c));
std::ostream s(&buf);
write(s);
assert(memcmp(c, "abc", sizeof(c)) == 0);
}
/* To memory, but in a hidden buffer. */
{
std::stringstream s;
write(s);
assert(s.str() == "abc");
}
/* To file. */
{
std::ofstream s("a.tmp");
write(s);
s.close();
}
/* I think this is implementation defined.
* pusetbuf calls basic_filebuf::setbuf(). */
{
char c[3];
std::ofstream s;
s.rdbuf()->pubsetbuf(c, sizeof c);
write(s);
s.close();
//assert(memcmp(c, "abc", sizeof(c)) == 0);
}
}
Unfortunately, it does not seem possible to interchange FILE* and fstream: Getting a FILE* from a std::fstream

You could use the CStr method of TiXMLPrinter which the documentation states:
The TiXmlPrinter is useful when you
need to:
Print to memory (especially in non-STL mode)
Control formatting (line endings, etc.)

https://github.com/Snaipe/fmem is a wrapper for different platform/version specific implementations of memory streams
It tries in sequence the following implementations:
open_memstream.
fopencookie, with growing dynamic buffer.
funopen, with growing dynamic buffer.
WinAPI temporary memory-backed file.
When no other mean is available, fmem falls back to tmpfile()

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Combine big files - c++

If you're using std::fstream, then don't. It's intended primarily for formatted input/output, and char-level operations for it are slower than you'd expect. Instead, use std::filebuf directly. This is in addition to suggestions in other answers, specifically, using the larger buffer size.

Related

writting a binary in c++

What are the fastest methods to read from a file in standard C++? [duplicate]

Handling large gzfile in c++

How to read a file in multiple chunks until EOF (C++)

How to write to a memory buffer with a FILE*?

Categories

Resources