Is there any way to create a memory buffer as a FILE*. In TiXml it can print the xml to a FILE* but i cant seem to make it print to a memory buffer.
There is a POSIX way to use memory as a FILE descriptor: fmemopen or open_memstream, depending on the semantics you want: Difference between fmemopen and open_memstream
I guess the proper answer is that by Kevin. But here is a hack to do it with FILE *. Note that if the buffer size (here 100000) is too small then you lose data, as it is written out when the buffer is flushed. Also, if the program calls fflush() you lose the data.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
FILE *f = fopen("/dev/null", "w");
int i;
int written = 0;
char *buf = malloc(100000);
setbuffer(f, buf, 100000);
for (i = 0; i < 1000; i++)
{
written += fprintf(f, "Number %d\n", i);
}
for (i = 0; i < written; i++) {
printf("%c", buf[i]);
}
}
fmemopen can create FILE from buffer, does it make any sense to you?
I wrote a simple example how i would create an in-memory FILE:
#include <unistd.h>
#include <stdio.h>
int main(){
int p[2]; pipe(p); FILE *f = fdopen( p[1], "w" );
if( !fork() ){
fprintf( f, "working" );
return 0;
}
fclose(f); close(p[1]);
char buff[100]; int len;
while( (len=read(p[0], buff, 100))>0 )
printf(" from child: '%*s'", len, buff );
puts("");
}
C++ basic_streambuf inheritance
In C++, you should avoid FILE* if you can.
Using only the C++ stdlib, it is possible to make a single interface that transparently uses file or memory IO.
This uses techniques mentioned at: Setting the internal buffer used by a standard stream (pubsetbuf)
#include <cassert>
#include <cstring>
#include <fstream>
#include <iostream>
#include <ostream>
#include <sstream>
/* This can write either to files or memory. */
void write(std::ostream& os) {
os << "abc";
}
template <typename char_type>
struct ostreambuf : public std::basic_streambuf<char_type, std::char_traits<char_type> > {
ostreambuf(char_type* buffer, std::streamsize bufferLength) {
this->setp(buffer, buffer + bufferLength);
}
};
int main() {
/* To memory, in our own externally supplied buffer. */
{
char c[3];
ostreambuf<char> buf(c, sizeof(c));
std::ostream s(&buf);
write(s);
assert(memcmp(c, "abc", sizeof(c)) == 0);
}
/* To memory, but in a hidden buffer. */
{
std::stringstream s;
write(s);
assert(s.str() == "abc");
}
/* To file. */
{
std::ofstream s("a.tmp");
write(s);
s.close();
}
/* I think this is implementation defined.
* pusetbuf calls basic_filebuf::setbuf(). */
{
char c[3];
std::ofstream s;
s.rdbuf()->pubsetbuf(c, sizeof c);
write(s);
s.close();
//assert(memcmp(c, "abc", sizeof(c)) == 0);
}
}
Unfortunately, it does not seem possible to interchange FILE* and fstream: Getting a FILE* from a std::fstream
You could use the CStr method of TiXMLPrinter which the documentation states:
The TiXmlPrinter is useful when you
need to:
Print to memory (especially in non-STL mode)
Control formatting (line endings, etc.)
https://github.com/Snaipe/fmem is a wrapper for different platform/version specific implementations of memory streams
It tries in sequence the following implementations:
open_memstream.
fopencookie, with growing dynamic buffer.
funopen, with growing dynamic buffer.
WinAPI temporary memory-backed file.
When no other mean is available, fmem falls back to tmpfile()
Related
I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?
edit:
The code I am using is more or less this:
string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
while(txtFile.good())
{
m_numLines++;
getline(txtFile, tmpString);
}
txtFile.close();
}
edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.
edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.
edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)
Updates: Be sure to check the (surprising) updates below the initial answer
Memory mapped files have served me well1:
#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm> // for std::find
#include <iostream> // for std::cout
#include <cstring>
int main()
{
boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
This should be rather quick.
Update
In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru
#include <algorithm>
#include <iostream>
#include <cstring>
// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
const char* map_file(const char* fname, size_t& length);
int main()
{
size_t length;
auto f = map_file("test.cpp", length);
auto l = f + length;
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
void handle_error(const char* msg) {
perror(msg);
exit(255);
}
const char* map_file(const char* fname, size_t& length)
{
int fd = open(fname, O_RDONLY);
if (fd == -1)
handle_error("open");
// obtain file size
struct stat sb;
if (fstat(fd, &sb) == -1)
handle_error("fstat");
length = sb.st_size;
const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
if (addr == MAP_FAILED)
handle_error("mmap");
// TODO close fd at some point in time, call munmap(...)
return addr;
}
Update
The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?
4000 * 400,000 = 1.6 GB if you're hard drive isn't an SSD you're likely getting ~100 MB/s sequential read. That's 16 seconds just in I/O.
Since you don't elaborate on the specific code your using or how you need to parse these files (do you need to read it line by line, does the system have a lot of RAM could you read the whole file into a large RAM buffer and then parse it?) There's little you can do to speed up the process.
Memory mapped files won't offer any performance improvement when reading a file sequentially. Perhaps manually parsing large chunks for new lines rather than using "getline" would offer an improvement.
EDIT After doing some learning (thanks #sehe). Here's the memory mapped solution I would likely use.
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>
int main() {
char* fName = "big.txt";
//
struct stat sb;
long cntr = 0;
int fd, lineLen;
char *data;
char *line;
// map the file
fd = open(fName, O_RDONLY);
fstat(fd, &sb);
//// int pageSize;
//// pageSize = getpagesize();
//// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize);
data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
line = data;
// get lines
while(cntr < sb.st_size) {
lineLen = 0;
line = data;
// find the next line
while(*data != '\n' && cntr < sb.st_size) {
data++;
cntr++;
lineLen++;
}
/***** PROCESS LINE *****/
// ... processLine(line, lineLen);
}
return 0;
}
Neil Kirk, unfortunately I can not reply to your comment (not enough reputation) but I did a performance test on ifstream an stringstream and the performance, reading a text file line by line, is exactly the same.
std::stringstream stream;
std::string line;
while(std::getline(stream, line)) {
}
This takes 1426ms on a 106MB file.
std::ifstream stream;
std::string line;
while(ifstream.good()) {
getline(stream, line);
}
This takes 1433ms on the same file.
The following code is faster instead:
const int MAX_LENGTH = 524288;
char* line = new char[MAX_LENGTH];
while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) {
}
This takes 884ms on the same file.
It is just a little tricky since you have to set the maximum size of your buffer (i.e. maximum length for each line in the input file).
As someone with a little background in competitive programming, I can tell you: At least for simple things like integer parsing the main cost in C is locking the file streams (which is by default done for multi-threading). Use the unlocked_stdio versions instead (fgetc_unlocked(), fread_unlocked()). For C++, the common lore is to use std::ios::sync_with_stdio(false) but I don't know if it's as fast as unlocked_stdio.
For reference here is my standard integer parsing code. It's a lot faster than scanf, as I said mainly due to not locking the stream. For me it was as fast as the best hand-coded mmap or custom buffered versions I'd used previously, without the insane maintenance debt.
int readint(void)
{
int n, c;
n = getchar_unlocked() - '0';
while ((c = getchar_unlocked()) > ' ')
n = 10*n + c-'0';
return n;
}
(Note: This one only works if there is precisely one non-digit character between any two integers).
And of course avoid memory allocation if possible...
Do you have to read all files at the same time? (at the start of your application for example)
If you do, consider parallelizing the operation.
Either way, consider using binary streams, or unbffered read for blocks of data.
Use Random file access or use binary mode. for sequential, this is big but still it depends on what you are reading.
The following code shows me getting the filesize of a certain file to then later on make a large enough buffer to ensure I can store all the files content in this buffer. So what I did was allocating it on the heap, because I couldn't know if the file is huge or not etc.
#include <iostream>
#include <string>
#include <cstdio>
#include <cstdlib>
size_t filesize(FILE* f) {
size_t size;
fseek(f, 0L, SEEK_END);
size = ftell(f);
fseek(f, 0L, SEEK_SET);
return size;
}
char* read_file(std::string name) {
FILE* f;
fopen_s(&f, name.c_str(), "rb");
size_t size = filesize(f);
char* buffer = new char[size+1];
memset(buffer, 0, size+1);
fread(buffer, sizeof(char), size+1, f);
fclose(f);
return buffer; //this is the buffer with the content to send
}
int main() {
char* buffer = read_file("main.cpp");
printf("%s", buffer);
delete[] buffer;
buffer = nullptr;
getchar();
return 0;
}
My question is, have I successfully deleted
char* buffer = new char[size+1];
from the heap by doing this:
char* buffer = read_file("main.cpp");
delete[] buffer;
buffer = nullptr;
Or does it still remain somewhere?
And if it does, how would I pin-point it and delete it?
Any other tips on how to handle raw pointers are appreciated as well.
Yes your code is correctly deleting the buffer.
C++ has various ways to handle this for you though so you don't need to worry about it and will be less likely to make mistakes and forget to free the buffer in some or all code paths, e.g. its easy to make mistakes like this:
int main()
{
char* buffer = read_file("main.cpp");
if ( buffer[0] != 'A' )
{
std::cout << "data is invalid\n";
return 1; // oops forgot to free buffer
}
delete[] buffer;
// data is valid
return 0;
}
One option is to use std::unique_ptr which will free the buffer for you when it goes out of scope:
#include <memory>
#include <string>
#include <iostream>
std::unique_ptr<char[]> read_file(std::string name) {
....
std::unique_ptr<char[]> buffer(new char[size+1]);
....
return buffer;
}
int main()
{
std::unique_ptr<char[]> buffer = read_file("main.cpp");
if ( buffer[0] != 'A' )
{
std::cout << "data is invalid\n";
return 1; // buffer is freed automatically
}
buffer.reset(); // can manually free if we are finished with buffer before it goes out of scope
// data is valid
return 0;
}
I'm using this to decompress a GZIP compressed file "input.gz" into the uncompressed "output.file". It works wonderfully, except I need a fixed size for the buffer (in this case 1MB) and if the output becomes larger the bytes get cut off. Is there a way to get this to work with any output size?
#include "zlib.h"
#include <stdio.h>
int main()
{
char buf[1024*1024];
gzFile in = gzopen("input.gz","rb8");
int len = gzread(in,buf,sizeof(buf));
gzclose(in);
FILE* out = fopen("output.file", "wb");
fwrite(buf,1,len,out);
fclose(out);
free(buf);
return 0;
}
gzread works the same way as fread. Consecutive calls to gzread just read more data from that file. I haven't tested the code, but this should work fine.
#include "zlib.h"
#include <stdio.h>
int main() {
char buf[1024];
gzFile in = gzopen("input.gz","rb8");
FILE* out = fopen("output.file", "wb");
while (int len = gzread(in, buf, sizeof(buf)))
fwrite(buf, 1, len, out);
gzclose(in);
fclose(out);
return 0;
}
I'm trying to read a large dataset, format it the way I need, and then write it to another file. I'm trying to use C++ over SAS or STATA for the speed advantage. The data file are usually around 10gigabytes. And my current code takes over an hour to run (and then I kill it because I'm sure that something is very inefficient with my code.
Is there a more efficient way to do this? Maybe read the file into memory and then analyze it using the switch statements? (I have 32gb ram linux 64bit). Is it possible that reading, and then writing within the loop slows it down since it is constantly reading, then writing? I tried to read it from one drive, and then write to another in an attempt to speed this up.
Are the switch cases slowing it down?
The process I have now reads the data using getline, uses the switch statement to parse it correctly, and then writes it to my outfile. And repeats for 300 million lines. There are about 10 more cases in the switch statement, but I didn't copy for brevity's sake.
The code is probably very ugly all being in the main function, but I wanted to get it working before I worked on attractiveness.
I've tried using read() but without any success. Please let me know if I need to clarify anything.
Thank you for the help!
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <stdio.h>
//#include <cstring>
//#include <boost/algorithm/string.hpp>
#include <vector>
using namespace std;
//using namespace boost;
struct dataline
{
char type[0];
double second;
short mill;
char event[1];
char ticker[6];
char marketCategory[1];
char financialStatus[1];
int roundLotSize;
short roundLotOnly;
char tradingState[1];
char reserved[1];
char reason[4];
char mpid[4];
char primaryMarketMaker[1];
char primaryMarketMode[1];
char marketParticipantState[1];
unsigned long orderNumber;
char buySell[0];
double shares;
float price;
int executedShares;
double matchNumber;
char printable[1];
double executionPrice;
int canceledShares;
double sharesBig;
double crossPrice;
char crossType[0];
double pairedShares;
double imbalanceShares;
char imbalanceDirection[1];
double fairPrice;
double nearPrice;
double currentReferencePrice;
char priceVariationIndicator[1];
};
int main ()
{
string a;
string b;
string c;
string d;
string e;
string f;
string g;
string h;
string k;
string l;
string times;
string smalltimes;
short time; //counter to keep second filled
short smalltime; //counter to keep millisecond filled
double N;
double NN;
double NNN;
int length;
char M;
//vector<> fout;
string line;
ofstream fout ("/media/3tb/test.txt");
ifstream myfile;
myfile.open("S050508-v3.txt");
dataline oneline;
if (myfile.is_open())
{
while ( myfile.good() )
{
getline (myfile,line);
// cout << line<<endl;;
a=line.substr(0,1);
stringstream ss(a);
char type;
ss>>type;
switch (type)
{
case 'T':
{
if (type == 'T')
{
times=line.substr(1,5);
stringstream s(times);
s>>time;
//oneline.second=time;
//oneline.second;
//cout<<time<<endl;
}
else
{
time=time;
}
break;
}
case 'M':
{
if (type == 'M')
{
smalltimes=line.substr(1,3);
stringstream ss(smalltimes);
ss>>smalltime; //oneline.mill;
// cout<<smalltime<<endl; //smalltime=oneline.mill;
}
else
{
smalltime=smalltime;
}
break;
}
case 'R':
{
oneline.second=time;
oneline.mill=smalltime;
a=line.substr(0,1);
stringstream ss(a);
ss>>oneline.type;
b=line.substr(1,6);
stringstream sss(b);
sss>>oneline.ticker;
c=line.substr(7,1);
stringstream ssss(c);
ssss>>oneline.marketCategory;
d=line.substr(8,1);
stringstream sssss(d);
sssss>>oneline.financialStatus;
e=line.substr(9,6);
stringstream ssssss(e);
ssssss>>oneline.roundLotSize;
f=line.substr(15,1);
stringstream sssssss(f);
sssssss>>oneline.roundLotOnly;
*oneline.tradingState=0;
*oneline.reserved=0;
*oneline.reason=0;
*oneline.mpid=0;
*oneline.primaryMarketMaker=0;
*oneline.primaryMarketMode=0;
*oneline.marketParticipantState=0;
oneline.orderNumber=0;
*oneline.buySell=0;
oneline.shares=0;
oneline.price=0;
oneline.executedShares=0;
oneline.matchNumber=0;
*oneline.printable=0;
oneline.executionPrice=0;
oneline.canceledShares=0;
oneline.sharesBig=0;
oneline.crossPrice=0;
*oneline.crossType=0;
oneline.pairedShares=0;
oneline.imbalanceShares=0;
*oneline.imbalanceDirection=0;
oneline.fairPrice=0;
oneline.nearPrice=0;
oneline.currentReferencePrice=0;
*oneline.priceVariationIndicator=0;
break;
}//End Case
}//End Switch
}//end While
myfile.close();
}//End If
else cout << "Unable to open file";
cout<<"Junk"<<endl;
return 0;
}
UPDATE So I've been trying to use memory map, but now I'm getting a segmentation fault.
I've been trying to follow different examples to piece together something that would work for mine. Why would I be getting a segmentation fault? I've taken the first part of my code, which looks like this:
int main (int argc, char** path)
{
long i;
int fd;
char *map;
char *FILEPATH = path;
unsigned long FILESIZE;
FILE* fp = fopen(FILEPATH, "/home/brian/Desktop/S050508-v3.txt");
fseek(fp, 0, SEEK_END);
FILESIZE = ftell(fp);
fseek(fp, 0, SEEK_SET);
fclose(fp);
fd = open(FILEPATH, O_RDONLY);
map = (char *) mmap(0, FILESIZE, PROT_READ, MAP_SHARED, fd, 0);
char z;
stringstream ss;
for (long i = 0; i <= FILESIZE; ++i)
{
z = map[i];
if (z != '\n')
{
ss << z;
}
else
{
// c style tokenizing
ss.str("");
}
}
if (munmap(map, FILESIZE) == -1) perror("Error un-mmapping the file");
close(fd);
The data file are usually around 10gigabytes.
...
Are the switch cases slowing it down?
Almost certainly not, smells like you're I/O bound. But you should consider measuring it. Modern CPUs have performance counters which are pretty easy to leverage with the right tools. But let's start to partition the problems into some major domains: I/O to devices, load/store to memory, CPU. You can place some markers in your code where you read a clock in order to understand how long each of the operations are taking. On linux you can use clock_gettime() or the rdtsc instruction to access a clock with higher precision than the OS tick.
Consider mmap/CreateFileMapping, either of which might provide better efficiency/throughput to the pages you're accessing.
Consider large/huge pages if streaming through large amounts of data which has already been paged in.
From the manual for mmap():
Description
mmap() creates a new mapping in the virtual address space of the
calling process. The starting address for the new mapping is specified
in addr. The length argument specifies the length of the mapping.
Here's an mmap() example:
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#define FILEPATH "/tmp/mmapped.bin"
#define NUMINTS (1000)
#define FILESIZE (NUMINTS * sizeof(int))
int main(int argc, char *argv[])
{
int i;
int fd;
int *map; /* mmapped array of int's */
fd = open(FILEPATH, O_RDONLY);
if (fd == -1) {
perror("Error opening file for reading");
exit(EXIT_FAILURE);
}
map = mmap(0, FILESIZE, PROT_READ, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
close(fd);
perror("Error mmapping the file");
exit(EXIT_FAILURE);
}
/* Read the file int-by-int from the mmap
*/
for (i = 1; i <=NUMINTS; ++i) {
printf("%d: %d\n", i, map[i]);
}
if (munmap(map, FILESIZE) == -1) {
perror("Error un-mmapping the file");
}
close(fd);
return 0;
}
I was wondering if there is a way to output the hexdump or raw data of a file to txt file.
for example
I have a file let's say "data.jpg" (the file type is irrelevant) how can I export the HEXdump (14ed 5602 etc) to a file "output.txt"?
also how I can I specify the format of the output for example, Unicode or UTF?
in C++
You can use a loop, fread and fprintf: With read you get the byte-value of the bytes, then with fprintf you can use the %x to print hexadecimal to a file.
http://www.cplusplus.com/reference/clibrary/cstdio/fread/
http://www.cplusplus.com/reference/clibrary/cstdio/fprintf/
If you want this to be fast you load whole machine-words (int or long long) instead of single bytes, if you want this to be even faster you fread a whole array, then sprintf a whole array, then fprintf that array to the file.
Maybe something like this?
#include <sstream>
#include <iostream>
#include <iomanip>
#include <iterator>
#include <algorithm>
int main()
{
std::stringstream buffer( "testxzy" );
std::istreambuf_iterator<char> it( buffer.rdbuf( ) );
std::istreambuf_iterator<char> end; // eof
std::cout << std::hex << std::showbase;
std::copy(it, end, std::ostream_iterator<int>(std::cout));
std::cout << std::endl;
return 0;
}
You just have to replace buffer with an ifstream that reads the binary file, and write the output to a textfile using an ofstream instead of cout.
This is pretty old -- if you want Unicode, you'll have to add that yourself.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
unsigned long offset = 0;
FILE *input;
int bytes, i, j;
unsigned char buffer[16];
char outbuffer[60];
if ( argc < 2 ) {
fprintf(stderr, "\nUsage: dump filename [filename...]");
return EXIT_FAILURE;
}
for (j=1;j<argc; ++j) {
if ( NULL ==(input=fopen(argv[j], "rb")))
continue;
printf("\n%s:\n", argv[j]);
while (0 < (bytes=fread(buffer, 1, 16, input))) {
sprintf(outbuffer, "%8.8lx: ", offset+=16);
for (i=0;i<bytes;i++) {
sprintf(outbuffer+10+3*i, "%2.2X ",buffer[i]);
if (!isprint(buffer[i]))
buffer[i] = '.';
}
printf("%-60s %*.*s\n", outbuffer, bytes, bytes, buffer);
}
fclose(input);
}
return 0;
}