Read file to memory, loop through data, then write file [duplicate] - c++

This question already has answers here:
How to read line by line after i read a text into a buffer?
(4 answers)
Closed 10 years ago.
I'm trying to ask a similar question to this post:
C: read binary file to memory, alter buffer, write buffer to file
but the answers didn't help me (I'm new to c++ so I couldn't understand all of it)
How do I have a loop access the data in memory, and go through line by line so that I can write it to a file in a different format?
This is what I have:
#include <fstream>
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdlib.h>
using namespace std;
int main()
{
char* buffer;
char linearray[250];
int lineposition;
double filesize;
string linedata;
string a;
//obtain the file
FILE *inputfile;
inputfile = fopen("S050508-v3.txt", "r");
//find the filesize
fseek(inputfile, 0, SEEK_END);
filesize = ftell(inputfile);
rewind(inputfile);
//load the file into memory
buffer = (char*) malloc (sizeof(char)*filesize); //allocate mem
fread (buffer,filesize,1,inputfile); //read the file to the memory
fclose(inputfile);
//Check to see if file is correct in Memory
cout.write(buffer,filesize);
free(buffer);
}
I appreciate any help!
Edit (More info on the data):
My data is different files that vary between 5 and 10gb. There are about 300 million lines of data. Each line looks like
M359
T359 3520 359
M400
A3592 zng 392
Where the first element is a character, and the remaining items could be numbers or characters. I'm trying to read this into memory since it will be a lot faster to loop through line by line, than reading a line, processing, and then writing. I am compiling in 64bit linux. Let me know if I need to clarify further. Again thank you.
Edit 2
I am using a switch statement to process each line, where the first character of each line determines how to format the rest of the line. For example 'M' means millisecond, and I put the next three numbers into a structure. Each line has a different first character that I need to do something different for.

So pardon the potentially blatantly obvious, but if you want to process this line by line, then...
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(int argc, char *argv[])
{
// read lines one at a time
ifstream inf("S050508-v3.txt");
string line;
while (getline(inf, line))
{
// ... process line ...
}
inf.close();
return 0;
}
And just fill in the body of the while loop? Maybe I'm not seeing the real problem (a forest for the trees kinda thing).
EDIT
The OP is inline with using a custom streambuf which may not necessarily be the most portable thing in the world, but he's more interested in avoiding flipping back and forh between input and output files. With enough RAM, this should do the trick.
#include <iostream>
#include <fstream>
#include <iterator>
#include <memory>
using namespace std;
struct membuf : public std::streambuf
{
membuf(size_t len)
: streambuf()
, len(len)
, src(new char[ len ] )
{
setg(src.get(), src.get(), src.get() + len);
}
// direct buffer access for file load.
char * get() { return src.get(); };
size_t size() const { return len; };
private:
std::unique_ptr<char> src;
size_t len;
};
int main(int argc, char *argv[])
{
// open file in binary, retrieve length-by-end-seek
ifstream inf(argv[1], ios::in|ios::binary);
inf.seekg(0,inf.end);
size_t len = inf.tellg();
inf.seekg(0, inf.beg);
// allocate a steam buffer with an internal block
// large enough to hold the entire file.
membuf mb(len+1);
// use our membuf buffer for our file read-op.
inf.read(mb.get(), len);
mb.get()[len] = 0;
// use iss for your nefarious purposes
std::istream iss(&mb);
std::string s;
while (iss >> s)
cout << s << endl;
return EXIT_SUCCESS;
}

You should look into fgets and scanf, in which you can pull out matched pieces of data so it is easier to manipulate, assuming that is what you want to do. Something like this could look like:
FILE *input = fopen("file.txt", "r");
FILE *output = fopen("out.txt","w");
int bufferSize = 64;
char buffer[bufferSize];
while(fgets(buffer,bufferSize,input) != EOF){
char data[16];
sscanf(buffer,"regex",data);
//manipulate data
fprintf(output,"%s",data);
}
fclose(output);
fclose(input);
That would be more of the C way to do it, C++ handles things a little more eloquently by using an istream:
http://www.cplusplus.com/reference/istream/istream/

If I had to do this, I'd probably use code something like this:
std::ifstream in("S050508-v3.txt");
std::istringstream buffer;
buffer << in.rdbuf();
std::string data = buffer.str();
if (check_for_good_data(data))
std::cout << data;
This assumes you really need the entire contents of the input file in memory at once to determine whether it should be copied to output or not. If (for example) you can look at the data one byte at a time, and determine whether that byte should be copied without looking at the others, you could do something more like:
std::ifstream in(...);
std::copy_if(std::istreambuf_iterator<char>(in),
std::istreambuf_iterator<char>(),
std::ostream_iterator<char>(std::cout, ""),
is_good_char);
...where is_good_char is a function that returns a bool saying whether that char should be included in the output or not.
Edit: the size of files you're dealing with mostly rules out the first possibility I've given above. You're also correct that reading and writing large chunks of data will almost certainly improve speed over working on one line at a time.

Related

Reading multiple bytes from file and storing them for comparison in C++

I want to binary read a photo in 1460 bytes increments and compare consecutive packets for corrupted transmission. I have a python script that i wrote and want to translate in C++, however I'm not sure that what I intend to use is correct.
for i in range(0, fileSize-1):
buff=f.read(1460) // buff stores a packet of 1460 bytes where f is the opened file
secondPacket=''
for j in buff:
secondPacket+="{:02x}".format(j)
if(secondPacket==firstPacket):
print(f'Packet {i+1} identical with {i}')
firstPacket=secondPacket
I have found int fseek ( FILE * stream, long int offset, int origin ); but it's unclear if it reads the first byte that is located offset away from origin or everything in between.
Thanks for clarifications.
#include <iostream>
#include <fstream>
#include <array>
std::array<char, 1460> firstPacket;
std::array<char, 1460> secondPacket;
int i=0;
int main() {
std::ifstream file;
file.open("photo.jpg", std::ios::binary);
while (file.read(firstPacket.data(), firstPacket.size())){
++i;
if (firstPacket==secondPacket)
std::cout<<"Packet "<<i<<" is a copy of packet "<<i-1<<std::endl;
memcpy(&secondPacket, &firstPacket, firstPacket.size());
}
std::cout<<i; //tested to check if i iterate correctly
return 0;
}
This is the code i have so far which doesn't work.
fseek
doesn't read, it just moves the point where the next read operation should begin. If you read the file from start to end you don't need this.
To read binary data you want the aptly named std::istream::read. You can use it like this wih a fixed size buffer:
// char is one byte, could also be uint8_t, but then you would need a cast later on
std::array<char, 1460> bytes;
while(myInputStream.read(bytes.data(), bytes.size())) {
// do work with the read data
}

how can I write pixel color data to a bmp image file with stb_image?

I've already opened the bmp file( one channel grayscale) and stored each pixel color in a new line as hex.
after some doing processes on the data (not the point of this question), I need to export a bmp image from my data.
how can I load the textfile(data) and use stb_image_write?
pixel to image :
#include <cstdio>
#include <cstdlib>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
using namespace std;
int main() {
FILE* datafile ;
datafile = fopen("pixeldata.x" , "w");
unsigned char* pixeldata ;//???
char Image2[14] = "image_out.bmp";
stbi_write_bmp(Image2, 512, 512, 1, pixeldata);
image to pixel:
#include <cstdio>
#include <cstdlib>
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
using namespace std;
const size_t total_pixel = 512*512;
int main() {
FILE* datafile ;
datafile = fopen("pixeldata.x" , "w");
char Image[10] = "image.bmp";
int witdth;
int height;
int channels;
unsigned char *pixeldata = stbi_load( (Image) , &witdth, &height, &channels, 1);
if(pixeldata != NULL){
for(int i=0; i<total_pixel; i++)
{
fprintf(datafile,"%x%s", pixeldata[i],"\n");
}
}
}
There are a lot of weaknesses in the question – too much to sort this out in comments...
This question is tagged C++. Why the error-prone fprintf()? Why not std::fstream? It has similar capabilities (if not even more) but adds type-safety (which printf() family cannot provide).
The counter-part of fprintf() is fscanf(). The formatters are similar but the storage type has to be configured in formatters even more carefully than in fprintf().
If the first code sample is the attempt to read pixels back from datafile.x... Why datafile = fopen("pixeldata.x" , "w");? To open a file with fopen() for reading, it should be "r".
char Image2[14] = "image_out.bmp"; is correct (if I counted correctly) but maintenance-unfriendly. Let the compiler do the work for you:
char Image2[] = "image_out.bmp";
To provide storage for pixel data with (in OPs case) fixed size of 512 × 512 bytes, the simplest would be:
unsigned char pixeldata[512 * 512];
Storing an array of that size (512 × 512 = 262144 Bytes = 256 KByte) in a local variable might be seen as potential issue by certain people. The alternative would be to use a std::vector<unsigned char> pixeldata; instead. (std::vector allocates storage dynamically in heap memory where local variables usually on a kind of stack memory which in turn is usually of limited size.)
Concerning the std::vector<unsigned char> pixeldata;, I see two options:
definition with pre-allocation:
std::vector<unsigned char> pixeldata(512 * 512);
so that it can be used just like the array above.
definition without pre-allocation:
std::vector<unsigned char> pixeldata;
That would allow to add every read pixel just to the end with std::vector::push_back().
May be, it's worth to reserve the final size beforehand as it's known from beginning:
std::vector<unsigned char> pixeldata;
pixeldata.reserve(512 * 512); // size reserved but not yet used
So, this is how it could look finally:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <vector>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
int main()
{
const int w = 512, h = 512;
// read data
FILE *datafile = fopen("pixeldata.x" , "r");
if (!datafile) { // success of file open should be tested ALWAYS
std::cerr << "Cannot open 'pixeldata.x'!\n";
return -1; // ERROR! (bail out)
}
typedef unsigned char uchar; // for convenience
std::vector<uchar> pixeldata(w * h);
char Image2[] = "image_out.bmp";
for (int i = 0, n = w * h; i < n; ++i) {
if (fscanf(datafile, "%hhx", &pixeldata[i]) < 1) {
std::cerr << "Failed to read value " << i << of 'pixeldata.x'!\n";
return -1; // ERROR! (bail out)
}
}
fclose(datafile);
// write BMP image
stbi_write_bmp(Image2, w, h, 1, pixeldata.data());
// Actually, success of this should be tested as well.
// done
return 0;
}
Some additional notes:
Please, take this code with a grain of salt. I haven't compiled or tested it. (I leave this as task to OP but will react on "bug reports".)
I silently removed using namespace std;: SO: Why is “using namespace std” considered bad practice?
I added checking of success of file operations. File operations are something which are always good for failing for a lot of reasons. For file writing, even the fclose() should be tested. Written data might be cached until file is closed and just writing the cached data to file might fail (because just this might overflow the available volume space).
OP used magic numbers (image width and size) which is considered as bad practice. It makes code maintenance-unfriendly and might be harder to understand for other readers: SO: What is a magic number, and why is it bad?

C++ reading large files part by part

I've been having a problem that I not been able to solve as of yet. This problem is related to reading files, I've looked at threads even on this website and they do not seem to solve the problem. That problem is reading files that are larger than a computers system memory. Simply when I asked this question a while ago I was referred too using the following code.
string data("");
getline(cin,data);
std::ifstream is (data);//, std::ifstream::binary);
if (is)
{
// get length of file:
is.seekg (0, is.end);
int length = is.tellg();
is.seekg (0, is.beg);
// allocate memory:
char * buffer = new char [length];
// read data as a block:
is.read (buffer,length);
is.close();
// print content:
std::cout.write (buffer,length);
delete[] buffer;
}
system("pause");
This code works well apart from the fact that it eats memory like fat kid in a candy store.
So after a lot of ghetto and unrefined programing, I was able to figure out a way to sort of fix the problem. However I more or less traded one problem for another in my quest.
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <stdio.h>
#include <stdlib.h>
#include <iomanip>
#include <windows.h>
#include <cstdlib>
#include <thread>
using namespace std;
/*======================================================*/
string *fileName = new string("tldr");
char data[36];
int filePos(0); // The pos of the file
int tmSize(0); // The total size of the file
int split(32);
char buff;
int DNum(0);
/*======================================================*/
int getFileSize(std::string filename) // path to file
{
FILE *p_file = NULL;
p_file = fopen(filename.c_str(),"rb");
fseek(p_file,0,SEEK_END);
int size = ftell(p_file);
fclose(p_file);
return size;
}
void fs()
{
tmSize = getFileSize(*fileName);
int AX(0);
ifstream fileIn;
fileIn.open(*fileName, ios::in | ios::binary);
int n1,n2,n3;
n1 = tmSize / 32;
// Does the processing
while(filePos != tmSize)
{
fileIn.seekg(filePos,ios_base::beg);
buff = fileIn.get();
// To take into account small files
if(tmSize < 32)
{
int Count(0);
char MT[40];
if(Count != tmSize)
{
MT[Count] = buff;
cout << MT[Count];// << endl;
Count++;
}
}
// Anything larger than 32
else
{
if(AX != split)
{
data[AX] = buff;
AX++;
if(AX == split)
{
AX = 0;
}
}
}
filePos++;
}
int tz(0);
filePos = filePos - 12;
while(tz != 2)
{
fileIn.seekg(filePos,ios_base::beg);
buff = fileIn.get();
data[tz] = buff;
tz++;
filePos++;
}
fileIn.close();
}
void main ()
{
fs();
cout << tmSize << endl;
system("pause");
}
What I tried to do with this code is too work around the memory issue. Rather than allocating memory for a large file that simply does not exist on a my system, I tried to use the memory I had instead which is about 8gb, but I only wanted to use maybe a few Kilobytes of it if at all possible.
To give you a layout of what I am talking about I am going to write a line of text.
"Hello my name is cake please give me cake"
Basically what I did was read said piece of text letter by letter. Then I put those letters into a box that could store 32 of them, from there I could use something like xor and then write them onto another file.
The idea in a way works but it is horribly slow and leaves off parts of files.
So basically how can I make something like this work without going slow or cutting off files. I would love to see how xor works with very large files.
So if anyone has a better idea than what I have, then I would be very grateful for the help.
To read and process the file piece-by-piece, you can use the following snippet:
// Buffer size 1 Megabyte (or any number you like)
size_t buffer_size = 1<<20;
char *buffer = new char[buffer_size];
std::ifstream fin("input.dat");
while (fin)
{
// Try to read next chunk of data
fin.read(buffer, buffer_size);
// Get the number of bytes actually read
size_t count = fin.gcount();
// If nothing has been read, break
if (!count)
break;
// Do whatever you need with first count bytes in the buffer
// ...
}
delete[] buffer;
The buffer size of 32 bytes, as you are using, is definitely too small. You make too many calls to library functions (and the library, in turn, makes calls (although probably not every time) to OS, which are typically slow, since they cause context-switching). There is also no need of tell/seek.
If you don't need all the file content simultaneously, reduce the working set first - like a set of about 32 words, but since XOR can be applied sequentially, you may further simplify the working set with constant size, like 4 kilo-bytes.
Now, you have the option to use file reader is.read() in a loop and process a small set of data each iteration, or use memmap() to map the file content as memory pointer which you can perform both read and write operations.

Buffer size for reading a UTF-8-encoded file using ICU (ICU4C)

I am trying to read a UTF-8-encoded file using ICU4C on Windows with msvc11. I need to determine the size of the buffer to build a UnicodeString. Since there is no fseek-like function in the ICU4C API I thought I could use an underlying C-file:
#include <unicode/ustdio.h>
#include <stdio.h>
/*...*/
UFILE *in = u_fopen("utfICUfseek.txt", "r", NULL, "UTF-8");
FILE* inFile = u_fgetfile(in);
fseek(inFile, 0, SEEK_END); /* Access violation here */
int size = ftell(inFile);
auto uChArr = new UChar[size];
There are two problems with this code:
It "throws" access violation at the fseek() line for some reason (Unhandled exception at 0x000007FC5451AB00 (ntdll.dll) in test.exe: 0xC0000005: Access violation writing location 0x0000000000000024.)
The size returned by the ftell function will not be the size I want because UTF-8 can use up to 4 bytes for a code point (a u8"tю" string will be of length 3).
So the questions are:
How do I determine a buffer size for a UnicodeString if I know that the input file is UTF-8-encoded?
Is there a portable way to use iostream/fstream for both reading and writing ICU's UnicodeStrings?
Edit:
Here is the possible solution (tested on msvc11 and gcc 4.8.1) based on the first answer and C++11 Standard. A few things from ISO IEC 14882 2011:
"The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form..."
"The basic source character set consists of 96 characters...", - 7 bits needed already
"The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set..."
"Objects declared as characters (char) shall be large enough to
store any member of the implementation’s basic character set."
So, to make this portable for platforms where the implementation defined size of char is 1 byte = 8 bits (don't know where this isn't true) we can read Unicode characters into chars using unformatted input operation:
std::ifstream is;
is.open("utfICUfSeek.txt");
is.seekg(0, is.end);
int strSize = is.tellg();
auto inputCStr = new char[strSize + 1];
inputCStr[strSize] = '\0'; //add null-character at the end
is.seekg(0, is.beg);
is.read(inputCStr, strSize);
is.seekg(0, is.beg);
UnicodeString uStr = UnicodeString::fromUTF8(inputCStr);
is.close();
What troubles me is that I have to create an additional buffer for chars and only then convert them to the required UnicodeString.
This is an alternative to using ICU.
Using the standard std::fstream you can read the whole/ part of the file into a standard std::string then iterate over that with a unicode aware iterator. http://code.google.com/p/utf-iter/
std::string get_file_contents(const char *filename)
{
std::ifstream in(filename, std::ios::in | std::ios::binary);
if (in)
{
std::string contents;
in.seekg(0, std::ios::end);
contents.reserve(in.tellg());
in.seekg(0, std::ios::beg);
contents.assign((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
in.close();
return(contents);
}
throw(errno);
}
Then in your code
std::string myString = get_file_contents( "foobar" );
unicode::iterator< std::string, unicode::utf8 /* or utf16/32 */ > iter = myString.begin();
while ( iter != myString.end() )
{
...
++iter;
}
Well, either you want to read in the whole file at once for some kind of postprocessing, in which case icu::UnicodeString is not really the best container...
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::ifstream in( "utfICUfSeek.txt" );
std::stringstream buffer;
buffer << in.rdbuf();
in.close();
// ...
return 0;
}
...or what you really want is to read into icu::UnicodeString just like into any other string object but went the long way around...
#include <iostream>
#include <fstream>
#include <unicode/unistr.h>
#include <unicode/ustream.h>
int main()
{
std::ifstream in( "utfICUfSeek.txt" );
icu::UnicodeString uStr;
in >> uStr;
// ...
in.close();
return 0;
}
...or I am completely missing what your problem really is about. ;)

Modifying binary files

I'm trying to write a small program which will search a binary file for a few bytes and replace these with another bunch of bytes. But everytime I try running this small app I got message about istream_iterator is not dereferenceable.
Maybe someone have a suggestion how to do this in another way (iterators are a little bit a new subject for me).
#include <fstream>
#include <iterator>
#include <algorithm>
using namespace std;
int main() {
typedef istream_iterator<char> input_iter_t;
const off_t SIZE = 4;
char before[SIZE] = { 0x12, 0x34, 0x56, 0x78 };
char after[SIZE] = { 0x78, 0x12, 0x34, 0x65 };
fstream filestream("numbers.exe", ios::binary | ios::in | ios::out);
if (search(input_iter_t(filestream), input_iter_t(), before, before + SIZE) != input_iter_t()) {
filestream.seekp(-SIZE, ios::cur);
filestream.write(after, SIZE);
}
return 0;
}
This is my second attempt to do this but also something is wrong. With small files looks like works OK but with bigger (around 2MB) it works very slowly and never find pattern what I'm looking for.
#include <iostream>
#include <cstdlib>
#include <string>
#include <fstream>
#include <iterator>
#include <vector>
#include <algorithm>
#include <windows.h>
using namespace std;
int main() {
const off_t Size = 4;
unsigned char before[Size] = { 0x12, 0x34, 0x56, 0x78 };
unsigned char after[Size] = { 0x90, 0xAB, 0xCD, 0xEF };
vector<char> bytes;
{
ifstream iFilestream( "numbers.exe", ios::in|ios::binary );
istream_iterator<char> begin(iFilestream), end;
bytes.assign( begin, end ) ;
}
vector<char>::iterator found = search( bytes.begin(), bytes.end(), before, before + Size );
if( found != bytes.end() )
{
copy( after, after + Size, found );
{
ofstream oFilestream( "number-modified.exe" );
copy( bytes.begin(), bytes.end(), ostream_iterator<unsigned char>(oFilestream) );
}
}
return 0;
}
Cheers,
Thomas
Read a larger part of the file to memory, replace it in memory and then dump the bunch to the disk. Reading one byte at a time is very slow.
I also suggest you read about mmap (or MapViewOfFile in win32).
search won't work on an istream_iterator because of the nature of the iterator. It's an input iterator, which means it simply moves forward - this is because it reads from the stream, and once it's read from the stream, it can't go back. search requires a forward iterator, which is an input iterator where you can stop, make a copy, and move one forward while keeping the old one. An example of a forward iterator is a singly-linked list. You can't go backwards, but you can remember where you are and restart from there.
The speed issue is because vector is truly terrible at handling unknown data. Every time it runs out of room, it copies the whole buffer over to new memory. Replace it with a deque, which can handle data arriving one by one. You will also likely get improved performance trying to read from the stream in blocks at a time, as character-by-character access is a pretty bad way to load an entire file into memory.
Assuming the file isn't too large, just read the file into memory, then modify the memory buffer as you see fit, then write it back out to a file.
E.g. (untested):
FILE *f_in = fopen("inputfile","rb");
fseek(f_in,0,SEEK_END);
long size = ftell(f_in);
rewind(f_in);
char* p_buffer = (char*) malloc (size);
fread (p_buffer,size,1,f_in);
fclose(f_in);
unsigned char *p= (unsigned char*)p_buffer;
// process.
FILE *f_out = fopen("outoutfile","wb");
fwrite(p_buffer,size,1,f_out);
fclose(f_out);
free(p_buffer);