Optimize readin binary file data to buffer in C++ - c++

i wrote a small commandline tool i need to loop and iterate a huge fileserver.
The logic is really simple. But it needs to much time. And i found the problem
is to read binary files into a buffer. I want to hold the implementation easy
because its c++ and some others have to understand the code too.
std::ifstream input( foundFile.c_str(), std::ios::binary );
std::vector<unsigned char> buffer(std::istreambuf_iterator<char>(input), {});
At the end i guess i have to refactor to chunk reading. But in general why
is it so slow about this way to readin a file binary?
complete source:
https://gitlab.com/Onnebrink/cltools/-/blob/main/src/dupfind/dupfind.cpp

Now, its much faster. I guess i have to play a little bit with bufferSize.
Perhaps 4096byte is to little. But I dont have an good overview about the average cases of the fileSizes it will be found. Perhaps i should make it more adaptive depending on founded filesize
unsigned long long calcHash(string &foundFile) {
const int bufferSize=4096;
unsigned long long hashValue = 0xeba29ce484222325ULL;
unsigned long long magicPrime = 0xad3760fd485d7f11ULL;
ifstream inFile(foundFile.c_str(), std::ios::binary);
vector<char> buffer(bufferSize);
while (!inFile.eof()) {
inFile.read(buffer.data(), bufferSize);
for (streamsize i = 0; i < inFile.gcount(); i++)
hashValue ^= buffer[i], hashValue *= magicPrime;
}
return hashValue;
}

Related

How to load an JPEG image into a char array C++?

i want to store an JPEG image into a normal unsigned char array, i'd used ifstream to store it; however, when i checked if the array i'd stored is correct or not ( by rewrite it again to an JPEG image), the image that i rewrote by using the stored array couldn't show correctly, so i think the problem must come from the technique that i use to store the image into an array is not correct. I want an array which can be stored perfectly so that i can use it to rewrite back into a JPEG image again.I'd really appreciate if anyone can help me solve this problem!
int size = 921600;
unsigned char output[size];
int i = 0;
ifstream DataFile;
DataFile.open("abc.jpeg");
while(!DataFile.eof()){
DataFile >> output[i];
i++;
}
/* i try to rewrite the above array into a new image here */
FILE * image2;
image2 = fopen("def.jpeg", "w");
fwrite(output,1,921600, image2);
fclose(image2);
There are multiple problems in the shown code.
while(!DataFile.eof()){
This is always a bug. See the linked question for a detailed explanation.
DataFile >> output[i];
The formatted extraction operator, >>, by definition, skips over all white space characters and ignores them. Your jpg file surely has bytes 0x09, 0x20, and a few others, somewhere in it, and this automatically skips over and does not read them.
In order to do this correctly, you need to use read() and gcount() to read your binary file. Using gcount() correctly should also result in your code detecting the end-of-file condition properly.
Make sure to add error check when opening files. Find the file size and read in to the buffer according to the filesize.
You might also look in to using std::vector<unsigned char> for character storage.
int main()
{
std::ifstream DataFile("abc.jpeg", std::ios::binary);
if(!DataFile.good())
return 0;
DataFile.seekg(0, std::ios::end);
size_t filesize = (int)DataFile.tellg();
DataFile.seekg(0);
unsigned char output[filesize];
//or std::vector
//or unsigned char *output = new unsigned char[filesize];
if(DataFile.read((char*)output, filesize))
{
std::ofstream fout("def.jpeg", std::ios::binary);
if(!fout.good())
return 0;
fout.write((char*)output, filesize);
}
return 0;
}

Buffering putc write

I'm new to C++ and am making an app that uses a lot of putc to write data in output which is file. Because of high writes its being slowed down, I used to code in Delphi, so I know how to solve it, like make a memory stream and write into it every time we need to write into output, and if size of memory stream is larger than buffer size we want, write it into output and clear the memory stream. How should I do this with C++ or any better solution?
putc is already buffered, 4 KB is default you can use setvbuf for changing that value :D
setvbuf
Writing to a file should be very quick. It is usually the emptying of the buffer that takes some time. Consider using the character \n instead of std::endl.
I think a good answer to your question is here: Writing a binary file in C++ very fast
Where the answer is:
#include <stdio.h>
const unsigned long long size = 8ULL*1024ULL*1024ULL;
unsigned long long a[size];
int main()
{
FILE* pFile;
pFile = fopen("file.binary", "wb");
for (unsigned long long j = 0; j < 1024; ++j){
//Some calculations to fill a[]
fwrite(a, 1, size*sizeof(unsigned long long), pFile);
}
fclose(pFile);
return 0;
}
The most important thing in your case is to write as much data you can, with the least possible I/O requests.

How can you read different sized bit values from a file?

I'm reading a bunch of bit values from a text file which are in binary from because I stored them using fwrite. The problem is that the first value in the file is 5 bytes in size and the next 4800 values are 2 bytes in size. So when I try to cycle through the file and read the values it will give me the wrong results because my program does not know that it should take 5 bytes the first time and then 2 bytes the remaining 4800 times.
Here is how I'm cycling through the file:
long lSize;
unsigned short * buffer;
size_t result;
pFile = fOpen("dataValues.txt", "rb");
lSize = ftell(pFile);
buffer = (unsigned short *) malloc (sizeof(unsigned short)*lSize);
size_t count = lSize/sizeof(short);
for(size_t i = 0; i < count; ++i)
{
result = fread(buffer+i, sizeof(unsigned short), 1, pFile);
print("%u\n", buffer[i]);
}
I'm pretty sure I'm going to need to change my fread statement because the first value is of type time_t so I'll probably need a statement that looks like this:
result = fread(buffer+i, sizeof(time_t), 1, pFile);
However, this did not work work when I tried it and I think it's because I am not changing the starting position properly. I think that while I do read 5 bytes worth of data, I don't move the starting position enough.
Does anyone here have a good understanding of fread? Can you please let me know what I can change to make my program accomplish what I need.
EDIT:
This is how I'm writing to the file.
fwrite(&timer, sizeof(timer), 1, pFile);
fwrite(ptr, sizeof(unsigned short), rawData.size(), pFile);
EDIT2:
I tried to read the file using ifstream
int main()
{
time_t x;
ifstream infile;
infile.open("binaryValues.txt", ios::binary | ios::in);
infile.read((char *) &x, sizeof(x));
return 0;
}
However, now it doesn't compile and just give me a bunch of undefined reference to errors to code that I don't even have written.
I don't see the problem:
uint8_t five_byte_buffer[5];
uint8_t two_byte_buffer[2];
//...
ifstream my_file(/*...*/);
my_file.read(&five_byte_buffer[0], 5);
my_file.read(&two_byte_buffer[0], 2);
So, what is your specific issue?
Edit 1: Reading in a loop
while (my_file.read(&five_byte_buffer[0], 5))
{
my_file.read(&two_byte_buffer[0], 5);
Process_Data();
}
You can't. Streams are byte, almost always octet (8 bit byte) oriented.
You can easily enough build a bit-oriented stream on top of that. You just keep a few bytes in a buffer and keep track of which bit is current. Watch out for getting the last few bits, and attempts to mix byte access with bit access.
Untested but this is the general idea.
struct bitstream
{
unsigned long long rack; // 64 bits rack
FILE *fp; // file opened for reading
int rackpos; // 0 - 63, poisition of bits read.
}
int getbits(struct bitstream *bs, int Nbits)
{
unsigned long long mask = 0x8000 0000 0000 0000;
int answer = 0;
while(bs->rackpos > 8)
{
bs->rack <<= 8;
bs->rack |= fgetc(bs->fp);
bs->rackpos -= 8;
}
mask >>= bs->rackpos;
for(i=0;i<Nbits;i++)
{
answer <<= 1;
answer |= bs->rack & mask;
mask >>= 1;
}
bs->rackpos += Nbits;
return answer;
}
You need to decide how you know when the stream is terminated. As is you'll corrupt the last few bits with the EOF read by fgetc().

Fastest way to read millions of integers from stdin C++?

I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using cin and std::ios::sync_with_stdio(false); but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).
How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).
The program is just fed the contents of a file with integers separated by newlines like $ ./program < numstosort.txt
Thanks
Here is the relevant code:
std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}
This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.
Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:
struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );
// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = (char *)::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead < sb.st_size )
{
ssize_t bytesRead = ::read( STDIN_FILENO,
data + totalRead, sb.st_size - totalRead );
if ( bytesRead <= 0 )
{
break;
}
totalRead += bytesRead;
}
// data is now in memory - start processing it
That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.
Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.
You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:
::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
// regular file...
}
(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)
To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.
Edit: parsing the data.
Again, preventing compiler optimizations probably makes the overhead of C++ operations slower than C-style processing. Based on that assumption, something simple will probably run faster.
This would probably work a lot faster in a non-optimized binary, assuming the data is in a C-style string read in as above:
char *next;
long count = ::strtol( data, &next, 0 );
long *values = new long[ count ];
for ( long ii = 0; ii < count; ii++ )
{
values[ ii ] = ::strtol( next, &next, 0 );
}
That is also very fragile. It relies on strtol() skipping over leading whitespace, meaning if there's anything other than whitespace between the numeric values it will fail. It also relies on the initial count of values being correct. Again - that code will fail if that's not true. And because it can replace the value of next before checking for errors, if it ever goes off the rails because of bad data it'll be hopelessly lost.
But it should be about as fast as possible without allowing compiler optimizations.
That's what crazy about not allowing compiler optimizations. You can write simple, robust C++ code to do all your processing, make use of a good optimizing compiler, and probably run almost as fast as the code I posted - which has no error checking and will fail spectacularly in unexpected and undefined ways if fed unexpected data.
You can make it faster if you use a SolidState hard drive. If you want to ask something about code performance, you need to post how are you doing things in the first place.
You may be able to speed up your program by reading the data into a buffer, then converting the text in the buffer to internal representation.
The thought behind this is that all stream devices like to keep streaming. Starting and stopping the stream wastes time. A block read transfers a lot of data with one transaction.
Although cin is buffered, by using cin.read and a buffer, you can make the buffer a lot bigger than cin uses.
If the data has fixed width fields, there are opportunities to speed up the input and conversion processes.
Edit 1: Example
const unsigned int BUFFER_SIZE = 65536;
char text_buffer[BUFFER_SIZE];
//...
cin.read(text_buffer, BUFFER_SIZE);
//...
int value1;
int arguments_scanned = snscanf(&text_buffer, REMAINING_BUFFER_SIZE,
"%d", &value1);
The tricky part is handling the cases where the text of a number is cut off at the end of the buffer.
Can you ran this little test in compare to your test with and without commented line?
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[20] {0,};
int t = 0;
while( std::cin.get(buffer, 20) )
{
// t = std::atoi(buffer);
std::cin.ignore(1);
}
return 0;
}
Pure read test:
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[1024*1024];
while( std::cin.read(buffer, 1024*1024) )
{
}
return 0;
}

C++ reading leftover data at the end of a file

I am taking input from a file in binary mode using C++; I read the data into unsigned ints, process them, and write them to another file. The problem is that sometimes, at the end of the file, there might be a little bit of data left that isn't large enough to fit into an int; in this case, I want to pad the end of the file with 0s and record how much padding was needed, until the data is large enough to fill an unsigned int.
Here is how I am reading from the file:
std::ifstream fin;
fin.open('filename.whatever', std::ios::in | std::ios::binary);
if(fin) {
unsigned int m;
while(fin >> m) {
//processing the data and writing to another file here
}
//TODO: read the remaining data and pad it here prior to processing
} else {
//output to error stream and exit with failure condition
}
The TODO in the code is where I'm having trouble. After the file input finishes and the loop exits, I need to read in the remaining data at the end of the file that was too small to fill an unsigned int. I need to then pad the end of that data with 0's in binary, recording enough about how much padding was done to be able to un-pad the data in the future.
How is this done, and is this already done automatically by C++?
NOTE: I cannot read the data into anything but an unsigned int, as I am processing the data as if it were an unsigned integer for encryption purposes.
EDIT: It was suggested that I simply read what remains into an array of chars. Am I correct in assuming that this will read in ALL remaining data from the file? It is important to note that I want this to work on any file that C++ can open for input and/or output in binary mode. Thanks for pointing out that I failed to include the detail of opening the file in binary mode.
EDIT: The files my code operates on are not created by anything I have written; they could be audio, video, or text. My goal is to make my code format-agnostic, so I can make no assumptions about the amount of data within a file.
EDIT: ok, so based on constructive comments, this is something of the approach I am seeing, documented in comments where the operations would take place:
std::ifstream fin;
fin.open('filename.whatever', std::ios::in | std::ios::binary);
if(fin) {
unsigned int m;
while(fin >> m) {
//processing the data and writing to another file here
}
//1: declare Char array
//2: fill it with what remains in the file
//3: fill the rest of it until it's the same size as an unsigned int
} else {
//output to error stream and exit with failure condition
}
The question, at this point, is this: is this truly format-agnostic? In other words, are bytes used to measure file size as discrete units, or can a file be, say, 11.25 bytes in size? I should know this, I know, but I've got to ask it anyway.
Are bytes used to measure file size as discrete units, or can a file be, say, 11.25 bytes in size?
No data type can be less than a byte, and your file is represented as an array of char meaning each character is one byte. Thus it is impossible to not get a whole number measure in bytes.
Here is step one, two, and three as per your post:
while (fin >> m)
{
// ...
}
std::ostringstream buffer;
buffer << fin.rdbuf();
std::string contents = buffer.str();
// fill with 0s
std::fill(contents.begin(), contents.end(), '0');