fread speeds managed unmanaged - c++

Ok, so I'm reading a binary file into a char array I've allocated with malloc.
(btw the code here isn't the actual code, I just wrote it on the spot to demonstrate, so any mistakes here are probably not mistakes in the actual program.) This method reads at about 50million bytes per second.
main
char *buffer = (char*)malloc(file_length_in_bytes*sizeof(char));
memset(buffer,0,file_length_in_bytes*sizeof(char));
//start time here
read_whole_file(buffer);
//end time here
free(buffer);
read_whole_buffer
void read_whole_buffer(char* buffer)
{
//file already opened
fseek(_file_pointer, 0, SEEK_SET);
int a = sizeof(buffer[0]);
fread(buffer, a, file_length_in_bytes*a, _file_pointer);
}
I've written something similar with managed c++ that uses filestream I believe and the function ReadByte() to read the entire file, byte by byte, and it reads at around 50million bytes per second.
Also, I have a sata and an IDE drive in my computer, and I've loading the file off of both, doesn't make any difference at all(Which is weird because I was under the assumption that SATA read much faster than IDE.)
Question
Maybe you can all understand why this doesn't make any sense to me. As far as I knew, it should be much faster to fread a whole file into an array, as opposed to reading it byte by byte. On top of that, through testing I've discovered that managed c++ is slower (only noticeable though if you are benchmarking your code and you require speed.)
SO
Why in the world am I reading at the same speed with both applications. Also is 50 million bytes from a file, into an array quick?
Maybe I my motherboard is bottle necking me? That just doesn't seem to make much sense eather.
Is there maybe a faster way to read a file into an array?
thanks.
My 'script timer'
Records start and end time with millisecond resolution...Most importantly it's not a timer
#pragma once
#ifndef __Script_Timer__
#define __Script_Timer__
#include <sys/timeb.h>
extern "C"
{
struct Script_Timer
{
unsigned long milliseconds;
unsigned long seconds;
struct timeb start_t;
struct timeb end_t;
};
void End_ST(Script_Timer *This)
{
ftime(&This->end_t);
This->seconds = This->end_t.time - This->start_t.time;
This->milliseconds = (This->seconds * 1000) + (This->end_t.millitm - This->start_t.millitm);
}
void Start_ST(Script_Timer *This)
{
ftime(&This->start_t);
}
}
#endif
Read buffer thing
char face = 0;
char comp = 0;
char nutz = 0;
for(int i=0;i<(_length*sizeof(char));++i)
{
face = buffer[i];
if(face == comp)
nutz = (face + comp)/i;
comp++;
}

Transfers from or to main memory run at speeds of gigabytes per second. Inside the CPU data flows even faster. It is not surprising that, whatever you do at the software side, the hard drive itself remains the bottleneck.
Here are some numbers from my system, using PerformanceTest 7.0:
hard disk: Samsung HD103SI 5400 rpm: sequential read/write at 80 MB/s
memory: 3 * 2 GB at 400 MHz DDR3: read/write around 2.2 GB/s
So if your system is a bit older than mine, a hard drive speed of 50 MB/s is not surprising. The connection to the drive (IDE/SATA) is not all that relevant; it's mainly about the number of bits passing the drive heads per second, purely a hardware thing.
Another thing to keep in mind is your OS's filesystem cache. It could be that the second time round, the hard drive isn't accessed at all.
The 180 MB/s memory read speed that you mention in your comment does seem a bit on the low side, but that may well depend on the exact code. Your CPU's caches come into play here. Maybe you could post the code you used to measure this?

The FILE* API uses buffered streams, so even if you read byte by byte, the API internally reads buffer by buffer. So your comparison will not make a big difference.
The low level IO API (open, read, write, close) is unbuffered, so using this one will make a difference.
It may also be faster for you, if you do not need the automatic buffering of the FILE* API!

I've done some tests on this, and after a certain point, the effect of increased buffer size goes down the bigger the buffer. There is usually an optimum buffer size you can find with a bit of trial and error.
Note also that fread() (or more specifically the C or C++ I/O library) will probably be doing its own buffering. If your system suports it a plain read() may (or may not) be a bit faster.

Related

c++ read small portions of big number of files

I have a relatively simple question to ask, there has been an ongoing discussion regarding many programming languages about which method provides the fastest file read. Mostly debated on read() or mmap(). As a person who also participated in these debates, I failed to find an answer to my current problem, because most answers help in the situation where the file to read is huge (example, how to read a 10 TB text file...).
But my problem is a bit different, I have lots of files, lets say a 100 million. I want to read the first 1-2 lines from these files. Whether the file is 10 kb or 100 TB is irrelevant. I just want the first one or two lines from every file. So I want to avoid reading or buffering the unnecessary parts of the files. My knowledge was not enough to thoroughly test which method is faster, or to discover what are all my options in the first place.
What I am doing right know: (I am doing this multithreaded for the moment)
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
if (!std::filesystem::is_directory(p)) {
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
What does C++, or the linux environment provide me in this situation ? Is there a faster or more efficient way to read small portions of millions of files ?
Thank you for your time.
Info: I have access to C++20 and Ubuntu 18.04
You can save one underlying call to fstat by not testing if the path is a directory, and then rely on is_open test
#include <iostream>
#include <fstream>
#include <filesystem>
#include <string>
int main()
{
std::string line,path=".";
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
{
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
std::cout << "opened: " << p.path().string() << '\n';
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
}
At least on Windows this code skips the directories. And as suggested in comments is_open test can even be skipped since getline doesn't read anything from a directory either.
Not the cleanest, but if it can save time it's worth it.
Any function in a program accessing a file under Linux will result in calling some "system calls" (for example read()).
All other available functions in some programming language (like fread(), fgets(), std::filesystem ...) call functions or methods which in turn call some system calls.
For this reason you can't be faster than calling the system calls directly.
I'm not 100% sure, but I think in most cases, the combination open(), read(), close() will be the fastest method for reading data from the start of a file.
(If the data is not located at the start of the file, pread() might be faster than read(); I'm not sure.)
Note that read() does not read a certain number of lines but a certain number of bytes (e.g. into an array of char), so you have to find the end(s) of the line(s) "manually" by searching the '\n' character(s) and/or the end of the file in the array of char.
Unfortunately, a line may be much longer than you expect, so reading the first N bytes from the file does not contain the first M lines and you have to call read() again.
In this case it depends on your system (e.g. file system or even hard disks) how many bytes you should read in each call to read() to get the maximum performance.
Example: Let's say in 75% of all files, the first N lines are found in the first 512 bytes of the file; in the other 25% of all files, first N lines are longer than 512 bytes in sum.
On some computers, reading 1024 bytes at once might require nearly the same time as reading 512 bytes, but reading 512 bytes twice will be much slower than reading 1024 bytes at once; on such computers it makes sense to read() 1024 bytes at once: You save a lot of time for 25% of the files and you lose only very little time for the other 75%.
On other computers, reading 512 bytes is significantly faster than reading 1024 bytes; on such computers it would be better to read() 512 bytes: Reading 1024 bytes would save you only little time when processing the 25% of files but cost you very much time when processing the other 75%.
I think in the most cases this "optimal value" will be a multiple of 512 bytes because most modern file systems organize files in units that have a multiple of 512 bytes.
I was just typing something similar to Martin Rosenau answer (when his popped up): unstructured read of the max length of two lines. But I would go further: queue that text buffer with corresponding file name and let another thread parse / analyze that. If parsing takes about the same time as reading, you can save half of the time. If it takes longer (unlikely) - you can use multiple threads and save even more.
Side note - you should not parallelize reading (tried that).
It may be worth experimenting: can you open one file, read it asynchronously while proceeding to open the next one? I don't know if any OS can overlap those things.

CPU usage 99% when reading large txt file using fgets() c++

I have to read in large txt files (1Gb) line-by-line, and use fgets() to do so. I run an empty while loop and execution takes extremely long (30mins) with 99% CPU utilization.
int buffer_size = 30;
char buffer[buffer_size];
while (fgets(buffer, buffer_size, traceFile1) != NULL)
{
}
I did do some reading and apparently the overheads related to text parsing causes this. So the question is, is there any way to read in a txt file while avoiding this? I'm reading in traces for a network simulator, so each line typically has |Injection_cycle source destination|
I've been searching for a while, so if anyone has a smart answer to this I would be absolutely delighted :)
1GB = 1024MB= 1048576 KB = 1073741824 B
30 min = 1800 seconds
So you are basically comparing about 595k/s comparisons (checking if the current character is '\n' or '\t' or eof) and making around the same amount of memory assignations. Not just that but also making a jump operation since you have a loop statement.
This while not being too fast it's not too slow either, I've seen a few "critical cases" that are unable to use the computer memory system properly, how does the result scale for various sizes?
I think I'm rather off the reason, but hope it helps

Limit CPU usage of fwrite operations

I'm developing a program with several threads that manages the streaming from several cameras. I have to write every raw images on SSD disk. I'm using fwrite to put the image in a binary file. Something like:
FILE* output;
output = fopen(fileName, "wb");
fwrite(imageData, imageSize, 1, output);
fclose(output);
The procedure seems to runs fast enough to save all images with the given cameras throughput. The problem is that the save procedure is CPU consuming, and I start to have sync issues when the save process is enabled, due to the CPU usage of the save threads.
Is there any way to reduce the CPU load of fwrite operations? Like playing with buffering, better DMA settings, ...?
Thanks!
MIX
-- UPDATE 1
Forgetting the multithreading software, here is a simple file writer software:
#include <stdio.h>
#include <stdlib.h>
const unsigned int TOT_DATA = 1280*2*960;
int main(int argc, char* argv[])
{
if(argc != 2)
{
printf("Usage:\n");
printf(" %s totWrite\n\n", argv[0]);
return -1;
}
char* imageData;
FILE* output;
char fileName[256];
unsigned int totWrite;
totWrite = atoi(argv[1]);
imageData = new char[TOT_DATA];
printf("Write imageData[%u] on file %u times.\n", TOT_DATA, totWrite);
for(unsigned int i = 0; i < totWrite; i++)
{
sprintf(fileName, "image_%06u.raw", i);
output = fopen(fileName, "wb");
fwrite(imageData, TOT_DATA, 1, output);
fclose(output);
}
printf("DONE!\n");
delete [] imageData;
return 0;
}
A char buffer will be created, and it will be written on file totWrite times. No overwrites, since each cycle writes on a new file. (of course, one have to remove files written by previous run...)
Running top (I'm on Linux) while the program is running I see that ~50% of the CPU (that means 50% of one of the 4 cores) is used. I suppose fwrite is the bottleneck regarding the CPU usage since it is the "slower" operation in the cycle, so the one "more probably" running when top update its stats. Even "more probable" if TOT_DATA will be increased, say, 100 times.
Any further consideration about what can reduce CPU usage in such program?
If you consider playing with DMA settings you're way out of the scope of the standard C library. It will be nowhere near portable - and then you don't have any benefits of using portable functions.
The first step you probably should use (after you've confirmed that it's CPU that is the bottleneck) is to use lower level functions like for example open/write (or whatever your OS calls them).
Basically what can happen with fwrite is that the program first copies the data to another place in memory (the FILE* buffer) before actually writing the data to disc. This operation certainly is CPU bound and if data transfer by the CPU is slower than the data transfer to the SSD it could be a case where CPU power is consumed for no good reason.
Also one should note that using multiple threads have it's drawbacks. First if it were not an SSD multiple threads writing to disk could result in redundant head movements which is not bad, but even SSD may suffer somewhat as you might fragment the layout of the data.
There's also a problem in loading the entire file as you seem to do in the example, especially if you do it in multiple threads. It will simply consume a lot of memory (which could result in that swapping is required). If possible you should write the data to the file as the data arrives.

What is the optimal file output buffer size?

See the below code for example. size is 1MB, and it certainly runs faster than when it is 1. I think it is due to that the number of IO system calls is reduced. Does this mean I will always benefit from a larger buffer size? I hoped so and ran some tests, but it seems that there is some limit. size being 2 will run much faster than when it is 1, but it doesn't go further that way.
Could someone explain this better? What is the optimal buffer size likely to be? And why don't I benefit much from expanding its size infinitely.
By the way, in this example I wrote to stdout for simplicity, but I'm also thinking about when writing to files in the disk.
enum
{
size = 1 << 20
};
void fill_buffer(char (*)[size]);
int main(void)
{
long n = 100000000;
for (;;)
{
char buf[size];
fill_buffer(&buf);
if (n <= size)
{
if (fwrite(buf, 1, n, stdout) != n)
{
goto error;
}
break;
}
if (fwrite(buf, 1, size, stdout) != size)
{
goto error;
}
n -= size;
}
return EXIT_SUCCESS;
error:
fprintf(stderr, "fwrite failed\n");
return EXIT_FAILURE;
}
You usually don't need the best buffer size, which may requires querying the OS for system parameters and do complex estimations or even benchmarking on the target environment, and it's dynamic. Lucky you just need a value that is good enough.
I would say a 4K~16K buffer suit most normal usages. Where 4K is the magic number for page size supported by normal machine (x86, arm) and also multiple of usual physical disk sector size(512B or 4K).
If you are dealing with huge amount of data (giga-bytes) you may realise simple fwrite-model is inadequate for its blocking nature.
On a large partition, cluster size is often 32 KB. On a large read / write request, if the system sees that there are a series of contiguous clusters, it will combine them into a single I/O. Otherwise, it breaks up the request into multiple I/O's. I don't know what the maximum I/O size is. On some old SCSI controllers, it was 64 KB or 1 MB - 8 KB (17 or 255 descriptors, in controller). For IDE / Sata, I've been able to do IOCTL's for 2 MB, confirming it was a single I/O with an external bus monitor, but I never tested to determine the limit.
For external sorting with k way bottom up merge sort with k > 2, read / write size of 10 MB to 100 MB is used to reduce random access overhead. The request will be broken up into multiple I/O's but the read or write will be sequential (under ideal circumstances).

How to change buffer size with boost::iostreams?

My program reads dozens of very large files in parallel, just one line at a time. It seems like the major performance bottleneck is HDD seek time from file to file (though I'm not completely sure how to verify this), so I think it would be faster if I could buffer the input.
I'm using C++ code like this to read my files through boost::iostreams "filtering streams":
input = new filtering_istream;
input->push(gzip_decompressor());
file_source in (fname);
input->push(in);
According to the documentation, file_source does not have any way to set the buffer size but filtering_stream::push seems to:
void push( const T& t,
std::streamsize buffer_size,
std::streamsize pback_size );
So I tried input->push(in, 1E9) and indeed my program's memory usage shot up, but the speed didn't change at all.
Was I simply wrong that read buffering would improve performance? Or did I do this wrong? Can I buffer a file_source directly, or do I need to create a filtering_streambuf? If the latter, how does that work? The documentation isn't exactly full of examples.
You should profile it too see where the bottleneck is.
Perhaps it's in the kernel, perhaps your at your hardware's limit. Until you profile it to find out you're stumbling in the dark.
EDIT:
Ok, a more thorough answer this time, then. According to the Boost.Iostreams documentation basic_file_source is just a wrapper around std::filebuf, which in turn is built on std::streambuf. To quote the documentation:
CopyConstructible and Assignable wrapper for a std::basic_filebuf opened in read-only mode.
streambuf does provide a method pubsetbuf (not the best reference perhaps, but the first google turned up) which you can, apparently, use to control the buffer size.
For example:
#include <fstream>
int main()
{
char buf[4096];
std::ifstream f;
f.rdbuf()->pubsetbuf(buf, 4096);
f.open("/tmp/large_file", std::ios::binary);
while( !f.eof() )
{
char rbuf[1024];
f.read(rbuf, 1024);
}
return 0;
}
In my test (optimizations off, though) I actually got worse performance with a 4096 bytes buffer than a 16 bytes buffer but YMMV -- a good example of why you should always profile first :)
But, as you say, the basic_file_sink does not provide any means to access this as it hides the underlying filebuf in its private part.
If you think this is wrong you could:
Urge the Boost developers to expose such functionality, use the mailing list or the trac.
Build your own filebuf wrapper which does expose the buffer size. There's a section in the tutorial which explains writing custom sources that might be a good starting point.
Write a custom source based on whatever, that does all the caching you fancy.
Remember that your hard drive as well as the kernel already does caching and buffering on file reads, which I don't think that you'll get much of a performance increase from caching even more.
And in closing, a word on profiling. There's a ton of powerful profiling tools available for Linux an I don't even know half of them by name, but for example there's iotop which is kind of neat because it's super simple to use. It's pretty much like top but instead shows disk related metrics. For example:
Total DISK READ: 31.23 M/s | Total DISK WRITE: 109.36 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
19502 be/4 staffan 31.23 M/s 0.00 B/s 0.00 % 91.93 % ./apa
tells me that my progam spends over 90% of its time waiting for IO, i.e. it's IO bound. If you need something more powerful I'm sure google can help you.
And remember that benchmarking on a hot or cold cache greatly affects the outcome.