CPU usage 99% when reading large txt file using fgets() c++ - c++

I have to read in large txt files (1Gb) line-by-line, and use fgets() to do so. I run an empty while loop and execution takes extremely long (30mins) with 99% CPU utilization.
int buffer_size = 30;
char buffer[buffer_size];
while (fgets(buffer, buffer_size, traceFile1) != NULL)
{
}
I did do some reading and apparently the overheads related to text parsing causes this. So the question is, is there any way to read in a txt file while avoiding this? I'm reading in traces for a network simulator, so each line typically has |Injection_cycle source destination|
I've been searching for a while, so if anyone has a smart answer to this I would be absolutely delighted :)

1GB = 1024MB= 1048576 KB = 1073741824 B
30 min = 1800 seconds
So you are basically comparing about 595k/s comparisons (checking if the current character is '\n' or '\t' or eof) and making around the same amount of memory assignations. Not just that but also making a jump operation since you have a loop statement.
This while not being too fast it's not too slow either, I've seen a few "critical cases" that are unable to use the computer memory system properly, how does the result scale for various sizes?
I think I'm rather off the reason, but hope it helps

Related

c++ read small portions of big number of files

I have a relatively simple question to ask, there has been an ongoing discussion regarding many programming languages about which method provides the fastest file read. Mostly debated on read() or mmap(). As a person who also participated in these debates, I failed to find an answer to my current problem, because most answers help in the situation where the file to read is huge (example, how to read a 10 TB text file...).
But my problem is a bit different, I have lots of files, lets say a 100 million. I want to read the first 1-2 lines from these files. Whether the file is 10 kb or 100 TB is irrelevant. I just want the first one or two lines from every file. So I want to avoid reading or buffering the unnecessary parts of the files. My knowledge was not enough to thoroughly test which method is faster, or to discover what are all my options in the first place.
What I am doing right know: (I am doing this multithreaded for the moment)
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
if (!std::filesystem::is_directory(p)) {
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
What does C++, or the linux environment provide me in this situation ? Is there a faster or more efficient way to read small portions of millions of files ?
Thank you for your time.
Info: I have access to C++20 and Ubuntu 18.04
You can save one underlying call to fstat by not testing if the path is a directory, and then rely on is_open test
#include <iostream>
#include <fstream>
#include <filesystem>
#include <string>
int main()
{
std::string line,path=".";
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
{
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
std::cout << "opened: " << p.path().string() << '\n';
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
}
At least on Windows this code skips the directories. And as suggested in comments is_open test can even be skipped since getline doesn't read anything from a directory either.
Not the cleanest, but if it can save time it's worth it.
Any function in a program accessing a file under Linux will result in calling some "system calls" (for example read()).
All other available functions in some programming language (like fread(), fgets(), std::filesystem ...) call functions or methods which in turn call some system calls.
For this reason you can't be faster than calling the system calls directly.
I'm not 100% sure, but I think in most cases, the combination open(), read(), close() will be the fastest method for reading data from the start of a file.
(If the data is not located at the start of the file, pread() might be faster than read(); I'm not sure.)
Note that read() does not read a certain number of lines but a certain number of bytes (e.g. into an array of char), so you have to find the end(s) of the line(s) "manually" by searching the '\n' character(s) and/or the end of the file in the array of char.
Unfortunately, a line may be much longer than you expect, so reading the first N bytes from the file does not contain the first M lines and you have to call read() again.
In this case it depends on your system (e.g. file system or even hard disks) how many bytes you should read in each call to read() to get the maximum performance.
Example: Let's say in 75% of all files, the first N lines are found in the first 512 bytes of the file; in the other 25% of all files, first N lines are longer than 512 bytes in sum.
On some computers, reading 1024 bytes at once might require nearly the same time as reading 512 bytes, but reading 512 bytes twice will be much slower than reading 1024 bytes at once; on such computers it makes sense to read() 1024 bytes at once: You save a lot of time for 25% of the files and you lose only very little time for the other 75%.
On other computers, reading 512 bytes is significantly faster than reading 1024 bytes; on such computers it would be better to read() 512 bytes: Reading 1024 bytes would save you only little time when processing the 25% of files but cost you very much time when processing the other 75%.
I think in the most cases this "optimal value" will be a multiple of 512 bytes because most modern file systems organize files in units that have a multiple of 512 bytes.
I was just typing something similar to Martin Rosenau answer (when his popped up): unstructured read of the max length of two lines. But I would go further: queue that text buffer with corresponding file name and let another thread parse / analyze that. If parsing takes about the same time as reading, you can save half of the time. If it takes longer (unlikely) - you can use multiple threads and save even more.
Side note - you should not parallelize reading (tried that).
It may be worth experimenting: can you open one file, read it asynchronously while proceeding to open the next one? I don't know if any OS can overlap those things.

Memory and time issue when reading/writing from a file

I am trying to solve a school problem and I did that, but it should run faster and on less memory if possible - can you please help me achieve that?
Problem statement: Read a natural number N and a string from a file, and output in another file the same string N number of times.
Example of input file:
3
dog
Example of output file:
dog
dog
dog
Restrictions:
1 ≤ n ≤ 50, and the length of the line to be read is maximum 1,000,000
Time limit: 0.27 seconds
This is what I tried (but run time exceeds the limit):
#include<fstream>
using namespace std;
ifstream cin("afisaren.in");
ofstream cout("afisaren.out");
short n;
char s[1000005];
int main() {
cin >> n;
cin >> s;
while(n) {
cout << s << '\n';
n--;
}
cin.close();
cout.close();
return 0;
}
Generally when given this type of problem, you should profile your own code to see which part of the code is consuming what amount of time. This can mostly be done by adding a few calls to a timekeeping-function before and after code execution, to see how long it was executing. However this is not so easy with your code, since one of the biggest problems (optimisation-wise) is your char s[1000005]; line. The memory will be allocated before executing your main() function, which is operating system dependant (or rather depends on the libc and compiler used).
So first, do not use pre-allocated char-arrays. You're using C++! Why not simply read the text into a std::(w)string or any of the C++-classes which will do dynamic memory allocation (and not crash your program if line-length does exceed 1,000,000).
And second, the c++ std::streams usually perform a flush-to-disk every time a line-ending character is written. This is highly inefficient unless your text is exactly the same size as the block-size of the underlying file-system. To optimize this, create a memory object (i.e. std::string) and copy your text into it for k times, where k = fs-block-size / text-length. fs-block-size will most likely be 1024, 2048 or 4096 bytes. There are system-calls to find that out, but performance will usually not be affected too much when writing twice (or 4x) the fs-block-size, so you can safely assume it to be 4096 for close-to-or-maximum-performance.
Since the maximum number of repetitions is 1 < n < 50, and line length is 1,000,000 (approx. 1 MiB if ASCII), maximum file size for the output will be 50,000,000 characters. You could also write everything into memory and then write everything in one call to write(). This would probably be the most efficient way in terms of disk-activity, but obviously not regarding memory consumption.
I'm not a c++ expert but I had a similar problem when I used c++ style file streams, after googling a bit, I tried switching to c-style file system and it boosted my performance a lot because c++ file streams copy file contents into internal buffer and that takes time, you can try it c-style but usually it is not recommended to use c in c++.

How to read 4GB file on 32bit system

In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck.
In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like.
std::ifstream file(filename_xml.c_str());
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
And ok, that's working but to slowly here is a time for my 3.6 GB of data:
real 1m4.155s
user 0m0.000s
sys 0m0.030s
I'm looking for a method that will be much faster than that for example I found that How to parse space-separated floats in C++ quickly? and I loved presented solution with boost::mapped_file but I faced to another problem what if my file is to big and in my case file 1GB large was enough to drop entire process. I have to care about current data in memory probably people who will be using that tool doesn't have more than 4 GB installed RAM.
So I found that mapped_file from boost but how to use it in my case? Is it possible to read partially that file and receive these lines?
Maybe you have another much better solution. I have to just process each line.
Thanks,
Bart
Nice to see you found my benchmark at How to parse space-separated floats in C++ quickly?
It seems you're really looking for the fastest way to count lines (or any linear single pass analysis), I've done a similar analysis and benchmark of exactly that here
Fast textfile reading in c++
Interestingly, you'll see that the most performant code does not need to rely on memory mapping at all there.
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
The case of a 64-bit system with small memory should be fine to load a large file into - it's all about address space - although it may well be slower than the "fastest" option in that case, it really depends on what else is in memory and how much of the memory is available for mapping the file into. In a 32-bit system, it won't work, since the pointers into the filemapping won't go beyond about 3.5GB at the very most - and typically around 2GB is the maximum - again, depending on what memory addresses are available to the OS to map the file into.
However, the benefit of memory mapping a file is pretty small - the huge majority of the time spent is from actually reading the data. The saving from using memory mapping comes from not having to copy the data once it's loaded into RAM. (When using other file-reading mechanisms, the read function will copy the data into the buffer supplied, where memory mapping a file will stuff it straight into the correct location directly).
You might want to look at increasing the buffer for the ifstream - the default buffer is often rather small, this leads to lots of expensive reads.
You should be able to do this using something like:
std::ifstream file(filename_xml.c_str());
char buffer[1024*1024];
file.rdbuf()->pubsetbuf(buffer, 1024*1024);
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
See this question for more info:
How to get IOStream to perform better?
Since this is windows, you can use the native windows file functions with the "ex" suffix:
windows file management functions
specifically the functions like GetFileSizeEx(), SetFilePointerEx(), ... . Read and write functions are limited to 32 bit byte counts, and the read and write "ex" functions are for asynchronous I/O as opposed to handling large files.

Proper, efficient file reading

I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok().
#include <stdio.h>
#include <string.h>
int main() {
int i;
char ch, buf[256];
FILE *fp = fopen("test.csv", "r");
for (;;) {
for (i = 0; ; i++) {
ch = fgetc(fp);
if (ch == ',') {
buf[i] = '\0';
puts(buf);
break;
} else if (ch == '\n') {
buf[i] = '\0';
puts(buf);
fclose(fp);
return 0;
} else buf[i] = ch;
}
}
}
Is this method as efficient and correct as possible?
What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF, feof(), ferror(), etc.).
Can I perform the same task using C++ file I/O with no loss of efficiency?
What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".
That having been said, there are a few things you could try:
Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.
When in doubt, though, try a few possibilities and profile, profile, profile.
Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).
Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.
If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)
It all depends on whether you're going to want or need that additional functionality or safety in the future.
Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.
But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)
One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.
I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.
use fgets to read one line at a time. C++ file I/O are basically wrapper code with some compiler optimization tucked inside ( and many unwanted functionality ). Unless you are reading millions of lines of code and measuring time, it does not matter.

fread speeds managed unmanaged

Ok, so I'm reading a binary file into a char array I've allocated with malloc.
(btw the code here isn't the actual code, I just wrote it on the spot to demonstrate, so any mistakes here are probably not mistakes in the actual program.) This method reads at about 50million bytes per second.
main
char *buffer = (char*)malloc(file_length_in_bytes*sizeof(char));
memset(buffer,0,file_length_in_bytes*sizeof(char));
//start time here
read_whole_file(buffer);
//end time here
free(buffer);
read_whole_buffer
void read_whole_buffer(char* buffer)
{
//file already opened
fseek(_file_pointer, 0, SEEK_SET);
int a = sizeof(buffer[0]);
fread(buffer, a, file_length_in_bytes*a, _file_pointer);
}
I've written something similar with managed c++ that uses filestream I believe and the function ReadByte() to read the entire file, byte by byte, and it reads at around 50million bytes per second.
Also, I have a sata and an IDE drive in my computer, and I've loading the file off of both, doesn't make any difference at all(Which is weird because I was under the assumption that SATA read much faster than IDE.)
Question
Maybe you can all understand why this doesn't make any sense to me. As far as I knew, it should be much faster to fread a whole file into an array, as opposed to reading it byte by byte. On top of that, through testing I've discovered that managed c++ is slower (only noticeable though if you are benchmarking your code and you require speed.)
SO
Why in the world am I reading at the same speed with both applications. Also is 50 million bytes from a file, into an array quick?
Maybe I my motherboard is bottle necking me? That just doesn't seem to make much sense eather.
Is there maybe a faster way to read a file into an array?
thanks.
My 'script timer'
Records start and end time with millisecond resolution...Most importantly it's not a timer
#pragma once
#ifndef __Script_Timer__
#define __Script_Timer__
#include <sys/timeb.h>
extern "C"
{
struct Script_Timer
{
unsigned long milliseconds;
unsigned long seconds;
struct timeb start_t;
struct timeb end_t;
};
void End_ST(Script_Timer *This)
{
ftime(&This->end_t);
This->seconds = This->end_t.time - This->start_t.time;
This->milliseconds = (This->seconds * 1000) + (This->end_t.millitm - This->start_t.millitm);
}
void Start_ST(Script_Timer *This)
{
ftime(&This->start_t);
}
}
#endif
Read buffer thing
char face = 0;
char comp = 0;
char nutz = 0;
for(int i=0;i<(_length*sizeof(char));++i)
{
face = buffer[i];
if(face == comp)
nutz = (face + comp)/i;
comp++;
}
Transfers from or to main memory run at speeds of gigabytes per second. Inside the CPU data flows even faster. It is not surprising that, whatever you do at the software side, the hard drive itself remains the bottleneck.
Here are some numbers from my system, using PerformanceTest 7.0:
hard disk: Samsung HD103SI 5400 rpm: sequential read/write at 80 MB/s
memory: 3 * 2 GB at 400 MHz DDR3: read/write around 2.2 GB/s
So if your system is a bit older than mine, a hard drive speed of 50 MB/s is not surprising. The connection to the drive (IDE/SATA) is not all that relevant; it's mainly about the number of bits passing the drive heads per second, purely a hardware thing.
Another thing to keep in mind is your OS's filesystem cache. It could be that the second time round, the hard drive isn't accessed at all.
The 180 MB/s memory read speed that you mention in your comment does seem a bit on the low side, but that may well depend on the exact code. Your CPU's caches come into play here. Maybe you could post the code you used to measure this?
The FILE* API uses buffered streams, so even if you read byte by byte, the API internally reads buffer by buffer. So your comparison will not make a big difference.
The low level IO API (open, read, write, close) is unbuffered, so using this one will make a difference.
It may also be faster for you, if you do not need the automatic buffering of the FILE* API!
I've done some tests on this, and after a certain point, the effect of increased buffer size goes down the bigger the buffer. There is usually an optimum buffer size you can find with a bit of trial and error.
Note also that fread() (or more specifically the C or C++ I/O library) will probably be doing its own buffering. If your system suports it a plain read() may (or may not) be a bit faster.