I am reading 1 million files from my local system (Ubuntu 20.04), about 10KB each. The first 40000 files are processed very quickly, and then the next 960000 are going at a much slower pace. I am assume this has something to do with the OS/Hardware/File System functionality. I am basically performing a file read as follows:
string read_file(std::ifstream& ifs) {
vector<char> buf(64 * 1024);
string s = "";
ifs.read(&buf[0], buf.size());
while (int n = ifs.gcount()) { s.append(&buf[0], n); ifs.read(&buf[0], buf.size()); }
return s;
}
Can someone explain what is happening to cause this slow down and how it can be fixed?
EDIT: I am processing the strings in a function, I am not sure if they persist after execution:
void process_file(string file_name) {
ifstream ifs(file_name);
string s = read_file(ifs);
process_data(s);
}
void p1() {
vector<string> files = get_files(); // 1 million size vector of filenames
for (int i = 0 ; i < files.size(); i++) {
process_file(files[i]);
}
}
process_data writes to another file but does not keep the string
EDIT: I have commented out the process_data(s) function and passed the file_name by reference and am getting the same behavior. However, if I run the program up to a number N, ctrl C, and then re-run, the first N files load very fast.
void process_file(string& file_name) {
ifstream ifs(file_name);
string s = read_file(ifs);
// process_data(s);
}
void p1() {
vector<string> files = get_files(); // 1 million size vector of filenames
for (int i = 0 ; i < files.size(); i++) {
process_file(files[i]);
}
}
string.append is not the fastest. It will cause memory re-allocation. You could do:
seekg 0 from end, then peekg, to get the file length.
create a string and resize it to the needed length.
in a loop, read directly into your string, by 64k blocks (or even read the whole file at once).
Related
This is my code for allocating memory for the array of strings:
FileReader::FileReader()
{
readBuffer = (char**)malloc(100 * sizeof(char*));
for (int i = 0; i < 100; i++)
{
readBuffer[i] = (char*)malloc(200 * sizeof(char));
}
}
Im alocating 100 strings for 100 lines then allocating 200 chars for each string.
This is my code for reading the lines:
char** FileReader::ReadFile(const char* filename)
{
int i = 0;
File.open(filename);
if (File.is_open())
{
while (getline(File, tmpString))
{
readBuffer[i] = (char*)tmpString.c_str();
i++;
}
return readBuffer;
}
}
and for printing:
for (int i = 0; i <= 5; i++)
{
cout << fileCpy[i];
}
this is the output to terminal:
Picture
As you can see it just repeats the last line of the file as the file just reads:
This is test
line 2
line 3
line 4
line 5
Any idea on whats going on? Why the lines aren't copying correctly?
Replace
readBuffer[i] = (char*)tmpString.c_str();
with
strcpy(readBuffer[i], tmpString.c_str());
Your version just saves a pointers to tmpString in your array. When tmpString changes then that pointer points at the new contents of tmpString (and that's just the best possible outcome). However strcpy actually copies the characters of the string, which is what you want.
Of course, I'm sure it doesn't need saying, but you can avoid all the headache and complication like this
vector<string> readBuffer;
This way there are no more pointer problems, no more manual allocation or freeing of memory, no limits, you aren't limited to 100 lines or 200 characters per line. I'm sure you have a reason for doing things the hard way. but I wonder if it's a good reason.
First of all, you have to switch from C to C++.
Do not allocate memory like that when the only way to do it now in modern C++ is trough smart pointers from the memory header.
Anyways, you do not directly need dynamic allocation here. You have to encapsulate your data within the class and to use the std::vector<std::string> component from the standard library. This vector is a dynamic array that handle all the memory stuff behind the scene for you.
To read all lines of a file :
std::string item_name;
std::vector<std::string> your_buffer;
std::ifstream name_fileout;
name_fileout.open("test.txt");
while (std::getline(name_fileout, item_name))
{
your_buffer.push_back(item_name);
std::cout << item_name;
}
name_fileout.close();
This question already has answers here:
C++ Make a file of a specific size
(3 answers)
Closed 9 years ago.
I have to create a text file of specific size, user enters the size. All I need to know is how to make a file faster. Currently creating a 10mb file takes about 15 seconds. I have to decrease this to max 5 seconds. How can I do that? Currently this is how I am making my file
void create_file()
{
int size;
cout<<"Enter size of File in MB's : ";
cin>>file_size;
size = 1024*1024*file_size; // 1MB = 1024 * 1024 bytes
ofstream pFILE("my_file.txt", ios::out);
for(int i=0; i<size; i++) //outputting spces to create file
pFILE<<' ';
pFILE.close();
}
Update, this is what I am using now, but I get garbage value written to the file as well,
void f_c()
{
int i, size;
cin>>size;
FILE * new_file = fopen("FILE TEST.txt", "w");
char buffer[1024];
memset(buffer,' ', 1024);
for(i = 0; i<1024 * size; i++)
fputs(buffer, new_file);
getchar();
}
You are filling it one character at a time. Instead, you could allocate a larger chunk of memory using new and then write larger chunks at once to speed up the process. You could use memset on the allocated memory to prevent having bytes characters in the memory. But also look at the comment about the duplicate question, there are even faster methods if the file needn't have specific content initially.
Here a simple sample, but without error checking.
Lets say you want size of 1000:
#include <fstream>
int main () {
int size = 1000;
const char* filename= "file.txt";
std::ofstream fout(filename);
fout.fill (' ');
fout.width (size);
fout << " ";
return 0;
}
You don't have to fill all of bytes in file if you just want to create a big file, that make it slow, just tell the file-system how big you want
On Linux, use truncate
truncate("data.txt",1024*1024*1024);
Windows use SetFilePointer
SetFilePointer (hFile, 1024*1024*1024, NULL, FILE_BEGIN);
both of them can create uninitialized file of some Gigabytes in less than seconds.
I have to read a text file into a array of structures.I have already written a program but it is taking too much time as there are about 13 lac structures in the file.
Please suggest me the best possible and fastest way to do this in C++.
here is my code:
std::ifstream input_counter("D:\\cont.txt");
/**********************************************************/
int counter = 0;
while( getline(input_counter,line) )
{
ReadCont( line,&contract[counter]); // function to read data to structure
counter++;
line.clear();
}
input_counter.close();
keep your 'parsing' as simple as possible: where you know the field' format apply the knowledge, for instance
ReadCont("|PE|1|0|0|0|0|1|1||2|0||2|0||3|0|....", ...)
should apply fast char to integer conversion, something like
ReadCont(const char *line, Contract &c) {
if (line[1] == 'P' && line[2] == 'E' && line[3] == '|') {
line += 4;
for (int field = 0; field < K_FIELDS_PE; ++field) {
c.int_field[field] = *line++ - '0';
assert(*line == '|');
++line;
}
}
well, beware to details, but you got the idea...
I would use Qt entirely in this case.
struct MyStruct {
int Col1;
int Col2;
int Col3;
int Col4;
// blabla ...
};
QByteArray Data;
QFile f("D:\\cont.txt");
if (f.open(QIODevice::ReadOnly)) {
Data = f.readAll();
f.close();
}
MyStruct* DataPointer = reinterpret_cast<MyStruct*>(Data.data());
// Accessing data
DataPointer[0] = ...
DataPointer[1] = ...
Now you have your data and you can access it as array.
In case your data is not binary and you have to parse it first you will need a conversion routine. For example if you read csv file with 4 columns:
QVector<MyStruct> MyArray;
QString StringData(Data);
QStringList Lines = StringData.split("\n"); // or whatever new line character is
for (int i = 0; i < Lines.count(); i++) {
String Line = Lines.at(i);
QStringList Parts = Line.split("\t"); // or whatever separator character is
if (Parts.count() >= 4) {
MyStruct t;
t.Col1 = Parts.at(0).toInt();
t.Col2 = Parts.at(1).toInt();
t.Col3 = Parts.at(2).toInt();
t.Col4 = Parts.at(3).toInt();
MyArray.append(t);
} else {
// Malformed input, do something
}
}
Now your data is parsed and in MyArray vector.
As user2617519 says, this can be made faster by multithreading. I see that you are reading each line and parsing it. Put these lines in a queue. Then let different threads pop them off the queue and parse the data into structures.
An easier way to do this (without the complication of multithreading) is to split the input data file into multiple files and run an equal number of processes to parse them. The data can then be merged later.
QFile::readAll() may cause a memory problem and std::getline() is slow (as is ::fgets()).
I faced a similar problem where I needed to parse very large delimited text files in a QTableView. Using a custom model, I parsed the file to find the offsets to the start of a each line. Then when data is needed to display in the table I read the line and parse it on demand. This results in a lot of parsing, but that is actually fast enough to not notice any lag in scrolling or update speed.
It also has the added benefit of low memory usage as I do not read the file contents into memory. With this strategy nearly any size file is possible.
Parsing code:
m_fp = ::fopen(path.c_str(), "rb"); // open in binary mode for faster parsing
if (m_fp != NULL)
{
// read the file to get the row pointers
char buf[BUF_SIZE+1];
long pos = 0;
m_data.push_back(RowData(pos));
int nr = 0;
while ((nr = ::fread(buf, 1, BUF_SIZE, m_fp)))
{
buf[nr] = 0; // null-terminate the last line of data
// find new lines in the buffer
char *c = buf;
while ((c = ::strchr(c, '\n')) != NULL)
{
m_data.push_back(RowData(pos + c-buf+1));
c++;
}
pos += nr;
}
// squeeze any extra memory not needed in the collection
m_data.squeeze();
}
RowData and m_data are specific to my implementation, but they are simply used to cache information about a row in the file (such as the file position and number of columns).
The other performance strategy I employed was to use QByteArray to parse each line, instead of QString. Unless you need unicode data, this will save time and memory:
// optimized line reading procedure
QByteArray str;
char buf[BUF_SIZE+1];
::fseek(m_fp, rd.offset, SEEK_SET);
int nr = 0;
while ((nr = ::fread(buf, 1, BUF_SIZE, m_fp)))
{
buf[nr] = 0; // null-terminate the string
// find new lines in the buffer
char *c = ::strchr(buf, '\n');
if (c != NULL)
{
*c = 0;
str += buf;
break;
}
str += buf;
}
return str.split(',');
If you need to split each line with a string, rather than a single character, use ::strtok().
Here is my current problem: I am trying to create a file of x MB in C++. The user will enter in the file name then enter in a number between 5 and 10 for the size of the file they want created. Later on in this project i'm gonna do other things with it but I'm stuck on the first step of creating the darn thing.
My problem code (so far):
char empty[1024];
for(int i = 0; i < 1024; i++)
{
empty[i] = 0;
}
fileSystem = fopen(argv[1], "w+");
for(int i = 0; i < 1024*fileSize; i++){
int temp = fputs(empty, fileSystem);
if(temp > -1){
//Sucess!
}
else{
cout<<"error"<<endl;
}
}
Now if i'm doing my math correctly 1 char is 1byte. There are 1024 bytes in 1KB and 1024KB in a MB. So if I wanted a 2 MB file, i'd have to write 1024*1024*2 bytes to this file. Yes?
I don't encounter any errors but I end up with an file of 0 bytes... I'm not sure what I'm doing wrong here so any help would be greatly appreciated!
Thanks!
Potentially sparse file
This creates output.img of size 300 MB:
#include <fstream>
int main()
{
std::ofstream ofs("ouput.img", std::ios::binary | std::ios::out);
ofs.seekp((300<<20) - 1);
ofs.write("", 1);
}
Note that technically, this will be a good way to trigger your filesystem's support for sparse files.
Dense file - filled with 0's
Functionally identical to the above, but filling the file with 0's:
#include <iostream>
#include <fstream>
#include <vector>
int main()
{
std::vector<char> empty(1024, 0);
std::ofstream ofs("ouput.img", std::ios::binary | std::ios::out);
for(int i = 0; i < 1024*300; i++)
{
if (!ofs.write(&empty[0], empty.size()))
{
std::cerr << "problem writing to file" << std::endl;
return 255;
}
}
}
Your code doesn't work because you are using fputs which writes a null-terminated string into the output buffer. But you are trying to write all nulls, so it stops right when it looks at the first byte of your string and ends up writing nothing.
Now, to create a file of a specific size, all you need to do is to call truncate function (or _chsiz for Windows) exactly once and set what size you want the file to be.
Good luck!
To make a 2MB file you have to seek to 2*1024*1024 and write 0 bytes. fput()ting empty string will do no good no matter how many time. And the string is empty, because strings a 0-terminated.
What is an efficient, proper way of reading in a data file with mixed characters? For example, I have a data file that contains a mixture of data loaded from other files, 32-bit integers, characters and strings. Currently, I am using an fstream object, but it gets stopped once it hits an int32 or the end of a string. if i add random data onto the end of the string in the data file, it seems to follow through with the rest of the file. This leads me to believe that the null-termination added onto strings is messing it up. Here's an example of loading in the file:
void main()
{
fstream fin("C://mark.dat", ios::in|ios::binary|ios::ate);
char *mymemory = 0;
int size;
size = 0;
if (fin.is_open())
{
size = static_cast<int>(fin.tellg());
mymemory = new char[static_cast<int>(size+1)];
memset(mymemory, 0, static_cast<int>(size + 1));
fin.seekg(0, ios::beg);
fin.read(mymemory, size);
fin.close();
printf(mymemory);
std::string hithere;
hithere = cin.get();
}
}
Why might this code stop after reading in an integer or a string? How might one get around this? Is this the wrong approach when dealing with these types of files? Should I be using fstream at all?
Have you ever considered that the file reading is working perfectly and it is printf(mymemory) that is stopping at the first null?
Have a look with the debugger and see if I am right.
Also, if you want to print someone else's buffer, use puts(mymemory) or printf("%s", mymemory). Don't accept someone else's input for the format string, it could crash your program.
Try
for (int i = 0; i < size ; ++i)
{
// 0 - pad with 0s
// 2 - to two zeros max
// X - a Hex value with capital A-F (0A, 1B, etc)
printf("%02X ", (int)mymemory[i]);
if (i % 32 == 0)
printf("\n"); //New line every 32 bytes
}
as a way to dump your data file back out as hex.