Reading from a large text file into a structure array in Qt? - c++

I have to read a text file into a array of structures.I have already written a program but it is taking too much time as there are about 13 lac structures in the file.
Please suggest me the best possible and fastest way to do this in C++.
here is my code:
std::ifstream input_counter("D:\\cont.txt");
/**********************************************************/
int counter = 0;
while( getline(input_counter,line) )
{
ReadCont( line,&contract[counter]); // function to read data to structure
counter++;
line.clear();
}
input_counter.close();

keep your 'parsing' as simple as possible: where you know the field' format apply the knowledge, for instance
ReadCont("|PE|1|0|0|0|0|1|1||2|0||2|0||3|0|....", ...)
should apply fast char to integer conversion, something like
ReadCont(const char *line, Contract &c) {
if (line[1] == 'P' && line[2] == 'E' && line[3] == '|') {
line += 4;
for (int field = 0; field < K_FIELDS_PE; ++field) {
c.int_field[field] = *line++ - '0';
assert(*line == '|');
++line;
}
}
well, beware to details, but you got the idea...

I would use Qt entirely in this case.
struct MyStruct {
int Col1;
int Col2;
int Col3;
int Col4;
// blabla ...
};
QByteArray Data;
QFile f("D:\\cont.txt");
if (f.open(QIODevice::ReadOnly)) {
Data = f.readAll();
f.close();
}
MyStruct* DataPointer = reinterpret_cast<MyStruct*>(Data.data());
// Accessing data
DataPointer[0] = ...
DataPointer[1] = ...
Now you have your data and you can access it as array.
In case your data is not binary and you have to parse it first you will need a conversion routine. For example if you read csv file with 4 columns:
QVector<MyStruct> MyArray;
QString StringData(Data);
QStringList Lines = StringData.split("\n"); // or whatever new line character is
for (int i = 0; i < Lines.count(); i++) {
String Line = Lines.at(i);
QStringList Parts = Line.split("\t"); // or whatever separator character is
if (Parts.count() >= 4) {
MyStruct t;
t.Col1 = Parts.at(0).toInt();
t.Col2 = Parts.at(1).toInt();
t.Col3 = Parts.at(2).toInt();
t.Col4 = Parts.at(3).toInt();
MyArray.append(t);
} else {
// Malformed input, do something
}
}
Now your data is parsed and in MyArray vector.

As user2617519 says, this can be made faster by multithreading. I see that you are reading each line and parsing it. Put these lines in a queue. Then let different threads pop them off the queue and parse the data into structures.
An easier way to do this (without the complication of multithreading) is to split the input data file into multiple files and run an equal number of processes to parse them. The data can then be merged later.

QFile::readAll() may cause a memory problem and std::getline() is slow (as is ::fgets()).
I faced a similar problem where I needed to parse very large delimited text files in a QTableView. Using a custom model, I parsed the file to find the offsets to the start of a each line. Then when data is needed to display in the table I read the line and parse it on demand. This results in a lot of parsing, but that is actually fast enough to not notice any lag in scrolling or update speed.
It also has the added benefit of low memory usage as I do not read the file contents into memory. With this strategy nearly any size file is possible.
Parsing code:
m_fp = ::fopen(path.c_str(), "rb"); // open in binary mode for faster parsing
if (m_fp != NULL)
{
// read the file to get the row pointers
char buf[BUF_SIZE+1];
long pos = 0;
m_data.push_back(RowData(pos));
int nr = 0;
while ((nr = ::fread(buf, 1, BUF_SIZE, m_fp)))
{
buf[nr] = 0; // null-terminate the last line of data
// find new lines in the buffer
char *c = buf;
while ((c = ::strchr(c, '\n')) != NULL)
{
m_data.push_back(RowData(pos + c-buf+1));
c++;
}
pos += nr;
}
// squeeze any extra memory not needed in the collection
m_data.squeeze();
}
RowData and m_data are specific to my implementation, but they are simply used to cache information about a row in the file (such as the file position and number of columns).
The other performance strategy I employed was to use QByteArray to parse each line, instead of QString. Unless you need unicode data, this will save time and memory:
// optimized line reading procedure
QByteArray str;
char buf[BUF_SIZE+1];
::fseek(m_fp, rd.offset, SEEK_SET);
int nr = 0;
while ((nr = ::fread(buf, 1, BUF_SIZE, m_fp)))
{
buf[nr] = 0; // null-terminate the string
// find new lines in the buffer
char *c = ::strchr(buf, '\n');
if (c != NULL)
{
*c = 0;
str += buf;
break;
}
str += buf;
}
return str.split(',');
If you need to split each line with a string, rather than a single character, use ::strtok().

Related

Trying to copy lines from text file to array of strings (char**)

This is my code for allocating memory for the array of strings:
FileReader::FileReader()
{
readBuffer = (char**)malloc(100 * sizeof(char*));
for (int i = 0; i < 100; i++)
{
readBuffer[i] = (char*)malloc(200 * sizeof(char));
}
}
Im alocating 100 strings for 100 lines then allocating 200 chars for each string.
This is my code for reading the lines:
char** FileReader::ReadFile(const char* filename)
{
int i = 0;
File.open(filename);
if (File.is_open())
{
while (getline(File, tmpString))
{
readBuffer[i] = (char*)tmpString.c_str();
i++;
}
return readBuffer;
}
}
and for printing:
for (int i = 0; i <= 5; i++)
{
cout << fileCpy[i];
}
this is the output to terminal:
Picture
As you can see it just repeats the last line of the file as the file just reads:
This is test
line 2
line 3
line 4
line 5
Any idea on whats going on? Why the lines aren't copying correctly?
Replace
readBuffer[i] = (char*)tmpString.c_str();
with
strcpy(readBuffer[i], tmpString.c_str());
Your version just saves a pointers to tmpString in your array. When tmpString changes then that pointer points at the new contents of tmpString (and that's just the best possible outcome). However strcpy actually copies the characters of the string, which is what you want.
Of course, I'm sure it doesn't need saying, but you can avoid all the headache and complication like this
vector<string> readBuffer;
This way there are no more pointer problems, no more manual allocation or freeing of memory, no limits, you aren't limited to 100 lines or 200 characters per line. I'm sure you have a reason for doing things the hard way. but I wonder if it's a good reason.
First of all, you have to switch from C to C++.
Do not allocate memory like that when the only way to do it now in modern C++ is trough smart pointers from the memory header.
Anyways, you do not directly need dynamic allocation here. You have to encapsulate your data within the class and to use the std::vector<std::string> component from the standard library. This vector is a dynamic array that handle all the memory stuff behind the scene for you.
To read all lines of a file :
std::string item_name;
std::vector<std::string> your_buffer;
std::ifstream name_fileout;
name_fileout.open("test.txt");
while (std::getline(name_fileout, item_name))
{
your_buffer.push_back(item_name);
std::cout << item_name;
}
name_fileout.close();

File system causing c++ program to run slowly

I am reading 1 million files from my local system (Ubuntu 20.04), about 10KB each. The first 40000 files are processed very quickly, and then the next 960000 are going at a much slower pace. I am assume this has something to do with the OS/Hardware/File System functionality. I am basically performing a file read as follows:
string read_file(std::ifstream& ifs) {
vector<char> buf(64 * 1024);
string s = "";
ifs.read(&buf[0], buf.size());
while (int n = ifs.gcount()) { s.append(&buf[0], n); ifs.read(&buf[0], buf.size()); }
return s;
}
Can someone explain what is happening to cause this slow down and how it can be fixed?
EDIT: I am processing the strings in a function, I am not sure if they persist after execution:
void process_file(string file_name) {
ifstream ifs(file_name);
string s = read_file(ifs);
process_data(s);
}
void p1() {
vector<string> files = get_files(); // 1 million size vector of filenames
for (int i = 0 ; i < files.size(); i++) {
process_file(files[i]);
}
}
process_data writes to another file but does not keep the string
EDIT: I have commented out the process_data(s) function and passed the file_name by reference and am getting the same behavior. However, if I run the program up to a number N, ctrl C, and then re-run, the first N files load very fast.
void process_file(string& file_name) {
ifstream ifs(file_name);
string s = read_file(ifs);
// process_data(s);
}
void p1() {
vector<string> files = get_files(); // 1 million size vector of filenames
for (int i = 0 ; i < files.size(); i++) {
process_file(files[i]);
}
}
string.append is not the fastest. It will cause memory re-allocation. You could do:
seekg 0 from end, then peekg, to get the file length.
create a string and resize it to the needed length.
in a loop, read directly into your string, by 64k blocks (or even read the whole file at once).

Retrieving data from binary file, non-sense characters

I am trying to retrieve some data from a binary file to put them in a linked list, here's my code to write to the file:
void Pila::memorizzafile()
{
int contatore = 0;
puntarec temp = puntatesta;
ofstream miofile;
miofile.open("data.dat" , ios::binary | ios::out);
if(!miofile) cerr << "errore";
else
{
while(temp)
{
temp->elem.writes(miofile);
contatore++;
temp = temp->next;
}
//I go back at the beginning of the file to write how many elements I have
miofile.seekp(0, ios::beg);
miofile.write((const char *)&contatore , sizeof(int));
miofile.close();
}
}
And the function writes:
void Fiche::writes(ofstream &miofile)
{
//Valore.
miofile.write((const char *)&Valore,sizeof(int));
//Materiale, I write the dimension of the string.
int buff = strlen(Materiale);
miofile.write((const char *)&buff,sizeof(int));
//Writing the string
miofile.write(Materiale,buff*sizeof(char));
//Dimension of Forma
buff = strlen(Forma);
miofile.write((const char*)&buff,sizeof(int));
//The string itself
miofile.write(Forma,buff*sizeof(char));
//Dimension of Colore.
buff = strlen(Colore);
miofile.write((const char*)&buff,sizeof(int));
//The string
miofile.write(Colore,buff*sizeof(char));
}
Now for the reading part, I am trying to make a constructor which should be able to read directly from the file, here it is:
Pila::Pila(char * nomefile)
{
puntatesta = 0;
int contatore = 0;
ifstream miofile;
miofile.open(nomefile , ios::binary | ios::in);
if(!miofile) cerr << "errore";
else
{
//I read how many records are stored in the file
miofile.read((char*)&contatore,sizeof(int));
Fish temp;
for(int i = 0; i < contatore; i++)
{
temp.reads(miofile);
push(temp);
}
miofile.close();
}
}
And the reading function:
void Fiche::reads(ifstream &miofile)
{
//I read the Valore
miofile.read((char*)&Valore,sizeof(int));
//I create a temporary char *
char * buffer;
int dim = 0;
//I read how long will be the string
miofile.read((char*)&dim,sizeof(int));
buffer = new char[dim];
miofile.read(buffer,dim);
//I use the set function I created to copy the buffer to the actual member char*
setMateriale(buffer);
delete [] buffer;
//Now it pretty much repeats itself for the other stuff
miofile.read((char*)&dim,sizeof(int));
buffer = new char[dim];
miofile.read(buffer,dim);
setForma(buffer);
delete [] buffer;
//And again.
miofile.read((char*)&dim,sizeof(int));
buffer = new char[dim];
miofile.read(buffer,dim);
setColore(buffer);
delete [] buffer;
}
The code doesn't give me any error, but on the screen I read random characters and not even remotely close to what I wrote on my file. Anyone could help me out, please?
EDIT:
As requested here's an example of input&output:
Fiche A("boh" , 4 , "no" , "gaia");
Fiche B("Marasco" , 3 , "boh" , "nonnt");
Fiche C("Valori" , 6 , "asd" , "hey");
Fiche D("TipO" , 7 , "lol" , "nonloso");
Pila pila;
pila.push(A);
pila.push(B);
pila.push(C);
pila.push(D);
pila.stampa();
pila.memorizzafile();
And:
Pila pila("data.dat");
pila.stampa();
This is probably your error:
//I go back at the beginning of the file to write how many elements I have
miofile.seekp(0, ios::beg);
miofile.write((const char *)&contatore , sizeof(int));
miofile.close();
By seeking to the beginning and then writing. You are overwriting part of the first object.
I think your best bet is to run through the list and count the elements first. Write this then proceed to write all the elements. It will probably be faster anyway (but you can time it to make sure).
I think you are using way to many C structures to hold things.
Also I would advice against a binary format unless you are saving huge amounts of information. A text format (for your data) is probably going to be just as good and will be human readable so you can look at the file and see what is wrong.

Access violation error with the new command

I am working on an assignment for my GUI programming class, in which we are to make a windows program that displays the contents of a file in hexadecimal. I have a class that holds the text and creates the hex in string format.
I'm attempting to create an array of character arrays to store each line for output. However, when I use new to create the array of character pointers, I get an access violation error.
I've done some searching, but haven't had any luck finding the answer.
The class has these member variables:
char* fileText;
char** Lines;
int numChars;
int numLines;
bool fileCopied;
My constructor:
Text::Text(char* fileName){ //load and copy file.
fileText = NULL;
Lines = NULL;
fileCopied = ExtractText(fileName);
if ( fileCopied ) {
CreateHex();
}//endif
}//end constructor
ExtractText loads the file given to the constructor, and copies it into a large string.
bool Text::ExtractText(char fileName[]){
char buffer = '/0'; //buffer for text transfer
numChars = 0; //initialize numLines
ifstream fin( fileName, ios::in|ios::out ); //load file stream
if ( !fin ) { //return false if the file fails to load
return false;
}//endif
while ( !fin.eof() ) { //count the lines in the file
fin.get(buffer);
numChars++;
}//endwh
fileText = new char[numLines]; //create an array of strings, one for each line in the file.
fin.clear(); //clear the eof flag
fin.seekg(0, ios::beg); //move the get pointer back to the start of the file.
for ( int i = 0; i < numChars; i++ ) { //copy the text from the file into the string array.
fin.get(fileText[i]);
}//endfr
fileText[numChars-1] = '\0';
fin.close();
numLines = (numChars % 16 == 0) ? (numChars/16) : (numChars/16 + 1);
return true;
}//end fun ExtractText
Then comes the problem code. In the CreateHex function, the first line is where try to create the array of character pointers.
void Text::CreateHex(){
Lines = new char*[numLines];
As soon as the program runs that line of code, that's when I get the access violation. I'm not really sure what the problem is, because I've used that exact same method before in a previous program. The only difference was the name of pointer. I'm using Borland C++ 5.02 if that makes any difference. It's not my first choice in compilers, but its what our teacher wants us to use.
When you execute the line
fileText = new char[numLines]
The variable numLines has not yet been initialized. As a member variable, it's initialized to 0, so you are allocating an empty array for fileText.

Howto read chunk of memory as char in c++

Hello I have a chunk of memory (allocated with malloc()) that contains bits (bit literal), I'd like to read it as an array of char, or, better, I'd like to printout the ASCII value of 8 consecutively bits of the memory.
I have allocated he memory as char *, but I've not been able to take characters out in a better way than evaluating each bit, adding the value to a char and shifting left the value of the char, in a loop, but I was looking for a faster solution.
Thank you
What I've wrote for now is this:
for allocation:
char * bits = (char*) malloc(1);
for writing to mem:
ifstream cleartext;
cleartext.open(sometext);
while(cleartext.good())
{
c = cleartext.get();
for(int j = 0; j < 8; j++)
{ //set(index) and reset(index) set or reset the bit at bits[i]
(c & 0x80) ? (set(index)):(reset(index));//(*ptr++ = '1'):(*ptr++='0');
c = c << 1;
}..
}..
and until now I've not been able to get character back, I only get the bits printed out using:
printf("%s\n" bits);
An example of what I'm trying to do is:
input.txt contains the string "AAAB"
My program would have to write "AAAB" as "01000001010000010100000101000010" to memory
(it's the ASCII values in bit of AAAB that are 65656566 in bits)
Then I would like that it have a function to rewrite the content of the memory to a file.
So if memory contains again "01000001010000010100000101000010" it would write to the output file "AAAB".
int numBytes = 512;
char *pChar = (char *)malloc(numBytes);
for( int i = 0; i < numBytes; i++ ){
pChar[i] = '8';
}
Since this is C++, you can also use "new":
int numBytes = 512;
char *pChar = new char[numBytes];
for( int i = 0; i < numBytes; i++ ){
pChar[i] = '8';
}
If you want to visit every bit in the memory chunk, it looks like you need std::bitset.
char* pChunk = malloc( n );
// read in pChunk data
// iterate over all the bits.
for( int i = 0; i != n; ++i ){
std::bitset<8>& bits = *reinterpret_cast< std::bitset<8>* >( pByte );
for( int iBit = 0; iBit != 8; ++iBit ) {
std::cout << bits[i];
}
}
I'd like to printout the ASCII value of 8 consecutively bits of the memory.
The possible value for any bit is either 0 or 1. You probably want at least a byte.
char * bits = (char*) malloc(1);
Allocates 1 byte on the heap. A much more efficient and hassle-free thing would have been to create an object on the stack i.e.:
char bits; // a single character, has CHAR_BIT bits
ifstream cleartext;
cleartext.open(sometext);
The above doesn't write anything to mem. It tries to open a file in input mode.
It has ascii characters and common eof or \n, or things like this, the input would only be a textfile, so I think it should only contain ASCII characters, correct me if I'm wrong.
If your file only has ASCII data you don't have to worry. All you need to do is read in the file contents and write it out. The compiler manages how the data will be stored (i.e. which encoding to use for your characters and how to represent them in binary, the endianness of the system etc). The easiest way to read/write files will be:
// include these on as-needed basis
#include <algorithm>
#include <iostream>
#include <iterator>
#include <fstream>
using namespace std;
// ...
/* read from standard input and write to standard output */
copy((istream_iterator<char>(cin)), (istream_iterator<char>()),
(ostream_iterator<char>(cout)));
/*-------------------------------------------------------------*/
/* read from standard input and write to text file */
copy(istream_iterator<char>(cin), istream_iterator<char>(),
ostream_iterator<char>(ofstream("output.txt"), "\n") );
/*-------------------------------------------------------------*/
/* read from text file and write to text file */
copy(istream_iterator<char>(ifstream("input.txt")), istream_iterator<char>(),
ostream_iterator<char>(ofstream("output.txt"), "\n") );
/*-------------------------------------------------------------*/
The last remaining question is: Do you want to do something with the binary representation? If not, forget about it. Else, update your question one more time.
E.g: Processing the character array to encrypt it using a block cipher
/* a hash calculator */
struct hash_sha1 {
unsigned char operator()(unsigned char x) {
// process
return rc;
}
};
/* store house of characters, could've been a vector as well */
basic_string<unsigned char> line;
/* read from text file and write to a string of unsigned chars */
copy(istream_iterator<unsigned char>(ifstream("input.txt")),
istream_iterator<char>(),
back_inserter(line) );
/* Calculate a SHA-1 hash of the input */
basic_string<unsigned char> hashmsg;
transform(line.begin(), line.end(), back_inserter(hashmsg), hash_sha1());
Something like this?
char *buffer = (char*)malloc(42);
// ... put something into the buffer ...
printf("%c\n", buffer[0]);
But, since you're using C++, I wonder why you bother with malloc and such...
char* ptr = pAddressOfMemoryToRead;
while(ptr < pAddressOfMemoryToRead + blockLength)
{
char tmp = *ptr;
// temp now has the char from this spot in memory
ptr++;
}
Is this what you are trying to achieve:
char* p = (char*)malloc(10 * sizeof(char));
char* p1 = p;
memcpy(p,"abcdefghij", 10);
for(int i = 0; i < 10; ++i)
{
char c = *p1;
cout<<c<<" ";
++p1;
}
cout<<"\n";
free(p);
Can you please explain in more detail, perhaps including code? What you're saying makes no sense unless I'm completely misreading your question. Are you doing something like this?
char * chunk = (char *)malloc(256);
If so, you can access any character's worth of data by treating chunk as an array: chunk[5] gives you the 5th element, etc. Of course, these will be characters, which may be what you want, but I can't quite tell from your question... for instance, if chunk[5] is 65, when you print it like cout << chunk[5];, you'll get a letter 'A'.
However, you may be asking how to print out the actual number 65, in which case you want to do cout << int(chunk[5]);. Casting to int will make it print as an integer value instead of as a character. If you clarify your question, either I or someone else can help you further.
Are you asking how to copy the memory bytes of an arbitrary struct into a char* array? If so this should do the trick
SomeType t = GetSomeType();
char* ptr = malloc(sizeof(SomeType));
if ( !ptr ) {
// Handle no memory. Probably should just crash
}
memcpy(ptr,&t,sizeof(SomeType));
I'm not sure I entirely grok what you're trying to do, but a couple of suggestions:
1) use std::vector instead of malloc/free and new/delete. It's safer and doesn't have much overhead.
2) when processing, try doing chunks rather than bytes. Even though streams are buffered, it's usually more efficient grabbing a chunk at a time.
3) there's a lot of different ways to output bits, but again you don't want a stream output for each character. You might want to try something like the following:
void outputbits(char *dest, char source)
{
dest[8] = 0;
for(int i=0; i<8; ++i)
dest[i] = source & (1<<(7-i)) ? '1':'0';
}
Pass it a char[9] output buffer and a char input, and you get a printable bitstring back. Decent compilers produce OK output code for this... how much speed do you need?