parse huge csv file with C++

parse huge csv file with C++ - c++

I order to simulate my network I am using a trace file (csv file) with a size between 5 to 30 GB.
The csv file is a row based, where each row contains multiple fields delimited by a space and forming teh information to form a network packet:
3 53 4 12 1 1 2 6
Since the file's size could reach several GBs (millions of lines), is it better to divided it into small chunks myfile00.csv, myfile01.csv..., or I can process the entire file on the hard drive without being loaded into the memory?
I want to read the file line by line at a specific time, which is the clock cycle of the simulation, and get all information in the line to create an omnet++ message.
packet MyTrace::getpacket() {
int id; // first field
int cycle; // second field
int source; // third field
int destination; // fourth field
int numberofDep; // fifth field
std::list<int> listofDep; // remaining fields
if (!traceFile.is_open()) {
// get id
// get cycle
// ....
}
Any suggestion would be helpful.
EDIT:
string line;
ifstream myfile ("BlackSmall.csv");
int currentline=0 ;
if (myfile.is_open())
{
while (getline(myfile, line)) {
istringstream ss(line);
string request;
int id, cycle, source , dest, srcType, destType, packetSize, dependency;
int listdep;
std::list<int> dep;
ss >> id;
ss>> cycle;
ss>> source;
ss>> dest;
ss>>request;
ss>> srcType;
ss>> destType;
ss>> packetSize;
ss>> dependency;
while (ss >> listdep) dep.push_back(listdep);
// Create my packet
}
myfile.close();
}
else cout << "Unable to open file";
With the above code, I can get all information that I need from a line.
The problem is that I need to use this code inside a class, which when I call it returns just one line's information. Is there a way how to point to a specific line when I call this class?

It seems like your application seems to require a single sequential pass through the input, so processing a file that is 1GB or 100GB is perhaps just a matter of patience and perhaps parallelism.
The approach should be to translate records line-by-line. You should avoid strategies that attempt to read the entire file into memory. The STL offers the easy-to-use std::ifstream class with a built-in getline method, which returns a std::string containing the line to be converted.
If you are feeling more ambitious and want to control the amount of data read or buffered more carefully then you would not be the first developer to roll-your-own code to implement a buffered reader. This is a fairly empowering exercise and will help you think through some corner cases with reading partial lines and such. But in the end, it probably will not give you a significant boost toward your goal. I suspect the ifstream approach will get you up and running without the hassle and will not ultimately be the bottleneck in processing these files.
If you were really concerned about optimizing execution time then having multiple files might help you launch parallel processing tasks.
// define a class to hold your custom record
class Record {
};
// create a parser function to convert a line of text into the record
bool parse(std::string const &line, Record &record) {
}
// create a translator method to convert a record into the desired output
bool write(Record const &record, std::ofstream &os) {
}
// actually open input stream for the input file
std::ifstream is;
std::ofstream os;
std::string line;
while (std::getline(is,line)) {
Record record;
if (!parse(line,record)) break;
if (!write(record,os)) break;
}
You can re-use the Record instance by moving it outside the while loop so long as you are careful to reset the variable so that information from preceding records does not taint the current record. You can also dive head first into the C++ ecosystem by producing stream input and output operator ("<<",">>") but I personally find this approach to be more confusion than it is worth.

Perhaps best approach for you would be to import your CSV file into SQLite database.
Once you import it and add some indexes, you can easily and very efficiently query necessary rows from that database. SQLite has lots of ready-to-use C/C++ client libraries available, you can start with default one at https://www.sqlite.org/cintro.html.

Related

What is the most efficiency way to import an .STL file in c++?

What is the most efficient strategy for parsing a .STL file?
A critical part of my code is importing a .STL file, (a common CAD file format) and this is limiting performance overall.
The .STL file format is summarized here- https://en.wikipedia.org/wiki/STL_(file_format)
Using ASCII format is required for this application.
The generic format is:
solid name
facet normal ni nj nk
outer loop
vertex v1x v1y v1z
vertex v2x v2y v2z
vertex v3x v3y v3z
endloop
endfacet
endsolid
However, I've noticed that there are no strict formatting requirements. And, the import function must do a minimal amount of error checking. I've done some performance measuring (using chrono) which for a 43,000 line file gives:
stl_import() - 1.177568 s
parsing loop - 3.894250 s
Parsing loop:
cout << "Importing " << stl_path << "... ";
auto file_vec = import_stl(stl_path);
for (auto& l : file_vec) {
trim(l);
if (solid_state) {
if (facet_state) {
if (starts_with(l, "vertex")) {
//---------ADD FACE----------//
l.erase(0, 6);
trim(l);
vector<string> strs;
split(strs, l, is_any_of(" "));
point p = { stod(strs[0]), stod(strs[1]), stod(strs[2]) };
facet_points.push_back(p);
//---------------------------//
}
else {
if (starts_with(l, "endfacet")) {
facet_state = false;
}
}
}
else {
if (starts_with(l, "facet")) {
facet_state = true;
//assert(facet_points.size() == 0);
//---------------------------//
// Normals can be ignored //
//---------------------------//
}
if (starts_with(l, "endsolid")) {
solid_state = false;
}
}
}
else {
if (starts_with(l, "solid")) {
solid_state = true;
}
}
if (facet_points.size() == 3) {
triangle facet(facet_points[0], facet_points[1], facet_points[2]);
stl_solid.add_facet(facet);
facet_points.clear();
//check normal
facet.normal();
}
}
The stl_import function is:
std::vector<std::string> import_stl(const std::string& file_path)
{
std::ifstream infile(file_path);
SkipBOM(infile);
std::vector<std::string> file_vec;
std::string line;
while (std::getline(infile, line))
{
file_vec.push_back(line);
}
return file_vec;
}
I have searched for ways to optimize file reading, etc. And, I see that using mmap may improve file read speed.
Fast textfile reading in c++
This question is an inquiry as to what the best parsing strategy for a .STL file is?

Without data which can be used for measuring where the time is spent it hard to determine what actually improves the performance. A decent library already doing the job may be the easiest approach. However, the current code uses a few approaches which may be easy wins to improve performance. There are things I spotted:
The streams library is quite good at skipping leading whitespace. Instead of first reading spaces followed by trimming them off, you may want to use std::getline(infile >> std::ws, line): the std::ws manipulator skips leading whitespaces.
Instead of using starts_with() with string literals, I'd rather read each line into a "command" and the tail of the line and compare the commands against std::string const objects: instead of a character comparison it may be sufficient to compare the size.
Instead of split()ing a std::string into a std::vector<std::string> on whitespace I'd rather reset a suitable stream (probably an std::istringstream but to prevent copying possibly a custom memory stream) and read directly from that:
std::istringstream in; // declared outside the reading loop
// ...
point p;
in.clear(); // get rid of potentially existing errors
in.str(line);
if (in >> p.x >> p.y >> p.z) {
facet_points.push_back(p);
}
This approach has the added advantage of allowing format checking: I always distrust any input received, even when it is from a trusted source.
If you insist in using adjusting the character sequence and/or splitting it into subsequences, I'd strongly recommend using std::string_view (or, in case this C++17 class isn't available a similar class) to avoid moving characters around.
Assuming the file is of a significant size, I'd recommend against reading the file into a std::vector<std::string> and then parsing it. Instead, I'd parse the file on the fly: this way the hot memory is immediately reused instead of moving it out of cache for later post-processing. This way dealing with an auxiliary stream (see point 3 above) can be avoided. To prevent an overly complex reading loop I'd split nested sections into appropriate functions, returning from them on closing tags. In addition I'd define input functions for structures like point to simply read them off the stream.
Depending on the system you are working on, you may want to call std::ios_base::sync_with_stdio(false) before reading the file: there used to be at least one often used implementation of streams which would benefit from this call.

How to DELETE a line(s) in C++?

I am new to file-handling...
I am writing a program that saves data in text-files in the following format:
3740541120991
Syed Waqas Ali
Rawalpindi
Lahore
12-12-2012
23:24
1
1
(Sorry for the bad alignment, it's NOT a part of the program)
Now I'm writing a delete function for the program that would delete a record.
So far this is my code:
void masterSystem::cancelReservation()
{
string line;
string searchfor = "3740541120991";
int i=0;
ifstream myfile("records.txt");
while (getline(myfile, line))
{
cout << line << endl;
if (line==searchfor)
{
// DELETE THIS + THE NEXT 8 LINES
}
}
}
I've done a bit of research and have found out that there is no easy way to access the line of a text file so we have to create another text file.
But the problem arises that how do I COPY the records/data before the record being deleted into the NEW text file?

Open the input file; read one line at a time from the input file. If you decide to want to keep that line, write it to the output file. On the other hand, if you want to 'delete' that line, don't write it to the output file.

You could have a record per line and make even more easy for example:
3740541120991|Syed Waqas Ali|Rawalpindi|Lahore|12-12-2012|23:24|1|1
and the | character saparating each field. This is a well known technic knows as CSV (Comma separated Values)
This way you don't have to worry about reading consecutive lines for erase a record and add a record access the file only once.
So your code becoms into:
void masterSystem::cancelReservation()
{
string line;
string searchfor = "3740541120991";
ifstream myfile("records.txt");
while (getline(myfile, line))
{
// Here each line is a record
// You only hace to decide if you will copy
// this line to the ouput file or not.
}
}
Don't think only about removing a record, there are others operations you will need to do against this file save a new record, read into memory and search.
Think a moment about search, and having your current desing in mind, try to answer this: How many reservations exists for date 12-12-2012 and past 12:00 AM?
In your code you have to access the file 8 times per record even if the other data is irrelevant to the question. But, if you have each record in a line you only have to access file 1 time per record.
With a few reservations the diference is near 0, but it grows exponentially (n^8).

Efficiently read CSV file with optional columns

I'm trying to write a program that reads in a CSV file (no need to worry about escaping anything, it's strictly formatted with no quotes) but any numeric item with a value of 0 is instead just left blank. So a normal line would look like:
12,string1,string2,3,,,string3,4.5
instead of
12,string1,string2,3,0,0,string3,4.5
I have some working code using vectors but it's way too slow.
int main(int argc, char** argv)
{
string filename("path\\to\\file.csv");
string outname("path\\to\\outfile.csv");
ifstream infile(filename.c_str());
if(!infile)
{
cerr << "Couldn't open file " << filename.c_str();
return 1;
}
vector<vector<string>> records;
string line;
while( getline(infile, line) )
{
vector<string> row;
string item;
istringstream ss(line);
while(getline(ss, item, ','))
{
row.push_back(item);
}
records.push_back(row);
}
return 0;
}
Is it possible to overload operator<< of ostream similar to How to use C++ to read in a .csv file and output in another form? when fields can be blank?
Would that improve the performance?
Or is there anything else I can do to get this to run faster?
Thanks

The time spent reading the string data from the file is greater than the time spent parsing it. You won't make significant time savings in the parsing of the string.
To make your program run faster, read bigger "chunks" into memory; get more data per read. Research on memory mapped files.

One alternative way to handle this to get better performance is to read the whole file into a buffer. Then go through the buffer and set pointers to where the values start, if you find a , or end of line put in a \0.
e.g. https://code.google.com/p/csv-routine/

For loops and inputing data?

trying to figure out how to make a little inventory program and I can't for the life figure out why it isn't working.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
struct record
{
int item_id;
string item_type;
int item_price;
int num_stock;
string item_title;
string item_author;
int year_published;
};
void read_all_records(record records[]);
const int max_array = 100;
int main()
{
record records[max_array];
read_all_records(records);
cout << records[2].item_author;
return 0;
}
void read_all_records(record records[])
{
ifstream invfile;
invfile.open("inventory.dat");
int slot = 0;
for (int count = 0; count<max_array; count++);
{
invfile >> records[slot].item_id >> records[slot].item_type >> records[slot].item_price >> records[slot].num_stock >> records[slot].item_title >> records[slot].item_author >> records[slot].year_published;
slot++;
}
invfile.close();
}
I'm testing it by having it print the second item from records author. When I run it, it doesn't show the authors name at all. The .dat file is located in just about every folder where the project is (I forgot which folder it needs to be in) so it's there.
The issue isn't that the file isn't working. It's the array not printing off anything.
my inv file is basically:
123456
book
69.99
16
title
etc
etc
and repeats for different books/cds etc all on one line, all without spaces. Should just next in.

You should check to see that the file is open.
invfile.open("inventory.dat");
if (!invfile.is_open())
throw std::runtime_error("couldn't open inventory file");
You should check to seen that your file reads are working and breaks when you hit the end of file.
invfile >> records[slot].item_id >> records[slot].item_type ...
if (invfile.bad())
throw std::runtime_error("file handling didn't work");
if (invfile.eof())
break;
You probably want to read each record at time, as it isn't clear from this code how the C++ streams are supposed to differentiate between each field.
Usually you'd expect to use std::getline, split the fields on however you delimit them, and then use something like boost::lexical_cast to do the type parsing.

If I were doing this, I think I'd structure it quite a bit differently.
First, I'd overload operator>> for a record:
std::istream &operator>>(std::istream &is, record &r) {
// code about like you had in `read_all_records` to read a single `record`
// but be sure to return the `stream` when you're done reading from it.
}
Then I'd use an std::vector<record> instead of an array -- it's much less prone to errors.
To read the data, I'd use std::istream_iterators, probably supplying them to the constructor for the vector<record>:
std::ifstream invfile("inventory.dat");
std::vector<record> records((std::istream_iterator<record>(invfile)),
std::istream_iterator<record>());
In between those (i.e., after creating the file, but before the vector) is where you'd insert your error handling, roughly on the order of what #Tom Kerr recommended -- checks for is_open(), bad(), eof(), etc., to figure out what (if anything) is going wrong in attempting to open the file.

Add a little check:
if (!invfile.is_open()) {
cout<<"file open failed";
exit(1);
}
So that way, you don't need to copy your input file everywhere like you do now ;-)
You are reading in a specific order, so your input file should have the same order and required number of inputs.
You are printing 3rd element of the struct records. So you should have at least 3 records. I don't see anything wrong with your code. It would a lot easier if you can post your sample input file.

How to read structured data from file in C++?

I have a data file where I need to read a datum from each line and store it. And then depending on the value of one of those datums store that data in an array so that I can then calculate the median value of all of these data.
The line of data is demographic information and depending on the geographic location, address of a person. I need to capture their age and then find the median of the people that live on a particular street for example.
So the data set is 150,000 records and each record has 26 fields, a lot of those fields are segments of an address and then the other fields are just numbers, age, street number and that sort of thing.
So what I need to do is read through the line and then if a particular field in the record meets a certain condition I need to capture a field from the record and store it in an array so that I can calculate the median of people that live on "Oak Street" for example.
I have the conditional logic and can work the sort out but I'm uncomfortable with the iostream objects in C++, like instantiating an ifstream object and then reading from the file itself.
Oh I forgot that the data was a comma separated value file.

For comma-delimited input:
using namespace std;
ifstream file;
string line;
while(getline(file, line)) {
istringstream stream(line);
string data[3];
for(int ii = 0; ii < sizeof data / sizeof data[0]; ++ii)
if(!getline(stream, data[ii], ','))
throw std::runtime_error("invalid data");
// process data here
}
For whitespace-delimited input (original answer):
using namespace std;
ifstream file;
string line;
while(getline(file, line)) {
int datum1;
string datum2;
double datum3;
istringstream stream(line);
if(!(line >> datum1 >> datum2 >> datum3))
throw std::runtime_error("invalid data");
// process data here
}
These methods won't win any prizes for performance, but hopefully they're fairly reliable and easy to understand.

This sounds like a perfect problem for an SQL light style embedded data base. Then you could have any number of standard SQL features without having to rewrite the wheel.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

parse huge csv file with C++ - c++

Related

What is the most efficiency way to import an .STL file in c++?

How to DELETE a line(s) in C++?

Efficiently read CSV file with optional columns

For loops and inputing data?

How to read structured data from file in C++?

Categories

Resources