How to skip header row in csv using C++ - c++

In my scenario,I need to create a parameters file using CSV .Every row means a config-data,the first field of the row treated as the header,using as a identifier. The format of CSV like below will be easy for me to parse:
1,field1,field2,field3,field4 // 1 indicated the TARGET that the other fields will be writted to.
1,field1,field2,field3,field4
2,field1,field2,field3,field4
2,field1,field2,field3,field4........
But it's not friendly to users.So,I define a csv file like below:
HeaderLine_Begin,1
field1,field2,field3,field4
field1,field2,field3,field4
HeaderLine_Begin,2
field1,field2,field3,field4
field1,field2,field3,field4
means,every row is the data will be writted to the target by the HeaderLine_Begin.I just separate the ID from the real data.
Then,I create a struct like this:
enum myenum
{
ON,OFF,NOCHANGE
};
struct Setting
{
int TargetID;
string field1;
string field2;
myenum field3;
myenum field4;
};
I knew how to write some code for reading csv line by line like below
filename +=".csv";
std::ifstream file(filename.c_str());
std::string line;
while ( file.good() )
{
getline ( file, line, '\n' ); // read a line until last
if(line.compare(0,1,"#") == 0) // ignore the comment line
continue;
ParseLine();// DONE.Parse the line if it's header row OR data row
}
file.close(); // close file
What I want to do is to create a list like vetor settings to keep the data.The flow should be,like,find the first headerID1,then find the next line.If the next line is dataline,treat it as the dataline belong to headerID1.If the next line is another headerID,loop again.
The problem is,there are no such std::getnextline(int lineIndex) for me to fetch the rows after I found the headerRow.

Your input loop should be more like:
int id = -1;
while (getline(file, line))
{
if (line.empty() || line[0] == '#')
continue;
if (starts_with_and_remove(line, "HeaderLine_Begin,"))
id = boost::lexical_cast<int>(line); // or id = atoi(line.c_str())
else
{
assert(id != -1);
...parse CSV, knowing "id" is in effect...
}
}
With:
bool stats_with_and_remove(std::string& lhs, const std::string& rhs)
{
if (lhs.compare(0, rhs.size(), lhs) == 0) // rhs.size() > lhs.size() IS safe
{
lhs.erase(0, rhs.size());
return true;
}
return false;
}

The simplest solution would be to use regular expressions:
std::string line;
int currentId = 0;
while ( std::getline( source, line ) ) {
trimCommentsAndWhiteSpace( line );
static std::regex const header( "HeaderLine_Begin,(\\d+)" );
std::smatch match;
if ( line.empty() ) {
// ignore
} else if ( std::regex_match( line, match, header ) ) {
std::istringstream s( match[ 1 ] );
s >> currentId;
} else {
// ...
}
}
I regularly use this strategy to parse .ini files, which pose
the same problem: section headers have a different syntax to
other things.
trimCommentsAndWhiteSpace can be as simple as:
void
trimCommentsAndWhiteSpace( std::string& line )
{
if ( !line.empty() && line[0] == '#' ) {
line = "";
}
}
It's fairly easy to expand it to handle end of line comments as
well, however, and it's usually a good policy (in contexts like
this) to trim leading and trailing whitespace---trailing
especially, since a human reader won't see it when looking at
the file.
Alternatively, of course, you could use a regular expression for
the lines you want to treet as comments ("\s*#.*"); this works
well with your current definition, but doesn't really extend
well for end of line comments, especially if you want to allow
# in quoted strings in your fields.
And one final comment: your loop is incorrect. You don't test
that getline succeeded before using its results, and
file.good() may return true even if there is nothing more to
read. (file.good() is one of those things that are there for
historical reasons; there's no case where it makes sense to use
it.)

Related

What is the fastest way to remove all characters in a line up until a pattern match in c++?

I have very large files that need to be read into memory. These files must be in a human readable format, and so they are polluted with tab indenting until normal characters appear... For example the following text is preceded with 3 spaces (which is equivalent to one tab indent)
/// There is a tab before this text.
Sample Text There is a tab in between the word "Text" and "There" in this line.
9919
2250
{
""
5
255
}
Currently I simply run the following code to replace the tabs (after the file has been loaded into memory)...
void FileParser::ReplaceAll(
std::string& the_string,
const std::string& from,
const std::string& to) const
{
size_t start_pos = 0;
while ((start_pos = the_string.find(from, start_pos)) != std::string::npos)
{
the_string.replace(start_pos, from.length(), to);
start_pos += to.length(); // In case 'to' contains 'from', like replacing 'x' with 'yx'
}
}
There are two issues with this code...
It takes 18 seconds to just complete the replacing of this text.
This replaces ALL tabs, I just want the tabs up until the first non-tab character. So if the line has tabs after the non-tab characters.... these would not be removed.
Can anyone offer up a solution that would speed up the process and only remove the initial tab indents of each line?
I'd do it this way:
std::string without_leading_chars(const std::string& in, char remove)
{
std::string out;
out.reserve(in.size());
bool at_line_start = true;
for (char ch : in)
{
if (ch == '\n')
at_line_start = true;
else if (at_line_start)
{
if (ch == remove)
continue; // skip this char, do not copy
else
at_line_start = false;
}
out.push_back(ch);
}
return out;
}
That's one memory allocation and a single pass, so pretty close to optimal.
As always. We can often gain more speed by thinking of good algorithms and create a good design.
First comment. I tested your approach with a 100MB source file and it took at least 30 minutes on my machine in Release mode with all optimizations on.
And, as you mentioned by yourself. It repalces all spaces, and not only those at the beginning of the file. So, we need to come up with a better solution
First we think of how we can identify spaces at the beginning of a line. For this we need some boolean flag that indicates that we are at the beginning of a line. We will call it beginOfLine and set it to true initially, because the file starts always with a line.
Then, next, we check, if the next character is a space ' ' or a tab '\t' character. In contrast to other solutions, we will check for both.
If this is the case, we do then not need to consider that space or tab in the output, depending, if we are at begin of the line or not. So, the result is the inverse of beginOfLine.
If the character is not a space or tab, then we check for a newline. If we found one, then we set the beginOfLine flag to true, else to false. In any case, we want to use the character.
All this can be put into a simple stateful Lambda
auto check = [beginOfLine = true](const char c) mutable -> bool {
if ((c == ' ') || (c == '\t') )
return !beginOfLine;
beginOfLine = (c == '\n');
return true; };
or, more compact:
auto check = [beginOfLine = true](const char c) mutable -> bool {
if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };
Then, next. We will not erase the spaces from the original string, because this is a huge memory shifting activity that takes brutally long. Instead, we copy the data (characters) to a new string, but just the needed onces.
And for that, we can use the std::copy_if from the standard library.
std::copy_if(data.begin(), data.end(), data2.begin(), check);
This will do the work. And for 100MB data, it takes 160ms. Compared to 30 minutes this is a tremondous saving.
Please see the example code (that of course needs to be addapted for your needs):
#include <iostream>
#include <fstream>
#include <filesystem>
#include <iterator>
#include <algorithm>
#include <string>
namespace fs = std::filesystem;
constexpr size_t SizeOfIOStreamBuffer = 1'000'000;
static char ioBuffer[SizeOfIOStreamBuffer];
int main() {
// Path to text file
const fs::path file{ "r:\\test.txt" };
// Open the file and check, if it could be opened
if (std::ifstream fileStream(file); fileStream) {
// Lambda, that identifies, if we have a spece or tab at the begin of a line or not
auto check = [beginOfLine = true](const char c) mutable -> bool {
if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };
// Huge string with all file data
std::string data{};
// Reserve space to spped up things and to avoid uncessary allocations
data.resize(fs::file_size(file));
// Used buffered IO with a huge iobuffer
fileStream.rdbuf()->pubsetbuf(ioBuffer, SizeOfIOStreamBuffer);
// Read file, Elimiate spaces and tabs at the beginning of the line and store in data
std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
}
return 0;
}
As you can see, all boils done to one statement in the code. And this runs (on my machine) in 160ms for a 100MB file.
What can be optimized further? Of course, we see that we have 2 100MB std::strings in our software. What a waste. The final optimization would be, to put the 2 statements for file reading and removing spaces and tabs at the beginning of a line , into one statement.
std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
We will have then have only 1 time the data in memory, and eliminate the nonsense that we read data from a file that we do not need. And the beauty of it is that by using modern C++ language elements, only minor modificyations are necessary. Just exchange the source iterators:
Yes, I know that the string size is too big in the end, but it can be set to the actual value easily. For exampe by using data.reserve(...) and back::inserter

Tokenize elements from a text file by removing comments, extra spaces and blank lines in C++

I'm trying to eliminate comments, blank lines and extra spaces within a text file, then tokenize the elements leftover. Each token needs a space before and after.
exampleFile.txt
var
/* declare variables */a1 ,
b2a , c,
Here's what's working as of now,
string line; //line: represents one line of text from file
ifstream InputFile("exampleFile", ios::in); //read from exampleFile.txt
//Remove comments
while (InputFile && getline(InputFile, line, '\0'))
{
while (line.find("/*") != string::npos)
{
size_t Begin = line.find("/*");
line.erase(Begin, (line.find("*/", Begin) - Begin) + 2);
// Start at Begin, erase from Begin to where */ is found
}
}
This removes comments, but I can't seem to figure out a way to tokenize while this is happening.
So my questions are:
Is it possible to remove comments, spaces, and empty lines and tokenize all in this while statement?
How can I implement a function to add spaces in between each token before they are tokenized? Tokens like c, need to be recognized as c and , individually.
Thank you in advanced for the help!
If you need to skip whitespace characters and you don't care about new lines then I'd recommend reading the file with operator>>.
You could write simply:
std::string word;
bool isComment = false;
while(file >> word)
{
if (isInsideComment(word, isComment))
continue;
// do processing of the tokens here
std::cout << word << std::endl;
}
Where the helper function could be implemented as follows:
bool isInsideComment(std::string &word, bool &isComment)
{
const std::string tagStart = "/*";
const std::string tagStop = "*/";
// match start marker
if (std::equal(tagStart.rbegin(), tagStart.rend(), word.rbegin())) // ends with tagStart
{
isComment = true;
if (word == tagStart)
return true;
word = word.substr(0, word.find(tagStart));
return false;
}
// match end marker
if (isComment)
{
if (std::equal(tagStop.begin(), tagStop.end(), word.begin())) // starts with tagStop
{
isComment = false;
word = word.substr(tagStop.size());
return false;
}
return true;
}
return false;
}
For your example this would print out:
var
a1
,
b2a
,
c,
The above logic should also handle multiline comments if you're interested.
However, denote that the function implementation should be modified according to what are your assumptions regarding the comment tokens. For instance, are they always separated with whitespaces from other words? Or is it possible that a var1/*comment*/var2 expression would be parsed? The above example won't work in such situation.
Hence, another option would be (what you already started implementing) reading lines or even chunks of data from the file (to assure begin and end comment tokens are matched) and learning positions of the comment markers with find or regex to remove them afterwards.

how to discard from streams? .ignore() doesnt work for this purpose, any other methods?

I have a lack of understanding about streams. The idea is, to read a file to the ifstream and then working with it. Extract Data from the stream to a string, and discard the part which is now in a string from the stream. Is that possible? Or how to handle those problems?
The following method, is for inserting a file which is properly read by the ifstream. (its a text file, containing informations about "Lost" episodes, its an episodeguide. It works fine, for one element of the class episodes. Every time i instantiate a episode file, i want to check the stream of that file, discard the informations about one episode (its indicated by "****", then the next episode starts) and process the informations discarded in a string. If I create a new object of Episode I want to discard the next informations about the episodes after "****" to the next "****" and so on.
void Episode::read(ifstream& in) {
string contents((istreambuf_iterator<char>(in)), istreambuf_iterator<char>());
size_t episodeEndPos = contents.find("****");
if ( episodeEndPos == -1) {
in.ignore(numeric_limits<char>::max());
in.clear(), in.sync();
fullContent = contents;
}
else { // empty stream for next episode
in.ignore(episodeEndPos + 4);
fullContent = contents.substr(0, episodeEndPos);
}
// fill attributes
setNrHelper();
setTitelHelper();
setFlashbackHelper();
setDescriptionHelper();
}
I tried it with inFile >> words (to read the words, this is a way to get the words out of the stream) another way i was thinking about is, to use .ignore (to ignore an amount of characters in the stream). But that doesnt work as intended. Sorry for my bad english, hopefully its clear what i want to do.
If your goal is at each call of Read() to read the next episode and advance in the file, then the trick is to to use tellg() and seekg() to bookmark the position and update it:
void Episode::Read(ifstream& in) {
streampos pos = in.tellg(); // backup current position
string fullContent;
string contents((istreambuf_iterator<char>(in)), istreambuf_iterator<char>());
size_t episodeEndPos = contents.find("****");
if (episodeEndPos == -1) {
in.ignore(numeric_limits<char>::max());
in.clear(), in.sync();
fullContent = contents;
}
else { // empty stream for next episode
fullContent = contents.substr(0, episodeEndPos);
in.seekg(pos + streamoff(episodeEndPos + 4)); // position file at next episode
}
}
In this way, you can call several time your function, every call reading the next episode.
However, please note that your approach is not optimised. When you construct your contents string from a stream iterator, you load the full rest of the file in the memory, starting at the current position in the stream. So here you keep on reading and reading again big subparts of the file.
Edit: streamlined version adapted to your format
You just need to read the line, check if it's not a separator line and concatenate...
void Episode::Read(ifstream& in) {
string line;
string fullContent;
while (getline(in, line) && line !="****") {
fullContent += line + "\n";
}
cout << "DATENSATZ: " << fullContent << endl; // just to verify content
// fill attributes
//...
}
The code you got reads the entire stream in one go just to use some part of the read text to initialize an object. Imagining a gigantic file that is almost certainly a bad idea. The easier approach is to just read until the end marker is found. In an ideal world, the end marker is easily found. Based on comments it seems to be on a line of its own which would make it quite easy:
void Episode::read(std::istream& in) {
std::string text;
for (std::string line; in >> line && line != "****"; ) {
text += line + "\n";
}
fullContent = text;
}
If the separate isn't on a line of its own, you could use code like this instead:
void Episode::read(std::istream& in) {
std::string text;
for (std::istreambuf_iterator<char> it(in), end; it != end; ++it) {
text.push_back(*it);
if (*it == '*' && 4u <= text.size() && text.substr(text.size() - 4) == "****") {
break;
}
if (4u <= text.size() && text.substr(text.size() - 4u) == "****") {
text.resize(text.size() - 4u);
}
fullContent = text;
}
Both of these approaches would simple read the file from start to end and consume the characters to be extracted in the process, stopping as soon as reading of one record is done.

Parse buffered data line by line

I want to write a parser for Wavefront OBJ file format, plain text file.
Example can be seen here: people.sc.fsu.edu/~jburkardt/data/obj/diamond.obj.
Most people use old scanf to parse this format line by line, however I would prefer to load the whole file at once to reduce IO operation count. Is there a way to parse this kind of buffered data line by line?
void ObjModelConditioner::Import(Model& asset)
{
uint8_t* buffer = SyncReadFile(asset.source_file_info());
delete [] buffer;
}
Or would it be preferable to load whole file into a string and try to parse that?
After a while It seems I found sufficient (and simple) solution. Since my goal is to create asset conditioning pipeline, the code has to be able to handle large amounts of data efficiently. Data can be read into a string at once and once loaded, stringstream can be initialized with this string.
std::string data;
SyncReadFile(asset.source_file_info(), data);
std::stringstream data_stream(data);
std::string line;
Then I simply call getline():
while(std::getline(data_stream, line))
{
std::stringstream line_stream(line);
std::string type_token;
line_stream >> type_token;
if (type_token == "v") {
// Vertex position
Vector3f position;
line_stream >> position.x >> position.y >> position.z;
// ...
}
else if (type_token == "vn") {
// Vertex normal
}
else if (type_token == "vt") {
// Texture coordinates
}
else if (type_token == "f") {
// Face
}
}
Here's a function that splits a char array into a vector of strings (assuming each new string starts with '\n' symbol):
#include <iostream>
#include <vector>
std::vector< std::string >split(char * arr)
{
std::string str = arr;
std::vector< std::string >result;
int beg=0, end=0;//begining and end of each line in the array
while( end = str.find( '\n', beg + 1 ) )
{
if(end == -1)
{
result.push_back(str.substr(beg));
break;
}
result.push_back(str.substr(beg, end - beg));
beg = end;
}
return result;
}
Here's the usage:
int main()
{
char * a = "asdasdasdasdasd \n asdasdasd \n asdasd";
std::vector< std::string >result = split(a);
}
If you've got the raw data in a char[] (or a unsigned char[]), and
you know its length, it's pretty trivial to write an input only, no seek
supported streambuf which will allow you to create an std::istream
and to use std::getline on it. Just call:
setg( start, start, start + length );
in the constructor. (Nothing else is needed.)
It really depends on how you're going to parse the text. One way to do this would be simply to read the data into a vector of strings. I'll assume that you've already covered issues such as scaleability / use of memory etc.
std::vector<std::string> lines;
std::string line;
ifstream file(filename.c_str(), ios_base::in);
while ( getline( file, line ) )
{
lines.push_back( line );
}
file.close();
This would cache your file in lines. Next you need to go through lines
for ( std::vector<std::string>::const_iterator it = lines.begin();
it != lines.end(); ++it)
{
const std::string& line = *it;
if ( line.empty() )
continue;
switch ( line[0] )
{
case 'g':
// Some stuff
break;
case 'v':
// Some stuff
break;
case 'f':
// Some stuff
break;
default:
// Default stuff including '#' (probably nothing)
}
}
Naturally, this is very simplistic and depends largely on what you want to do with your file.
The size of the file that you've given as an example is hardly likely to cause IO stress (unless you're using some very lightweight equipment) but if you're reading many files at once I suppose it might be an issue.
I think your concern here is to minimise IO and I'm not sure that this solution will really help that much since you're going to be iterating over a collection twice. If you need to go back and keep reading the same file over and over again, then it will definitely speed things up to cache the file in memory but there are just as easy ways to do this such as memory mapping a file and using normal file accessing. If you're really concerned, then try profiling a solution like this against simply processing the file directly as you read from IO.

function of searching a string from a file

This is some code I wrote to check a string's presence in a file:
bool aviasm::in1(string s)
{
ifstream in("optab1.txt",ios::in);//opening the optab
//cout<<"entered in1 func"<<endl;
char c;
string x,y;
while((c=in.get())!=EOF)
{
in.putback(c);
in>>x;
in>>y;
if(x==s)
return true;
}
return false;
}
it is sure that the string being searched lies in the first column of the optab1.txt and in total there are two columns in the optab1.txt for every row.
Now the problem is that no matter what string is being passed as the parameter s to the function always returns false. Can you tell me why this happens?
What a hack! Why not use standard C++ string and file reading functions:
bool find_in_file(const std::string & needle)
{
std::ifstream in("optab1.txt");
std::string line;
while (std::getline(in, line)) // remember this idiom!!
{
// if (line.substr(0, needle.length()) == needle) // not so efficient
if (line.length() >= needle.length() && std::equal(needle.begin(), needle.end(), line.begin())) // better
// if (std::search(line.begin(), line.end(), needle.begin(), needle.end()) != line.end()) // for arbitrary position
{
return true;
}
}
return false;
}
You can replace substr by more advanced string searching functions if the search string isn't required to be at the beginning of a line. The substr version is the most readable, but it makes a copy of the substring. The equal version compares the two strings in-place (but requires the additional size check). The search version finds the substring anywhere, not just at the beginning of the line (but at a price).
It's not too clear what you're trying to do, but the condition in the
while will never be met if plain char is unsigned. (It usually
isn't, so you might get away with it.) Also, you're not extracting the
end of line in the loop, so you'll probably see it instead of EOF, and
pass once too often in the loop. I'd write this more along the lines
of:
bool
in1( std::string const& target )
{
std::ifstream in( "optab1.txt" );
if ( ! in.is_open() )
// Some sort of error handling, maybe an exception.
std::string line;
while ( std::getline( in, line )
&& ( line.size() < target.size()
|| ! std::equal( target.begin(), target.end(), line.begin() ) ) )
;
return in;
}
Note the check that the open succeeded. One possible reason you're
always returning false is that you're not successfully opening the file.
(But we can't know unless you check the status after the open.)