C++ How to split stringstream using mulitple delimiters - c++

How would I go about splitting up a stringstream into individual strings using multiple delimiters?
Right now it uses the default white space delimiter and I manually delete the first and last characters if they are anything other then alphanumeric.
The goal here is to read in a .cpp file and parse it for all the user idents that are not reserved words in C++.
It's working for benign examples but for stuff like this:
OrderedPair<map_iterator, bool> insert(const value_type& kvpair)
It is not working. I'd like to be able to split OrderedPair into it's own word, map_iterator into it's own, bool, insert, const, value_type, and kvpair all into individual words.
How would I go about using "< > , ( & ) . -> *" as delimiters for my stringstream?
while (getline(inFile, line)) {
isComment = false;
stringstream sstream(line);
while (sstream >> word) {
isCharLiteral = false;
if (!isComment) {
if (word[0] == '/' && word[1] == '/')
isComment = true;
}
if (!isMultilineComment) {
if (word[0] == '/' && word[1] == '*')
isMultilineComment = true;
}
if (!isStringLiteral) {
if (word[0] == '"')
isStringLiteral = true;
}
if (!isCharLiteral) {
if (word[0] == '\'' && word.back() == '\'')
isCharLiteral = true;
}
if (isStringLiteral)
if (word.back() == '"')
isStringLiteral = false;
if (isMultilineComment)
if (word[0] == '*' && word[1] == '/')
isMultilineComment = false;
if (!isStringLiteral && !isMultilineComment && !isComment && !isCharLiteral) {

If you are able to use standard libraries, then I would suggest using std::strtok() to tokenize your string. You can pass any delimiters you like to strtok(). There is a reference for it here.
Since you are using a string datatype, for strtok to work properly, you'd have to copy your string into a null-terminated character array of sufficient length, and then call strtok() on that array.

C++ std::istream only provides basic input methods for the most common use cases. Here you can directly process the std::string with the methods find_first_of and find_last_of to identify either delimiters or non delimiters. It is easy to build something near the good old strtok but acting directly on a std::string instead of writing directly \0 in the parsed string.
But for you are trying to achieve, you should take into accounts comments, string litteral, macros and pragmas that you should not search for indentifiers

You could use a regex to replace instances of the characters you want to be delimiters with whitespace. Then use your existing white space splitting setup.
http://en.cppreference.com/w/cpp/regex
Or get extra fancy with the regex and just match on the things you do want, and iterate through the matches.

Related

What is the fastest way to remove all characters in a line up until a pattern match in c++?

I have very large files that need to be read into memory. These files must be in a human readable format, and so they are polluted with tab indenting until normal characters appear... For example the following text is preceded with 3 spaces (which is equivalent to one tab indent)
/// There is a tab before this text.
Sample Text There is a tab in between the word "Text" and "There" in this line.
9919
2250
{
""
5
255
}
Currently I simply run the following code to replace the tabs (after the file has been loaded into memory)...
void FileParser::ReplaceAll(
std::string& the_string,
const std::string& from,
const std::string& to) const
{
size_t start_pos = 0;
while ((start_pos = the_string.find(from, start_pos)) != std::string::npos)
{
the_string.replace(start_pos, from.length(), to);
start_pos += to.length(); // In case 'to' contains 'from', like replacing 'x' with 'yx'
}
}
There are two issues with this code...
It takes 18 seconds to just complete the replacing of this text.
This replaces ALL tabs, I just want the tabs up until the first non-tab character. So if the line has tabs after the non-tab characters.... these would not be removed.
Can anyone offer up a solution that would speed up the process and only remove the initial tab indents of each line?
I'd do it this way:
std::string without_leading_chars(const std::string& in, char remove)
{
std::string out;
out.reserve(in.size());
bool at_line_start = true;
for (char ch : in)
{
if (ch == '\n')
at_line_start = true;
else if (at_line_start)
{
if (ch == remove)
continue; // skip this char, do not copy
else
at_line_start = false;
}
out.push_back(ch);
}
return out;
}
That's one memory allocation and a single pass, so pretty close to optimal.
As always. We can often gain more speed by thinking of good algorithms and create a good design.
First comment. I tested your approach with a 100MB source file and it took at least 30 minutes on my machine in Release mode with all optimizations on.
And, as you mentioned by yourself. It repalces all spaces, and not only those at the beginning of the file. So, we need to come up with a better solution
First we think of how we can identify spaces at the beginning of a line. For this we need some boolean flag that indicates that we are at the beginning of a line. We will call it beginOfLine and set it to true initially, because the file starts always with a line.
Then, next, we check, if the next character is a space ' ' or a tab '\t' character. In contrast to other solutions, we will check for both.
If this is the case, we do then not need to consider that space or tab in the output, depending, if we are at begin of the line or not. So, the result is the inverse of beginOfLine.
If the character is not a space or tab, then we check for a newline. If we found one, then we set the beginOfLine flag to true, else to false. In any case, we want to use the character.
All this can be put into a simple stateful Lambda
auto check = [beginOfLine = true](const char c) mutable -> bool {
if ((c == ' ') || (c == '\t') )
return !beginOfLine;
beginOfLine = (c == '\n');
return true; };
or, more compact:
auto check = [beginOfLine = true](const char c) mutable -> bool {
if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };
Then, next. We will not erase the spaces from the original string, because this is a huge memory shifting activity that takes brutally long. Instead, we copy the data (characters) to a new string, but just the needed onces.
And for that, we can use the std::copy_if from the standard library.
std::copy_if(data.begin(), data.end(), data2.begin(), check);
This will do the work. And for 100MB data, it takes 160ms. Compared to 30 minutes this is a tremondous saving.
Please see the example code (that of course needs to be addapted for your needs):
#include <iostream>
#include <fstream>
#include <filesystem>
#include <iterator>
#include <algorithm>
#include <string>
namespace fs = std::filesystem;
constexpr size_t SizeOfIOStreamBuffer = 1'000'000;
static char ioBuffer[SizeOfIOStreamBuffer];
int main() {
// Path to text file
const fs::path file{ "r:\\test.txt" };
// Open the file and check, if it could be opened
if (std::ifstream fileStream(file); fileStream) {
// Lambda, that identifies, if we have a spece or tab at the begin of a line or not
auto check = [beginOfLine = true](const char c) mutable -> bool {
if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };
// Huge string with all file data
std::string data{};
// Reserve space to spped up things and to avoid uncessary allocations
data.resize(fs::file_size(file));
// Used buffered IO with a huge iobuffer
fileStream.rdbuf()->pubsetbuf(ioBuffer, SizeOfIOStreamBuffer);
// Read file, Elimiate spaces and tabs at the beginning of the line and store in data
std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
}
return 0;
}
As you can see, all boils done to one statement in the code. And this runs (on my machine) in 160ms for a 100MB file.
What can be optimized further? Of course, we see that we have 2 100MB std::strings in our software. What a waste. The final optimization would be, to put the 2 statements for file reading and removing spaces and tabs at the beginning of a line , into one statement.
std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
We will have then have only 1 time the data in memory, and eliminate the nonsense that we read data from a file that we do not need. And the beauty of it is that by using modern C++ language elements, only minor modificyations are necessary. Just exchange the source iterators:
Yes, I know that the string size is too big in the end, but it can be set to the actual value easily. For exampe by using data.reserve(...) and back::inserter

Tokenize elements from a text file by removing comments, extra spaces and blank lines in C++

I'm trying to eliminate comments, blank lines and extra spaces within a text file, then tokenize the elements leftover. Each token needs a space before and after.
exampleFile.txt
var
/* declare variables */a1 ,
b2a , c,
Here's what's working as of now,
string line; //line: represents one line of text from file
ifstream InputFile("exampleFile", ios::in); //read from exampleFile.txt
//Remove comments
while (InputFile && getline(InputFile, line, '\0'))
{
while (line.find("/*") != string::npos)
{
size_t Begin = line.find("/*");
line.erase(Begin, (line.find("*/", Begin) - Begin) + 2);
// Start at Begin, erase from Begin to where */ is found
}
}
This removes comments, but I can't seem to figure out a way to tokenize while this is happening.
So my questions are:
Is it possible to remove comments, spaces, and empty lines and tokenize all in this while statement?
How can I implement a function to add spaces in between each token before they are tokenized? Tokens like c, need to be recognized as c and , individually.
Thank you in advanced for the help!
If you need to skip whitespace characters and you don't care about new lines then I'd recommend reading the file with operator>>.
You could write simply:
std::string word;
bool isComment = false;
while(file >> word)
{
if (isInsideComment(word, isComment))
continue;
// do processing of the tokens here
std::cout << word << std::endl;
}
Where the helper function could be implemented as follows:
bool isInsideComment(std::string &word, bool &isComment)
{
const std::string tagStart = "/*";
const std::string tagStop = "*/";
// match start marker
if (std::equal(tagStart.rbegin(), tagStart.rend(), word.rbegin())) // ends with tagStart
{
isComment = true;
if (word == tagStart)
return true;
word = word.substr(0, word.find(tagStart));
return false;
}
// match end marker
if (isComment)
{
if (std::equal(tagStop.begin(), tagStop.end(), word.begin())) // starts with tagStop
{
isComment = false;
word = word.substr(tagStop.size());
return false;
}
return true;
}
return false;
}
For your example this would print out:
var
a1
,
b2a
,
c,
The above logic should also handle multiline comments if you're interested.
However, denote that the function implementation should be modified according to what are your assumptions regarding the comment tokens. For instance, are they always separated with whitespaces from other words? Or is it possible that a var1/*comment*/var2 expression would be parsed? The above example won't work in such situation.
Hence, another option would be (what you already started implementing) reading lines or even chunks of data from the file (to assure begin and end comment tokens are matched) and learning positions of the comment markers with find or regex to remove them afterwards.

Parsing a csv with comma in field

I'm trying to create an object using a csv with the below data
Alonso,Fernando,21,31,29,2,Racing
Dhoni,Mahendra Singh,22,30,4,26,Cricket
Wade,Dwyane,23,29.9,18.9,11,Basketball
Anthony,Carmelo,24,29.4,21.4,8,Basketball
Klitschko,Wladimir,25,28,24,4,Boxing
Manning,Peyton,26,27.1,15.1,12,Football
Stoudemire,Amar'e,27,26.7,21.7,5,Basketball
"Earnhardt, Jr.",Dale,28,25.9,14.9,11,Racing
Howard,Dwight,29,25.5,20.5,5,Basketball
Lee,Cliff,30,25.3,25.1,0.2,Baseball
Mauer,Joe,31,24.8,23,1.8,Baseball
Cabrera,Miguel,32,24.6,22.6,2,Baseball
Greinke,Zack,33,24.5,24.4,50,Baseball
Sharapova,Maria,34,24.4,2.4,22,Tennis
Jeter,Derek,35,24.3,15.3,9,Baseball
I'm using the following code to parse it:
void AthleteDatabase::createDatabase(void)
{
ifstream inFile(INPUT_FILE.c_str());
string inputString;
if(!inFile)
{
cout << "Error opening file for input: " << INPUT_FILE << endl;
}
else
{
getline(inFile, inputString);
while(inFile)
{
istringstream s(inputString);
string field;
string athleteArray[7];
int counter = 0;
while(getline(s, field, ','))
{
athleteArray[counter] = field;
counter++;
}
string lastName = athleteArray[0];
string firstName = athleteArray[1];
int rank = atoi(athleteArray[2].c_str());
float totalEarnings = strtof(athleteArray[3].c_str(), NULL);
float salary = strtof(athleteArray[4].c_str(), NULL);
float endorsements = strtof(athleteArray[5].c_str(), NULL);
string sport = athleteArray[6];
Athlete anAthlete(lastName, firstName, rank,
totalEarnings, salary, endorsements, sport);
athleteDatabaseBST.add(anAthlete);
display(anAthlete);
getline(inFile, inputString);
}
inFile.close();
}
}
My code breaks on the line:
"Earnhardt, Jr.",Dale,28,25.9,14.9,11,Racing
obviously because of the quotes. Is there a better way to handle this? I'm still extremely new to C++ so any assistance would be greatly appreciated.
I'd recommend just using a proper CSV parser. You can find some in the answers to this earlier question, or just search for one on Google.
If you insist on rolling your own, it's probably easiest to just get down to the basics and design it as a finite state machine that processes the input one character at a time. With a one-character look-ahead, you basically need two states: "reading normal input" and "reading a quoted string". If you don't want to use look-ahead, you can do this with a couple more states, e.g. like this:
initial state: If next character is a quote, switch to state quoted field; else behave as if in state unquoted field.
unquoted field: If next character is EOF, end parsing; else, if it is a newline, start a new row and switch to initial state; else, if it is a separator (comma), start a new field in the same row and switch to initial state; else append the character to the current field and remain in state unquoted field. (Optionally, if the character is a quote, signal a parse error.)
quoted field: If next character is EOF, signal parse error; else, if it is a quote, switch to state end quote; else append the character to the current field and remain in state quoted field.
end quote: If next character is a quote, append it to the current field and return to state quoted field; else, if it is a comma or a newline or EOF, behave as if in state unquoted field; else signal parse error.
(This is for "traditional" CSV, as described e.g. in RFC 4180, where quotes in quoted fields are escaped by doubling them. Adding support for backslash-escapes, which are used in some fairly common variants of the CSV format, is left as an exercise. It requires one or two more states, depending on whether you want to to support backslashes in quoted or unquoted strings or both, and whether you want to support both traditional and backslash escapes at the same time.)
In a high-level scripting language, such character-by-character iteration would be really inefficient, but since you're writing C++, all it needs to be blazing fast is some half-decent I/O buffering and a reasonably efficient string append operation.
You have to parse each line character by character, using a bool flag, and a std::string that accumulates the contents of the next field; instead of just plowing ahead to the next comma, as you did.
Initially, the bool flag is false, and you iterate over the entire line, character by character. The quote character flips the bool flag. The comma character, only when the bool flag is false takes the accumulated contents of the std::string and saves it as the next field on the line, and clears the std::string to empty, ready for the next field. Otherwise, the character gets appended to the buffer.
This is a basic outline of the algorithm, with some minor details that you should be able to flesh out by yourself. There are a couple of other ways to do this, that are slightly more efficient, but for a beginner like yourself this kind of an approach would be the easiest to implement.
Simple answer: use a different delimiter. Everything's a lot easier to parse if you use something like '|' instead:
Stoudemire,Amar'e|27|26.7|21.7|5|Basketball
Earnhardt, Jr.|Dale|28|25.9|14.9|11|Racing
The advantage there being any other app that might need to parse your file can also do it just as cleanly.
If sticking with commas is a requirement, then you'd have to conditionally grab a field based on its first char:
std::istream& nextField(std::istringstream& s, std::string& field)
{
char c;
if (s >> c) {
if (c == '"') {
// using " as the delimeter
getline(s, field, '"');
return s >> c; // for the subsequent comma
// could potentially assert for error-checking
}
else if (c == ',') {
// handle empty field case
field = "";
}
else {
// normal case, but prepend c
getline(s, field, ',');
field = c + field;
}
}
return s;
}
Used as a substitute for where you have getline:
while (nextField(s, field)) {
athleteVec.push_back(field); // prefer vector to array
}
Could even simplify that logic a bit by just continuing to use getline if we have an unterminated quoted string:
std::istream& nextField(std::istringstream& s, std::string& field)
{
if (std::getline(s, field, ',')) {
while (s && field[0] == '"' && field[field.size() - 1] != '"') {
std::string next;
std::getline(s, next, ',');
field += ',' + next;
}
if (field[0] == '"' && field[field.size() - 1] == '"') {
field = field.substr(1, field.size() - 2);
}
}
return s;
}
I agree with Imari's answer, why re-invent the wheel? That being said, have you considered regex? I believe this answer can be used to accomplish what you want and then some.

Implementing a find-and-replace procedure in C++

Just for fun, I'm trying to write a find-and-replace procedure like word processors have. I was wondering whether someone could help me figure out what I'm doing wrong (I'm getting a Timeout error) and could help me write a more elegant procedure.
#include <iostream>
#include <string>
void find_and_replace(std::string& text, const std::string& fword, const std::string& rword)
{
for (std::string::iterator it(text.begin()), offend(text.end()); it != offend;)
{
if (*it != ' ')
{
std::string::iterator wordstart(it);
std::string thisword;
while (*(it+1) != ' ' && (it+1) != offend)
thisword.push_back(*++it);
if (thisword == fword)
text.replace(wordstart, it, rword);
}
else {
++it;
}
}
}
int main()
{
std::string S("Yo, dawg, I heard you like ...");
std::string f("dawg");
std::string w("dog");
// Replace every instance of the word "dawg" with "dog":
find_and_replace(S, f, w);
std::cout << S;
return 0;
}
A find-and-replace like most editors have would involve regular
expressions. If all you're looking for is for literal
replacements, the function you need is std::search, to find
the text to be replaced, and std::string::replace, to do the
actual replacement. The only real issue you'll face:
std::string::replace can invalidate your iterators. You could
always start the search at the beginning of the string, but this
could lead to endless looping, if the replacement text contained
the search string (e.g. something like s/vector/std::vector/).
You should convert the the iterator returned from std::search
to an offset into the string before doing the replace (offset
= iter - str.begin()), and convert it back to an iterator after
(iter = str.begin() + offset + replacement.size()). (The
addition of replacement.size() is to avoid rescanning the text
you just inserted, which can lead to an infinite loop, for the
same reasons as presented above.)
using text.replace may invalidate any iterators into text (ie, both it and offend): this isn't safe
copying each character into a temporary string (which is created and destroyed every time you start a new word) is wasteful
The simplest thing that could possibly work is to:
use find to find the first matching substring: it returns a position which won't be invalidated when you replace substrings
check whether:
your substring is either at the start of the text, or preceded by a word separator
your substring is either at the end of the text, or succeeded by a word separator
if 2.1 and 2.2 are true, replace the substring
if you replaced it, increase position (from 1) by the length of your replacement string
otherwise increase position by the length of the string you searched for
repeat from 1, this time starting your find from position (from 4/5)
end when step 1 returns position std::string::npos.
1) you don't push first symbol of found word into "thisword" variable.
2) You use only space symbol ' ' as separator, and what about comma ','. Your program will find word "dawg," not "dawd"
The following code works, but you should think about other word separators. Do you really need to replace only whole word, or just sequence of symbols?
#include <iostream>
#include <string>
void find_and_replace(std::string& text, const std::string& fword, const std::string& rword)
{
for (std::string::iterator it(text.begin()), offend(text.end()); it != offend;)
{
if (*it != ' ' && *it != ',')
{
std::string::iterator wordstart(it);
std::string thisword;
while ((it) != offend && *(it) != ' ' && *(it) != ',')
thisword.push_back(*it++);
if (thisword == fword)
text.replace(wordstart, it, rword);
}
else {
++it;
}
}
}
int main()
{
std::string S("Yo, dawg, I heard you like ...");
std::string f("dawg");
std::string w("dog");
// Replace every instance of the word "dawg" with "dog":
find_and_replace(S, f, w);
std::cout << S;
return 0;
}

C++ split string with space and punctuation chars

I wanna split an string using C++ which contains spaces and punctuations.
e.g. str = "This is a dog; A very good one."
I wanna get "This" "is" "a" "dog" "A" "very" "good" "one" 1 by 1.
It's quite easy with only one delimiter using getline but I don't know all the delimiters. It can be any punctuation chars.
Note: I don't wanna use Boost!
Use std::find_if() with a lambda to find the delimiter.
auto it = std::find_if(str.begin(), str.end(), [] (const char element) -> bool {
return std::isspace(element) || std::ispunct(element);})
So, starting at the first position, you find the first valid token. You can use
index = str.find_first_not_of (yourDelimiters);
Then you have to find the first delimiter after this, so you can do
delimIndex = str.substr (index).find_first_of (yourDelimiters);
your first word will then be
// since delimIndex will essentially be the length of the word
word = str.substr (index, delimIndex);
Then you truncate your string and repeat. You have to, of course, handle all of the cases where find_first_not_of and find_first_of return npos, which means that character was/was not found, but I think that's enough to get started.
Btw, I'm not claiming that this is the best method, but it works...
vmpstr's solution works, but could be a bit tedious.
Some months ago, I wrote a C library that does what you want.
http://wiki.gosub100.com/doku.php?id=librerias:c:cadenas
Documentation has been written in Spanish (sorry).
It doesn't need external dependencies. Try with splitWithChar() function.
Example of use:
#include "string_functions.h"
int main(void){
char yourString[]= "This is a dog; A very good one.";
char* elementsArray[8];
int nElements;
int i;
/*------------------------------------------------------------*/
printf("Character split test:\n");
printf("Base String: %s\n",yourString);
nElements = splitWithChar(yourString, ' ', elementsArray);
printf("Found %d element.\n", nElements);
for (i=0;i<nElements;i++){
printf ("Element %d: %s\n", i, elementsArray[i]);
}
return 0;
}
The original string "yourString" is modified after use spliWithChar(), so be carefull.
Good luck :)
CPP, unlike JAVA doesn't provide an elegant way to split the string by a delimiter. You can use boost library for the same but if you want to avoid it, a manual logic would suffice.
vector<string> split(string s) {
vector<string> words;
string word = "";
for(char x: s) {
if(x == ' ' or x == ',' or x == '?' or x == ';' or x == '!'
or x == '.') {
if(word.length() > 0) {
words.push_back(word);
word = "";
}
}
else
word.push_back(x);
}
if(word.length() > 0) {
words.push_back(word);
}
return words;