How to access previous string while reading from a file? - c++

I need some help with a project I'm working on. In the program I'm reading in strings from a file and doing different things depending on if they have punctuation marks or not. If the string has punctuation you separate the punctuation and set it as a value to the string key, then set the end of sentence value as "$", then lastly set the key for the beginning of the next sentence as "^" with the next read in string as its value. I have the code for where it ends with a punctuation finished, but i'm not sure entirely what to do if it DOESN'T have punctuation.
Essentially if the read in string doesn't have punctuation marks then i want to simply do: mapName[previousString].push_back(newString)
But how do i access that previous string? If i try to read in 2 strings at once i would still have to check for punctuation, which defeats the purpose of checking only once for punctuation. Apologies if this is a dumb question, but i've been trying to work on this all day yesterday and today. Any help would be greatly appreciated!
void BookBot::readIn(const std::string & filename) {
ifstream inputFile;
string Startkey = "^"; //beginning of sentence
string value;
string value2;
inputFile.open(filename); //open file;
while(inputFile) {
inputFile >> value; //read a string into value
sanitize(value); //clean up string if needed
size_t end = value.size()-1;
if(isEndPunctuation(value[end])) {
string endKey = "$";
string endChar(1,value[end]);
value = value.substr(0,end);
markov_chain[value].push_back(endChar);
markov_chain[endChar].push_back(endKey);
markov_chain[endKey].push_back(Startkey);
inputFile >> value2;
sanitize(value2);
markov_chain[Startkey].push_back(value2);
} else {
//if it DOESN'T HAVE PUNCTUATION
//Essentially i just want to be able to do
//markov_chain[previousString].push_back(newString)
}
}
}

But how do i access that previous string?
Well, you have to remember it. Unfortunately your question doesn't make it very clear which strings qualify. I'll assume that it is value after processing in this code fragment.
string previousString;
while(inputFile) {
...
if(...) {
...
value = value.substr(0,end);
...
markov_chain[Startkey].push_back(value2);
previousString = value;
}
else {
markov_chain[previousString].push_back(value);
}
...
}
Edit:
From your comment it sounds like the else case may also need to set the previousString
else {
markov_chain[previousString].push_back(value);
previousString = value;
}
in which case it could just be moved to the bottom of the loop.
while(inputFile) {
...
previousString = value;
}

Related

How to read a complex input with istream&, string& and getline in c++?

I am very new to C++, so I apologize if this isn't a good question but I really need help in understanding how to use istream.
There is a project I have to create where it takes several amounts of input that can be on one line or multiple and then pass it to a vector (this is only part of the project and I would like to try the rest on my own), for example if I were to input this...
>> aaa bb
>> ccccc
>> ddd fff eeeee
Makes a vector of strings with "aaa", "bb", "ccccc", "ddd", "fff", "eeeee"
The input can be a char or string and the program stops asking for input when the return key is hit.
I know getline() gets a line of input and I could probably use a while loop to try and get the input such as...(correct me if I'm wrong)
while(!string.empty())
getline(cin, string);
However, I don't truly understand istream and it doesn't help that my class has not gone over pointers so I don't know how to use istream& or string& and pass it into a vector. On the project description, it said to NOT use stringstream but use functionality from getline(istream&, string&). Can anyone give somewhat of a detailed explanation as to how to make a function using getline(istream&, string&) and then how to use it in the main function?
Any little bit helps!
You're on the right way already; solely, you'd have to pre-fill the string with some dummy to enter the while loop at all. More elegant:
std::string line;
do
{
std::getline(std::cin, line);
}
while(!line.empty());
This should already do the trick reading line by line (but possibly multiple words on one line!) and exiting, if the user enters an empty line (be aware that whitespace followed by newline won't be recognised as such!).
However, if anything on the stream goes wrong, you'll be trapped in an endless loop processing previous input again and again. So best check the stream state as well:
if(!std::getline(std::cin, line))
{
// this is some sample error handling - do whatever you consider appropriate...
std::cerr << "error reading from console" << std::endl;
return -1;
}
As there might be multiple words on a single line, you'd yet have to split them. There are several ways to do so, quite an easy one is using an std::istringstream – you'll discover that it ressembles to what you likely are used to using std::cin:
std::istringstream s(line);
std::string word;
while(s >> word)
{
// append to vector...
}
Be aware that using operator>> ignores leading whitespace and stops after first trailing one (or end of stream, if reached), so you don't have to deal with explicitly.
OK, you're not allowed to use std::stringstream (well, I used std::istringstream, but I suppose this little difference doesn't count, does it?). Changes matter a little, it gets more complex, on the other hand, we can decide ourselves what counts as words an what as separators... We might consider punctuation marks as separators just like whitespace, but allow digits to be part of words, so we'd accept e. g. ab.7c d as "ab", "7c", "d":
auto begin = line.begin();
auto end = begin;
while(end != line.end()) // iterate over each character
{
if(std::isalnum(static_cast<unsigned char>(*end)))
{
// we are inside a word; don't touch begin to remember where
// the word started
++end;
}
else
{
// non-alpha-numeric character!
if(end != begin)
{
// we discovered a word already
// (i. e. we did not move begin together with end)
words.emplace_back(begin, end);
// ('words' being your std::vector<std::string> to place the input into)
}
++end;
begin = end; // skip whatever we had already
}
}
// corner case: a line might end with a word NOT followed by whitespace
// this isn't covered within the loop, so we need to add another check:
if(end != begin)
{
words.emplace_back(begin, end);
}
It shouldn't be too difficult to adjust to different interpretations of what is a separator and what counts as word (e. g. std::isalpha(...) || *end == '_' to detect underscore as part of words, but digits not). There are quite a few helper functions you might find useful...
You could input the value of the first column, then call functions based on the value:
void Process_Value_1(std::istream& input, std::string& value);
void Process_Value_2(std::istream& input, std::string& value);
int main()
{
// ...
std::string first_value;
while (input_file >> first_value)
{
if (first_value == "aaa")
{
Process_Value_1(input_file, first_value);
}
else if (first_value = "ccc")
{
Process_Value_2(input_file, first_value);
}
//...
}
return 0;
}
A sample function could be:
void Process_Value_1(std::istream& input, std::string& value)
{
std::string b;
input >> b;
std::cout << value << "\t" << b << endl;
input.ignore(1000, '\n'); // Ignore until newline.
}
There are other methods to perform the process, such as using tables of function pointers and std::map.

Reading from FileStream with arbitrary delimiter

I have encountered a problem to read msg from a file using C++. Usually what people does is create a file stream then use getline() function to fetch msg. getline() function can accept an additional parameter as delimiter so that it return each "line" separated by the new delimiter but not default '\n'. However, this delimiter has to be a char. In my usecase, it is possible the delimiter in the msg is something else like "|--|", so I try to get a solution such that it accept a string as delimiter instead of a char.
I have searched StackOverFlow a little bit and found some interesting posts.
Parse (split) a string in C++ using string delimiter (standard C++)
This one gives a solution to use string::find() and string::substr() to parse with arbitrary delimiter. However, all the solutions there assumes input is a string instead of a stream, In my case, the file stream data is too big/waste to fit into memory at once so it should read in msg by msg (or a bulk of msg at once).
Actually, read through the gcc implementation of std::getline() function, it seems it is much more easier to handle the case delimiter is a singe char. Since every time you load in a chunk of characters, you can always search the delimiter and separate them. While it is different if you delimiter is more than one char, the delimiter itself may straddle between two different chunks and cause many other corner cases.
Not sure whether anyone else has faced this kind of requirement before and how you guys handled it elegantly. It seems it would be nice to have a standard function like istream& getNext (istream&& is, string& str, string delim)? This seems to be a general usecase to me. Why not this one is in Standard lib so that people no longer to implement their own version separately?
Thank you very much
The STL simply does not natively support what you are asking for. You will have to write your own function (or find a 3rd party function) that does what you need.
For instance, you can use std::getline() to read up to the first character of your delimiter, and then use std::istream::get() to read subsequent characters and compare them to the rest of your delimiter. For example:
std::istream& my_getline(std::istream &input, std::string &str, const std::string &delim)
{
if (delim.empty())
throw std::invalid_argument("delim cannot be empty!");
if (delim.size() == 1)
return std::getline(input, str, delim[0]);
str.clear();
std::string temp;
char ch;
bool found = false;
do
{
if (!std::getline(input, temp, delim[0]))
break;
str += temp;
found = true;
for (int i = 1; i < delim.size(); ++i)
{
if (!input.get(ch))
{
if (input.eof())
input.clear(std::ios_base::eofbit);
str.append(delim.c_str(), i);
return input;
}
if (delim[i] != ch)
{
str.append(delim.c_str(), i);
str += ch;
found = false;
break;
}
}
}
while (!found);
return input;
}
if you are ok with reading byte by byte, you could build a state transition table implementation of a finite state machine to recognize your stop condition
std::string delimeter="someString";
//initialize table with a row per target string character, a column per possible char and all zeros
std::vector<vector<int> > table(delimeter.size(),std::vector<int>(256,0));
int endState=delimeter.size();
//set the entry for the state looking for the next letter and finding that character to the next state
for(unsigned int i=0;i<delimeter.size();i++){
table[i][(int)delimeter[i]]=i+1;
}
now in you can use it like this
int currentState=0;
int read=0;
bool done=false;
while(!done&&(read=<istream>.read())>=0){
if(read>=256){
currentState=0;
}else{
currentState=table[currentState][read];
}
if(currentState==endState){
done=true;
}
//do your streamy stuff
}
granted this only works if the delimiter is in extended ASCII, but it will work fine for some thing like your example.
It seems, it is easiest to create something like getline(): read to the last character of the separator. Then check if the string is long enough for the separator and, if so, if it ends with the separator. If it is not, carry on reading:
std::string getline(std::istream& in, std::string& value, std::string const& separator) {
std::istreambuf_iterator<char> it(in), end;
if (separator.empty()) { // empty separator -> return the entire stream
return std::string(it, end);
}
std::string rc;
char last(separator.back());
for (; it != end; ++it) {
rc.push_back(*it);
if (rc.back() == last
&& separator.size() <= rc.size()
&& rc.substr(rc.size() - separator.size()) == separator) {
return rc.resize(rc.size() - separator.size());
}
}
return rc; // no separator was found
}

Tokenize elements from a text file by removing comments, extra spaces and blank lines in C++

I'm trying to eliminate comments, blank lines and extra spaces within a text file, then tokenize the elements leftover. Each token needs a space before and after.
exampleFile.txt
var
/* declare variables */a1 ,
b2a , c,
Here's what's working as of now,
string line; //line: represents one line of text from file
ifstream InputFile("exampleFile", ios::in); //read from exampleFile.txt
//Remove comments
while (InputFile && getline(InputFile, line, '\0'))
{
while (line.find("/*") != string::npos)
{
size_t Begin = line.find("/*");
line.erase(Begin, (line.find("*/", Begin) - Begin) + 2);
// Start at Begin, erase from Begin to where */ is found
}
}
This removes comments, but I can't seem to figure out a way to tokenize while this is happening.
So my questions are:
Is it possible to remove comments, spaces, and empty lines and tokenize all in this while statement?
How can I implement a function to add spaces in between each token before they are tokenized? Tokens like c, need to be recognized as c and , individually.
Thank you in advanced for the help!
If you need to skip whitespace characters and you don't care about new lines then I'd recommend reading the file with operator>>.
You could write simply:
std::string word;
bool isComment = false;
while(file >> word)
{
if (isInsideComment(word, isComment))
continue;
// do processing of the tokens here
std::cout << word << std::endl;
}
Where the helper function could be implemented as follows:
bool isInsideComment(std::string &word, bool &isComment)
{
const std::string tagStart = "/*";
const std::string tagStop = "*/";
// match start marker
if (std::equal(tagStart.rbegin(), tagStart.rend(), word.rbegin())) // ends with tagStart
{
isComment = true;
if (word == tagStart)
return true;
word = word.substr(0, word.find(tagStart));
return false;
}
// match end marker
if (isComment)
{
if (std::equal(tagStop.begin(), tagStop.end(), word.begin())) // starts with tagStop
{
isComment = false;
word = word.substr(tagStop.size());
return false;
}
return true;
}
return false;
}
For your example this would print out:
var
a1
,
b2a
,
c,
The above logic should also handle multiline comments if you're interested.
However, denote that the function implementation should be modified according to what are your assumptions regarding the comment tokens. For instance, are they always separated with whitespaces from other words? Or is it possible that a var1/*comment*/var2 expression would be parsed? The above example won't work in such situation.
Hence, another option would be (what you already started implementing) reading lines or even chunks of data from the file (to assure begin and end comment tokens are matched) and learning positions of the comment markers with find or regex to remove them afterwards.

Parsing a csv with comma in field

I'm trying to create an object using a csv with the below data
Alonso,Fernando,21,31,29,2,Racing
Dhoni,Mahendra Singh,22,30,4,26,Cricket
Wade,Dwyane,23,29.9,18.9,11,Basketball
Anthony,Carmelo,24,29.4,21.4,8,Basketball
Klitschko,Wladimir,25,28,24,4,Boxing
Manning,Peyton,26,27.1,15.1,12,Football
Stoudemire,Amar'e,27,26.7,21.7,5,Basketball
"Earnhardt, Jr.",Dale,28,25.9,14.9,11,Racing
Howard,Dwight,29,25.5,20.5,5,Basketball
Lee,Cliff,30,25.3,25.1,0.2,Baseball
Mauer,Joe,31,24.8,23,1.8,Baseball
Cabrera,Miguel,32,24.6,22.6,2,Baseball
Greinke,Zack,33,24.5,24.4,50,Baseball
Sharapova,Maria,34,24.4,2.4,22,Tennis
Jeter,Derek,35,24.3,15.3,9,Baseball
I'm using the following code to parse it:
void AthleteDatabase::createDatabase(void)
{
ifstream inFile(INPUT_FILE.c_str());
string inputString;
if(!inFile)
{
cout << "Error opening file for input: " << INPUT_FILE << endl;
}
else
{
getline(inFile, inputString);
while(inFile)
{
istringstream s(inputString);
string field;
string athleteArray[7];
int counter = 0;
while(getline(s, field, ','))
{
athleteArray[counter] = field;
counter++;
}
string lastName = athleteArray[0];
string firstName = athleteArray[1];
int rank = atoi(athleteArray[2].c_str());
float totalEarnings = strtof(athleteArray[3].c_str(), NULL);
float salary = strtof(athleteArray[4].c_str(), NULL);
float endorsements = strtof(athleteArray[5].c_str(), NULL);
string sport = athleteArray[6];
Athlete anAthlete(lastName, firstName, rank,
totalEarnings, salary, endorsements, sport);
athleteDatabaseBST.add(anAthlete);
display(anAthlete);
getline(inFile, inputString);
}
inFile.close();
}
}
My code breaks on the line:
"Earnhardt, Jr.",Dale,28,25.9,14.9,11,Racing
obviously because of the quotes. Is there a better way to handle this? I'm still extremely new to C++ so any assistance would be greatly appreciated.
I'd recommend just using a proper CSV parser. You can find some in the answers to this earlier question, or just search for one on Google.
If you insist on rolling your own, it's probably easiest to just get down to the basics and design it as a finite state machine that processes the input one character at a time. With a one-character look-ahead, you basically need two states: "reading normal input" and "reading a quoted string". If you don't want to use look-ahead, you can do this with a couple more states, e.g. like this:
initial state: If next character is a quote, switch to state quoted field; else behave as if in state unquoted field.
unquoted field: If next character is EOF, end parsing; else, if it is a newline, start a new row and switch to initial state; else, if it is a separator (comma), start a new field in the same row and switch to initial state; else append the character to the current field and remain in state unquoted field. (Optionally, if the character is a quote, signal a parse error.)
quoted field: If next character is EOF, signal parse error; else, if it is a quote, switch to state end quote; else append the character to the current field and remain in state quoted field.
end quote: If next character is a quote, append it to the current field and return to state quoted field; else, if it is a comma or a newline or EOF, behave as if in state unquoted field; else signal parse error.
(This is for "traditional" CSV, as described e.g. in RFC 4180, where quotes in quoted fields are escaped by doubling them. Adding support for backslash-escapes, which are used in some fairly common variants of the CSV format, is left as an exercise. It requires one or two more states, depending on whether you want to to support backslashes in quoted or unquoted strings or both, and whether you want to support both traditional and backslash escapes at the same time.)
In a high-level scripting language, such character-by-character iteration would be really inefficient, but since you're writing C++, all it needs to be blazing fast is some half-decent I/O buffering and a reasonably efficient string append operation.
You have to parse each line character by character, using a bool flag, and a std::string that accumulates the contents of the next field; instead of just plowing ahead to the next comma, as you did.
Initially, the bool flag is false, and you iterate over the entire line, character by character. The quote character flips the bool flag. The comma character, only when the bool flag is false takes the accumulated contents of the std::string and saves it as the next field on the line, and clears the std::string to empty, ready for the next field. Otherwise, the character gets appended to the buffer.
This is a basic outline of the algorithm, with some minor details that you should be able to flesh out by yourself. There are a couple of other ways to do this, that are slightly more efficient, but for a beginner like yourself this kind of an approach would be the easiest to implement.
Simple answer: use a different delimiter. Everything's a lot easier to parse if you use something like '|' instead:
Stoudemire,Amar'e|27|26.7|21.7|5|Basketball
Earnhardt, Jr.|Dale|28|25.9|14.9|11|Racing
The advantage there being any other app that might need to parse your file can also do it just as cleanly.
If sticking with commas is a requirement, then you'd have to conditionally grab a field based on its first char:
std::istream& nextField(std::istringstream& s, std::string& field)
{
char c;
if (s >> c) {
if (c == '"') {
// using " as the delimeter
getline(s, field, '"');
return s >> c; // for the subsequent comma
// could potentially assert for error-checking
}
else if (c == ',') {
// handle empty field case
field = "";
}
else {
// normal case, but prepend c
getline(s, field, ',');
field = c + field;
}
}
return s;
}
Used as a substitute for where you have getline:
while (nextField(s, field)) {
athleteVec.push_back(field); // prefer vector to array
}
Could even simplify that logic a bit by just continuing to use getline if we have an unterminated quoted string:
std::istream& nextField(std::istringstream& s, std::string& field)
{
if (std::getline(s, field, ',')) {
while (s && field[0] == '"' && field[field.size() - 1] != '"') {
std::string next;
std::getline(s, next, ',');
field += ',' + next;
}
if (field[0] == '"' && field[field.size() - 1] == '"') {
field = field.substr(1, field.size() - 2);
}
}
return s;
}
I agree with Imari's answer, why re-invent the wheel? That being said, have you considered regex? I believe this answer can be used to accomplish what you want and then some.

Cleaning a string of punctuation in C++

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.
Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).
The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.
So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.
Here are the functions I have written to read in from the file and then the one that cleans it.
Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.
Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.
int get_words(map<string, int>& mapz)
{
int cnt = 0; //set out counter to zero
map<string, int>::const_iterator mapzIter;
ifstream input; //declare instream
input.open( "prog2.d" ); //open instream
assert( input ); //assure it is open
string s; //temp strings to read into
string not_s;
input >> s;
while(!input.eof()) //read in until EOF
{
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
}
input.close(); //close instream
for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
cnt = cnt + mapzIter->second;
return cnt; //return number of words in instream
}
void clean_entry(const string& non_clean, string& clean)
{
int i, j, begin, end;
for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);
begin = i;
if(begin ==(int)non_clean.length())
return;
for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);
end = j;
clean = non_clean.substr(begin, (end-begin));
for(i = 0; i < (int)clean.size(); i++)
clean[i] = tolower(clean[i]);
}
The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
to
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() > 0)
{
mapz[not_s]++; //increment occurence
}
input >>s;
EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.
Further improvements would be to
declare variables only when you use them, and in the innermost scope
use c++-style casts instead of the c-style (int) casts
use empty() instead of length() == 0 comparisons
use the prefix increment operator for the iterators (i.e. ++mapzIter)
A blank string is a valid instance of the string class, so there's nothing special about adding it into the map. What you could do is first check if it's empty, and only increment in that case:
if (!not_s.empty())
mapz[not_s]++;
Style-wise, there's a few things I'd change, one would be to return clean from clean_entry instead of modifying it:
string not_s = clean_entry(s);
...
string clean_entry(const string &non_clean)
{
string clean;
... // as before
if(begin ==(int)non_clean.length())
return clean;
... // as before
return clean;
}
This makes it clearer what the function is doing (taking a string, and returning something based on that string).
The function 'getWords' is doing a lot of distinct actions that could be split out into other functions. There's a good chance that by splitting it up into it's individual parts, you would have found the bug yourself.
From the basic structure, I think you could split the code into (at least):
getNextWord: Return the next (non blank) word from the stream (returns false if none left)
clean_entry: What you have now
getNextCleanWord: Calls getNextWord, and if 'true' calls CleanWord. Returns 'false' if no words left.
The signatures of 'getNextWord' and 'getNextCleanWord' might look something like:
bool getNextWord (std::ifstream & input, std::string & str);
bool getNextCleanWord (std::ifstream & input, std::string & str);
The idea is that each function does a smaller more distinct part of the problem. For example, 'getNextWord' does nothing but get the next non blank word (if there is one). This smaller piece therefore becomes an easier part of the problem to solve and debug if necessary.
The main component of 'getWords' then can be simplified down to:
std::string nextCleanWord;
while (getNextCleanWord (input, nextCleanWord))
{
++map[nextCleanWord];
}
An important aspect to development, IMHO, is to try to Divide and Conquer the problem. Split it up into the individual tasks that need to take place. These sub-tasks will be easier to complete and should also be easier to maintain.