Cleaning a string of punctuation in C++ - c++

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.
Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).
The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.
So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.
Here are the functions I have written to read in from the file and then the one that cleans it.
Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.
Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.
int get_words(map<string, int>& mapz)
{
int cnt = 0; //set out counter to zero
map<string, int>::const_iterator mapzIter;
ifstream input; //declare instream
input.open( "prog2.d" ); //open instream
assert( input ); //assure it is open
string s; //temp strings to read into
string not_s;
input >> s;
while(!input.eof()) //read in until EOF
{
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
}
input.close(); //close instream
for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
cnt = cnt + mapzIter->second;
return cnt; //return number of words in instream
}
void clean_entry(const string& non_clean, string& clean)
{
int i, j, begin, end;
for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);
begin = i;
if(begin ==(int)non_clean.length())
return;
for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);
end = j;
clean = non_clean.substr(begin, (end-begin));
for(i = 0; i < (int)clean.size(); i++)
clean[i] = tolower(clean[i]);
}

The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
to
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() > 0)
{
mapz[not_s]++; //increment occurence
}
input >>s;
EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.

Further improvements would be to
declare variables only when you use them, and in the innermost scope
use c++-style casts instead of the c-style (int) casts
use empty() instead of length() == 0 comparisons
use the prefix increment operator for the iterators (i.e. ++mapzIter)

A blank string is a valid instance of the string class, so there's nothing special about adding it into the map. What you could do is first check if it's empty, and only increment in that case:
if (!not_s.empty())
mapz[not_s]++;
Style-wise, there's a few things I'd change, one would be to return clean from clean_entry instead of modifying it:
string not_s = clean_entry(s);
...
string clean_entry(const string &non_clean)
{
string clean;
... // as before
if(begin ==(int)non_clean.length())
return clean;
... // as before
return clean;
}
This makes it clearer what the function is doing (taking a string, and returning something based on that string).

The function 'getWords' is doing a lot of distinct actions that could be split out into other functions. There's a good chance that by splitting it up into it's individual parts, you would have found the bug yourself.
From the basic structure, I think you could split the code into (at least):
getNextWord: Return the next (non blank) word from the stream (returns false if none left)
clean_entry: What you have now
getNextCleanWord: Calls getNextWord, and if 'true' calls CleanWord. Returns 'false' if no words left.
The signatures of 'getNextWord' and 'getNextCleanWord' might look something like:
bool getNextWord (std::ifstream & input, std::string & str);
bool getNextCleanWord (std::ifstream & input, std::string & str);
The idea is that each function does a smaller more distinct part of the problem. For example, 'getNextWord' does nothing but get the next non blank word (if there is one). This smaller piece therefore becomes an easier part of the problem to solve and debug if necessary.
The main component of 'getWords' then can be simplified down to:
std::string nextCleanWord;
while (getNextCleanWord (input, nextCleanWord))
{
++map[nextCleanWord];
}
An important aspect to development, IMHO, is to try to Divide and Conquer the problem. Split it up into the individual tasks that need to take place. These sub-tasks will be easier to complete and should also be easier to maintain.

Related

How to read a complex input with istream&, string& and getline in c++?

I am very new to C++, so I apologize if this isn't a good question but I really need help in understanding how to use istream.
There is a project I have to create where it takes several amounts of input that can be on one line or multiple and then pass it to a vector (this is only part of the project and I would like to try the rest on my own), for example if I were to input this...
>> aaa bb
>> ccccc
>> ddd fff eeeee
Makes a vector of strings with "aaa", "bb", "ccccc", "ddd", "fff", "eeeee"
The input can be a char or string and the program stops asking for input when the return key is hit.
I know getline() gets a line of input and I could probably use a while loop to try and get the input such as...(correct me if I'm wrong)
while(!string.empty())
getline(cin, string);
However, I don't truly understand istream and it doesn't help that my class has not gone over pointers so I don't know how to use istream& or string& and pass it into a vector. On the project description, it said to NOT use stringstream but use functionality from getline(istream&, string&). Can anyone give somewhat of a detailed explanation as to how to make a function using getline(istream&, string&) and then how to use it in the main function?
Any little bit helps!
You're on the right way already; solely, you'd have to pre-fill the string with some dummy to enter the while loop at all. More elegant:
std::string line;
do
{
std::getline(std::cin, line);
}
while(!line.empty());
This should already do the trick reading line by line (but possibly multiple words on one line!) and exiting, if the user enters an empty line (be aware that whitespace followed by newline won't be recognised as such!).
However, if anything on the stream goes wrong, you'll be trapped in an endless loop processing previous input again and again. So best check the stream state as well:
if(!std::getline(std::cin, line))
{
// this is some sample error handling - do whatever you consider appropriate...
std::cerr << "error reading from console" << std::endl;
return -1;
}
As there might be multiple words on a single line, you'd yet have to split them. There are several ways to do so, quite an easy one is using an std::istringstream – you'll discover that it ressembles to what you likely are used to using std::cin:
std::istringstream s(line);
std::string word;
while(s >> word)
{
// append to vector...
}
Be aware that using operator>> ignores leading whitespace and stops after first trailing one (or end of stream, if reached), so you don't have to deal with explicitly.
OK, you're not allowed to use std::stringstream (well, I used std::istringstream, but I suppose this little difference doesn't count, does it?). Changes matter a little, it gets more complex, on the other hand, we can decide ourselves what counts as words an what as separators... We might consider punctuation marks as separators just like whitespace, but allow digits to be part of words, so we'd accept e. g. ab.7c d as "ab", "7c", "d":
auto begin = line.begin();
auto end = begin;
while(end != line.end()) // iterate over each character
{
if(std::isalnum(static_cast<unsigned char>(*end)))
{
// we are inside a word; don't touch begin to remember where
// the word started
++end;
}
else
{
// non-alpha-numeric character!
if(end != begin)
{
// we discovered a word already
// (i. e. we did not move begin together with end)
words.emplace_back(begin, end);
// ('words' being your std::vector<std::string> to place the input into)
}
++end;
begin = end; // skip whatever we had already
}
}
// corner case: a line might end with a word NOT followed by whitespace
// this isn't covered within the loop, so we need to add another check:
if(end != begin)
{
words.emplace_back(begin, end);
}
It shouldn't be too difficult to adjust to different interpretations of what is a separator and what counts as word (e. g. std::isalpha(...) || *end == '_' to detect underscore as part of words, but digits not). There are quite a few helper functions you might find useful...
You could input the value of the first column, then call functions based on the value:
void Process_Value_1(std::istream& input, std::string& value);
void Process_Value_2(std::istream& input, std::string& value);
int main()
{
// ...
std::string first_value;
while (input_file >> first_value)
{
if (first_value == "aaa")
{
Process_Value_1(input_file, first_value);
}
else if (first_value = "ccc")
{
Process_Value_2(input_file, first_value);
}
//...
}
return 0;
}
A sample function could be:
void Process_Value_1(std::istream& input, std::string& value)
{
std::string b;
input >> b;
std::cout << value << "\t" << b << endl;
input.ignore(1000, '\n'); // Ignore until newline.
}
There are other methods to perform the process, such as using tables of function pointers and std::map.

Reading from FileStream with arbitrary delimiter

I have encountered a problem to read msg from a file using C++. Usually what people does is create a file stream then use getline() function to fetch msg. getline() function can accept an additional parameter as delimiter so that it return each "line" separated by the new delimiter but not default '\n'. However, this delimiter has to be a char. In my usecase, it is possible the delimiter in the msg is something else like "|--|", so I try to get a solution such that it accept a string as delimiter instead of a char.
I have searched StackOverFlow a little bit and found some interesting posts.
Parse (split) a string in C++ using string delimiter (standard C++)
This one gives a solution to use string::find() and string::substr() to parse with arbitrary delimiter. However, all the solutions there assumes input is a string instead of a stream, In my case, the file stream data is too big/waste to fit into memory at once so it should read in msg by msg (or a bulk of msg at once).
Actually, read through the gcc implementation of std::getline() function, it seems it is much more easier to handle the case delimiter is a singe char. Since every time you load in a chunk of characters, you can always search the delimiter and separate them. While it is different if you delimiter is more than one char, the delimiter itself may straddle between two different chunks and cause many other corner cases.
Not sure whether anyone else has faced this kind of requirement before and how you guys handled it elegantly. It seems it would be nice to have a standard function like istream& getNext (istream&& is, string& str, string delim)? This seems to be a general usecase to me. Why not this one is in Standard lib so that people no longer to implement their own version separately?
Thank you very much
The STL simply does not natively support what you are asking for. You will have to write your own function (or find a 3rd party function) that does what you need.
For instance, you can use std::getline() to read up to the first character of your delimiter, and then use std::istream::get() to read subsequent characters and compare them to the rest of your delimiter. For example:
std::istream& my_getline(std::istream &input, std::string &str, const std::string &delim)
{
if (delim.empty())
throw std::invalid_argument("delim cannot be empty!");
if (delim.size() == 1)
return std::getline(input, str, delim[0]);
str.clear();
std::string temp;
char ch;
bool found = false;
do
{
if (!std::getline(input, temp, delim[0]))
break;
str += temp;
found = true;
for (int i = 1; i < delim.size(); ++i)
{
if (!input.get(ch))
{
if (input.eof())
input.clear(std::ios_base::eofbit);
str.append(delim.c_str(), i);
return input;
}
if (delim[i] != ch)
{
str.append(delim.c_str(), i);
str += ch;
found = false;
break;
}
}
}
while (!found);
return input;
}
if you are ok with reading byte by byte, you could build a state transition table implementation of a finite state machine to recognize your stop condition
std::string delimeter="someString";
//initialize table with a row per target string character, a column per possible char and all zeros
std::vector<vector<int> > table(delimeter.size(),std::vector<int>(256,0));
int endState=delimeter.size();
//set the entry for the state looking for the next letter and finding that character to the next state
for(unsigned int i=0;i<delimeter.size();i++){
table[i][(int)delimeter[i]]=i+1;
}
now in you can use it like this
int currentState=0;
int read=0;
bool done=false;
while(!done&&(read=<istream>.read())>=0){
if(read>=256){
currentState=0;
}else{
currentState=table[currentState][read];
}
if(currentState==endState){
done=true;
}
//do your streamy stuff
}
granted this only works if the delimiter is in extended ASCII, but it will work fine for some thing like your example.
It seems, it is easiest to create something like getline(): read to the last character of the separator. Then check if the string is long enough for the separator and, if so, if it ends with the separator. If it is not, carry on reading:
std::string getline(std::istream& in, std::string& value, std::string const& separator) {
std::istreambuf_iterator<char> it(in), end;
if (separator.empty()) { // empty separator -> return the entire stream
return std::string(it, end);
}
std::string rc;
char last(separator.back());
for (; it != end; ++it) {
rc.push_back(*it);
if (rc.back() == last
&& separator.size() <= rc.size()
&& rc.substr(rc.size() - separator.size()) == separator) {
return rc.resize(rc.size() - separator.size());
}
}
return rc; // no separator was found
}

Erase words from a string (in C++)

I want to delete some words from a string but my code doesn't work . I don't have any errors or warnings , but I'm thinking that my string becomes empty. Could someone help me with this? I tried to convert my initial strings into 2 vectors, so that I can navigate more easily then
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
using namespace std;
int main()
{
string s("Somewhere down the road");
string t("down");
istringstream iss(s);
vector <string> plm;
vector <string> plm2;
do
{
string sub;
iss >> sub;
plm.push_back(sub);
} while (iss);
for(unsigned int i=0 ; i<plm.size();i++){
cout<<plm[i];}
istringstream ist(t);
do
{
string subb;
ist >> subb;
plm2.push_back(subb);
} while (ist);
for(int i=0;i<plm.size();i++){
for(int j=0;j<plm2.size();i++){
{if (plm[i]==plm2[j])
plm.erase(plm.begin()+j);}}}
for(int i=0 ; i<plm.size();i++)
cout<<plm[i];
}
Warning: this is really just a comment that's too long to fit in a comment field. Oh, and a bit of a rant at that.
I'm sure glad we have these modern languages to make life so much easier than it was decades ago. Consider, for example, what this job looked like an the long-since moribund SNOBOL 4 programming language:
s = 'somewhere down the road'
del s 'down' = :s(del)
OUTPUT = s
God, it's nice that we've since made so much progress that we don't have to deal with 3 whole lines of code, and we can now do the job with only 52 lines instead (oh, except that the 52 lines don't actually work, but let's ignore that for the moment).
I guess, in fairness, we can do the job a little more compactly in C++ though. One obvious way would be with std::remove_copy, some stream iterators, and a stringstream or two:
std::istringstream input("somewhere down the road");
std::string del_str("down");
std::istream_iterator<std::string> in(input), end;
std::ostringstream result;
std::remove_copy(in, end, std::ostream_iterator<std::string>(result, " "), del_str);
std::cout << result.str();
There is no benefit in converting to vector - string itself already provides all that is necessary for what you want to do. Anyway, do it this way:
vector<char> v;
v.assign(s.c_str(), s.c_str() + s.length()); // without...
v.assign(s.c_str(), s.c_str() + s.length() + ); // including...
// ... terminating null character
Now it gets easy:
size_t pos = s.find(t);
if(pos != string::npos)
{
s.erase(pos, t.length());
}
This does not care, however, about leaving multiple whitespace or if t is not an entire word within s (e. g. t = "down"; s = "I'm going to downtown."; would result in s == "I'm going to town."), but you did not do so either...
First problem is, if std::string::erase is called only with the beginning position, it erases everything until the end of string.
Second problem is, that the code will just erase all letters which are in the second string, one by one. I.e. not the entire word - for that, you would need to check if the entire word matches, and only then erase (the entire length of the word). Ask yourself - what will happen in the code, if e.g. the first two letters will match, but not the rest of the word?
In your second for loop you never incremented j and inside the if (plm[i]==plm2[j]) block you used j instead of i as your offset in erase().
for(int i=0;i<plm.size();i++)
{
for(int j=0;j<plm2.size();j++)//here you need to increment j
{
if (plm[i]==plm2[j])
plm.erase(plm.begin()+i);//here the offset should be i
}
}
Another thing don't use a do...while loop to read from the stringstream and push back on the vector. If the reading fails you will be pushing invalid data to the vector, instead try something like:
string sub;
while(iss >> sub;)
plm.push_back(sub);//only if reading is successful
...//do the same for the other istringstream too
You do not increment j this is the first thing I saw on your code. Write it correctly then if it still doesnt work, then ask!

Pull out data from a file and store it in strings in C++

I have a file which contains records of students in the following format.
Umar|Ejaz|12345|umar#umar.com
Majid|Hussain|12345|majid#majid.com
Ali|Akbar|12345|ali#geeks-inn.com
Mahtab|Maqsood|12345|mahtab#myself.com
Juanid|Asghar|12345|junaid#junaid.com
The data has been stored according to the following format:
firstName|lastName|contactNumber|email
The total number of lines(records) can not exceed the limit 100. In my program, I've defined the following string variables.
#define MAX_SIZE 100
// other code
string firstName[MAX_SIZE];
string lastName[MAX_SIZE];
string contactNumber[MAX_SIZE];
string email[MAX_SIZE];
Now, I want to pull data from the file, and using the delimiter '|', I want to put data in the corresponding strings. I'm using the following strategy to put back data into string variables.
ifstream readFromFile;
readFromFile.open("output.txt");
// other code
int x = 0;
string temp;
while(getline(readFromFile, temp)) {
int charPosition = 0;
while(temp[charPosition] != '|') {
firstName[x] += temp[charPosition];
charPosition++;
}
while(temp[charPosition] != '|') {
lastName[x] += temp[charPosition];
charPosition++;
}
while(temp[charPosition] != '|') {
contactNumber[x] += temp[charPosition];
charPosition++;
}
while(temp[charPosition] != endl) {
email[x] += temp[charPosition];
charPosition++;
}
x++;
}
Is it necessary to attach null character '\0' at the end of each string? And if I do not attach, will it create problems when I will be actually implementing those string variables in my program. I'm a new to C++, and I've come up with this solution. If anybody has better technique, he is surely welcome.
Edit: Also I can't compare a char(acter) with endl, how can I?
Edit: The code that I've written isn't working. It gives me following error.
Segmentation fault (core dumped)
Note: I can only use .txt file. A .csv file can't be used.
There are many techniques to do this. I suggest searching StackOveflow for "[C++] read file" to see some more methods.
Find and Substring
You could use the std::string::find method to find the delimiter and then use std::string::substr to return a substring between the position and the delimiter.
std::string::size_type position = 0;
positition = temp.find('|');
if (position != std::string::npos)
{
firstName[x] = temp.substr(0, position);
}
If you don't terminate a a C-style string with a null character there is no way to determine where the string ends. Thus, you'll need to terminate the strings.
I would personally read the data into std::string objects:
std::string first, last, etc;
while (std::getline(readFromFile, first, '|')
&& std::getline(readFromFile, last, '|')
&& std::getline(readFromFile, etc)) {
// do something with the input
}
std::endl is a manipulator implemented as a function template. You can't compare a char with that. There is also hardly ever a reason to use std::endl because it flushes the stream after adding a newline which makes writing really slow. You probably meant to compare to a newline character, i.e., to '\n'. However, since you read the string with std::getline() the line break character will already be removed! You need to make sure you don't access more than temp.size() characters otherwise.
Your record also contains arrays of strings rather than arrays of characters and you assign individual chars to them. You either wanted to yse char something[SIZE] or you'd store strings!

Reading a text document character by character

I am reading a text file character by character using ifstream infile.get() in an infinite while loop.
This sits inside an infinite while loop, and should break out of it once the end of file condition is reached. (EOF). The while loop itself sits within a function of type void.
Here is the pseudo-code:
void function (...) {
while(true) {
...
if ( (ch = infile.get()) == EOF) {return;}
...
}
}
When I "cout" characters on the screen, it goes through all the character and then keeps running outputting what appears as blank space, i.e. it never breaks. I have no idea why. Any ideas?
In C++, you don't compare the return value with EOF. Instead, you can use a stream function such as good() to check if more data can be read. Something like this:
while (infile.good()) {
ch = infile.get();
// ...
}
One idiom that makes it relatively easy to read from a file and detect the end of the file correctly is to combine the reading and the testing into a single, atomic, event, such as:
while (infile >> ch)
or:
while (std::getline(infile, instring))
Of course, you should also consider using a standard algorithm, such as copy:
std::copy(std::istream_iterator<char>(infile),
std::istream_iterator<char>(),
std::ostream_itertror<char>(std::cout, "\n"));
One minor note: by default, reading with >> will skip white space. When you're doing character-by-character input/processing, you usually don't want that. Fortunately, disabling that is pretty easy:
infile.unsetf(std::ios_base::skipws);
try converting the function to an int one and return 1 when reaching EOF
The reason it is not working is that get() returns an int but you are using the input as a char.
When you assign the result of get() to a char it is fine as long as the last character read was a character. BUT if the last character read was a special character (such as EOF) then it will get truncated when assigned to a char and thus the subsequent comparison to EOF will always fail.
This should work:
void function (...)
{
while(true)
{
...
int value;
if ( (value = infile.get()) == EOF) {return;}
char ch = value;
...
}
}
But it should be noted that it is a lot easier to use the more standard pattern where the read is done as part of the condition. Unfortunately the get() does not give you that functionality. So we need to switch to a method that uses iterators.
Note the standard istream_iterator will not work as you expect (as it ignores white space). But you can use the istreambuf_iterator (notice the buf after istream) which does not ignore white space.
void function (...)
{
for(std::istreambuf_iterator<char> loop(infile);
loop != std::istreambuf_iterator<char>();
++loop)
{
char ch = *loop;
...
}
}