Remove duplicate words from a sentence without using regex in c++?

Remove duplicate words from a sentence without using regex in c++? - c++

I would like to remove duplicate word from the sentence without using regex in c++. For example:
Input: Welcome to to the world of of digital computers computers.
Expected Output: Welcome to the world of digital computers.
Actual output: Welcome to the world of digital computers computers.
How I can manage that in the below code ??
void removeDupWord(string str)
{
// Used to split string around spaces.
istringstream ss(str);
// To store individual visited words
unordered_set<string> hsh;
// Traverse through all words
do
{
string word;
ss >> word;
// If current word is not seen before.
while (hsh.find(word) == hsh.end()) {
cout << word << " ";
hsh.insert(word);
}
} while (ss);
}

std::set and std::unordered_set can't store duplicate elements. Instead of explicitly calling find method on the set, you can use the values returned by insert method to determine whether the insertion was successful or not.
#include <iostream>
#include <unordered_set>
#include <string>
#include <sstream>
void removeDupWord(std::string str)
{
// Used to split string around spaces.
std::istringstream ss(str);
// To store individual visited words
std::unordered_set<std::string> hsh;
std::string word;
// Traverse through all words
while(ss >> word)
{
// Remove punctuation char from end
if(std::ispunct(word.back()))
word.pop_back();
const auto& [itr, result] = hsh.insert(word);
if( result )
std::cout << word << ' ';
}
// output
// Welcome to the world of digital computers
}
int main()
{
std::string input {"Welcome to to the world of of digital computers computers."};
removeDupWord(input);
}

Following from the discussion in the comments, I tinkered a bit with a way to accomplish your goal of removing duplicate words (only if adjacent duplicates) and preserving the assumed punctuation at the end of the sentence. With tasks such as these, it's often easier to simply use a loop and a couple of variables to preserve the last word written and punctuation than trying to use some type of container to store and compare words.
The following does just that. It loops over each word, checking if the last character is punctuation with std::ispunct(), preserving the punctuation for appending later. To output the words a simple first state variable is used to determine the first output, allowing the space to be inserted before the next word rather than afterwards. That allows any punctuation removed for word comparison to be appended to the output as needed.
Below the program takes the sentence to process as the first argument to the program (could have just as easily read it from stdin -- up to you) and then outputs the corrected sentence below, e.g.
#include <iostream>
#include <string>
#include <sstream>
#include <cctype>
void removeDupWord(std::string str)
{
bool first = true; /* flag indicating 1st word output */
std::istringstream ss(str); /* initialize stringsream with str */
std::string word, last {};
while(ss >> word) /* iterate over each word */
{
char c = 0; /* if last char is punctuation, save in c */
/* remove any punctuation from word */
if (std::ispunct((unsigned char)word.back())) {
c = word.back();
word.pop_back();
}
if (word != last) { /* if not duplicate of last word */
if (!first) { /* if not 1st word */
std::cout << " " << word; /* prepend space */
}
else { /* otherwise */
std::cout << word; /* output word alone */
first = false; /* set first flag false */
}
}
else if (c) { /* otherwise word is adjacent duplicate */
std::cout << c; /* output trailing punctuation */
}
last = word; /* update last to current word */
}
std::cout.put('\n'); /* tidy up with newline */
}
int main(int argc, char **argv)
{
if (argc != 2) { /* ensure argument provided */
return 1;
}
std::string input { argv[1] }; /* initialize string with argument */
removeDupWord (input); /* remove adjacent duplicate words */
}
Example Use/Output
With your example sentence:
$ ./bin/unordered_set_rm_dup_words "Welcome to to the world of of digital computers computers."
Welcome to the world of digital computers.
And the longer example from the discussion with #PeteBecker in the comments:
$ ./bin/unordered_set_rm_dup_words "If I go go to Mars I would need to turn left left to go to Venus Venus."
If I go to Mars I would need to turn left to go to Venus.
It's up to you whether you simply compare, or choose to use a std::unordered_set, etc.. to store the words and compare. This, of course, turns on whether you want to remove ALL duplicate words, or just adjacent-duplicate words. Hard to tell what you need from your given example -- but my second example above, you can see the downside to removing ALL rather than just adjacent duplicates.
Any additional punctuation handling is left as an exercise for you. Let me know if you have questions.

Related

How do I access individual lines in a string? (updated once again / with more detail)

I am updating this again because I am utterly lost, I apologize. I think I can more clearly word it now after trying this solutions, so here goes:
Say I have a string that goes:
"
199
200
208
210
200
"
Here's the update: I may have worded it poorly before, so here it is in no uncertain terms: How do I make it so if that string is called s, then s[0] = 199 (rather than 1), s[1] = 200 (rather than 9), s[2] = 208 (rather than 9), etc. I am sorry to keep coming back here, but I really want to resolve this.
By the way, this is my code so far:
int main()
{
int increase_altitude = 0;
int previous = 10000;
std::string s;
while (std::getline(std::cin, s))
{
std::stringstream ss(s);
}

The answer to your earlier question showed one way to do just what you ask, but didn't directly show how to access any given string by index. In the comment by #RemyLebeau above there are several approaches outline. One of the easiest remains separating the newline separated strings into a std::vector<std::string>. You can then obtain the index where the wanted string is located and then .erase() that index (iterator at that index), or swap two indexes, etc..
You take it up to the point of creating the std::stringstream in the snippet of code you show, you simply need to complete the process.
To locate the index where the wanted string is found, you can use a simple for loop to obtain the index. The following program takes as its first argument the substring to find in your multi-line input string:
#include <iostream>
#include <string>
#include <vector>
#include <sstream>
int main (int argc, char **argv) {
if (argc < 2) { /* validate one argument provided for wanted substring */
std::cerr << std::string {"usage: "} + argv[0] + " substring\n";
return 1;
}
const std::string want {argv[1]}; /* wanted substring */
std::vector<std::string> vs {}; /* vector<string> for strings */
std::string s {"199\n200\n208\n210\n200"}; /* input string */
std::string tmp; /* tmp to read stringstream */
std::stringstream ss (s); /* initialize stringstream */
while (ss >> tmp) { /* loop reading tmp from stringstream */
vs.push_back(tmp); /* add string to vector of string */
}
size_t ndx = vs.size(); /* index of desired substring */
for (size_t i = 0; i < vs.size(); i++) { /* loop over each string */
if (vs[i] == want) { /* if wanted substring */
ndx = i; /* save index */
break;
}
}
if (ndx != vs.size()) { /* if found, output index of substring */
std::cout << want << " is index: " << ndx << '\n';
}
else {
std::cout << want << " does not exist in input.\n";
}
}
Example Use/Output
$ ./bin/str_multiline 208
208 is index: 2
Once you have the index, you can std::swap or .erase() that index and write the modified collection of strings back to a std::stringstream or output it as desired.
If the wanted substring isn't found, you have:
$ ./bin/str_multiline 207
207 does not exist in input.
Using <algorithm> std::find_if() To Do The Same
You can use some of the niceties of the algorithms library to automate the index find by using std::find_if to return the iterator to the index in the array of strings where the wanted string exists (or the end iterator if the wanted string isn't found). To obtain the index, you can simply subtract the beginning iterator, e.g.
...
#include <algorithm>
...
size_t ndx; /* index of desired substring */
std::vector<std::string> vs {}; /* vector<string> for strings */
std::string s {"199\n200\n208\n210\n200"}; /* input string */
std::string tmp; /* tmp to read stringstream */
std::stringstream ss (s); /* initialize stringstream */
while (ss >> tmp) { /* loop reading tmp from stringstream */
vs.push_back(tmp); /* add string to vector of string */
}
/* locate index of wanted string with find_if - begin iterator */
if ((ndx = std::find_if (vs.begin(), vs.end(),
[want](std::string n) {
return n == want;
}) - vs.begin()) != vs.size()) {
std::cout << want << " is index: " << ndx << '\n';
}
else {
std::cout << want << " does not exist in input.\n";
}
(same use/output)
Manipulating The String Directly
You can avoid using the vector of strings altogether by looping down the string with the std::basic_string::find to locate the first '\n'. Your first substring occurs between the beginning and the index returned. Keeping track of the index and using as pos + 1, you have the beginning of the second substring to start your next call to std::basic_string::find (repeat until end of string)
The only caveat there is it is up to you to maintain the locations where each string begins and ends in order to manipulate the string. It's more manual position accounting than is needed using the array of strings approach. But both are equally fine.
There are more ways as well, these are just a few.
Let me know if this is closer to what you were looking for rather than the answer to your previous question. If you compare with the previous answer, this answer essentially separates the substrings in the same way, but adds how to identify the index of the wanted substring.

Missing last word of string when I split the sentence into word [duplicate]

This question already has answers here:
How do I iterate over the words of a string?
(84 answers)
Closed 2 years ago.
I am missing the last word of string. this is code I used to store word into array.
string arr[10];
int Add_Count = 0;
string sentence = "I am unable to store last word"
string Words = "";
for (int i = 0; i < sentence.length(); i++)
{
if (Sentence[i] == ' ')
{
arr[Add_Count] = Words;
Words = "";
Add_Count++;
}
else if (isalpha(Sentence[i]))
{
Words = Words + sentence[i];
}
}
Let's print the arr:
for(int i =0; i<10; i++)
{
cout << arr[i] << endl;
}

You are inserting the word found when you see a blank character.
Since the end of the string is not a blank character, the insertion for the last word never happens.
What you can do is:
(1) If the current character is black, skip to the next character.
(2) See the next character of current character.
(2-1) If the next character is blank, insert the accumulated word.
(2-2) If the next character doesn't exist (end of the sentence), insert the accumulated word.
(2-3) If the next character is not blank, accumulate word.

Obviously you lost the last word because when you go to the end the last word is not extracted yet. You can add this line to get the last word
if (Words.length() != 0) {
arr[Add_Count] = Words;
Words = "";
}

Following on from the very good approach by #Casey, but adding the use of std::vector instead of an array, allows you to break a line into as many words as may be included in it. Using the std::stringstream and extracting with >> allows a simple way to tokenize the sentence while ignoring leading, multiple included and trailing whitespace.
For example, you could do:
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
int main (void) {
std::string sentence = " I am unable to store last word ",
word {};
std::stringstream ss (sentence); /* create stringstream from sentence */
std::vector<std::string> words {}; /* vector of strings to hold words */
while (ss >> word) /* read word */
words.push_back(word); /* add word to vector */
/* output original sentence */
std::cout << "sentence: \"" << sentence << "\"\n\n";
for (const auto& w : words) /* output all words in vector */
std::cout << w << '\n';
}
Example Use/Output
$ ./bin/tokenize_sentence_ss
sentence: " I am unable to store last word "
I
am
unable
to
store
last
word
If you need more fine-grained control, you can use std::string::find_first_of and std::string::find_first_not_of with a set of delimiters to work your way through a string finding the first character in a token with std::string::find_first_of and then skipping over delimiters to the start of the next token with std::string::find_first_not_of. That involves a bit more arithmetic, but is a more flexible alternative.

This happens because the last word has no space after it, just add this line after for loop.
arr[Add_Count] = Words;

My version :
#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <vector>
int main() {
std::istringstream iss("I am unable to store last word");
std::vector<std::string> v(std::istream_iterator<std::string>(iss), {});
std::copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
Sample Run :
I
am
unable
to
store
last
word

If you know you won't have to worry about punctuation, the easiest way to handle it is to throw the string into a istringstream. You can use the extraction operator overload to extract the "words". The extraction operator defaults to splitting on whitespace and automatically terminates at the end of the stream:
#include <algorithm>
#include <sstream>
#include <string>
#include <vector>
std::string sentence = // ... Get the string from cin, a file, or hard-code it here.
std::istringstream ss(sentence);
std::vector<std::string> arr;
arr.reserve(1 + std::count(std::cbegin(sentence), std::cend(sentence), ' '));
std::string word;
while(ss >> word) {
arr.push_back(word);
}

(C++) Reading a CSV text file as a vector of integers

I'm a beginner programmer working through the 2019 Advent of Code challenges in C++.
The last piece of the puzzle I'm putting together is actually getting the program to read the input.txt file, which is essentially a long string of values in the form of '10,20,40,23" etc. on a single line.
In the previous puzzle I used the lines
int inputvalue;
std::ifstream file("input.txt");
while(file >> inputvalue){
//
}
to grab lines from the file, but it was formatted as a text file in sequential lines with no comma separation.
ie:
10
20
40
23
What can I do to read through the file using the comma delineation, and specifically how can I get those values to be read as integers, instead of as strings or chars, and store them into a vector?

While it would be strange to write a routine to read just one line from a comma separated file, instead of writing a general routine to read all lines (and just take the first one if you only wanted one) -- you can take out the parts for reading multiple lines into a std::vector<std::vector<int>> and just read one line into a std::vector<int> -- though it only saves a handful of lines of code.
The general approach would be to read the entire line of text with getline(file, line) and then create a std::stringstream (line) from which you could then use >> to read each integer followed by a getline (file, tmpstr, ',') to read the delimiter.
You can take a second argument in addition to the file to read, so you can pass the delimiter as the first character of the second argument -- that way there is no reason to re-compile your code to handle delimiters of ';' or ',' or any other single character.
You can put a short bit of code together to do that which could look like the following:
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
int main (int argc, char **argv) {
if (argc < 2) { /* validate at least 1 argument given */
std::cerr << "error: insufficient number of arguments.\n"
"usage: " << argv[0] << " <filename>\n";
return 1;
}
std::vector<int> v {}; /* vector<int> */
std::string line {}; /* string to hold each line */
std::ifstream f (argv[1]); /* open file-stream with 1st argument */
const char delim = argc > 2 ? *argv[2] : ','; /* delim is 1st char in */
/* 2nd arg (default ',') */
if (!f.good()) { /* validate file open for reading */
std::cerr << "errro: file open failed '" << argv[1] << "'.\n";
return 1;
}
if (getline (f, line)) { /* read line of input into line */
int itmp; /* temporary integer to fill */
std::stringstream ss (line); /* create stringstream from line */
while (ss >> itmp) { /* read integer value from ss */
std::string stmp {}; /* temporary string to hold delim */
v.push_back(itmp); /* add to vector */
getline (ss, stmp, delim); /* read delimiter */
}
}
for (auto col : v) /* loop over each integer */
std::cout << " " << col; /* output col value */
std::cout << '\n'; /* tidy up with newline */
}
(note: there are relatively few changes needed to read all lines into a vector of vectors, the more notable is simply replacing the if(getline...) with while(getline..) and then filling a temporary vector which, if non-empty, is then pushed back into your collection of vectors)
Example Input File
With a set of comma separated integers in the file named dat/int-1-10-1-line.txt, e.g.
$ cat dat/int-1-10-1-line.txt
1,2,3,4,5,6,7,8,9,10
Example Use/Output
Your use an results would be:
$ ./bin/read_csv_int-1-line dat/int-1-10-1-line.txt
1 2 3 4 5 6 7 8 9 10
Of course you can change the output format to whatever you need. Look things over and let me know if you have further questions.

You have options. In my opinion, the most straight-forward is to just read a string and then convert to integer. You can use the additional "delimiter" parameter of std::getline to stop when it encounters a comma:
std::string value;
while (std::getline(file, value, ',')) {
int ival = std::stoi(value);
std::cout << ival << std::endl;
}
A common alternative is to read a single character, expecting it to be a comma:
int ival;
while (file >> ival) {
std::cout << ival << std::endl;
// Skip comma (we hope)
char we_sure_hope_this_is_a_comma;
file >> we_sure_hope_this_is_a_comma;
}
If it's possible for whitespace to also be present, you may want a less "hopeful" technique to skip the comma:
// Skip characters up to (and including) next comma
for (char c; file >> c && c != ',';);
Or simply:
// Skip characters up to (and including) next comma
while (file && file.get() != ',');
Or indeed, if you expect only whitespace or a comma, you could do something like:
// Skip comma and any leading whitespace
(file >> std::ws).get();
Of course, all the above are more-or-less clunky ways of doing this:
// Skip characters up to (and including) next comma on next read
file.ignore(std::numeric_limits<std::streamsize>::max(), ',');
All these approaches assume input is a single line. If you expect multiple lines of input with comma-separated values, you'll also need to handle end-of-line occurring without encountering a comma. Otherwise, you might miss the first input on the next line. Except for the "hopeful" approach, which will work but only on a technicality.
For robustness, I generally advise you to read line-based input as a whole string with std::getline, and then use std::istringstream to read individual values out of that line.

Here is another compact solution using iterators.
#include <iostream>
#include <vector>
#include <string>
#include <iterator>
#include <fstream>
#include <algorithm>
template <char D>
struct WordDelimiter : public std::string
{};
template <char D>
std::istream &
operator>>(std::istream & is, WordDelimiter<D> & output)
{
// Output gets every comma-separated token
std::getline(is, output, D);
return is;
}
int main() {
// Open a test file with comma-separated tokens
std::ifstream f{"test.txt"};
// every token is appended in the vector
std::vector<std::string> vec{ std::istream_iterator<WordDelimiter<','>>{ f },
std::istream_iterator<WordDelimiter<','>>{} };
// Transform str vector to int vector
// WARNING: no error checking made here
std::vector<int> vecint;
std::transform(std::begin(vec),std::end(vec),std::back_inserter(vecint),[](const auto& s) { return std::stoi(s); });
for (auto val : vecint) {
std::cout << val << std::endl;
}
return 0;
}

Splitting sentences and placing in vector

I was given a code from my professor that takes multiple lines of input. I am currently changing the code for our current assignment and I came across an issue. The code is meant to take strings of input and separate them into sentences from periods and put those strings into a vector.
vector<string> words;
string getInput() {
string s = ""; // string to return
bool cont = true; // loop control.. continue is true
while (cont){ // while continue
string l; // string to hold a line
cin >> l; // get line
char lastChar = l.at(l.size()-1);
if(lastChar=='.') {
l = l.substr(0, l.size()-1);
if(l.size()>0){
words.push_back(s);
s = "";
}
}
if (lastChar==';') { // use ';' to stop input
l = l.substr(0, l.size()-1);
if (l.size()>0)
s = s + " " + l;
cont = false; // set loop control to stop
}
else
s = s + " " + l; // add line to string to return
// add a blank space to prevent
// making a new word from last
// word in string and first word
// in line
}
return s;
}
int main()
{
cout << "Input something: ";
string s = getInput();
cout << "Your input: " << s << "\n" << endl;
for(int i=0; i<words.size(); i++){
cout << words[i] << "\n";
}
}
The code puts strings into a vector but takes the last word of the sentence and attaches it to the next string and I cannot seem to understand why.

This line
s = s + " " + l;
will always execute, except for the end of input, even if the last character is '.'. You are most likely missing an else between the two if-s.

You have:
string l; // string to hold a line
cin >> l; // get line
The last line does not read a line unless the entire line has non-white space characters. To read a line of text, use:
std::getline(std::cin, l);
It's hard telling whether that is tripping your code up since you haven't posted any sample input.

I would at least consider doing this job somewhat differently. Right now, you're reading a word at a time, then putting the words back together until you get to a period.
One possible alternative would be to use std::getline to read input until you get to a period, and put the whole string into the vector at once. Code to do the job this way could look something like this:
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <iterator>
int main() {
std::vector<std::string> s;
std::string temp;
while (std::getline(std::cin, temp, '.'))
s.push_back(temp);
std::transform(s.begin(), s.end(),
std::ostream_iterator<std::string>(std::cout, ".\n"),
[](std::string const &s) { return s.substr(s.find_first_not_of(" \t\n")); });
}
This does behave differently in one circumstance--if you have a period somewhere other than at the end of a word, the original code will ignore that period (won't treat it as the end of a sentence) but this will. The obvious place this would make a difference would be if the input contained a number with a decimal point (e.g., 1.234), which this would break at the decimal point, so it would treat the 1 as the end of one sentence, and the 234 as the beginning of another. If, however, you don't need to deal with that type of input, this can simplify the code considerably.
If the sentences might contain decimal points, then I'd probably write the code more like this:
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <iterator>
class sentence {
std::string data;
public:
friend std::istream &operator>>(std::istream &is, sentence &s) {
std::string temp, word;
while (is >> word) {
temp += word + ' ';
if (word.back() == '.')
break;
}
s.data = temp;
return is;
}
operator std::string() const { return data; }
};
int main() {
std::copy(std::istream_iterator<sentence>(std::cin),
std::istream_iterator<sentence>(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
Although somewhat longer and more complex, at least to me it still seems (considerably) simpler than the code in the question. I guess it's different in one way--it detects the end of the input by...detecting the end of the input, rather than depending on the input to contain a special delimiter to mark the end of the input. If you're running it interactively, you'll typically need to use a special key combination to signal the end of input (e.g., Ctrl+D on Linux/Unix, or F6 on Windows).
In any case, it's probably worth considering a fundamental difference between this code and the code in the question: this defines a sentence as a type, where the original code just leaves everything as strings, and manipulates strings. This defines an operator>> for a sentence, that reads a sentence from a stream as we want it read. This gives us a type we can manipulate as an object. Since it's like a string in other ways, we provide a conversion to string so once you're done reading one from a stream, you can just treat it as a string. Having done that, we can (for example) use a standard algorithm to read sentences from standard input, and write them to standard output, with a new-line after each to separate them.

When parsing a string using a string stream, it extracts a new line character

Description of the program : The program must read in a variable amount of words until a sentinel value is specified ("#" in this case). It stores the words in a vector array.
Problem : I use a getline to read in the string and parse the string with a stringstream. My problem is that the stringstream is not swallowing the new line character at the end of each line and is instead extracting it.
Some solutions I have thought of is to cut off the last character by creating a subset or checking if the next extracted word is a new line character, but I feel there is a better cost efficient solution such as changing the conditions for my loops.
I have included a minimized version of the overall code that reproduces the problem.
#include <iostream>
#include <sstream>
#include <string>
using namespace std;
int main()
{
const int MAX_LIST_SIZE = 1000;
string str;
string list[MAX_LIST_SIZE];
int numWords = 0;
// program starts here
getline(cin, str); // read innput
stringstream parse(str); // use stringstream to parse input
while(str != "#") // read in until sentinel value
{
while(!parse.fail()) // until all words are extracted from the line
{
parse >> list[numWords]; // store words
numWords++;
}
getline(cin,str); // get next line
parse.clear();
parse.str(str);
}
// print number of words
cout << "Number of words : " << numWords << endl;
}
And a set of test input data that will produce the problem
Input:
apples oranges mangos
bananas
pineapples strawberries
Output:
Number of words : 9
Expected Output:
Number of words : 6
I would appreciate any suggestions on how to deal with this problem in an efficient manner.

Your logic for parsing out the stream isn't quite correct. fail() only becomes true after a >> operation fails, so you'll doing an extra increment each time. For example:
while(!parse.fail())
{
parse >> list[numWords]; // fails
numWords++; // increment numWords anyway
} // THEN check !fail(), but we incremented already!
All of these operations have returns that you should check as you go to avoid this problem:
while (getline(cin, str)) { // fails if no more lines in cin
if (str != "#") { // doesn't need to be a while
stringstream parse(str);
while (parse >> list[numWords]) { // fails if no more words
++numWords; // *only* increment if we got one!
}
}
}
Even better would be to not use an array at all for the list of words:
std::vector<std::string> words;
Which can be used in the inner loop:
std::string temp;
while (parse >> temp) {
words.push_back(temp);
}

The increment on numwords happens one more time than you intend at the end of each line. Use a std::vector< std::string > for your list. Then you can use list.size().

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove duplicate words from a sentence without using regex in c++? - c++

Related

How do I access individual lines in a string? (updated once again / with more detail)

Missing last word of string when I split the sentence into word [duplicate]

(C++) Reading a CSV text file as a vector of integers

Splitting sentences and placing in vector

When parsing a string using a string stream, it extracts a new line character

Categories

Resources