How to remove duplicate phrases that are separated by being inside double quotes or separated by a comma in a file with c++ - c++

I use this function to remove duplicate words in a file
But I need it to remove duplicate expressions instead
for example What the function is currently doing
If I have the expression
"Hello World"
"beautiful world"
The function will remove the word "world" from both expressions
And I need this function to replace the entire expression only if it is found more than once in the file
for example
If I have the expressions
"Hello World"
"Hello World"
"beautiful world"
"beautiful world"
The function will remove the expression "Hello world" and "beautiful world" and leave only one from each of them but it will not touch the word "world" because the function will treat everything that is within the quotes as one word
This is the code I use now
#include <string>
#include <sstream>
#include <iostream>
#include <unordered_set>
void Remove_Duplicate_Words(string str)
{
ofstream Write_to_file{ "test.txt" };
// Used to split string around spaces.
istringstream ss(str);
// To store individual visited words
unordered_set<string> hsh;
// Traverse through all words
do
{
string word;
ss >> word;
// If current word is not seen before.
while (hsh.find(word) == hsh.end()) {
cout << word << '\n';
Write_to_file << word << endl; // write to outfile
hsh.insert(word);
}
} while (ss);
}
int main()
{
ifstream Read_from_file{ "test.txt" };
string file_content{ ist {Read_from_file}, ist{} };
Remove_Duplicate_Words(file_content);
return 0;
}
How do I remove duplicate expressions instead of duplicate words?
Unfortunately my knowledge on this subject is very basic and usually what I do is try all kinds of things until I succeed. I tried to do it here too and I just can not figure out how to do it
Any help would be greatly appreciated

Requires a little bit of String parsing.
Your example works by reading tokens, which are similar to words (but not exactly). For your problem, the token becomes word OR quoted string. The more complex your definition of tokens, the harder the problem becomes. Try starting by thinking of tokens as either words or quoted strings on the same line. A quoted string across lines might be a little more complex.
Here's a similar SO question to get you started: Reading quoted string in c++. You need to do something similar, but instead of having set positions, your quoted string can occur anywhere in the line. So you read tokens something like this:
Read next word token (as you're doing now)
If last read token is quote character ("), read till next (") as a single token
Check on the set and output token only if it isn't already there (if token is quoted, don't forget to output the quotes)
Insert token into set.
Repeat till EOF
Hope that helps

Related

Building a list of words from a sentence inputted

I am fairly new to programming and would like help with my homework. I have no idea where to even start.
"
1. Have the user input a sentence
2. Print out the individual words in the sentence, along with the word number
So the string "This is a test of our program." should produce:
1. This
2. is
3. a
4. test
5. of
6. our
7. program
This should strip out all spaces, commas, periods, exclamation points."
if you can give me some pointers. thanks.
You will have to use strings and streams from the standard library. You can start by including the following headers
#include <string>
#include <iostream>
A good starting point would be to look at the introduction here
Try some stuff with std::cout. This method allows you to output content to the console. Start with something easy, such as:
std::cout << "Hello World" << endl;
You can also output the content of a variable the same way:
std::string myString = "SomeText";
std::cout << myString << endl;
std::cout does the opposite. It allows you to capture the user input into a variable.
int myNumber;
std::cin >> myNumber;
or
std::string userInputString;
std::getline(std::cin, userInputString)
Notice that in the second case we're using std::getline. This is because std::cin behaves in such a way that it will stop after the first word if you write an entire sentence.
Now that you've captured the user input string, you can remove undesired characters, split the string, etc.. Look at what is available in the string class. Good luck.

Extracting individual sentences from a text file ... I haven't got it right YET

As part of a larger program, I'm extracting individual sentences from a text file and placing them as strings into a vector of strings. I first decided to use the procedure I've commented out. But then, after a test, I realized that it's doing 2 things wrong:
(1) It's not separating sentences when they are separated by a new line.
(2) It's not separating sentences when they end in a quotation mark. (Ex. The sentences The string Obama said, "Yes, we can." Then he audience gave a thunderous applause. would not be separated.)
I need to fix those problems. However, I'm afraid this going to end up as spaghetti code, if it isn't already. Am I going about this wrong? I don't want to keep going back and fixing things. Maybe there's some easier way?
// Extract sentences from Plain Text file
std::vector<std::string> get_file_sntncs(std::fstream& file) {
// The sentences will be stored in a vector of strings, strvec:
std::vector<std::string> strvec;
// Print out error if the file could not be found:
if(file.fail()) {
std::cout << "Could not find the file. :( " << std::endl;
// Otherwise, proceed to add the sentences to strvec.
} else {
char curchar;
std::string cursentence;
/* While we haven't reached the end of the file, add the current character to the
string representing the current sentence. If that current character is a period,
then we know we've reached the end of a sentence if the next character is a space or
if there is no next character; we then must add the current sentence to strvec. */
while (file >> std::noskipws >> curchar) {
cursentence.push_back(curchar);
if (curchar == '.') {
if (file >> std::noskipws >> curchar) {
if (curchar == ' ') {
strvec.push_back(cursentence);
cursentence.clear();
} else {
cursentence.push_back(curchar);
}
} else {
strvec.push_back(cursentence);
cursentence.clear();
}
}
}
}
return strvec;
}
Given your request to detect sentence boundaries by punctuation, whitespace, and certain combinations of them, using a regular expression seems to be a good solution. You can use regular expression to describe possible sequences of characters that indicate sentence boundaries, e.g.
[.!?]\s+
which means: "one of dot, exclamation mark question mark, followed by one or more whitespaces".
One particularly convenient way of using regular expressions in C++ is to use the regex implementation included in the Boost library. Here is an example of how it work in your case:
#include <string>
#include <vector>
#include <iostream>
#include <iterator>
#include <boost/regex.hpp>
int main()
{
/* Input. */
std::string input = "Here is a short sentence. Here is another one. And we say \"this is the final one.\", which is another example.";
/* Define sentence boundaries. */
boost::regex re("(?: [\\.\\!\\?]\\s+" // case 1: punctuation followed by whitespace
"| \\.\\\",?\\s+" // case 2: start of quotation
"| \\s+\\\")", // case 3: end of quotation
boost::regex::perl | boost::regex::mod_x);
/* Iterate through sentences. */
boost::sregex_token_iterator it(begin(input),end(input),re,-1);
boost::sregex_token_iterator endit;
/* Copy them onto a vector. */
std::vector<std::string> vec;
std::copy(it,endit,std::back_inserter(vec));
/* Output the vector, so we can check. */
std::copy(begin(vec),end(vec),
std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}
Notice I used the boost::regex::perl and boost:regex:mod_x options to construct the regex matcher. This allowed by to use extra whitespace inside the regex to make it more readable.
Also note that certain characters, such as . (dot), ! (exclamation mark) and others need to be escaped (i.e. you need to put \\ in front of them), because they would meta characters with special meanings otherwise.
When compiling/linking the code above, you need to link it with the boost-regex library. Using GCC the command looks something like:
g++ -W -Wall -std=c++11 -o test test.cpp -lboost_regex
(assuming your program in stored in a file called test.cpp).

boost::algorithm - splitting a string returns an extra token

Perhaps someone could tell me what is happening here?
My intention is to split an input string on braces: ie: either '(' or ')'.
For an input string of "(well)hello(there)world" I expect 4 tokens to be returned: well; hello; there; world.
As you can see from my exemplar app below I am getting 5 tokens back (The 1st is an empty string).
Is there any way to get this to return me only the non-empty strings?
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <vector>
int main()
{
std::string in = "(well)hello(there)world";
std::vector<std::string> tokens;
boost::split(tokens, in, boost::is_any_of("()"));
for (auto s : tokens)
std::cout << "\"" << s << "\"" << std::endl;
return 0;
}
Output:
$ a.out
"" <-- where is this token coming from?
"well"
"hello"
"there"
"world"
I have tried using boost::algorithm::token_compress_on but I get the same result.
Yes, the first result returned is the empty set {} immediately preceding the first open parenthesis. The behavior is as expected.
If you don't want to use that result, simply test for an empty returned variable and discard it.
To test that this is the expected behavior, put a parenthesis at the end of the line and you will have another empty result at the end. :)
this thread is kinda old but this is better solution boost::token_compress_on, add this after the delimeter in boost::split

C++ - Remove or skip quote char in reading a file line by tokenizer

I have a csv file that has records like:
837478739*"EP"1"3FB2B464BD5003B55CA6065E8E040A2A"*"F"*21*15*"NH"*"N"0*-1*"-1"*0*0**-1*223944*-1*"23"1"-1""-1""78909""-1""-1""-1""-1""-1""-1""-1""-1""-1""-1""-1""-1""-1""74425""26""-1"*"-1"*1*1*69*23.58*0*0*0*0*"MC"
The file has lots of records, so I need a fast method to breakdown the line and push_back each of those parts into a vector. The main reason I choose tokenizer is that I heard a lot about its performance. I have a function:
void break(){
//using namespace boost;
string s = "This is a , test '' file";
boost::tokenizer<> tok(s);
vector<string> line;
for(boost::tokenizer<>::iterator beg=tok.begin();beg!=tok.end();++beg){
line.push_back(*beg);
}
cout << line[3] << " and " << line[5] << endl;
}
By that I can get each part of the sentence and ignore everything that is not a letter. Does the tokenizer have the ability to read the record that I have and parse them by "*" delimiter and remove the quotes from the string? There won't be any kind of special character between quotes, I just need to remove the quote marks. I tried to read the tokenizer document, but nothing came out.
You need to assign another TokenizerFunc to your Tokenizer to parse the string differently, the default parses on space and punctuation
http://www.boost.org/doc/libs/1_37_0/libs/tokenizer/tokenizerfunction.htm
You can use regex_replace.
"break" is keyword. You shouldn't use it for function name.

formatting a string which contains quotation marks

I am having problem formatting a string which contains quotationmarks.
For example, I got this std::string: server/register?json={"id"="monkey"}
This string needs to have the four quotation marks replaced by \", because it will be used as a c_str() for another function.
How does one do this the best way on this string?
{"id"="monkey"}
EDIT: I need a solution which uses STL libraries only, preferably only with String.h. I have confirmed I need to replace " with \".
EDIT2: Nvm, found the bug in the framework
it is perfectly legal to have the '"' char in a C-string. So the short answer is that you need to do nothing. Escaping the quotes is only required when typing in the source code
std::string str("server/register?json={\"id\"=\"monkey\"}")
my_c_function(str.c_str());// Nothing to do here
However, in general if you want to replace a substring by an other, use boost string algorithms.
#include <boost/algorithm/string/replace.hpp>
#include <iostream>
int main(int, char**)
{
std::string str = "Hello world";
boost::algorithm::replace_all(str, "o", "a"); //modifies str
std::string str2 = boost::algorithm::replace_all_copy(str, "ll", "xy"); //doesn't modify str
std::cout << str << " - " << str2 << std::endl;
}
// Displays : Hella warld - Hexya warld
If you std::string contains server/register?json={"id"="monkey"}, there's no need to replace anything, as it will already be correctly formatted.
The only place you would need this is if you hard-coded the string and assigned it manually. But then, you can just replace the quotes manually.