boost string split to eliminate spaces in words

boost string split to eliminate spaces in words - c++

I have written this code to split up a string containing words with many spaces and/or tab into a string vector just containing the words.
#include<iostream>
#include<vector>
#include<boost/algorithm/string/split.hpp>
#include<boost/algorithm/string.hpp>
int main()
{
using namespace std;
string str("cONtainS SoMe CApiTaL WORDS");
vector<string> strVec;
using boost::is_any_of;
boost::algorithm::split(strVec, str, is_any_of("\t "));
vector<string>::iterator i ;
for(i = strVec.begin() ; i != strVec.end(); i++)
cout<<*i<<endl;
return 0;
}
I was expecting an output
cONtainS
SoMe
CApiTaL
WORDS
but i m geting output with space as an element in the strVec i.e
cONtainS
SoMe
CApiTaL
WORDS

You need to add a final parameter with the value boost::token_compress_on, as per the documentation:
boost::algorithm::split(strVec,str,is_any_of("\t "),boost::token_compress_on);

It's because your input contains consecutive separators. By default split interprets that to mean they have empty strings between them.
To get the output you expected, you need to specify the optional eCompress parameter, with value token_compress_on.
http://www.boost.org/doc/libs/1_43_0/doc/html/boost/algorithm/split_id667600.html

Related

Parse string with delimiter whitespace but having strings include whitespace as well?

I have a text file with state names and their respective abbreviations. It looks something like this:
Florida FL
Nevada NV
New York NY
So the number of whitespaces between state name and abbreviation differs. I want to extract the name and abbreviation and I thought about using getline with whitespace as a delimiter but I have problems with the whitespace in names like "New York". What function could I use instead?

You know that the abbreviation is always two characters.
So you can read the whole line, and split it at two characters from the end (probably using substr).
Then trim the first string and you have two nice strings for the name and abbreviation.

The systematic way is to analyze the all possible input data and then search for a pattern in the text. In your case, we analyze the problem and find out that
at the end of the string we have some consecutive uppercase letters
before that we have the state's name
So, if we search for the state abbreviation pattern and split that of, then the full name of the state will be available. But maybe with trailing and leading spaces. This we will remove and then the result is there.
For searching we will use a std::regex. The pattern is: 1 or more uppercase letters followed by 0 or more white spaces, followed by the end of the line. The regular expressions for that is: "([A-Z]+)\\s*$"
When this is available, the prefix of the result contains the full statename. We will remove leading and trailing spaces and that's it.
Please see:
#include <iostream>
#include <string>
#include <sstream>
#include <regex>
std::istringstream textFile(R"( Florida FL
Nevada NV
New York NY)");
std::regex regexStateAbbreviation("([A-Z]+)\\s*$");
int main()
{
// Split of some parts
std::smatch stateAbbreviationMatch{};
std::string line{};
while (std::getline(textFile, line)) {
if (std::regex_search(line, stateAbbreviationMatch, regexStateAbbreviation))
{
// Get the state
std::string state(stateAbbreviationMatch.prefix());
// Remove leading and trailing spaces
state = std::regex_replace(state, std::regex("^ +| +$|( ) +"), "$1");
// Get the state abbreviation
std::string stateabbreviation(stateAbbreviationMatch[0]);
// Print Result
std::cout << stateabbreviation << ' ' << state << '\n';
}
}
return 0;
}

Avoid empty elements in match when optional substrings are not present

I am trying to create a regex that match the strings returned by diff terminal command.
These strings start with a decimal number, might have a substring composed by a comma and a number, then a mandatory character (a, c, d) another mandatory decimal number followed by another optional group as the one before.
Examples:
27a27
27a27,30
28c28
28,30c29,31
1d1
1,10d1
I am trying to extract all the groups separately but the optional ones without ,.
I am doing this in C++:
#include<iostream>
#include<string>
#include<fstream>
#include <regex>
using namespace std;
int main(int argc, char* argv[])
{
string t = "47a46";
std::string result;
std::regex re2("(\\d+)(?:,(\\d+))?([acd])(\\d+)(?:,(\\d+))?");
std::smatch match;
std::regex_search(t, match, re2);
cout<<match.size()<<endl;
cout<<match.str(0)<<endl;
if (std::regex_search(t, match, re2))
{
for (int i=1; i<match.size(); i++)
{
result = match.str(i);
cout<<i<<":"<<result<< " ";
}
cout<<endl;
}
return 0;
}
The string variable t is the string I want to manipulate.
My regular expression
(\\d+)(?:,(\\d+))?([acd])(\\d+)(?:,(\\d+))?
is working but with strings that do not have the optional subgroups (such as 47a46, the match variable will contain empty elements in the corresponding position of the expected substrings.
For example in the program above the elements of match (preceded by their index) are:
1:47 2: 3:a 4:46 5:
Elements in position 2 and 5 correspond to the optional substring that in this case are not present so I would like match to avoid retrieving them so that it would be:
1:47 2:a 3:46
How can I do it?

I think the best RE for you would be like this:
std::regex re2(R"((\d+)(?:,\d+)?([a-z])(\d+)(?:,\d+)?)");
- that way it should match all the required groups (but optional)
output:
4
47a46
1:47 2:a 3:46
Note: the re2's argument string is given in c++11 notation.
EDIT: simplified RE a bit

C++ Find Word in String without Regex

I'm trying to find a certain word in a string, but find that word alone. For example, if I had a word bank:
789540132143
93
3
5434
I only want a match to be found for the value 3, as the other values do not match exactly. I used the normal string::find function, but that found matches for all four values in the word bank because they all contain 3.
There is no whitespace surrounding the values, and I am not allowed to use Regex. I'm looking for the fastest implementation of completing this task.

If you want to count the words you should use a string to int map. Read a word from your file using >> into a string then increment the map accordingly
string word;
map<string,int> count;
ifstream input("file.txt");
while (input.good()) {
input >> word;
count[word]++;
}
using >> has the benefit that you don't have to worry about whitespace.

All depends on the definition of words: is it a string speparated from others with a whitespace ? Or are other word separators (e.g. coma, dot, semicolon, colon, parenntheses...) relevant as well ?
How to parse for words without regex:
Here an accetable approach using find() and its variant find_first_of():
string myline; // line to be parsed
string what="3"; // string to be found
string separator=" \t\n,;.:()[]"; // string separators
while (getline(cin, myline)) {
size_t nxt=0;
while ( (nxt=myline.find(what, nxt)) != string::npos) { // search occurences of what
if (nxt==0||separator.find(myline[nxt-1])!=string::npos) { // if at befgin of a word
size_t nsep=myline.find_first_of(separator,nxt+1); // check if goes to end of wordd
if ((nsep==string::npos && myline.length()-nxt==what.length()) || nsep-nxt==what.length()) {
cout << "Line: "<<myline<<endl; // bingo !!
cout << "from pos "<<nxt<<" to " << nsep << endl;
}
}
nxt++; // ready for next occurence
}
}
And here the online demo.
The principle is to check if the occurences found correspond to a word, i.e. are at the begin of a string or begin of a word (i.e. the previous char is a separator) and that it goes until the next separator (or end of line).
How to solve your real problem:
You can have the fastest word search function: if ou use it for solving your problem of counting words, as you've explained in your comment, you'll waste a lot of efforts !
The best way to achieve this would certainly be to use a map<string, int> to store/updated a counter for each string encountered in the file.
You then just have to parse each line into words (you could use find_fisrst_of() as suggested above) and use the map:
mymap[word]++;

c++ How to extract the whitespace between words if there is one

I've got two questions. I need to write a program that extracts all non-alphabetic characters and displays them, then removes them.
I am using isalpha which is working for symbols, but only if the input string has no spaces like "hello world"
but if it is more than one word like "hello! world!", it will only extract the first exclamation mark but not the second.
Second question which may be related, I want my program to detect the spaces between the words (I tried isspace but I must have used it wrong? and remove them and put them in a char variable
so for example
if the input is hello4 world! How3 are you today?
I want it to tell me
removed: 4
removed:
removed: !
removed:
removed: 3
removed:
removed:
removed:
long story short, if there is no other way, I'd like to detect spaces as !isalpha, or find something similar to isalpha for space between text.
Thanks
# include <iostream>
# include <string>
using namespace std;
void main()
{
string message;
cin >> message;
for (int i = 0; message[i]; i++)
if(!isalpha(message[i]))
cout << "deleted following character: " << message[i] <<endl;
else
cout <<"All is good! \n";
}

>> reads a single word, stopping when a whitespace character is found. To read a whole line, you want
std::getline(cout, message);

There is a better way by which you can get non-alphabetic characters,
You can check with asci value of each character and compare with alphabetic asci character if not in it & not a space (space asci val),
then you get your non-alphabetic character.
You can get all ascii codes over here :=> http://www.asciitable.com/
-Jayesh

C++ Boost: Split function is_any_of()

I'm trying to use the split() function provided in boost/algorithm/string.hpp in the following function :
vector<std::string> splitString(string input, string pivot) { //Pivot: e.g., "##"
vector<string> splitInput; //Vector where the string is split and stored
split(splitInput,input,is_any_of(pivot),token_compress_on); //Split the string
return splitInput;
}
The following call :
string hello = "Hieafds##addgaeg##adf#h";
vector<string> split = splitString(hello,"##"); //Split the string based on occurrences of "##"
splits the string into "Hieafds" "addgaeg" "adf" & "h". However I don't want the string to be split by a single #. I think that the problem is with is_any_of().
How should the function be modified so that the string is split only by occurrences of "##" ?

You're right, you have to use is_any_of()
std::string input = "some##text";
std::vector<std::string> output;
split( output, input, is_any_of( "##" ) );
update
But, if you want to split on exactly two sharp, maybe you have to use a regular expression:
split_regex( output, input, regex( "##" ) );
take a look at the documentation example.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

boost string split to eliminate spaces in words - c++

You need to add a final parameter with the value boost::token_compress_on, as per the documentation: boost::algorithm::split(strVec,str,is_any_of("\t "),boost::token_compress_on);

Related

Parse string with delimiter whitespace but having strings include whitespace as well?

Avoid empty elements in match when optional substrings are not present

C++ Find Word in String without Regex

c++ How to extract the whitespace between words if there is one

C++ Boost: Split function is_any_of()

Categories

Resources