Parse string with delimiter whitespace but having strings include whitespace as well? - c++

I have a text file with state names and their respective abbreviations. It looks something like this:
Florida FL
Nevada NV
New York NY
So the number of whitespaces between state name and abbreviation differs. I want to extract the name and abbreviation and I thought about using getline with whitespace as a delimiter but I have problems with the whitespace in names like "New York". What function could I use instead?

You know that the abbreviation is always two characters.
So you can read the whole line, and split it at two characters from the end (probably using substr).
Then trim the first string and you have two nice strings for the name and abbreviation.

The systematic way is to analyze the all possible input data and then search for a pattern in the text. In your case, we analyze the problem and find out that
at the end of the string we have some consecutive uppercase letters
before that we have the state's name
So, if we search for the state abbreviation pattern and split that of, then the full name of the state will be available. But maybe with trailing and leading spaces. This we will remove and then the result is there.
For searching we will use a std::regex. The pattern is: 1 or more uppercase letters followed by 0 or more white spaces, followed by the end of the line. The regular expressions for that is: "([A-Z]+)\\s*$"
When this is available, the prefix of the result contains the full statename. We will remove leading and trailing spaces and that's it.
Please see:
#include <iostream>
#include <string>
#include <sstream>
#include <regex>
std::istringstream textFile(R"( Florida FL
Nevada NV
New York NY)");
std::regex regexStateAbbreviation("([A-Z]+)\\s*$");
int main()
{
// Split of some parts
std::smatch stateAbbreviationMatch{};
std::string line{};
while (std::getline(textFile, line)) {
if (std::regex_search(line, stateAbbreviationMatch, regexStateAbbreviation))
{
// Get the state
std::string state(stateAbbreviationMatch.prefix());
// Remove leading and trailing spaces
state = std::regex_replace(state, std::regex("^ +| +$|( ) +"), "$1");
// Get the state abbreviation
std::string stateabbreviation(stateAbbreviationMatch[0]);
// Print Result
std::cout << stateabbreviation << ' ' << state << '\n';
}
}
return 0;
}

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

How to delimit this text file? strtok

so there's a text file where I have 1. languages, a 2. text of a number written in the said language, 3. the base of the number and 4. the number written in digits. Here's a sample:
francais deux mille quatre cents 10 2400
How I went about it:
struct Nomen{
char langue[21], nomNombre [31], baseC[3], nombreC[21];
int base, nombre;
};
and in the main:
if(myfile.is_open()){
{
while(getline(myfile, line))
{
strcpy(Linguo[i].langue, strtok((char *)line.c_str(), " "));
strcpy(Linguo[i].nomNombre, strtok(NULL, " "));
strcpy(Linguo[i].baseC, strtok(NULL, " "));
strcpy(Linguo[i].nombreC, strtok(NULL, "\n"));
i++;
}
Difficulty: I'm trying to put two whitespaces as a delimiter, but it seems that strtok() counts it as if there were only one whitespace. The fact there are spaces in the text number, etc. is messing up the tokenization. How should I go about it?
strtok treats any single character in the provided string as a delimiter. It does not treat the string itself as a single delimiter. So " " (two spaces) is the same as " " (one space).
strtok will also treat multiple delimiters together as a single delimiter. So the input "t1 t2" will be tokenized as two tokens, "t1" and "t2".
As mentioned in comments, strtok is also writes the NUL character into the input to create the token strings. So, it is an error to pass the result of string::c_str() as input to the function. The fact that you need to cast the constant string should have been enough to dissuade you from this approach.
If you want to treat a double space as a delimiter, you will have to scan the string and search for them yourself. Given you are using C APIs, you can consider strstr. However, in C++, you can use string::find.
Here's an algorithm to parse your string manually:
Given an input string input:
language is the substring from the start of input to the first SPC character.
From where language ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
text is the substring from the start of input to the first double SPC sequence.
From where text ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
Parse base, and parse number.

Using Regex to remove leading/trailing whitespaces, except for quotes

I am trying to write a regular expression which recognises whitespaces from a user input string, except for between quotation marks ("..."). For example, if the user enters
#load "my folder/my files/ program.prog" ;
I want my regex substitution to transform this into
#load "my folder/my files/ program.prog" ;
So far I've implemented the following (you can run it here).
#include <iostream>
#include <string>
#include <regex>
int main(){
// Variables for user input
std::string input_line;
std::string program;
// User prompt
std::cout << ">>> ";
std::getline(std::cin, input_line);
// Remove leading/trailing whitespaces
input_line = std::regex_replace(input_line, std::regex("^ +| +$|( ) +"), "$1");
// Check result
std::cout << input_line << std::endl;
return 0;
}
But this removes whitespaces between quotes too. Is there any way I can use regex to ignore spaces between quotes?
You may add another alternative to match and capture double quoted string literals and re-insert it into the result with another backreference:
input_line = std::regex_replace(
input_line,
std::regex(R"(^ +| +$|(\"[^\"\\]*(?:\\[\s\S][^\"\\]*)*\")|( ) +)"),
"$1$2");
See the C++ demo.
The "[^"\\]*(?:\\[\s\S][^"\\]*)*\" part matches a ", then 0+ chars other than \ and ", then 0 or more occurrences of any escaped char (\ and then any char matched with [\s\S]) and then 0+ chars other than \ and ".
Note I used a raw string literal R"(...)" to avoid having to escape regex escape backslashes (R"([\s\S])" = "[\\s\\S]").

need support defining the right regex

I would like to parse a file using boost::sregex_token_iterator.
Unfortunately I'm not able to find the right regex to extract strings in the form FOO:BAR out of it.
The below code example is usable only if one such occurence per line is found, but I would like to support multiple of this entries per line, and ideally also a comment after an '#'
So entries like this
AA:BB CC:DD EE:FF #this is a comment
should result in 3 identified token (AA:BB, CC:DD, EE:FF)
boost::regex re("((\\W+:\\W+)\\S*)+");
boost::sregex_token_iterator i(line.begin(), line.end(), re, -1), end;
for(; i != end; i++){
std::stringstream ss(*i);
...
}
Any support is very welcome.
I suggest you use splitting to get the values you need.
I would begin by first splitting using #. This separates the comment from the rest of the line. Then split using white space, which separates the pairs out. After this, individual pairs can be split using :.
If, for whatever reason, you must use regex, you can iterate over the matches. In this case I would use the following regex:
(?:#(?:.*))*(\w+:\w+)\s*
This regex will match every pair until it finds a comment. If there is a comment, it will skip to the next new line.
You want to match sequences of 1 or more word chars followed with : and then having again 1 or more word chars.
Thus, you need to replace -1 with 1 in the call to boost::sregex_token_iterator to get Group 1 text chunks and replace the regex you use with \w+:\w+ pattern:
boost::regex re(R"(#.*|(\w+:\w+))");
boost::sregex_token_iterator i(line.begin(), line.end(), re, 1), end;
Note that R"(#.*|(\w+:\w+))" is a raw string literal that actually represents #.*|(\w+:\w+) pattern that matches # and then the rest of the line or matches and captures the pattern you need into Group 1.
See an std::regex C++ example (you may easily adjust the code for Boost):
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex r(R"(#.*|(\w+:\w+))");
std::string s = "AA:BB CC:DD EE:FF #this is a comment XX:YY";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m[1].str() << '\n';
}
return 0;
}

C++ Find Word in String without Regex

I'm trying to find a certain word in a string, but find that word alone. For example, if I had a word bank:
789540132143
93
3
5434
I only want a match to be found for the value 3, as the other values do not match exactly. I used the normal string::find function, but that found matches for all four values in the word bank because they all contain 3.
There is no whitespace surrounding the values, and I am not allowed to use Regex. I'm looking for the fastest implementation of completing this task.
If you want to count the words you should use a string to int map. Read a word from your file using >> into a string then increment the map accordingly
string word;
map<string,int> count;
ifstream input("file.txt");
while (input.good()) {
input >> word;
count[word]++;
}
using >> has the benefit that you don't have to worry about whitespace.
All depends on the definition of words: is it a string speparated from others with a whitespace ? Or are other word separators (e.g. coma, dot, semicolon, colon, parenntheses...) relevant as well ?
How to parse for words without regex:
Here an accetable approach using find() and its variant find_first_of():
string myline; // line to be parsed
string what="3"; // string to be found
string separator=" \t\n,;.:()[]"; // string separators
while (getline(cin, myline)) {
size_t nxt=0;
while ( (nxt=myline.find(what, nxt)) != string::npos) { // search occurences of what
if (nxt==0||separator.find(myline[nxt-1])!=string::npos) { // if at befgin of a word
size_t nsep=myline.find_first_of(separator,nxt+1); // check if goes to end of wordd
if ((nsep==string::npos && myline.length()-nxt==what.length()) || nsep-nxt==what.length()) {
cout << "Line: "<<myline<<endl; // bingo !!
cout << "from pos "<<nxt<<" to " << nsep << endl;
}
}
nxt++; // ready for next occurence
}
}
And here the online demo.
The principle is to check if the occurences found correspond to a word, i.e. are at the begin of a string or begin of a word (i.e. the previous char is a separator) and that it goes until the next separator (or end of line).
How to solve your real problem:
You can have the fastest word search function: if ou use it for solving your problem of counting words, as you've explained in your comment, you'll waste a lot of efforts !
The best way to achieve this would certainly be to use a map<string, int> to store/updated a counter for each string encountered in the file.
You then just have to parse each line into words (you could use find_fisrst_of() as suggested above) and use the map:
mymap[word]++;