Using regex to parse out numbers - c++

My problem is more or less self-explanatory, I want to write a regex to parse out numbers out of a string that user enters via console. I take the user input using:
getline(std::cin,stringName); //1 2 3 4 5
I asume that user enters N numbers followed by white spaces except the last number.
I have solved this problem by analyzing string char by char like this:
std::string helper = "";
std::for_each(stringName.cbegin(), strinName.cend(), [&](char c)
{
if (c == ' ')
{
intVector.push_back(std::stoi(helper.c_str()));
helper = "";
}
else
helper += c;
});
intVector.push_back(std::stoi(helper.c_str()));
I want to achieve the same behavior by using regex. I've wrote the following code:
std::regex rx1("([0-9]+ )");
std::sregex_iterator begin(stringName.begin(), stringName.end(), rx1);
std::sregex_iterator end;
while (begin != end)
{
std::smatch sm = *begin;
int number = std::stoi(sm.str(1));
std::cout << number << " ";
}
Problem with this regex occurs when it gets to the last number since it doesn't have space behind it, therefore it enters an infinite loop. Can someone give me an idea on how to fix this?

You're going to get an endless loop there because you never increment begin. If you do that, you'll get all the numbers except the last one (which, as you say, is not followed by a space).
But I don't understand why you feel it necessary to include the whitespace in the regular expression. If you just match a string of digits, the regex will automatically select the longest possible match, so the following character (if any) cannot be a digit.
I also see no value in the capture in the regex. If you wanted to restrict the capture to the number itself, you would have used ([0-9]+). (But since stoi only converts until it finds a non-digit, it doesn't matter.)
So you just use this:
std::regex rx1("[0-9]+");
for (auto it = std::sregex_iterator{str.begin(), str.end(), rx1},
end = std::sregex_iterator{};
it != end;
++it) {
std::cout << std::stoi(it->str(0)) << '\n';
}
(Live on coliru)

Related

Searching for an alternative for strtok() in C++

I am using strtok to divide a string in several parts.
In this example, all sections will be read from the string, which are bounded by a colon or a semicolon
char string[] = "Alice1:IscoolAlice2; Alert555678;Bob1:knowsBeepBob2;sees";
char delimiter[] = ":;";
char *p;
p = strtok(string, delimiter);
while(p != NULL) {
cout << "Result: " << p << endl;
p = strtok(NULL, delimiter);
}
As results I get:
Result: Alice1
Result: IscoolAlice2
Result: Alert555678
Result: Bob1
Result: knowsBeepBob2
Result: sees
But I would like to get this results:
Result: Alice1:
Result: Alice2;
Result: Bob1:
Result: Bob2;
The restriction is that I can only choose individual characters when I use strtok.
Does anyone know an alternative for strtok that I also can search for strings?
Or has anyone an idea to solve my problem?
You can not do that task with strtok since you need more complex search
Although I am not sure what is your string as delimiter but the same output can be done with:
char string[] = "Alice1:IscoolAlice2; Alert555678;Bob1:knowsBeepBob2;sees";
char delimiter[] = "(?:Alice|Bob)\\d.";
std::regex regex( delimiter );
std::regex_iterator< const char* > first( std::begin( string ), std::end( string ), regex ), last;
while( first != last ){
std::cout << "Result: " << first->str() << '\n';
++first;
}
the output:
Result: Alice1;
Result: Alice2;
Result: Bob1;
Result: Bob2;
It's just a simple bit of scratch logic, along these lines:
char *ptr = string;
while(*ptr)
{
printf("Result:");
while(*ptr)
{
printf("%c", *ptr);
if(ispunc(*ptr))
{
ptr++;
printf("\n");
break;
}
else
{
ptr++;
}
}
}
It's not possible with your stated data set to properly split it the way you want. You can come up with a "just so" rule to split literally just the data you showed, but given the messy nature of the data it's highly likely it'll fail on other examples. Let's start with this token.
IscoolAlice2
How is a computer program supposed to know which part of this is the name and which is not? You want to get "Alice2" out of this. If you decide that a capital letter specifies a name then it will just spit out the "name" IscoolAlice2. The same with:
knowsBeepBob2
If you search for the first capital letter then the program will decide his name is BeepBob2, so in each case searching for the last occurance of a capital letter in the token finds the name. But what if a name contains two capital letters? The program will cut their name off and you can't do anything about that.
If you're happy to live with these sorts of limitations you can do an initial split via strtok using only the ; character, which gives:
Alice1:IscoolAlice2
Alert555678
Bob1:knowsBeepBob2
sees
Which is less than ideal. You could then specify a rule such that a name exists in any row which contains a : taking anything left of the : as a name, and then finding the last capital letter and anything from that point is also a name. That would give you the output you desire.
But the rules I outlined are extremely specific to the data that was just fed in. If anything about other samples of data deviates at all from this (e.g. a name with two capitals in it) then it will fail as there will be no way on Earth the program could determine where the "name" starts.
The only way to fix this is to go back to where the data is coming from and format it differently so that there is some sort of punctuation before the names.
Or alternatively you need a full database of all possible names that could appear, then search for them, find any characters up to the next : or ; and append them and print the name. But that seems extremely impractical.

C++11 Regex submatches

I have the following code to extract the left & right part from a string of type
[3->1],[2->2],[5->3]
My code looks like the following
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex expr("([[:d:]]+)->([[:d:]]+)");
string input = "[3->1],[2->2],[5->3]";
const std::sregex_token_iterator end;
int submatches[] = { 1, 2 };
string left, right;
for (std::sregex_token_iterator itr(input.begin(), input.end(), expr, submatches); itr != end;)
{
left = ((*itr).str()); ++itr;
right = ((*itr).str()); ++itr;
cout << left << " " << right << endl;
}
}
Output will be
3 1
2 2
5 3
Now I am trying to extend it so that first part will be a string instead of digit. For example, the input will be
[(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11)->1]
And I need to split it as
(3),(5),(0,1) 2
(32,2) 6
(27),(61,11) 1
Basic expressions that I tried ("(\\(.*+)->([[:d:]]+)") just splits the entire string to two as following
(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11) 1
Can somebody give me some suggestions on how to achieve this? Appreciate all the help.
You need to get everything after the first '[', except "->", kind of like if
you were doing a regex for the multiline comment /* ... */, where " */ " has to be excluded, or else the regex gets greedy and eats everything until the last one, like is happening in your case for "->". You can't really use the dot for any char, because it gets very greedy.
This works for me:
\\[([^-\\]]+)->([0-9]+)\\]
'^' at the start of [...] makes it so all chars, except '-', so you can avoid "->", and ']', are accepted
What you need is to make it a bit more specific:
\[([^]]*)->([^]]*)\]
In order to avoid capturing too many data. See live demo.
You could have use the .*? pattern instead of [^]]* but it would have been less efficient.

C++ Find Word in String without Regex

I'm trying to find a certain word in a string, but find that word alone. For example, if I had a word bank:
789540132143
93
3
5434
I only want a match to be found for the value 3, as the other values do not match exactly. I used the normal string::find function, but that found matches for all four values in the word bank because they all contain 3.
There is no whitespace surrounding the values, and I am not allowed to use Regex. I'm looking for the fastest implementation of completing this task.
If you want to count the words you should use a string to int map. Read a word from your file using >> into a string then increment the map accordingly
string word;
map<string,int> count;
ifstream input("file.txt");
while (input.good()) {
input >> word;
count[word]++;
}
using >> has the benefit that you don't have to worry about whitespace.
All depends on the definition of words: is it a string speparated from others with a whitespace ? Or are other word separators (e.g. coma, dot, semicolon, colon, parenntheses...) relevant as well ?
How to parse for words without regex:
Here an accetable approach using find() and its variant find_first_of():
string myline; // line to be parsed
string what="3"; // string to be found
string separator=" \t\n,;.:()[]"; // string separators
while (getline(cin, myline)) {
size_t nxt=0;
while ( (nxt=myline.find(what, nxt)) != string::npos) { // search occurences of what
if (nxt==0||separator.find(myline[nxt-1])!=string::npos) { // if at befgin of a word
size_t nsep=myline.find_first_of(separator,nxt+1); // check if goes to end of wordd
if ((nsep==string::npos && myline.length()-nxt==what.length()) || nsep-nxt==what.length()) {
cout << "Line: "<<myline<<endl; // bingo !!
cout << "from pos "<<nxt<<" to " << nsep << endl;
}
}
nxt++; // ready for next occurence
}
}
And here the online demo.
The principle is to check if the occurences found correspond to a word, i.e. are at the begin of a string or begin of a word (i.e. the previous char is a separator) and that it goes until the next separator (or end of line).
How to solve your real problem:
You can have the fastest word search function: if ou use it for solving your problem of counting words, as you've explained in your comment, you'll waste a lot of efforts !
The best way to achieve this would certainly be to use a map<string, int> to store/updated a counter for each string encountered in the file.
You then just have to parse each line into words (you could use find_fisrst_of() as suggested above) and use the map:
mymap[word]++;

boost regex to extract a number from string

I have a string
resource = "/Music/1"
the string can take multiple numeric values after "/Music/" . I new to regular expression stuff . I tried following code
#include <iostream>
#include<boost/regex.hpp>
int main()
{
std::string resource = "/Music/123";
const char * pattern = "\\d+";
boost::regex re(pattern);
boost::sregex_iterator it(resource.begin(), resource.end(), re);
boost::sregex_iterator end;
for( ; it != end; ++it)
{
std::cout<< it->str() <<"\n";
}
return 0;
}
vickey#tb:~/trash/boost$ g++ idExtraction.cpp -lboost_regex
vickey#tb:~/trash/boost$ ./a.out
123
works fine . But even when the string happens to be something like "/Music23/123" it give me a value 23 before 123. When I use the pattern "/\d+" it would give results event when the string is /23/Music/123. What I want to do is extract the only number after "/Music/" .
I think part of the problem is that you haven't defined very well (at least to us) what it is you are trying to match. I'm going to take some guesses. Perhaps one will meet your needs.
The number at the end of your input string. For example "/a/b/34". Use regex "\\d+$".
A path element that is entirely numeric. For example "/a/b/12/c" or "/a/b/34" but not "/a/b56/d". Use regex "(?:^|/)(\\d+)(?:/|$)" and get captured group [1]. You might do the same thing with lookahead and lookbehind, perhaps with "(?<=^|/)\\d+(?=/|$)".
If there will never be anything after the last slash could you just use a regex or string.split() to get everything after the last slash. I'd get you code but I'm on my phone now.

regex as tokenizer - string beginning with delimiter

sregex_token_iterator works almost perfectly as a tokenizer when the index of the submatch is specified to be -1. But unfortunately it doesn't work well with strings that begin with delimiters e.g:
#include <string>
#include <regex>
#include <iostream>
using namespace std;
int main()
{
string s("--aa---b-c--d--");
regex r("-+");
for (sregex_token_iterator it = sregex_token_iterator(s.begin(), s.end(), r, -1); it != sregex_token_iterator(); ++it)
{
cout << (string) *it << endl;
}
return 0;
}
prints out:
aa
b
c
d
(Note the leading empty line).
So note that it actually handles trailing delimeters well (as it doesn't print an extra empty line).
Reading the standard it seems like there is a clause for specifically handling trailing delimeter to work well i.e:
[re.tokiter] no 4.
If the end of sequence is reached (position is equal to the end of sequence iterator), the iterator becomes equal to the end-of-sequence iterator value, unless the sub-expression being enumerated has index -1, in which case the iterator enumerates one last sub-expression that contains all the characters from the end of the last regular expression match to the end of the input sequence being enumerated, provided that this
would not be an empty sub-expression.
Does anyone know what's the reason for this seemingly asymmetric behaviour being specified?
And lastly, is there an elegant solution to make this work? (such that we don't have empty entries at all).
Apparently your regex matches empty strings between the - delimiters, a simple (not necessarily elegant solution) will discard all strings with length zero:
...
string aux = (string) *it;
if(aux.size() > 0){
cout << aux << endl;
}
...
It seems when you pass -1 as the third argument you're effectively doing a split, and that's the expected behavior for a split. The first token is whatever precedes the first delimiter, and the last token is whatever follows the last delimiter. In this case, both happen to be the empty string, and it's traditional for split() to drop any empty tokens at the end, but to keep the ones at the beginning.
Just out of curiosity, why don't you match the tokens themselves? If "-+" is the correct regex for the delimiters, this should match the tokens:
regex r("[^-}+");