Finding a string of numbers within another string - c++

So, I'm having a problem in C++.
I need to search for a string of five numbers that won't always be in the same spot in a string.
For example, sometimes the source string might be "sjdjfut93835sxx" and other times it may be "jj3333333335".
In the first string, I would need to exctract "93835". In the second string, I wouldn't extract anything since the string of numbers is over five characters.
I need to find strings of numbers that are 5 characters long and only numbers, no letters in-between.
What would the easiest way of doing this be? I'm having a lot of trouble with this and can't find an answer to it anywhere on Google or past StackOverflow questions
Thanks!

Try splitting the task up into two steps.
First, use something like regular expressions to pull out all of the numeric strings (93835 and 3333333335 in your example).
Second, remove any results that aren't 5 characters long.

with std::regex
int extract(const string& str) {
smatch result;
regex r("\\d{5}");
regex_search(str, result, r);
return stoi(result.str());
}
this function(stoi) throws an exception if the number is not found.
Edit:: this function also matches string that contain more than 5 consecutive digits.
you can modify the regex to (^|\\D)\\d{5}($|\\D), then remove the first non-digit(if there is one) before calling stoi.

That would be pretty simple to do with DFA (deterministic finite automaton) algorithms and pattern matching ones. Examples are Boyer-Moore algorithm or Knuth-Morris-Pratt's one. You can find thorough descriptions of them into any algorithm book.
Otherwise as Joshua noted you might use some ready regex libraries and have the searching and pattern matching work done by it.
Your specific problem might also be solved "manually" with a hand-crafted solution (if I understood it correctly) like the following:
Scan the string one character at a time
If you meet a number, start counting how many there are next
If > 5, then drop it and reset the counter until you find another number
pretty easy and O(N).

You can create simple finite state machine with the states:
1) Waiting for digit
2) Have first digit, waiting for second digit
3) Have second digit, waiting for third digit
4) ...
5) ...
6) ...
7) Have fifth digit, waiting for letter or end of string
8) Finish. Return string.

string text="sjdjfut93835sxx";
int digitCount=0;
string aux="";
for(int i=0; i<strlen(text); i++)
{
if(text[i]>=48 && text[i]<=57) // if is a digit
{
digitCount++;
aux+=text[i];
if(digitCount==5)
{
cout<<"I found it! "<<aux;
}
}
else
{
aux="";
digitCount=0;
}
}

Related

Regular expression string division, priorize the part lengths

I have this string
0Sc-a+nn1.ed_AI&AO1301#89
That has to be split in three parts
0Sc-a+nn1.ed_AI&AO
1301
89
I am using this RE (?P<prefix>[a-z\.\_\-\+(\&)]+\W?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?)) in python, but for now, testing in https://regex101.com/.
I am having problem to identify the first part. If I try "Sc-a+nn.ed_AI&AO1301#89" works fine, but adding the numbers to the first part, as the example, don't.
How to priory the second and the third part to be the maximum length allowed around the # and the first one () allow numbers in the beginning and middle (never at the end because will be in part two)? ? is there because sometimes the precedent element doesn't exist.
Use [a-zA-Z]{2} to capture the string after & and specify the length for each part i.e [\d]{4}
(?P<prefix>[A-Za-z0-9._\-+&;]+[a-zA-Z]{2}?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?))

How to get a count of the word sizes in a large amount of text?

I have a large amount text - roughly 7000 words.
I would like to get a count of the words sizes e.g. the count of 4 letter words, 6 letters words using regex.
I am unsure how to go about this - my thought process so far would be to split the sentence into a String array which would allow me to count each individual elements size. Is there an easier way to go about this using a regex? I am using Groovy for this task.
EDIT: So i did get this working using an normal array but it was slightly messy. The final solution simply used Groovy's countBy() method coupled with a small amount of logic for anyone who might come across a similar problem.
Don't forget word boudary token \b. If you don't put it at both ends of a \w{n} token then all words longer than n characters are also found. For a 4 character word \b\w{4}\b for a six character long word use \b\w{6}\b. Here is a demo with 7000 words as input string.
Java implementation:
String dummy = ".....";
Pattern pattern = Pattern.compile("\\b\\w{6}\\b");
Matcher matcher = pattern.matcher(dummy);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
Read the file using any stream word by word and calculate their length. Store counters in an array and increment values after reading each word.
You could generate regexes for each size you want.
\w{6} would get each word with 6 letters exactly
\w{7} would get each word with 7 letters exactly
and so on...
So you could run one of these regex on the text, with the global flag enabled (finding every instance in the whole string). This will give you an array of every match, which you can then find the length of.

Regex less than or greater than 0

I'm trying to find a regex that validates for a number being greater or less than 0.
It must allow a number to be 1.20, -2, 0.0000001, etc...it simply can't be 0 and it must be a number, also means it can't be 0.00, 0.0
^(?=.*[1-9])(?:[1-9]\d*\.?|0?\.)\d*$
tried that but it does not allows negative
I don't think a regex is the appropriate tool for that problem.
Why not using a simple condition ?
long number = ...;
if (number != 0)
{
// ...
}
Why using a bazooka to kill a fly ?
also tried something:
-?[0-9]*([1-9][0-9]*(\.[0-9]*)?|\.[0-9]*[1-9][0-9]*)
demo: http://regex101.com/r/bZ8fE5
Just tried something:
[+-]?(?:\d*[1-9]\d*(?:\.\d+)?|0+\.\d*[1-9]\d*)
Online demo
Take a typical regex for a number, say
^[+-]?[0-9]*(\.[0-9]*)?$
and then require that there be a non-zero digit either before or after the decimal. Based on your examples, you're not expecting leading zeros before the decimal, so a simple regex might be
^([+-]?[1-9][0-9]*(\.[0-9]*)?)|([+-]?[0-9]*\.0*[1-9]*0*)
Then decide if you still want to use a regex for this.
Try to negate the regex like this
!^[0\.]+$
If you're feeling the need to use regex just because it's stored as a String you could use Double.parseDouble() to covert the string into a numeric type. This would have an added advantage of checking if the string is a valid number or not (by catching NumberFormatException).

How to find a formatted number in a string?

If I have a string, and I want to find if it contains a number of the form XXX-XX-XXX, and return its position in the string, is there an easy way to do that?
XXX-XX-XXX can be any number, such as 259-40-092.
This is usually a job for a regular expression. Have a look at the Boost.Regex library for example.
I did this before....
Regular Expression is your superhero, become his friend....
//Javascript
var numRegExp = /^[+]?([0-9- ]+){10,}$/;
if (numRegExp.test("259-40-092")) {
alert("True - Number found....");
else
alert("False - Not a Number");
}
To give you a position in the string, that will be your homework. :-)
The regular expression in C++ will be...
char* regExp = "[+]?([0-9- ]+){10,}";
Use Boost.Regex for this instance.
If you don't want regexes, here's an algorithm:
Find the first -
LOOP {
Find the next -
If not found, break.
Check if the distance is 2
Check if the 8 characters surrounding the two minuses are digits
If so, found the number.
}
Not optimal but but the scan speed will already be dominated by the cache/memory speed. It can be optimized by considering on what part the match failed, and how. For instance, if you've got "123-4X-X............", when you find the X you know that you can skip ahead quickly. The second - preceding the X cannot be the first - of a proper number. Similarly, in "123--" you know that the second - can't be the first - of a number either.

checking float inside a string and return result?

I have a text file which I geline to a string. The file is like this: 0.2abc 0.2 .2abc .2 abc.2abc abc.2 abc0.20 .2 . 20
I wanna check the result then parse it in to separate float. The result is:0.2 0.2abc 2 20 2abc abc0.20 abc
This is expalined: check if there is 2 digit (before and after '.' (full stop)) whether with char or not. If only 1 site of the '.' is digit the '.' will be full stop.
How can I parse a STRING to separate result like that? I did use iterator to check the '.' and pos of it, but still got stuck.
The first thing you need to do is split the input in words. Easy, just don't use .getline()
but instead rely on `while (cin >> strWord ) { /* do stuff with word*/ };
The second thing is to kick out bad input words early: words of 2 characters or less, with more than one ., or with the . first or last.
You now know that the . is somewhere in the middle. find() will give you an iterator. ++ and -- give you the next and previous iterators. * gives you the character that the iterator points to. isdigit() tells you whether that character is a digit. Add ingredients together and you're done.
Seems like some fairly complicated advice above -- and not necessarily helpful.
Your question does not make it entirely clear what the end result should look like. Do you want an array of floating point numbers? Do you just want the sum? Do you want to print out the results?
If you want help with homework, the best policy is to post your own attempt and then others can help you improve it, to make it work.
One approach that might help is to try to break the string into sub-strings (tokens) and discard the junk.
Write a function that accepts a character and returns true (this is part of a floating point number) or false (it isn't).
Scan along the string using an iterator or an index.
While current char is not part of a token, skip it.
If you find a token char, while current char is part of a token, copy it to another string
etc. to get all floating point substrings.
Then you can use std::stringstream or ::atof() to convert.
Have a bit of a go and post what you can get done.
sounds like you could use some regex to extract your number.
Try this regex in order to extract the floating values within a string.
[0-9]+\.[0-9]+
Keep in mind that this won't extract integer values. ie 234abc
I don't know if there is a built-in way to use regex in c++ but i found this library with a quick google search which allows you to use regex in c++
Sounds like you should look at the "Interpreter" Design Pattern.
Or you could use the "State" Design Pattern and do it by hand.
There should be plenty of examples of both on the web.