What does regex expression doing? - regex

What does this expression mean?
Pattern.compile("^.*(?=.*\\d).*$", Pattern.CASE_INSENSITIVE | Pattern.COMMENTS)
I tried to split each part of the expression, but could not get its meaning. please help me on this.

From regex101.com:
TL;DR:
Matches any String that contains at least a number (characters '0' to '9').
As a side note I'd like to point out that this is a horrendous way to do so, and could be replaced by the following:
Pattern.compile("\\d");
I basically removed all the nonsense greedy fillers and the useless anchors. Use this regex with Matcher#find() method and not Matcher#matches().

There are two parts to this regex.
1. The part up to (but not including) the digit.
2. The part from the digit to the end of the string.
The regex is processed left to right.
The first thing it see's is .*. This tells it to go directly to the
end of the string and start searching backwards to satisfy ->
The next thing it see's, which is (?=.*\d).
In that assertion the .* is ignored because of the previous .*
since its already at the end.
So the search progresses (using the assertion) to the left until it finds a
position where a digit is directly in front of the current position.
Once that is found, it matches that digit and all past it until the end of
the string. This is the part 2. described above.
Visually, it can be seen if you add some capture groups, and test it on some
real input.
^
( .* ) # (1)
(?=
( .* ) # (2)
( \d ) # (3)
)
( .* ) # (4)
$
Output:
** Grp 0 - ( pos 0 , len 15 )
12hh34ddd567uuu
** Grp 1 - ( pos 0 , len 11 )
12hh34ddd56
** Grp 2 - ( pos 11 , len 0 ) EMPTY
** Grp 3 - ( pos 11 , len 1 )
7
** Grp 4 - ( pos 11 , len 4 )
7uuu

Related

Extracting inner groups with regex

I have the following string
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6')) AND (((SUM_RevisionAnomalia_UltRevision_1M = 1) AND (CANT_ConsumoFact_UltRevision_1M > 1)) OR ((SUM_RevisionNoAnomalia_UltRevision_1M + 1) AND (CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2))) OR (SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
and I am trying to extract all inner groups, so my answer should contain
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6'))
(SUM_RevisionAnomalia_UltRevision_1M = 1)
(CANT_ConsumoFact_UltRevision_1M > 1)
(SUM_RevisionNoAnomalia_UltRevision_1M + 1)
(CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2)
(SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
It is quite easy to extract this when there is only 1 set of those strings inside parentheses, but when given the example above my regex captures the whole string.
The regex i am using is
/(\([a-zA-Z0-9\[\]:_+=-\s\.\(\),'óáéíúüçãôàäê><]+\))/g
It seems you just want to match what is in-between ( and ) that is not ( and ) unless these are (...) that are preceded with a word character.
You can use
\((?:[^()]|\b\([^()]*\))*\)
See the regex demo
The regex breakdown:
\( - matching a literal (
(?:[^()]|\b\([^()]*\))* - zero or more sequences of:
[^()] - any character other than ( and )
| - or...
\b\([^()]*\) - a word boundary (i.e. before that position, there must be a word character) followed with ( followed with zero or more characters other than ( and )
\) - a closing )
An alternative pattern can be an unrolled one (more efficient with longer inputs):
\([^()]*(?:\b\([^()]*\)[^()]*)*\)
See another demo

Ignoring the first match in regex

I'm doing an exercise from C++ Primer
Rewrite your phone program so that it writes only the
second and subsequent phone numbers for people with more than one phone
number.
(The phone program simply recognises phone-numbers that have a certain format using a regular expression).
The chapter has been discussing using regex_replace and the format flags to alter the format of the phone numbers entered in. The question is asking to ignore the first phone number entered and only format/print the second and subsequent. My input might look something like:
dave: 050 000 0020, (402)2031032, (999) 999-2222
and it should output
402.203.1032 999.999.2222
This is my solution:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
using namespace regex_constants;
int main(){
string pattern = "(\\()?(\\d{3})(\\))?([-. ])?(\\d{3})([-. ])?(\\d{4})";
regex r(pattern);
//string firstFormat = "";
string secondFormat = "$2.$5.$7 ";
for(string line; getline(cin, line);){
unsigned counter = 0;
for(sregex_iterator b(line.begin(), line.end(), r), e; b != e; ++b)
if(++counter > 1) cout << (*b).format(secondFormat);
cout << endl;
// Below: iterates through twice, maybe not ideal
// string noFirst = regex_replace(line, r, firstFormat, format_first_only); //removes the first phone number
// cout << regex_replace(noFirst, r, secondFormat, format_no_copy) << endl;
}
}
However I am unhappy with the use of a counter to make sure I'm not processing the first match. It feels like there must be a more natural utility (like the format_first_only flag that can be passed to format, except in reverse) that makes it possible to ignore the first match? But I am struggling to find one.
The commented out solution seems a bit better except it requires a second iteration through the input.
You could use the \G anchor.
"(?:(?!\\A)\\G|.*?\\d{3}\\D*\\d{3}\\D*\\d{4}).*?(\\d{3})\\D*(\\d{3})\\D*(\\d{4})"
And secondFormat = "$1.$2.$3 ";
Where there is no need for a counter.
Formatted:
(?:
(?! \A ) # Not beginning of string
\G # End of previous match
| # or,
.*? # Anything up to
\d{3} \D* \d{3} \D* \d{4} # First phone number
)
.*? # Anything up to
( \d{3} ) # (1), Next phone number
\D*
( \d{3} ) # (2)
\D*
( \d{4} ) # (3)
Input:
dave: 050 000 0020, (402)2031032, (999) 999-2221
Output:
** Grp 0 - ( pos 0 , len 32 )
dave: 050 000 0020, (402)2031032
** Grp 1 - ( pos 21 , len 3 )
402
** Grp 2 - ( pos 25 , len 3 )
203
** Grp 3 - ( pos 28 , len 4 )
1032
-------------------------------------
** Grp 0 - ( pos 32 , len 16 )
, (999) 999-2221
** Grp 1 - ( pos 35 , len 3 )
999
** Grp 2 - ( pos 40 , len 3 )
999
** Grp 3 - ( pos 44 , len 4 )
2221
How about change regex to be something like (?<=\P, *)(\P) (where \P is shorthand for a regex which matches a phone number). In other words, you are interested only in phone numbers which follow a previous phone number.
The only problem with this suggestion is that C++ doesn't appear to support positive look-behind.
(Note: you don't want all the captures in the first phone number.)

Iterate through captures with boost::regex

I have a regular expression to capture three fields in a HTML tag using boost::regex
"\\/\\/(.{1,3}?)\\.wikipedia\\.[a-z]+\\/wiki\\/(.*?)\\s*>(.*?)<"
So, from
Deutsch
I get
de
Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de"
Deutsch
But I´d like to have {de, Porky%E2%80%99s, Deutsch} instead.
How can I make my regex to stop matching the second field as soon as it finds the first white space?
I tried
"\\/\\/(.{1,3}?)\\.wikipedia\\.[a-z]+\\/wiki\\/(\\S*?)*>(.*?)<"
So the second field matches everything but whitespace but I get this crash report
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
what(): Ran out of stack space trying to match the regular expression.
This might work -
"//(.{1,3}?)\\.wikipedia\\.[a-z]+/wiki/([^\\s>\"]*).*?>(.*?)<"
I would use this instead -
"//(.{1,3}?)\\.wikipedia\\.[a-z]+/wiki/([^\\s>\"]*)[^>]*>(.*?)<"
Formatted:
//
( .{1,3}? ) # (1)
\.
wikipedia
\.
[a-z]+
/wiki/
( [^\s>"]* ) # (2)
[^>]*
>
( .*? ) # (3)
<
Output:
** Grp 0 - ( pos 9 , len 98 )
//de.wikipedia.org/wiki/Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de">Deutsch<
** Grp 1 - ( pos 11 , len 2 )
de
** Grp 2 - ( pos 33 , len 15 )
Porky%E2%80%99s
** Grp 3 - ( pos 99 , len 7 )
Deutsch

Regex fails to extract a double parameter substring from a string

I am trying to use the Regex library tools to extract double and integer parameters from a text file. Here is a minimal code that captures the 'std::regex_error' message I've been getting:
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string My_String = "delta = -002.050";
std::smatch Match;
std::regex Base("/^[0-9]+(\\.[0-9]+)?$");
std::regex_match(My_String,Match,Base);
std::ssub_match Sub_Match = Match[1];
std::string Sub_String = Sub_Match.str();
std::cout << Sub_String << std::endl;
return 0;
}
I am not much familiar with the Regex library, and couldn't find anything immediately useful. Any idea what causes this error message? To compile my code, I use g++ with -std=c++11 enabled. However, I am sure that the problem is not caused by my g++ compiler as suggested in the answers given to this earlier question (I tried several g++ compilers here).
I expect to get "-002.050" from the string "delta = -002.050", but I get:
terminate called after throwing an instance of 'std::regex_error'
what(): regex_error
Abort
Assuming you have gcc4.9 (older versions do not ship with a libstdc++ version that supports <regex>), then you can get the desired result by changing your regex to
std::regex Base("[0-9]+(\\.[0-9]+)?");
This will capture the fractional part of the floating point number in the input, along with the decimal point.
There are a couple of problems with your original regex. I think the leading / is an error. And then you're trying match the entire string by enclosing the regular expression in ^...$, which is clearly not what you want.
Finally, since you only want to match part of the input string, and not the entire thing, you need to use regex_search instead of regex_match.
std::regex Base(R"([0-9]+(\.[0-9]+)?)"); // use raw string literals to avoid
// having to escape backslashes
if(std::regex_search(My_String,Match,Base)) {
std::ssub_match Sub_Match = Match[1];
std::string Sub_String = Sub_Match.str();
std::cout << Sub_String << std::endl;
}
Live demo
I expect to get "-002.050" from the string "delta = -002.050"
To do that, modify the regex in the example above to
std::regex Base(R"(([+-]{0,1}[0-9]+\.[0-9]+))");
The above will match a single, optional, leading + or - sign.
The leading forward slash doesn't look right. Also, it looks like you are trying to match an entire line, due to the leading ^ and trailing $, but I'm not really sure that is what you want. Also, your expression isn't matching the negative sign.
Try this:
std::regex Base("-?[0-9]+(\\.[0-9]+)?$");
I think you are getting an error because what within the smatch object
is not valid.
To avoid this you have to check for a match.
Beyond that a general regex is
# "(?<![-.\\d])(?=[-.\\d]*\\d)(-?\\d*)(\\.\\d*)?(?![-.\\d])"
(?<! [-.\d] ) # Lookbehind, not these chars in behind
# This won't match like -'-3.44'
# Remove if not needed
(?= [-.\d]* \d ) # Lookahead, subject has to contain a digit
# Here, all the parts of a valid number are
# in front, now just define an arbitrary form
# to pick them out.
# Note - the form is all optional, let the engine
# choose what to match.
# -----------------
( -? \d* ) # (1), Required group before decimal, can be empty
( \. \d* )? # (2), Optional group, can be null
# change to (\.\d*) if decimal required
(?! [-.\d] ) # Lookahead, not these chars in front
# This won't match like '3.44'.66
# Remove if not needed
Sample output:
** Grp 0 - ( pos 9 , len 8 )
-002.050
** Grp 1 - ( pos 9 , len 4 )
-002
** Grp 2 - ( pos 13 , len 4 )
.050
-----------------
** Grp 0 - ( pos 28 , len 3 )
.65
** Grp 1 - ( pos 28 , len 0 ) EMPTY
** Grp 2 - ( pos 28 , len 3 )
.65
-----------------
** Grp 0 - ( pos 33 , len 4 )
1.00
** Grp 1 - ( pos 33 , len 1 )
1
** Grp 2 - ( pos 34 , len 3 )
.00
-----------------
** Grp 0 - ( pos 39 , len 4 )
9999
** Grp 1 - ( pos 39 , len 4 )
9999
** Grp 2 - NULL
-----------------
** Grp 0 - ( pos 104 , len 4 )
-99.
** Grp 1 - ( pos 104 , len 3 )
-99
** Grp 2 - ( pos 107 , len 1 )
.

Regex pattern works in javascript but fails in scala with PatternSyntaxException: Unclosed character class

Here is the regex:
ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?
Here is a test pattern:
wss://beta5.max.com:18989/abcde.html
softlion.com likes it:
Test results
Match count: 1
Global matches:
wss://beta5.max.com:18989/abcde.html
Value of each capturing group:
0 1 2 3 4
wss://beta5.max.com:18989/abcde.html s beta5.max.com 18989 /abcde.html
scala does not:
val regex = """ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?""".r
Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 58
ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?
My bad, I had an extra [ at the front of the last capturing group.
([/([0-9.a-zA-Z-_]+)?
Java allows intersections and all that, so error ..
ws
( s )?
://
( [0-9\.a-zA-Z\-_]+ )
:
( [\d]+ )
= ( <-- Unbalanced '('
= [ <-- Unbalanced '['
/
( [0-9\.a-zA-Z\-_]+ )?
With everybody else its no problem:
ws
( s )? # (1)
://
( [0-9\.a-zA-Z\-_]+ ) # (2)
:
( [\d]+ ) # (3)
( [/([0-9\.a-zA-Z\-_]+ )? # (4)
So, its good to see (know) the original regex is not what you thought it was.