Iterate through captures with boost::regex - c++

I have a regular expression to capture three fields in a HTML tag using boost::regex
"\\/\\/(.{1,3}?)\\.wikipedia\\.[a-z]+\\/wiki\\/(.*?)\\s*>(.*?)<"
So, from
Deutsch
I get
de
Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de"
Deutsch
But I´d like to have {de, Porky%E2%80%99s, Deutsch} instead.
How can I make my regex to stop matching the second field as soon as it finds the first white space?
I tried
"\\/\\/(.{1,3}?)\\.wikipedia\\.[a-z]+\\/wiki\\/(\\S*?)*>(.*?)<"
So the second field matches everything but whitespace but I get this crash report
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
what(): Ran out of stack space trying to match the regular expression.

This might work -
"//(.{1,3}?)\\.wikipedia\\.[a-z]+/wiki/([^\\s>\"]*).*?>(.*?)<"
I would use this instead -
"//(.{1,3}?)\\.wikipedia\\.[a-z]+/wiki/([^\\s>\"]*)[^>]*>(.*?)<"
Formatted:
//
( .{1,3}? ) # (1)
\.
wikipedia
\.
[a-z]+
/wiki/
( [^\s>"]* ) # (2)
[^>]*
>
( .*? ) # (3)
<
Output:
** Grp 0 - ( pos 9 , len 98 )
//de.wikipedia.org/wiki/Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de">Deutsch<
** Grp 1 - ( pos 11 , len 2 )
de
** Grp 2 - ( pos 33 , len 15 )
Porky%E2%80%99s
** Grp 3 - ( pos 99 , len 7 )
Deutsch

Related

What does regex expression doing?

What does this expression mean?
Pattern.compile("^.*(?=.*\\d).*$", Pattern.CASE_INSENSITIVE | Pattern.COMMENTS)
I tried to split each part of the expression, but could not get its meaning. please help me on this.
From regex101.com:
TL;DR:
Matches any String that contains at least a number (characters '0' to '9').
As a side note I'd like to point out that this is a horrendous way to do so, and could be replaced by the following:
Pattern.compile("\\d");
I basically removed all the nonsense greedy fillers and the useless anchors. Use this regex with Matcher#find() method and not Matcher#matches().
There are two parts to this regex.
1. The part up to (but not including) the digit.
2. The part from the digit to the end of the string.
The regex is processed left to right.
The first thing it see's is .*. This tells it to go directly to the
end of the string and start searching backwards to satisfy ->
The next thing it see's, which is (?=.*\d).
In that assertion the .* is ignored because of the previous .*
since its already at the end.
So the search progresses (using the assertion) to the left until it finds a
position where a digit is directly in front of the current position.
Once that is found, it matches that digit and all past it until the end of
the string. This is the part 2. described above.
Visually, it can be seen if you add some capture groups, and test it on some
real input.
^
( .* ) # (1)
(?=
( .* ) # (2)
( \d ) # (3)
)
( .* ) # (4)
$
Output:
** Grp 0 - ( pos 0 , len 15 )
12hh34ddd567uuu
** Grp 1 - ( pos 0 , len 11 )
12hh34ddd56
** Grp 2 - ( pos 11 , len 0 ) EMPTY
** Grp 3 - ( pos 11 , len 1 )
7
** Grp 4 - ( pos 11 , len 4 )
7uuu

Regex to extract switches /switch=value

I have a batch file that I need to extract switches from.
The switches are in this format.
/Switch1=Value1 /Switch2="Value 2" /Switch3 /Switch4="C:\Program Files\DIR"
I need Switch=Value or Switch (only if it doesn't have any value for e.g. Switch3) extracted.
I am a beginner to regex. So far I have tried \/\w+=|\/\w+ this expression. But that doesn't give me a value.
Seems like you want this,
\/\w+(?:=(?:(["'])(?:(?!\1).)*\1|\S+))?
DEMO
Not much information, but here is something in perl to get you going:
perl -p -i -e 'print "$1=$3\n" if /\/(\w+)(=((\"[^"]*\")|\S+))?/;'
you use the lookback searching "switch." and look ahead for the first slash you will have to trim the values after but you got the values
(?<=Switch.=).+(?=/)
It can get hairy to parse a command line with switches.
Something like below.
# /([^ =]+)(?:=(?|"((?:[^"\\]*(?:\\.|[^"\\]*)*))"|([^ ]*)))?
/
( [^ =]+ ) # (1)
(?:
=
(?|
"
( # (2 start)
(?:
[^"\\]*
(?:
\\ .
|
[^"\\]*
)*
)
) # (2 end)
"
|
( [^ ]* ) # (2)
)
)?
Output
** Grp 0 - ( pos 0 , len 15 )
/Switch1=Value1
** Grp 1 - ( pos 1 , len 7 )
Switch1
** Grp 2 - ( pos 9 , len 6 )
Value1
-------------------
** Grp 0 - ( pos 16 , len 18 )
/Switch2="Value 2"
** Grp 1 - ( pos 17 , len 7 )
Switch2
** Grp 2 - ( pos 26 , len 7 )
Value 2
-------------------
** Grp 0 - ( pos 35 , len 8 )
/Switch3
** Grp 1 - ( pos 36 , len 7 )
Switch3
** Grp 2 - NULL
-------------------
** Grp 0 - ( pos 44 , len 31 )
/Switch4="C:\Program Files\DIR"
** Grp 1 - ( pos 45 , len 7 )
Switch4
** Grp 2 - ( pos 54 , len 20 )
C:\Program Files\DIR

Regex fails to extract a double parameter substring from a string

I am trying to use the Regex library tools to extract double and integer parameters from a text file. Here is a minimal code that captures the 'std::regex_error' message I've been getting:
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string My_String = "delta = -002.050";
std::smatch Match;
std::regex Base("/^[0-9]+(\\.[0-9]+)?$");
std::regex_match(My_String,Match,Base);
std::ssub_match Sub_Match = Match[1];
std::string Sub_String = Sub_Match.str();
std::cout << Sub_String << std::endl;
return 0;
}
I am not much familiar with the Regex library, and couldn't find anything immediately useful. Any idea what causes this error message? To compile my code, I use g++ with -std=c++11 enabled. However, I am sure that the problem is not caused by my g++ compiler as suggested in the answers given to this earlier question (I tried several g++ compilers here).
I expect to get "-002.050" from the string "delta = -002.050", but I get:
terminate called after throwing an instance of 'std::regex_error'
what(): regex_error
Abort
Assuming you have gcc4.9 (older versions do not ship with a libstdc++ version that supports <regex>), then you can get the desired result by changing your regex to
std::regex Base("[0-9]+(\\.[0-9]+)?");
This will capture the fractional part of the floating point number in the input, along with the decimal point.
There are a couple of problems with your original regex. I think the leading / is an error. And then you're trying match the entire string by enclosing the regular expression in ^...$, which is clearly not what you want.
Finally, since you only want to match part of the input string, and not the entire thing, you need to use regex_search instead of regex_match.
std::regex Base(R"([0-9]+(\.[0-9]+)?)"); // use raw string literals to avoid
// having to escape backslashes
if(std::regex_search(My_String,Match,Base)) {
std::ssub_match Sub_Match = Match[1];
std::string Sub_String = Sub_Match.str();
std::cout << Sub_String << std::endl;
}
Live demo
I expect to get "-002.050" from the string "delta = -002.050"
To do that, modify the regex in the example above to
std::regex Base(R"(([+-]{0,1}[0-9]+\.[0-9]+))");
The above will match a single, optional, leading + or - sign.
The leading forward slash doesn't look right. Also, it looks like you are trying to match an entire line, due to the leading ^ and trailing $, but I'm not really sure that is what you want. Also, your expression isn't matching the negative sign.
Try this:
std::regex Base("-?[0-9]+(\\.[0-9]+)?$");
I think you are getting an error because what within the smatch object
is not valid.
To avoid this you have to check for a match.
Beyond that a general regex is
# "(?<![-.\\d])(?=[-.\\d]*\\d)(-?\\d*)(\\.\\d*)?(?![-.\\d])"
(?<! [-.\d] ) # Lookbehind, not these chars in behind
# This won't match like -'-3.44'
# Remove if not needed
(?= [-.\d]* \d ) # Lookahead, subject has to contain a digit
# Here, all the parts of a valid number are
# in front, now just define an arbitrary form
# to pick them out.
# Note - the form is all optional, let the engine
# choose what to match.
# -----------------
( -? \d* ) # (1), Required group before decimal, can be empty
( \. \d* )? # (2), Optional group, can be null
# change to (\.\d*) if decimal required
(?! [-.\d] ) # Lookahead, not these chars in front
# This won't match like '3.44'.66
# Remove if not needed
Sample output:
** Grp 0 - ( pos 9 , len 8 )
-002.050
** Grp 1 - ( pos 9 , len 4 )
-002
** Grp 2 - ( pos 13 , len 4 )
.050
-----------------
** Grp 0 - ( pos 28 , len 3 )
.65
** Grp 1 - ( pos 28 , len 0 ) EMPTY
** Grp 2 - ( pos 28 , len 3 )
.65
-----------------
** Grp 0 - ( pos 33 , len 4 )
1.00
** Grp 1 - ( pos 33 , len 1 )
1
** Grp 2 - ( pos 34 , len 3 )
.00
-----------------
** Grp 0 - ( pos 39 , len 4 )
9999
** Grp 1 - ( pos 39 , len 4 )
9999
** Grp 2 - NULL
-----------------
** Grp 0 - ( pos 104 , len 4 )
-99.
** Grp 1 - ( pos 104 , len 3 )
-99
** Grp 2 - ( pos 107 , len 1 )
.

Parse Maven Filename

How can I parse a maven filename into the artifact and and version?
The filenames look like this:
test-file-12.2.2-SNAPSHOT.jar
test-lookup-1.0.16.jar
I need to get
test-file
12.2.2-SNAPSHOT
test-lookup
1.0.16
So the artifactId is the text before the first instance of a dash and a number and the version is the text after the first instance of a number up to .jar.
I could probably do it with split and several loops and checks but it feels like there should be a simpler way.
EDIT:
Actually, the regex wasn't as complicated as I thought!
new File("test").eachFile() { file ->
String fileName = file.name[0..file.name.lastIndexOf('.') - 1]
//Split at the first instance of a dash and a number
def split = fileName.split("-[\\d]")
String artifactId = split[0]
String version = fileName.substring(artifactId.length() + 1, fileName.length())
println(artifactId)
println(version)
}
EDIT2: Hmm. It fails on examples such as this:
http://mvnrepository.com/artifact/org.xhtmlrenderer/core-renderer/R8
core-renderer-R8.jar
Basically its just this ^(.+?)-(\d.*?)\.jar$
used in multi-line mode if there is more than one line.
^
( .+? )
-
( \d .*? )
\. jar
$
Output:
** Grp 0 - ( pos 0 , len 29 )
test-file-12.2.2-SNAPSHOT.jar
** Grp 1 - ( pos 0 , len 9 )
test-file
** Grp 2 - ( pos 10 , len 15 )
12.2.2-SNAPSHOT
--------------------------
** Grp 0 - ( pos 31 , len 22 )
test-lookup-1.0.16.jar
** Grp 1 - ( pos 31 , len 11 )
test-lookup
** Grp 2 - ( pos 43 , len 6 )
1.0.16

Regex to extract pattern from text

I have a string that contains a bunch of function calls within it. I need to extract every occurrence of the VariableSet function call. Functions can appear in any order. Here is an example:
parsedExpression = "VariableSet(b, 999)If(a = 0,"Black",SetColor(a,b,c))VariableSet("a" ,1.573) VariableSet( c,-2387)"
I need to find every match that starts with "VariableSet(" and ends with the first close parenthesis that follows it. So, for the example above, I need a list like this:
VariableSet(b, 999)
VariableSet("a" ,1.573)
VariableSet( c,-2387)
I planned to use the code below but I have not been able to determine the correct regex pattern. The best I could come up with is "VariableSet(.*(?i:)\b)" but it does not produce the list above.
Dim matches As MatchCollection = Regex.Matches(parsedExpression, "VariableSet\(.*(?i:\)\b)")
' Loop over matches.
For Each m As Match In matches
' Loop over captures.
For Each c As Capture In m.Captures
Dim varName As String = ""
Dim varValue As String = ""
Dim firstCommaPosition As Integer
'For every VariableSet that was found do the following:
'Parse the captured string to get the variable name and value
varName = c.Value.Replace("VariableSet(", "").Replace(")", "")
firstCommaPosition = varName.IndexOf(",")
varValue = varName.Substring(firstCommaPosition + 1)
varName = varName.Substring(0, firstCommaPosition).Replace("""", "")
'Set the variable
ce.Variables(varName) = ce.Evaluate(varValue)
'Remove this instance of VariableSet() function from parsedExpression
parsedExpression = parsedExpression.Replace(c.Value, "")
Next
Next
I would greatly appreciate it if someone could provide the correct regex pattern.
Maybe this will help you:
Dim strMatch As String = ""
Dim strVar1 As String = ""
Dim strVar2 As String = ""
Dim strExpression As String = "VariableSet(b, 999)If(a = 0,""Black"",SetColor(a,b,c))VariableSet(""a"" ,1.573) VariableSet( c,-2387)"
Dim rx As New RegularExpressions.Regex("VariableSet\((?<V1>.*?),(?<V2>.*?)\)", RegularExpressions.RegexOptions.IgnoreCase)
Dim rxMatch As RegularExpressions.MatchCollection = rx.Matches(strExpression)
For intI As Integer = 0 To rxMatch.Count - 1
strMatch = rxMatch(intI).Value 'VariableSet(b, 999)
strVar1 = rxMatch(intI).Groups("V1").ToString 'b
strVar2 = rxMatch(intI).Groups("V2").ToString ' 999
Next
VariableSet\([^)]*\) should be a direct replacement.
If you wanted to get fancy, all your code could be done using a single regex.
# VariableSet\((\s*"?\s*([^,")]*?)\s*"?\s*(?:,\s*"?\s*([^,")]*?)\s*"?\s*)?)\)
VariableSet
\( # Open paren
( # (1 start), Inside paren's
\s*
"? \s*
( [^,")]*? ) # (2), Var
\s*
"? \s*
(?:
, # Comma
\s*
"? \s*
( [^,")]*? ) # (3), Value
\s*
"? \s*
)?
) # (1 end)
\) # Close paren
Example input string:
VariableSet(b, 999)
VariableSet("a" ,1.573)
VariableSet( c,-2387)
VariableSet( , 999)
VariableSet( "aadsfasdf")
VariableSet( )
Output matches ( Var / Value ):
** Grp 2 - ( pos 12 , len 1 )
b
** Grp 3 - ( pos 16 , len 3 )
999
----------------
** Grp 2 - ( pos 35 , len 1 )
a
** Grp 3 - ( pos 40 , len 5 )
1.573
----------------
** Grp 2 - ( pos 63 , len 1 )
c
** Grp 3 - ( pos 65 , len 5 )
-2387
----------------
** Grp 2 - ( pos 86 , len 0 ) EMPTY
** Grp 3 - ( pos 88 , len 3 )
999
----------------
** Grp 2 - ( pos 108 , len 9 )
aadsfasdf
** Grp 3 - NULL
----------------
** Grp 2 - ( pos 136 , len 0 ) EMPTY
** Grp 3 - NULL