I use a regexp to test a link :
lolspec:\/\/(spectator\.(na|euw1|eu|kr|oc1|br|la1|la2|ru|tr|pbe1)\.lol\.riotgames\.com:(80|8088)((([?&]region=(NA1|EUW1|EUN1|KR|OC1|BR1|LA1|LA2|RU|TR1|PBE1))|([?&]gameID=([0-9]+))|([?&]encKey=(.+)))){3})
to test this link :
lolspec://spectator.euw1.lol.riotgames.com:80?region=NA1&gameID=44584&encKey=fghgdsv1134+ianfcuia
but some groups are empty (#7, #8, #9)
what should I do ?
Probably overkill on the capture groups.
The regex you use there contains a container capture group 4 that is quantified
like this ( ... ){3}.
What that will do is overwrite the container capture buffer 3 times leaving
the last value captured within the capture group.
Moving on to the next level is a single inner group with which the outer group encapsulates, like this (( ... )){3} so thats not needed, and you get the same affect of overwritting.
Moving even deeper, are three capture groups all separated by alternations.
They follow the same rules, each one will get overwritten if they can match
again during each successive 1..3 quantified passes.
Its only that one group match in the alternation cluster.
So, if you have identical adjacent data, it could be matched by the same
alternation cluster, leaving the other cluster groups empty.
So, this is not the approach if you want to match out-of-order parameters
in a string.
The way this is done is using lookahead assertions OR if you are using
an engine that can do conditionals.
The way to do it using conditionals is like this
(?:
.*?
(?:
( (?(1)(?!)) abc ) # (1)
| ( (?(2)(?!)) def ) # (2)
| ( (?(3)(?!)) ghi ) # (3)
)
){3}
It forces finding all of the capture group contents.
The way you are doing it is the same but without the conditionals,
and suffering the consequences as stated above.
Btw, Your regex above does not have any empty groups with that particular sample, But it has many problems.
lolspec:\/\/
( # (1 start)
spectator\.
( na | euw1 | eu | kr | oc1 | br | la1 | la2 | ru | tr | pbe1 ) # (2)
\.lol\.riotgames\.com:
( 80 | 8088 ) # (3)
( # (4 start)
( # (5 start)
( # (6 start)
[?&] region=
( NA1 | EUW1 | EUN1 | KR | OC1 | BR1 | LA1 | LA2 | RU | TR1 | PBE1 ) # (7)
) # (6 end)
| ( # (8 start)
[?&] gameID=
( [0-9]+ ) # (9)
) # (8 end)
| ( # (10 start)
[?&] encKey=
( .+ ) # (11)
) # (10 end)
) # (5 end)
){3} # (4 end)
) # (1 end)
Output
** Grp 0 - ( pos 0 , len 97 )
lolspec://spectator.euw1.lol.riotgames.com:80?region=NA1&gameID=44584&encKey=fghgdsv1134+ianfcuia
** Grp 1 - ( pos 10 , len 87 )
spectator.euw1.lol.riotgames.com:80?region=NA1&gameID=44584&encKey=fghgdsv1134+ianfcuia
** Grp 2 - ( pos 20 , len 4 )
euw1
** Grp 3 - ( pos 43 , len 2 )
80
** Grp 4 - ( pos 69 , len 28 )
&encKey=fghgdsv1134+ianfcuia
** Grp 5 - ( pos 69 , len 28 )
&encKey=fghgdsv1134+ianfcuia
** Grp 6 - ( pos 45 , len 11 )
?region=NA1
** Grp 7 - ( pos 53 , len 3 )
NA1
** Grp 8 - ( pos 56 , len 13 )
&gameID=44584
** Grp 9 - ( pos 64 , len 5 )
44584
** Grp 10 - ( pos 69 , len 28 )
&encKey=fghgdsv1134+ianfcuia
** Grp 11 - ( pos 77 , len 20 )
fghgdsv1134+ianfcuia
Related
I have the following code, originally programmed using C++11's regular expressions library (#include <regex>) but now using Boost in an attempt to troubleshoot:
boost::regex reg(R"(.*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z]+)).*?(\d+).*?((?:[a-z][a-z]+)))", boost::regex::icase);
boost::cmatch matches;
if (boost::regex_match(request, reg) && matches.size() > 1)
{
printf("Match found");
}
else
{
printf("No match.");
}
When executed, this code seems to "freeze" on boost::regex_match(request, reg), as if it's taking a long time to process. I waited five minutes for it to process (in case this is a processing issue) but the program state was the same.
I tested the STL's regex library version of the above code online on cpp.sh and onlinegdb, and it works flawlessly there. I then copied this code into a VC++ project, and the code freezes again:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string request = "\\login\\\\challenge\\jRJkdflp3gvTzrwiQ3tyKSqnyppmaZog\\uniquenick\\Lament\\partnerid\\0\\response\\4767846ef255a88da9b10f7c923a1e6e\\port\\-14798\\productid\\11489\\gamename\\crysiswars\\namespaceid\\56\\sdkrevision\\3\\id\\1\\final\\";
std::regex reg(R"(.*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z]+)).*?(\d+).*?((?:[a-z][a-z]+)))", std::regex::icase);
std::smatch matches;
if (std::regex_search(request, matches, reg) && matches.size() > 1)
{
printf("Match found");
}
else
{
printf("No match.");
}
}
The string concerned is the following:
\login\challenge\jRJwdflp3gvTrrwiQ3tyKSqnyppmaZog\uniquenick\User\partnerid\0\response\4767846ef255a83da9b10f7f923a1e6e\port-14798\productid\11489\gamename\crysiswars\namespaceid\56\sdkrevision\3\id\1\final\
I tested the same code on a Visual Studio 2017 installation on another computer (brand new project) and get the exact same result... which seems to indicate that something that the compiler is doing is causing the code to freeze/take a long time processing. I am unable to test on another compiler locally at present.
The regular expression string checks out on regex101, so functionally the expression is OK.
This is with Visual Studio 2017 Professional targeting v141.
Why is this happening, and how can I fix it?
Your problem is one of backtracking.
In the boost sample, you use regex_match which forces a match on the
whole string.
You will get the same result if using regex_search and adding ^..$.
However, your string can never match because you have forced it to end
on a letter, but the string really ends with a backslash .
This forces the engine to retry all those .*? positions.
The fix is to put a final .*? at the end of your regex which will let
the regex fulfill it's mission of matching the entire string.
Other things may help, you could clean up your regex a bit and/or add some
atomic groups and/or add some slashes in place of those .*?
Anyway, use this :
^.*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z]+)).*?(\d+).*?((?:[a-z][a-z]+)).*?$
Output
** Grp 0 - ( pos 0 : len 207 )
\login\challenge\jRJwdflp3gvTrrwiQ3tyKSqnyppmaZog\uniquenick\User\partnerid\0\response\4767846ef255a83da9b10f7f923a1e6e\port-14798\productid\11489\gamename\crysiswars\namespaceid\56\sdkrevision\3\id\1\final\
** Grp 1 - ( pos 1 : len 5 )
login
** Grp 2 - ( pos 7 : len 9 )
challenge
** Grp 3 - ( pos 17 : len 32 )
jRJwdflp3gvTrrwiQ3tyKSqnyppmaZog
** Grp 4 - ( pos 50 : len 10 )
uniquenick
** Grp 5 - ( pos 61 : len 4 )
User
** Grp 6 - ( pos 66 : len 9 )
partnerid
** Grp 7 - ( pos 76 : len 1 )
0
** Grp 8 - ( pos 78 : len 8 )
response
** Grp 9 - ( pos 94 : len 25 )
ef255a83da9b10f7f923a1e6e
** Grp 10 - ( pos 120 : len 4 )
port
** Grp 11 - ( pos 125 : len 5 )
14798
** Grp 12 - ( pos 131 : len 9 )
productid
** Grp 13 - ( pos 141 : len 5 )
11489
** Grp 14 - ( pos 147 : len 8 )
gamename
** Grp 15 - ( pos 156 : len 10 )
crysiswars
** Grp 16 - ( pos 167 : len 11 )
namespaceid
** Grp 17 - ( pos 179 : len 2 )
56
** Grp 18 - ( pos 182 : len 11 )
sdkrevision
** Grp 19 - ( pos 194 : len 1 )
3
** Grp 20 - ( pos 196 : len 2 )
id
** Grp 21 - ( pos 199 : len 1 )
1
** Grp 22 - ( pos 201 : len 5 )
final
I am trying to write a regex to parse out seven match objects: four numbers and three operands:
Individual lines in the file look like this:
[ 9] -21 - ( 12) - ( -5) + ( -26) = ______
The number in brackets is the line number which will be ignored. I want the four integer values, (including the '-' if it is a negative integer), which in this case are -21, 12, -5 and -26. I also want the operands, which are -, - and +.
I will then take those values (match objects) and actually compute the answer:
-21 - 12 - -5 + -26 = -54
I have this:
[\s+0-9](-?[0-9]+)
In Pythex it grabs the [ 9] but it also then grabs every integer in separate match objects (four additional match objects). I don't know why it does that.
If I add a ? to the end: [\s+0-9](-?[0-9]+)? thinking it will only grab the first integer, it doesn't. I get seventeen matches?
I am trying to say, via the regex: Grab the line number and it's brackets (that part works), then grab the first integer including sign, then the operand, then the next integer including sign, then the next operand, etc.
It appears that I have failed to explain myself clearly.
The file has hundreds of lines. Here is a five line sample:
[ 1] 19 - ( 1) - ( 4) + ( 28) = ______
[ 2] -18 + ( 8) - ( 16) - ( 2) = ______
[ 3] -8 + ( 17) - ( 15) + ( -29) = ______
[ 4] -31 - ( -12) - ( -5) + ( -26) = ______
[ 5] -15 - ( 12) - ( 14) - ( 31) = ______
The operands are only '-' or '+', but any combination of those three may appear in a line. The integers will all be from -99 to 99, but that shouldn't matter if the regex works. The goal (as I see it) is to extract seven match objects: four integers and three operands, then add the numbers
exactly as they appear. The number in brackets is just the line number and plays no role in the computation.
Much luck with regex, if you just need the result:
import re
s="[ 9] -21 - ( 12) - ( -5) + ( -26) = ______"
s = s[s.find("]")+1:s.find("=")] # cut away line nr and = ...
if not re.sub( "[+-0123456789() ]*","",s): # weak attempt to prevent python code injection
print(eval(s))
else:
print("wonky chars inside, only numbers, +, - , space and () allowed.")
Output:
-54
Make sure to read the eval()
and have a look into:
https://opensourcehacker.com/2014/10/29/safe-evaluation-of-math-expressions-in-pure-python/
https://softwareengineering.stackexchange.com/questions/311507/why-are-eval-like-features-considered-evil-in-contrast-to-other-possibly-harmfu/311510
https://www.kevinlondon.com/2015/07/26/dangerous-python-functions.html
Example for hundreds of lines:
import re
s="[ 9] -21 - ( 12) - ( -5) + ( -26) = ______"
def calcIt(line):
s = line[line.find("]")+1:line.find("=")]
if not re.sub( "[+-0123456789() ]*","",s):
return(eval(s))
else:
print(line + " has wonky chars inside, only numbers, +, - , space and () allowed.")
return None
import random
random.seed(42)
pattern = "[ {}] -{} - ( {}) - ( -{}) + ( -{}) = "
for n in range(1000):
nums = [n]
nums.extend([ random.randint(0,100),random.randint(-100,100),random.randint(-100,100),
random.randint(-100,100)])
c = pattern.format(*nums)
print (c, calcIt(c))
Ahh... I had a cup of coffee and sat down in front of Pythex again.
I figured out the correct regex:
[\s+0-9]\s+(-?[0-9]+)\s+([-|+])\s+\(\s+(-?[0-9]+)\)\s+([-|+])\s+\(\s+(-?[0-9]+)\)\s+([-|+])\s+\(\s+(-?[0-9]+)\)
Yields:
-21
-
12
-
-5
+
-26
I'm looking for a function that returns a map[string]interface{} where interface{} can be a slice, a a map[string]interface{} or a value.
My use case is to parse WKT geometry like the following and retrieves point values; Example for a donut polygon:
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
The regex (I voluntary set \d that matches only integers for readability purpose):
(POLYGON \(
(?P<polygons>\(
(?P<points>(?P<point>(\d \d), ){3,})
(?P<last_point>\d \d )\),)*
(?P<last_polygon>\(
(?P<points>(?P<point>(\d \d), ){3,})
(?P<last_point>\d \d)\))\)
)
I have a function (copied from SO) that retrieves some informations but it's not that good for nested groups and list of groups:
func getRegexMatchParams(reg *regexp.Regexp, url string) (paramsMap map[string]string) {
match := reg.FindStringSubmatch(url)
paramsMap = make(map[string]string)
for i, name := range reg.SubexpNames() {
if i > 0 && i <= len(match) {
paramsMap[name] = match[i]
}
}
return match
}
It seems that the group point gets only 1 point.
example on playground
[EDIT] The result I want is something like this:
map[string]interface{}{
"polygons": map[string]interface{} {
"points": []interface{}{
{map[string]string{"point": "0 0"}},
{map[string]string{"point": "0 10"}},
{map[string]string{"point": "10 10"}},
{map[string]string{"point": "10 0"}},
},
"last_point": "0 0",
},
"last_polygon": map[string]interface{} {
"points": []interface{}{
{map[string]string{"point": "3 3"}},
{map[string]string{"point": "3 7"}},
{map[string]string{"point": "7 7"}},
{map[string]string{"point": "7 3"}},
},
"last_point": "3 3",
}
}
So I can use it further for different purposes like querying databases and validate that last_point = points[0] for each polygon.
Try to add some whitespace to the regex.
Also note that this engine won't retain all capture group values that are
within a quantified outer grouping like (a|b|c)+ where this group will only contain the last a or b or c it finds.
And, your regex can be reduced to this
(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\)(?:\s*,\s*|\s*\)))+)
https://play.golang.org/p/rLaaEa_7GX
The original:
(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\),)*(?P<last_polygon>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\))\s*\))
https://play.golang.org/p/rZgJYPDMzl
See below for what the groups contain.
( # (1 start)
POLYGON \s* \(
(?P<polygons> # (2 start)
\( \s*
(?P<points> # (3 start)
(?P<point> # (4 start)
\s*
( \d+ \s+ \d+ ) # (5)
\s*
,
){3,} # (4 end)
) # (3 end)
\s*
(?P<last_point> \d+ \s+ \d+ ) # (6)
\s* \),
)* # (2 end)
(?P<last_polygon> # (7 start)
\( \s*
(?P<points> # (8 start)
(?P<point> # (9 start)
\s*
( \d+ \s+ \d+ ) # (10)
\s*
,
){3,} # (9 end)
) # (8 end)
\s*
(?P<last_point> \d+ \s+ \d+ ) # (11)
\s* \)
) # (7 end)
\s* \)
) # (1 end)
Input
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
Output
** Grp 0 - ( pos 0 , len 65 )
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
** Grp 1 - ( pos 0 , len 65 )
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
** Grp 2 [polygons] - ( pos 9 , len 30 )
(0 0, 0 10, 10 10, 10 0, 0 0),
** Grp 3 [points] - ( pos 10 , len 23 )
0 0, 0 10, 10 10, 10 0,
** Grp 4 [point] - ( pos 27 , len 6 )
10 0,
** Grp 5 - ( pos 28 , len 4 )
10 0
** Grp 6 [last_point] - ( pos 34 , len 3 )
0 0
** Grp 7 [last_polygon] - ( pos 39 , len 25 )
(3 3, 3 7, 7 7, 7 3, 3 3)
** Grp 8 [points] - ( pos 40 , len 19 )
3 3, 3 7, 7 7, 7 3,
** Grp 9 [point] - ( pos 54 , len 5 )
7 3,
** Grp 10 - ( pos 55 , len 3 )
7 3
** Grp 11 [last_point] - ( pos 60 , len 3 )
3 3
Possible Solution
It's not impossible. It just takes a few extra steps.
(As an aside, isn't there a library for WKT that can parse this for you ?)
Now, I don't know your language capabilities, so this is just a general approach.
1. Validate the form you're parsing.
This will validate and return all polygon sets as a single string in All_Polygons group.
Target POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
POLYGON\s*\((?P<All_Polygons>(?:\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))(?:\s*,\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))*)\s*\)
** Grp 1 [All_Polygons] - ( pos 9 , len 55 )
(0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)
2. If 1 was successful, set up a loop match using the output of All_Polygons string.
Target (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)
(?:\(\s*(?P<Single_Poly_All_Pts>\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,})\s*\))
This step is equivalent of a find all type of match. It should match successive values of all the points of a single polygon, returned in Single_Poly_All_Pts group string.
This will give you these 2 separate matches, which can be put into a temp array having 2 value strings:
** Grp 1 [Single_Poly_All_Pts] - ( pos 1 , len 27 )
0 0, 0 10, 10 10, 10 0, 0 0
** Grp 1 [Single_Poly_All_Pts] - ( pos 31 , len 23 )
3 3, 3 7, 7 7, 7 3, 3 3
3. If 2 was successful, set up a loop match using the temp array output of step 2.
This will give you the individual points of each polygon.
(?P<Single_Point>\d+\s+\d+)
Again this is a loop match (or a find all type of match). For each array element
(Polygon), this will produce the individual points.
Target[element 1] 0 0, 0 10, 10 10, 10 0, 0 0
** Grp 1 [Single_Point] - ( pos 0 , len 3 )
0 0
** Grp 1 [Single_Point] - ( pos 5 , len 4 )
0 10
** Grp 1 [Single_Point] - ( pos 11 , len 5 )
10 10
** Grp 1 [Single_Point] - ( pos 18 , len 4 )
10 0
** Grp 1 [Single_Point] - ( pos 24 , len 3 )
0 0
And,
Target[element 2] 3 3, 3 7, 7 7, 7 3, 3 3
** Grp 1 [Single_Point] - ( pos 0 , len 3 )
3 3
** Grp 1 [Single_Point] - ( pos 5 , len 3 )
3 7
** Grp 1 [Single_Point] - ( pos 10 , len 3 )
7 7
** Grp 1 [Single_Point] - ( pos 15 , len 3 )
7 3
** Grp 1 [Single_Point] - ( pos 20 , len 3 )
3 3
I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:
library(stringi)
df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')
df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')
id string1 pctOff10_20
1 1 15% off NA
2 2 25% off NA
3 3 35% off NA
4 4 45% off NA
5 5 55% off NA
I'd like something like this:
id string1 pctOff10_20
1 1 15% off 1
2 2 25% off 0
3 3 35% off 0
4 4 45% off 0
5 5 55% off 0
Here is the way to go,
df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')
Explanation:
^ the beginning of the string
( group and capture to \1:
1 '1'
\d digits (0-9)
| OR
20 '20'
) end of \1
% '%'
1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):
library(gsubfn)
btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)
transform(df1, pctOff10_20 =
strapply(
X = string1,
pattern = "\\d+",
FUN = ~ btwn(x, 10, 20),
simplify = TRUE
)
)
2) A base solution using the same btwn function defined above is:
transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))
What I need from the string ((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS)))) is this:
"JJ", "RBJJ", "DTJJNNPNNPS", "JJCCRBJJ", "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
that is, to find the text between innermost brackets, delete the immediately surrounding brackets so that the text can be combined and extracted. But this comprises of different levels. The uncovering of brackets can't be done all at once because the no, of brackets go out of balance:
str1<-c()
str2<-c()
library(gsubfn)
strr<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))")
repeat {
str1<-unlist(strapply(strr, "((\\(([A-Z])+\\))+)"))
str2<-append(str1, str2)
strr<-gsub("(\\(\\w+\\))", "~\\1~", strr)
strr<-gsub("~\\(|\\)~", "", strr)
if (strr == "") {break}
}
strr
[1] "(VBD(JJCCRBJJINDTJJNNPNNPS"
There are brackets left blocking combining of text which makes it escape the regex. The solution to this I think is, to differentiate between innermost brackets (JJ, RB, JJ, DT, JJ, NNP, NNPS, (2, 4, 5, 7 , 8 , 9 , 10 on the fresh string)) and inner brackets. So that when all the inner most brackets are uncovered step by step and the text combined and extracted, we will reach the whole string. Is there a regular expression to do this? Or is there any other way? Please help.
This doesn't use regexp. In fact, I'm not sure that regexp are powerful enough to solve the problem and that a parser is necessary. Rather than create/define a parser in R, I leverage the existing R code parser. Doing so uses some rather potentially dangerous tricks.
The basic idea is to turn the string into parsable code which generates a tree structure using lists. Then this structure is effectively reverse pruned (keeping only the leaf node inward), and the various strings at each level are created.
Some helper packages
library("plotrix")
library("plyr")
The original string that you gave
strr<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))")
Turn this string into parsable code, quoting what is inside the parentheses, and then making each set of parentheses a call to list. Commas have to be inserted between list items, but the innermost parts are always lists of length 1, so that isn't a problem. Then parse the code.
tmp <- gsub("\\(([^\\(\\)]*)\\)", '("\\1")', strr)
tmp <- gsub("\\(", "list(", tmp)
tmp <- gsub("\\)list", "),list", tmp)
tmp <- eval(parse(text=tmp))
At this point, tmp looks like
> str(tmp)
List of 3
$ :List of 1
..$ : chr "VBD"
$ :List of 3
..$ :List of 1
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
..$ :List of 1
.. ..$ : chr "CC"
..$ :List of 2
.. ..$ :List of 1
.. .. ..$ : chr "RB"
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
$ :List of 2
..$ :List of 1
.. ..$ : chr "IN"
..$ :List of 4
.. ..$ :List of 1
.. .. ..$ : chr "DT"
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
.. ..$ :List of 1
.. .. ..$ : chr "NNP"
.. ..$ :List of 1
.. .. ..$ : chr "NNPS"
The nesting of parentheses is now nesting of lists. A few more helper functions are needed. The first collapses everything below a certain depth and throws away any node above that depth. The second is just a wrapper for paste to work one the elements of a list collectively.
atdepth <- function(l, d) {
if (d > 0 & !is.list(l)) {
return(NULL)
}
if (d == 0) {
return(unlist(l))
}
if (is.list(l)) {
llply(l, atdepth, d-1)
}
}
pastelist <- function(l) {paste(unlist(l), collapse="", sep="")}
Create a list where each element is the tree structure collapsed to a particular depth.
down <- llply(1:listDepth(tmp), atdepth, l=tmp)
Iterating backwards over this list, paste the leaf sets together. Work backwards "up" the (collapsed) trees. Doing this produces some blank strings (where there was a leaf higher up), so these are trimmed out.
out <- if (length(down) > 2) {
c(unlist(llply(length(down):3, function(i) {
unlist(do.call(llply, c(list(down[[i]]), replicate(i-3, llply), pastelist)))
})), unlist(pastelist(down[[2]])))
} else {
unlist(pastelist(down[[2]]))
}
out <- out[out != ""]
The result is what I think you asked for:
> out
[1] "JJ" "RBJJ"
[3] "DTJJNNPNNPS" "JJCCRBJJ"
[5] "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
> dput(out)
c("JJ", "RBJJ", "DTJJNNPNNPS", "JJCCRBJJ", "INDTJJNNPNNPS", "VBDJJCCRBJJINDTJJNNPNNPS"
)
EDIT:
In response to a comment with a subsequent question: How to adapt this to process over a set of these strings.
The general approach to solving the do-it-multiple-times-for-different-inputs is to create a function which takes a single item as input and returns the associated single output. Then loop over the function with one of the apply family of functions.
Pulling together all the code from earlier into a single function:
parsestrr <- function(strr) {
atdepth <- function(l, d) {
if (d > 0 & !is.list(l)) {
return(NULL)
}
if (d == 0) {
return(unlist(l))
}
if (is.list(l)) {
llply(l, atdepth, d-1)
}
}
pastelist <- function(l) {paste(unlist(l), collapse="", sep="")}
tmp <- gsub("\\(([^\\(\\)]*)\\)", '("\\1")', strr)
tmp <- gsub("\\(", "list(", tmp)
tmp <- gsub("\\)list", "),list", tmp)
tmp <- eval(parse(text=tmp))
down <- llply(1:listDepth(tmp), atdepth, l=tmp)
out <- if (length(down) > 2) {
c(unlist(llply(length(down):3, function(i) {
unlist(do.call(llply, c(list(down[[i]]), replicate(i-3, llply), pastelist)))
})), unlist(pastelist(down[[2]])))
} else {
unlist(pastelist(down[[2]]))
}
out[out != ""]
}
Now given a vector of strings to process, say:
strrs<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))",
"((VBD)(((JJ))(CC)((RB)(XX)(JJ)))((IN)(BB)((DT)(JJ)(NNP)(NNPS))))",
"((VBD)(((JJ)(QQ))(CC)((RB)(JJ)))((IN)((TQR)(JJ)(NNPS))))")
You can process all of them with
llply(strr, parsestrr)
which returns
[[1]]
[1] "JJ" "RBJJ"
[3] "DTJJNNPNNPS" "JJCCRBJJ"
[5] "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
[[2]]
[1] "JJ" "RBXXJJ"
[3] "DTJJNNPNNPS" "JJCCRBXXJJ"
[5] "INBBDTJJNNPNNPS" "VBDJJCCRBXXJJINBBDTJJNNPNNPS"
[[3]]
[1] "JJQQ" "RBJJ"
[3] "TQRJJNNPS" "JJQQCCRBJJ"
[5] "INTQRJJNNPS" "VBDJJQQCCRBJJINTQRJJNNPS"
I'm not sure if you just want to build a tree structure of balanced text or not.
Or, why you want to strip the containing parenthesis on the inner most level.
Using your example, if it is to be done in stages, the inner most level has to be initially determined. Then parenthesis stripped off in subsequent levels in recursive passes.
This of course requires a way to do balanced text. Some regex engines can do this.
If the engine you are using doesn't support this, it would have to be done manually via text processing.
I happen to have a regex analysis program. I pumped your initial string into it and it visually formatted it via group levels. Each pass, I just stripped the inner parenth's which simulates a recursion.
Maybe this can help you to visualize what needs to be done.
## Pass 0
## ---------
(
( VBD )
(
(
( JJ )
)
( CC )
(
( RB )
( JJ )
)
)
(
( IN )
(
( DT )
( JJ )
( NNP )
( NNPS )
)
)
)
## Pass 1
## ---------
(
( VBD )
(
( JJ )
( CC )
( RB JJ )
)
(
( IN )
( DT JJ NNP NNPS )
)
)
## Pass 2
## ---------
(
( VBD )
( JJ CC RB JJ )
( IN DT JJ NNP NNPS )
)
## Pass 3
## ---------
( VBD JJ CC RB JJ IN DT JJ NNP NNPS )
## Pass 4
## ---------
VBD JJ CC RB JJ IN DT JJ NNP NNPS
You don't really need to think of matching brackets here... Sounds like you just want to recursively match the pattern [()]([^()]*)[()].
That is, "match something containing no ( ) and delimited by ( or )"