Get named list of subgroup in golang regex - regex
I'm looking for a function that returns a map[string]interface{} where interface{} can be a slice, a a map[string]interface{} or a value.
My use case is to parse WKT geometry like the following and retrieves point values; Example for a donut polygon:
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
The regex (I voluntary set \d that matches only integers for readability purpose):
(POLYGON \(
(?P<polygons>\(
(?P<points>(?P<point>(\d \d), ){3,})
(?P<last_point>\d \d )\),)*
(?P<last_polygon>\(
(?P<points>(?P<point>(\d \d), ){3,})
(?P<last_point>\d \d)\))\)
)
I have a function (copied from SO) that retrieves some informations but it's not that good for nested groups and list of groups:
func getRegexMatchParams(reg *regexp.Regexp, url string) (paramsMap map[string]string) {
match := reg.FindStringSubmatch(url)
paramsMap = make(map[string]string)
for i, name := range reg.SubexpNames() {
if i > 0 && i <= len(match) {
paramsMap[name] = match[i]
}
}
return match
}
It seems that the group point gets only 1 point.
example on playground
[EDIT] The result I want is something like this:
map[string]interface{}{
"polygons": map[string]interface{} {
"points": []interface{}{
{map[string]string{"point": "0 0"}},
{map[string]string{"point": "0 10"}},
{map[string]string{"point": "10 10"}},
{map[string]string{"point": "10 0"}},
},
"last_point": "0 0",
},
"last_polygon": map[string]interface{} {
"points": []interface{}{
{map[string]string{"point": "3 3"}},
{map[string]string{"point": "3 7"}},
{map[string]string{"point": "7 7"}},
{map[string]string{"point": "7 3"}},
},
"last_point": "3 3",
}
}
So I can use it further for different purposes like querying databases and validate that last_point = points[0] for each polygon.
Try to add some whitespace to the regex.
Also note that this engine won't retain all capture group values that are
within a quantified outer grouping like (a|b|c)+ where this group will only contain the last a or b or c it finds.
And, your regex can be reduced to this
(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\)(?:\s*,\s*|\s*\)))+)
https://play.golang.org/p/rLaaEa_7GX
The original:
(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\),)*(?P<last_polygon>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\))\s*\))
https://play.golang.org/p/rZgJYPDMzl
See below for what the groups contain.
( # (1 start)
POLYGON \s* \(
(?P<polygons> # (2 start)
\( \s*
(?P<points> # (3 start)
(?P<point> # (4 start)
\s*
( \d+ \s+ \d+ ) # (5)
\s*
,
){3,} # (4 end)
) # (3 end)
\s*
(?P<last_point> \d+ \s+ \d+ ) # (6)
\s* \),
)* # (2 end)
(?P<last_polygon> # (7 start)
\( \s*
(?P<points> # (8 start)
(?P<point> # (9 start)
\s*
( \d+ \s+ \d+ ) # (10)
\s*
,
){3,} # (9 end)
) # (8 end)
\s*
(?P<last_point> \d+ \s+ \d+ ) # (11)
\s* \)
) # (7 end)
\s* \)
) # (1 end)
Input
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
Output
** Grp 0 - ( pos 0 , len 65 )
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
** Grp 1 - ( pos 0 , len 65 )
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
** Grp 2 [polygons] - ( pos 9 , len 30 )
(0 0, 0 10, 10 10, 10 0, 0 0),
** Grp 3 [points] - ( pos 10 , len 23 )
0 0, 0 10, 10 10, 10 0,
** Grp 4 [point] - ( pos 27 , len 6 )
10 0,
** Grp 5 - ( pos 28 , len 4 )
10 0
** Grp 6 [last_point] - ( pos 34 , len 3 )
0 0
** Grp 7 [last_polygon] - ( pos 39 , len 25 )
(3 3, 3 7, 7 7, 7 3, 3 3)
** Grp 8 [points] - ( pos 40 , len 19 )
3 3, 3 7, 7 7, 7 3,
** Grp 9 [point] - ( pos 54 , len 5 )
7 3,
** Grp 10 - ( pos 55 , len 3 )
7 3
** Grp 11 [last_point] - ( pos 60 , len 3 )
3 3
Possible Solution
It's not impossible. It just takes a few extra steps.
(As an aside, isn't there a library for WKT that can parse this for you ?)
Now, I don't know your language capabilities, so this is just a general approach.
1. Validate the form you're parsing.
This will validate and return all polygon sets as a single string in All_Polygons group.
Target POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
POLYGON\s*\((?P<All_Polygons>(?:\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))(?:\s*,\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))*)\s*\)
** Grp 1 [All_Polygons] - ( pos 9 , len 55 )
(0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)
2. If 1 was successful, set up a loop match using the output of All_Polygons string.
Target (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)
(?:\(\s*(?P<Single_Poly_All_Pts>\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,})\s*\))
This step is equivalent of a find all type of match. It should match successive values of all the points of a single polygon, returned in Single_Poly_All_Pts group string.
This will give you these 2 separate matches, which can be put into a temp array having 2 value strings:
** Grp 1 [Single_Poly_All_Pts] - ( pos 1 , len 27 )
0 0, 0 10, 10 10, 10 0, 0 0
** Grp 1 [Single_Poly_All_Pts] - ( pos 31 , len 23 )
3 3, 3 7, 7 7, 7 3, 3 3
3. If 2 was successful, set up a loop match using the temp array output of step 2.
This will give you the individual points of each polygon.
(?P<Single_Point>\d+\s+\d+)
Again this is a loop match (or a find all type of match). For each array element
(Polygon), this will produce the individual points.
Target[element 1] 0 0, 0 10, 10 10, 10 0, 0 0
** Grp 1 [Single_Point] - ( pos 0 , len 3 )
0 0
** Grp 1 [Single_Point] - ( pos 5 , len 4 )
0 10
** Grp 1 [Single_Point] - ( pos 11 , len 5 )
10 10
** Grp 1 [Single_Point] - ( pos 18 , len 4 )
10 0
** Grp 1 [Single_Point] - ( pos 24 , len 3 )
0 0
And,
Target[element 2] 3 3, 3 7, 7 7, 7 3, 3 3
** Grp 1 [Single_Point] - ( pos 0 , len 3 )
3 3
** Grp 1 [Single_Point] - ( pos 5 , len 3 )
3 7
** Grp 1 [Single_Point] - ( pos 10 , len 3 )
7 7
** Grp 1 [Single_Point] - ( pos 15 , len 3 )
7 3
** Grp 1 [Single_Point] - ( pos 20 , len 3 )
3 3
Related
RegEx Catastrophic Backtracking
I'm trying to code a program to handle some data from some computations for me and my group of chemists. I need to be able to search through an output file that contains an excess of 100,000 lines with repeating units. I'm having trouble developing an expression to pull the data that I need. Here is a sample portion of the output file from which I need to extract data. -------------------------------------------------------- 6 0.934039 1.910373 -0.007356 6 0.522681 1.025902 1.132490 6 1.295895 0.175184 1.887754 6 2.745917 -0.059133 1.663889 6 3.251755 -1.317777 1.620723 6 2.354786 -2.503163 1.659226 6 2.500165 -3.540741 2.548878 16 1.200827 -4.674482 2.398451 6 0.450003 -3.761272 1.124048 6 1.191704 -2.653037 0.838883 1 0.889391 -1.914706 0.104995 6 -0.831020 -4.156701 0.527821 6 -1.110322 -3.881221 -0.814127 6 -2.349055 -4.235426 -1.332810 7 -3.309523 -4.841208 -0.630339 6 -3.035523 -5.105663 0.648864 6 -1.833017 -4.789011 1.269474 1 -1.692631 -5.008941 2.323469 1 -3.826405 -5.592591 1.215268 1 -2.579432 -4.028326 -2.375778 1 -0.365461 -3.412291 -1.448955 6 3.573349 -3.754217 3.574993 1 3.816975 -2.812111 4.074804 1 3.260231 -4.473156 4.337010 1 4.494421 -4.124194 3.111155 6 4.710061 -1.617156 1.557050 6 5.623546 -1.003416 2.421004 6 6.974181 -1.326012 2.370518 6 7.436595 -2.269579 1.456291 6 6.535520 -2.894414 0.598497 6 5.182590 -2.576469 0.655016 1 4.479434 -3.078256 -0.004163 1 6.885058 -3.634965 -0.114866 1 8.492504 -2.519869 1.416345 1 7.667827 -0.841306 3.051107 1 5.267975 -0.268736 3.136940 6 3.580944 1.164538 1.503126 6 4.497172 1.300372 0.454818 6 5.240591 2.465784 0.310519 6 5.081537 3.515835 1.211519 6 4.166656 3.395074 2.253939 6 3.418260 2.231401 2.393481 1 2.692677 2.145384 3.197718 1 4.033033 4.209217 2.960243 1 5.661476 4.426693 1.096865 1 5.945953 2.554610 -0.510417 1 4.622653 0.485776 -0.251950 6 0.528985 -0.506746 2.884231 6 -0.800876 -0.209971 2.864388 16 -1.134679 0.961768 1.625169 6 -1.858335 -0.812159 3.682689 6 -1.740268 -2.136749 4.118466 6 -2.752842 -2.684280 4.894687 7 -3.862544 -2.030644 5.248979 6 -3.974818 -0.772056 4.819134 6 -3.016393 -0.122197 4.049880 1 -3.167697 0.912109 3.755521 1 -4.882415 -0.247888 5.110965 1 -2.672356 -3.711555 5.245040 1 -0.881625 -2.737495 3.833528 1 0.964846 -1.208304 3.585910 1 0.076921 2.185394 -0.628553 1 1.403779 2.830811 0.357050 1 1.665791 1.402320 -0.641683 Energy (Hartree) = -2216.64927779 Step: 34 Scan 1 out of 73 Converged(Max Force, RMS Force, Max Disp, RMS Disp): YES, YES, NO, YES -------------------------------------------------------- -------------------------------------------------------- 6 0.934062 1.911021 -0.006793 6 0.522693 1.026243 1.132808 6 1.295849 0.175204 1.887786 6 2.745849 -0.059151 1.663825 6 3.251670 -1.317799 1.620645 6 2.354637 -2.503134 1.659254 6 2.499913 -3.540512 2.549163 16 1.200720 -4.674404 2.398679 6 0.450032 -3.761466 1.124001 6 1.191706 -2.653228 0.838749 1 0.889499 -1.915053 0.104660 6 -0.830799 -4.157197 0.527556 6 -1.109977 -3.881704 -0.814420 6 -2.348529 -4.236224 -1.333319 7 -3.308924 -4.842349 -0.631043 6 -3.035027 -5.106853 0.648170 6 -1.832715 -4.789884 1.268992 1 -1.692398 -5.009943 2.322969 1 -3.825839 -5.594087 1.214409 1 -2.578819 -4.029095 -2.376300 1 -0.365153 -3.412510 -1.449097 6 3.572924 -3.753599 3.575540 1 3.816308 -2.811319 4.075158 1 3.259759 -4.472389 4.337671 1 4.494142 -4.123544 3.111975 6 4.709955 -1.617231 1.556911 6 5.623512 -1.003434 2.420751 6 6.974133 -1.326076 2.370222 6 7.436469 -2.269740 1.456057 6 6.535326 -2.894639 0.598379 6 5.182409 -2.576651 0.654945 1 4.479201 -3.078478 -0.004148 1 6.884808 -3.635261 -0.114937 1 8.492367 -2.520070 1.416076 1 7.667829 -0.841314 3.050721 1 5.268008 -0.268672 3.136635 6 3.580909 1.164495 1.503041 6 4.497111 1.300286 0.454710 6 5.240549 2.465680 0.310363 6 5.081547 3.515754 1.211345 6 4.166716 3.395021 2.253815 6 3.418302 2.231369 2.393405 1 2.692762 2.145371 3.197682 1 4.033146 4.209176 2.960116 1 5.661485 4.426605 1.096636 1 5.945883 2.554473 -0.510601 1 4.622556 0.485666 -0.252039 6 0.528927 -0.506827 2.884183 6 -0.800849 -0.209681 2.864659 16 -1.134600 0.962406 1.625752 6 -1.858353 -0.811785 3.682977 6 -1.740809 -2.136677 4.117983 6 -2.753416 -2.684132 4.894214 7 -3.862669 -2.030122 5.249222 6 -3.974426 -0.771225 4.820153 6 -3.015920 -0.121422 4.050946 1 -3.166775 0.913141 3.757259 1 -4.881643 -0.246746 5.112609 1 -2.673367 -3.711662 5.243914 1 -0.882533 -2.737672 3.832462 1 0.964690 -1.208688 3.585621 1 0.076974 2.186246 -0.627940 1 1.403854 2.831315 0.357930 1 1.665819 1.403131 -0.641235 Energy (Hartree) = -2216.64927781 Step: 35 Scan 1 out of 73 Converged(Max Force, RMS Force, Max Disp, RMS Disp): YES, YES, YES, YES Optimized Parameters for Coordinate Value: 48.7864 -------------------------------------------------------- -------------------------------------------------------- 6 0.928653 1.914728 -0.015952 6 0.523104 1.029664 1.125513 6 1.299323 0.175435 1.873723 6 2.746979 -0.062752 1.638872 6 3.250582 -1.322711 1.610272 6 2.350949 -2.505931 1.653003 6 2.487964 -3.535693 2.553022 16 1.187646 -4.668406 2.403419 6 0.447689 -3.765331 1.115501 6 1.193508 -2.661056 0.825700 1 0.897892 -1.928805 0.083033 6 -0.829701 -4.163896 0.513583 6 -1.098964 -3.899654 -0.832683 6 -2.334534 -4.256414 -1.357129 7 -3.300974 -4.854584 -0.656318 6 -3.036531 -5.108382 0.627053 6 -1.837984 -4.788211 1.253491 1 -1.705454 -4.999308 2.310312 1 -3.832212 -5.589178 1.191978 1 -2.557135 -4.057999 -2.403474 1 -0.348825 -3.437415 -1.466208 6 3.553334 -3.741710 3.588769 1 3.794918 -2.795527 4.081850 1 3.233495 -4.453260 4.354918 1 4.477112 -4.117327 3.134952 6 4.708676 -1.625521 1.559405 6 5.617384 -1.005953 2.424241 6 6.967681 -1.331651 2.386026 6 7.434526 -2.284193 1.483422 6 6.538165 -2.914834 0.624932 6 5.185521 -2.593731 0.669206 1 4.485944 -3.099951 0.009606 1 6.891156 -3.662361 -0.079406 1 8.490177 -2.536916 1.453052 1 7.657559 -0.842289 3.067116 1 5.258339 -0.264251 3.131155 6 3.588296 1.158822 1.495804 6 4.455584 1.334979 0.412017 6 5.190024 2.506923 0.274817 6 5.065591 3.527255 1.214482 6 4.191377 3.371568 2.286682 6 3.451948 2.201349 2.419198 1 2.755016 2.090210 3.245500 1 4.081099 4.164280 3.020955 1 5.637641 4.443718 1.104935 1 5.859723 2.624932 -0.572054 1 4.551226 0.545487 -0.327323 6 0.537765 -0.505652 2.874875 6 -0.791246 -0.204606 2.865525 16 -1.130701 0.970006 1.630565 6 -1.844454 -0.804660 3.690867 6 -1.727716 -2.130536 4.123087 6 -2.736173 -2.676023 4.906078 7 -3.840730 -2.019138 5.270315 6 -3.951832 -0.759281 4.843904 6 -2.997115 -0.111294 4.068471 1 -3.146985 0.924153 3.777407 1 -4.855222 -0.232457 5.143904 1 -2.656671 -3.704306 5.253685 1 -0.873433 -2.733722 3.830294 1 0.976622 -1.209660 3.572222 1 0.067886 2.192910 -0.630666 1 1.403422 2.833365 0.346506 1 1.654564 1.405702 -0.656173 Energy (Hartree) = -2216.64908578 Step: 1 Scan 2 out of 73 Converged(Max Force, RMS Force, Max Disp, RMS Disp): NO, YES, NO, NO -------------------------------------------------------- Each section contains a set of coordinates for elements of a molecule, the total energy for the molecule, the step for the computational scan, and convergence criteria. If all four convergence criterion are met, the optimized coordinate scan value is added. The data I need to extract are the coordinates, the total energy, the scan number, and coordinate value for a converged block only! I've tried tirelessly to develop a suitable expression to use for extracting the required sets of data. Currently, my RegEx code looks like the following: \-*?\n(\d{1,2}(?:\s+[+-]?\d.*?)+)?\n\w+.*?([+-]?\d+\.\d+)\n\w.*?\n\w+\s(\d+).*?\n\w.*\n That code is able to capture the entire section for coordinates, the energy, and the step number. Every time I try to go to the next line using \w, I immediately receive a catastrophic backtracking error. it's imperative that I have that next line in the expression as it's what differentiates the desired block of data over the other. I'm not terribly great with python or RegEx, and I'm requesting help. Once I have the correct expression, I'll be using a nested for loop to extract all of the data I need! Demo If there are any other questions I can answer to better describe my situation, please let me know! An explanation of what you do to help will be much appreciated as I want to learn as much as I can! Thank you for your help in advance!
One option is to make the pattern more specific and match the exact words instead of using \w+ and .*? Based on the example data, if you want to capture the values for the coordinates, the total energy, the scan number you could use 3 capturing groups: -+\r?\n(\d{1,2}(?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)*(?:\r?\n\d{1,2}(?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)*)*)\r?\nEnergy[^\S\r\n]+\([^[()]+\)[^\S\r\n]+=[^\S\r\n]+([+-]?\d+\.\d+)\r?\nStep:[^\S\r\n]+(\d+) Explanation -+\r?\n ( Caputure group 1 \d{1,2} Match 1-2 digits (?: Non capture group [^\S\r\n]+[+-]?\d+(?:\.\d+)? Repeat 1+ spaces and a digit with an optional decimal part )* Close group and repeat 0+ times (?: Non capture group \r?\n\d{1,2} Match a newline and 1-2 digits (?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)* Repeat 0+ times matching spaced followed by a digit with an optional decimal part )* Close group and repeat 0+ times ) Close group 1 \r?\nEnergy[^\S\r\n]+\([^[()]+\)[^\S\r\n]+=[^\S\r\n]+ ( Capture group 2 [+-]?\d+\.\d+ Match optional - or + and 1+ digit with a decimal part ) Close group 2 \r?\nStep:[^\S\r\n]+ (\d+) Capture group 3, match 1+ digits Regex demo Note that \s could also match a newline. To match whitespace chars without a newline, you could use a negated character class [^\S\r\n]
Why does this regular expression "freeze" the program in VC++?
I have the following code, originally programmed using C++11's regular expressions library (#include <regex>) but now using Boost in an attempt to troubleshoot: boost::regex reg(R"(.*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z]+)).*?(\d+).*?((?:[a-z][a-z]+)))", boost::regex::icase); boost::cmatch matches; if (boost::regex_match(request, reg) && matches.size() > 1) { printf("Match found"); } else { printf("No match."); } When executed, this code seems to "freeze" on boost::regex_match(request, reg), as if it's taking a long time to process. I waited five minutes for it to process (in case this is a processing issue) but the program state was the same. I tested the STL's regex library version of the above code online on cpp.sh and onlinegdb, and it works flawlessly there. I then copied this code into a VC++ project, and the code freezes again: #include <iostream> #include <string> #include <regex> int main() { std::string request = "\\login\\\\challenge\\jRJkdflp3gvTzrwiQ3tyKSqnyppmaZog\\uniquenick\\Lament\\partnerid\\0\\response\\4767846ef255a88da9b10f7c923a1e6e\\port\\-14798\\productid\\11489\\gamename\\crysiswars\\namespaceid\\56\\sdkrevision\\3\\id\\1\\final\\"; std::regex reg(R"(.*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z]+)).*?(\d+).*?((?:[a-z][a-z]+)))", std::regex::icase); std::smatch matches; if (std::regex_search(request, matches, reg) && matches.size() > 1) { printf("Match found"); } else { printf("No match."); } } The string concerned is the following: \login\challenge\jRJwdflp3gvTrrwiQ3tyKSqnyppmaZog\uniquenick\User\partnerid\0\response\4767846ef255a83da9b10f7f923a1e6e\port-14798\productid\11489\gamename\crysiswars\namespaceid\56\sdkrevision\3\id\1\final\ I tested the same code on a Visual Studio 2017 installation on another computer (brand new project) and get the exact same result... which seems to indicate that something that the compiler is doing is causing the code to freeze/take a long time processing. I am unable to test on another compiler locally at present. The regular expression string checks out on regex101, so functionally the expression is OK. This is with Visual Studio 2017 Professional targeting v141. Why is this happening, and how can I fix it?
Your problem is one of backtracking. In the boost sample, you use regex_match which forces a match on the whole string. You will get the same result if using regex_search and adding ^..$. However, your string can never match because you have forced it to end on a letter, but the string really ends with a backslash . This forces the engine to retry all those .*? positions. The fix is to put a final .*? at the end of your regex which will let the regex fulfill it's mission of matching the entire string. Other things may help, you could clean up your regex a bit and/or add some atomic groups and/or add some slashes in place of those .*? Anyway, use this : ^.*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z0-9_]*)).*?(\d+).*?((?:[a-z][a-z]+)).*?(\d+).*?((?:[a-z][a-z]+)).*?$ Output ** Grp 0 - ( pos 0 : len 207 ) \login\challenge\jRJwdflp3gvTrrwiQ3tyKSqnyppmaZog\uniquenick\User\partnerid\0\response\4767846ef255a83da9b10f7f923a1e6e\port-14798\productid\11489\gamename\crysiswars\namespaceid\56\sdkrevision\3\id\1\final\ ** Grp 1 - ( pos 1 : len 5 ) login ** Grp 2 - ( pos 7 : len 9 ) challenge ** Grp 3 - ( pos 17 : len 32 ) jRJwdflp3gvTrrwiQ3tyKSqnyppmaZog ** Grp 4 - ( pos 50 : len 10 ) uniquenick ** Grp 5 - ( pos 61 : len 4 ) User ** Grp 6 - ( pos 66 : len 9 ) partnerid ** Grp 7 - ( pos 76 : len 1 ) 0 ** Grp 8 - ( pos 78 : len 8 ) response ** Grp 9 - ( pos 94 : len 25 ) ef255a83da9b10f7f923a1e6e ** Grp 10 - ( pos 120 : len 4 ) port ** Grp 11 - ( pos 125 : len 5 ) 14798 ** Grp 12 - ( pos 131 : len 9 ) productid ** Grp 13 - ( pos 141 : len 5 ) 11489 ** Grp 14 - ( pos 147 : len 8 ) gamename ** Grp 15 - ( pos 156 : len 10 ) crysiswars ** Grp 16 - ( pos 167 : len 11 ) namespaceid ** Grp 17 - ( pos 179 : len 2 ) 56 ** Grp 18 - ( pos 182 : len 11 ) sdkrevision ** Grp 19 - ( pos 194 : len 1 ) 3 ** Grp 20 - ( pos 196 : len 2 ) id ** Grp 21 - ( pos 199 : len 1 ) 1 ** Grp 22 - ( pos 201 : len 5 ) final
for loop in pandas to search dataframe and update list stuck
I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing? (example dataframe below) ## AOI count of spec. type function (in progress): import numpy as np import pandas as pd path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv" df = pd.read_csv(path_i, sep =",") #create a new dataframe for AOIs: d = {'marker': []} df_aoi = pd.DataFrame(data=d) ### Creating an Aoi list item = df.which_AOI aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling num = 0 for i in range (0, len (df.marker)): #loop through the dataframe if df.marker == num: ## if marker = num its one picture for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list if (item == aoi[index]): aoi_array[index] += 1 print (aoi) print (aoi_array) se = pd.Series(aoi_array) # make list into a series to attach to dataframe df_aoi['new_col'] = se.values #add list to dataframe aoi_array.clear() #clears list before next picture else: num +=1 index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock 1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save 2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save 3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save 4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save 5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save 6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save 7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker. You can accomplish this using groupby df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0) In: pos_time pos_x pos_y pup_time pup_diameter marker \ 0 16300 168.608780 -136.360855 16300 2.935716 0 1 16318 144.976730 -157.495514 16318 3.088388 0 2 16351 152.925606 -156.641724 16351 3.089530 0 3 16368 152.132454 -157.989685 16368 3.111009 0 4 16386 151.598358 -157.555878 16386 3.095147 0 5 16404 150.880928 -152.694794 16404 3.100091 1 6 16441 152.765541 -142.061890 16441 3.082150 1 which_AOI fixation Picname shock 0 7 18 5 save 1 8 33 5 save 2 7 17 5 save 3 7 18 5 save 4 7 18 5 save 5 7 37 5 save 6 7 33 5 save Out: which_AOI 7 8 marker 0 4 1 1 2 0
Empty groups in regex
I use a regexp to test a link : lolspec:\/\/(spectator\.(na|euw1|eu|kr|oc1|br|la1|la2|ru|tr|pbe1)\.lol\.riotgames\.com:(80|8088)((([?&]region=(NA1|EUW1|EUN1|KR|OC1|BR1|LA1|LA2|RU|TR1|PBE1))|([?&]gameID=([0-9]+))|([?&]encKey=(.+)))){3}) to test this link : lolspec://spectator.euw1.lol.riotgames.com:80?region=NA1&gameID=44584&encKey=fghgdsv1134+ianfcuia but some groups are empty (#7, #8, #9) what should I do ?
Probably overkill on the capture groups. The regex you use there contains a container capture group 4 that is quantified like this ( ... ){3}. What that will do is overwrite the container capture buffer 3 times leaving the last value captured within the capture group. Moving on to the next level is a single inner group with which the outer group encapsulates, like this (( ... )){3} so thats not needed, and you get the same affect of overwritting. Moving even deeper, are three capture groups all separated by alternations. They follow the same rules, each one will get overwritten if they can match again during each successive 1..3 quantified passes. Its only that one group match in the alternation cluster. So, if you have identical adjacent data, it could be matched by the same alternation cluster, leaving the other cluster groups empty. So, this is not the approach if you want to match out-of-order parameters in a string. The way this is done is using lookahead assertions OR if you are using an engine that can do conditionals. The way to do it using conditionals is like this (?: .*? (?: ( (?(1)(?!)) abc ) # (1) | ( (?(2)(?!)) def ) # (2) | ( (?(3)(?!)) ghi ) # (3) ) ){3} It forces finding all of the capture group contents. The way you are doing it is the same but without the conditionals, and suffering the consequences as stated above. Btw, Your regex above does not have any empty groups with that particular sample, But it has many problems. lolspec:\/\/ ( # (1 start) spectator\. ( na | euw1 | eu | kr | oc1 | br | la1 | la2 | ru | tr | pbe1 ) # (2) \.lol\.riotgames\.com: ( 80 | 8088 ) # (3) ( # (4 start) ( # (5 start) ( # (6 start) [?&] region= ( NA1 | EUW1 | EUN1 | KR | OC1 | BR1 | LA1 | LA2 | RU | TR1 | PBE1 ) # (7) ) # (6 end) | ( # (8 start) [?&] gameID= ( [0-9]+ ) # (9) ) # (8 end) | ( # (10 start) [?&] encKey= ( .+ ) # (11) ) # (10 end) ) # (5 end) ){3} # (4 end) ) # (1 end) Output ** Grp 0 - ( pos 0 , len 97 ) lolspec://spectator.euw1.lol.riotgames.com:80?region=NA1&gameID=44584&encKey=fghgdsv1134+ianfcuia ** Grp 1 - ( pos 10 , len 87 ) spectator.euw1.lol.riotgames.com:80?region=NA1&gameID=44584&encKey=fghgdsv1134+ianfcuia ** Grp 2 - ( pos 20 , len 4 ) euw1 ** Grp 3 - ( pos 43 , len 2 ) 80 ** Grp 4 - ( pos 69 , len 28 ) &encKey=fghgdsv1134+ianfcuia ** Grp 5 - ( pos 69 , len 28 ) &encKey=fghgdsv1134+ianfcuia ** Grp 6 - ( pos 45 , len 11 ) ?region=NA1 ** Grp 7 - ( pos 53 , len 3 ) NA1 ** Grp 8 - ( pos 56 , len 13 ) &gameID=44584 ** Grp 9 - ( pos 64 , len 5 ) 44584 ** Grp 10 - ( pos 69 , len 28 ) &encKey=fghgdsv1134+ianfcuia ** Grp 11 - ( pos 77 , len 20 ) fghgdsv1134+ianfcuia
Regex Negations in Vim
Question: How do I convert var x+=1+2+3+(5+6+7) to var x += 1 + 2 + 3 + ( 5 + 6 + 7 ) Details: Using regular expressions, something like :%s/+/\ x\ /g won't work because it will convert += to + = (amongst other problems). So instead one would use negations (negatives, nots, whatever they're called) like so :%s/\s\#!+/\ +/g, which is about as complicated a way as one can say "plus sign without an empty space before it". But now this converts something like x++ into x + +. What I need is something more complex. I need more than one constraint in the negation, and an additional constraint afterwards. Something like so, but this doesn't work :%s/[\s+]\#!+\x\#!/\ +/g Could someone please provide the one, or possibly two regex statements which will pad out an example operator, such that I can model the rest of my rules on it/them. Motivation: I find beautifiers for languages like javascript or PHP don't give me full control (see here). Therefore, I am attempting to use regex to carry out the following conversions: foo(1,2,3,4) → foo( 1, 2, 3, 4 ) var x=1*2*3 → var x = 1 * 2 * 3 var x=1%2%3 → var x = 1 % 2 % 3 var x=a&&b&&c → var x = a && b && c var x=a&b&c → var x = a & b & c Any feedback would also be appreciated
Thanks to the great feedback, I now have a regular expression like so to work from. I am running these two regular expressions: :%s/\(\w\)\([+\-*\/%|&~)=]\)/\1\ \2/g :%s/\([+\-*\/%|&~,(=]\)\(\w\)/\1\ \2/g And it is working fairly well. Here are some results. (1+2+3+4,1+2+3+4,1+2+3+4) --> ( 1 + 2 + 3 + 4, 1 + 2 + 3 + 4, 1 + 2 + 3 + 4 ) (1-2-3-4,1-2-3-4,1-2-3-4) --> ( 1 - 2 - 3 - 4, 1 - 2 - 3 - 4, 1 - 2 - 3 - 4 ) (1*2*3*4,1*2*3*4,1*2*3*4) --> ( 1 * 2 * 3 * 4, 1 * 2 * 3 * 4, 1 * 2 * 3 * 4 ) (1/2/3/4,1/2/3/4,1/2/3/4) --> ( 1 / 2 / 3 / 4, 1 / 2 / 3 / 4, 1 / 2 / 3 / 4 ) (1%2%3%4,1%2%3%4,1%2%3%4) --> ( 1 % 2 % 3 % 4, 1 % 2 % 3 % 4, 1 % 2 % 3 % 4 ) (1|2|3|4,1|2|3|4,1|2|3|4) --> ( 1 | 2 | 3 | 4, 1 | 2 | 3 | 4, 1 | 2 | 3 | 4 ) (1&2&3&4,1&2&3&4,1&2&3&4) --> ( 1 & 2 & 3 & 4, 1 & 2 & 3 & 4, 1 & 2 & 3 & 4 ) (1~2~3~4,1~2~3~4,1~2~3~4) --> ( 1 ~ 2 ~ 3 ~ 4, 1 ~ 2 ~ 3 ~ 4, 1 ~ 2 ~ 3 ~ 4 ) (1&&2&&3&&4,1&&2&&3&&4,1&&2&&3&&4) --> ( 1 && 2 && 3 && 4, 1 && 2 && 3 && 4, 1 && 2 && 3 && 4 ) (1||2||3||4,1||2||3||4,1||2||3||4) --> ( 1 || 2 || 3 || 4, 1 || 2 || 3 || 4, 1 || 2 || 3 || 4 ) var x=1+(2+(3+4*(965%(123/(456-789))))); --> var x = 1 +( 2 +( 3 + 4 *( 965 %( 123 /( 456 - 789 ))))); It seems to work fine for everything except nested brackets. If I fix the nested brackets problem, I will update it here.