Findall with regular expression in a pandas dataframe returns an incomplete list [duplicate] - regex

so I am having trouble with Pandas for a series findall(). currently I am trying to look at a report and retrieving all the electric components. Currently the report is either a line or a paragraph and mention components in a standardize way. I am using this code
failedCoFromReason =rlist['report'].str.findall(r'([CULJRQF]([\dV]{2,4}))',flags=re.IGNORECASE)
It returns the components but it also returns a repeat value of the number like this [('r919', '919'), ('r920', '920')]
I would like it just to return [('r919'), ('r920')] but I am struggling with getting it to work. Pretty new to pandas and regex and confused how to search. I have tried greedy and non greedy searches but it didn't work.

See the Series.str.findall reference:
Equivalent to applying re.findall() to all the elements in the Series/Index.
The re.findall references says that "if one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group."
So, all you need to do is actually remove all capturing parentheses in this case, as all you need is to get the whole match:
rlist['report'].str.findall(r'[CULJRQF][\dV]{2,4}', flags=re.I)
In other cases, when you need to preserve the group (to quantify it, or to use alternatives), you need to change the capturing groups to non-capturing ones:
rlist['report'].str.findall(r'(?:[CULJRQF](?:[\dV]{2,4}))', flags=re.I)
Though, in this case, it is quite redundant.

Related

Using regex in SPARQL to bind variables?

I'm working on an OWL knowledge graph for info about patients in the Covid pandemic. I've been using SPARQL to transform strings from spreadsheets into the appropriate objects and values of properties.
I have strings like Infected by P231 and P456 and P39393 What I want is something that can bind variables to the patient ids. I thought this shouldn't be too hard because the strings only follow a few patterns. E.g, strings will have one, two, or three Patient IDs and no more so I could write a query that matches each separate case.
I thought I could use regex to do this but now that I look at regex more carefully I think all it can do is tell me that such Patient IDs exist but unlike functions such as SUBSTR that will actually return part of the string that I want so I can bind it to a variable, regex just returns true or false that some string matches the pattern or it doesn't. Is that correct?
If that is correct are there other ways to do pattern matching in SPARQL where I can actually bind variables to a substring that matches part of the pattern? Or do I need to resort to a full programming language like Python to do this?
REPLACE is the function to apply a regular expression, with () match groups, and calculate a return string based on the match using $1 to get the group actually matched. It is based on fn:replace from "XPath and XQuery Functions and Operators" as are many of the SPARQL functions.
BIND (REPLACE("123", "(.)..", "$1") AS ?str)
will set ?str to "1".

repeated, arbitrary capture groups

Given a string, eg.:
static_string.name__john.id__6.foo__bar.final_string
but with an arbitrary number of label__value. components, how can I repeat the capture groups, split them into label & value, and also capture the terminating final_string ?
For the above I'd want [name, john, id, 6, foo, bar, final_string]
Is something like this possible when I don't know the number of label__value. components in advance?
This is for golang / RE2 if that matters.
Update: I don't have the luxury of doing this in a few lines of code, and would need to do this in a single regex. The regex is defined in a config file to an application I don't control, so a code based loop with conditionals etc is unfortunately not possible.
This totally depends on what the thing you are putting this into expects.
This is answer focused on getting you the capture groups in a basic way attempting to avoid any issues with the "thing" you are putting the regex into and RE2.
Note: You might find that the final_string doesn't get the capture group index you expect with this method, but again depends on what you are putting the regex into.
A regular expression that would match "one" and "no" key/value pairs the following is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
static_string.final_string
static_string.name__john.final_string
To support one more key/value pair we repeat part of the regular expression:
Part repeated:
(?:\.([^.]+?)__([^.]+))?
So to support 2 key value pairs the regular expression is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
This now supports the following additional example:
static_string.name__john.foo__bar.final_string
So if I expand that out to support 12 key value pairs the regular expression is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
This supports the following additional examples:
static_string.name__john.id__6.foo__bar.final_string
static_string.name2_1b__john.id__6.foo__bar.final_string
static_string.name__john.id__6.foo__bar.name__john.id__6.foo__bar.name__john.id__6.foo__bar.name__john.id__6.foo__bar.final_string

How to extract multiple values with a regular expression in Jmeter

I am running tests with jmeter and I need to extract with a Regular Expression:
insertar?sIws2kyXGJJA_01==
insertar?sIws2kyXGJJA_02==
in the following String:
[\"EMBPAGE1_00010001\",\"**insertar?sIws2kyXGJJA_01==**\",1,100,\"%\",300,\"px\",0,\"center\",\"\",\"[\"EMBPAGE1_00010002\",\"**insertar?sIws2kyXGJJA_02==**\",1,100,\"%\",300,\"px\",0,\"center\",\"\",\"
Use super secret operator (Negative match N)
UPD: G2 - is in my example, as I extract two groups from each encounter.
each encounter is "uuid" in g1 and g2 is second part I need second part here.
that's why $2$ template and g2. If your encounters in one group you ll most likely use $1$ template that will place all encounters into g1.
If you have one match group you don't actually need _gN ending at all.
To understand more the variables after group extraction add a "Debug PostProcessor" and inspect output in TreeView.
It nice two know that control elements like "For each" understand groups and can work with prefix like regexUUID_ and walk through. In most cases it's next you do after extraction.
UPD2. primitive version of regexp in question (insertar\?sIws2kyXGJJA_\d*)==([^[]*)
with template $1$$2$
you ll have the first parts in g1 group and the second parts in g2
In answer given by DMC, you need to add regular expression extractor TWICE to match/retrieve both the values with different Match No. (1, 2). Though it is also correct, suggesting better approach to achieve the same.
Another Approach:
1. Capture Both Values:
You can use Template to capture both the values at the same time, and later, refer it using indexing.
Please check the following screen shot:
Here, we captured both the values using two groups into two different templates, as $1$ and $2$ respectively. Here, templates store the data in the order of the groups specified in regular expression by default. (FYI, you can change the order also by tweaking the order of templates like $2$ and then $1$.)
Now, as in the diagram, we are capturing two values and storing them using templates: $1$ (refers to first group match) and $2$ (refers to second group match)
2. Retrieve Values:
Now, refer these values in your script by using the following syntax:
${insert_values_gn} (n refers to match no.)
eg:
${insert_values_g1} - refers to the first match
${insert_values_g2} - refers to the second match
To make it simple, You can think "insert_values" as list of strings captured using multiple groups and use 'n' (1,2,3 etc) as the index to retrieve the values.
Note: using templates, you can have any number of values can be retrieved using multiple groups and refer to them by indexing, using a single regular expression extractor.
I'm sure there is a more efficient way but this worked:
\*\*(.*?)\*\*.*\"\*\*(.*?)\*\*
You can also use only \*\*(.*?)\*\*
It will match both of them anyway, so make sure you set the right 'Matching No.' in Jmeter if you pass one of the values:
The Matching No should be 1 for the first, and 2 for the second match i believe.

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Regex capturing named groups in a language that doesn't support them using a meta regex?

I am using Haskell and I don't seem to find a REGEX package that supports Named Groups so I have to implement it somehow myself.
basically a user of my api would use some regex with named groups to get back captured groups in a map
so
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj
would give
[("name","foo"),("surname","bar")]
I am doing a specification trivial implementation with relatively small strings so for now performance is not a main issue.
To solve this, I thought I'd write a meta regex that will apply on the user's regex
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj
to extract the names of groups and replace them with nothing to get
0 -> name
1 -> surname
and the regex becomes
/([a-z]*)/hhhh/([a-z]*)/jjj
then apply it to the string and use the index to group names with matched.
Two questions:
does it seem like a good idea?
what is the meta regex that I need to capture and replace the named groups syntax
for those unfamiliar with named groups http://www.regular-expressions.info/named.html
note: all what I need from named groups is that the user give names to matches, so a subset of named groups that only gives me this is ok.
The more generally you want to apply your solution, the more complex your problem becomes. For instance, in your approach, you want to remove the named groups and use the indexes (indices?) to match. This seems like a good start, but you have consider a few things:
If you replace the (?<name>blah) with (blah) then you also have to replace the /name with /1 or /2 or whatever.
What happens if the user includes non named groups as well? for eg: ([a-z]{3})/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj. In this case, your numbering will not work b/c group 1 is the user defined non named group.
See this post for some insipration, as it seems other have successfully tried the same (albeit in Java)
Regex Named Groups in Java
Perhaps you should use parser combinators. This looks sufficiently complicated that it would be cleaner and more maintainable to step out and use Parsec or Attoparsec, instead of trying to push regexes further towards parsing.