Using regex in SPARQL to bind variables? - regex

I'm working on an OWL knowledge graph for info about patients in the Covid pandemic. I've been using SPARQL to transform strings from spreadsheets into the appropriate objects and values of properties.
I have strings like Infected by P231 and P456 and P39393 What I want is something that can bind variables to the patient ids. I thought this shouldn't be too hard because the strings only follow a few patterns. E.g, strings will have one, two, or three Patient IDs and no more so I could write a query that matches each separate case.
I thought I could use regex to do this but now that I look at regex more carefully I think all it can do is tell me that such Patient IDs exist but unlike functions such as SUBSTR that will actually return part of the string that I want so I can bind it to a variable, regex just returns true or false that some string matches the pattern or it doesn't. Is that correct?
If that is correct are there other ways to do pattern matching in SPARQL where I can actually bind variables to a substring that matches part of the pattern? Or do I need to resort to a full programming language like Python to do this?

REPLACE is the function to apply a regular expression, with () match groups, and calculate a return string based on the match using $1 to get the group actually matched. It is based on fn:replace from "XPath and XQuery Functions and Operators" as are many of the SPARQL functions.
BIND (REPLACE("123", "(.)..", "$1") AS ?str)
will set ?str to "1".

Related

Findall with regular expression in a pandas dataframe returns an incomplete list [duplicate]

so I am having trouble with Pandas for a series findall(). currently I am trying to look at a report and retrieving all the electric components. Currently the report is either a line or a paragraph and mention components in a standardize way. I am using this code
failedCoFromReason =rlist['report'].str.findall(r'([CULJRQF]([\dV]{2,4}))',flags=re.IGNORECASE)
It returns the components but it also returns a repeat value of the number like this [('r919', '919'), ('r920', '920')]
I would like it just to return [('r919'), ('r920')] but I am struggling with getting it to work. Pretty new to pandas and regex and confused how to search. I have tried greedy and non greedy searches but it didn't work.
See the Series.str.findall reference:
Equivalent to applying re.findall() to all the elements in the Series/Index.
The re.findall references says that "if one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group."
So, all you need to do is actually remove all capturing parentheses in this case, as all you need is to get the whole match:
rlist['report'].str.findall(r'[CULJRQF][\dV]{2,4}', flags=re.I)
In other cases, when you need to preserve the group (to quantify it, or to use alternatives), you need to change the capturing groups to non-capturing ones:
rlist['report'].str.findall(r'(?:[CULJRQF](?:[\dV]{2,4}))', flags=re.I)
Though, in this case, it is quite redundant.

repeated, arbitrary capture groups

Given a string, eg.:
static_string.name__john.id__6.foo__bar.final_string
but with an arbitrary number of label__value. components, how can I repeat the capture groups, split them into label & value, and also capture the terminating final_string ?
For the above I'd want [name, john, id, 6, foo, bar, final_string]
Is something like this possible when I don't know the number of label__value. components in advance?
This is for golang / RE2 if that matters.
Update: I don't have the luxury of doing this in a few lines of code, and would need to do this in a single regex. The regex is defined in a config file to an application I don't control, so a code based loop with conditionals etc is unfortunately not possible.
This totally depends on what the thing you are putting this into expects.
This is answer focused on getting you the capture groups in a basic way attempting to avoid any issues with the "thing" you are putting the regex into and RE2.
Note: You might find that the final_string doesn't get the capture group index you expect with this method, but again depends on what you are putting the regex into.
A regular expression that would match "one" and "no" key/value pairs the following is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
static_string.final_string
static_string.name__john.final_string
To support one more key/value pair we repeat part of the regular expression:
Part repeated:
(?:\.([^.]+?)__([^.]+))?
So to support 2 key value pairs the regular expression is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
This now supports the following additional example:
static_string.name__john.foo__bar.final_string
So if I expand that out to support 12 key value pairs the regular expression is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
This supports the following additional examples:
static_string.name__john.id__6.foo__bar.final_string
static_string.name2_1b__john.id__6.foo__bar.final_string
static_string.name__john.id__6.foo__bar.name__john.id__6.foo__bar.name__john.id__6.foo__bar.name__john.id__6.foo__bar.final_string

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Regex Pattern Matching Concatenation

Is it possible to concatenate the results of Regex Pattern Matching using only Regex syntax?
The specific instance is a program is allowing regex syntax to pull info from a file, but I would like it to pull from several portions and concatenate the results.
For instance:
Input string: 1234567890
Desired result string: 2389
Regex Pattern match: (?<=1).+(?=4)%%(?<=7).+(?=0)
Where %% represents some form of concatenation syntax. Using starting and ending with syntax is important since I know the field names but not the values of the field.
Does a keyword that functions like %% exist? Is there a more clever way to do this? Must the code be changed to allow multiple regex inputs, automatically concatenating?
Again, the pieces to be concatenated may be far apart with unknown characters in between. All that is known is the information surrounding the substrings.
2011-08-08 edit: The program is written in C#, but changing the code is a major undertaking compared to finding a regex-based solution.
Without knowing exactly what you want to match and what language you're using, it's impossible to give you an exact answer. However, the usual way to approach something like this is to use grouping.
In C#:
string pattern = #"(?<=1)(.+)(?=4).+(?<=7)(.+)(?=0)";
Match m = Regex.Match(input, pattern);
string result = m.Groups[0] + m.Groups[1];
The same approach can be applied to many other languages as well.
Edit
If you are not able to change the code, then there's no way to accomplish what you want. The reason is that in C#, the regex string itself doesn't have any power over the output. To change the result, you'd have to either change the called method of the Regex class or do some additional work afterwards. As it is, the method called most likely just returns either a Match object or a list of matching objects, neither of which will do what you want, regardless of the input regex string.

Regex capturing named groups in a language that doesn't support them using a meta regex?

I am using Haskell and I don't seem to find a REGEX package that supports Named Groups so I have to implement it somehow myself.
basically a user of my api would use some regex with named groups to get back captured groups in a map
so
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj
would give
[("name","foo"),("surname","bar")]
I am doing a specification trivial implementation with relatively small strings so for now performance is not a main issue.
To solve this, I thought I'd write a meta regex that will apply on the user's regex
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj
to extract the names of groups and replace them with nothing to get
0 -> name
1 -> surname
and the regex becomes
/([a-z]*)/hhhh/([a-z]*)/jjj
then apply it to the string and use the index to group names with matched.
Two questions:
does it seem like a good idea?
what is the meta regex that I need to capture and replace the named groups syntax
for those unfamiliar with named groups http://www.regular-expressions.info/named.html
note: all what I need from named groups is that the user give names to matches, so a subset of named groups that only gives me this is ok.
The more generally you want to apply your solution, the more complex your problem becomes. For instance, in your approach, you want to remove the named groups and use the indexes (indices?) to match. This seems like a good start, but you have consider a few things:
If you replace the (?<name>blah) with (blah) then you also have to replace the /name with /1 or /2 or whatever.
What happens if the user includes non named groups as well? for eg: ([a-z]{3})/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj. In this case, your numbering will not work b/c group 1 is the user defined non named group.
See this post for some insipration, as it seems other have successfully tried the same (albeit in Java)
Regex Named Groups in Java
Perhaps you should use parser combinators. This looks sufficiently complicated that it would be cleaner and more maintainable to step out and use Parsec or Attoparsec, instead of trying to push regexes further towards parsing.