Regex for finding all namespaces in data - regex

I need a regular expression (dubbed SOME_EXPRESSION below) that allows finding all namespaces for resources used as subject in a SPARQL 1.1 endpoint. The query should look like the following. How can I do this?
SELECT DISTINCT ?ns
WHERE
{
?s ?p ?o.
BIND(REPLACE(str(?s), SOME_EXPRESSION, "")) AS ?ns)
Filter(isURI(?s))
}

Since the harder part of this is processing the IRI strings, I'll show how you can do this for properties (which must be IRIs, so we don't need to check for isIRI). Adapting this to work with the IRIs of subjects won't be hard. However, there is one thing that needs some consideration: URIs for linked data typically (there's no hard requirement, but conventions do emerge) use prefixes that end in / or in #. Whether one is better than the other is the subject of plenty of debate and discussion (e.e., see section 4 of Cool URIs, or HashVsSlash). In general, you're going to want to replace everything after the final slash or hash with the final slash or hash. Since you can use groups in SPARQL's regex and replace, you can handle both cases with one replace:
select distinct ?ns where {
[] ?p [] .
bind( replace( str(?p), "(#|/)[^#/]*$", "$1" ) as ?ns )
}
This matches the regular expression (#|/)[^#/]*$ against the string form of the IRI, remembering # or / in the variable $1, and then grabs the rest of the characters (which must not contain # or /) up until the end of the string, and replaces the whole thing with $1, which is either # or /. For some data that I pulled from Linked Open British National Bibliography data, I get results like these:
$ sparql --query query.rq --data sample.nt
-----------------------------------------------------
| ns |
=====================================================
| "http://www.w3.org/2000/01/rdf-schema#" |
| "http://www.w3.org/1999/02/22-rdf-syntax-ns#" |
| "http://www.w3.org/2004/02/skos/core#" |
| "http://purl.org/ontology/bibo/" |
| "http://purl.org/dc/terms/" |
| "http://iflastandards.info/ns/isbd/elements/" |
| "http://www.bl.uk/schemas/bibliographic/blterms#" |
| "http://www.w3.org/2002/07/owl#" |
| "http://purl.org/NET/c4dm/event.owl#" |
-----------------------------------------------------
This seems like a reasonable set of namespace prefixes. In fact, when I look at the header of the RDF document, original namespaces included:
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:bibo="http://purl.org/ontology/bibo/"
xmlns:dct="http://purl.org/dc/terms/"
xmlns:isbd="http://iflastandards.info/ns/isbd/elements/"
xmlns:blt="http://www.bl.uk/schemas/bibliographic/blterms#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:event="http://purl.org/NET/c4dm/event.owl#"
As applied to your code, we end up with the following query. It's almost exactly what you wanted, since since there's just one regular expression that handles both cases (so just one thing to fill in for SOME_EXPRESSION. However, instead of replacing with "", you do have to replace with "$1". I hope that's not a terrible inconvenience, though.
SELECT DISTINCT ?ns
WHERE
{
?s ?p ?o.
BIND(REPLACE(str(?s), "(#|/)[^#/]*$", "$1") AS ?ns)
Filter(isURI(?s))
}
It's important to note, of course, that this is only a heuristic. A given IRI can be abbreviated using lots of different prefixes. This technique should give some relatively good results, though, because there are conventions that people tend to follow pretty well.

Related

How to replace a period within string and not in Numeric using singlestore (MemSQL) DB REGEXP_REPLACE function

I have a scenario wherein I want to replace a period when its surrounded by Alphabets and not when surrounded by Numbers. I figured out a Regular Expression pattern that can identify only the periods in Key names but the pattern is not working in SQL
SELECT REGEXP_REPLACE("Amount.fee:0.75,Amount.tot:645.55","(?<!\d)(\.)(?!\d)","_","ig");
Expected output: Amount_fee:0.75,Amount_tot:645.55
Note, I am trying this because, In MemSQL I couldn't access JSON key when it has period in it.
Also verified the pattern "(?<!\d)(.)(?!\d)" using https://coding.tools/regex-replace and it working fine. But, SQL is not working. Am using MemSQL 7.1.9 and POSIX Enhanced Regular expression are supposed to be work. Any help is much appreciated.
Since it looks like you are trying to workaround accessing a JSON key with a period, I will show you how to do that.
This can be done by either surrounding the json key name with backtics while using the shorthand json extract syntax:
select col::%`Amount.fee` from (select '{"Amount.fee":0.75,"Amount.tot":645.55}' col);
+--------------------+
| col::%`Amount.fee` |
+--------------------+
| 0.75 |
+--------------------+
or by using the json_extract_ builtins directly:
select json_extract_double('{"Amount.fee":0.75,"Amount.tot":645.55}', 'Amount.fee');
+------------------------------------------------------------------------------+
| json_extract_double('{"Amount.fee":0.75,"Amount.tot":645.55}', 'Amount.fee') |
+------------------------------------------------------------------------------+
| 0.75 |
+------------------------------------------------------------------------------+
Assuming you only want to target dots that are in between two non digit characters, where the dot is not the first or last character in the string, you may match on ([^\d])\.([^\d]) and replace with \1_\2:
SELECT REGEXP_REPLACE("Amount.fee:0.75,Amount.tot:645.55", "([^\d])\.([^\d])", "\1_\2", "ig");
Here is a regex demo showing that the replacement is working. Note that you might have to use $1_$2 instead of \1_\2 as the replacement, depending on the regex flavor of your SQL tool.

Dialogflow: Regexp entity not matched

I am going crazy with this problem, I am sure I am missing something...
I would like to match words that start with 2 characters or digits, followed by 1 or more character/digit/slash.
Some examples:
AM9
B9C
AS/1
etc...
So I have created an entity, let's say EntityOne as follows according to some RegExp tests (I have also tested the same regexp surrounded by "()", all tested on https://regex-golang.appspot.com/assets/html/index.html that it seems to use re2):
and a test Intent with params defined as follows:
REQUIRED | PARAM NAME | ENTITY | VALUE | IS LIST | PROMPTS
yes | name | #EntityOne | $value | no | test:
And inside this intent I try with words similar to the examples above that should be matched.
But I see the prompt "test:" over and over, the entity is never matched.
Any hints please? Tell me if you want me to share additional info, but I think that there is nothing much to share. Thanks in advance

Spark 2.2/Jupyter Notebook SQL regexp_extract function not matching regex pattern

I'm using the regexp_extract Spark 2.2 SQL function in a Jupyter (Scala) notebook to match a string of 11 or more repeating characters.
Here's the regex:
^(.)\1{10,}$
Now, let's look at that pattern with the regexp_extract function. Here's how I've used it in my notebook:
spark.sql("SELECT REGEXP_EXTRACT('hhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar").show()
+-------------+
|ExtractedChar|
+-------------+
| |
+-------------+
Odd, no output. Let's make sure my regex pattern is actually correct. Yep, looks right.
You may be wondering why the regex pattern contains two "\\" characters, it's because it is an escape character so two are necessary. Here's some verification:
1. val string = "SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar"
2. println(string)
SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\1{10,}$', 1) as ExtractedChar
Alright, let's make sure the regexp_extract function is working correctly:
spark.sqlContext.sql("SELECT REGEXP_EXTRACT('TESTING', '^.', 0) as test").show()
+----+
|test|
+----+
| T|
+----+
Okay, maybe the issue is the Jupyter notebook? After checking and using the Scala REPL, I'm still having the same issue.
Any ideas why I'm unable to get this regex to successfully match?
Edit: Spark SQL is a requirement for this. I could create my own UDF using Scala; however, UDFs are black boxed by Spark meaning they will not be fully optimized.
I found the solution. The SQL string needs to include 4 "\" characters, like so:
'^(.)\\\\1{10,}$'
As explained here, four \ characters are needed because \ for two reasons:
\ is a special character in SQL and needs to be escaped, so the query needs two of them.
The input is coming from a string where \ also needs to be escaped. Just having "\\" would give a single \. To get two you need "\\\\".

Splitting a comma separated string with regex in sparql

i have to make a question about regex() in SPARQL.
I would like to replace a variable, which sometime contains a phrase with a comma, with another that contains just what is before the comma.
For example if the variable contains "I like it, ok" i want to get a new variable which contains "I like it". I don't know which regular expresions to use.
This is a use case for strbefore, you don't need regex at all. As a general tip, I suggest reading (or skimming) through the table of contents for Section 17 of the SPARQL 1.1 Query Language Recommendation. It lists all the SPARQL functions, and while you don't need to memorize them all, you'll at least have an idea of what's out there. (This is good advice for all programmers and languages: skim the table of contents and the index.) This query1 shows how to use strbefore:
select ?x ?prefix where {
values ?x { "we invited the strippers, jfk and stalin" }
bind( strbefore( ?x, "," ) as ?prefix )
}
---------------------------------------------------------------------------
| x | prefix |
===========================================================================
| "we invited the strippers, jfk and stalin" | "we invited the strippers" |
---------------------------------------------------------------------------
1. See Strippers, JFK, and Stalin Illustrate Why You Should Use the Serial Comma

How can I use a regular expression to match something in the form 'stuff=foo' 'stuff' = 'stuff' 'more stuff'

I need a regexp to match something like this,
'text' | 'text' | ... | 'text'(~text) = 'text' | 'text' | ... | 'text'
I just want to divide it up into two sections, the part on the left of the equals sign and the part on the right. Any of the 'text' entries can have "=" between the ' characters though. I was thinking of trying to match an even number of 's followed by a =, but I'm not sure how to match an even number of something.. Also note I don't know how many entries on either side there could be. A couple examples,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33) = '51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
Should extract,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33)
and,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637) = '227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Should extract,
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637)
and,
'227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Also note the little tilda bit at the end of the first section is optional, so I can't just look for that.
Actually I wouldn't use a regex for that at all. Assuming your language has a split operation, I'd first split on the | character to get a list of:
'51NL9637X33'
'ISL6262ACRZ-T'
'QFN'(~51NL9637X33) = '51NL9637X33'
'ISL6262ACRZ-T'
'INTERSIL'
'QFN7SQ-HT1_P49'
'()'
Then I'd split each of them on the = character to get the key and (optional) value:
'51NL9637X33' <null>
'ISL6262ACRZ-T' <null>
'QFN'(~51NL9637X33) '51NL9637X33'
'ISL6262ACRZ-T' <null>
'INTERSIL' <null>
'QFN7SQ-HT1_P49' <null>
'()' <null>
You haven't specified why you think a regex is the right tool for the job but most modern languages also have a split capability and regexes aren't necessarily the answer to every requirement.
I agree with paxdiablo in that regular expressions might not be the most suitable tool for this task, depending on the language you are working with.
The question "How do I match an even number of characters?" is interesting nonetheless, and here is how I'd do it in your case:
(?:'[^']*'|[^=])*(?==)
This expression matches the left part of your entry by looking for a ' at its current position. If it finds one, it runs forward to the next ' and thereby only matching an even number of quotes. If it does not find a ' it matches anything that is not an equal sign and then assures that an equal sign follows the matched string. It works because the regex engine evaluates OR constructs from left to right.
You could get the left and right parts in two capturing groups by using
((?:'[^']*'|[^=])*)=(.*)
I recommend http://gskinner.com/RegExr/ for tinkering with regular expressions. =)
As paxdiablo said, you almost certainly don't want to use a regex here. The split suggestion isn't bad; I myself would probably use a parser here—there's a lot of structure to exploit. The idea here is that you formally specify the syntax of what you have—sort of like what you gave us, only rigorous. So, for instance: a field is a sequence of non-single-quote characters surrounded by single quotes; a fields is any number of fields separated by white space, a |, and more white space; a tilde is non-right-parenthesis characters surrounded by (~ and ); and an expr is a fields, optional whitespace, an optional tilde, a =, optional whitespace, and another fields. How you express this depends on the language you are using. In Haskell, for instance, using the Parsec library, you write each of those parsers as follows:
import Text.ParserCombinators.Parsec
field :: Parser String
field = between (char '\'') (char '\'') $ many (noneOf "'\n")
tilde :: Parser String
tilde = between (string "(~") (char ')') $ many (noneOf ")\n")
fields :: Parser [String]
fields = field `sepBy` (try $ spaces >> char '|' >> spaces)
expr :: Parser ([String],Maybe String,[String])
expr = do left <- fields
spaces
opt <- optionMaybe tilde
spaces >> char '=' >> spaces
right <- fields
(char '\n' >> return ()) <|> eof
return (left, opt, right)
Understanding precisely how this code works isn't really important; the basic idea is to break down what you're parsing, express it in formal rules, and build it back up out of the smaller components. And for something like this, it'll be much cleaner than a regex.
If you really want a regex, here you go (barely tested):
^\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?(\(~[^)\n]*\))?\s*=\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?\s*$
See why I recommend a parser? When I first wrote this, I got at least two things wrong which I picked up (one per test), and there's probably something else. And I didn't insert capturing groups where you wanted them because I wasn't sure where they'd go. Now yes, I could have made this more readable by inserting comments, etc. And after all, regexen have their uses! However, the point is: this is not one of them. Stick with something better.