Extract triples containing particular substring using SPARQL

Extract triples containing particular substring using SPARQL - regex

I want to extract a triple which contains word say "alice" in its subject. The query I used was:
SELECT ?s ?p ?o WHERE { ?s ?p ?o .FILTER regex(?s, \"alice\") .}
This doesn't give me any results inspite of have a triple which satisfies this constraint.
On the other hand when I use the same query to extract a triple which contains a word brillant in its object .It returns only one of the 2 possible matches.
The query used is:
SELECT ?s ?p ?o WHERE { ?s ?p ?o .FILTER regex(?o, \"brillant\") .}
Please let me know where am I going wrong and what is the reason for this behaviour.

I'll assume that the escapes around the quotation marks are just a remnant from copying and pasting. The first argument to regex must be a literal, but literals cannot be the subjects of triples in RDF, so it's not true that you have data that should match this pattern. What you might have, though, is subjects whose URI contains the string "alice", and you can get the string representation of the URI using the str function. E.g.,
SELECT ?s ?p ?o WHERE { ?s ?p ?o .FILTER regex(str(?s), "alice") .}
To illustrate, let's use the two values <http://example.org> and "string containing example" and filter as you did in your original query:
select ?x where {
values ?x { <http://example.org> "string containing example" }
filter( regex(?x, "exam" ))
}
-------------------------------
| x |
===============================
| "string containing example" |
-------------------------------
We only got "string containing example" because the other value wasn't a string, and so wasn't a suitable argument to regex. However, if we add the call to str, then it's the string representation of the URI that regex will consider:
select ?x where {
values ?x { <http://example.org> "string containing example" }
filter( regex(str(?x), "exam" ))
}
-------------------------------
| x |
===============================
| <http://example.org> |
| "string containing example" |
-------------------------------

Related

Match at least 3 words in any order from some 5 words

I have a group of words:
"dog", "car", "house", "work", "cat"
I need to be able to match at least 3 of them in a text, for example:
"I always let my cat and dog at the animal nursery when I go to work by car"
Here I want to match the regex because it matches at least 3 words (4 words here):
"cat", "dog", "car" and "work"
EDIT 1
I want to use it with Oracle's regexp_like function
EDIT 2
I also need it to work with consecutive words

Since Oracle's regexp_like doesn't support non-capturing groups and word boundaries, the following expression can be used:
^((.*? )?(dog|car|house|work|cat)( |$)){3}.*$
Try it out here.
Alternatively, a larger but arguably cleaner solution is:
^(.*? )?(dog|car|house|work|cat) .*?(dog|car|house|work|cat) .*?(dog|car|house|work|cat)( .*)?$
Try it out here.
NOTE: These will both match the same word used multiple times, e.g. "dog dog dog".
EDIT: To address the concerns over punctuation, a small modification can be made. It isn't perfect, but should match 99% of situations involving punctuation (but won't match e.g. !dog):
^((.*? )?(dog|car|house|work|cat)([ ,.!?]|$)){3}.*$
Try it out here

This is a solution that doesn't use regular expressions, will exclude repeated words and the words to match can be passed in as a bind parameter in a collection:
SQL Fiddle
Oracle 11g R2 Schema Setup:
Create a collection type to store a list of words:
CREATE TYPE StringList IS TABLE OF VARCHAR2(50)
/
Create a PL/SQL function to split a delimited string into the collection:
CREATE OR REPLACE FUNCTION split_String(
i_str IN VARCHAR2,
i_delim IN VARCHAR2 DEFAULT ','
) RETURN StringList DETERMINISTIC
AS
p_result StringList := StringList();
p_start NUMBER(5) := 1;
p_end NUMBER(5);
c_len CONSTANT NUMBER(5) := LENGTH( i_str );
c_ld CONSTANT NUMBER(5) := LENGTH( i_delim );
BEGIN
IF c_len > 0 THEN
p_end := INSTR( i_str, i_delim, p_start );
WHILE p_end > 0 LOOP
p_result.EXTEND;
p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, p_end - p_start );
p_start := p_end + c_ld;
p_end := INSTR( i_str, i_delim, p_start );
END LOOP;
IF p_start <= c_len + 1 THEN
p_result.EXTEND;
p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, c_len - p_start + 1 );
END IF;
END IF;
RETURN p_result;
END;
/
Create some test data:
CREATE TABLE test_data ( value ) AS
SELECT 'I always let my cat and dog at the animal nursery when I go to work by car' FROM DUAL UNION ALL
SELECT 'dog dog foo bar dog' FROM DUAL
/
Query 1:
SELECT *
FROM test_data
WHERE CARDINALITY(
split_string( value, ' ' ) -- Split the string into a collection
MULTISET INTERSECT -- Intersect it with the input words
StringList( 'dog', 'car', 'house', 'work', 'cat' )
) >= 3 -- Check that the size of the intersection
-- is at least 3 items.
Results:
| VALUE |
|----------------------------------------------------------------------------|
| I always let my cat and dog at the animal nursery when I go to work by car |

Ignoring the questions I asked in a Comment under the original post, here is one easy way to solve the problem, with a join and aggregation (using a HAVING condition). Note that a word like doghouse in the input will match both dog and house, etc. (Do read my comment under the original post!)
In the query below, both the input phrase and the words to match are hardcoded in factored subqueries (the WITH clause). In a serious environment, both should be in base tables, or be provided as input variables, etc.
I show how to use the standard string comparison operator LIKE. This can be changed to REGEXP_LIKE, but that is generally unneeded (and indeed a bad idea). But if you need to differentiate between 'dog' and 'dogs' (and 'dogwood'), or need case insensitive comparison, etc., you can use REGEXP_LIKE. The point of this solution is that you don't need to worry about matching THREE different words; if you know how to match ONE (whether full word match is needed, capitalization does or does not matter, etc.), then you can also, easily, match THREE words under the same rules.
with
inputs ( input_phrase ) as (
select
'I always let my cat and dog at the animal nursery when I go to work by car'
from dual
),
words ( word_to_match) as (
select 'dog' from dual union all
select 'car' from dual union all
select 'house' from dual union all
select 'work' from dual union all
select 'cat' from dual
)
select input_phrase
from inputs inner join words
on input_phrase like '%' || word_to_match || '%'
group by input_phrase
having count(*) >= 3
;
INPUT_PHRASE
--------------------------------------------------------------------------
I always let my cat and dog at the animal nursery when I go to work by car

The following solution will exclude repeated matches, doesn't use regular expressions (though you can if you like), and doesn't use PL/SQL.
WITH match_list ( match_word ) AS (
SELECT 'dog' AS match_word FROM dual
UNION ALL
SELECT 'work' FROM dual
UNION ALL
SELECT 'car' FROM dual
UNION ALL
SELECT 'house' FROM dual
UNION ALL
SELECT 'cat' FROM dual
)
SELECT phrase, COUNT(*) AS unique_match_cnt, SUM(match_cnt) AS total_match_cnt
, LISTAGG(match_word, ',') WITHIN GROUP ( ORDER BY match_word ) AS unique_matches
FROM (
SELECT pt.phrase, ml.match_word, COUNT(*) AS match_cnt
FROM phrase_table pt INNER JOIN match_list ml
ON ' ' || LOWER(pt.phrase) || ' ' LIKE '%' || ml.match_word || '%'
GROUP BY pt.phrase, ml.match_word
) GROUP BY phrase
HAVING COUNT(*) >= 3;
The key is putting the words you want to match into a table or common table expression/subquery. If you like you can use REGEXP_LIKE() in place of LIKE though I think that would be more expensive. Skip LISTAGG() if you're not using Oracle 11g or higher, or if you don't actually need to know which words were matched, and skip LOWER() if you want a case-sensitive match.

If you don't need to match different words.
(?:\b(?:dog|car|house|work|cat)\b.*?){3}
I don't know if this works in your environment.
EDIT: I didn't see there is another answer almost like this one.

Matching double quotes in SPARQL query in Virtuoso

I need to get a SPARQL query that matches double quotes in Virtuoso graph. I use such query:
SELECT distinct ?o
FROM <http://graph>
WHERE
{
?s ?p ?o.
}
It returns me a column with such values:
http://some.prefix/Symbol
"abcd"
I need to match only second value ("abcd"). I tried to add such filter to WHERE clause:
FILTER regex(str(?o), "\"")
But it returns no results. I also tried '"' as a second parameter to regex, and some other things. Is it possible at all?

"abcd" is a literal of four characters. It does not include the ""; these are the string delimiters and do not form part of the string.
FILTER isLiteral(?o)
should work.

Since your filter is not working, "abcd" does not have a double quote in it. It's a string literal. Not sure what type it is; so you can use --
select ?type where { "abcd" a ?type }
-- to get its type. You can then use that type as a filter in your query as:
SELECT distinct ?o
FROM <http://graph>
WHERE
{
?s ?p ?o .
?o a <whatever type you received in the previous query> .
}

Selecting for a Jsonb array contains regex match

Given a data structure as follows:
{"single":"someText", "many":["text1", text2"]}
I can query a regex on single with
WHERE JsonBColumn ->> 'single' ~ '^some.*'
And I can query a contains match on the Array with
WHERE JsonBColumn -> 'many' ? 'text2'
What I would like to do is to do a contains match with a regex on the JArray
WHERE JsonBColumn -> 'many' {Something} '.*2$'

I found that it is also possible to convert the entire JSONB array to a plain text string and simply perform the regular expression on that. A side effect though is that a search on something like
xt 1", "text
would end up matching.
This approach isn't as clean since it doesn't search each element individually but it gets the job done with a visually simpler statement.
WHERE JsonBColumn ->>'many' ~ 'text2'

Use jsonb_array_elements_text() in lateral join.
with the_data(id, jsonbcolumn) as (
values
(1, '{"single":"someText", "many": ["text1", "text2"]}'::jsonb)
)
select distinct on (id) d.*
from
the_data d,
jsonb_array_elements_text(jsonbcolumn->'many') many(elem)
where elem ~ '^text.*';
id | jsonbcolumn
----+----------------------------------------------------
1 | {"many": ["text1", "text2"], "single": "someText"}
(1 row)
See also this answer.
If the feature is used frequently, you may want to write your own function:
create or replace function jsonb_array_regex_like(json_array jsonb, pattern text)
returns boolean language sql as $$
select bool_or(elem ~ pattern)
from jsonb_array_elements_text(json_array) arr(elem)
$$;
The function definitely simplifies the code:
with the_data(id, jsonbcolumn) as (
values
(1, '{"single":"someText", "many": ["text1", "text2"]}'::jsonb)
)
select *
from the_data
where jsonb_array_regex_like(jsonbcolumn->'many', '^text.*');

Splitting a comma separated string with regex in sparql

i have to make a question about regex() in SPARQL.
I would like to replace a variable, which sometime contains a phrase with a comma, with another that contains just what is before the comma.
For example if the variable contains "I like it, ok" i want to get a new variable which contains "I like it". I don't know which regular expresions to use.

This is a use case for strbefore, you don't need regex at all. As a general tip, I suggest reading (or skimming) through the table of contents for Section 17 of the SPARQL 1.1 Query Language Recommendation. It lists all the SPARQL functions, and while you don't need to memorize them all, you'll at least have an idea of what's out there. (This is good advice for all programmers and languages: skim the table of contents and the index.) This query1 shows how to use strbefore:
select ?x ?prefix where {
values ?x { "we invited the strippers, jfk and stalin" }
bind( strbefore( ?x, "," ) as ?prefix )
}
---------------------------------------------------------------------------
| x | prefix |
===========================================================================
| "we invited the strippers, jfk and stalin" | "we invited the strippers" |
---------------------------------------------------------------------------
1. See Strippers, JFK, and Stalin Illustrate Why You Should Use the Serial Comma

SPARQL 1.1: how to use the replace function?

How can one use the replace function in SPARQL 1.1, especially in update commands?
For example, if I have a number of triples ?s ?p ?o where ?o is a string and for all triples where ?o contains the string "gotit" I want to insert an additional triple where "gotit" is replaced by "haveit", how could I do this? I am trying to achieve this is Sesame 2.6.0.
I tried this naive approach:
INSERT { ?s ?p replace(?o,"gotit","haveit","i") . }
WHERE { ?s ?p ?o . FILTER(regex(?o,"gotit","i")) }
but this caused a syntax error.
I also failed to use replace in the result list of a query like so:
SELECT ?s ?p (replace(?o,"gotit","haveit","i") as ?r) WHERE { .... }
The SPARQL document unfortunately does not contain an example of how to use this function.
Is it possible at all to use functions to create new values and not just test existing values and if yes, how?

You can't use an expression directly in your INSERT clause like you have attempted to do. Also you are binding ?name with the first triple pattern but then filtering on ?o in the FILTER which is not going to give you any results (filtering on an unbound variable will give you no results for most filter expressions).
Instead you need to use a BIND in your WHERE clause to make the new version of the value available in the INSERT clause like so:
INSERT
{
?s ?p ?o2 .
}
WHERE
{
?s ?p ?o .
FILTER(REGEX(?o, "gotit", "i"))
BIND(REPLACE(?o, "gotit", "haveit", "i") AS ?o2)
}
BIND assigns the result of an expression to a new variable so you can use that value elsewhere in your query/update.
The relevant part of the SPARQL specification you are interested in is the section on Assignment

The usage of replace looks correct afaict according to the spec. I believe REPLACE was just added to the last rev of the spec relatively recently - perhaps Sesame just doesn't support it yet?
If you just do SELECT ?s ?p ?o WHERE { ?s ?p ?name . FILTER(regex(?name,"gotit","i")) } does your query return rows?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract triples containing particular substring using SPARQL - regex

Related

Match at least 3 words in any order from some 5 words

Matching double quotes in SPARQL query in Virtuoso

Selecting for a Jsonb array contains regex match

Splitting a comma separated string with regex in sparql

SPARQL 1.1: how to use the replace function?

Categories

Resources