SPARQL: combine and exclude regex filters - regex

I want to filter my SPARQL query for specific keywords while at the same time excluding other keywords. I thought this may be easily accomplished with FILTER (regex(str(?var),"includedKeyword","i") && !regex(str(?var),"excludedKeyword","i")). It works without the "!" condition, but not with. I also separated the FILTER statements, but no use.
I used this query on http://europeana.ontotext.com/ :
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
SELECT DISTINCT ?CHO
WHERE {
?proxy dc:subject ?subject .
FILTER ( regex(str(?subject),"gemälde","i") && !regex(str(?subject),"Fotografie","i") )
?proxy edm:type "IMAGE" .
?proxy ore:proxyFor ?CHO.
?agg edm:aggregatedCHO ?CHO; edm:country "germany".
}
But I always get the result on the first row with the title "Gemäldegalerie", which has a dc:subject of "Fotografie" (the one I want excluded). I think the problem lies in the fact that one object from the Europeana database can have more than one dc:subject property, so maybe it looks only for one of these properties while ignoring the other ones.
Any ideas? Would be very thankful!

The problem is that your combined filter checks for the same binding of ?subject. So it succeeds if at least one value of ?subject matches both conditions (which is almost always true, because the string "Gemäldegalerie", for example, matches your first regex and does not match the second).
So for the negative condition, you need to formulate something that checks for all possible values, rather than just one particular value. You can do this using SPARQL's NOT EXISTS function, for example like this:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
SELECT DISTINCT ?CHO
WHERE {
?proxy edm:type "IMAGE" .
?proxy ore:proxyFor ?CHO.
?agg edm:aggregatedCHO ?CHO; edm:country "germany".
?proxy dc:subject ?subject .
FILTER(regex(str(?subject),"gemälde","i"))
FILTER NOT EXISTS {
?proxy dc:subject ?otherSubject.
FILTER(regex(str(?otherSubject),"Fotografie","i"))
}
}
As an aside: since you are doing regular expression checks, and now combining them with an NOT EXISTS operator, this is likely to become very expensive for the query processor quite quickly. You may want to think about alternative ways to formulate your query (for example, using the exact subject string to include or exclude to eliminate the regex), or even having a look at some non-standard extensions that the SPARQL endpoint might provide (OWLIM, for example, the store on which the Europeana endpoint runs, supports various full-text-search extensions, though I am not sure they are enabled in the Europeana endpoint).

Related

cts:value-match on xs:dateTime() type in Marklogic

I have a variable $yearMonth := "2015-02"
I have to search this date on an element Date as xs:dateTime.
I want to use regex expression to find all files/documents having this date "2015-02-??"
I have path-range-index enabled on ModifiedInfo/Date
I am using following code but getting Invalid cast error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime("2015-02-??T??:??:??.????"))
I have also used following code and getting same error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime(xs:date("2015-02-??"),xs:time("??:??:??.????")))
Kindly help :)
It seems you are trying to use wild card search on Path Range index which has data type xs:dateTime().
But, currently MarkLogic don't support this functionality. There are multiple ways to handle this scenario:
You may create Field index.
You may change it to string index which supports wildcard search.
You may run this workaround to support your existing system:
for $x in cts:values(cts:path-reference("ModifiedInfo/Date"))
return if(starts-with(xs:string($x), '2015-02')) then $x else ()
This query will fetch out values from lexicon and then you may filter your desired date.
You can solve this by combining a couple cts:element-range-querys inside of an and-query:
let $target := "2015-02"
let $low := xs:date($target || "-01")
let $high := $low + xs:yearMonthDuration("P1M")
return
cts:search(
fn:doc(),
cts:and-query((
cts:element-range-query("country", ">=", $low),
cts:element-range-query("country", "<", $high)
))
)
From the cts:element-range-query documentation:
If you want to constrain on a range of values, you can combine multiple cts:element-range-query constructors together with cts:and-query or any of the other composable cts:query constructors, as in the last part of the example below.
You could also consider doing a cts:values with a cts:query param that searches for values between for instance 2015-02-01 and 2015-03-01. Mind though, if multiple dates occur within one document, you will need to post filter manually after all (like in option 3 of Navin), but it could potentially speed up post-filtering a lot..
HTH!

How to Select Date Range using RegEx

I have date strings that looks like so:
20120817110329
Which, as you can see, is formatted: YYYYMMDDHHMMSS
How would I select (using RegEx) dates that are between 7/15 and 8/20? Or what about 8/1 to 8/15?
I have this working if I want to select a range that doesn't involve more than one place, but it is very limited:
^2012081[0-7] //selects 8/10 to 8/17
Update
Never forget the obvious (as pointed out by Wiseguy below), one can simply look for a range between 201207150000 and 201208209999.
Since you're just querying a database field that contains these values, you could simply check for a value between 201207150000 and 201208209999.
If you still want the regex, it ain't pretty, but this does it:
^20120(7(1[5-9]|2\d|3[01])|8([0-1]\d|20))\d{4}$
reFiddle example
You basically have to account for each possible range by hand.
^20120
(
7
(
1[5-9]
|2\d
|3[01]
)
|
8
(
[0-1]\d
|20
)
)
\d{4}$
I think this should work:
^2012(07(1[5-9]|[2-3][0-9])|08([0-1][0-9]|20))
Although the other answers are pretty the same...
You can check this for more info.

sparql regex compare two string variables (one is composed by the other)

I am trying to compare two string variables to discover if one is contained in the other, specifically if one is composed of the other (so, I would like to avoid retrieving that "information" contains "format". I am interested only in results similar to "information_management" includes "information".
I have tried both FILTER CONTAINS() and FILTER regex() with the same results. How can I modify the query so it includes the fact that there needs to be a space either before or after the term?
SELECT DISTINCT ?l1 ?l2
WHERE
{
?term1 skos:prefLabel ?l1.
?term2 skos:prefLabel ?l2.
FILTER(contains(?l1,?l2))
}
So if I understand you correctly you want to find pairs of terms where one term is contained in the other but is not equal to the other?
If so you can add a !SAMETERM() call into the the FILTER clause like so:
SELECT DISTINCT ?l1 ?l2
WHERE
{
?term1 skos:prefLabel ?l1.
?term2 skos:prefLabel ?l2.
FILTER(!SAMETERM(?l1, ?l2) && contains(?l1,?l2))
}
Edit
Re-reading the question I don't think I addressed the whole question, for the problem where you have the terms "format" and "information" and don't want them to be matched you can do something like the following:
SELECT DISTINCT ?l1 ?l2
WHERE
{
?term1 skos:prefLabel ?l1.
?term2 skos:prefLabel ?l2.
FILTER(!SAMETERM(?l1, ?l2)
&& contains(?l1,?l2)
&& ( STRENDS(STRBEFORE(?l1, ?l2)," ")
|| STRSTARTS(STRAFTER(?l1, ?l2), " ")
))
}
This requires that the string before/after the containing term must end/start with whitespace. You may have to play around with this to get something that more closely models your constraints.
Another solution would be by constructing a regex pattern on the fly, like:
FILTER(regex(concat("\\b", ?l1, "\\b"), ?l2))
I'm not entirely sure that SPARQL/XML Schema requires \b, but I think most implementations will have it.

SQL and regular expression to check if string is a substring of larger string?

I have a database filled with some codes like
EE789323
990
78000
These numbers are ALWAYS endings of a larger code. Now I have a function that needs to check if the larger code contains the subcode.
So if I have codes 90 and 990 and my full code is EX888990, it should match both of them.
However I need to do it in the following way:
SELECT * FROM tableWithRecordsWithSubcode
WHERE subcode MATCHES [reg exp with full code];
Is a regular expression like this this even possible?
EDIT:
To clarify the issue I'm having, I'm not using SQL here. I just used that to give an example of the type of query I'm using.
In fact I'm using iOS with CoreData, and I need a predicate to fetch me only the records that match.
In the way that is mentioned below.
Given the observations from a comment:
Do you have two tables, one called tableWithRecordsWithSubcode and another that might be tableWithFullCodeColumn? So the matching condition is in part a join - you need to know which subcodes match any of the full codes in the second table? But you're only interested in the information in the tableWithRecordsWithSubcode table, not in which rows it matches in the other table?
and the laconic "you're correct" response, then we have to rewrite the query somewhat.
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON F.Fullcode ...ends-with... S.Subcode
or maybe using an EXISTS sub-query:
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(SELECT * FROM tableWithFullCodeColumn AS F
WHERE F.Fullcode ...ends-with... S.Subcode)
This uses a correlated sub-query but avoids the DISTINCT operation; it may mean the optimizer can work more efficiently.
That just leaves the magical 'X ...ends-with... T' operator to be defined. One possible way to do that is with LENGTH and SUBSTR. However, SUBSTR does not behave the same way in all DBMS, so you may have to tinker with this (possibly adding a third argument, LENGTH(s.subcode)):
LENGTH(f.fullcode) >= LENGTH(s.subcode) AND
SUBSTR(f.fullcode, LENGTH(f.fullcode) - LENGTH(s.subcode)) = s.subcode
This leads to two possible formulations:
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode;
and
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(
SELECT * FROM tableWithFullCodeColumn AS F
WHERE LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode);
This is not going to be a fast operation; joins on computed results such as required by this query seldom are.
I'm not sure why you think that you need a regular expression... Just use the charindex function:
select something
from table
where charindex(code, subcode) <> 0
Edit:
To find strings at the end, you can create a pattern with the % wildcard from the subcode:
select something
from table
where '%' + subcode like code

doctrine2 dql, use setParameter with % wildcard when doing a like comparison

I want to use the parameter place holder - e.g. ?1 - with the % wild cards. that is, something like: "u.name LIKE %?1%" (though this throws an error). The docs have the following two examples:
1.
// Example - $qb->expr()->like('u.firstname', $qb->expr()->literal('Gui%'))
public function like($x, $y); // Returns Expr\Comparison instance
I do not like this as there is no protection against code injection.
2.
// $qb instanceof QueryBuilder
// example8: QueryBuilder port of: "SELECT u FROM User u WHERE u.id = ?1 OR u.nickname LIKE ?2 ORDER BY u.surname DESC" using QueryBuilder helper methods
$qb->select(array('u')) // string 'u' is converted to array internally
->from('User', 'u')
->where($qb->expr()->orx(
$qb->expr()->eq('u.id', '?1'),
$qb->expr()->like('u.nickname', '?2')
))
->orderBy('u.surname', 'ASC'));
I do not like this because I need to search for terms within the object's properties - that is, I need the wild cards on either side.
When binding parameters to queries, DQL pretty much works exactly like PDO (which is what Doctrine2 uses under the hood).
So when using the LIKE statement, PDO treats both the keyword and the % wildcards as a single token. You cannot add the wildcards next to the placeholder. You must append them to the string when you bind the params.
$qb->expr()->like('u.nickname', '?2')
$qb->getQuery()->setParameter(2, '%' . $value . '%');
See this comment in the PHP manual.
The selected answer is wrong. It works, but it is not secure.
You should escape the term that you insert between the percentage signs:
->setParameter(2, '%'.addcslashes($value, '%_').'%')
The percentage sign '%' and the symbol underscore '_' are interpreted as wildcards by LIKE. If they're not escaped properly, an attacker might construct arbirtarily complex queries that can cause a denial of service attack. Also, it might be possible for the attacker to get search results he is not supposed to get. A more detailed description of attack scenarios can be found here: https://stackoverflow.com/a/7893670/623685