sparql regex compare two string variables (one is composed by the other) - regex

I am trying to compare two string variables to discover if one is contained in the other, specifically if one is composed of the other (so, I would like to avoid retrieving that "information" contains "format". I am interested only in results similar to "information_management" includes "information".
I have tried both FILTER CONTAINS() and FILTER regex() with the same results. How can I modify the query so it includes the fact that there needs to be a space either before or after the term?
SELECT DISTINCT ?l1 ?l2
WHERE
{
?term1 skos:prefLabel ?l1.
?term2 skos:prefLabel ?l2.
FILTER(contains(?l1,?l2))
}

So if I understand you correctly you want to find pairs of terms where one term is contained in the other but is not equal to the other?
If so you can add a !SAMETERM() call into the the FILTER clause like so:
SELECT DISTINCT ?l1 ?l2
WHERE
{
?term1 skos:prefLabel ?l1.
?term2 skos:prefLabel ?l2.
FILTER(!SAMETERM(?l1, ?l2) && contains(?l1,?l2))
}
Edit
Re-reading the question I don't think I addressed the whole question, for the problem where you have the terms "format" and "information" and don't want them to be matched you can do something like the following:
SELECT DISTINCT ?l1 ?l2
WHERE
{
?term1 skos:prefLabel ?l1.
?term2 skos:prefLabel ?l2.
FILTER(!SAMETERM(?l1, ?l2)
&& contains(?l1,?l2)
&& ( STRENDS(STRBEFORE(?l1, ?l2)," ")
|| STRSTARTS(STRAFTER(?l1, ?l2), " ")
))
}
This requires that the string before/after the containing term must end/start with whitespace. You may have to play around with this to get something that more closely models your constraints.

Another solution would be by constructing a regex pattern on the fly, like:
FILTER(regex(concat("\\b", ?l1, "\\b"), ?l2))
I'm not entirely sure that SPARQL/XML Schema requires \b, but I think most implementations will have it.

Related

Remove columns by name based on pattern

How can I remove a large number of columns by name based on a pattern?
A data set exported from Jira has a ton of extra columns that I've no interest in. 400 Log entries, 50 Comments, dozens of links or attachments. Problem is that they get random numbers assigned which means that removing them with hardcoded column names will not work. That would look like this and break as the numbers change:
= Table.RemoveColumns(#"Previous Step",{"Watchers", "Watchers_10", "Watchers_11", "Watchers_12", "Watchers_13", "Watchers_14", "Watchers_15", "Watchers_16", "Watchers_17", "Watchers_18", "Watchers_19", "Watchers_20", "Watchers_21", "Watchers_22", "Watchers_23", "Watchers_24", "Watchers_25", "Watchers_26", "Watchers_27", "Watchers_28", "Log Work", "Log Work_29", "Log Work_30", "Log Work_31", "Log Work_32", ...
How can I remove a large number of columns by using a pattern in the name? i.e. remove all "Log Work" columns.
The best way I've found is to use List.FindText on Table.ColumnNames to get a list of column names dynamically based on target string:
= Table.RemoveColumns(#"Previous Step", List.FindText(Table.ColumnNames(#"Previous Step"), "Log Work")
This works by first grabbing the full list of Column Names and keeping only the ones that match the search string. That's then sent to RemoveColumns as normal.
Limitation appears to be that FindText doesn't offer complex pattern matching.
Of course, when you want to remove a lot of different patterns, having individual steps isn't very interesting. A way to combine this is to use List.Combine to join the resulting column names together.
That becomes:
= Table.RemoveColumns(L, List.Combine({ List.FindText(Table.ColumnNames(L), "Watchers_"), List.FindText(Table.ColumnNames(L), "Log Work"), List.FindText(Table.ColumnNames(L), "Comment"), List.FindText(Table.ColumnNames(L), "issue link"), List.FindText(Table.ColumnNames(L), "Attachment")} ))
SO what's actually written there is:
Table.RemoveColumns(PreviousStep, List.Combine({ foundList1, foundlist2, ... }))
Note the { } that signifies a list! You need to use this as List.Combine only accepts a single argument which is itself already a List of lists. And the Combine call is required here.
Also note the L here instead of #"Previous Step". That's used to make the entire thing more readable. Achieved by inserting a step named "L" that just has = #"Promoted Headers".
This allows relatively maintainable removal of multiple columns by name, but it's far from perfect.

Using ARRAYFORMULA with SUMIF for multiple conditions combined with a wildcard to select all values for a given condition

I am using ARRAYFORMULA with multiple conditions using SUMIF concatenating the conditions using & and it works. Now I would like to use similarly this idea for a special condition indicating to consider all values using a wildcard ("ALL") for a given column, but it doesn't work.
Here is my example:
On I2if have the following formula:
=ARRAYFORMULA(if(not(ISBLANK(H2:H)),sumif(B2:B & C2:C & D2:D & year(E2:E),
if($G$2="ALL",B2:B,$G$2) & if($G$4="ALL",C2:C,$G$4) & if($G$6="ALL",D2:D,$G$6) &
H2:H,A2:A),))
and it works, when I enter specific values, but when I use my wildcard: ALL indicating that for a given column/criteria all values should be taken into consideration, it doesn't work as expected. The scenario should consider that all criteria can be labeled as ALLin such case it will provide the sum of NUM per year.
Here is my testing sample in Google Spreadsheet:
https://docs.google.com/spreadsheets/d/1c28KRQWgPCEdzVvwvXFOQ3Y13MBDjpEgKdfoLipFAOk/edit?usp=sharing
Notes:
I was able to get a solution for that using SUMPRODUCT but this function doesn't get expanded with ARRAYFORMULA
In my real example I have more conditions, so I am looking for a solution that escalates having more conditions
use:
=QUERY(QUERY(FILTER(A2:E,
IF(G2="All", B2:B<>"×", B2:B=G2),
IF(G4="All", C2:C<>"×", C2:C=G4),
IF(G6="All", D2:D<>"×", D2:D=G6)),
"select year(Col5),sum(Col1)
where Col1 is not null
group by year(Col5)"),
"offset 1", 0)

SPARQL: combine and exclude regex filters

I want to filter my SPARQL query for specific keywords while at the same time excluding other keywords. I thought this may be easily accomplished with FILTER (regex(str(?var),"includedKeyword","i") && !regex(str(?var),"excludedKeyword","i")). It works without the "!" condition, but not with. I also separated the FILTER statements, but no use.
I used this query on http://europeana.ontotext.com/ :
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
SELECT DISTINCT ?CHO
WHERE {
?proxy dc:subject ?subject .
FILTER ( regex(str(?subject),"gemälde","i") && !regex(str(?subject),"Fotografie","i") )
?proxy edm:type "IMAGE" .
?proxy ore:proxyFor ?CHO.
?agg edm:aggregatedCHO ?CHO; edm:country "germany".
}
But I always get the result on the first row with the title "Gemäldegalerie", which has a dc:subject of "Fotografie" (the one I want excluded). I think the problem lies in the fact that one object from the Europeana database can have more than one dc:subject property, so maybe it looks only for one of these properties while ignoring the other ones.
Any ideas? Would be very thankful!
The problem is that your combined filter checks for the same binding of ?subject. So it succeeds if at least one value of ?subject matches both conditions (which is almost always true, because the string "Gemäldegalerie", for example, matches your first regex and does not match the second).
So for the negative condition, you need to formulate something that checks for all possible values, rather than just one particular value. You can do this using SPARQL's NOT EXISTS function, for example like this:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
SELECT DISTINCT ?CHO
WHERE {
?proxy edm:type "IMAGE" .
?proxy ore:proxyFor ?CHO.
?agg edm:aggregatedCHO ?CHO; edm:country "germany".
?proxy dc:subject ?subject .
FILTER(regex(str(?subject),"gemälde","i"))
FILTER NOT EXISTS {
?proxy dc:subject ?otherSubject.
FILTER(regex(str(?otherSubject),"Fotografie","i"))
}
}
As an aside: since you are doing regular expression checks, and now combining them with an NOT EXISTS operator, this is likely to become very expensive for the query processor quite quickly. You may want to think about alternative ways to formulate your query (for example, using the exact subject string to include or exclude to eliminate the regex), or even having a look at some non-standard extensions that the SPARQL endpoint might provide (OWLIM, for example, the store on which the Europeana endpoint runs, supports various full-text-search extensions, though I am not sure they are enabled in the Europeana endpoint).

SQL and regular expression to check if string is a substring of larger string?

I have a database filled with some codes like
EE789323
990
78000
These numbers are ALWAYS endings of a larger code. Now I have a function that needs to check if the larger code contains the subcode.
So if I have codes 90 and 990 and my full code is EX888990, it should match both of them.
However I need to do it in the following way:
SELECT * FROM tableWithRecordsWithSubcode
WHERE subcode MATCHES [reg exp with full code];
Is a regular expression like this this even possible?
EDIT:
To clarify the issue I'm having, I'm not using SQL here. I just used that to give an example of the type of query I'm using.
In fact I'm using iOS with CoreData, and I need a predicate to fetch me only the records that match.
In the way that is mentioned below.
Given the observations from a comment:
Do you have two tables, one called tableWithRecordsWithSubcode and another that might be tableWithFullCodeColumn? So the matching condition is in part a join - you need to know which subcodes match any of the full codes in the second table? But you're only interested in the information in the tableWithRecordsWithSubcode table, not in which rows it matches in the other table?
and the laconic "you're correct" response, then we have to rewrite the query somewhat.
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON F.Fullcode ...ends-with... S.Subcode
or maybe using an EXISTS sub-query:
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(SELECT * FROM tableWithFullCodeColumn AS F
WHERE F.Fullcode ...ends-with... S.Subcode)
This uses a correlated sub-query but avoids the DISTINCT operation; it may mean the optimizer can work more efficiently.
That just leaves the magical 'X ...ends-with... T' operator to be defined. One possible way to do that is with LENGTH and SUBSTR. However, SUBSTR does not behave the same way in all DBMS, so you may have to tinker with this (possibly adding a third argument, LENGTH(s.subcode)):
LENGTH(f.fullcode) >= LENGTH(s.subcode) AND
SUBSTR(f.fullcode, LENGTH(f.fullcode) - LENGTH(s.subcode)) = s.subcode
This leads to two possible formulations:
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode;
and
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(
SELECT * FROM tableWithFullCodeColumn AS F
WHERE LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode);
This is not going to be a fast operation; joins on computed results such as required by this query seldom are.
I'm not sure why you think that you need a regular expression... Just use the charindex function:
select something
from table
where charindex(code, subcode) <> 0
Edit:
To find strings at the end, you can create a pattern with the % wildcard from the subcode:
select something
from table
where '%' + subcode like code

How Lucene scores results in a RegexQuery?

I can see how two values, when doing a regular/fuzzy full text search, can be compared to determine which one is "better" (i.e. one value contains more keywords than the other, one contains less non-keywords than the other).
However, how Lucene computes the score when doing regex queries using RegexQuery? It is a boolean query - a field's value is either compatible with the regex or not. Lucene can't take keywords from my regex query and do its usual magic...
There are two passes. In the first, it generates a list of all terms which match the regex. In the second, it finds all documents with terms matching that regex.
The major code you want to look at is in MultiTermQuery:
public Query rewrite(IndexReader reader) throws IOException {
FilteredTermEnum enumerator = getEnum(reader);
BooleanQuery query = new BooleanQuery();
try {
do {
Term t = enumerator.term();
if (t != null) {
TermQuery tq = new TermQuery(t); // found a match
tq.setBoost(getBoost() * enumerator.difference()); // set the boost
query.add(tq, false, false); // add to query
}
} while (enumerator.next());
} finally {
enumerator.close();
}
return query;
}
Two things:
The boolean query is instantiated with coord on. So the standard coord scoring applies (i.e. the more terms you get, the better).
The boost of the term query is given by enumerator.difference(). However, as of 3.0.1 this just returns 1:
#Override
public final float difference() {
// TODO: adjust difference based on distance of searchTerm.text() and term().text()
return 1.0f;
}
So at some point this will return the distance (probably levenstein) between the terms. But for now it does nothing.
This is just a wild guess, but one possible metric could be the number of backtracking steps the regex engine needs to take to match your search strings.
Of course, these values also depend mightily on the quality of your regex, but when comparing several matches, the one that was "easier to match" could be considered a better match than the one that the regex engine had to go through contortions for.