Check if `LIKE` patterns intersects in Postgres - regex

There ara two strings in some request that are patterns that used within LIKE expressions (with _ and % placeholders). I want to find if this patterns intersects (have some string that matches them both). Is there any way to do that?...
“Like pattern” corresponds to finit or infinit set of strings. Each string in this set matches to given pattern. I want to check if intersection of string sets for two given patterns is not empty. Thus it is better to say patterns conjunction. In a math language:
S — set of strings
P — set of patterns (where each pattern has one or more string representation)
Sᵢ — subset of strings (Sᵢ ⊂ S) that match pᵢ pattern (where instead of i could be any index).
In equation form: “Sᵢ = {s | s ∈ S, s matches pᵢ, pᵢ ∈ P}” — that meas: “Sᵢ is a set of elements that are strings and match pᵢ pattern”.
Or another notation: “Sᵢ ⊂ S, ∀pᵢ ∈ P ∀s ∈ S (s matches pᵢ ≡ s ∈ Sᵢ)” — that meas: “Sᵢ is subset of strings and any string is element of Sᵢ if it matches pᵢ pattern”.
Let's define conjunction of patterns: “p₁ ∧ p₂ = p₃ ≡ S₁ ∩ S₂ = S₃” — that means: “Set of strings that match conjunction of patterns p₁ and p₂ is intersection of sets of strings that match p₁ pattern and that match p₂ pattern”.
For example:
ab_d and %cd — intersects
k%n and kl___ — intersects

I want to find if this patterns intersects (have some string that matches them both). Is there any way to do that?... (...) I want to check if intersection of string sets for two given patterns is not empty.
So, if I get this right, given two like patterns, p1 and p2, you're interested in whether there exists a (yet to be determined) string that matches p1 as well as p2.
E.g.:
select check_pattern('a%', 'b_'); -- false
select check_pattern('a%', '_b'); -- true ('ab')
Are you even sure there's a general solution to that problem in the first place?
Assuming there is, plain SQL isn't the right tool to find the solution imho, because you cannot readily express this in terms of "here's my (finite) set of data, join/filter them and yield a set based on it". To find the solution in SQL terms, you'd need to generate the set that stems from your data, and that's obviously not an option when the set in question is infinite.
Methinks you'd want to break up the problem into smaller parts and use a procedure language such as C, Perl, Lisp, whatever you fancy.
One potential solution might be this:
If both p1 and p2 are open on both ends or different ends, the answer is trivially yes: strings matching %foo% will intersect with those matching %bar%, just as strings matching foo% will intersect strings matching %bar.
If p1 yields a finite set (i.e. it contains no %), you could imagine iterating the entire set of potential matches for p1 using generate_series() or a for/while/whatever loop, and trying p2 on each string. It's ugly and inefficient, but it'll eventually work.
If p1 and p2 are both anchored (e.g. abc% and def% or %abc and %def), or reasonably anchored (e.g. _abc% and abcd%) the solution is trivial enough as well by considering the anchored part and proceeding as in the prior case.
I'll leave it to you to enumerate and solve the remaining cases if any...
The key, I think, will be to nail down the anchored parts of your patterns that yield a finite set of strings, and to stick to checking whether the (finite) set of strings they will match will intersect.

Related

RegEx that matches "variable" strings/sequences? + backtracking?

I want to use regex like language to match against variable-string (in my case sequence of character|words|numbers stored in a graph DB).
I found a way to implement RegEx engine :
https://deniskyashif.com/2019/02/17/implementing-a-regular-expression-engine/
the problem is that it matches against static string. My case is sort of what I call variable-string/sequence.
F.e. let say I have stored the following sequences :
who; why; when; where;
Keep in mind I dont have the sequences available (so that I can loop over them), they are deconstructed to a graph. (you can think of interface to the sequence like a function which given prefix predicts/returns the next character)
if I match against regex : w* it should match/return all of the strings one after another /like in backtracking/
if i use : whe* => when, where
etc..
Is there a way to modify NFA, DFA in such a way that it will accommodate variable-string ?
I just started exploring implementing NFA and think the change has to be here :
function search(nfa, word) { .... }
it has to be search that passes the next expected regex-symbol/state i.e. given the previous string-symbol does the next-predicted-string-symbol match the expected regex-symbol ?
The regex should 'drive' the match, rather than the string ! It should be doable because the regex is deconstructed to finite states with the transitions..
what do you think ?
they are stored as a tree in graph db...f.e.can be represented as :
lvl5: (where:.)
lvl4: (wher:e), (when:.), (whom:.),
lvl3: (whe:r), (whe:n), (who:m), (who:.), (why:.)
lvl2: (wh:y) , (wh:o), (wh:e)
lvl1: (w:h)
lvl0: w h y o .
I don’t understand your question, but this regex could be the answer:
<prefix>.*?\b
Where <prefix> is w or whe etc.
This will match all words in the input that start with the prefix.
In whatever language you’re using, there should be a way to loop over all matches found for a given input.

A regex for maximal periodic substrings

This is a follow up to A regex to detect periodic strings .
A period p of a string w is any positive integer p such that w[i]=w[i+p]
whenever both sides of this equation are defined. Let per(w) denote
the size of the smallest period of w . We say that a string w is
periodic iff per(w) <= |w|/2.
So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.
For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.
As more examples, abcabca, ababababa and abcabcabc are also periodic.
#horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is
\b(\w*)(\w+\1)\2+\b
I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.
Formally a substring w is a maximal periodic substring if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run"
with the same period.
The four maximal periodic substrings (runs) of string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.
The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs):
T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.
The string T = atatbatatb contains the following three runs. They are:
T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.
Is there a regex (with backreferences) that will capture all maximal
periodic substrings?
I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re supports. However I would even be happy with PCRE if that makes the problem solvable.
(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )
Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.
This should do it, using Python's re module:
(?<=(.))(?=((\w*)(\w*(?!\1)\w\3)\4+))
Fiddle: https://regex101.com/r/aA9uJ0/2
Notes:
You must precede the string being scanned by a dummy character; the # in the fiddle. If that is a problem, it should be possible to work around it in the regex.
Get captured group 2 from each match to get the collection of maximal periodic substrings.
Haven't tried it with longer strings; performance may be an issue.
Explanation:
(?<=(.)) - look-behind to the character preceding the maximal periodic substring; captured as group 1
(?=...) - look-ahead, to ensure overlapping patterns are matched; see How to find overlapping matches with a regexp?
(...) - captures the maximal periodic substring (group 2)
(\w*)(\w*...\w\3)\4+ - #horcruz's regex, as proposed by OP
(?!\1) - negative look-ahead to group 1 to ensure the periodic substring is maximal
As pointed out by #ClasG, the result of my regex may be incomplete. This happens when two runs start at the same offset. Examples:
aabaab has 3 runs: aabaab, aa and aa. The first two runs start at the same offset. My regex will fail to return the shortest one.
atatbatatb has 3 runs: atatbatatb, atat, atat. Same problem here; my regex will only return the first and third run.
This may well be impossible to solve within the regex. As far as I know, there is no regex engine that is capable of returning two different matches that start at the same offset.
I see two possible solutions:
Ignore the missing runs. If I am not mistaken, then they are always duplicates; an identical run will follow within the same encapsulating run.
Do some postprocessing on the result. For every run found (let's call this X), scan earlier runs trying to find one that starts with the same character sequence (let's call this Y). When found (and not already 'used'), add an entry with the same character sequence as X, but the offset of Y.
I think it is not possible. Regular expressions cannot do complex nondeterministic jobs, even with backreferences. You need an algorithm for this.
This kind of depends on your input criteria... There is no infinite string of characters.. using back references you will be able to create a suitable representation of the last amount of occurrences of the pattern you wish to match.
\
Personally I would define buckets of length of input and then fill them.
I would then use automata to find patterns in the buckets and then finally coalesce them into larger patterns.
It's not how fast the RegEx is going to be in this case it's how fast you are going to be able to recognize a pattern and eliminate the invalid criterion.

Subsetting a string based on pre- and suffix

I have a column with these type of names:
sp_O00168_PLM_HUMAM
sp_Q8N1D5_CA158_HUMAN
sp_Q15818_NPTX1_HUMAN
tr_Q6FGH5_Q6FGH5_HUMAN
sp_Q9UJ99_CAD22_HUMAN
I want to remove everything before, and including, the second _ and everything after, and including, the third _.
I do not which to remove based on number of characters, since this is not a fixed number.
The output should be:
PLM
CA158
NPTX1
Q6FGH5
CAD22
I have played around with these, but don't quite get it right..
library(stringer)
str_sub(x,-6,-1)
That’s not really a subset in programming terminology1, it’s a substring. In order to extract partial strings, you’d usually use regular expressions (pretty much regardless of language); in R, this is accessible via sub and other related functions:
pattern = '^.*_.*_([^_]*)_.*$'
result = sub(pattern, '\\1', strings)
1 Aside: taking a subset is, as the name says, a set operation, and sets are defined by having no duplicate elements and there’s no particular order to the elements. A string by contrast is a sequence which is a very different concept.
Another possible regular expression is this:
sub("^(?:.+_){2}(.+?)_.+", "\\1", vec)
# [1] "PLM" "CA158" "NPTX1" "Q6FGH5" "CAD22"
where vec is your vector of strings.
A visual explanation:
> gsub(".*_.*_(.*)_.*", "\\1", "sp_O00168_PLM_HUMAM")
[1] "PLM"

The set of all strings with even number of ‘aba’ over alphabet { a,b }

This is what I've come up with but it leaves out strings such as "baaabba", "bbbaaabba"...
b*a*((aba)a*b*(aba)a*)*b*
No aba
First, let's see how we would match a string with no abas at all. You'd want something like this:
(b|a+bb)*(a*b*)
At each point, we can match bs, but we need to look out for a - we can match an a (or a block of as) only if it is followed by bb. Lastly, near the end of the string, we are free to match a block of as and a block of bs.
Exactly One aba
Next, let's look at words with one aba. This is very similar to what we had before:
(b|a+bb)*(a+ba(b|a+bb)*)(a*b*)
We have the same pattern with (a+ba(b|a+bb)*) added in the middle - a+ba is our aba block, and (b|a+bb)* after it is again a block of as and bs which doesn't contain aba.
Note that the inner group (the parentheses around a+ba(b|a+bb)*) isn't needed - it's there for readability.
Exactly Two abas
(b|a+bb)*(a+ba(b|a+bb)*)(a+ba(b|a+bb)*)(a*b*)
Even Number of abas
(b|a+bb)*(a+ba(b|a+bb)*a+ba(b|a+bb)*)*(a*b*)
Similar to the previous one, but with a star around the inner group.
This one ^(((?!aba)[ab])*(aba((?!aba)[ab])*aba)*)*$ will do the work.
I assumed that you want the aba substrings not to overlap. In other words ababa is not a match.

Regular expressions, what a trouble!

I need your kind help to resolve this question.
I state that I am not able to use regolar expressions with Oracle PL/SQL, but I promise that I'll study them ASAP!
Please suppose you have a table with a column called MY_COLUMN of type VARCHAR2(4000).
This colums is populated as follows:
Description of first no.;00123457;Description of 2nd number;91399399119;Third Descr.;13456
You can see that the strings are composed by couple of numbers (which may begin with zero), and strings (containing all alphanumeric characters, and also dot, ', /, \, and so on):
Description1;Number1;Description2;Number2;Description3;Number3;......;DescriptionN;NumberN
Of course, N is not known, this means that the number of couples for every record can vary from record to record.
In every couple the first element is always the number (which may begin with zero, I repeat), and the second element is the string.
The field separator is ALWAYS semicolon (;).
I would like to transform the numbers as follows:
00123457 ===> 001-23457
91399399119 ===> 913-99399119
13456 ===> 134-56
This means, after the first three digits of the number, I need to put a dash "-"
How can I achieve this using regular expressions?
Thank you in advance for your kind cooperation!
I don't know Oracle/PL/SQL, but I can provide a regex:
([[:digit:]]{3})([[:digit:]]+)
matches a number of at least four digits and remembers the first three separately from the rest.
RegexBuddy constructs the following code snippet from this:
DECLARE
result VARCHAR2(255);
BEGIN
result := REGEXP_REPLACE(subject, '([[:digit:]]{3})([[:digit:]]+)', '\1-\2', 1, 0, 'c');
END;
If you need to make sure that those numbers are always directly surrounded by ;, you can alter this slightly:
(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)
However, this will not work if two numbers can directly follow each other (12345;67890 will only match the first number). If that's not a problem, use
result := REGEXP_REPLACE(subject, '(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)', '\1\2-\3\4', 1, 0, 'c');