Regex: Matching two regular expressions - regex

Assume we have collection of strings as input:
str1: ABCD
str2: ABCD
str3: AWXYD
str4: AWXYD
The goal is to remove duplicates and preserve the unique ones. Having the above input, our output should look like the following:
ABCD
AWXYD
Unfortunately, the machine that produces this collection is error prone and sometimes misses some alphabets (see below). Luckily, we are able detect that there is a missing part. But we don’t know how large this missing part is. In reality we have the following input:
str1: A?CD
str2: AB?D
str3: AWXYD
str4: A?D
where ? indicates the missing part.
In this example, we would like to preserve A?CD or AB?D and also AWXYD.
In an attempt to solve this, I substituted ? with .* and assumed these strings to be regular expressions:
Reg1 --> A.*CD
Reg2 --> AB.*D
Reg3 --> AWXYD
Reg4 --> A.*D
Now I am trying to identify the duplicates by comparing regular expressions. Using this approach, one can easily match Reg4 to Reg3 because Reg3 is actually a string (no missing part). Things become complicated when both have missing parts and therefore you have to compare regular expressions.
I wonder if it is possible or if there is a better solution for this.
Thanks!
Edit 1: Note that str1 and str2 might came from different strings (e.g. AXCD and ABXD). Our goal is to remove any (possible) duplicates and be sure that preserved strings are unique (even if we remove more). This is why we preserve either str1 or str2. Thanks to Aaron who pointed this out.
Edit 2: There are millions of strings. That's why an algorithms is needed.

I don't think regular expressions is appropriate for such task. If you are asking if there is an implemented way to compare regular expressions, answer is NO. At least I haven't seen it. If you are asking if there is a way to implement it, I would say YES. You can represent regex as finite state machine as well as graph. And it is possible to check isomorphism of these things. But it would be enormously complex to do it for regular expressions. Three things that pops in my mind now is: Levenshtein distance algorithm , Binary search tree (extremely search efficient data structure) and Black Board architecture . And also here you will find several answers that can help you. Good luck!
P.S PostgreSQL has fuzzystrmatch module with Levenshtein algorithm implementation.

I think the problem is your pattern.
Reg1 --> A.*CD
and
Reg2 --> AB.*D
Sometimes, they are represented the same pattern e.g.
ABCD
Either Reg1 or Reg2 can be matched this text. That means there is some duplicated pattern inside Reg1 and Reg2.
You might solve your problem by changing your pattern to
Reg1 --> A(?!B).*CD
// (?!B) means the second character can be any letters except `B`
and
Reg2 --> A.*(?<!C)D
// (?<!C) means the second last character can be any letters except `C`
Otherwise you can not distinguish between these two patterns.

Related

A Regex to ignore a set of words

Is there a way to set regex to ignore a set of words separated by space?
I have different products names like:
"Matrix 10X, 10 ml + DISPENSER"
"Matrix 10X,10ml + DISPENSER" where the quantity varies
What I'm trying to do is to replace using regex all words except for:
"10 ml" | "10 ML" | "10ml" ---> these are to be ignored
I have found a code to replace all characters except words separated by space (like "10 ml")
https://regex101.com/r/bG8vB4/5
and to replace them when they are together (like "10ml")
https://regex101.com/r/bG8vB4/4
but can find a way to mix them together to keep just "10 ml" OR "10 ML" OR "10ml" and remove other characters up to the end of the string
Regexps are a mathematical model to do efficient computer recognition of strings. As easy as getting a regular expression to match a string if it has any of some words, math demonstrates that the regexp to get a matcher of strings that just matches a string if it has none of those words is possible. The way to get such a regexp, although is far more complex.
On regular expressions theory, a regular language is one that allows you to set a finite automaton from a regular expression, and the automaton that recognizes a string if the original doesn't is feasible by just switching all accept states into non-accepting states. Once done this, the hardest part is to build a regular expression that matches that automaton (that is possible, but the final regular expression is far more complex, in general than the original) This can be solved with an example (a simple one) and you'll see that that is a complex thing (of course, some regexp libraries allow you to use an operand for this, but you don't specify if the one you are using does) One such sample is when you have to recognize a simple C language comment. A comment is a string delimited by the sequences /* and */ but in the inner part, you cannot have the sequence */.
The first approach could be to use the following regexp:
\/\*.*\*\/
but that fails, as the inner regexp includes the recognition of */ as part of it, so /* bla bla bla */ bla bla bla */ will be recognized as a comment in whole (it should end at the first */) so wee need a regexp that recognizes anything but not something that includes */
Such subexpression is:
([^*]|\*[^/])*
which means and undefinite concatenation of characters different that *, or sequences that, including the first character as * are not followed by /. If you follow that concatenation, you'll see that it's impossible to form a sequence */ leading to our final regexp:
\/\*([^*]|\*[^/])*\*\/
(now you see how the things complicate)
To extend this to a single word (as word, more than two letters) you have to consider that you can allow:
([^w]|w[^o]|wo[^r]|wor[^d])*
in the set, and if you have two words (like foo and bar) you have to write:
([^f]|f[^o]|fo[^o]|[^b]|b[^a]|ba[^r])*
meaning that for each word you have such regexps, making the final regexp a bit complicated. Also, there can be interactions between words if some can be the prefix to another or some have the same prefix chars. This also can have the problem that the compilation of regexps into finite automata has produced many libraries that consider the | operator non conmutative and resolve them in a non conmutative way, leading to erroneous results.
You have not explained also what you mean with ignoring. If you mean matching them and pass around, is different to mean to ignore the whole line they could appear on. The regexps then (an the definition of the problem you need to solve is quite different ---my explanation was in the sense of rejecting a full sentence if it has any of the words on it, which probably is not what you mean) So please, explain (in your question) what do you mean with:
accepting you have matched a sentence containing a word.
rejecting such a sentence.
what are you rejecting (or ignoring) at all.
Rejecting just a word, is simply selecting a sencence that contains that word, and mark the word to be able to pass over it. But that's a different problem, and it requires to select sentences that do have the word.

Regex issue with car submodels

I'm pulling car submodels from the DB and I'm building my regular expression on the fly.
Here is an example of a search string:
EX-L Sedan 4-Door
Here is my regular expression:
preg_match("/LX|EX|EX-L|LX-P|LX-S/Ui", $input_line, $output_array);
For some reason the output is EX and not EX-L as it supposed to be. Can someone explain why?
Your pattern is unanchored and thus the first alternative that matches a substring makes the regex engine stop processing the whole group. This is a common behavior with NFA regexes.
Also, there are no quantifiers in your pattern, thus the /U modifier is redundant.
So, you can use
/EX-L|LX-P|LX-S|LX|EX/i
It is a readable form. However, best practice with regexes is to make sure no alternative branch can match at the same location as another. That means you can use
/EX(-L)?|LX(-[PS])?/i
As others have pointed out, the reason for this undesired outcome is because the regex engine is happy to have the first alternative and run for the door since your pattern has no anchors (like: ^, $, and some other lesser known ones). This is the same short-circuiting behavior you'd see in php's if($x || $y) conditions; if $x is true there is no need to evaluate further. But enough about that...
I would like to offer some additional logic that I think is relevant to your case/question.
You say your regex is built on the fly, so I am assuming your method goes something like this:
A user identifies which substrings/keywords they want to search for.
$strings=array('LX','EX','EX-L','LX-P','LX-S');
// array of substrings in any order
As mentioned earlier, you need longer strings to precede shorter ones with identical starting characters.
rsort($strings);
// sort DESC, longer strings precede shorter strings when leading characters match
Pipe all strings into a single regex pattern with implode().
$piped_regex='/\b(?:'.implode('|',$array).')\b/i';
// word boundaries ensure the string is not part of a larger word; remove if not desired
// pattern: /\b(?:LX-S|LX-P|LX|EX-L|EX)\b/i
While programmatically condensing your similar strings into a concise pattern as Wiktor recommended is possible, it's probably not worth the effort with your on-the-fly patterns.
Finally run preg_match() as normal.
$input_line='EX-L Sedan 4-Door';
if(preg_match($piped_regex,$input_line,$output_array)){
var_export($output_array);
}
// output: array(0=>'EX-L')
I hope stepping out this method is helpful to you and future SO readers.

Pattern matching for strings independent from symbols

I have need for an algorithm which can find pre-defined patterns in data (which is present in the form of strings) independent from the actual symbols/characters of the data and the pattern. I only care about the relations between the symbols, not the symbols themselves. It is also legal to have different pattern symbols for the same symbol in the data. The only thing the pattern matching algorithm has to enforce is that multiple occurences of the same symbol in the pattern are preserved. To give you an example:
The pattern is abca, so the first and the last letter are the same. For my application, an equivalent way to write this would be 1 2 3 1, where the digits are just variables. The data I have is thistextisatest. The resulting algorithm should give me two correct matches here, text and test. Because only in these two cases, the first and the fourth letter are the same, as in the pattern.
As a second example, the pattern abcd should return 12 matches (one for each position in thistextisat). Since no variable in the pattern is repeated, it is trivially matched everywhere. Even in the case of text and test, because it is legal that the variables a and d of the pattern map to the same symbol.
The goal of this algorithm should be to detect similarities in written language. Imagine having a dictionary of the English language and parsing it with the pattern unseen or equivalently 1 2 3 4 4 2. You would then see that, for example, the word belittle contains the same pattern of letters.
So, now that I hopefully made clear what I need, I have some questions:
What is this algorithm called? Is it a well-known problem that has been solved?
Are there publications on the matter? It is really hard to find anything useful when you don't know the correct search terms to separate this problem from regular pattern matching.
Is there a ready implementation of this?
I have not used Regex for anything too complicated, so I don't know if anything like this would even be possible in Regex, when you basically do not care about the symbols as such, but only consider the pattern of their occurences.
I'd really appreciate your help!
I don't think you need regular expressions here. Your search term:
unseen
123442
This has six characters, so index each word of your text into 6-mers
belittle
12,12,12,12,11,12,12 2-mers
123,123,123,122,112,123 3-mers
1234,1234,1233,1223,1123 4-mers
12345,12344,12334,12234 5-mers
123455,123442,123321 6-mers
So just looking at the 6-mers, you've got a match. Any 6 digit number less than your search term would also be a match, to allow for the abcd (1234) case matching an abca (1231) word.
So given a search term of n characters, just split each word into its constituent n-mers and check for numeric equal or less than.

Derive RegExp from set of strings

Imagine there is an arbitrary set of strings. We now suppose that they are all equal beside a few succeeding characters (if this assumption does not hold I'm fine with returning an error). I now want to derive a regular expression to identify the portion of the strings that is different.
Input:
"Hello Alice, I'm Bob.", "Hello John, I'm Bob.", "Hello Josh, I'm Bob."
Output:
"Hello (.+), I'm Bob."
Input:
"Monday", "Tree", "Dog"
Output:
Error
Maybe finding the longest common substrings or the Levenshtein distance could help? I'm not sure yet if one of them really applies to my problem or how to use them to solve it.
You had a problem and decided to use regexp to solve it -- now you have two problems. :-)
All kidding aside, you can break this down into two steps:
Identify differences between strings.
Look at all the differences and figure out a regexp to match them.
For (1), it's a matter of using a diff-computing library in your language (like difflib in Python) to find a list of identical regions between two strings. If all strings have common segments, then compare string-1 to each of string-[2..N] to analyze the resulting identical blocks (you have to be smart about comparing both the contents of each block and its position relative to other identical blocks). Extract and record text between the identical blocks too.
For your example, you'd get two identical block every time you compare: "Hello " and ", I'm Bob.".
The text between the identical blocks will be these strings: "Alice", "John", "Josh".
For (2), the most trivial solution is to combine your findings into a quite literal regexp composed of:
Hello + (Alice|John|Josh) + , I'm Bob.
Or, replace any segment between the same identical blocks found in all strings with .*. Consider making that a non-greedy match -- .*?.
I don't know automata theory and can't help you with DFA/NFA, but that's a solid direction to go if you needed more precision.

Finding a string *and* its substrings in a haystack

Suppose you have a string (e.g. needle). Its 19 continuous substrings are:
needle
needl eedle
need eedl edle
nee eed edl dle
ne ee ed dl le
n e d l
If I were to build a regex to match, in a haystack, any of the substrings I could simply do:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle|ne|ee|ed|dl|le|n|e|d|l)/
but it doesn't look really elegant. Is there a better way to create a regex that will greedly match any one of the substrings of a given string?
Additionally, what if I posed another constraint, wanted to match only substrings longer than a threshold, e.g. for substrings of at least 3 characters:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle)/
note: I deliberately did not mention any particular regex dialect. Please state which one you're using in your answer.
As Qtax suggested, the expression
n(e(e(d(l(e)?)?)?)?)?|e(e(d(l(e)?)?)?)?|e(d(l(e)?)?)?|d(l(e)?)?|l(e)?|e
would be the way to go if you wanted to write an explicit regular expression (egrep syntax, optionally replace (...) by (?:...)). The reason why this is better than the initial solution is that the condensed version requires only O(n^2) space compared to O(n^3) space in the original version, where n is the length of the input. Try this with extraordinarily as input to see the difference. I guess the condensed version is also faster with many regexp engines out there.
The expression
nee(d(l(e)?)?)?|eed(l(e)?)?|edl(e)?|dle
will look for substrings of length 3 or longer.
As pointed out by vhallac, the generated regular expressions are a bit redundant and can be optimized. Apart from the proposed Emacs tool, there is a Perl package Regexp::Optimizer that I hoped would help here, but a quick check failed for the first regular expression.
Note that many regexp engines perform non-overlapping search by default. Check this with the requirements of your problem.
I have found elegant almostsolution, depending how badly you need only one regexp. For example here is the regexp, which finds common substring (perl) of length 7:
"$needle\0$heystack" =~ /(.{7}).*?\0.*\1/s
Matching string is in \1. Strings should not contain null character which is used as separator.
You should make a cycle which starters with length of the needle and goes downto treshold and tries to match the regexp.
Is there a better way to create a regex that will match any one of the
substrings of a given string?
No. But you can generate such expression easily.
Perhaps you're just looking for
.*(.{1,6}).*