Find the simplest regex query to match a set of examples - regex

The online service Kimono provides a GUI for a user to select
page elements and then uses the selected elements to create a regex which will match those selections. This regex can then be used to extract information from the same page at different points in time. The service is useful because you dont have to generate the regex query yourself and instead provide a set of example query matches which are then compiled into a query regex expression. The company was acquired and so the service is no longer available.
However the problem seems like an interesting problem and so my question is this: what algorithm is capable of turning a number of examples (both positive and negative are needed) in a large document into a regex which when applied will then match those examples?

Regular expressions are typically implemented with NFAs and DFAs.
https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
https://en.wikipedia.org/wiki/Deterministic_finite_automaton
The process of finding the smallest DFA to represent a particular DFA is known as minimization.
https://en.wikipedia.org/wiki/DFA_minimization
This needs to be converted back into a regular expression.
https://cs.stackexchange.com/questions/2016/how-to-convert-finite-automata-to-regular-expressions

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Regular Expressions - Greater than / less than conditionals allowed?

I have a page that returns a list of buyables (i.e. number available) and their associated product ID:
{"buyable":1,"prodId":444123,"
I want to retrieve the product ID from the page with a buyable number greater than 1, is it possible to do this within a regular expression?
EDIT: I have the following regular expression to grab the appropriate groups, but am not finding a good way to setup a conditional statement within the Regular Expression to filter out the non-"1" buyable items.
.*\"buyable\":([0-9]+),.*\"prodId\":([0-9]+),
or am I drastically overthinking this and I just need to use the below instead?
.*\"buyable\":([1-9]+),.*\"prodId\":([0-9]+),
I think you should be using json path to tackle this problem.
Anyway, if you want to use a regex, then you could use:
.*\"buyable\":(?:[02-9]|[0-9]{2,}),.*\"prodId\":([0-9]+),
Working demo

Combine multiple regexes into one / build small regex to match a set of fixed strings

The situation:
We created a tool Google Analytics Referrer Spam Killer, which automatically adds filters to Google Analytics to filter out spam.
These filters exclude traffic which comes from certain spammy domains. Right now we have 400+ spammy domains in our list.
To remove the spam, we add a regex (like so domain1.com|domain2.com|..) as a filter to Analytics and tell Analytics to ignore all traffic which matches this filter.
The problem:
Google Analytics has a 255 character limit for each regex (one regex per filter). Because of that we must to create a lot of filters to add all 400+ domains (now 30+ filters). The problem is, there is another limit. Number of write operation per day. Each new filter is 3 more write operations.
The question:
What I want to find the shortest regex to exactly match another regex.
For example you need to match the following strings:
`abc`, `abbc` and `aac`
You could match them with the following regexes: /^abc|abbc|aac$/, /^a(b|bb|a)c$/, /^a(bb?|a)c$/, etc..
Basically I'm looking for an expression which exactly matches /^abc|abbc|aac$/, but is shorter in length.
I found multiregexp, but as far as I can tell, it doesn't create a new regex out of another expression which I can use in Analytics.
Is there a tool which can optimize regexes for length?
I found this C tool which compiles on Linux: http://bisqwit.iki.fi/source/regexopt.html
Super easy:
$ ./regex-opt '123.123.123.123'
(123.){3}123
$ ./regex-opt 'abc|abbc|aac'
(aa|ab{1,2})c
$ ./regex-opt 'aback|abacus|abacuses|abaft|abaka|abakas|abalone|abalones|abamp'
aba(ck|ft|ka|lone|mp|(cu|ka|(cus|lon)e)s)
I wasn't able to run the tool suggested by #sln. It looks like it makes an even shorter regex.
I'm not aware of an existing tool for combining / compressing / optimising regexes. There may be one. Maybe by building a finite-state machine out of a regex and then generating a regex back out of that?
You don't need to solve the problem for the general case of arbitrary regexes. I think it's a better bet to look at creating compact regexes to match any of a given set of fixed strings.
There may already be some existing code for making an optimised regex to match a given set of fixed strings, again, IDK.
To do it yourself, the simplest thing would be to sort your strings and look for common prefixes / suffixes. ((afoo|bbaz|c)bar.com). Looking for common strings in the middle is less easy. You might want to look at algorithms used for lossless data compression for finding redundancy.
You'd ideally want to spot cases where you could use a foo[a-d] range instead of a foo(a|b|c|d), and various other things.

Efficiently match table of regex to a string

Typically, you have a regular expression and lots of strings to process.
I have the opposite. I have one string, and I want to find all the regular expressions that match it. Let's say I have 10 million regular expressions. I'm not trying to do any replacement or rewriting of the string, I just want to find things that match.
I'd like to store these in a database. A crude way to do this would be to select all ten million lines and iterate through them. For each iteration, apply the regex and somehow (I'm a little unclear on this piece too) see if it matches. Perhaps my regex library has a function which I give it a string and a regex, and it tells me if it matches. If it does, then I print out the regex.
This would be slow. I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string, out of its table of 10 million.
I'm agnostic on the database used, I'd just like it to be fast. I don't need it to be "custom assembler" fast but just "let the database figure it out so I don't have to iterate on 10 million lines" fast.
I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string
At least mysql can do this:
SELECT regex FROM table_with_regexes WHERE
regex REGEXP someString;
Also it would be helpful if you tell us more about your actual problem. I don't think you wrote ten millions regexes by hand, they must have been automatically generated - tell us how.
In your case, I would process in three steps:
Step 1 : Find a first sql query
Build a sql query that search for the regex matching my string.
I would start with a small regex set for building the sql query.
Step 2 : Refine it if nessary
Add more regexes and see how the sql query performs.
I would optimize, rewrite it if necessary here.
Step 3 : Use choosed database optimization tools
I would simply fine tune my sql query to respond as quickly as possible.
I would use hints for the sql engine, indices, parallel executions etc
Handing off all the hard work to the database is a good approach since IMO it's an elegant and clear approach.

Regex and non-technical users

Given that:
You have some Key-Value data that can be modified
Modification is done via by applying filters to the data. Filters that control what gets changed are created by non-technical people
The filters are setup using regular expressions. An example of a rule described as part of a filter may be "If a key matches some regex, replace value with some other value"
How would you:
If filters are to be produced by business people, who can't create regular expressions, in what form would you have them submit their input that would be easily translated to regex?
Agent Ransack contains a GUI editor for creating regular expressions from plain English, I would suggest taking a look at that and implementing your own variation of it.
See the screenshot for an example:
If it works, I would go for "wildcard only" support - ie the asterisk * is the only special character allowed and you translate that to the regex .*? in code.
Most non-technical people can grasp * meaning "anything".