Efficiently match table of regex to a string - regex

Typically, you have a regular expression and lots of strings to process.
I have the opposite. I have one string, and I want to find all the regular expressions that match it. Let's say I have 10 million regular expressions. I'm not trying to do any replacement or rewriting of the string, I just want to find things that match.
I'd like to store these in a database. A crude way to do this would be to select all ten million lines and iterate through them. For each iteration, apply the regex and somehow (I'm a little unclear on this piece too) see if it matches. Perhaps my regex library has a function which I give it a string and a regex, and it tells me if it matches. If it does, then I print out the regex.
This would be slow. I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string, out of its table of 10 million.
I'm agnostic on the database used, I'd just like it to be fast. I don't need it to be "custom assembler" fast but just "let the database figure it out so I don't have to iterate on 10 million lines" fast.

I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string
At least mysql can do this:
SELECT regex FROM table_with_regexes WHERE
regex REGEXP someString;
Also it would be helpful if you tell us more about your actual problem. I don't think you wrote ten millions regexes by hand, they must have been automatically generated - tell us how.

In your case, I would process in three steps:
Step 1 : Find a first sql query
Build a sql query that search for the regex matching my string.
I would start with a small regex set for building the sql query.
Step 2 : Refine it if nessary
Add more regexes and see how the sql query performs.
I would optimize, rewrite it if necessary here.
Step 3 : Use choosed database optimization tools
I would simply fine tune my sql query to respond as quickly as possible.
I would use hints for the sql engine, indices, parallel executions etc
Handing off all the hard work to the database is a good approach since IMO it's an elegant and clear approach.

Related

How to decrease regex computational time

I'm making regular expression to find chosen log messages from Graylog. The problem is I can use just one regular expression for variety of messages. So I made regex:
^<189>.*Authenticate\sfail.*
but it is so computational heavy. There are usernames, IPs, etc. in these messages so I got not many possibilities how to get rid off that recursive part. So is it better to use
.*
or should I try to find as many describing string in messages as possible? In short, is regex:
^<189>.*UserName=.*Authenticate\sfail.*
better for computational performance than:
^<189>.*Authenticate\sfail.*
In this case,
^<189>.*Authenticate\sfail.*
would be faster.
You can use Regex 101 website to check how many steps were needed to run the regular expression. Paste your log into Test string field, play around and find a solution that needs as few steps as possible.

Combine multiple regexes into one / build small regex to match a set of fixed strings

The situation:
We created a tool Google Analytics Referrer Spam Killer, which automatically adds filters to Google Analytics to filter out spam.
These filters exclude traffic which comes from certain spammy domains. Right now we have 400+ spammy domains in our list.
To remove the spam, we add a regex (like so domain1.com|domain2.com|..) as a filter to Analytics and tell Analytics to ignore all traffic which matches this filter.
The problem:
Google Analytics has a 255 character limit for each regex (one regex per filter). Because of that we must to create a lot of filters to add all 400+ domains (now 30+ filters). The problem is, there is another limit. Number of write operation per day. Each new filter is 3 more write operations.
The question:
What I want to find the shortest regex to exactly match another regex.
For example you need to match the following strings:
`abc`, `abbc` and `aac`
You could match them with the following regexes: /^abc|abbc|aac$/, /^a(b|bb|a)c$/, /^a(bb?|a)c$/, etc..
Basically I'm looking for an expression which exactly matches /^abc|abbc|aac$/, but is shorter in length.
I found multiregexp, but as far as I can tell, it doesn't create a new regex out of another expression which I can use in Analytics.
Is there a tool which can optimize regexes for length?
I found this C tool which compiles on Linux: http://bisqwit.iki.fi/source/regexopt.html
Super easy:
$ ./regex-opt '123.123.123.123'
(123.){3}123
$ ./regex-opt 'abc|abbc|aac'
(aa|ab{1,2})c
$ ./regex-opt 'aback|abacus|abacuses|abaft|abaka|abakas|abalone|abalones|abamp'
aba(ck|ft|ka|lone|mp|(cu|ka|(cus|lon)e)s)
I wasn't able to run the tool suggested by #sln. It looks like it makes an even shorter regex.
I'm not aware of an existing tool for combining / compressing / optimising regexes. There may be one. Maybe by building a finite-state machine out of a regex and then generating a regex back out of that?
You don't need to solve the problem for the general case of arbitrary regexes. I think it's a better bet to look at creating compact regexes to match any of a given set of fixed strings.
There may already be some existing code for making an optimised regex to match a given set of fixed strings, again, IDK.
To do it yourself, the simplest thing would be to sort your strings and look for common prefixes / suffixes. ((afoo|bbaz|c)bar.com). Looking for common strings in the middle is less easy. You might want to look at algorithms used for lossless data compression for finding redundancy.
You'd ideally want to spot cases where you could use a foo[a-d] range instead of a foo(a|b|c|d), and various other things.

regex finding elements in xml which contain attributes whose values contain two periods

I'm searching some xml and my tool is regex. (my only tools in this case are editors so I"m using either eclipse or notepad++). I need to find all elements which contain attributes that have values containing two periods not adjacent.
so it would find attr1 and attr3 in this:
<myelement attr1 = "ab.cd.ef", attr2="ab", attr3="zy.sa.xa"/>
I've tried this and variations in notepad++
^(([^\"\.])*(\")[^\"\.]*[\.][^\"\.]*[\.][^\"\.]*[\"])+$
but it isn't picking up second attributes with values containing two periods.
I'm going to keep trying but if someone can point me to an answer I'd appreciate it.
I think you can't do this with regex.
Unless you create a monster regex that will create a blackhole swallowing all the life in the Earth (politely saying of course).
Bear in mind that you don't have logic in regex you just use pattern matching, for instance a number is just a number you can't say if I get 1 then get 3 also in a simple way.
You can use if then else in regex like:
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
But what you want to do is to nest if conditions with multiple conditions for each case, like if 1 then 3 | if 2 then 4 | if 3 then 5 creating an enormous pattern nested.
Another regex approach would be to have multiple regex lookarounds (look ahead in this case) what will do your regex impossible to read.
I think you might find more useful a Xpath or Xquery expressions for this. That it's a better approach to match xml than regex.
I'm searching some xml and my tool is regex.
That's a bit like saying that you are cutting down trees and your tool is a screwdriver. Get the right tool for the job: an XML parser and an XPath engine.

Regex exclude value from repetition

I'm working with load files and trying to write a regex that will check for any rows in the file that do not have the correct number of delimiters. Let's pretend the delimiter is % (I'm not sure this text field supports the delimiters that are in the load file). The regex I wrote that finds all rows that are correct is:
^%([^%]*%){20}$
Sometimes it's more beneficial to find any rows that do not have the correct number of delimiters, so to accomplish that, I wrote this:
(^%([^%]*%){0,19}$)|(^%([^%]*%){21,}$)
I'm concerned about the efficiency of this (and any regex I write in general), so I'm wondering if there's a better way to write it, or if the way I wrote it is fine. I thought maybe there would be some way to use alternation with the repetition tokens, such as:
{0,19}|{21,}
but that doesn't seem to work.
If it's helpful to know, I'm just searching through the files in Sublime Text, which I believe uses PCRE. I'm also open to suggestions for making the first regex better in general, although I've found it to work pretty well even in exceptionally large load files.
If your regex engine supports negative lookaheads, you can slightly modify your original regex.
^(?!%([^%]*%){20}$)
The regex above is useful for test only. If you want to capture, then you need to add .* part.
^(?!%([^%]*%){20}$).*$

Validate Regex Input, preferably using Regex

I'm looking to have the (admin) user enter some pattern matching string, to give different users of my website access to different database rows, depending on if the text in a particular field of the row matches the pattern matching string against that user.
I decided on Regex because it is trivial to integrate into the MySQL statements directly.
I don't really know where to start with validating that a string is a valid regular expression, with a regular expression.
I did some searching for similar questions, couldn't see one. Google produced the comical answer, sadly not so helpful.
Do people do this in the wild, or avoid it?
Is it able to be done with a simple regex, or will the set of all valid regex need to be limited to a usable subset?
Validating a regex is an incredibly complex task. A regex would not be able to do it.
A simple approach would be to catch any errors that occur when you try to run the SQL statement, then report an appropriate error back to the user.
I am assuming that the 'admin' is a trusted user here. It is quite dangerous to give a non-trusted user the ability to enter regexes, because it is easy to attack your system with a regex that is constructed to take a really long time to execute. And that is before you even start to worry about the Bobby Tables problems.
in javascript:
input = "hello**";
try{
RegExp(input);
// sumbit the regex
}catch(err){
// regex is not valid
}
You cannot validate that a string contains a valid regular expression with a regular expression. But you might be able to compromise.
If you only need to know that only characters which are valid in regular expressions were used in the string, you can use the regex:
^[\d\w \-\}\{\)\(\+\*\?\|\.\$\^\[\]\\]*$
This might be enough depending on the application.