Combine multiple regexes into one / build small regex to match a set of fixed strings - regex

The situation:
We created a tool Google Analytics Referrer Spam Killer, which automatically adds filters to Google Analytics to filter out spam.
These filters exclude traffic which comes from certain spammy domains. Right now we have 400+ spammy domains in our list.
To remove the spam, we add a regex (like so domain1.com|domain2.com|..) as a filter to Analytics and tell Analytics to ignore all traffic which matches this filter.
The problem:
Google Analytics has a 255 character limit for each regex (one regex per filter). Because of that we must to create a lot of filters to add all 400+ domains (now 30+ filters). The problem is, there is another limit. Number of write operation per day. Each new filter is 3 more write operations.
The question:
What I want to find the shortest regex to exactly match another regex.
For example you need to match the following strings:
`abc`, `abbc` and `aac`
You could match them with the following regexes: /^abc|abbc|aac$/, /^a(b|bb|a)c$/, /^a(bb?|a)c$/, etc..
Basically I'm looking for an expression which exactly matches /^abc|abbc|aac$/, but is shorter in length.
I found multiregexp, but as far as I can tell, it doesn't create a new regex out of another expression which I can use in Analytics.
Is there a tool which can optimize regexes for length?

I found this C tool which compiles on Linux: http://bisqwit.iki.fi/source/regexopt.html
Super easy:
$ ./regex-opt '123.123.123.123'
(123.){3}123
$ ./regex-opt 'abc|abbc|aac'
(aa|ab{1,2})c
$ ./regex-opt 'aback|abacus|abacuses|abaft|abaka|abakas|abalone|abalones|abamp'
aba(ck|ft|ka|lone|mp|(cu|ka|(cus|lon)e)s)
I wasn't able to run the tool suggested by #sln. It looks like it makes an even shorter regex.

I'm not aware of an existing tool for combining / compressing / optimising regexes. There may be one. Maybe by building a finite-state machine out of a regex and then generating a regex back out of that?
You don't need to solve the problem for the general case of arbitrary regexes. I think it's a better bet to look at creating compact regexes to match any of a given set of fixed strings.
There may already be some existing code for making an optimised regex to match a given set of fixed strings, again, IDK.
To do it yourself, the simplest thing would be to sort your strings and look for common prefixes / suffixes. ((afoo|bbaz|c)bar.com). Looking for common strings in the middle is less easy. You might want to look at algorithms used for lossless data compression for finding redundancy.
You'd ideally want to spot cases where you could use a foo[a-d] range instead of a foo(a|b|c|d), and various other things.

Related

How to decrease regex computational time

I'm making regular expression to find chosen log messages from Graylog. The problem is I can use just one regular expression for variety of messages. So I made regex:
^<189>.*Authenticate\sfail.*
but it is so computational heavy. There are usernames, IPs, etc. in these messages so I got not many possibilities how to get rid off that recursive part. So is it better to use
.*
or should I try to find as many describing string in messages as possible? In short, is regex:
^<189>.*UserName=.*Authenticate\sfail.*
better for computational performance than:
^<189>.*Authenticate\sfail.*
In this case,
^<189>.*Authenticate\sfail.*
would be faster.
You can use Regex 101 website to check how many steps were needed to run the regular expression. Paste your log into Test string field, play around and find a solution that needs as few steps as possible.

Is rearranging words of text possible with RegEx?

I've got a list of devices in a database, such as Model X123 ABC. Another portion of the system accepts user input and needs to, as well as possible, match their entries to the existing devices. But the users have the ability to enter anything they want. They might enter the above model as Model 100 ABC X123 or Model X123.
Understand, this is a general example, and the permutations of available models and matching user entries is enormous, and I'm just trying to match as many as possible so that the manual corrections can be kept to a minimum.
The system is built in FileMaker, but has access to pretty much any plugin I wish, which means I have access to Groovy, PHP, JavaScript, etc. I'm currently working with Groovy using the ScriptMaster plugin for other simple regex pattern matching elsewhere, and I'm wondering if the most straightforward way to do this is to use regex.
My thinking with regex is that I'm looking for patterns, but I'm unsure if I can say, "Assign this grouping to this pattern regardless of where it is in the order of pattern groups." Above, I want to find if the string contains three patterns: (?i)\bmodel\b, (?i)\b[a-z]\d{3}\b, and (?i)\b[a-z]{3}\b, but I don't care about what order they come in. If all three are found, I want to place them in that specific order: first the word "model", capitalized, then the all-caps alphanumeric code and finally the pure alphabetical code in all-caps.
Can (should?) regex handle this?
I suggest tokenizing the input into words, matching each of them against the supported tokens, and assembling them into canonical categorized slots.
Even better would be to offer search suggestions when the user enters the information, and require the user to pick a suggestion.
But you could do it by (programmatically) constructing a monster regex pattern with all the premutations:
\b(?:(model)\s+([a-z]\d{3})\s+([a-z]{3})
|(model)\s+([a-z]{3})\s+([a-z]\d{3})
|([a-z]\d{3})\s+(model)\s+([a-z]{3})
|([a-z]\d{3})\s+([a-z]{3})(model)
|([a-z]{3})(model)\s+([a-z]\d{3})
|([a-z]{3})\s+([a-z]\d{3})\s+(model)
)\b
It'd have to use named capturing groups but I left that out in the hopes that the above might be close to readable.
I'm not sure I fully understand your underlying objective -- is this to be able to match up like products (e.g., products with the same model number)? If so, a word permutations function like this one could be used in a calculated field to create a multikey: http://www.briandunning.com/cf/1535
If you need partial matching in FileMaker, you could also use a redux search function like this one: http://www.fmfunctions.com/fid/380
Feel free to PM me if you have questions that aren't a good format to post here.

How to author and manage very long regex patterns and reuse pattern blocks?

Is there any proven way to overcome the difficulty of authoring and managing large regex patterns in your code? Preferably in a visual tool? Is there any way to build up a pattern from smaller reusable pieces? I could not find an web based regex visualizers that supported multine regex for instance.
We are currently using a technique to split patterns and store the pieces in variables, but this mixes languages - an architectural no-no for us - and also hinders the ability to paste the pattern into a visualizer.
I am using .NET/Powershell/JavaScript - but I am interested in the flavor agnostic perspective as well.
At my old job we used regex for everything. The best tools I found where the below:
Best regex editor in my opinion (it explains each segment and has a reference sheet): http://regex101.com
Best web multi-line regex editor:http://regexpal.com/
Best regex editor overall (a download for the price of $40): http://www.regexbuddy.com/
As far as managing regexs, we used to keep all regexs in a properties file separate from the code, where the code loads the property (regex) in real time. We also shared regexbuddy files for exchanging regex patterns. There was one file that was saved on source control that had lines and lines of simple patterns for matching certain things. It helped to create larger ones, using the smaller pieces. However, what I have learned is that basically all regexs need to be tweaked for your specific purposes. It is not as simple as piecing small ones together. The small ones just help get started in the right direction.
Over here in flavor agnostic land, I sometimes do something like this (actual working code I just happened to be revisiting):
street = "(#{names}[A-Za-z0-9']+)((?:\\s+(?:#{StreetType.regexp}))?)"
space = '[\s.,]+'
at_a_street =
'(?:and|&|&amp;|at|#|by|just\s+\w+\s+of|just\s+past|looking(?:\s+\w+)?\s+(?:at|to|towards?)|near)' +
"#{space}#{street}"
between_streets =
"(?:between|(?:betw?|btwn)\\.?)#{space}#{street}#{space}(?:and|&|&amp;)#{space}#{street}"
address = '(\b\d+)(?:\s*-\s*\d+|[a-z])?\s+' + street
#regexps = [
/#{street}#{space}#{at_a_street}/i,
/#{street}#{space}#{between_streets}/i,
/#{address}/i,
/#{address}#{space}#{at_a_street}/i,
/#{address}#{space}#{between_streets}/i
]
Namely break the regexp up into meaningful bits, give them comprehensible names, and concatenate them as necessary. (You need to think a little extra about whether each bit can be safely concatenated with others, e.g. watch out for greedy expressions at the end.)

Writing Regular Expression for URL in Google Analytics

I have a huge list of URL's, in the format:
http://www.example.com/dest/uk/bath/
http://www.example.com/dest/aus/sydney/
http://www.example.com/dest/aus/
http://www.example.com/dest/uk/
http://www.example.com/dest/nor/
What RegEx could I use to get the last three URL's, but miss the first two, so that every URL without a city attached is given, but the ones with cities are denied?
Note: I am using Google Analytics, so I need to use RegEx's to monitor my URL's with their advanced feature. As of right now Google is rejecting each regular expression.
Generally, the best suggestion I can make for parsing URL's with a Regex is don't.
Your time is much much better spent finding a libary that exists for your language dedicated to the task of processing URLs.
It will have worked out all the edge cases, be fully RFC compliant, be bug free, secure, and have a great user interface so you can just suck out the bits you really want.
In your case, the suggested way to process it would be, using your URL library, extract the element s and then work explicitly on them.
That way, at most you'll have to deal with the path on its own, and not have to worry so much wether its
http://site.com/
https://site.com/
http://site.com:80/
http://www.site.com/
Unless you really want to.
For the "Path" you might even wish to use a splitter ( or a dedicated path parser ) to tokenise the path into elements first just to be sure.
tj111's current solution doesn't work - it matches all your urls.
Here's one that works (and I checked with your values). It also matches, no matter if there is a trailing slash or not:
http:\/\/.*dest\/\w+/?$
/http:\/\/www\.site\.com\/dest\/\w+\/?$/i
matches if they're all the same site with the "dest" there. you could also do this:
/\w+:\/\/[^/]+\/dest\/\w+\/?$/i
which will match any site with any protocal (http,ftp) and any site with the /dest/country at the end, and an optional /
Note, that this will only work with a subset of what the urls could legitimately be.
Try this regular expression:
^http://www\.example\.com/dest/[^/]+/$
This would only match the last three URLs.

Under what situations are regular expressions really the best way to solve the problem?

I'm not sure if Jeff coined it but it's the joke/saying that people who say "oh, I know I'll use regular expressions!" now have two problems. I've always taken this to mean that people use regular expressions in very inappropriate contexts.
However, under what circumstances are regular expressions really the best answer? What problems are they really the best or maybe only way to solve a situation?
RexExprs are good for:
Text Format Validations (email, url, numbers)
Text searchs/substitution.
Mappings (e.g. url pattern to function call)
Filtering some texts (related to substitution)
Lexical analysis during parsing.
They can be used to validate anything that have a pattern like :
Social Security Number
Telephone Number ( 555-555-5555 )
Email Address (something#example.com)
IP Address (but it's more complex to make sure it's valid)
All those have patterns and are easily verifiable by RegEx.
They are difficultly used for entry that have a logic instead of a pattern like a credit card number but they still can be used to do some client validation.
So the best ways?
To sanitize data entry on the client
side before sanitizing them on the
server.
To make "Search and Replace" of some
strings that contains pattern
I'm sure I am missing a lot of other cases.
Regular expressions are a great way to parse text that doesn't already have a parser (i.e. XML) I have used it to create a parser for the mod_rewrite syntax in the .htaccess file or in my URL Rewriter project http://www.codeplex.com/urlrewriter for example
they are really good when you want to be more specific than "*" or "?" like "3 letters then 2 numbers then a $ sign then a period"
The quote is from an anti-Perl rant from Jamie Zawinski. I think Perl used to do regex really badly but now it seems to be a standard engine for a lot of programs.
But the same sentiment still applies. If you don't know how to use regex, you better not try something real fancy other wise you get one of these tags too (see bronze list) ;o)
https://stackoverflow.com/users/730/keng
They are good for matching or finding text that takes a very specific and simple format. By "simple" I mean not nested and smaller than the entire html spec, for example.
They are primarily of value for highly structured text parsing. If you used named groups (and option in most mature regex systems), you have a phenomenally powerful and crisp way to handle the strings.
Here's an example. Consider that netstat in its various iterations on different linux OSes, and versions of netstat can return different results. Sometimes there is an extra column, sometimes there is a shift if the date/time format. Regexes give you a powerful way to handle that with a single expression. Couple that with named groups, and you can retrieve the data without hacks like:
1) split on spaces
2) ok, the netstat version is X so add I need to add 1 to all array references past column 5.
3) ok, the netstat version is Y so I need to make sure that I use multiple array references for the date info.
YUCK. Simple to fix in a Regex :-)