Is rearranging words of text possible with RegEx? - regex

I've got a list of devices in a database, such as Model X123 ABC. Another portion of the system accepts user input and needs to, as well as possible, match their entries to the existing devices. But the users have the ability to enter anything they want. They might enter the above model as Model 100 ABC X123 or Model X123.
Understand, this is a general example, and the permutations of available models and matching user entries is enormous, and I'm just trying to match as many as possible so that the manual corrections can be kept to a minimum.
The system is built in FileMaker, but has access to pretty much any plugin I wish, which means I have access to Groovy, PHP, JavaScript, etc. I'm currently working with Groovy using the ScriptMaster plugin for other simple regex pattern matching elsewhere, and I'm wondering if the most straightforward way to do this is to use regex.
My thinking with regex is that I'm looking for patterns, but I'm unsure if I can say, "Assign this grouping to this pattern regardless of where it is in the order of pattern groups." Above, I want to find if the string contains three patterns: (?i)\bmodel\b, (?i)\b[a-z]\d{3}\b, and (?i)\b[a-z]{3}\b, but I don't care about what order they come in. If all three are found, I want to place them in that specific order: first the word "model", capitalized, then the all-caps alphanumeric code and finally the pure alphabetical code in all-caps.
Can (should?) regex handle this?

I suggest tokenizing the input into words, matching each of them against the supported tokens, and assembling them into canonical categorized slots.
Even better would be to offer search suggestions when the user enters the information, and require the user to pick a suggestion.
But you could do it by (programmatically) constructing a monster regex pattern with all the premutations:
\b(?:(model)\s+([a-z]\d{3})\s+([a-z]{3})
|(model)\s+([a-z]{3})\s+([a-z]\d{3})
|([a-z]\d{3})\s+(model)\s+([a-z]{3})
|([a-z]\d{3})\s+([a-z]{3})(model)
|([a-z]{3})(model)\s+([a-z]\d{3})
|([a-z]{3})\s+([a-z]\d{3})\s+(model)
)\b
It'd have to use named capturing groups but I left that out in the hopes that the above might be close to readable.

I'm not sure I fully understand your underlying objective -- is this to be able to match up like products (e.g., products with the same model number)? If so, a word permutations function like this one could be used in a calculated field to create a multikey: http://www.briandunning.com/cf/1535
If you need partial matching in FileMaker, you could also use a redux search function like this one: http://www.fmfunctions.com/fid/380
Feel free to PM me if you have questions that aren't a good format to post here.

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Combine multiple regexes into one / build small regex to match a set of fixed strings

The situation:
We created a tool Google Analytics Referrer Spam Killer, which automatically adds filters to Google Analytics to filter out spam.
These filters exclude traffic which comes from certain spammy domains. Right now we have 400+ spammy domains in our list.
To remove the spam, we add a regex (like so domain1.com|domain2.com|..) as a filter to Analytics and tell Analytics to ignore all traffic which matches this filter.
The problem:
Google Analytics has a 255 character limit for each regex (one regex per filter). Because of that we must to create a lot of filters to add all 400+ domains (now 30+ filters). The problem is, there is another limit. Number of write operation per day. Each new filter is 3 more write operations.
The question:
What I want to find the shortest regex to exactly match another regex.
For example you need to match the following strings:
`abc`, `abbc` and `aac`
You could match them with the following regexes: /^abc|abbc|aac$/, /^a(b|bb|a)c$/, /^a(bb?|a)c$/, etc..
Basically I'm looking for an expression which exactly matches /^abc|abbc|aac$/, but is shorter in length.
I found multiregexp, but as far as I can tell, it doesn't create a new regex out of another expression which I can use in Analytics.
Is there a tool which can optimize regexes for length?
I found this C tool which compiles on Linux: http://bisqwit.iki.fi/source/regexopt.html
Super easy:
$ ./regex-opt '123.123.123.123'
(123.){3}123
$ ./regex-opt 'abc|abbc|aac'
(aa|ab{1,2})c
$ ./regex-opt 'aback|abacus|abacuses|abaft|abaka|abakas|abalone|abalones|abamp'
aba(ck|ft|ka|lone|mp|(cu|ka|(cus|lon)e)s)
I wasn't able to run the tool suggested by #sln. It looks like it makes an even shorter regex.
I'm not aware of an existing tool for combining / compressing / optimising regexes. There may be one. Maybe by building a finite-state machine out of a regex and then generating a regex back out of that?
You don't need to solve the problem for the general case of arbitrary regexes. I think it's a better bet to look at creating compact regexes to match any of a given set of fixed strings.
There may already be some existing code for making an optimised regex to match a given set of fixed strings, again, IDK.
To do it yourself, the simplest thing would be to sort your strings and look for common prefixes / suffixes. ((afoo|bbaz|c)bar.com). Looking for common strings in the middle is less easy. You might want to look at algorithms used for lossless data compression for finding redundancy.
You'd ideally want to spot cases where you could use a foo[a-d] range instead of a foo(a|b|c|d), and various other things.

Regular Expression search and replace on Autoblogged Plugin

I'm using Autoblogged to pull a feed in as a blog post. I need to create a reg expression to convert the title of the item to things I can use as meta data. I've attached a screen of the backend I have access to. Any help would be greatly appreciated!
Here are examples of the title from the feed.
Type One Training Event New New Mexico, WY November 2012
Type Two Training Event Seattle, WA November 2012
I need that to become this:
<what>Type One Training Event</what> <city>New New Mexico</city>, <%state>WY</state> <month>November</month> <year>2012</year>
<what>Type Two Training Event</what> <city>Seattle</city>, <state>WA</state> <month>November</month> <year>2012</year>
Essentially says take whatever is before the word event and make that "what"
Take anything after the word event and before the comma and make that "city"
Take the two letters after the comma and make that "state"
Take the last two words and make em month and year
Autblogged backend:
We actually have an email in queue to respond to you directly once we get our v2.9 update out. The update fixes a bug in the regex feature but I thought I would go ahead and comment here so this question isn't just left open.
The ability to extract info from a feed is one of the coolest and most powerful features of AutoBlogged and this is a perfect example of what you can do with those features.
First of all, here are the regex patterns you would use:
What: (.*)\sTraining\sEvent
City: Training\sEvent\s([^,]*)
State: .*,\s([A-Z]{2})
To use these, you create new custom fields in the feed settings. Note that the custom fields also use the same syntax as the post templates so you can use the powerful regex function to extract info from the feed. This is how the fields should look:
Once you create these custom fields you can use them in your post templates and they will be added as custom fields to your post in WordPress.
Once you have these custom fields set up, you can use them in your post template as %what%, %city% or %state_code%. As I mentioned before this will also create custom fields on your blog post in WordPress as well. If you don't want that, you can just use %regex("%title%", "(.*)\sTraining\sEvent", "1")% instead of %what% directly in your post template.
Quick explanation of the syntax:
If you use %regex("%title%", "(.*)\sTraining\sEvent", "1")% it means this:
Get this info from the %title% field
Use the regex pattern (.*)\sTraining\sEvent
Use match reference 1, the (.*) part.
Perhaps match:
^(.* Event) (.*), ([A-Z]{2}) +(?i(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)) +((?:19|20)\d{2})\b
EDIT: re your comment, it looks like you have to surround your regex in delimiters. Try:
/insert_above_regex_here/
If you want case-insensitive, then do:
/insert_above_regex_here_but_remove_(?i_and_matching_)/i
However if you do case-insensitive, your state ([A-Z]{2}) will also match two lower-case letters. If this is OK then go for it. You culd also try changing that part of the regex to (?-i([A-Z]{2})) which says "be case-sensitive for this part", but it depends on whether that engine supports it (don't worry, you'll get an error if it doesn't).
Then replace with:
<what>$1</what> <city>$2</city>, <state>$3</state> <month>$4</month> <year>$5</year>
I'm not sure what flavour of regex that interface has so you might not be able to do the (?i bit in the Month regex (it just makes that bit case insensitive) -- you'll just have to be careful then to write your months with one capital letter and the rest lowercase, or you can modify the regex to allow upper-case too.

Need to create a gmail like search syntax; maybe using regular expressions?

I need to enhance the search functionality on a page listing user accounts. Rather than have multiple search boxes for each possible field, or a drop down menu where the user can only search against one field, I'd like a single search box and to use a gmail like syntax. That's the best way I can describe it, and what I mean by a gmail like search syntax is being able to type the following into the input box:
username:bbaggins type:admin "made up plc"
When the form is submitted, the search string should be split into it's separate parts, which will allow me to construct a SQL query. So for example, type:admin would form part of the WHERE clause so that it would find any record where the field type is equal to admin and the same for username. The text in quotes may be a free text search, but I'm not sure on that yet.
I'm thinking that a regular expression or two would be the best way to do this, but that's something I'm really not good at. Can anyone help to construct a regular expression which could be used for this purpose? I've searched around for some pointers but either I don't know what to search for or it's not out there as I couldn't find anything obvious. Maybe if I understood regular expressions better it would be easier :-)
Cheers,
Adam
No, you would not use regular expressions for this. Just split the string on spaces in whatever language you're using.
You don't necessarily have to use a regex. Regexes are powerful, but in many cases also slow. Regex also does not handle nested parameters very well. It would be easier for you to write a script that uses string manipulation to split the string and extract the keywords and the field names.
If you want to experiment with Regex, try the online REGex tester. Find a tutorial and play around, it's fun, and you should quickly be able to produce useful regexes that find any words before or after a : character, or any sentences between " quotation marks.
thanks for the answers...I did start doing it without regex and just wondered if a regex would be simpler. Sounds like it wouldn't though, so I'll go back to the way I was doing it and test it again.
Good old Mr Bilbo is my go to guy for any naming needs :-)
Cheers,
Adam

Under what situations are regular expressions really the best way to solve the problem?

I'm not sure if Jeff coined it but it's the joke/saying that people who say "oh, I know I'll use regular expressions!" now have two problems. I've always taken this to mean that people use regular expressions in very inappropriate contexts.
However, under what circumstances are regular expressions really the best answer? What problems are they really the best or maybe only way to solve a situation?
RexExprs are good for:
Text Format Validations (email, url, numbers)
Text searchs/substitution.
Mappings (e.g. url pattern to function call)
Filtering some texts (related to substitution)
Lexical analysis during parsing.
They can be used to validate anything that have a pattern like :
Social Security Number
Telephone Number ( 555-555-5555 )
Email Address (something#example.com)
IP Address (but it's more complex to make sure it's valid)
All those have patterns and are easily verifiable by RegEx.
They are difficultly used for entry that have a logic instead of a pattern like a credit card number but they still can be used to do some client validation.
So the best ways?
To sanitize data entry on the client
side before sanitizing them on the
server.
To make "Search and Replace" of some
strings that contains pattern
I'm sure I am missing a lot of other cases.
Regular expressions are a great way to parse text that doesn't already have a parser (i.e. XML) I have used it to create a parser for the mod_rewrite syntax in the .htaccess file or in my URL Rewriter project http://www.codeplex.com/urlrewriter for example
they are really good when you want to be more specific than "*" or "?" like "3 letters then 2 numbers then a $ sign then a period"
The quote is from an anti-Perl rant from Jamie Zawinski. I think Perl used to do regex really badly but now it seems to be a standard engine for a lot of programs.
But the same sentiment still applies. If you don't know how to use regex, you better not try something real fancy other wise you get one of these tags too (see bronze list) ;o)
https://stackoverflow.com/users/730/keng
They are good for matching or finding text that takes a very specific and simple format. By "simple" I mean not nested and smaller than the entire html spec, for example.
They are primarily of value for highly structured text parsing. If you used named groups (and option in most mature regex systems), you have a phenomenally powerful and crisp way to handle the strings.
Here's an example. Consider that netstat in its various iterations on different linux OSes, and versions of netstat can return different results. Sometimes there is an extra column, sometimes there is a shift if the date/time format. Regexes give you a powerful way to handle that with a single expression. Couple that with named groups, and you can retrieve the data without hacks like:
1) split on spaces
2) ok, the netstat version is X so add I need to add 1 to all array references past column 5.
3) ok, the netstat version is Y so I need to make sure that I use multiple array references for the date info.
YUCK. Simple to fix in a Regex :-)