Individually shorten URLs in Hive - regex

I have URLs of the following structure:
https://pinball.globalzone.com/en_US/home?tic=1-dj33jl-dj33jl&goToRegisterNow=true
What I want to do now is to shorten the URLs to be able to group and count similar URL patterns. For instance, I want to cut out https://, the locale en_US/ and the token ?tic=1-dj33jl-dj33jl while keeping the rest. The result should look as follows:
pinball.globalzone.com/home&goToRegisterNow=true
I tried to achieve that by using regexp_extract but this method only lets me extract specific pieces that are always at the same position.
The bigger problem is that the parts I want to cut out are either individual/rule-based (i.e. the locale always contains of two lower case and two upper case letters separated by a underscore) or unique with no guaranteed length (i.e. the token).
Moreover, my resultset will also contain URLs with a different pattern in which I only want to cut the existing parts (e.g. https://pinball.globalzone.com/en_US/forgottenPassword, in which only en_US/ has to be cut out).
If I would have to solve the problem quickly I would just get URLs and write some piece of Java or R code to split the get URLs into pieces and iterate through the array while cutting out all parts I don't need. However, I was wondering if there is a more elegant way to get this result straight out of Hive.

What about
(?:https?:\/\/|\/[a-z]{2}_[A-Z]{2}|[?&]tic=[^&?]*)
It matches the parts you've described as unwanted. Replace that with an empty string should leave you with what you want.
See it here at regex101.
Edit
Updated to check for tic=. Should make it more stable.
And I don't know if it's what you want, but this one allows tic= to be any parameter, not only the first:
(?:https?:\/\/|\/[a-z]{2}_[A-Z]{2}|[?&]tic=[^&?\n]*)
Here at regex101

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Regex needed to search for a numeric id within a tag

My very basic regex skills are not allowing me to successfully extract an id number within a tag.
I think it would be fairly straightforward. I would like to extract the id from the following extract.
<id>53222132</id>
The id number is not a specific length but I just need to be able to find the id number which is numeric only.
More specifically this is the only instance of the tag id so it's unique and should be used within the regex.
Finally is there a way that this can saved within a variable.
Using regex as part of a splunk query where I will use the variable to make it distinct.
I have got as far as the following which captures everything including the tag.
<\s*id[^>]*>(.*?)<\s*\/\s*id>
Thanks in advance
(?<=<id>)\d+(?=<\/id>)
This would be my first thought. This will use a positive look ahead and positive look behind and it will only match a string of digit characters in the middle. Another alternative is:
\d+(?=<\/id>)
This will only use the look ahead as the look behind is not entirely supported. One other option:
\d+(?=\s*<\s*\/\s*id\s*>)
This will ignore any spaces that might be present in that ending tag, and still find the id regardless. One of these should work for your scenario.

Is rearranging words of text possible with RegEx?

I've got a list of devices in a database, such as Model X123 ABC. Another portion of the system accepts user input and needs to, as well as possible, match their entries to the existing devices. But the users have the ability to enter anything they want. They might enter the above model as Model 100 ABC X123 or Model X123.
Understand, this is a general example, and the permutations of available models and matching user entries is enormous, and I'm just trying to match as many as possible so that the manual corrections can be kept to a minimum.
The system is built in FileMaker, but has access to pretty much any plugin I wish, which means I have access to Groovy, PHP, JavaScript, etc. I'm currently working with Groovy using the ScriptMaster plugin for other simple regex pattern matching elsewhere, and I'm wondering if the most straightforward way to do this is to use regex.
My thinking with regex is that I'm looking for patterns, but I'm unsure if I can say, "Assign this grouping to this pattern regardless of where it is in the order of pattern groups." Above, I want to find if the string contains three patterns: (?i)\bmodel\b, (?i)\b[a-z]\d{3}\b, and (?i)\b[a-z]{3}\b, but I don't care about what order they come in. If all three are found, I want to place them in that specific order: first the word "model", capitalized, then the all-caps alphanumeric code and finally the pure alphabetical code in all-caps.
Can (should?) regex handle this?
I suggest tokenizing the input into words, matching each of them against the supported tokens, and assembling them into canonical categorized slots.
Even better would be to offer search suggestions when the user enters the information, and require the user to pick a suggestion.
But you could do it by (programmatically) constructing a monster regex pattern with all the premutations:
\b(?:(model)\s+([a-z]\d{3})\s+([a-z]{3})
|(model)\s+([a-z]{3})\s+([a-z]\d{3})
|([a-z]\d{3})\s+(model)\s+([a-z]{3})
|([a-z]\d{3})\s+([a-z]{3})(model)
|([a-z]{3})(model)\s+([a-z]\d{3})
|([a-z]{3})\s+([a-z]\d{3})\s+(model)
)\b
It'd have to use named capturing groups but I left that out in the hopes that the above might be close to readable.
I'm not sure I fully understand your underlying objective -- is this to be able to match up like products (e.g., products with the same model number)? If so, a word permutations function like this one could be used in a calculated field to create a multikey: http://www.briandunning.com/cf/1535
If you need partial matching in FileMaker, you could also use a redux search function like this one: http://www.fmfunctions.com/fid/380
Feel free to PM me if you have questions that aren't a good format to post here.

How to efficiently match an input string against several regular expressions at once?

How would one efficiently match one input string against any number of regular expressions?
One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:
/user/with-id/{userId}
/user/with-id/{userId}/profile
/user/with-id/{userId}/preferences
/users
/users/who-signed-up-on/{date}
/users/who-signed-up-between/{fromDate}/and/{toDate}
…
where {…} are named placeholders (like regular expression capturing groups).
Note: This question is not about whether the above REST interface is well-designed or not. (It probably isn't, but that shouldn't matter in the context of this question.)
It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.
Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.
Are there any efficient algorithms for this?
Inputs:
An input string
A set of "mutually exclusive" regular expressions (ie. no input string may match more than one expression)
Output:
The regular expression (if any) that the input string matched against.
The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.
There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.
Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).
When you reach a leaf, you know which url you matched
The standard solution for matching multiple regular expressions against an input stream is a lexer-generator such as Flex (there are lots of these avalable, typically several for each programming langauge).
These tools take a set of regular expressions associated with "tokens" (think of tokens as just names for whatever a regular expression matches) and generates efficient finite-state automata to match all the regexes at the same time. This is linear time with a very small constant in the size of the input stream; hard to ask for "faster" than this. You feed it a character stream, and it emits the token name of the regex that matches "best" (this handles the case where two regexes can match the same string; see the lexer generator for the definition of this), and advances the stream by what was recognized. So you can apply it again and again to match the input stream for a series of tokens.
Different lexer generators will allow you to capture different bits of the recognized stream in differnt ways, so you can, after recognizing a token, pick out the part you care about (e.g., for a literal string in quotes, you only care about the string content, not the quotes).
If there is a hierarchy in the url structure, that should be used to maximize performance. Only an url that starts with /user/ can match any of the first three and so on.
I suggest storing the hierarchy to match in a tree corresponding to the url hierarchy, where each node matches a level in the hierarchy. To match an url, test the url against all roots of the tree where only nodes with regexes for "user" and "users" are. Matching url:s are tested against the children of those nodes until a match is found in a leaf node. A succesful match can be returned as the list of nodes from the root to the leaf. Named groups with property values such as {user-id} can be fetched from the nodes of the successful match.
Use named expressions and the OR operator, i.e. "(?P<re1>...)|(?P<re2>...)|...".
First I though that I couldn't see any good optimization for this process.
However, if you have a really large number of regexes you might want to partition them (I'm not sure if this is technically partitioning).
What I tell you to do is:
Suppose that you have 20 possible urls that start with user:
/user/with-id/X
/user/with-id/X/preferences # instead of preferences, you could have another 10 possibilities like /friends, /history, etc
Then, you also have 20 possible urls starting with users:
/users/who-signed-up-on
/users/who-signed-up-on-between #others: /registered-for, /i-might-like, etc
And the list goes on for /products, /companies, etc instead of users.
What you could do in this case is using "multi-level" matching.
First, match the start of the string. You'd be matching for /products, /companies, /users, one at a time and ignoring the rest of the string. This way, you don't have to test all the 100 possibilities.
After you know the url starts with /users, you can match only the possible urls that start with users.
This way, you would reduce a lot of unneeded matches. You won't match the string for all the /procucts possibilities.

Regex in URL Rewriting to match Querystring Parameter Values in any order?

Many URL rewriting utilities allow Regex matching. I need some URLs to be matched against a couple of main querystring parmeter values no matter what order they appear in. For example let's consider an URL having two key parameters ID= and Lang= in no specific order, and maybe some other non-key params are interspersed.
An Example URL to be matched with key params in any order:
http://www.example.com/SurveyController.aspx?ID=500&Lang=4 or
http://www.example.com/SurveyController.aspx?Lang=4&ID=500
Maybe with some interspersed non-key params:
http://www.example.com/SurveyController.aspx?Lang=3&ID=1&misc=3&misc=4 or
http://www.example.com/SurveyController.aspx?ID=1&misc=4&Lang=3 or
http://www.example.com/SurveyController.aspx?misc=4&Lang=3&ID=1 or
etc
Is there a good regex pattern to match against querystring param value in any order, or is it best to duplicate some rules, or in general should I look to other means?
Note: The main querystring values will also be captured using brackets i.e. ID=(3)&Lang=(500) and substituted into the destination URL, but that's not the focus of the question.
I would suggest parsing the query string into a dictionary and working from there, but if you want regex, you can use alternation+repetition to match in any order (without inlining all possible sequences). Python example:
>>> import re
>>> p = re.compile(r'(?:[?&](?:abc=([^&]*)|xyz=([^&]*)|[^&]*))+$')
>>> p.findall('x?abc=1&jjj=2&xyz=3')
[('1', '3')]
>>> p.findall('x?abc=1&xyz=3&jjj=2')
[('1', '3')]
>>> p.findall('x?xyz=3&abc=1&jjj=2')
[('1', '3')]
Regex matching depends highly on the sequential nature of a string. Position of the match is not important, but order definitely is.
This means you cannot write a regex pattern that matches its different parts in any arbitrary order. You can write a pattern that matches its parts in any pre-defined order, though - you would have to include every possible permutation in the pattern. This gets inconvenient very fast:
to match (a,b) you would need a,b|b,a
to match (a,b,c) you would need a,b,c|a,c,b|b,a,c|b,c,a|c,a,b|c,b,a
and so on
And this means you would best try to approach the problem sequentially, matching one parameter at a time. It depends on the capabilities of your rewriting engine how this would work.
This is outside of the capabilities of (most flavours of) regex. You would indeed need to duplicate each rewrite rule for every possible order of parameters, which is practical for two and... less practical for ten.
Also, regexes wouldn't do the kind of parsing you'd need to handle all possible parameter inputs. For example:
http://www.example.com/SurveyController.aspx?ID=500&L%61ng=4
would normally be a valid synonym, and
http://www.example.com/SurveyController.aspx?Hello=3&ID=400&Lang=4&ID=500
might often be a synonym for ID 400 or 500 depending on the parser. The simple regex matches might be OK if you are only wanting to 301 a load of deprecated old-format address to the shiny new one, but not enough if they are to catch all possible inputs.
So for more complex cases like this, you'd be better off having a real SurveyController.aspx that looks at its parameters and redirects you where you need to go.
If the underlying regular expression implementation understands both named groups and zero-width look-aheads you may be able to make something work, using something like aspx\?(?=ID=(?<ID>\d+))(?=Lang=(?<Lang>\d+)) (this is untested speculation), but the result is likely to be both unmaintainable and likely under-performs even a naive implementation that uses multiple regexes to parse the string.
I might suggest that query strings are best parsed by a simple tokenizer or even just split operations may be the best things for it.