How to store regex "literals" in Postgres? - regex

I want to store regex pattern/option "literals" in a Postgres database, like:
/<pattern>/options
I think it's helpful to indicate the expected format and use of the text. Also, the application framework I'm using can coerce this kind of text into the proper Regex type.
I looked through the data types and provided extensions and didn't see anything specific. Am I missing one?
If there is no specialized type, is there a reasonable way to constrain TEXT to likely contain a regex (not to validate the regex, just to ensure text between forward-slashes). Does this work?
pattern TEXT CONSTRAINT is_regex (pattern LIKE '/%/%')
At the moment, I'm only using these literals in application code, which is why the TEXT to Regex transformation is very helpful. At some point, I might get better at CTEs and transform them back to regular TEXT (without forward-slashes or options) to be used in Postgres pattern matching functions.

PostgreSQL doesn't offer such type (as of now), but generally speaking you have a few options to preserve database integrity (I can only assume you want this to avoid worrying that the data you read from the database fails your application, because it's not a valid regular expression).
Your best bet is (which you already figured out) is to use a CHECK constraint, one way or the other. If you plan to use this pattern in multiple places, I suggest you to use domain types. That way, you don't have to define these constraints at multiple columns. Ironically the best way to write such a CHECK constraint is to write a regexp pattern to match your regexp patterns (because there are multiple regexp implementations with slight differences). It obviously won't be perfect, but it might be good enough. I.e.
create domain likely_regexp as text
check (value ~ '^/([^/]*(\\/[^/]*)*[^\\])?/[a-z]*$');
But if you're okay to check against PostgreSQL's implementation, you can (ab)use the fact that CHECK constraints fails not only when the evaluated expression is false, but they also fail when the expression throws (raises) some error. So you can call a regexp function in order to detect if it's actually a valid regular expression or not. Altough you still have to split the pattern and the options part.
create domain pg_regexp as text
check (regexp_replace('', replace(substring(value from '^/(.*)/'), '\/', '/'),
'', substring(value from '/([^/]*)$')) = '');
https://rextester.com/YFG18381

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Validate Regex Input, preferably using Regex

I'm looking to have the (admin) user enter some pattern matching string, to give different users of my website access to different database rows, depending on if the text in a particular field of the row matches the pattern matching string against that user.
I decided on Regex because it is trivial to integrate into the MySQL statements directly.
I don't really know where to start with validating that a string is a valid regular expression, with a regular expression.
I did some searching for similar questions, couldn't see one. Google produced the comical answer, sadly not so helpful.
Do people do this in the wild, or avoid it?
Is it able to be done with a simple regex, or will the set of all valid regex need to be limited to a usable subset?
Validating a regex is an incredibly complex task. A regex would not be able to do it.
A simple approach would be to catch any errors that occur when you try to run the SQL statement, then report an appropriate error back to the user.
I am assuming that the 'admin' is a trusted user here. It is quite dangerous to give a non-trusted user the ability to enter regexes, because it is easy to attack your system with a regex that is constructed to take a really long time to execute. And that is before you even start to worry about the Bobby Tables problems.
in javascript:
input = "hello**";
try{
RegExp(input);
// sumbit the regex
}catch(err){
// regex is not valid
}
You cannot validate that a string contains a valid regular expression with a regular expression. But you might be able to compromise.
If you only need to know that only characters which are valid in regular expressions were used in the string, you can use the regex:
^[\d\w \-\}\{\)\(\+\*\?\|\.\$\^\[\]\\]*$
This might be enough depending on the application.

Regex assistance: include/exclude

Hello I am trying to figure out this RegEx expression. I have a URL that can have different querystring parameter at different location.
test.aspx?foo=bar&abc=123
test.aspx?abc=123&foo=bar
test.aspx?foo=bar&abc=123#T1
test.aspx?abc=123&foo=bar#T2
I am trying to only find the one without the #Tnumber
Here what I have so far.
test.aspx\?(?!\#T[0-9])
However it still select all of them, is there a way to have a string constant and scan it down the line?
Juniorflip
If #Tnum is always at the end, you just need to do a bit of anchoring. For example, like this:
test.aspx\?.*(?!\#T[0-9])...$
But that's very fragile as it depends on the bad URLs always ending in a very particular form and good URLs always having enough characters to soak up that end matching. A negative lookbehind assertion is somewhat better, but still fragile and less commonly supported:
test.aspx\?.*(?<!\#T[0-9])$
It's better to write a regular expression that matches what you don't want and to just invert the logic of what to do when you get a match (i.e., "if it matches throw it away", instead of "if it matches use it"). But really it's much better to delegate the parsing of the URLs to a specialized library and then just do a simpler check against the fragment identifier as a logical component instead of as a horrible RE hack.

Regex in URL Rewriting to match Querystring Parameter Values in any order?

Many URL rewriting utilities allow Regex matching. I need some URLs to be matched against a couple of main querystring parmeter values no matter what order they appear in. For example let's consider an URL having two key parameters ID= and Lang= in no specific order, and maybe some other non-key params are interspersed.
An Example URL to be matched with key params in any order:
http://www.example.com/SurveyController.aspx?ID=500&Lang=4 or
http://www.example.com/SurveyController.aspx?Lang=4&ID=500
Maybe with some interspersed non-key params:
http://www.example.com/SurveyController.aspx?Lang=3&ID=1&misc=3&misc=4 or
http://www.example.com/SurveyController.aspx?ID=1&misc=4&Lang=3 or
http://www.example.com/SurveyController.aspx?misc=4&Lang=3&ID=1 or
etc
Is there a good regex pattern to match against querystring param value in any order, or is it best to duplicate some rules, or in general should I look to other means?
Note: The main querystring values will also be captured using brackets i.e. ID=(3)&Lang=(500) and substituted into the destination URL, but that's not the focus of the question.
I would suggest parsing the query string into a dictionary and working from there, but if you want regex, you can use alternation+repetition to match in any order (without inlining all possible sequences). Python example:
>>> import re
>>> p = re.compile(r'(?:[?&](?:abc=([^&]*)|xyz=([^&]*)|[^&]*))+$')
>>> p.findall('x?abc=1&jjj=2&xyz=3')
[('1', '3')]
>>> p.findall('x?abc=1&xyz=3&jjj=2')
[('1', '3')]
>>> p.findall('x?xyz=3&abc=1&jjj=2')
[('1', '3')]
Regex matching depends highly on the sequential nature of a string. Position of the match is not important, but order definitely is.
This means you cannot write a regex pattern that matches its different parts in any arbitrary order. You can write a pattern that matches its parts in any pre-defined order, though - you would have to include every possible permutation in the pattern. This gets inconvenient very fast:
to match (a,b) you would need a,b|b,a
to match (a,b,c) you would need a,b,c|a,c,b|b,a,c|b,c,a|c,a,b|c,b,a
and so on
And this means you would best try to approach the problem sequentially, matching one parameter at a time. It depends on the capabilities of your rewriting engine how this would work.
This is outside of the capabilities of (most flavours of) regex. You would indeed need to duplicate each rewrite rule for every possible order of parameters, which is practical for two and... less practical for ten.
Also, regexes wouldn't do the kind of parsing you'd need to handle all possible parameter inputs. For example:
http://www.example.com/SurveyController.aspx?ID=500&L%61ng=4
would normally be a valid synonym, and
http://www.example.com/SurveyController.aspx?Hello=3&ID=400&Lang=4&ID=500
might often be a synonym for ID 400 or 500 depending on the parser. The simple regex matches might be OK if you are only wanting to 301 a load of deprecated old-format address to the shiny new one, but not enough if they are to catch all possible inputs.
So for more complex cases like this, you'd be better off having a real SurveyController.aspx that looks at its parameters and redirects you where you need to go.
If the underlying regular expression implementation understands both named groups and zero-width look-aheads you may be able to make something work, using something like aspx\?(?=ID=(?<ID>\d+))(?=Lang=(?<Lang>\d+)) (this is untested speculation), but the result is likely to be both unmaintainable and likely under-performs even a naive implementation that uses multiple regexes to parse the string.
I might suggest that query strings are best parsed by a simple tokenizer or even just split operations may be the best things for it.

Under what situations are regular expressions really the best way to solve the problem?

I'm not sure if Jeff coined it but it's the joke/saying that people who say "oh, I know I'll use regular expressions!" now have two problems. I've always taken this to mean that people use regular expressions in very inappropriate contexts.
However, under what circumstances are regular expressions really the best answer? What problems are they really the best or maybe only way to solve a situation?
RexExprs are good for:
Text Format Validations (email, url, numbers)
Text searchs/substitution.
Mappings (e.g. url pattern to function call)
Filtering some texts (related to substitution)
Lexical analysis during parsing.
They can be used to validate anything that have a pattern like :
Social Security Number
Telephone Number ( 555-555-5555 )
Email Address (something#example.com)
IP Address (but it's more complex to make sure it's valid)
All those have patterns and are easily verifiable by RegEx.
They are difficultly used for entry that have a logic instead of a pattern like a credit card number but they still can be used to do some client validation.
So the best ways?
To sanitize data entry on the client
side before sanitizing them on the
server.
To make "Search and Replace" of some
strings that contains pattern
I'm sure I am missing a lot of other cases.
Regular expressions are a great way to parse text that doesn't already have a parser (i.e. XML) I have used it to create a parser for the mod_rewrite syntax in the .htaccess file or in my URL Rewriter project http://www.codeplex.com/urlrewriter for example
they are really good when you want to be more specific than "*" or "?" like "3 letters then 2 numbers then a $ sign then a period"
The quote is from an anti-Perl rant from Jamie Zawinski. I think Perl used to do regex really badly but now it seems to be a standard engine for a lot of programs.
But the same sentiment still applies. If you don't know how to use regex, you better not try something real fancy other wise you get one of these tags too (see bronze list) ;o)
https://stackoverflow.com/users/730/keng
They are good for matching or finding text that takes a very specific and simple format. By "simple" I mean not nested and smaller than the entire html spec, for example.
They are primarily of value for highly structured text parsing. If you used named groups (and option in most mature regex systems), you have a phenomenally powerful and crisp way to handle the strings.
Here's an example. Consider that netstat in its various iterations on different linux OSes, and versions of netstat can return different results. Sometimes there is an extra column, sometimes there is a shift if the date/time format. Regexes give you a powerful way to handle that with a single expression. Couple that with named groups, and you can retrieve the data without hacks like:
1) split on spaces
2) ok, the netstat version is X so add I need to add 1 to all array references past column 5.
3) ok, the netstat version is Y so I need to make sure that I use multiple array references for the date info.
YUCK. Simple to fix in a Regex :-)