Regular expression - remove last literal - regex

So I have the following table in PostgreSQL.
This is a test table only with one column route that has values of route names like
I-95
US-95N
I-95 S
I want to remove the trailing direction literals from all the route names.
UPDATE <schema>.<table>
SET route= regexp_replace(route, '%[:digit:](S|N|E|W)', '%[:digit:]', 'ig');
No change in the records happens. Anyone has any idea what I am doing wrong here?

To remove any single letter signifying a cardinal direction following immediately after a digit:
UPDATE tbl
SET route = regexp_replace(route, '(\d)[SNEW]', '\1', 'ig')
SQL Fiddle.
A positive lookbehind match would be even more elgant, but sadly only lookahead matches are implemented. So I use a back-reference to re-insert the first (captured) part from the match.
The bracket expression [SNEW] is simpler for the case than multiple branches (S|N|E|W), which would need non-capturing parentheses in this case: (:?S|N|E|W).

Related

How to highlight SQL keywords using a regular expression?

I would like to highlight SQL keywords that occur within a string in a syntax highlighter. Here are the rules I would like to have:
Match the keywords SELECT and FROM (others will be added, but we'll start here). Must be all-caps
Must be contained in a string -- either starting with ' or "
The first word in that string (ignoring whitespace preceding it) should be one of the keywords.
This of course is not comprehensive (can ignore escapes within a string), but I'd like to start here.
Here are a few examples:
SELECT * FROM main -- will not match (not in a string)
"SELECT name FROM main" -- will match
"
SELECT name FROM main" -- will match
"""Here is a SQL statement:
SELECT * FROM main""" -- no, string does not start with a keyword (SELECT...).
The only way I thought to do it in a single regex would be with a negative lookbehind...but then it would not be fixed width, as we don't know when the string starts. Something like:
(?<=["']\s*(SELECT)\s*)(SELECT|FROM)
But this of course won't work:
Would something like this be possible to do in a single regex?
A suitable regular expression is likely to get pretty complex, especially as the rules evolve further. As others have noted, it may be worth considering using a parser instead. That said, here is one possible regex attempting to cover the rules mentioned so far:
(["'])\s*(SELECT)(?:\s+.*)?\s+(FROM)(?:\s+.*)?\1(?:[^\w]|$)
Online Demos
Debuggex Demo
Regex101 Demo
Explanation
As can be seen in the above visualisation, the regex looks for either a double or single quote at the start (saved in capturing group #1) and then matches this reference at the end via \1. The SELECT and FROM keywords are captured in capturing groups #2 and #3. (The (?:x|y) syntax ensures there aren't more groups for other choices as ?: at the start of a choice excludes it as a capturing group.) There are some further optional details such as limiting what is allowed between the SELECT and FROM and not counting the final quotation mark if it is immediately succeeded by a word character.
Results
SELECT * FROM tbl -- no match - not in a string
"SELECT * FROM tbl" -- matches - in a double-quoted string
'SELECT * FROM tbl;' -- matches - in a single-quoted string
'SELECT * FROM it's -- no match - letter after end quote
"SELECT * FROM tbl' -- no match - quotation marks don't match
'SELECT * FROM tbl" -- no match - quotation marks don't match
"select * from tbl" -- no match - keywords not upper case
'Select * From tbl' -- no match - still not all upper case
"SELECT col1 FROM" -- matches - even though no table name
' SELECT col1 FROM ' -- matches - as above with more whitespace
'SELECT col1, col2 FROM' -- matches - with multiple columns
Possible Improvement?
It might also be necessary to exclude quotation marks from the "any character" parts. This can be done at the expense of increased complexity using the technique described here by replacing both instances of .* with (?:(?!\1).)*:
(["'])\s*(SELECT)(?:\s+(?:(?!\1).)*)?\s+(FROM)(?:\s+(?:(?!\1).)*)?\1(?:[^\w]|$)
See this Regex101 Demo.
You could use capturing groups:
(.*["']\s*\K)(?(1)(SELECT|FROM).*(SELECT|FROM)|)
In this case $2 would refer to the first keyword and $3 would refer to the second keyword. This also only works if there are only two keywords and only one string on a line, which seems to be true in all of your examples, but if those restrictions don't work for you, let me know.
Just tested the regexp bellow:
If you need to add other commands the thing may get a little trick, because some keywords doesn't apply. Eg: ALTER TABLE mytable or UPDATE SET col = val;. For these scenarios you will need to create subgroups and the regexp may become slow.
Best regards!
If I understand your requirements well I suggest that:
/^'\s*(SELECT)[^']*(FROM)[^']*'|^"\s*(SELECT)[^"]*(FROM)[^"]*"/m
[Regex Fiddle Demo]
Explanation:
When you need to check start of a string; use ^.
When you need to accept 0-n spaces; use \s*.
When you need to accept new-line or multi-line strings; use m flag over your regex.
When you need to use Case-Sensitive mode; Don't use i flag over your regex.
When you need to block a string between a specific character like "; use [^"]* instead of .* that will protects first end of block.
When you need to have a block with similar start and end characters like ' & "; use ' '|" " instead of ['"] ['"].
Update:
If you need to capture any special keyword after verifying existence of SELECT keyword after start of your string, I can update my solution to this:
/^'\s*(SELECT)([^']*(SELECT|FROM))+|^"\s*(SELECT)([^"]*(SELECT|FROM))+/m
without parsing of quoted strings
could be done using \G and \K construct
(?:"\s*(?=(?:SELECT|FROM))|(?<!^)\G)[^"]*?\K(SELECT|FROM)
demo

Why is this negative lookahead not working?

I have this regex that is supposed to help me find and replace deprecated mysql queries. For some reason though, once I replace one query, it recaptures the same area and merely extends it to the end of the next deprecated query. I'm trying to solve this by not letting it select a specific keyword that is in the replacement string (stmt), but its ignoring this constraint for some reason.
\(?mis)\$(?<sql>[a-z0-9]*) = (?<query>"select.*?\;)(?:(?!stmt).*?)\$(?<res>[a-z0-9]*?) = mysql_query.*?\;(?<txt>.*?)while\s?\(\$(?<row>[a-z0-9]*?) = mysql.*?\{\ 1
Here is the Regex101 I'm using to debug.
(?:(?!stmt).*?) is the lookahead in question. I want it to allow for an arbitrary amount of text in between the named capture groups before and after.2 The *? should already be forcing it to find the smallest section. As you can see below, there is a perfectly acceptable match starting on line 14 ($sql = "SELECT admin from user where id=" . $userID;), but it is insisting on starting all the way at the top with the old, and already replaced match.
Why is my negative lookahead not working the way I think it should be working?3
1. I'm using (?mis) because PHPStorm doesn't play nice with normal flags.
2. To prevent random code and bad formatting from getting in the way of the pattern
3. If this is an XY problem and I should be forcing the correct match a different way, I'll welcome that as an answer instead.
The negative lookahead is not matching is because it's not matching. It denies a match when the semicolon at the end of the <query> is immediately followed by "stmt", which isn't the case in your code: it's followed by newline, whitespace, dollar sign, then "stmt".
You can fix that part by extending the negative lookahead to (?!\s*\$stmt), but then the second problem becomes evident: that just extends the <query> match to the next semicolon, which isn't followed by a $stmt. You fix that by tightening the match in <query> to match greedily on non-semicolons, rather than non-greedily on anything. That is, (?<query>"select.*?\;) becomes (?<query>"select[^;]*\;). This creates a dead stop to the match at the first semicolon.
This will fail to match if you have any semicolons inside your SQL, but hey.
Does that get the desired result?

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Strategy advice for this regex (matching in the middle of lookahead and a lookbehind)

I am using positive lookbehind and lookahead to match a word between certain parts (FROM and TO strings).
.*(?<=FROM)\s+(.*?)\s+(?=TO).*
EDIT: That approach cannot be changed. Need to assume, not a workaround for the approach itself, thank you! It's more a theoretical question about how to deal with that lokaheads in-between matching.
I'd like to input an string like
FROM table a, table2 b TO
and obtain as \1 table and table2. a and b labels are optional.
My problem is that if I place something like (?:(\w+)\s*,?)+? for matching every table part, it seems like it's done backwards
http://regex101.com/r/mV4rD8
If I'm understanding what you want correctly, you don't need lookahead/behind. You can do:
FROM (?:(\w+)(?: \w)*(?:,)? )+TO
Of the three parts inside the outermost parentheses, the second and third need to be treated separately because they are optional for different reasons. The second is present if the a and b labels are present. The third is present if the table is not the last one in the list.
This will capture the table names as you described. So e.g.:
FROM table1 a, table2, table3 c TO
Will capture "table1", "table2" and "table3".
I used literal spaces, but you can replace them with \s if you prefer.
EDIT: With the lookahead and lookbehind still present, as per your requirement:
.*(?<=FROM)\s+(?:(\w+)(?:\s+\w)*(?:\s*,)?\s+)+(?=TO).*

Regex - Combining positive and negative lookbehind

I am doing some replaces in some huge SSIS packages to reflect changes in table- and column names.
Some of the tabels have columnnames witch are identical to the tablenames and I need to match the columnname without matching the tablename.
So what i need is a way to match MyName in [MyName] but not in [dbo].[MyName]
(?<=\[)(MyName)(?=\]) matches both, and I thought that (?<!\[dbo\]\.)(?<=\[)(MyName)(?=\]) would do the trick, but it does not seem to work.
You need to include the opening square bracket in the first lookbehind:
(?<!\[dbo\]\.\[)(?<=\[)(MyName)(?=\])