How to highlight SQL keywords using a regular expression? - regex

I would like to highlight SQL keywords that occur within a string in a syntax highlighter. Here are the rules I would like to have:
Match the keywords SELECT and FROM (others will be added, but we'll start here). Must be all-caps
Must be contained in a string -- either starting with ' or "
The first word in that string (ignoring whitespace preceding it) should be one of the keywords.
This of course is not comprehensive (can ignore escapes within a string), but I'd like to start here.
Here are a few examples:
SELECT * FROM main -- will not match (not in a string)
"SELECT name FROM main" -- will match
"
SELECT name FROM main" -- will match
"""Here is a SQL statement:
SELECT * FROM main""" -- no, string does not start with a keyword (SELECT...).
The only way I thought to do it in a single regex would be with a negative lookbehind...but then it would not be fixed width, as we don't know when the string starts. Something like:
(?<=["']\s*(SELECT)\s*)(SELECT|FROM)
But this of course won't work:
Would something like this be possible to do in a single regex?

A suitable regular expression is likely to get pretty complex, especially as the rules evolve further. As others have noted, it may be worth considering using a parser instead. That said, here is one possible regex attempting to cover the rules mentioned so far:
(["'])\s*(SELECT)(?:\s+.*)?\s+(FROM)(?:\s+.*)?\1(?:[^\w]|$)
Online Demos
Debuggex Demo
Regex101 Demo
Explanation
As can be seen in the above visualisation, the regex looks for either a double or single quote at the start (saved in capturing group #1) and then matches this reference at the end via \1. The SELECT and FROM keywords are captured in capturing groups #2 and #3. (The (?:x|y) syntax ensures there aren't more groups for other choices as ?: at the start of a choice excludes it as a capturing group.) There are some further optional details such as limiting what is allowed between the SELECT and FROM and not counting the final quotation mark if it is immediately succeeded by a word character.
Results
SELECT * FROM tbl -- no match - not in a string
"SELECT * FROM tbl" -- matches - in a double-quoted string
'SELECT * FROM tbl;' -- matches - in a single-quoted string
'SELECT * FROM it's -- no match - letter after end quote
"SELECT * FROM tbl' -- no match - quotation marks don't match
'SELECT * FROM tbl" -- no match - quotation marks don't match
"select * from tbl" -- no match - keywords not upper case
'Select * From tbl' -- no match - still not all upper case
"SELECT col1 FROM" -- matches - even though no table name
' SELECT col1 FROM ' -- matches - as above with more whitespace
'SELECT col1, col2 FROM' -- matches - with multiple columns
Possible Improvement?
It might also be necessary to exclude quotation marks from the "any character" parts. This can be done at the expense of increased complexity using the technique described here by replacing both instances of .* with (?:(?!\1).)*:
(["'])\s*(SELECT)(?:\s+(?:(?!\1).)*)?\s+(FROM)(?:\s+(?:(?!\1).)*)?\1(?:[^\w]|$)
See this Regex101 Demo.

You could use capturing groups:
(.*["']\s*\K)(?(1)(SELECT|FROM).*(SELECT|FROM)|)
In this case $2 would refer to the first keyword and $3 would refer to the second keyword. This also only works if there are only two keywords and only one string on a line, which seems to be true in all of your examples, but if those restrictions don't work for you, let me know.

Just tested the regexp bellow:
If you need to add other commands the thing may get a little trick, because some keywords doesn't apply. Eg: ALTER TABLE mytable or UPDATE SET col = val;. For these scenarios you will need to create subgroups and the regexp may become slow.
Best regards!

If I understand your requirements well I suggest that:
/^'\s*(SELECT)[^']*(FROM)[^']*'|^"\s*(SELECT)[^"]*(FROM)[^"]*"/m
[Regex Fiddle Demo]
Explanation:
When you need to check start of a string; use ^.
When you need to accept 0-n spaces; use \s*.
When you need to accept new-line or multi-line strings; use m flag over your regex.
When you need to use Case-Sensitive mode; Don't use i flag over your regex.
When you need to block a string between a specific character like "; use [^"]* instead of .* that will protects first end of block.
When you need to have a block with similar start and end characters like ' & "; use ' '|" " instead of ['"] ['"].
Update:
If you need to capture any special keyword after verifying existence of SELECT keyword after start of your string, I can update my solution to this:
/^'\s*(SELECT)([^']*(SELECT|FROM))+|^"\s*(SELECT)([^"]*(SELECT|FROM))+/m

without parsing of quoted strings
could be done using \G and \K construct
(?:"\s*(?=(?:SELECT|FROM))|(?<!^)\G)[^"]*?\K(SELECT|FROM)
demo

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Powershell script to search, split and join in one line

Been racking my Friday brain on a regex problem with dealing with Sql Server object names.
An input to my Powershell script is a procedure name. The name can take many forms, such as
dbo.Procedure
[dbo].Procedure
dbo.[Procedure.Name]
etc
So far I'd come up with the following to split the value into it's constituent parts:
[string[]] $procNameA = $procedure.Split("(?:\.)(?=(?:[^\]]|\[[^\]]*\])*$)")
In addition I have a regex that I could use to handle the square brackets
(?:\[)*([A-Za-z0-9. !]+)(?:\])*
And this is about as far as my limited regex experience will take me.
Now granted I could deal with a lot of this by treating each element in a ForEach and doing a RegEx replace there, but y'know that just seems so, I dunno, ungainly. So, question I have for any passing Powershell & RegEx guru: "How can I do all this in one line?"
What'd I'm looking for is where I can get the following results
Original Corrected
===================== =====================
dbo.ProcName [dbo].[ProcName]
dbo.[ProcName] [dbo].[ProcName]
[dbo].ProcName [dbo].[ProcName]
[dbo].[ProcName] [dbo].[ProcName]
[My.Schema].[My.Proc] [My.Schema].[My.Proc]
[My.Schema].ProcName [MySchema].[ProcName]
dbo.[ABadBADName! [dbo].[[ABadBADName!]
(Notice the last instance where an object name starts but does not end with a square bracket (not that I'm expecting that [and if I saw anyone on my team naming an object like that I'd be asking HR if I can fire them for it], but I do like to be so thorough).
Think that covers everything...
So, over to you Powershell & RegEx gurus - how do I do this?
Please limit any answers to FULLY answering the question with code I can actually use and not just syntax suggestions.
Clarification: I am acutely aware that sometimes 'slow and steady wins the race' may apply here and that support wise it would be potentially safer to handle the rest in a ForEach, but that's not the point. Part of this is to help me understand just how flexible RegEx can be, so this is more of an educational exercise rather than a philosophical one.
Okay how about this:
#'
dbo.ProcName
dbo.[ProcName]
[dbo].ProcName
[dbo].[ProcName]
[My.Schema].[My.Proc]
[My.Schema].ProcName
dbo.[ABadBADName!
'# -split '\s*\r?\n\s*' | % {
$_ -replace '^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$', '[${schema}].[${proc}]'
}
Note that I'm only using ForEach-Object (%) here to iterate through your test cases; the actual replace is done with a single regex / replace.
Explanation
So the important part here is the regex:
^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$
Breaking it down:
^ -- match the beginning of the string
(?: -- open a non-capturing group (for alternation purposes)
\[ -- match a literal left bracket [
(?<schema> -- start a named capture group, with the name schema
[^\]]+ -- match 1 or more of any character that is not a literal right square bracket ]
) -- end the schema capture group
| -- alternation; if the previous expression didn't match, try what comes after this
(?<schema> -- again start a named capture group called schema; this is only tried if the other one didn't match.
[^\.]+ -- match 1 or more of any character that is not a literal dot .
) -- end the alternate schema capture group
) -- end the non-capturing group
\. -- match a literal dot . (this is the one separating schema and proc)
(the next part for proc is exactly the same steps as above, with a different name for the capturing group)
$ -- match the end of the string
In the replace, we just qualify the names of the groups with ${name} syntax instead of the numbers $1 (which would work too actually).

Postgres asterisc regex quantifier not working

In Postgres 9.5.1 the following command works:
select regexp_replace('JamesBond007','\d+','');
Output:
JamesBond
However the asterisc does not seem to work:
select regexp_replace('JamesBond007','\d*','');
it produces:
JamesBond007
Even more weird things happen when I put something in as replacement string:
select regexp_replace('JamesBond007','\d+','008');
results in:
JamesBond008
while
select regexp_replace('JamesBond007','\d*','008');
gives me back:
008JamesBond007
The Postgres documentation says * = a sequence of 0 or more matches of the atom.
So what is happening here? (N.B. in Oracle all the above works as expected)
The thing is that \d* can match an empty string and you are not passing the flag g.
See regexp_replace:
The flags parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. Flag i specifies case-insensitive matching, while flag g specifies replacement of each matching substring rather than only the first one.
The \d* matches the empty location at the beginning of the JamesBond007 string, and since g is not passed, that empty string is replaced with 008 when you use select regexp_replace('JamesBond007','\d*','008'); and the result is expected - 008JamesBond007.
With select regexp_replace('JamesBond007','\d*','');, again, \d* matches the empty location at the beginning of the string, and replaces it with an empty string (no visible changes).
Note that Oracle's REGEXP_REPLACE replaces all occurrences by default:
By default, the function returns source_char with every occurrence of the regular expression pattern replaced with replace_string.
In general, you should be cautious when using patterns matching empty strings inside regex-based replace functions/methods. Do it only when you understand what you are doing. If you want to replace digit(s) you usually want to find at least 1 digit. Else, why remove something that is not present in the string in the first place?

Regex Enforcing match

Ok i got this regex:
^[\w\s]+=["']\w+['"]
Now the regex will match:
a href='google'
a href="google"
and also
a href='google"
How can i enforce regex to match its quote?
If first quote is single quote, how can i make the last quote also a single quote not a double quote
Read about backreferences.
^[\w\s]+=(["'])\w+?\1
Note that you want to put a ? after the second + or else it will be greedy. However, in general this is not the right way to parse HTML. Use Beautiful Soup.
I am afraid you will have to do it the long way:
^[\w\s]+=("\w+"|'\w+')
More technically, ensuring correct matching / nesting of quotes is not a problem for a regular grammar so for more complex problems you would have to use a proper parser (or perl6 style extended regular expression but they technically do not class as regular expressions).
Replace the ['"] with \1 to use a back reference (capture group)
^[\w\s]+=["']\w+\1
What exactly do you want to match? It sounds you want to match:
word (tagname)
mandatory whitespace
word (attr name)
optional whitespace
=
optional whitespace
either single quoted or double quoted anything (attr value)
That would be: ^(\w+)\s+(\w+)\s*=\s*(?:'([^']*)'|"([^"]*)")
This will allow matches like:
a href='' - empty attr
a href='Hello world' - spaces and other non-word characters in quoted part
a href="one 'n two" - quotes of different kind in quoted part
a href = 'google' - spaces on both sides of =
And disallow things like these that your original regexp allows:
a b c href='google' - extra words
='google' - only spaces on the left
href='google' - only attr on the left
It still doesn't sound exactly right - you're trying to match a tag with exactly one attribute?
With this regexp, tag name will be in $1, attr name in $2, and attr value in either $3 or $4 (the other being nil - most languages distinguish group not taken with nil vs group taken but empty with "" if you need it).
Regexp that would ensure attr value gets in the same group would be messier if you wanted to allow single quotes in doubly quoted attr value and vice verse - something like ^(\w+)\s+(\w+)\s*=\s*(['"])((?:(?!\3).)*)\3 ((?!) is zero-width negative look-ahead - (?:(?!\3).) means something like [^\3] except the latter isn't supported).
If you don't care about this ^(\w+)\s+(\w+)\s*=\s*(['"])(['"]*)\3 will do just fine (for both $3 will be quote type, and $4 attr value).
By the way re (["'])\w+?\1 above - \w doesn't match quotes, so this ? doesn't change anything.
Having said all that, use a real HTML parser ;-)
These regexps will work in Perl and Ruby. Other languages usually copy Perl's regexp system, but often introduce minor changes so some adjustments might be necessary. Especially the one with negative look-aheads might be unsupported.
Try this:
^[\w\s]+="\w+"|^[\w\s]+='\w+'

PHP - Regex for prepending table names within SQL

I am looking for an unobtrusive way to find and replace table names based on their position in an SQL query.
Example:
$query = 'SELECT t1.id, t1.name, t2.country FROM users AS t1, country AS t2 INNER JOIN another_table AS t3 ON t3.user_id = t1.id';
I essentially need to prepend client name abbreviations to table names and then have my CMS handle that change. So, going from 'users' to 'so_users' (If Stack Overflow was a client) but not have to add curly braces around all query table names like Drupal. An example is how WordPress will allow you on setup to prepend table names, but the way WordPress handles this issue is not ideal for my means.
For my example I want the output of some method to be:
$query = 'SELECT t1.id, t1.name, t2.country FROM so_users AS t1, so_country AS t2 INNER JOIN so_another_table AS t3 ON t3.user_id = t1.id';
('so_' in prepended to table names)
Thank you.
Kris
Using a query builder class would be the best solution, as you don't want to make any assumption about the pattern you want to replace with regex. If you don't find any existing library suitable for your particular need, roll out your own. It's not hard to make a simple query builder.
Regex does not have the power to parse SQL. Think of constructions like:
SELECT 'SELECT * FROM users';
SELECT * FROM users; -- users
SELECT '* -- users' FROM users;
SELECT '\' FROM users; -- '; -- differs in My/Pg vs others
SELECT users.name FROM country AS users; -- or without AS
SELECT users(name) FROM country; -- users() is procedure
SELECT "users"."name" FROM users; -- or ` on MySQL, [] in TSQL
and so on. To parse SQL you need a proper SQL parser library; trying to hack it after the fact in regex will only make weird mistakes.
This should work for your given example.
A word of caution though,as others have mentioned allready, Regexes are not the best tool for what you need. Given regex works for your example, nothing more, nothing less. There are lots of SQL constructions imaginable where this regex will not make the replacements you need.
$result = preg_replace('/(FROM|JOIN|,) ([_\w]*) (AS)/m', '$1 so_$2 $3', $subject);
# (FROM|JOIN|,) ([_\w]*) (AS)
#
# Match the regular expression below and capture its match into backreference number 1 «(FROM|JOIN|,)»
# Match either the regular expression below (attempting the next alternative only if this one fails) «FROM»
# Match the characters “FROM” literally «FROM»
# Or match regular expression number 2 below (attempting the next alternative only if this one fails) «JOIN»
# Match the characters “JOIN” literally «JOIN»
# Or match regular expression number 3 below (the entire group fails if this one fails to match) «,»
# Match the character “,” literally «,»
# Match the character “ ” literally « »
# Match the regular expression below and capture its match into backreference number 2 «([_\w]*)»
# Match a single character present in the list below «[_\w]*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# The character “_” «_»
# A word character (letters, digits, etc.) «\w»
# Match the character “ ” literally « »
# Match the regular expression below and capture its match into backreference number 3 «(AS)»
# Match the characters “AS” literally «AS»