PHP - Regex for prepending table names within SQL - regex

I am looking for an unobtrusive way to find and replace table names based on their position in an SQL query.
Example:
$query = 'SELECT t1.id, t1.name, t2.country FROM users AS t1, country AS t2 INNER JOIN another_table AS t3 ON t3.user_id = t1.id';
I essentially need to prepend client name abbreviations to table names and then have my CMS handle that change. So, going from 'users' to 'so_users' (If Stack Overflow was a client) but not have to add curly braces around all query table names like Drupal. An example is how WordPress will allow you on setup to prepend table names, but the way WordPress handles this issue is not ideal for my means.
For my example I want the output of some method to be:
$query = 'SELECT t1.id, t1.name, t2.country FROM so_users AS t1, so_country AS t2 INNER JOIN so_another_table AS t3 ON t3.user_id = t1.id';
('so_' in prepended to table names)
Thank you.
Kris

Using a query builder class would be the best solution, as you don't want to make any assumption about the pattern you want to replace with regex. If you don't find any existing library suitable for your particular need, roll out your own. It's not hard to make a simple query builder.

Regex does not have the power to parse SQL. Think of constructions like:
SELECT 'SELECT * FROM users';
SELECT * FROM users; -- users
SELECT '* -- users' FROM users;
SELECT '\' FROM users; -- '; -- differs in My/Pg vs others
SELECT users.name FROM country AS users; -- or without AS
SELECT users(name) FROM country; -- users() is procedure
SELECT "users"."name" FROM users; -- or ` on MySQL, [] in TSQL
and so on. To parse SQL you need a proper SQL parser library; trying to hack it after the fact in regex will only make weird mistakes.

This should work for your given example.
A word of caution though,as others have mentioned allready, Regexes are not the best tool for what you need. Given regex works for your example, nothing more, nothing less. There are lots of SQL constructions imaginable where this regex will not make the replacements you need.
$result = preg_replace('/(FROM|JOIN|,) ([_\w]*) (AS)/m', '$1 so_$2 $3', $subject);
# (FROM|JOIN|,) ([_\w]*) (AS)
#
# Match the regular expression below and capture its match into backreference number 1 «(FROM|JOIN|,)»
# Match either the regular expression below (attempting the next alternative only if this one fails) «FROM»
# Match the characters “FROM” literally «FROM»
# Or match regular expression number 2 below (attempting the next alternative only if this one fails) «JOIN»
# Match the characters “JOIN” literally «JOIN»
# Or match regular expression number 3 below (the entire group fails if this one fails to match) «,»
# Match the character “,” literally «,»
# Match the character “ ” literally « »
# Match the regular expression below and capture its match into backreference number 2 «([_\w]*)»
# Match a single character present in the list below «[_\w]*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# The character “_” «_»
# A word character (letters, digits, etc.) «\w»
# Match the character “ ” literally « »
# Match the regular expression below and capture its match into backreference number 3 «(AS)»
# Match the characters “AS” literally «AS»

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

How to highlight SQL keywords using a regular expression?

I would like to highlight SQL keywords that occur within a string in a syntax highlighter. Here are the rules I would like to have:
Match the keywords SELECT and FROM (others will be added, but we'll start here). Must be all-caps
Must be contained in a string -- either starting with ' or "
The first word in that string (ignoring whitespace preceding it) should be one of the keywords.
This of course is not comprehensive (can ignore escapes within a string), but I'd like to start here.
Here are a few examples:
SELECT * FROM main -- will not match (not in a string)
"SELECT name FROM main" -- will match
"
SELECT name FROM main" -- will match
"""Here is a SQL statement:
SELECT * FROM main""" -- no, string does not start with a keyword (SELECT...).
The only way I thought to do it in a single regex would be with a negative lookbehind...but then it would not be fixed width, as we don't know when the string starts. Something like:
(?<=["']\s*(SELECT)\s*)(SELECT|FROM)
But this of course won't work:
Would something like this be possible to do in a single regex?
A suitable regular expression is likely to get pretty complex, especially as the rules evolve further. As others have noted, it may be worth considering using a parser instead. That said, here is one possible regex attempting to cover the rules mentioned so far:
(["'])\s*(SELECT)(?:\s+.*)?\s+(FROM)(?:\s+.*)?\1(?:[^\w]|$)
Online Demos
Debuggex Demo
Regex101 Demo
Explanation
As can be seen in the above visualisation, the regex looks for either a double or single quote at the start (saved in capturing group #1) and then matches this reference at the end via \1. The SELECT and FROM keywords are captured in capturing groups #2 and #3. (The (?:x|y) syntax ensures there aren't more groups for other choices as ?: at the start of a choice excludes it as a capturing group.) There are some further optional details such as limiting what is allowed between the SELECT and FROM and not counting the final quotation mark if it is immediately succeeded by a word character.
Results
SELECT * FROM tbl -- no match - not in a string
"SELECT * FROM tbl" -- matches - in a double-quoted string
'SELECT * FROM tbl;' -- matches - in a single-quoted string
'SELECT * FROM it's -- no match - letter after end quote
"SELECT * FROM tbl' -- no match - quotation marks don't match
'SELECT * FROM tbl" -- no match - quotation marks don't match
"select * from tbl" -- no match - keywords not upper case
'Select * From tbl' -- no match - still not all upper case
"SELECT col1 FROM" -- matches - even though no table name
' SELECT col1 FROM ' -- matches - as above with more whitespace
'SELECT col1, col2 FROM' -- matches - with multiple columns
Possible Improvement?
It might also be necessary to exclude quotation marks from the "any character" parts. This can be done at the expense of increased complexity using the technique described here by replacing both instances of .* with (?:(?!\1).)*:
(["'])\s*(SELECT)(?:\s+(?:(?!\1).)*)?\s+(FROM)(?:\s+(?:(?!\1).)*)?\1(?:[^\w]|$)
See this Regex101 Demo.
You could use capturing groups:
(.*["']\s*\K)(?(1)(SELECT|FROM).*(SELECT|FROM)|)
In this case $2 would refer to the first keyword and $3 would refer to the second keyword. This also only works if there are only two keywords and only one string on a line, which seems to be true in all of your examples, but if those restrictions don't work for you, let me know.
Just tested the regexp bellow:
If you need to add other commands the thing may get a little trick, because some keywords doesn't apply. Eg: ALTER TABLE mytable or UPDATE SET col = val;. For these scenarios you will need to create subgroups and the regexp may become slow.
Best regards!
If I understand your requirements well I suggest that:
/^'\s*(SELECT)[^']*(FROM)[^']*'|^"\s*(SELECT)[^"]*(FROM)[^"]*"/m
[Regex Fiddle Demo]
Explanation:
When you need to check start of a string; use ^.
When you need to accept 0-n spaces; use \s*.
When you need to accept new-line or multi-line strings; use m flag over your regex.
When you need to use Case-Sensitive mode; Don't use i flag over your regex.
When you need to block a string between a specific character like "; use [^"]* instead of .* that will protects first end of block.
When you need to have a block with similar start and end characters like ' & "; use ' '|" " instead of ['"] ['"].
Update:
If you need to capture any special keyword after verifying existence of SELECT keyword after start of your string, I can update my solution to this:
/^'\s*(SELECT)([^']*(SELECT|FROM))+|^"\s*(SELECT)([^"]*(SELECT|FROM))+/m
without parsing of quoted strings
could be done using \G and \K construct
(?:"\s*(?=(?:SELECT|FROM))|(?<!^)\G)[^"]*?\K(SELECT|FROM)
demo

Powershell script to search, split and join in one line

Been racking my Friday brain on a regex problem with dealing with Sql Server object names.
An input to my Powershell script is a procedure name. The name can take many forms, such as
dbo.Procedure
[dbo].Procedure
dbo.[Procedure.Name]
etc
So far I'd come up with the following to split the value into it's constituent parts:
[string[]] $procNameA = $procedure.Split("(?:\.)(?=(?:[^\]]|\[[^\]]*\])*$)")
In addition I have a regex that I could use to handle the square brackets
(?:\[)*([A-Za-z0-9. !]+)(?:\])*
And this is about as far as my limited regex experience will take me.
Now granted I could deal with a lot of this by treating each element in a ForEach and doing a RegEx replace there, but y'know that just seems so, I dunno, ungainly. So, question I have for any passing Powershell & RegEx guru: "How can I do all this in one line?"
What'd I'm looking for is where I can get the following results
Original Corrected
===================== =====================
dbo.ProcName [dbo].[ProcName]
dbo.[ProcName] [dbo].[ProcName]
[dbo].ProcName [dbo].[ProcName]
[dbo].[ProcName] [dbo].[ProcName]
[My.Schema].[My.Proc] [My.Schema].[My.Proc]
[My.Schema].ProcName [MySchema].[ProcName]
dbo.[ABadBADName! [dbo].[[ABadBADName!]
(Notice the last instance where an object name starts but does not end with a square bracket (not that I'm expecting that [and if I saw anyone on my team naming an object like that I'd be asking HR if I can fire them for it], but I do like to be so thorough).
Think that covers everything...
So, over to you Powershell & RegEx gurus - how do I do this?
Please limit any answers to FULLY answering the question with code I can actually use and not just syntax suggestions.
Clarification: I am acutely aware that sometimes 'slow and steady wins the race' may apply here and that support wise it would be potentially safer to handle the rest in a ForEach, but that's not the point. Part of this is to help me understand just how flexible RegEx can be, so this is more of an educational exercise rather than a philosophical one.
Okay how about this:
#'
dbo.ProcName
dbo.[ProcName]
[dbo].ProcName
[dbo].[ProcName]
[My.Schema].[My.Proc]
[My.Schema].ProcName
dbo.[ABadBADName!
'# -split '\s*\r?\n\s*' | % {
$_ -replace '^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$', '[${schema}].[${proc}]'
}
Note that I'm only using ForEach-Object (%) here to iterate through your test cases; the actual replace is done with a single regex / replace.
Explanation
So the important part here is the regex:
^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$
Breaking it down:
^ -- match the beginning of the string
(?: -- open a non-capturing group (for alternation purposes)
\[ -- match a literal left bracket [
(?<schema> -- start a named capture group, with the name schema
[^\]]+ -- match 1 or more of any character that is not a literal right square bracket ]
) -- end the schema capture group
| -- alternation; if the previous expression didn't match, try what comes after this
(?<schema> -- again start a named capture group called schema; this is only tried if the other one didn't match.
[^\.]+ -- match 1 or more of any character that is not a literal dot .
) -- end the alternate schema capture group
) -- end the non-capturing group
\. -- match a literal dot . (this is the one separating schema and proc)
(the next part for proc is exactly the same steps as above, with a different name for the capturing group)
$ -- match the end of the string
In the replace, we just qualify the names of the groups with ${name} syntax instead of the numbers $1 (which would work too actually).

regex: substitute character in captured group

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

Regex to find complete words at Postgresql

I want to only get the records that have some words at one column, I have tried using WHERE ... IN (...) but Postgres is case sensitive in this where clause.
This is why I tried regex and ~* operator.
The following is a SQL snippet that returns all the columns and tables from the DB, I want to restrict the rows to bring only the tables in the regex expresion.
SELECT ordinal_position as COLUMN_ID, TABLE_NAME, COLUMN_NAME
FROM information_schema.columns
WHERE table_schema = 'public' and table_name ~* 'PRODUCTS|BALANCES|BALANCESBARCODEFORMATS|BALANCESEXPORTCATEGORIES|BALANCESEXPORTCATEGORIESSUB'
order by TABLE_NAME, COLUMN_ID
This regex will bring all the columns of BALANCES and the columns of the tables that contain the 'BALANCES' keyword.
I want to restrict the result to complete names only.
Using regexes, the common solution is using word boundaries before and after the current expression.
See effect without: http://regexr.com?35ecl
See effect with word boundaries: http://regexr.com?35eci
In PostgreSQL, the word boundaries are denoted by \y (other popular regex engines, such as PCRE, C# and Java, use \b instead - thus its use in the regex demo above - thanks #IgorRomanchenko).
Thus, for your case, the expression below could be used (the matches are the same as the example regexes in the links above):
'\y(PRODUCTS|BALANCES|BALANCESBARCODEFORMATS|BALANCESEXPORTCATEGORIES|BALANCESEXPORTCATEGORIESSUB)\y'
See demo of this expression in use here:
http://sqlfiddle.com/#!12/9f597/1
If you want to match only whole table_name use something like
'^(PRODUCTS|BALANCES|BALANCESBARCODEFORMATS|BALANCESEXPORTCATEGORIES|BALANCESEXPORTCATEGORIESSUB)$'
^ matches at the beginning of the string.
$ matches at the end of the string.
Details here.
Alternatively you can use something like:
upper(table_name) IN ('PRODUCTS','BALANCES','BALANCESBARCODEFORMATS','BALANCESEXPORTCATEGORIES', ...)
to make IN case insensitive.