Powershell script to search, split and join in one line - regex

Been racking my Friday brain on a regex problem with dealing with Sql Server object names.
An input to my Powershell script is a procedure name. The name can take many forms, such as
dbo.Procedure
[dbo].Procedure
dbo.[Procedure.Name]
etc
So far I'd come up with the following to split the value into it's constituent parts:
[string[]] $procNameA = $procedure.Split("(?:\.)(?=(?:[^\]]|\[[^\]]*\])*$)")
In addition I have a regex that I could use to handle the square brackets
(?:\[)*([A-Za-z0-9. !]+)(?:\])*
And this is about as far as my limited regex experience will take me.
Now granted I could deal with a lot of this by treating each element in a ForEach and doing a RegEx replace there, but y'know that just seems so, I dunno, ungainly. So, question I have for any passing Powershell & RegEx guru: "How can I do all this in one line?"
What'd I'm looking for is where I can get the following results
Original Corrected
===================== =====================
dbo.ProcName [dbo].[ProcName]
dbo.[ProcName] [dbo].[ProcName]
[dbo].ProcName [dbo].[ProcName]
[dbo].[ProcName] [dbo].[ProcName]
[My.Schema].[My.Proc] [My.Schema].[My.Proc]
[My.Schema].ProcName [MySchema].[ProcName]
dbo.[ABadBADName! [dbo].[[ABadBADName!]
(Notice the last instance where an object name starts but does not end with a square bracket (not that I'm expecting that [and if I saw anyone on my team naming an object like that I'd be asking HR if I can fire them for it], but I do like to be so thorough).
Think that covers everything...
So, over to you Powershell & RegEx gurus - how do I do this?
Please limit any answers to FULLY answering the question with code I can actually use and not just syntax suggestions.
Clarification: I am acutely aware that sometimes 'slow and steady wins the race' may apply here and that support wise it would be potentially safer to handle the rest in a ForEach, but that's not the point. Part of this is to help me understand just how flexible RegEx can be, so this is more of an educational exercise rather than a philosophical one.

Okay how about this:
#'
dbo.ProcName
dbo.[ProcName]
[dbo].ProcName
[dbo].[ProcName]
[My.Schema].[My.Proc]
[My.Schema].ProcName
dbo.[ABadBADName!
'# -split '\s*\r?\n\s*' | % {
$_ -replace '^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$', '[${schema}].[${proc}]'
}
Note that I'm only using ForEach-Object (%) here to iterate through your test cases; the actual replace is done with a single regex / replace.
Explanation
So the important part here is the regex:
^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$
Breaking it down:
^ -- match the beginning of the string
(?: -- open a non-capturing group (for alternation purposes)
\[ -- match a literal left bracket [
(?<schema> -- start a named capture group, with the name schema
[^\]]+ -- match 1 or more of any character that is not a literal right square bracket ]
) -- end the schema capture group
| -- alternation; if the previous expression didn't match, try what comes after this
(?<schema> -- again start a named capture group called schema; this is only tried if the other one didn't match.
[^\.]+ -- match 1 or more of any character that is not a literal dot .
) -- end the alternate schema capture group
) -- end the non-capturing group
\. -- match a literal dot . (this is the one separating schema and proc)
(the next part for proc is exactly the same steps as above, with a different name for the capturing group)
$ -- match the end of the string
In the replace, we just qualify the names of the groups with ${name} syntax instead of the numbers $1 (which would work too actually).

Related

RegEx Replace - Remove Non-Matched Values

Firstly, apologies; I'm fairly new to the world of RegEx.
Secondly (more of an FYI), I'm using an application that only has RegEx Replace functionality, therefore I'm potentially going to be limited on what can/can't be achieved.
The Challange
I have a free text field (labelled Description) that primarily contains "useless" text. However, some records will contain either one or multiple IDs that are useful and I would like to extract said IDs.
Every ID will have the same three-letter prefix (APP) followed by a five digit numeric value (e.g. 12911).
For example, I have the following string in my Description Field;
APP00001Was APP00002TEST APP00003Blah blah APP00004 Apple APP11112OrANGE APP
THE JOURNEY
I've managed to very crudely put together an expression that is close to what I need (although, I actually need the reverse);
/!?APP\d{1,5}/g
Result;
THE STRUGGLE
However, on the Replace, I'm only able to retain the non-matched values;
Was TEST Blah blah Apple OrANGE APP
THE ENDGAME
I would like the output to be;
APP00001 APP00002 APP00003 APP00004 APP11112
Apologies once again if this is somewhat of a 'noddy' question; but any help would be much appreciated and all ideas welcome.
Many thanks in advance.
You could use an alternation | to capture either the pattern starting with a word boundary in group 1 or match 1+ word chars followed by optional whitespace chars.
What you capture in group 1 can be used as the replacement. The matches will not be in the replacement.
Using !? matches an optional exclamation mark. You could prepend that to the pattern, but it is not part of the example data.
\b(APP\d{1,5})\w*|\w+\s*
See a regex demo
In the replacement use capture group 1, mostly using $1 or \1

How to highlight SQL keywords using a regular expression?

I would like to highlight SQL keywords that occur within a string in a syntax highlighter. Here are the rules I would like to have:
Match the keywords SELECT and FROM (others will be added, but we'll start here). Must be all-caps
Must be contained in a string -- either starting with ' or "
The first word in that string (ignoring whitespace preceding it) should be one of the keywords.
This of course is not comprehensive (can ignore escapes within a string), but I'd like to start here.
Here are a few examples:
SELECT * FROM main -- will not match (not in a string)
"SELECT name FROM main" -- will match
"
SELECT name FROM main" -- will match
"""Here is a SQL statement:
SELECT * FROM main""" -- no, string does not start with a keyword (SELECT...).
The only way I thought to do it in a single regex would be with a negative lookbehind...but then it would not be fixed width, as we don't know when the string starts. Something like:
(?<=["']\s*(SELECT)\s*)(SELECT|FROM)
But this of course won't work:
Would something like this be possible to do in a single regex?
A suitable regular expression is likely to get pretty complex, especially as the rules evolve further. As others have noted, it may be worth considering using a parser instead. That said, here is one possible regex attempting to cover the rules mentioned so far:
(["'])\s*(SELECT)(?:\s+.*)?\s+(FROM)(?:\s+.*)?\1(?:[^\w]|$)
Online Demos
Debuggex Demo
Regex101 Demo
Explanation
As can be seen in the above visualisation, the regex looks for either a double or single quote at the start (saved in capturing group #1) and then matches this reference at the end via \1. The SELECT and FROM keywords are captured in capturing groups #2 and #3. (The (?:x|y) syntax ensures there aren't more groups for other choices as ?: at the start of a choice excludes it as a capturing group.) There are some further optional details such as limiting what is allowed between the SELECT and FROM and not counting the final quotation mark if it is immediately succeeded by a word character.
Results
SELECT * FROM tbl -- no match - not in a string
"SELECT * FROM tbl" -- matches - in a double-quoted string
'SELECT * FROM tbl;' -- matches - in a single-quoted string
'SELECT * FROM it's -- no match - letter after end quote
"SELECT * FROM tbl' -- no match - quotation marks don't match
'SELECT * FROM tbl" -- no match - quotation marks don't match
"select * from tbl" -- no match - keywords not upper case
'Select * From tbl' -- no match - still not all upper case
"SELECT col1 FROM" -- matches - even though no table name
' SELECT col1 FROM ' -- matches - as above with more whitespace
'SELECT col1, col2 FROM' -- matches - with multiple columns
Possible Improvement?
It might also be necessary to exclude quotation marks from the "any character" parts. This can be done at the expense of increased complexity using the technique described here by replacing both instances of .* with (?:(?!\1).)*:
(["'])\s*(SELECT)(?:\s+(?:(?!\1).)*)?\s+(FROM)(?:\s+(?:(?!\1).)*)?\1(?:[^\w]|$)
See this Regex101 Demo.
You could use capturing groups:
(.*["']\s*\K)(?(1)(SELECT|FROM).*(SELECT|FROM)|)
In this case $2 would refer to the first keyword and $3 would refer to the second keyword. This also only works if there are only two keywords and only one string on a line, which seems to be true in all of your examples, but if those restrictions don't work for you, let me know.
Just tested the regexp bellow:
If you need to add other commands the thing may get a little trick, because some keywords doesn't apply. Eg: ALTER TABLE mytable or UPDATE SET col = val;. For these scenarios you will need to create subgroups and the regexp may become slow.
Best regards!
If I understand your requirements well I suggest that:
/^'\s*(SELECT)[^']*(FROM)[^']*'|^"\s*(SELECT)[^"]*(FROM)[^"]*"/m
[Regex Fiddle Demo]
Explanation:
When you need to check start of a string; use ^.
When you need to accept 0-n spaces; use \s*.
When you need to accept new-line or multi-line strings; use m flag over your regex.
When you need to use Case-Sensitive mode; Don't use i flag over your regex.
When you need to block a string between a specific character like "; use [^"]* instead of .* that will protects first end of block.
When you need to have a block with similar start and end characters like ' & "; use ' '|" " instead of ['"] ['"].
Update:
If you need to capture any special keyword after verifying existence of SELECT keyword after start of your string, I can update my solution to this:
/^'\s*(SELECT)([^']*(SELECT|FROM))+|^"\s*(SELECT)([^"]*(SELECT|FROM))+/m
without parsing of quoted strings
could be done using \G and \K construct
(?:"\s*(?=(?:SELECT|FROM))|(?<!^)\G)[^"]*?\K(SELECT|FROM)
demo

Regular Expression "Matching" vs "Capturing"

I've been looking up regular expression tutorials trying to get the hang of them and was enjoying the tutorial in this link right up until this problem: http://regexone.com/lesson/12
I cannot seem to figure out what the difference between "matching" and "capturing" is. Nothing I write seems to select the text under the "Capture" section (not even .*).
Edit: Here is an example for the tutorial that confuses me: (.* (.*)) is considered correct and (.* .*) is not. Is this a problem with the tutorial or something I am not understanding?
Matching:
When engine matches a part of string or the whole but does return nothing.
Capturing:
When engine matches a part of string or the whole and does return something.
--
What's the meaning of returning?
When you need to check/store/validate/work/love a part of string that your regex matched it before you need capturing groups (...)
At your example this regex .*?\d+ just matches the dates and years See here
And this regex .*?(\d+) matches the whole and captures the year See here
And (.*?(\d+)) will match the whole and capture the whole and the year respectively See here
*Please notice the bottom right box titled Match groups
So returning....
1:
preg_match("/.*?\d+/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
)
2:
preg_match("/(.*?\d+)/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
[1] => Jan 1987
)
3:
preg_match("/(.*?(\d+))/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
[1] => Jan 1987
[2] => 1987
)
So as you can see at the last example, we have 2 capturing groups indexed at 1 and 2 in the array, and 0 is always the matched string however it's not captured.
capturing in regexps means indicating that you're interested not only in matching (which is finding strings of characters that match your regular expression), but you're also interested in using specific parts of the matched string later on.
for example, the answer to the tutorial you linked to would be (\w{3}\s+(\d+)).
now, why ?
to simply match the date strings it would be enough to write \w{3}\s+\d+ (3 word characters, followed by one or more spaces, followed by one or more digits), but adding capture groups to the expression (a capture group is simply anything enclosed in parenthesis ()) will allow me to later extract either the whole expression (using "$1", because the outer-most pair of parenthesis are the 1st the parser encounters) or just the year (using "$2", because the 2nd pair of parenthesis, around the \d+, are the 2nd pair that the regexp parser encounters)
capture groups come in handy when you're interested not only in matching strings to pattern, but also extracting data from the matched strings or modifying them in any way. for example, suppose you wanted to add 5 years to each of those dates in the tutorial - being able to extract just the year part from a matched string (using $2) would come in handy then
In a nutshell, a "Capture" saves the collected value in a special place so you can access it later.
As some have pointed out, the captured stuff can be used 'later on' in the same pattern, so that
/(ab*c):\1/
will match ac:ac, or abc:abc, or abbc:abbc etc. The (ab*c) will match an a, any number of b, then a c. Whatever it DOES match is 'captured'. In many programming and scripting languages, the syntax like \1, \2 etc has the special meaning referring to the first, second, etc captures. Since the first one might be abbc, then the \1 bit has to match abbc only, thus the only possible full match would then be 'abbc:abbc'
Perl (and I think) PHP both allow the \1 \2 syntax, but they also use $1 $2 etc which is considered more modern. Many languages have picked up the powerful RegEx engine from Perl so there's increasing use of this in the world.
Since your sample question seems to be on a PHP site, the typical use of $1 in PHP is:
/(ab*c)(de*f)/
then later (eg next line of code)
$x = $1 . $2; # I hope that's PHP syntax for concatenation!
So the capture is available until your next use of a regex. Depending on the programming language in use, those captured values may be smashed by the next pattern match, or they may be permanently available through special syntax or use of the language.
take a look at these 2 regex - from your example
# first
/(... (\d\d\d\d))/
#second
/... \d\d\d\d/
they both match "Jun 1965" and "May 2000"
(and incidentally many other things like "555 1234")
the second one just matches it - yesno
so you could say
if ($x=~/... \d\d\d\d/){do something}
the first one captures so
/(... (\d\d\d\d))/
print $1,";;;",$2
would print "Jun 1967;;;1967"

RegEx: Word immediately before the last opened parenthesis

I have a little knowledge about RegEx, but at the moment, it is far above of my abilities.
I'm needing help to find the text before the last open-parenthesis that doesn't have a matching close-parenthesis.
(It is for CallTip of a open source software in development.)
Below some examples:
--------------------------
Text I need
--------------------------
aaa( aaa
aaa(x) ''
aaa(bbb( bbb
aaa(y=bbb( bbb
aaa(y=bbb() aaa
aaa(y <- bbb() aaa
aaa(bbb(x) aaa
aaa(bbb(ccc( ccc
aaa(bbb(x), ccc( ccc
aaa(bbb(x), ccc() aaa
aaa(bbb(x), ccc()) ''
--------------------------
Is it possible to write a RegEx (PCRE) for these situations?
The best I got was \([^\(]+$ but, it is not good and it is the opposite of what I need.
Anyone can help please?
Take a look at this JavaScript function
var recreg = function(x) {
var r = /[a-zA-Z]+\([^()]*\)/;
while(x.match(r)) x = x.replace(r,'');
return x
}
After applying this you are left with all unmatched parts which don't have closing paranthesis and we just need the last alphabetic word.
var lastpart = function(y) { return y.match(/([a-zA-Z]+)\([^(]*$/); }}
The idea is to use it like
lastpart(recreg('aaa(y <- bbb()'))
Then check if the result is null or else take the matching group which will be result[1]. Most of the regex engines don't support ?R flag which is needed for recursive regex matching.
Note that this is a sample JavaScript representation which simulated recursive regex.
Read http://www.catonmat.net/blog/recursive-regular-expressions/
This works correctly on all your sample strings:
\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)
The most interesting part is this:
(?:[^()]*\([^()]*\))*
It matches zero or more balanced pairs of parentheses along with the non-paren characters before and between them (like the y=bbb() and bbb(x), ccc() in your sample strings). When that part is done, the final [^()]*$ ensures that there are no more parens before the end of the string.
Be aware, though, that this regex is based on the assumption that there will never be more than one level of nesting. In other words, it assumes these are valid:
aaa()
aaa(bbb())
aaa(bbb(), ccc())
...but this isn't:
aaa(bbb(ccc()))
The string ccc(bbb(aaa( in your samples seems to imply that multi-level nesting is indeed permitted. If that's the case, you won't be able to solve your problem with regex alone. (Sure, some regex flavors support recursive patterns, but the syntax is hideous even by regex standards. I guarantee you won't be able to read your own regex a week after you write it.)
A partial solution - this is assuming that your regex is called from within a programming language that can loop.
1) prune the input: find matching parentheses, and remove them with everything in between. Keep going until there is no match. The regex would look for ([^()]) - open parenthesis, not a parenthesis, close parenthesis. It has to be part of a "find and replace with nothing" loop. This trims "from the inside out".
2) after the pruning you have either no parentheses left, or only leading/trailing ones. Now you have to find a word just before an open parenthesis. This requires a regex like \w(. But that won't work if there are multiple unclosed parentheses. Taking the last one could be done with a greedy match (with grouping around the last \w): ^.*\w( "as many characters as you can up to a word before a parenthesis" - this will find the last one.
I am saying "approximate" solution because, depending on the environment you are using, how you say "this matching group" and whether you need to put a backslash before the () varies. I left that detail out as its hard to check on my iPhone.
I hope this inspires you or others to come up with a complete solution.
Not sure which regex langage/platform you're using for this and don't know if subpatterns are allowed in your platform or not. However following 2 step PHP code will work for all the cases you listed above:
$str = 'aaa(bbb(x), ccc()'; // your original string
// find and replace all balanced square brackets with blank
$repl = preg_replace('/ ( \( (?: [^()]* | (?1) )* \) ) /x', '', $str);
$matched = '';
// find word just before opening square bracket in replaced string
if (preg_match('/\w+(?=[^\w(]*\([^(]*$)/', $repl, $arr))
$matched = $arr[0];
echo "*** Matched: [$matched]\n";
Live Demo: http://ideone.com/evXQYt

how to group in regex matching correctly?

consider following scenario
input string = "WIPR.NS"
i have to replace this with "WIPR2.NS"
i am using following logic.
match pattern = "(.*)\.NS$" \\ any string that ends with .NS
replace pattern = "$12.NS"
In above case, since there is no group with index 12, i get result $12.NS
But what i want is "WIPR2.NS".
If i don't have digit 2 to replace, it works in all other cases but not working for 2.
How to resolve this case?
Thanks in advance,
Alok
Usually depends entirely on your regex engine (I'm not familiar with those that use $1 to represent a capture group, I'm more used to \1 but you'd have the same problem with that).
Some will provide a delimiter that you can use, like:
replace pattern = "${1}2.NS"
which clearly indicates that you want capture group 1 followed by the literal 2.NS.
In fact, by looking at this page, it appears that's exactly the way to do it (assuming .NET):
To replace with the first backreference immediately followed by the digit 9, use ${1}9. If you type $19, and there are less than 19 backreferences, the $19 will be interpreted as literal text, and appear in the result string as such.
Also keep in mind that Jay provides an excellent answer for this specific use case that doesn't require capture groups at all (by just replacing .NS with 2.NS).
You may want to look into that as a possibility - I'll leave this answer here since:
it's the accepted answer; and
it probably better for the more complex cases, like changing X([A-Z])4([A-Z]) with X${1}5${2}, where you have variable text on either side of the bit you wish to modify.
You don't need to do anything with what precedes the .NS, since only what is being matched is subject to replacement.
match pattern = "\.NS$" (any string that ends with .NS -- don't forget to escape the .)
replace pattern = "2.NS"
You can further refine this with lookaround zero-width assertions, but that depends on your regex engine, and you have not specified the environment/programming language in which you are working.