Regex for string following a string and forward slash - regex

I'm trying to extract the resource group name in Azure via az cli.
The full path to a certain resource group looks like this:
/subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup
I'm trying to only extract "Test_ResourceGroup" out of the full string (which is stored in a variable), so i think the code would be something like
$scope = /subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup
$resourcegroup = $scope -match 'regex'
But I'm terrible at regex and not great at it. The addition challenge is that sometimes there's more strings or integers after the resource name, e.g.
/subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup/specificnameofresource/blahblah
But again, I just want the resource group name.

while a pure regex solution is likely faster, this is simpler & easier to understand/modify. well, it is for me. [grin]
the neat thing is that the final .Split() will silently fail, so it works for "end of string" and for any position in the string.
$TestList = #(
'/subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup/specificnameofresource/blahblah'
'/subscriptions/b049-1234-1256-125456-125/resourceGroups/ResourceGroupZiggity'
)
foreach ($TL_Item in $TestList)
{
($TL_Item -split 'resourcegroups/')[-1].Split('/')[0]
}
output ...
Test_ResourceGroup
ResourceGroupZiggity

You can use the regex-based -replace operator:
$resourcegroup = $scope -replace '.+/resourceGroups/([^/]+).*', '$1'
.+/resourceGroups/ captures one or more (+) characters (.), followed by /resourceGroups/, i.e. everything up to and including /resourceGroups/.
[^/]+ captures one or more characters that are not (^) in the character set ([...]) comprising /, i.e. everything up, but not including the next /, if any.
(...) is a capturing subexpression a so-called capture group), whose captured text can be referred to in the replacement operand (substitution text) as $1.
.* matches zero or more (*) remaining characters, i.e. whatever characters are left.
Since the regex matches the entire input string, replacing what it matched with '$1 in effect extracts just the token of interest.

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex in PowerShell to get the city name from the Managedby property in Active Directory

Can anyone help me with this. I need to derive a City name from the "managedby" attribute in Active Directory which looks like this:
CN=Marley\, Bob,OU=Users,OU=PARIS,DC=Domain,DC=com
So I need to take everything out and be left with "PARIS"
I really don't know enough about Regex but assume its going to involve using -replace in some way. I have tried following some examples on the web but I just get lost. I can remove all special characters using:
'CN=Marley\, Bob,OU=Users,OU=PARIS,DC=Domain,DC=com' -replace '[\W]', ''
But I have no idea how to clean that up further.
Any help would be greatly appreciated
Actually you don't need regex for that. If the structure of the distinguished name is always the same you can use nested -splits ... like this:
(('CN=Marley\, Bob,OU=Users,OU=PARIS,DC=Domain,DC=com' -split '=')[3] -split ',')[0]
or this:
(('CN=Marley\, Bob,OU=Users,OU=PARIS,DC=Domain,DC=com' -split ',')[-3] -split '=')[1]
I'd recommend the second version because this way you avoid confusion you can have with commas in the CN part of the distinguished name. ;-)
If you like to do it with regex anyway you can use look-arounds to extract what's between the users OU and the domain like this:
'CN=Marley\, Bob,OU=Users,OU=PARIS,DC=Domain,DC=com' -match '(?<=Users,OU=).+(?=,DC=DOmain)'
$Matches[0]
The following is a -replace-based solution that assumes that the city name follows the last ,OU= in the input string (though it wouldn't be hard to make the regex more specific).
It also supports city names with escaped , characters (\,), such as PARIS\, Texas.
$str = 'CN=Marley\, Bob,OU=Users,OU=PARIS\, Texas,DC=Domain,DC=com'
# -> 'PARIS, Texas'
$str -replace '.+,OU=(.+?),DC=.+', '$1' -replace '\\,', ','
.+,OU= greedily matches one or more (+) arbitrary characters (.) up to the last ,OU= substring in the input string.
(.+?) matches on or more subsequent characters non-greedily (+?), via a capture group (capturing subexpression, (...)).
,DC=.+ matches the next occurrence of substring ,DC followed by whatever is left in the string (.+).
Note that this means that the regex matches the entire string, so that the value of the substitution expression, $1, is the only thing returned:
$1 refers to the value of the 1st capture group, which contains the city name.
The second -replace operation unescapes the \,, i.e. turns it into , - note how the literal \ to replace had to be escaped as \\ in the regex.

RegEx improvement recommendations

Given a string like
Some text and [A~Token] and more text and [not a token] and
[another~token]
I need to extract the "tokens" for later replacement. The tokens are defined as two identifiers separated by a ~ and enclosed in [ ]. What I have been doing is using $string -match "\[.*?~.*?\]", which works. And, as I understand it I am escaping both brackets, doing any character zero or more times and forced lazy, then the ~ and then the same any character sequence. So, my first improvement was to replace .*? with .+?, as I want 1 or more, not zero or more. Then I moved to $string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]", which limits the two identifiers to alpha numerics, which is a big improvement.
So, first question is:
Is this last solution the best approach, or is there further improvements to be made?
Also, currently I only get a single token returned, so I am looping through the string, replacing tokens as they are found, and looping till there are no tokens. But, my understanding is that RegEx is greedy by default, and so I would have expected this last version to return two tokens, and I could loop through the dictionary rather than using a While loop.
So, second question is:
What am I doing wrong that I am only getting one match back? Or am I misunderstanding how greedy matching works?
EDIT:
to clarify, I am using $matches, as shown here, and still only getting a count of 1.
if ($string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]") {
Write-Host "new2: $($matches.count)"
foreach ($key in $matches.keys) {
Write-Host "$($matches.$key)"
}
}
Also, I can't really use a direct replace at the point of identifying the token, because there are a TON of potential replacements. I take the token, strip the square brackets, then split on the ~ to arrive at prefix and suffix values, which then identify a specific replacement value, which I can do with a dedicated -replace.
And one last clarification, the number of tokens is variable. It could just be one, it could be three or four. So my solution has to be pretty flexible.
To list all tokens and use the values you can use code like this:
$matces = Select-String '\[([\w]+)~([\w]+)\]' -input $string -AllMatches | Foreach {$_.matches}
foreach($value in $matces){
$fullToken = $value.Value;
$firstPart = $value.Groups[1].Value;
$secondPart = $value.Groups[2].Value;
echo "full token found: '$fullToken' first part: '$firstPart' second part: '$secondPart'";
}
Note in regex parts grouped with () this allows access to parts of you token.
In this loop you can find appropriate value that you want to insert instead of fullToken using firstPart and secondPart.
As for the \[.*?~.*?\] not working properly its because it tries to match and succeeds with text [not a token] and [another~token] as in this regex characters ][ are allowed in token parts. \[[^\]\[]*?~[^\]\[]*?\] (^ negates expression so it would read: all characters except ][) would also be fine but its not that readable with all braces if \w is good enough you should us it.
You can use \w to match a word character (letter, digit, underscore).
That results in the pattern \[\w+~\w+\].
Now you can create a regex object with that pattern:
$rgx = [Regex]::new($pattern)
and replace all occurences of that pattern with the Replace operator:
$rgx.Replace($inputstring, $replacement)
Maybe it's also worth noting that regex has an .Match operator which returns the first occurence of the pattern and an .Matches operator which return all occurences of the pattern.
Taking your example line
$String = "Some text and [A~Token] and more text and [not a token] and [another~token]"
This RegEx with capture groups
$RegEx = [RegEx]"\[(\w+~\w+)\][^\[]+\[[^\]]+\][^\[]+\[(\w+~\w+)\]"
if ($string -match $RegEX){
"First token={0} Second token={1}" -f $matches[1],$matches[2]
}
returns:
First token=A~Token Second token=another~token
See the above RegEx explained on https://regex101.com/r/tp6b9e/1
The area between the two tokens is matched alternating with negated classes
for [/] and the literal char [/]

Get first instance and Get last instance of string

I am trying to match the first instance of the value of Timestamp in one expression and the last instance of the value of Timestamp in another expression:
{'Latitude': 50.00001,'Longitude': 2.00002,'Timestamp': '00:10:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:20:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:25:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:37:00'}
Anyone know how to do that.
Take advantage of regexp's greediness: the * operator will take as many matches as it can find. So the approach here is to match the explicit pattern at the beginning and end of the regexp with a .* in the middle. The .* will slurp up as many characters as it can subject to the rest of the regexp also matching.
/(${pattern}).*(${pattern})/
Where here, ${} represents extrapolation. This will vary on your language. In Ruby it would be #{}. I have chosen to capture the entire pattern; you can instead put the () capture around the timestamp value but I find this easier to read and maintain. This regexp will match two instances of $pattern with as much stuff in between as it can fit, thus guaranteeing that you have the first and last.
If you want to be more strict, you could enforce the pattern in the middle as well, *'ing the full pattern rather than just .:
/${pattern},\s*(?:${pattern},\s*)*${pattern}/
Ask in the comments if you don't understand any piece of this regexp.
One pattern we can use is /\{[^}]+\'Timestamp\'[^}]+\}/.Note that this pattern assumes that Timestamp is the LAST key; if this is not always true you need to add a bit more to this pattern.
So the total pattern for the first example will be:
str =~ /(${pattern}.*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}).*({[^}]+'Timestamp'[^}]+})/
Then, $1 and $2 are the first and last hashes that match the Timestamp key. Again, this matches the entire pattern rather than only the timestamp value itself, but it should be straightforward from there to extract the actual timestamp value.
For the second, more strict example, and the reason I did not want to capture the timestamp value inside the pattern itself, we have:
str =~ /(${pattern}),\s*(?:${pattern},\s*)*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}), *(?:{[^}]+'Timestamp'[^}]+}, *)*({[^}]+'Timestamp'[^}]+})/
We still have the correct results in $1 and $2 because we explicitly chose NOT to put a capturing group inside the pattern.

Regex matching only a portion of string

I would like to match a portion of a URL in this order.
First the domain name will remain static. So, nothing check with regex.
$domain_name = "http://foo.com/";
What I would like to validate is what comes after the last /.
So, my AIM is to create something like.
$stings_only = "[\w+]";
$number_only = "[\d+]";
$numbers_and_strings = "[0-9][a-z][A-Z]";
Now, I would like to just use the above variables to check if a URL confirms to the patterns mentioned.
$example_url = "http://foo.com/some-title-with-id-1";
var_dump(preg_match({$domain_name}{$strings_only}, $example_url));
The above should return false, because title is NOT $string_only.
$example_url = "http://foo.com/foobartar";
var_dump(preg_match({$domain_name}{$strings_only}, $example_url));
The above should return true, because title is $string_only.
Update:
~^http://foo\.com/[a-z]+/?$~i
~^http://foo\.com/[0-9]+/?$~
~^http://foo\.com/[a-z0-9]+/?$~i
These would be your three expressions to match alphabetical URLs, numeric URLS, and alphanumeric. A couple notes, \w matches [a-zA-Z0-9_] so I don't think it is what you expected. The + inside of your character class ([]) does not have any special meaning, like you may expect. \w and \d are "shorthand character classes" and do not need to be within the [] syntax (however they can be, e.g. [\w.,]). Notice the i modifier, this makes the expressions case-insensitive so we do not need to use [a-zA-Z].
$strings_only = '~^http://foo\.com/[a-z]+/?$~i';
$url = 'http://foo.com/some-title-with-id-1';
var_dump(preg_match($strings_only, $url)); // int(0)
$url = 'http://foo.com/foobartar';
var_dump(preg_match($strings_only, $url)); // int(1)
Test/tweak all of my above expressions with Regex101.
. matches any character, but only once. Use .* for 0+ or .+ for 1+. However, these will be greedy and match your whole string and can potentially cause problems. You can make it lazy by adding ? to the end of them (meaning it will stop as soon as it sees the next character /). Or, you can specify anything but a / using a negative character class [^/].
My final regex of choice would be:
~^https://stolak\.ru/([^/]+)/?$~
Notice the ~ delimiters, so that you don't need to escape every /. Also, you need to escape the . with \ since it has a special meaning. I threw the [^/]+ URI parameter into a capture group and made the trailing slash optional by using /?. Finally, I anchored this to the beginning and the end of the strings (^ and $, respectively).
Your question was somewhat vague, so I tried to interpret what you wanted to match. If I was wrong, let me know and I can update it. However, I tried to explain it all so that you could learn and tweak it to your needs. Also, play with my Regex101 link -- it will make testing easier.
Implementation:
$pattern = '~^https://stolak\.ru/([^/]+)/?$~';
$url = 'https://stolak.ru/car-type-b1';
preg_match($pattern, $url, $matches);
var_dump($matches);
// array(2) {
// [0]=>
// string(29) "https://stolak.ru/car-type-b1"
// [1]=>
// string(11) "car-type-b1"
// }