Powershell Regex Query - regex

I'm having some issues pulling the desired values from my source string.
I have 2 possible string formats (that I'm differentiating based on a -match operation and if check):
{u'specialGroup': u'projectWriters', u'role': u'WRITER'
or
, {u'role': u'WRITER', u'userByEmail': u'john#domain.com'
What I desire to return from the regex:
[0]projectWriters
[1]WRITER
and
[0]WRITER
[1]john#domain.com
Basically I need to return all values between start string : u' and end string ' as array values [0] and [1] but cannot figure out the regex pattern.
Trying:
[regex]::match($stuff[1], ": u'([^']+)'").groups
Groups : {: u'WRITER', WRITER}
Success : True
Captures : {: u'WRITER'}
Index : 10
Length : 11
Value : : u'WRITER'
Success : True
Captures : {WRITER}
Index : 14
Length : 6
Value : WRITER
But no sign of john#domain.com value.

A pragmatic approach, assuming that all strings have the same field structure:
$strings = "{u'specialGroup': u'projectWriters', u'role': u'WRITER'}",
", {u'role': u'WRITER', u'userByEmail': u'john#domain.com'"
$strings | ForEach-Object { ($_ -split 'u''|''' -notmatch '[{}:,]')[1,3] }
yields:
projectWriters
WRITER
WRITER
john#domain.com
As for what you tried:
[regex]::match() only ever returns one match, so you need to base your solution on [regex]::matches() - plural! - which returns all matches, and then extract the capture-group values of interest.
$strings | ForEach-Object { [regex]::matches($_, ": u'([^']+)'").Groups[1,3].Value }

Related

Regex that extract string of length that is encoded in string

I have the following string to parse:
X4IitemX6Nabc123
that is structured as follows:
X... marker for 'field identifier'
4... length of item (name), will change according to length of item name
I... identifier for item name, must not be extracted, fixed
item... value that should be extraced as "name"
X... marker for 'field identifier'
6... length of item (name), will change according to length of item name
N... identifier for item number, must not be extracted, fixed
abc123... value that should be extraced as "num"
Only these two values will be contained in the string, the sequence is also always the same (name, nmuber).
What I have so far is
\AX(?I<namelen>\d+)U(?<name>.+)X(?<numlen>\d+)N(?<num>.+)$
But that does not take into account that the length of the name is contained in the string itself. Somehow the .+ in the name group should be replaced by .{4}. I tried {$1}, {${namlen}} but that does not yield the result I expect (on rubular.com or regex.191)
Any ideas or further references?
What you ask for is only possible in languages that allow code insertions in the regex pattern.
Here is a Perl example:
#!/usr/bin/perl
use warnings;
use strict;
my $text = "X4IitemX6Nabc123";
if ($text =~ m/^X(?<namelen>[0-9]+)I(?<name>(??{".{".$^N."}"}))X(?<numlen>[0-9]+)N(?<num>.+)$/) {
print $text . ": PASS!\n";
} else {
print $text . ": FAIL!\n"
}
# -> X4IitemX6Nabc123: PASS!
In other languages, use a two-step approach:
Extract the number after X,
Build a regex dynamically using the result of the first step.
See a JavaScript example:
const text = "X4IitemX6Nabc123";
const rx1 = /^X(\d+)/;
const m1 = rx1.exec(text)
if (m1) {
const rx2 = new RegExp(`^X(?<namelen>\\d+)I(?<name>.{${m1[1]}})X(?<numlen>\\d+)N(?<num>.+)$`)
if (rx2.test(text)) {
console.log(text, '-> MATCH!')
} else console.log(text, '-> FAIL!');
} else {
console.log(text, '-> FAIL!')
}
See the Python demo:
import re
text = "X4IitemX6Nabc123"
rx1 = r'^X(\d+)'
m1 = re.search(rx1, text)
if m1:
rx2 = fr'^X(?P<namelen>\d+)I(?P<name>.{{{m1.group(1)}}})X(?P<numlen>\d+)N(?P<num>.+)$'
if re.search(rx2, text):
print(text, '-> MATCH!')
else:
print(text, '-> FAIL!')
else:
print(text, '-> FAIL!')
# => X4IitemX6Nabc123 -> MATCH!

Powershell regex group regex matches but doesn't have my group. What's missing?

I have some code that I am porting from a jenkins script and I need it as a shell command. So I know the regex works - What's blowing my mind is how it can match but then not have my capture group. What I need is just the root level directory names as such:
foo
baz
How can it "match" but then not have my group? BTW: If there is a simpler way to achieve this, I am all ears.
PS E:\SysData\Jenkins\workspace\chb0_chb0mb_example> git diff --name-only origin/master feature/foo | %{ Resolve-Path -Relative $_ } | sls '.\\.*\\.*' | sls '\\.\\(.+?)\\.*|.*' | %{$_.matches}
Groups : {0}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 48
Value : .\foo\Nuget\deleteme.txt
Groups : {0}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 55
Value : .\baz\QC_OH_DARKESol\deleteme.txt
Assuming I have the question right. For one thing, a literal period has to be backslashed. But it works without backslashing it anyway. There's no backslash at the beginning. Not everyone has the git command. This pattern could be shorter, but it works. I'm expanding the groups property, which you didn't show.
'.\foo\Nuget\deleteme.txt' | sls '.\\(.+?)\\.*|.*' | % matches | % groups
Groups : {0, 1}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 6
Value : .\foo\
Success : True
Name : 1
Captures : {1}
Index : 2
Length : 3
Value : foo
Piping the object from the first sls to the second sls messes something up with the group capture. It seems like a bug. Submitted: piping select-string to itself and the strange effect on matches The value property isn't even right here.
'abc' | select-string a | select-string '(b)' | % matches | % groups
Groups : {0}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 1
Value : a # should be b
Compare with sending a plain string to the second select-string, which gives the right output:
'abc' | select-string a | % line | select-string '(b)' | % matches | % groups
Groups : {0, 1}
Success : True
Name : 0
Captures : {0}
Index : 1
Length : 1
Value : b
Success : True
Name : 1
Captures : {1}
Index : 1
Length : 1
Value : b
js2010's helpful answer points out a potential problem with your approach (.\\ should be \.\\), succinctly demonstrates the unexplained behavior you've experienced (for which they've created a GitHub issue), and suggests a workaround (inserting | % Line).
To solve your problem more directly:
# Inputs are sample paths.
'.\foo\Nuget\deleteme.txt',
'.\bar\QC_OH_DARKESol\deleteme.txt' |
foreach { if ($_ -match '^\.\\([^\\]+)') { $Matches[1] } }
The above yields the following strings:
foo
bar
That is, it extracts the first path component following literal .\ from the input paths, using foreach (ForEach-Object) to apply -match, the regular-expression matching operator to each input string, whose matching results are reflected in the automatic $Matches variable, which is a hash table whose 0 entry is the overall match, with entry 1 containing the 1st capture group's value, 2 the 2nd's, ...; named capture groups (e.g., (?<root>...)), if present, have entries by their name (e.g., root).
An alternative is to use the switch statement with the -Regex option:
switch -Regex (
git diff --name-only origin/master feature/foo | Resolve-Path -Relative
) {
'^\.\\([^\\]+)' { $Matches[1] }
}

Get the word before & after '_-_' with REGEX PowerShell

I am trying to get the Word before and decimal string following a non guaranteed string that looks like ' - '.
Consider this string
"some str (targetWord - 12434 trailing string)"
this string is not guaranteed to have spaces before or after the '-'
so it could look like one of the following
"some str (targetWord-12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
"some str (targetWord -12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
So far I have the following
$allServices = (Get-Service "Known Service Prefix*").DisplayName
foreach ($service in $allServices){
$service = $service.split('\((.*?)\)')[1] #esc( 'Match any non greedy' esc)
if($service.split()[0] -Match '-'){
$arr_services += $service.split('( - )')[0..1]
}else{
$arr_services += ($service -replace '-','').split()[0..1]
}
}
This works to handle the simple case of ' - ' & '-', but cant handle anything else. I feel like this is the kind of problem that could be handled by one line of REGEX or at most two.
What I want to end up with is an array of strings, where the evens (including zero) are the targetWord, and the odd values are the decimal strings.
My issue isn't that I can't make this happen, it's that it looks like crap...
what I mean is my goal is to try and use REGEX to get each word, ignore the '-', and push out to a growing array the targetWord & decimalString.
I see this as more of a puzzle than anything and am trying to use this to improve my REGEX skills. Any help is appreciated!
A single regex passed to the -match operator should suffice:
$arr_services = $allServices | ForEach-Object {
if ($_ -match '\((?<word>\w+) *- *(?<number>\d+)') {
# Output the word and number consecutively.
$Matches.word, $Matches.number
}
}
# Output the resulting array.
$arr_services
Note how the pipeline output can be directly collected in a variable as an array ($arr_services = ...) - no need to iteratively "add" to an array. If you need to ensure that $arr_services is always an array - even if the pipeline outputs only one object, use [array] $arr_services = ...
With your sample strings, the above yields (a flat array of consecutive word-number pairs):
targetWord
12434
targetWord
12434
targetWord
12434
targetWord
12434
As for the regex:
\( matches a literal (
\w+ matches a nonempty run (+) of word characters (\w - letters, digits, _), captured in named capture group word ((?<word>...).
 *- * matches a literal - surrounded by any number of spaces - including none (*).
\d+ matches a nonempty run of digits (\d), captured in named group digits.
if the -match operator finds a match, the results are reflected in the automatic $Matches variable, a hashtable that enables accessing named capture groups directly by name.
here's one way to handle the data set you posted. it presumes all the strings will have the same general format that you posted. that means it WILL FAIL if your sample data set is not realistic. [grin]
$InStuff = #(
'some str (targetWord - 12434 trailing string)'
'some str (targetWord-12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
'some str (targetWord -12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
)
$Results = foreach ($IS_Item in $InStuff)
{
$Null = $IS_Item -match '.+\((?<Word>.+) *- *(?<Number>\d{1,}) .+\)'
[PSCustomObject]#{
Word = $Matches.Word.Trim()
Number = $Matches.Number
}
}
$Results
output ...
Word Number
---- ------
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434

Powershell - Regex date range replace

I have an input file which contains some start dates and if those dates are before a specific date 1995-01-01 (YYYY-MM-DD format) then replace the date with the minimum value e.g.
<StartDate>1970-12-23</StartDate>
would be changed to
<StartDate>1995-01-01</StartDate>
<StartDate>1996-05-12</StartDate> is ok and would remain unchanged.
I was hoping to use regex replace but checking for the date range isn't working as expected. I was hoping to use something like this for the range check
\b(?:1900-01-(?:3[01]|2[1-31])|1995/01/01)\b
You can use a simple regex like '<StartDate>(\d{4}-\d{2}-\d{2})</StartDate>' to match <StartDate>, 4 digits, -, 2 digits, -, 2 digits, and </StartDate>, and then use a callback method to parse the captured into group 1 date and use Martin's code there to compare dates. If the date is before the one defined, use the min date, else, use the one captured.
$callback = {
param($match)
$current = [DateTime]$match.Groups[1].Value
$minimum = [DateTime]'1995-01-01'
if ($minimum -gt $current)
{
'<StartDate>1995-01-01</StartDate>'
}
else {
'<StartDate>' + $match.Groups[1].Value + '</StartDate>'
}
}
$text = '<StartDate>1970-12-23</StartDate>'
$rex = [regex]'<StartDate>(\d{4}-\d{2}-\d{2})</StartDate>'
$rex.Replace($text, $callback)
To use it with Get-Content and Foreach-Object, you may define the $callback as above and use
$rex = [regex]'<StartDate>(\d{4}-\d{2}-\d{2})</StartDate>'
(Get-Content $path\$xml_in) | ForEach-Object {$rex.Replace($_, $callback)} | Set-Content $path\$outfile
You don't have to use regex here. Just cast the dates to DateTime and compare them:
$currentDate = [DateTime]'1970-12-23'
$minDate = [DateTime]'1995-01-01'
if ($minDate -gt $currentDate)
{
$currentDate = $minDate
}

PowerShell regular expression on logfile is capturing too much

I am trying to extract some text from a logfile, and I'm having problems.
Example text I am working on is:
ahksjhadjsadhsah
sakdsjakdjks
ksajdksaj
REF=35464
sadsad
213213
213
2
13
I need to extract the value "35464" (the REF number). I have limited knowledge of regular expressions, but thought 'REF=([0-9]+)' would do this.
Now I'm not sure how best I should be doing reading this file, so I've tried a couple of ways:
select-string -path e:\powershell\log.txt -pattern 'REF=([0-9]+)' | % { $_.Matches } | % { $_.Value }
Which gives me "REF=35464" - which I don't understand (why REF= is included), because I thought the 'capture' was only the parts in ()'s?
I also tried:
$data=Get-Content e:\powershell\log.txt
$data -match 'REF=([0-9]+)'
$Matches
But $Matches is empty.
I also tried a similar method to the above, but line by line, for example:
foreach ($line in $data)
{
$line -match 'REF=([0-9]+)'
}
I either get no matches or the full match (including the REF= part). I've also tried groups (that is, '(REF=)([0-9]+)'), and I can't get what I need.
How should I be reading the file? What is wrong with my regular expression?
I just need this extracted value as a usable variable.
It may be the way you are trying to access the capture group
I put this quick static class together to illustrate how to get the match you are looking for.
Note: I am using the # symbol on the regex and your input string to make them literals.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
namespace SkunkWorks.RegexPractice
{
public static class RegexPractice2
{
public static string input = #"ahksjhadjsadhsah
sakdsjakdjks
ksajdksaj
REF=35464
sadsad
213213
213
2
13";
static string pat = #"REF=([0-9]+)";
public static void Do()
{
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match m = r.Match(input);
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match" + (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
Console.WriteLine("Group" + i + "='" + g + "'");
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine("Capture" + j + "='" + c + "', Position=" + c.Index);
}
}
m = m.NextMatch();
}
}
}
}
What I usually do when I need to extract a substring from an array of strings is to use the automatic variable $Matches that is generated from using the -match operator in a Where statement. Like this:
$Data | Where{$_ -match "REF=([0-9]+)"} | ForEach{$Matches[1]}
Now, the $Matches variable there will be an array. The first entry will be the entire line that it matched, and the second object will be just the captured text, that is why I specify [1]. Now, about your RegEx that you're matching on... technically it's acceptable, but it's not very specific, so it really could return just the first number since [0-9]+ means 1 or more character that falls within the [0-9] scope. If you want to be sure that you get all of the numbers you can tell it to get everything to the end of the line by using the end-of-line character $ in your match like: REF=([0-9]+)$. We can't really tell if there's any whitespace after the numbers, so you might want to allow for that too using the \s notation that looks for any whitespace character (spaces, tabs, whatever), and using the asterisks after it which means zero or more. Then it becomes REF=([0-9]+)\s*$, which gets you exactly what you were looking for. Lastly, I would use \d instead of [0-9] because it does the same thing and it's shorter and simpler, and specifically made for the job. So, we have:
$Data | Where{$_ -match "REF=(\d+)\s*$"} | ForEach{$Matches[1]}
And that is broken down step by step and explained here: https://regex101.com/r/dG7jC7/1