I'm working on a regular expression to extract a map of key and associated string.
For some reason, it's working for lines that don't show a line split, but misses where there are line splits.
This is what I'm using:
$errorMap = [ordered]#{}
# process the lines one-by-one
switch -Regex ($fileContent -split ';') {
'InsertCodeInfo\(([\w]*), "(.*)"' { # key etc., followed by string like "Media size cassette missing"
$key,$value = ($matches[1,2])|ForEach-Object Trim
$errorMap[$key] = $value
}
}
This is an example of $fileContent:
InsertCodeInfo(pjlWarnCommunications,
"communications error");
InsertCodeInfo(pjlNormalOnline,
"Online");
InsertCodeInfo(pjlWarnOffline,
"offline");
InsertCodeInfo(pjlNormalAccessing, "Accessing"); #this is first match :(
InsertCodeInfo(pjlNormalArrive, "Normal arrive");
InsertCodeInfo(pljNormalProcessing, "Processing");
InsertCodeInfo(pjlNormalDataInBuffer, "Data in buffer");
It's returning the pairs from pjlNormalAccessing down, where it doesn't have a line split. I thought that using the semicolon to split the regex content would fix it, but it didn't help. I was formerly splitting regex content with
'\r?\n'
I thought maybe there was something going on with VSCode so I have exited and re-opened it, and re-running the script had the same result. Any idea how to get it to match every InsertCodeInfo through the semicolon line with the key-value pair?
This is using VSCode and Powershell 5.1.
Update:
Someone asked how $fileContent is created:
I call my method with the filenamepath ($FileHandler), and from/to strings/methodNames ($matchFound2 becomes $fileContent later as a method parameter):
$matchFound2 = Get-MethodContents -codePath $FileHandler -methodNameToReturn "OkStatusHandler::PopulateCodeInfo" -followingMethodName "OkStatusHandler::InsertCodeInfo"
Function Get-MethodContents{
[cmdletbinding()]
Param ( [string]$codePath, [string]$methodNameToReturn, [string]$followingMethodName)
Process
{
$contents = ""
Write-Host "In GetMethodContents method File:$codePath method:$methodNameToReturn followingMethod:$followingMethodName" -ForegroundColor Green
$contents = Get-Content $codePath -Raw #raw gives content as single string instead of a list of strings
$null = $contents -match "($methodNameToReturn[\s\S]*)$followingMethodName" #| Out-Null
return $Matches.Item(1)
}#End of Process
}#End of Function
You can use
InsertCodeInfo\((\w+),\s*"([^"]*)
See the online regex demo.
Details:
InsertCodeInfo\( - a literal InsertCodeInfo( text
(\w+) - Group 1: one or more word chars (letters, digits, diacritics or underscores (connector punctuation)
, - a comma
\s* - zero or more whitespaces
" - a " char
([^"]*) - Group 2: zero or more chars other than a " char.
See the regex graph:
This regular expression seems to be catching all lines, including ones with newline in the middle. Thanks for the suggestion #WiktorStribizew. I tweaked your suggestion, and it helped.
InsertCodeInfo\(([\w]*),[\s]*"([^"]*)
It might be the most succinct, but it's catching all lines. Feel free as always to post alternative suggestions. This is why I didn't accept my own answer.
Related
Been beating my head around this one all day and I'm getting close but not quite getting there. I have a small subset of my much larger script for just the regex part. Here is the script so far:
$CCI_ID = #(
"003417 AR-2.1"
"003425 AR-2.9"
"003392 AP-1.12"
"009012 APP-1(21).1"
)
[regex]::matches($CCI_ID, '(\d{1,})|([a-zA-Z]{2}[-][\d][\(?\){0,1}[.][\d]{1,})') |
ForEach-Object {
if($_.Groups[1].Value.length -gt 0){
write-host $('CCI-' + $_.Groups[1].Value.trim())}
else{$_.Groups[2].Value.trim()}
}
CCI-003417
AR-2.1
CCI-003425
AR-2.9
CCI-003392
AP-1.12
CCI-009012
PP-1(21
CCI-1
The output is correct for all but the last one. It should be:
CCI-009012
APP-1(21).1
Thanks for any advice.
Instead of describing and quantifying the (optional) opening and closing parenthesis separately, group them together and then make the whole group optional:
(?:\(\d+\))?
The whole pattern thus ends up looking like:
[regex]::Matches($CCI_ID, '(\d{1,})|([a-zA-Z]{2,3}[-][\d](?:\(\d+\))?[.][\d]{1,})')
In your pattern you are using an alternation | but looking at the example data you can match 1 or more whitespaces after it instead.
If there is a match for the pattern, the group 1 value already contains 1 or more digits so you don't have to check for the Value.length
The pattern with the optional digits between parenthesis:
\b(\d+)\s+([a-zA-Z]{2,}-\d(?:\(\d+\))?\.\d+)\b
See a regex101 demo.
$CCI_ID = #(
"003417 AR-2.1"
"003425 AR-2.9"
"003392 AP-1.12"
"009012 APP-1(21).1"
)
[regex]::matches($CCI_ID, '\b(\d+)\s+([a-zA-Z]{2,}-\d(?:\(\d+\))?\.\d+)\b') |
ForEach-Object {
write-host $( 'CCI-' + $_.Groups[1].Value.trim() )
write-host $_.Groups[2].Value.trim()
}
Output
CCI-003417
AR-2.1
CCI-003425
AR-2.9
CCI-003392
AP-1.12
CCI-009012
APP-1(21).1
As you experiencing here, Regex expressions might become very complex and unreadable.
Therefore it is often an good idea to view your problem from two different angles:
Try matching the part(s) you want, or
Try matching the part(s) you don't want
In your case it is probably easier to match the part that you don't want: the delimiter, the space, and split your string upon that, which is apparently want to achieve:
$CCI_ID | Foreach-Object {
$Split = $_ -Split '\s+', 2
'CCI-' + $Split[0]
$Split[1]
}
$_ -Split '\s+', 2, Splits the concerned string based on 1 or more white-spaces (where you might also consider a literal space: -Split ' '). The , 2 will prevent the the string to split in more than 2 parts. Meaning that the second part will not be further split even if it contains a spaces.
I'm pulling my hair, to RegEx-tract the bare version information from some filenames.
e.g. "1.2.3.4"
Let's assume, I have the following Filenames:
VendorSetup-x64-1.23.4.exe
VendorSetup-1-2-3-4.exe
Vendor Setup 1.23.456Update.exe
SoftwareName-1.2.34.5-x64.msi
SoftwareName-1.2.3.4-64bit.msi
SoftwareName-64-Bit-1.2.3.4.msi
VendorName_SoftwareName_64_1.2.3_Setup.exe
(And I know there are still some filenames out there, that have "x32" as well as "x86" in them, so I've added them to the title)
First of all, I replaced the _'s & -'s by .'s which I'd like to avoid in general, but haven't found a cleverer approach and to be honest - only works well if there's no other "digit"-information in the String for example like the 2nd Filename.
I then tried to extract the Version information using Regex like
-replace '^(?:\D+)?(\d+((\.\d+){1,4})?)(?:.*)?', '$1'
Which lacks the ability to omit "x64", "64Bit", "64-Bit" or any variation of that generally.
Additionally, I played around with RegExes like
-replace '^(?:[xX]*\d{2})?(?:\D+)?(\d+((\.\d+){1,4})?)(?:.*)?$', '$1'
to try to omit a leading "x64" or "64", but with no success (most probably because of the replacement from -'s to .'s.
And before it gets even worse, I'd like to ask if there's anybody who could help me or lead me in the right direction?
Thanks in advance!
This could be done using a single pattern, but by splitting it up into two separate patterns and let PowerShell do some of the work, the overall solution can be much easier.
Pattern 1 matches version numbers that are separated by . (dot):
(?<=[\s_-])\d+(?:\.\d+){1,3}
Pattern 2 matches version numbers that are separated by - (dash):
(?<=[\s_-])\d+(?:-\d+){1,3}
The patterns start with (?<=[\s_-]) which is a positive lookbehind assertion that makes sure that the version is separated by space, underscore or dash on the left side, without including these in the captured value. This prevents sub string 64-1 from the first sample to match as a version.
Detailed explanations of the pattern can be found at regex101.
Powershell code:
# Create an array of sample filenames
$names = #'
VendorSetup-2022-05-x64-1.23.4.exe
VendorSetup-x64-1.23.4-2022-05.exe
VendorSetup-1-2-3-4.exe
VendorSetup_2022-05_1-2-3-4.exe
Vendor Setup 1.23.456Update.exe
SoftwareName-1.2.34.5-x64.msi
SoftwareName-1.2.3.4-64bit.msi
SoftwareName-64-Bit-1.2.3.4.msi
VendorName_SoftwareName_64_1.2.3_Setup.exe
NoVersion.exe
'# -split '\r?\n'
# Array of RegEx patterns in order of precedence.
$versionPatterns = '(?<=[\s_-])\d+(?:\.\d+){1,3}', # 2..4 numbers separated by '.'
'(?<=[\s_-])\d+(?:-\d+){1,3}' # 2..4 numbers separated by '-'
foreach( $name in $names ) {
$version = $versionPatterns.
ForEach{ [regex]::Match( $name, $_, 'RightToLeft' ).Value }. # Apply each pattern from right to left of string.
Where({ $_ }, 'First'). # Get first matching pattern (non-empty value).
ForEach{ $_ -replace '\D+', '.' }[0] # Normalize the number separator and get single string.
# Output custom object for nice table formatting
[PSCustomObject]#{ Name = $name; Version = $version }
}
Output:
Name Version
---- -------
VendorSetup-2022-05-x64-1.23.4.exe 1.23.4
VendorSetup-x64-1.23.4-2022-05.exe 1.23.4
VendorSetup-1-2-3-4.exe 1.2.3.4
VendorSetup_2022-05_1-2-3-4.exe 1.2.3.4
Vendor Setup 1.23.456Update.exe 1.23.456
SoftwareName-1.2.34.5-x64.msi 1.2.34.5
SoftwareName-1.2.3.4-64bit.msi 1.2.3.4
SoftwareName-64-Bit-1.2.3.4.msi 1.2.3.4
VendorName_SoftwareName_64_1.2.3_Setup.exe 1.2.3
NoVersion.exe
Explanation of the Powershell code:
To resolve ambiguities when a filename has multiple matches of the patterns, we use the following rules:
Version with . separator is preferred over version with - separator. We simply apply the patterns in this order and stop when the first pattern matches.
Rightmost version is preferred (by passing the RightToLeft flag to [regex]::Match()).
.ForEach and .Where are PowerShell intrinsic methods. They are basically faster variants of the ForEach-Object and Where-Object cmdlets.
The index [0] operator after the last .ForEach is required because .ForEach and .Where always return arrays, even if there is only a single value, contrary to the behaviour cmdlets.
I have a name delimiter that I want to use to extract the whole line where it is found.
[string]$testString = $null
# broken text string of text & newlines which simulates $testString = Get-Content -Raw
$testString = "initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text"
# test1
# simulate text string before(?<content>.*)text string after - this returns "initial text" only (no newline or anything after)
# $testString -match "(?<BOURKE>.*)"
# test2
# this returns all text, including the newlines, so that $testString outputs exactly as it is defined
$testString -match "(?s)(?<BOURKE>.*)"
#test3
# I want just the line with BOURKE
$result = $matches['BOURKE']
$result
#Test1 finds the match but only prints to the newline. #Test2 finds the match and includes all newlines. I would like to know what is the regex pattern that forces the output to begin 001 BOURKE ...
Any suggestions would be appreciated.
Note:
I'm assuming you're looking for the whole line on which BOURKE appears as a substring.
In your own attempts, (?<BOURKE>...) simply gives the regex capture group a self-chosen name (BOURKE), which is unrelated to what the capture group's subexpression (...) actually matches.
For the use case at hand, there's no strict need to use a (named) capture group at all, so the solutions below make do without one, which, when the -match operator is used, means that the result of a successful match is reported in index [0] of the automatic $Matches variable, as shown below.
If your multiline input string contains only Unix-format LF newlines (\n), use the following:
if ($multiLineStr -match '.*BOURKE.*') { $Matches[0] }
Note:
To match case-sensitively, use -cmatch instead of -match.
If you know that the substring is preceded / followed by at least one char., use .+ instead of .*
If you want to search for the substring verbatim and it happens to or may contain regex metacharacters (e.g. . ), apply [regex]::Escape() to it; e.g, [regex]::Escape('file.txt') yields file\.txt (\-escaped metacharacters).
If necessary, add additional constraints for disambiguation, such as requiring that the substring start or end only at word boundaries (\b)
If there's a chance that Windows-format CLRF newlines (\r\n) are present , use:
if ($multiLineStr -match '.*BOURKE[^\r\n]*') { $Matches[0] }
For an explanation of the regexes and the ability to experiment with them, see this regex101.com page for .*BOURKE.*, and this one for .*BOURKE[^\r\n]*
In short:
By default, . matches any character except \n, which obviates the need for line-specific anchors (^ and $) altogether, but with CRLF newlines requires excluding \r so as not to capture it as part of the match.[1]
Two asides:
PowerShell's -match operator only ever looks for one match; if you need to find all matches, you currently need to use the underlying [regex] API directly; e.g., [regex]::Matches($multiLineStr, '.*BOURKE[^\r\n]*').Value, 'IgnoreCase'GitHub issue #7867 suggests bringing this functionality directly to PowerShell in the form of a -matchall operator.
If you want to anchor the substring to find, i.e. if you want to stipulate that it either occur at the start or at the end of a line, you need to switch to multi-line mode ((?m)), which makes ^ and $ match on each line; e.g., to only match if BOURKE occurs at the very start of a line:
if ($multiLineStr -match '(?m)^BOURKE[^\r\n]*') { $Matches[0] }
If line-by-line processing is an option:
Line-by-line processing has the advantage that you needn't worry about differences in newline formats (assuming the utility handling the splitting into lines can handle both newline formats, which is true of PowerShell in general).
If you're reading the input text from a file, the Select-String cmdlet, whose very purpose is to find the whole lines on which a given regex or literal substring (-SimpleMatch) matches, additionally offers streaming processing, i.e. it reads lines one by one, without the need to read the whole file into memory.
(Select-String -LiteralPath file.txt -Pattern BOURKE).Line
Add -CaseSensitive for case-sensitive matching.
The following example simulates the above (-split '\r?\n' splits the multiline input string into individual lines, recognizing either newline format):
(
#'
initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text
'# -split '\r?\n' |
Select-String -Pattern BOURKE
).Line
Output:
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
[1] Strictly speaking, the [^\r\n]* would also stop matching at a \r character in isolation (i.e., even if not directly followed by \n). If ruling out that case is important (which seems unlikely), use a (simplified version of) the regex suggested by Mathias R. Jessen in a comment on the question: .*BOURKE.*?(?=\r?\n)
I find it best to have a match consume up to what is not needed; the \r\n. That can be done with the set nomenclature with the ^ in the set such as [^\r\n]+ which says consume up to either a \r or a \n. Hence everything that is not a \r\n.
To do that use
$testString -match "(?<Bourke>\d\d\d\s[^\r\n]+)"
Also one should try to avoid the * when one knows there will be matchable txt...the * is a greedy type that consumes everything. Usage of the +, one or more, limits the match considerably because the parser doesn't have to try patterns (The zero of the *s zero or more), backtracking as its called which are patently not plausible.
I'm trying to clean up a string. An example string:
{
"NodeID": "${NodeID}",
"EventID": "${EventID}"
}
I want to capture all double quotes which occur after the colon, so that the end string will be:
{
"NodeID": ${NodeID},
"EventID": ${EventID}
}
I know that it's JSON, and that technically it is a string in those positions, but they're macros that will be interpreted by a system which generates the actual JSON string and replaces the macros with data, so in my use case this text isn't JSON yet. I can deal with the text line-by-line to make it easier.
I'll be using the regex pattern in both PowerShell and Python.
The closest I've gotten so far have been: (?<=[^*:])("), and (?<=:)(.*)(?<!,)
This is working, but seems incredibly kludgy and inelegant:
$String = '{
"NodeID": "${NodeID}",
"EventID": "${EventID}"
}'
# The Regex to match the text after the colon
[regex]$Regex = '(?<=:)(.*)'
# Splitting each line of the string into an ArrayList element
[System.Collections.ArrayList]$StringArray = $String.Split([string[]][Environment]::NewLine, [StringSplitOptions]::None)
# Declaring an output string
$OutPutString = ''
# Loop through the ArrayList
$i = 1
foreach ($Row in $StringArray) {
# Split each element string at the RegEx match
$RowArray = $Row -split $Regex
[String]$RowString1 = $RowArray[0]
[String]$RowString2 = $RowArray[1]
# Reassemble the element string after replacing the double quotes in the 2nd half
$FullRowString = $RowString1 + $RowString2.Replace('"','')
# If this is the first line in the string, don't add a new line charact in front
if ($i -gt 1) {
$NewLine = "`n"
}
# Reassemble the string
$OutPutString += $NewLine + $FullRowString
$i++
}
$OutPutString
Any better ideas?
đī¸ For the regex to be functional as expected, the regex-engine indicated by scripting/programming language is important to know.
Please always add this information as tags besides regex.
Here: powershell, python
Regex to match a JSON text-field and capture the raw-value
Tested on Python, see regex101 demo:
(?<=:\s\s)\"([^\"]*)\"
đĄī¸ Components
To explain the composition of the regex and its working in steps:
(?<=:\s\s): positive look behind ?<=: for 2 white-spaces \s\s
to neglect the field-name also enclosed in double-quotes
\" and \": matching double-quotes before and after the capture group
the unwanted enclosing of the field-value
([^\"]*): capture-group denoted by parentheses surround any non-double-quote character [^\"]*
the wanted raw field-value (string) without enclosing double-quotes
âšī¸ Note:
The character-group [^\"] matches any non (^) double-quote \".
It will start matching at the leading double-quote and stop matching as soon as a double-quote is detected. So the final \" in the regex is optional: It is not required for matching/capturing, but will ensure that each matched field-value is correctly enclosed by double-quotes.
Result
Matching following input lines:
{
"NodeID": "${NodeID}",
"EventID": "${EventID}"
}
Will give the desired raw field-values in group 1 for each match:
e.g.
${NodeID} for the first match
${EventID} for the second match
đī¸ Working with JSON in PowerShell
For your context assumed as parsing JSON following related links may be useful:
Microsoft Scripting Blog: Working with JSON data in PowerShell
Related Question: PowerShell parsing JSON
PowerShell Explained: Powershell: The many ways to use regex
I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1