Regex in Powershell fails to check for newlines

Regex in Powershell fails to check for newlines - regex

I'm trying to get the first block of releasenotes...
(See sample content in the code)
Whenever I use something simple it works, it only breaks when I try to
search across multiple lines (\n). I'm using (Get-Content $changelog | Out-String) because that gives back 1 string instead of an array from each line.
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
(Get-Content $changelog | Out-String) | Select-String -Pattern $regex -AllMatches
<#
SAMPLE:
------
v1.0.23
- Adds an IContainer API.
- Bugfixes.
v1.0.22
- Hotfix: Language operators.
v1.0.21
- Support duplicate query parameters.
v1.0.20
- Splitting up the ICommand interface.
- Fixing the referrer header empty field value.
#>
The result I need is:
v1.0.23
- Adds an IContainer API.
- Bugfixes.
Update:
Using options..
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
Get-Content -Path $changelog -Raw | Select-String -Pattern $regex -AllMatches
I also get nothing.. (no matter if I use \n or \r\n)

Unless you're stuck with PowerShell v2, it's simpler and more efficient to use Get-Content -Raw to read an entire file as a single string; besides, Out-String adds an extra newline to the string.[1]
Since you're only looking for the first match, you can use the -match operator - no need for Select-String's -AllMatches switch.
Note: While you could use Select-String without it, it is more efficient to use the -match operator, given that you've read the entire file into memory already.
Regex matching is by default always case-insensitive in PowerShell, consistent with PowerShell's overall case-insensitivity.
Thus, the following returns the first block, if any:
if ((Get-Content -Raw $changelog) -match '(?m)^v\d+\.\d+\.\d+.*(\r?\n-\s?.*)+') {
# Match found - output it.
$Matches[0]
}
* (?m) turns on inline regex option m (multi-line), which causes anchors ^ and $ to match the beginning and end of individual lines rather than the overall string's.
\r?\n matches both CRLF and LF-only newlines.
You could make the regex slightly more efficient by making the (...) subexpression non-capturing, given that you're not interested in what it captured: (?:...).
Note that -match itself returns a Boolean (with a scalar LHS), but information about the match is recorded in the automatic $Matches hashtable variables, whose 0 entry contains the overall match.
As for what you tried:
'([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work, because by default $ only matches at the very end of the input string, at the end of the last line (though possibly before a final newline).
To make $ to match the end of each line, you'd have to turn on the multiline regex option (which you did in your 2nd attempt).
As a result, nothing matches.
'(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work as intended, because by using option s (single-line) you've made . match newlines too, so that a greedy subexpression such as .* will match the remainder of the string, across lines.
As a result, everything from the first block on matches.
[1] This problematic behavior is discussed in GitHub issue #14444.

Related

Select-String: match a string only if it isn't preceded by a specific character

I have a list of files that contain either of the two strings:
"stuff" or ";stuff"
I'm trying to write a PowerShell Script that will return only the files that contain "stuff". The script below currently returns all the files because obviously "stuff" is a substring of ";stuff"
For the life of me, I cannot figure out how to only matches file that contain "stuff", without a preceding ;
Get-Content "C:\temp\list\list.txt" |
Where-Object { Select-String -Quiet -Pattern "stuff" -SimpleMatch $_ }
Note: C:\temp\list\list.txt contains a list of file paths that are each passed to Select-String.
Thanks for the help.

You cannot perform the desired matching with literal substring searches (-SimpleMatch).
Instead, use a regex with a negative look-behind assertion ((?<!..)) to rule out stuff substrings preceded by a ; char.: (?<!;)stuff
Applied to your command:
Get-Content "C:\temp\list\list.txt" |
Where-Object { Select-String -Quiet -Pattern '(?<!;)stuff' -LiteralPath $_ }
Regex pitfalls:
It is tempting to use [^;]stuff instead, using a negated (^) character set ([...]) (see this answer); however, this will not work as expected if stuff appears at the very start of a line, because a character set - whether negated or not - only matches an actual character, not the start-of-the-line position.
It is then tempting to apply ? to the negated character set (for an optional match - 0 or 1 occurrence): [^;]?stuff. However, that would match a string containing ;stuff again, given that stuff is technically preceded by a "0-repeat occurrence" of the negated character set; thus, ';stuff' -match '[^;]?stuff' yields $true.
Only a look-behind assertion works properly in this case - see regular-expressions.info.

To complement #mklement0's answer, I suggest an alternative approach to make your code easier to read and understand:
#requires -Version 4
#(Get-Content -Path 'C:\Temp\list\list.txt').
ForEach([IO.FileInfo]).
Where({ $PSItem | Select-String -Pattern '(?<!;)stuff' -Quiet })
This will turn your strings into objects (System.IO.FilePath) and utilizes the array functions ForEach and Where for brevity/conciseness. Further, this allows you to pipe the paths as objects which will be accepted by the -Path parameter into Select-String to make it more understandable (I find long lists of parameter sets difficult to read).

The example code posted won't actually run, as it will look at each line as the -Path value.
What you need is to get the content, select the string you're after, then filter the results with Where-Object
Get-Content "C:\temp\list\list.txt" | Select-String -Pattern "stuff" | Where-Object {$_ -notmatch ";stuff"}
You could create a more complex regex if needed, but depends on what your result data from your files looks like

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.

The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.

Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

Powershell: Pull URL out of String

I am pulling a string from a text file that looks like:
C:\Users\users\Documents\Firefox\tools\Install.ps1:37: Url = "https://somewebsite.com"
I need to some how remove everything except the URL, so it should look like:
https://www.somewebsite.com
Here is what I have tried:
$Urlselect = Select-String -Path "$zipPath\tools\chocolateyInstall.ps1" -Pattern "url","Url"-List # Selects URL download path
$Urlselect = $Urlselect -replace ".*" ","" -replace ""*.","" # remove everything but the download link
but this didn't seam to do anything. I am thinking that its going to have to do with regex but I am not sure how to put it. Any help is appreciated. Thanks

I suggest using the switch statement with the -Regex and -File options:
$url = switch -regex -file "$zipPath\tools\chocolateyInstall.ps1" {
' Url = "(.*?)"' { $Matches[1]; break }
}
-file makes switch loop over all lines of the specified file.
-regex interprets the branch conditionals as regular expressions, and the automatic $Matches variable can be used in the associated script block ({ ... }) to access the results of the match, notably, what the 1st (and only) capture group in the regex ((...)) captured - the URL of interest.
break stops processing once the 1st match is found. (To continue matching, use continue).
If you do want to use Select-String:
$url = Select-String -List ' Url = "(.*?)"' "$zipPath\tools\chocolateyInstall.ps1" |
ForEach-Object { $_.Matches.Groups[1].Value }
Note that the switch solution will perform much better.
As for what you tried:
Select-String -Path "$zipPath\tools\chocolateyInstall.ps1" -Pattern "url","Url"
Select-String is case-insensitive by default, so there's no need to specify case variations of the same string. (Conversely, you must use the -CaseSensitive switch to force case-sensitive matching).
Also note that Select-String doesn't output the matching line directly, as a string, but as a match-information objects; to get the matching line, access the .Line property[1].
$Urlselect -replace ".*" ","" -replace ""*.",""
".*" " and ""*." result in syntax errors, because you forgot to escape the _embedded " as `".
Alternatively, use '...' (single-quoted literal strings), which allows you to embed " as-is and is generally preferable for regexes and replacement operands, because there's no confusion over what parts PowerShell may interpret up front (string expansion).
Even with the escaping problem solved, however, your -replace operations wouldn't have worked, because .*" matches greedily and therefore up to the last "; here's a corrected solution with non-greedy matching, and with the replacement operand omitted (which makes it default to the empty string):
PS> 'C:\...ps1:37: Url = "https://somewebsite.com"' -replace '^.*?"' -replace '"$'
https://somewebsite.com
^.*?" non-greedily replaces everything up to the first ".
"$ replaces a " at the end of the string.
However, you can do it with a single -replace operation, using the same regex as with the switch solution at the top:
PS> 'C:\...ps1:37: Url = "https://somewebsite.com"' -replace '^.*?"(.*?)"', '$1'
https://somewebsite.com
$1 in the replacement operand refers to what the 1st capture group ((...)) captured, i.e. the bare URL; for more information, see this answer.
[1] Note that there's a green-lit feature suggestion - not yet implemented as of Windows PowerShell Core 6.2.0 - to allow Select-String to emit strings directly, using the proposed -Raw switch - see https://github.com/PowerShell/PowerShell/issues/7713

Regular Expressions in powershell split

I need to strip out a UNC fqdn name down to just the name or IP depending on the input.
My examples would be
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
I want to end up with just tom or 123.43.234.23
I have the following code in my array which is striping out the domain name perfect, but Im still left with \\tom
-Split '\.(?!\d)')[0]

Your regex succeeds in splitting off the tokens of interest in principle, but it doesn't account for the leading \\ in the input strings.
You can use regex alternation (|) to include the leading \\ at the start as an additional -split separator.
Given that matching a separator at the very start of the input creates an empty element with index 0, you then need to access index 1 to get the substring of interest.
In short: The regex passed to -split should be '^\\\\|\.(?!\d)' instead of '\.(?!\d)', and the index used to access the resulting array should be [1] instead of [0]:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '^\\\\|\.(?!\d)')[1] }
The above yields:
tom
123.43.234.23
Alternatively, you could remove the leading \\ in a separate step, using -replace:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '\.(?!\d)')[0] -replace '^\\\\' }
Yet another alternative is to use a single -replace operation, which does not require a ForEach-Object call (doesn't require explicit iteration):
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' -replace
'?(x) ^\\\\ (.+?) \.\D .+', '$1'
Inline option (?x) (IgnoreWhiteSpace) allows you to make regexes more readable with insignificant whitespace: any unescaped whitespace can be used for visual formatting.
^\\\\ matches the \\ (escaped with \) at the start (^) of each string.
(.+?) matches one or more characters lazily.
\.\D matches a literal . followed by something other than a digit (\d matches a digit, \D is the negation of that).
.+ matches one or more remaining characters, i.e., the rest of the input.
$1 as the replacement operand refers to what the 1st capture group ((...)) in the regex matched, and, given that the regex was designed to consume the entire string, replaces it with just that.

I'm stealing Lee_Daileys $InSTuff
but appending a RegEx I used recently
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$InStuff |ForEach-Object {($_.Trim('\\') -split '\.(?!\d{1,3}(\.|$))')[0]}
Sample Output:
tom
123.43.234.23
As you can see here on RegEx101 the dots between the numbers are not matched

The Select-String function uses regex and populates a MatchInfo object with the matches (which can then be queried).
The regex "(\.?\d+)+|\w+" works for your particular example.
"\\tom.overflow.corp.com", "\\123.43.234.23.overflow.corp.com" |
Select-String "(\.?\d+)+|\w+" | % { $_.Matches.Value }

while this is NOT regex, it does work. [grin] i suspect that if you have a really large number of such items, then you will want a regex. they do tend to be faster than simple text operators.
this will get rid of the leading \\ and then replace the domain name with .
# fake reading in a text file
# in real life, use Get-Content
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$DomainName = '.overflow.corp.com'
$InStuff.ForEach({
$_.TrimStart('\\').Replace($DomainName, '')
})
output ...
tom
123.43.234.23

Using powershell to search for a pattern

I am trying to write a powershell script to search for a pattern in a text file. Specifically I am looking at reading a file line by line and returning any line that has a space at the 32nd character position.
I have this so far but it just returns all lines that have white space. I need to narrow it down to the 32nd position
Get-Content -path C:\VM_names.txt | Where-Object {$_ -match "\s+"}

Use this pattern:
-match '^.{31} '
Explanation:
^ - beginning of the string
. - any character
{31} - repeated 31 times
- a space

This is actually really easy to do. By default, Get-Content reads a text file as an array of strings (individual lines), unless you use the -Raw parameter, which reads it as a single string. You can use the -match PowerShell operator to "match" the lines that meet your regular expression.
(Get-Content -Path c:\VM_names.txt) -match '^.{31}\s'
The result of the above command is an array of lines that match the desired regular expression.
NOTE: The call to Get-Content must be wrapped in parentheses, otherwise the PowerShell parser will think that -match is a parameter on that command.
NOTE2: As a good practice, use single quotes around strings, unless you specifically know that you need double quotes. You'll save yourself from accidental interpolation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex in Powershell fails to check for newlines - regex

Related

Select-String: match a string only if it isn't preceded by a specific character

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

Powershell: Pull URL out of String

Regular Expressions in powershell split

Using powershell to search for a pattern

Categories

Resources