Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace - regex

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.

The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.

Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

Related

looking for RexEx to replace a string within a string

I have to replace a string in a text file with Powershell. The text file has this content:
app.framework.locale=de_DE
app.gf.data.profile=somevalues,somevalues[,...]
app.gf.data.profile.path=C:\somepath\somesubpath
app.basket.currencies=currencies
I want to set the profile and the profile.path. So everything behind the "=" should be replaced, regardless of how long the string behind the "=" is.
I am reading the file, replacing the string, and writing back the file with these commands:
change the profile:
(Get-Content $TXT_FILE ).Replace('app.gf.data.profile=default,',"app.gf.data.profile=default,$Man") | Out-File $TXT_FILE -Encoding ASCII
change the profile path:
(Get-Content $TXT_FILE ).Replace('app.gf.data.profile.path=',"app.gf.data.profile.path=$Path") | Out-File $TXT_FILE -Encoding ASCII
As you can see, my replacement script is not correct, the rest behind the "=" will remain. I think I need a regex or some kind of wildcards to fit my needs.
You need to use
(Get-Content $TXT_FILE) `
-replace '^(app\.gf\.data\.profile=).*', "`$1,$Man" `
-replace '^(app\.gf\.data\.profile\.path=).*', "`${1}$Path" | `
Out-File $TXT_FILE -Encoding ASCII
Here, -replace is used to enable regex replacing, and both regular expressions follow a similar logic:
^ - matches start of a line
(app\.gf\.data\.profile=) - captures a literal string (literal dots in regex must be escaped) into Group 1 ($1 or ${1})
.* - matches the rest of the line.
${1} is used in the second case as the variable $Path follows the backreference immediately, and if it starts with a digit, the replacement will not be as expected.

Powershell: append text after string in file

Problem: I am trying to append a string after a tag. I got a large text file, and I only need to append some text after the tag (including the text xxxxxx) <xxxxxx>, and I cannot seem to figure it out just yet.
Currently im trying this with regex: <[(xxxxxx)]+>, which according to regex101.com does match the exact tag <xxxxxx>, but when I use this in Powershell it returns a lot of other stuff.
How can I make sure that Powershell only matches <xxxxxx> ? And to append some string after <xxxxxx> ?
Sample snippet from the text file: PredefinedSettings=<xxxxxx><abc test123 /abc></xxxxxx>
Sample PS command: Get-Content .\samplefile.ini | Select-String -Pattern "<[(xxxxxx)]+>"
Which returns the entire line PredefinedSettings=<xxxxxx><abc test123 /abc></xxxxx> instead of just <xxxxxx>
If you want to output just the matched text, you can do the following:
Select-String -Path sample.ini -Pattern '<(/?xxxxxx)>' -AllMatches | Foreach-Object {
$_.Matches.Groups[1].Value # Outputs matched text between `<>`
$_.Matches.Value # Outputs all matched text
}
The -AllMatches switch will allow matching beyond the first match. So it would return <xxxxxx> and </xxxxxx>.
If you want to replace text in a file, you can do the following:
(Get-Content .\samplefile.ini) -replace '<(/?xxxxxx)>','<$1Text>' |
Set-Content .\sampplefile.ini
If your replacement text is in a variable, you will need to escape the $ for the capture group.
$Text = 'replacement Text'
(Get-Content .\samplefile.ini) -replace '<(/?xxxxxx)>',"<`$1$Text>" |
Set-Content .\sampplefile.ini
$1 is the capture group 1 data matched within the first (). Depending on your Text, it may be wise to name your capture group. If Text is 23OtherText, <$123OtherText> will attempt to substitute capture group 123. Using a named capture group, you can do the following:
(Get-Content .\samplefile.ini) -replace '<(?<Tag>/?xxxxxx)>','<${Tag}Text>' |
Set-Content .\sampplefile.ini
/? matches zero or more / characters.
-replace will return all text not matched and all text replaced by the operator.
I hope I got your question right.
In regex Quantifiers are greedy so it will select from the first open tag to the last closing tag, you can change that by using a ?.
So your Regex will be <[(xxxxxx)]+?>.

Replace text + optional newline in file

I've been through other similar questions and tried their advice, but it wouldn't help.
I'm trying to delete a specific line of text in a text file.
My code which works
(Get-Content -Path "MyPath.txt" -Raw).Replace('this is the line', '') | Set-Content "MyPath.txt" -Encoding UTF8
Now this works but leaves an ugly empty line in the text file. I wanted to also replace an optional newline character by adding this regex at the end of the line
\n?
and this wouldn't work. The other threads made other recommendations and I've tried all combinations but just can't match. I'm using windows style ending (CRLF)
Both using -Raw and not using it
\n
\r\n
`n
`r`n
I haven't even added the regex question mark at the end (or non-capturing group in case it needs the \r\n syntax).
The [string] type's .Replace() method doesn't support regexes (regular expressions), whereas PowerShell's -replace operator does.
However, the simplest solution in this case is to take advantage of the fact that the -ne operator acts as a filter with an array-valued LHS (as other comparison operators do):
#(Get-Content -Path MyPath.txt) -ne 'this is the line' |
Set-Content MyPath.txt -Encoding UTF8
Note how Get-Content is called without -Raw in order to return an array of lines, from which -ne then filters out the line of (non)-interest; #(...), the array-subexpression operator ensures that the output is an array even if the file happens to contain just one line.
The assumption is that string 'this is the line' matches the whole line (case-insensitively).
If that is not the case, instead of -ne you could use -notlike with a wildcard expression or -notmatch with a regex (e.g.,
-notmatch 'this is the line' or -notlike '*this is the line')

Regex in Powershell fails to check for newlines

I'm trying to get the first block of releasenotes...
(See sample content in the code)
Whenever I use something simple it works, it only breaks when I try to
search across multiple lines (\n). I'm using (Get-Content $changelog | Out-String) because that gives back 1 string instead of an array from each line.
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
(Get-Content $changelog | Out-String) | Select-String -Pattern $regex -AllMatches
<#
SAMPLE:
------
v1.0.23
- Adds an IContainer API.
- Bugfixes.
v1.0.22
- Hotfix: Language operators.
v1.0.21
- Support duplicate query parameters.
v1.0.20
- Splitting up the ICommand interface.
- Fixing the referrer header empty field value.
#>
The result I need is:
v1.0.23
- Adds an IContainer API.
- Bugfixes.
Update:
Using options..
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
Get-Content -Path $changelog -Raw | Select-String -Pattern $regex -AllMatches
I also get nothing.. (no matter if I use \n or \r\n)
Unless you're stuck with PowerShell v2, it's simpler and more efficient to use Get-Content -Raw to read an entire file as a single string; besides, Out-String adds an extra newline to the string.[1]
Since you're only looking for the first match, you can use the -match operator - no need for Select-String's -AllMatches switch.
Note: While you could use Select-String without it, it is more efficient to use the -match operator, given that you've read the entire file into memory already.
Regex matching is by default always case-insensitive in PowerShell, consistent with PowerShell's overall case-insensitivity.
Thus, the following returns the first block, if any:
if ((Get-Content -Raw $changelog) -match '(?m)^v\d+\.\d+\.\d+.*(\r?\n-\s?.*)+') {
# Match found - output it.
$Matches[0]
}
* (?m) turns on inline regex option m (multi-line), which causes anchors ^ and $ to match the beginning and end of individual lines rather than the overall string's.
\r?\n matches both CRLF and LF-only newlines.
You could make the regex slightly more efficient by making the (...) subexpression non-capturing, given that you're not interested in what it captured: (?:...).
Note that -match itself returns a Boolean (with a scalar LHS), but information about the match is recorded in the automatic $Matches hashtable variables, whose 0 entry contains the overall match.
As for what you tried:
'([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work, because by default $ only matches at the very end of the input string, at the end of the last line (though possibly before a final newline).
To make $ to match the end of each line, you'd have to turn on the multiline regex option (which you did in your 2nd attempt).
As a result, nothing matches.
'(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work as intended, because by using option s (single-line) you've made . match newlines too, so that a greedy subexpression such as .* will match the remainder of the string, across lines.
As a result, everything from the first block on matches.
[1] This problematic behavior is discussed in GitHub issue #14444.

Powershell Regex acting per-line rather than on entire string

Given foo.txt
this is a file
it has some text
the text has three lines
The following regex replacement
(get-content -raw foo.txt) -replace ".*", "hello" | write-output
produces the output
hellohello
hellohello
hellohello
rather than the desired
hello
My understanding was that get-content returns the content as an array of strings, one per line. The -raw flag replaces this behavior with returning the contents as a single string. As far as I know, ".*" should match the entire string, but instead it matches twice on each line.
Please advise.
Use the inline (?s) (dotall) modifier which forces . to span across newlines.
(Get-Content .\foo.txt -Raw) -replace "(?s).+", "hello"
Example:
PS> $data = Get-Content .\foo.txt -Raw
PS> $data
this is a file
it has some text
the text has three lines
PS> $data -replace "(?s).+", "hello"
hello
I can't explain it other than to say that . appears not to be matching newline characters so you get one match for each complete line then one match for the zero characters at the end of each line.
This also explains the .+ behavior of hello once per-line.
You can "fix" this by using a better pattern that does match the newline characters.
(Get-Content -raw .\foo.txt) -replace "(.|\r|\n)+", "hello"
From https://stackoverflow.com/a/13674250/1252649,
The trick around DotAll mode is to use [\s\S] instead of .. This character class matches any character ...
Of course, this raises the question as to what exactly . is supposed to match other than 'any character'.