Powershell: Remove all between 2 strings - regex

got a html which contains 2 lines of texts.
<!-- START -->
asdf
<!-- END -->
between those 2 marker can stand anything and its changing data so its not same data all the time.
Is there a possibility to erase all lines between those 2?
Have tried with regex
(?sm)<!-- START -->.*?(?=^<!-- END -->)
but he always starts with the first line and not below.
Can someone help me to start after with regex and then delete it?

The main issue here is that you match without capturing the left-hand delimiter.
To match and erase arbitrary content in between two multichar delimiters you need to either put both delimiters inside lookarounds:
-replace '(?<=left_hand_delim).*?(?=right_hand_delim)'
Or, use capturing groups in the regex and backreferences in the replacement:
-replace '(left_hand_delim).*?(right_hand_delim)', '$1$2'
You may use
$regex='(?ms)(?<=^\s*<!-- OPC-ITEM-ENTRIES START -->\s*).*?(?=\s*<!-- OPC-ITEM-ENTRIES END -->)'
(Get-Content -raw $file) -replace $regex, '$1$2' | Set-Content $outfile
See regex demo 1 and regex demo #2 (see Context tab).
You must use -raw option to read in the file contents into a single variable since you need the s singleline flag to let . match any char including newlines.

Related

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

Regular Expressions in powershell split

I need to strip out a UNC fqdn name down to just the name or IP depending on the input.
My examples would be
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
I want to end up with just tom or 123.43.234.23
I have the following code in my array which is striping out the domain name perfect, but Im still left with \\tom
-Split '\.(?!\d)')[0]
Your regex succeeds in splitting off the tokens of interest in principle, but it doesn't account for the leading \\ in the input strings.
You can use regex alternation (|) to include the leading \\ at the start as an additional -split separator.
Given that matching a separator at the very start of the input creates an empty element with index 0, you then need to access index 1 to get the substring of interest.
In short: The regex passed to -split should be '^\\\\|\.(?!\d)' instead of '\.(?!\d)', and the index used to access the resulting array should be [1] instead of [0]:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '^\\\\|\.(?!\d)')[1] }
The above yields:
tom
123.43.234.23
Alternatively, you could remove the leading \\ in a separate step, using -replace:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '\.(?!\d)')[0] -replace '^\\\\' }
Yet another alternative is to use a single -replace operation, which does not require a ForEach-Object call (doesn't require explicit iteration):
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' -replace
'?(x) ^\\\\ (.+?) \.\D .+', '$1'
Inline option (?x) (IgnoreWhiteSpace) allows you to make regexes more readable with insignificant whitespace: any unescaped whitespace can be used for visual formatting.
^\\\\ matches the \\ (escaped with \) at the start (^) of each string.
(.+?) matches one or more characters lazily.
\.\D matches a literal . followed by something other than a digit (\d matches a digit, \D is the negation of that).
.+ matches one or more remaining characters, i.e., the rest of the input.
$1 as the replacement operand refers to what the 1st capture group ((...)) in the regex matched, and, given that the regex was designed to consume the entire string, replaces it with just that.
I'm stealing Lee_Daileys $InSTuff
but appending a RegEx I used recently
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$InStuff |ForEach-Object {($_.Trim('\\') -split '\.(?!\d{1,3}(\.|$))')[0]}
Sample Output:
tom
123.43.234.23
As you can see here on RegEx101 the dots between the numbers are not matched
The Select-String function uses regex and populates a MatchInfo object with the matches (which can then be queried).
The regex "(\.?\d+)+|\w+" works for your particular example.
"\\tom.overflow.corp.com", "\\123.43.234.23.overflow.corp.com" |
Select-String "(\.?\d+)+|\w+" | % { $_.Matches.Value }
while this is NOT regex, it does work. [grin] i suspect that if you have a really large number of such items, then you will want a regex. they do tend to be faster than simple text operators.
this will get rid of the leading \\ and then replace the domain name with .
# fake reading in a text file
# in real life, use Get-Content
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$DomainName = '.overflow.corp.com'
$InStuff.ForEach({
$_.TrimStart('\\').Replace($DomainName, '')
})
output ...
tom
123.43.234.23

Regex in Powershell fails to check for newlines

I'm trying to get the first block of releasenotes...
(See sample content in the code)
Whenever I use something simple it works, it only breaks when I try to
search across multiple lines (\n). I'm using (Get-Content $changelog | Out-String) because that gives back 1 string instead of an array from each line.
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
(Get-Content $changelog | Out-String) | Select-String -Pattern $regex -AllMatches
<#
SAMPLE:
------
v1.0.23
- Adds an IContainer API.
- Bugfixes.
v1.0.22
- Hotfix: Language operators.
v1.0.21
- Support duplicate query parameters.
v1.0.20
- Splitting up the ICommand interface.
- Fixing the referrer header empty field value.
#>
The result I need is:
v1.0.23
- Adds an IContainer API.
- Bugfixes.
Update:
Using options..
$changelog = 'C:\Source\VSTS\AcmeLab\AcmeLab Core\changelog.md'
$regex = '(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
Get-Content -Path $changelog -Raw | Select-String -Pattern $regex -AllMatches
I also get nothing.. (no matter if I use \n or \r\n)
Unless you're stuck with PowerShell v2, it's simpler and more efficient to use Get-Content -Raw to read an entire file as a single string; besides, Out-String adds an extra newline to the string.[1]
Since you're only looking for the first match, you can use the -match operator - no need for Select-String's -AllMatches switch.
Note: While you could use Select-String without it, it is more efficient to use the -match operator, given that you've read the entire file into memory already.
Regex matching is by default always case-insensitive in PowerShell, consistent with PowerShell's overall case-insensitivity.
Thus, the following returns the first block, if any:
if ((Get-Content -Raw $changelog) -match '(?m)^v\d+\.\d+\.\d+.*(\r?\n-\s?.*)+') {
# Match found - output it.
$Matches[0]
}
* (?m) turns on inline regex option m (multi-line), which causes anchors ^ and $ to match the beginning and end of individual lines rather than the overall string's.
\r?\n matches both CRLF and LF-only newlines.
You could make the regex slightly more efficient by making the (...) subexpression non-capturing, given that you're not interested in what it captured: (?:...).
Note that -match itself returns a Boolean (with a scalar LHS), but information about the match is recorded in the automatic $Matches hashtable variables, whose 0 entry contains the overall match.
As for what you tried:
'([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work, because by default $ only matches at the very end of the input string, at the end of the last line (though possibly before a final newline).
To make $ to match the end of each line, you'd have to turn on the multiline regex option (which you did in your 2nd attempt).
As a result, nothing matches.
'(?smi)([Vv][0-9]+\.[0-9]+\.[0-9]+\n)(^-.*$\n)+'
doesn't work as intended, because by using option s (single-line) you've made . match newlines too, so that a greedy subexpression such as .* will match the remainder of the string, across lines.
As a result, everything from the first block on matches.
[1] This problematic behavior is discussed in GitHub issue #14444.

Use Powershell and Regex to extract block of lines from a text file

I'm developing Powershell scripts and .Net Regex to find pattern matching in the network device configuration using powershell and regex. I am having problem trying to extract a block of strings from the config file and also having problem to write regex statement to match the carriage return and the new line. Below is my example. I have a config file with information below that I want to extract
vlan no description ports
999 unused Gi1/2,Gi1/3, Gi1/4, Gi1/5, Gi1/6, Gi/7, Gi/8, Gi1/9
Gi1/0, Gi1/11, Gi1/12, Gi1/13, Gi1/14, Gi1/15, Gi1/16
Gi1/17, Gi1/18
Here is my code
$File = Get-content C:\config.txt
$Regex = "(?sm)(^999.*(\r\n\s+.*)"
$unused_ports = Select-String -path $File -Pattern $Regex
Write-host $Unused_ports
it only displays the first line
999 unused Gi1/2,Gi1/3, Gi1/4, Gi1/5, Gi1/6, Gi/7, Gi/8, Gi1/9
I also tried the following $Regex
$Regex = '(?m)(^999.*\s+Gi1/10.*)
$Regex = '(?m)(^999.*\r\n\s+Gi1/10.*)
But none of the regex statements I used extracted all the ports (3 lines)
I also used get-content c:\config.txt -raw but this would display everything thing in the config file.
Really appreciate if someone can help to extract all three lines with port numbers and how to use carriage return and new line to match the new line.
Wiktor Stribiżew provided the crucial pointer in a comment on the question[1]
: You must use Get-Content -Raw to read the file contents into a single string so your regex can match across lines:
if ((Get-Content -Raw C:\Config.txt) -match '(?ms)^999.*?(?=\r?\n\S|\Z)') {
$Matches[0] # automatic variable $Matches reflects what was captured
}
The regex needed some tweaking, too, including the use of non-greedy quantifier .*?, as suggested by TheMadTechnician:
(?ms) sets regex options m (treats ^ and $ as line anchors) and s (makes . match \n (newlines) too`.
^999.*? matches any line starting with 999 and any subsequent characters non-greedily.
(?=\r?\n\S|\Z) is a positive look-ahead assertion ((?=...)) that matches either a newline (\r?\n) followed by a non-whitespace character (\S) - assumed to be the start of the next block - or (|) the very end of the input (\Z) - in effect, this matches either the end of the file or the start of the next block , but without including it in in the match recorded in $Matches.
[1] Wiktor also suggests regex (?m)^999.*(?:\r?\n.*){2}, which works well with the sample input, but is limited to blocks that have exactly 3 lines - by contrast, the solution presented here finds blocks of any length, as long as the non-initial block lines all have leading whitespace.

Powershell Regex acting per-line rather than on entire string

Given foo.txt
this is a file
it has some text
the text has three lines
The following regex replacement
(get-content -raw foo.txt) -replace ".*", "hello" | write-output
produces the output
hellohello
hellohello
hellohello
rather than the desired
hello
My understanding was that get-content returns the content as an array of strings, one per line. The -raw flag replaces this behavior with returning the contents as a single string. As far as I know, ".*" should match the entire string, but instead it matches twice on each line.
Please advise.
Use the inline (?s) (dotall) modifier which forces . to span across newlines.
(Get-Content .\foo.txt -Raw) -replace "(?s).+", "hello"
Example:
PS> $data = Get-Content .\foo.txt -Raw
PS> $data
this is a file
it has some text
the text has three lines
PS> $data -replace "(?s).+", "hello"
hello
I can't explain it other than to say that . appears not to be matching newline characters so you get one match for each complete line then one match for the zero characters at the end of each line.
This also explains the .+ behavior of hello once per-line.
You can "fix" this by using a better pattern that does match the newline characters.
(Get-Content -raw .\foo.txt) -replace "(.|\r|\n)+", "hello"
From https://stackoverflow.com/a/13674250/1252649,
The trick around DotAll mode is to use [\s\S] instead of .. This character class matches any character ...
Of course, this raises the question as to what exactly . is supposed to match other than 'any character'.