Remove everything up to and including triple newline - regex

I am very new to Powershell, so I am no doubt doing something really stupid that causes my attempts to get this to work to not actually work... but after an hour of struggling, I'd love a hand.
I have a file for which a triple newline (two empty lines) marks a boundary. I want only everything that comes after the boundary.
My latest fruitless attempt looks like this:
$content = Get-Content -Raw $Path
$content = $content -Replace '^.+`r`n`r`n`r`n', ''
All my attempts to even match a single new line have failed. The -Raw parameter is because I came to understand this would change the way newlines were processed, but it didn't change anything.
I am also aware the regex isn't ideal; I'd want to make it non-greedy but I want to get a super-basic test case working first given my unfamiliarity with whatever flavor of regular expressions Powershell supports. (I assume I can just stick a ? after the + to fix that, but first things first.)
The goal is to go from
useless metadata I don't care about
more useless metadata
actual content
to this:
actual content
What am I doing wrong?

The '`r`n' is a literal 4 char string, while "`r`n" is linebreak 2-char string. Your pattern would not match any line breaks. It is safer to use \r to match CR and \n to match LF in Powershell regex patterns.
Also note that there are several lines between the start of the string and your delimiter, but . does not match a newline by default, you need a (?s) inline modifier to make . match newlines, too.
Use
$content -replace '(?s)^.*?(?:\r?\n){3}'
Details
(?s) - a Singleline option that makes . match newlines, too
^ - start of the string
.*? - any 0+ chars, as few as possible
(?:\r?\n){3} - triple CRLF/LF line break.
See the .NET regex demo.

Related

How can I express this regex with sed?

I have this regex that I would like to use with sed. I would like to use sed, since I want to batch process a few thousand files and my editor does not like that
Find: "some_string":"ab[\s\S\n]+"other_string_
Replace: "some_string":"removed text"other_string_
Find basically matches everything between some_string and other_string, including special chars like , ; - or _ and replaces it with a warning that text was removed.
I was thinking about combining the character classes [[:space:]] and [[:alnum:]], which did not work.
In MacOS FreeBSD sed, you can use
sed -i '' -e '1h;2,$H;$!d;g' -e 's/"some_string":"ab.*"other_string_/"some_string":"removed text"other_string_/g' file
The 1h;2,$H;$!d;g part reads the whole file into memory so that all line breaks are exposed to the regex, and then "some_string":"ab.*"other_string_ matches text from "some_string":"ab till the last occurrence of "other_string_ and replaces with the RHS text.
You need to use -i '' with FreeBSD sed to enforce inline file modification.
By the way, if you decide to use perl, you really can use the -0777 option to enable file slurping with the s modifier (that makes . match any chars including line break chars) and use something like
perl -i -0777 's/"some_string":"\Kab.*(?="other_string_)/removed text/gs' file
Here,
"some_string":" - matches literal text
\K - omits the text matched so far from the current match memory buffer
ab - matches ab
.* - any zero or more chars as many as possible
OR .*? - any zero or more chars as few as possible
(?="other_string_) - a positive lookahead (that matches the text but does not append to the match value) making sure there is "other_string_ immediately on the right.

Multiline Regex Lookbehind Failing in Powershell

I'm trying to parse a particular text file. One portion of the file is:
Installed HotFix
n/a Internet Explorer - 0
Applications:
In powershell, this is currently in a file C:\temp\software.txt. I'm trying to get it to return all lines in between "HotFix" and "Applications:" (As there may be more in the future.)
My current command looks like this:
Get-Content -Raw -Path 'C:\temp\software.txt' | Where-Object { $_ -match '(?<=HotFix\n)((.*?\n)+)(?=Applications)' }
Other regex I've tried:
'(?<=HotFix`n)((.*?`n)+)(?=Applications)'
'(?<=HotFix`n)((.*?\n)+)(?=Applications)'
'(?<=HotFix\n)((.*?`n)+)(?=Applications)'
'(?<=HotFix$)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?\n)+)(?=Applications)'
'(?<=HotFix)((.*?`n)+)(?=Applications)'
I think Select-String will provide better results here:
((Get-Content -Path 'C:\temp\software.txt' textfile -Raw |
Select-String -Pattern '(?sm)(?<=HotFix\s*$).*?(?=^Applications:)' -AllMatches).Matches.Value).Trim()
Regex modifier s is used because you are expecting the . character to potentially match newline characters. Regex modifier m is used so that end of string $ and start of string ^ characters can be matched on each line. Together that syntax is (?sm) in PowerShell.
Where {$_ -match ...} will return anything that makes the condition true. Since you are passing a Get-Content -Raw output, the entire contents of the file will be one string and therefore the entire string will output on a true condition.
Since you used -match here against a single string, any successful matches will be stored in the $matches automatic variable. Your matched string would be available in $matches[0]. If you were expecting multiple matches, -match will not work as constructed here.
Alternatively, the .NET Matches() method of the Regex class, can also do the job:
[regex]::Matches((Get-Content 'c:\temp\software.txt' -Raw),'(?sm)(?<=HotFix\s*$).*?(?=^Applications:)').Value.Trim()
Without Trim(), you'd need to understand your newline character situation:
[regex]::Matches((Get-Content software.txt -Raw),'(?m)(?<=HotFix\r?\n?)[^\r\n]+(?=\r?\n?^Applications:)').Value
A non-regex, alternative could use a switch statement.
switch -File Software.txt -Regex {
'HotFix\s*$' { $Hotfix,$Applications = $true,$false }
'^Applications:' { $Applications = $true }
default {
if ($Hotfix -and !$Applications) {
$_
}
}
}
If you read the file into a string the following regular expression will read the lines of interest:
/(?<=HotFix\n).*?(?=\nApplications:)/s
demo
The regex reads:
Match zero or more characters, lazily (?), preceded by the string "HotFix\n" and followed by the string "\nApplications:".
(?<=HotFix\n) is a positive lookbehind; (?=\nApplications:) is a positive lookahead.
The flag s (/s) causes .*? to continue past the ends of lines. (Some languages have a different flag that has the same effect.)
.*? (lazy match) is used in place of .* (greedy match) in the event that there is more than one line following the "Hot Fix" line that begins "Applications:". The lazy version will match the first; the greedy version, the last.
I would not be inclined to use a regex for this task. For one, the entire file must be read into a string, which could be problematic (memory-wise) if the file is sufficiently large. Instead, I would simply read the file line-by-line, keeping only the current line in memory. Once the "Bad Fix" line has been read, save the following lines until the "Applications:" line is read. Then, after closing the file, you're done.
Instead of using lookarounds, you could make use of a capturing group
First match the line that ends with HotFix. Then capture in group 1 all the following lines that do not start with Applications and then match Applications
^.*\bHotFix\r?\n((?:(?!Applications:).*\r?\n)+)Applications:
Explanation
^.*\bHotFix\r?\n Match the line that ends with HotFix
( Capture group 1
(?: Non capture group
(?!Applications:).*\r?\n Match the whole line if it does not start with Applications:
)+ Close non capturing group and repeat 1+ times to match all lines
) Close group 1
Applications: Match literally
Regex demo

matching two chars with multiple lines in between

I am new to regex and I am using Perl.
I have below tag:
<CFSC>cfsc_service=TRUE
SEC=1
licenses=10
expires=20170511
</CFSC>
I want to match anything between <CFSC> and </CFSC> tags.
I tried /<CFSC>.*?\n.*?\n.*?\n.*?\n<\/CFSC>/
and /<CFSC>(.*)<\/CFSC>/ but had no luck.
You need the /s single line modifier to make the regex engine include line breaks in ..
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
See this example.
my $foo = qq{<CFSC>cfsc_service=TRUE
SEC=1
licenses=10
expires=20170511
</CFSC>};
$foo =~ m{>(.*)</CFSC>}s;
print $1;
You also need to use a different delimiter than /, or escape it.
Try
/<CFSC>(.*)<\/CFSC>/s
The final s makes the . match newline chars (\n = 0x0a) which is usually doesn't match:
Treat string as single line. That is, change "." to match any
character whatsoever, even a newline, which normally it would not
match.
from http://perldoc.perl.org/perlre.html#Modifiers
Try this:
$foo =~ m/<CFSC>((?:(?!<\/CFSC>).)*)<\/CFSC>/gs;
Modifiers:
g - Matches global
s - newline
i - case sensitive
\ - escape sequence

Regex to remove string after file extension

I'm using PowerShell to query for a service path from which results should resemble C:\directory\sub-directory\service.exe
Some results however also include characters after the .exe file extension, for example output may resemble one of the following:
C:\directory\sub-directory\service.exe ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe -ThisTextNeedsRemoving
C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving
i.e. ThisTextNeedsRemoving may be proceeded by a space, hyphen or forward slash.
I can use the regex -replace '($*.exe).*' to remove everything after, but including the .exe file extension, but how do I keep the .exe in the results?
You can use a look-around:
$txt = 'C:\directory\sub-directory\service.exe /ThisTextNeedsRemoving'
$txt -replace '(?<=\.exe).+', ''
This uses a look-behind which is a zero-width match so it doesn't get replaced.
Debuggex Demo
Using lookbehind is possible, but note that lookbehinds are only necessary when you need to specify some rather complex condition or to obtain overlapping matches. In most cases, when you can do without a lookbehind, you should consider using a non-lookbehind solution because it is rather a costly operation. It is easier to check once if the current character is not a whitespace than to also check if each of these symbols is preceded with something else. Or a whole substring, or a more complext pattern.
Thus, I'd suggest using a solution based on capturing mechanism, with a backreference in the replacement part to restore the captured substring in the result:
$s -replace '^(\S+\.exe) .*','$1'
or - for paths containing spaces and not inside double quotes:
$s -replace '^(.*?\.exe) .*','$1'
Explanation:
^ - start of string
(\S+\.exe) - one or more character other than whitespace (\S+) (or any characters other than a newline, any amount, as few as possible, with .*?) followed with a literal . and exe
.* - a space and then any number of characters other than a newline.

Powershell replace regex between two tags

Given a block of arbitrary text enclosed by specific tags, I would like to replace the whole chunk with something else (in the example, "BANANA")
$newvar = $oldvar -replace "<!-- URL -->(*.)<!-- END -->","BANANA"
Is there a mode in PS regex to not require escaping and can the syntax then be as simple as this to achieve the replacement?
UPDATE: I understand now that it should be .*, not *., but still no dice. The match covers multiple lines, if that adds complexity to the regex or requires other options.
It looks to me like you have the .* in reverse (*.). Apart from that, try:
$newvar = $oldvar -creplace '(?s)<!-- URL -->.*?<!-- END -->', 'BANANA'
In response to your comments, I have made the .*? lazy so it will not "overmatch" (for details on lazy vs. greedy, see the reference section)
Also in reference to your comments, the (?s) activates DOTALL mode, allowing the .*? to match across multiple lines.
Reference
The Many Degrees of Regex Greed