REGEX Multi-line Search between 2 characters- Powershell - regex

I am unable to apply many of the other powershell regex solutions to help solve my problem. The answer may very well already be on stackoverflow, but my lack of experience with powershell is prohibiting me from deducing how to maniupulate the solutions to my question.
I have a text file containing an XML document tree(I bring in the document tree as one large string into powershell)(edit 1) that includes the HTML tags to establish where certain content is. I need to steal the file name from in between the filename tags. Sometimes both tags and the file name are all on one line, and other times the tags are each on a seperate line as well as the file name. An example of the input data I have is below:
<files>
<file>
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
<file>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
I have created the following code to find the content within the tags thus far. It works if the filename tags and the file name are on the same line. The problem I'm having is in the instance where they are all on seperate lines (the example I provided above). I have already managed to transfer the large string above into $xmldata.
$xmldata -match '<fileName>(.*?)(</fileName>)'
$matches
Using the example text I displayed above, the output I receive is as follows:
<fileName>AnotherTextFileINeedReturned.txt</fileName>
I'm ok with receiving the tags, but I also need the name of the file that is on multiple lines. Like this...
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
Or any variation that would give me both of the names of the text files. I have seen the (?m) part used before, but I haven't been able to successfully implement it. Thanks in advance for the help!! Let me know if you need any other information!

You should be able to get around that without using any regex. Powershell supports XML pretty well. Extracting the filename would be as easy as:
$Xml = #"
<files>
<file>
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
<file>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
</files>
"#
Select-Xml -Content $Xml -XPath "//files/file/fileName" | foreach {$_.node.InnerXML.Trim()}

You not explainen how you get your data but I guess you are using Get-Content to retrieve your source file. Get-Content reads the content one line at a time and returns a collection of objects, each of which represents a line of content. In other words, you're probably doing a Match on each separate line and therefor do no find the matches that are spread over multiple lines.
If this is indeed the case, the solution would be to simply join the lines first:
($xmldata -Join "") -match '<fileName>(.*?)(</fileName>)'
And check your matches, e.g.:
$Matches[0]

Related

A Regular Expression that I know is correct, doesn't work with awk. Please advise

Following up on an answer by #dawg to my question how to delete multiple sections in a file based on known patterns, I want to use a regular expression in awk to identify the start of the section(s) I want to delete.
The file I am working with is an xml file. It is in fact the file containing the recently used filenames list (RUFL) in Linux Mint (~/.local/share/recently-used.xbel).
This is how the RUFL is structured:
<?xml version="1.0" encoding="UTF-8"?>
<xbel version="1.0"
xmlns:bookmark="http://www.freedesktop.org/standards/desktop-bookmarks"
xmlns:mime="http://www.freedesktop.org/standards/shared-mime-info"
>
<bookmark href="file:///home/ocor61/Documents/Linux/Linux%20Mint%20Cinnamon%20Keyboard%20Shortcuts.pdf" added="2021-07-18T01:57:02Z" modified="2021-07-18T01:57:02Z" visited="1969-12-31T23:59:59Z">
<info>
<metadata owner="http://freedesktop.org">
<mime:mime-type type="application/pdf"/>
<bookmark:applications>
<bookmark:application name="Document Viewer" exec="&apos;xreader %u&apos;" modified="2021-07-18T01:57:02Z" count="1"/>
</bookmark:applications>
</metadata>
</info>
</bookmark>
<bookmark href="file:///home/ocor61/Documents/Linux/Linux%20Command%20Line%20Cheat%20Sheet.pdf" added="2021-07-18T01:57:09Z" modified="2021-07-18T01:57:09Z" visited="1969-12-31T23:59:59Z">
<info>
<metadata owner="http://freedesktop.org">
<mime:mime-type type="application/pdf"/>
<bookmark:applications>
<bookmark:application name="Document Viewer" exec="&apos;xreader %u&apos;" modified="2021-07-18T01:57:09Z" count="1"/>
</bookmark:applications>
</metadata>
</info>
</bookmark>
<bookmark href="file:///home/ocor61/Documents/work.bfproject" added="2021-07-20T10:52:59Z" modified="2021-07-22T08:41:57Z" visited="1969-12-31T23:59:59Z">
<info>
<metadata owner="http://freedesktop.org">
<mime:mime-type type="application/x-bluefish-project"/>
<bookmark:applications>
<bookmark:application name="bluefish" exec="&apos;bluefish %u&apos;" modified="2021-07-22T08:41:57Z" count="2"/>
</bookmark:applications>
</metadata>
</info>
</bookmark>
</xbel>
I am working on a script to remove filenames from the list. It works fine, but I am also working with an array that contains patterns that should not be used. For example: if the pattern [bookmark] would be used to identify a section that must be removed, the entire file would become unusable. That goes for parts of [bookmark], but also for href, added, info... You get my drift.
So, I want to work with a regexp to counter the problems of entering patterns that cannot be used.
Currently, this is the awk code I am using now (thanks to #dawg):
ENDLINE='</bookmark>'
awk -v f=1 -v st="$1" -v end="$ENDLINE" '
match($0, st) {f=0}
f
match($0, end){f=1}' ~/.local/share/recently-used.xbel
$1 would be the pattern a user enters at the command line, which is part of the file name that must be removed from the RUFL.
The following is the code I would like to use, including the regexp, which doesn't work:
STARTLINE='/(<bookmark href)(.*)($1)(.*)(>)/'
ENDLINE='</bookmark>'
awk -v f=1 -v st="$STARTLINE" -v end="$ENDLINE" '
match($0, st) {f=0}
f
match($0, end){f=1}' ~/.local/share/recently-used.xbel
I have tested the regular expression at https://regexr.com/, so I know it is correct. However, when I use it in my script, this is the error message I am getting:
./ruffle.sh: line 99: syntax error near unexpected token `$0,'
./ruffle.sh: line 99: ` match($0, st) {f=0}'
I have also tried to enter the regexp itself in the awk command line instead of the variable, but that has the same result.
I don't know how to proceed, so any help is appreciated.
The answer to my question lies in how regular expressions can differ when used in different environments. The website I used to check my regexp does so for languages like JS, but not for Bash or likely other shell implementations.
With shellcheck.net as well as by putting the command 'set -vx' in my script right before the awk command, I managed to work things out.
Another mistake I made was to attempt to catch the complete line in the regexp, while I need only the part in that line that can hold the pattern that is entered (which is the part between 'file:' and 'added' in the file ~/.local/share/recently-used.xbel).
The regexp that ultimately works for me now with the variable STARTLINE is:
STARTLINE='file:.*'$1'.*added='
I will have to look into using an xml parser, thanks for the suggestion! For now, however, my script works. Thanks #Sundeep and #EdMorton!

Extract text between tags using PowerShell

I have an XML file that includes many instances of a particular tag/element. I am trying to capture each of these and then dump them into a new file.
I have the following script which does work, in that it takes the first occurrence of the text I am after and displays it to the console.
I am trying to incorporate foreach-object to retrieve all occurrences of ...allContent... but am failing to ge it added correctly.
Here is my working script that displays the output I am after for the first occurrence only.
$firstString = "<RunListItems>"
$secondString = "</RunListItems>"
#Get content from file
$file = Get-Content "C:\Users\Bob\Desktop\ps\order.xml"
#Regex pattern to compare two strings
$pattern = "$firstString(.*?)$secondString"
#Perform the opperation
$result = [regex]::Match($file,$pattern).Groups[1].Value
#Return result
return $result
Parsing XML text with regular expressions is brittle and therefore ill-advised.
PowerShell provides easy access to proper XML parsers, and the in case at hand you can use the Select-Xml cmdlet:
Select-Xml //RunListItems C:\Users\Bob\Desktop\ps\order.xml |
ForEach-Object { $_.Node.InnerText }
//RunListItems is an XPath query that selects all elements whose tag name is RunListItems throughout the document, irrespective of their position in the hierarchy (//)
The .Node property of the output objects (of type Microsoft.PowerShell.Commands.SelectXmlInfo) contains the matching element, and its .InnerText property returns its text content.
Note: If your XML document uses namespaces, you must pass a hashtable with prefix-to-URI mappings to Select-Xml's -Namespace parameter, and use these prefixes in the XPath query (-XPath) when referring to elements - see this answer for more information.
To save the output strings to a file, separated with newlines, simply append something like
| Set-Content out.txt; use Set-Content's -Encoding parameter to control the encoding, if needed.[1]
[1] In Windows PowerShell (versions up to 5.1), Set-Content defaults to the active ANSI code page. In PowerShell (Core) 7+, the consistent default across all cmdlets is BOM-less UTF-8. See this answer for more information.

PowerShell - Removing multiple lines of text between delimiters in a text file

I edit XML files and am using PowerShell to open them in Notepad and replace strings of text. Given two distinct delimiters, a starting and stopping, that appear multiple times in an XML file, I would like to completely remove the text between the delimiters (whether the delimiters get removed as well or not does not matter to me).
In the following example text, I want to completely remove the text between my starting and ending delimiter, but keep all the text before and after it.
The issue I am facing is the fact that there are newlines at the end of each line of text that prevents me from doing a simple:
-replace "<!--A6-->.*?<!--A6 end-->", "KEVIN"
Starting Delimiter:
<!--A6-->
Stopping Delimiter:
<!--A6 end-->
Example Text:
<listItem>
<para>Apple iPhone 6</para>
</listItem>
<listItem>
<para>Apple iPhone 8</para>
</listItem>
<!--A6-->
<listItem>
<para>Apple iPhone X</para>
</listItem>
<!--A6 end-->
</randomList></para>
</levelledPara>
<levelledPara>
<!--A6-->
<title>Available Apple iPhone Colors</title>
<para>The current iPhone model is available in
the follow colors. You can purchase this model
in store, or online.</para>
<!--A6 end-->
<para>If the color option that you want is out
of stock, you can find them at the following
website link.</para>
Current Code:
$Directory = "C:\Users\hellokevin\Desktop\PSTest"
$FindBook = "Book"
$ReplaceBook = "Novel"
$FindBike = "Bike"
$ReplaceBike = "Bicycle"
Get-ChildItem -Path $Directory -Recurse |
Select-Object -Expand FullName|
ForEach-Object {
(Get-Content $_) -replace $FindBook,$ReplaceBook -replace "<!--A6-->.*?<!--A6 end-->", "KEVIN" |
Set-Content ($_ + "_new.xml")
}
Any help would be greatly appreciated. Being fairly new to PowerShell, I don't know how to factor in the newlines at the end of each line in my code. Thanks for looking!
Using search-and-replace on XML files is extremely inadvisable and should be avoided at all costs, because it's way too easy to damage the XML this way.
There are better ways of modifying XML, and they all follow this schema:
load the XML document
modify the document tree
write the XML document back to file.
For your case ("remove nodes between markers") this could be as follows:
load the XML document
look at all XML nodes, in document order
when we see a comment that reads "A6", set a flag to remove nodes from now on
when we see a comment that reads "A6 end", unset that flag
collect all nodes that should be removed (that come up while the flag is on)
in a last step, remove them
write the XML document back to file.
The following program would do exactly this (and also remove the "A6" comments themselves):
$doc = New-Object xml
$doc.Load("C:\path\to\your.xml")
$toRemove = #()
$A6flag = $false
foreach ($node in $doc.SelectNodes('//node()')) {
if ($node.NodeType -eq "Comment") {
if ($node.Value -eq 'A6') {
$A6flag = $true
$toRemove += $node
} elseif ($node.Value -eq 'A6 end') {
$A6flag = $false
$toRemove += $node
}
} elseif ($A6flag) {
$toRemove += $node
}
}
foreach ($node in $toRemove) {
[void]$node.ParentNode.RemoveChild($node)
}
$doc.Save("C:\path\to\your_modified.xml")
You could do string replacement inside the foreach loop as well:
if ($node.NodeType -eq "Text") {
$node.Value = $node.Value -replace "Apple","APPLE"
}
Doing -replace on a single $node.Value is safe. Doing -replace on the entire XML is not.
Note:
Generally, for robust processing, you should use a dedicated XML parser to parse XML text.
See Tomalak's robust, but more complex XML-parsing answer.
In the specific case at hand, using a regex is a convenient shortcut, with the caveat that it only works because the blocks of lines being removed are self-contained elements or element sequences; if this assumption doesn't hold, the modifications will invalidate the XML document.
Additionally, there may be character-encoding issues, because reading an XML file as text doesn't honor an explicit encoding attribute potentially present in the file's XML declaration - see the bottom section for details.
That said, the technique below is appropriate for modifying plain-text files that have no specific formal structure.
You need to use the s (SingleLine) regex option to ensure that . also matches newlines - such options, if used inline, must be placed inside (?...) at the start of the regex; that is, '(?s)...' in this case.
Ad hoc, you can alternatively use workaround [\s\S] instead of ., as suggested by x15; this expression matches any character that is a whitespace char. or a non-whitespace char., and therefore matches any char., including newlines.
To fully remove the lines of interest, you must also match the preceding and succeeding newline.
(Get-Content -Raw file.xml) -replace '(?s)\r?\n<!--A6-->.*?<!--A6 end-->\r?\n'
Get-Content -Raw file.xml reads the file into memory as a whole (single string).
Get-Content makes assumptions about a file's character encoding in the absence of a BOM: Windows PowerShell assumes ANSI encoding, and PowerShell [Core] v6+ now sensibly assumes UTF-8. Since Get-Content is a general-purpose text-file reading cmdlet, it is not aware of a potential encoding attribute in the XML declaration of XML input files (e.g., <?xml version="1.0" encoding="ISO-8859-1"?>)
Similarly, Set-Content defaults to ANSI in Windows PowerShell, and BOM-less UTF-8 PowerShell [Core] v6+.
When in doubt, use the -Encoding parameter, both with Get-Content and Set-Content
See bottom section for more information.
\r?\n matches both Windows-style CRLF newlines and Unix-style LF-only ones.
Use (?:\r?\n)? instead of \r?\n if newlines aren't guaranteed to precede / succeed the lines of interest.
To verify that the resulting string is still a valid XML document, simply cast the command (or its captured result) to [xml]: [xml] ((Get-Content ...) -replace ...)
If you find that the document is broken, use Tomalak's fully robust, but more complex XML-parsing answer.
XML files and character encodings:
If you use Get-Content to read an XML file as text, and that file has neither a UTF-8 BOM nor a UTF-16 / UTF-32 BOM, Get-Content makes an assumption: it assumes ANSI encoding (e.g., Windows-1252) in Windows PowerShell, and, more sensibly, UTF-8 encoding in PowerShell [Core] v6+. Since Get-Content is a general-purpose text-file reading cmdlet, it is not aware of a potential encoding attribute in the XML declaration of XML input files.
If you know the actual encoding, use the -Encoding parameter to specify it.
Use -Encoding with the same value for saving the file with Set-Content later: As is generally the case in PowerShell, once data has been loaded into memory by a file-reading cmdlet, no information about its original encoding is retained, and using a file-writing cmdlet such as Set-Content later uses its fixed default encoding, which again, is ANSI in Windows PowerShell, and BOM-less UTF-8 in PowerShell [Core] v6+. Note that, unfortunately, different cmdlets have different defaults in Windows PowerShell, whereas PowerShell [Core] v6+ commendably consistently defaults to UTF-8.
The System.Xml.XmlDocument .NET type (whose PowerShell type accelerator is [xml]) offers robust XML parsing, and using its .Load() and .Save() methods provide better encoding support if the document's XML declaration contains an explicit encoding attribute naming the encoding used:
If such an attribute is present (e.g., <?xml version="1.0" encoding="ISO-8859-1"?>), both .Load() and .Save() will honor it.
That is an input file with an encoding attribute will be read correctly, and saved with that same encoding.
Of course, this assumes that the encoding named in the encoding attribute reflect's the input file's actual encoding.
Otherwise, if the file has no BOM, (BOM-less) UTF-8 is assumed, as with PowerShell [Core] v6+'s Get-Content / Set-Content - that is sensible, because an XML document that has neither an encoding attribute nor a UTF-8 or UTF-16 BOM should default to UTF-8, per the W3C XML Recommendation; if the file does have a BOM, only UTF-8 and UTF-16 are permitted without also naming the encoding in an encoding attribute, although in practice XmlDocument also reads UTF-32 files with a BOM correctly.
This means that .Save() will not preserve the encoding of a (with-BOM) UTF-16 or UTF-32 file that doesn't have an encoding attribute, and will instead create a BOM-less UTF-8 file.
If you want to detect a file's actual encoding - as either inferred from its BOM / absence thereof or, if present, the encoding attribute, read your file via an XmlTextReader instance:
# Create an XML reader.
$xmlReader = [System.Xml.XmlTextReader]::new(
"$pwd/some.xml" # IMPORTANT: use a FULL PATH
)
# Read past the declaration, which detects the encoding,
# whether via the presence / absence of a BOM or an explicit
# `encoding` attribute.
$null = $xmlReader.MoveToContent()
# Report the detected encoding.
$xmlReader.Encoding
# You can now pass the reader to .Load(), if needed
# See next section for how to *save* with the detected encoding.
$xmlDoc = [xml]::new()
$xmlDoc.Load($xmlReader)
$xmlReader.Close()
If a given file is non-compliant and you know the actual encoding used and/or you want to save with a given encoding (be sure that it doesn't contradict the encoding attribute, if there is one), you can specify encodings explicitly (the equivalent of using -Encoding with Get-Content / Set-Content), use the .Load() / .Save() method overloads that accepts a Stream instance, via StreamReader / StreamWriter instances constructed with a given encoding; e.g.:
# Get the encoding to use, matching the input file's.
# E.g., if the input file is ISO-8859-1-encoded, but lacks
# an `encoding` attribute in the XML declaration.
$enc = [System.Text.Encoding]::GetEncoding('ISO-8859-1')
# Create a System.Xml.XmlDocument instance.
$xmlDoc = [xml]::new()
# Create a stream reader for the input XML file
# with explicit encoding.
$streamIn = [System.IO.StreamReader]::new(
"$pwd/some.xml", # IMPORTANT: use a FULL PATH
$enc
)
# Read and parse the file.
$xmlDoc.Load($streamIn)
# Close the stream
$streamIn.Close()
# ... process the XML DOM.
# Create a stream *writer* for saving back to the file
# with the same encoding.
$streamOut = [System.IO.StreamWriter]::new(
"$pwd/t.xml", # IMPORTANT: use a FULL PATH
$false, # don't append
$enc # same encoding as above in this case.
)
# Save the XML DOM to the file.
$xmlDoc.Save($streamOut)
# Close the stream
$streamOut.Close()
A general caveat re passing file paths to .NET methods: Always use full paths, because .NET's idea of the current directory typically differs from PowerShell's.

Can't seem to get RegEx to match

I am trying to extract the Get-Help comment headers from a PowerShell script...using PowerShell. The file I'm reading looks something like this:
<#
.SYNOPSIS
Synopsis goes here.
It could span multiple lines.
Like this.
.DESCRIPTION
A description.
It could also span multiple lines.
.PARAMETER MyParam
Purpose of MyParam
.PARAMETER MySecondParam
Purpose of MySecondParam.
Notice that this section also starts with '.PARAMETER'.
This one should not be captured.
...and many many more lines like this...
#>
# Rest of the script...
I would like to get all the text below .DESCRIPTION, up to the first instance of .PARAMETER. So the desired output would be:
A description.
It could also span multiple lines.
Here's what I've tried:
$script = Get-Content -Path "C:\path\to\the\script.ps1" -Raw
$pattern = '\.DESCRIPTION(.*?)\.PARAMETER'
$description = $script | Select-String -Pattern $pattern
Write-Host $description
When I run that, $description is empty. If I change $pattern to .*, I get the entire contents of the file, as expected; So there must be something wrong with my RegEx pattern, but I can't seem to figure it out.
Any ideas?
(get-help get-date).description
The `Get-Date` cmdlet gets a DateTime object that represents the current date
or a date that you specify. It can format the date and time in several Windows
and UNIX formats. You can use `Get-Date` to generate a date or time character
string, and then send the string to other cmdlets or programs.
(get-help .\script.ps1).description
the Select-String cmdlet works on entire strings and you have given it ONE string. [grin]
so, instead of fighting with that, i went with the -match operator. the following presumes you have loaded the entire file into $InStuff as one multiline string with -Raw.
the (?ms) stuff is two regex flags - multiline & singleline.
$InStuff -match '(?ms)(DESCRIPTION.*?)\.PARAMETER'
$Matches.1
output ...
DESCRIPTION
A description.
It could also span multiple lines.
note that there is a blank line at the end. you likely will want to trim that away.
In the words of #Mathias R. Jessen:
Don't use regex to parse PowerShell code in PowerShell
Use the PowerShell parser instead!
So, let's use PowerShell to parse PowerShell:
$ScriptFile = "C:\path\to\the\script.ps1"
$ScriptAST = [System.Management.Automation.Language.Parser]::ParseFile($ScriptFile, [ref]$null, [ref]$null)
$ScriptAST.GetHelpContent().Description
We use the [System.Management.Automation.Language.Parser]::ParseFile() to parse our file and ouput an Abstract Syntax Tree (AST).
Once we have the Abstract Syntax Tree, we can then use the GetHelpContent() method (exactly what Get-Help uses) to get our parsed help content.
Since we are only interested in the Description portion, we can simply access it directly with .GetHelpContent().Description

PowerShell: how to get URL string from line? Beginner

How do I write a Power Shell script with a that scrapes one website and extracts one url from within a public static html file?
I am having trouble getting just the link, I can only get the line that contains the link.
'Invoke-WebRequest' downloads and saves the html file.
The link I want ends in .m3u8 so I use
'Select-String' to search for .m3u8 and PowerShell returns one line. But I want a link, not a line, the line contains other normal html markup that I don't want. The link is in double quotes and ends in .m3u8. I want what is inside the quotes.
Should I use split to convert the line into an array?
Should I use regex to "only get what is inside of quotes"? and if so how?
$variable_text = index.html
$variable_line = sls .m3u8 $variable_text
$variable_url = sls "regex inside of the quotes" in $variable_line
When I google regular expressions and enter them into powershell the command returns the ">>". Perhaps my problem is with syntax? The online regular expression checking tools work but when I put that regular expression into powershell it never works. Thank you very much for your time.
No need to download the website or parsing thru all lines.
The Invoke-Webrequest cmdlet contains a property named links.
Example of getting all links and searching for the m3u8 link:
$WebSite = Invoke-WebRequest -Uri "your website"
$Links = $WebSite.Links.href
$Links | Where-Object{$_ -like "*.m3u8"} #Will show you all links which end with .m3u8