I edit XML files and am using PowerShell to open them in Notepad and replace strings of text. Given two distinct delimiters, a starting and stopping, that appear multiple times in an XML file, I would like to completely remove the text between the delimiters (whether the delimiters get removed as well or not does not matter to me).
In the following example text, I want to completely remove the text between my starting and ending delimiter, but keep all the text before and after it.
The issue I am facing is the fact that there are newlines at the end of each line of text that prevents me from doing a simple:
-replace "<!--A6-->.*?<!--A6 end-->", "KEVIN"
Starting Delimiter:
<!--A6-->
Stopping Delimiter:
<!--A6 end-->
Example Text:
<listItem>
<para>Apple iPhone 6</para>
</listItem>
<listItem>
<para>Apple iPhone 8</para>
</listItem>
<!--A6-->
<listItem>
<para>Apple iPhone X</para>
</listItem>
<!--A6 end-->
</randomList></para>
</levelledPara>
<levelledPara>
<!--A6-->
<title>Available Apple iPhone Colors</title>
<para>The current iPhone model is available in
the follow colors. You can purchase this model
in store, or online.</para>
<!--A6 end-->
<para>If the color option that you want is out
of stock, you can find them at the following
website link.</para>
Current Code:
$Directory = "C:\Users\hellokevin\Desktop\PSTest"
$FindBook = "Book"
$ReplaceBook = "Novel"
$FindBike = "Bike"
$ReplaceBike = "Bicycle"
Get-ChildItem -Path $Directory -Recurse |
Select-Object -Expand FullName|
ForEach-Object {
(Get-Content $_) -replace $FindBook,$ReplaceBook -replace "<!--A6-->.*?<!--A6 end-->", "KEVIN" |
Set-Content ($_ + "_new.xml")
}
Any help would be greatly appreciated. Being fairly new to PowerShell, I don't know how to factor in the newlines at the end of each line in my code. Thanks for looking!
Using search-and-replace on XML files is extremely inadvisable and should be avoided at all costs, because it's way too easy to damage the XML this way.
There are better ways of modifying XML, and they all follow this schema:
load the XML document
modify the document tree
write the XML document back to file.
For your case ("remove nodes between markers") this could be as follows:
load the XML document
look at all XML nodes, in document order
when we see a comment that reads "A6", set a flag to remove nodes from now on
when we see a comment that reads "A6 end", unset that flag
collect all nodes that should be removed (that come up while the flag is on)
in a last step, remove them
write the XML document back to file.
The following program would do exactly this (and also remove the "A6" comments themselves):
$doc = New-Object xml
$doc.Load("C:\path\to\your.xml")
$toRemove = #()
$A6flag = $false
foreach ($node in $doc.SelectNodes('//node()')) {
if ($node.NodeType -eq "Comment") {
if ($node.Value -eq 'A6') {
$A6flag = $true
$toRemove += $node
} elseif ($node.Value -eq 'A6 end') {
$A6flag = $false
$toRemove += $node
}
} elseif ($A6flag) {
$toRemove += $node
}
}
foreach ($node in $toRemove) {
[void]$node.ParentNode.RemoveChild($node)
}
$doc.Save("C:\path\to\your_modified.xml")
You could do string replacement inside the foreach loop as well:
if ($node.NodeType -eq "Text") {
$node.Value = $node.Value -replace "Apple","APPLE"
}
Doing -replace on a single $node.Value is safe. Doing -replace on the entire XML is not.
Note:
Generally, for robust processing, you should use a dedicated XML parser to parse XML text.
See Tomalak's robust, but more complex XML-parsing answer.
In the specific case at hand, using a regex is a convenient shortcut, with the caveat that it only works because the blocks of lines being removed are self-contained elements or element sequences; if this assumption doesn't hold, the modifications will invalidate the XML document.
Additionally, there may be character-encoding issues, because reading an XML file as text doesn't honor an explicit encoding attribute potentially present in the file's XML declaration - see the bottom section for details.
That said, the technique below is appropriate for modifying plain-text files that have no specific formal structure.
You need to use the s (SingleLine) regex option to ensure that . also matches newlines - such options, if used inline, must be placed inside (?...) at the start of the regex; that is, '(?s)...' in this case.
Ad hoc, you can alternatively use workaround [\s\S] instead of ., as suggested by x15; this expression matches any character that is a whitespace char. or a non-whitespace char., and therefore matches any char., including newlines.
To fully remove the lines of interest, you must also match the preceding and succeeding newline.
(Get-Content -Raw file.xml) -replace '(?s)\r?\n<!--A6-->.*?<!--A6 end-->\r?\n'
Get-Content -Raw file.xml reads the file into memory as a whole (single string).
Get-Content makes assumptions about a file's character encoding in the absence of a BOM: Windows PowerShell assumes ANSI encoding, and PowerShell [Core] v6+ now sensibly assumes UTF-8. Since Get-Content is a general-purpose text-file reading cmdlet, it is not aware of a potential encoding attribute in the XML declaration of XML input files (e.g., <?xml version="1.0" encoding="ISO-8859-1"?>)
Similarly, Set-Content defaults to ANSI in Windows PowerShell, and BOM-less UTF-8 PowerShell [Core] v6+.
When in doubt, use the -Encoding parameter, both with Get-Content and Set-Content
See bottom section for more information.
\r?\n matches both Windows-style CRLF newlines and Unix-style LF-only ones.
Use (?:\r?\n)? instead of \r?\n if newlines aren't guaranteed to precede / succeed the lines of interest.
To verify that the resulting string is still a valid XML document, simply cast the command (or its captured result) to [xml]: [xml] ((Get-Content ...) -replace ...)
If you find that the document is broken, use Tomalak's fully robust, but more complex XML-parsing answer.
XML files and character encodings:
If you use Get-Content to read an XML file as text, and that file has neither a UTF-8 BOM nor a UTF-16 / UTF-32 BOM, Get-Content makes an assumption: it assumes ANSI encoding (e.g., Windows-1252) in Windows PowerShell, and, more sensibly, UTF-8 encoding in PowerShell [Core] v6+. Since Get-Content is a general-purpose text-file reading cmdlet, it is not aware of a potential encoding attribute in the XML declaration of XML input files.
If you know the actual encoding, use the -Encoding parameter to specify it.
Use -Encoding with the same value for saving the file with Set-Content later: As is generally the case in PowerShell, once data has been loaded into memory by a file-reading cmdlet, no information about its original encoding is retained, and using a file-writing cmdlet such as Set-Content later uses its fixed default encoding, which again, is ANSI in Windows PowerShell, and BOM-less UTF-8 in PowerShell [Core] v6+. Note that, unfortunately, different cmdlets have different defaults in Windows PowerShell, whereas PowerShell [Core] v6+ commendably consistently defaults to UTF-8.
The System.Xml.XmlDocument .NET type (whose PowerShell type accelerator is [xml]) offers robust XML parsing, and using its .Load() and .Save() methods provide better encoding support if the document's XML declaration contains an explicit encoding attribute naming the encoding used:
If such an attribute is present (e.g., <?xml version="1.0" encoding="ISO-8859-1"?>), both .Load() and .Save() will honor it.
That is an input file with an encoding attribute will be read correctly, and saved with that same encoding.
Of course, this assumes that the encoding named in the encoding attribute reflect's the input file's actual encoding.
Otherwise, if the file has no BOM, (BOM-less) UTF-8 is assumed, as with PowerShell [Core] v6+'s Get-Content / Set-Content - that is sensible, because an XML document that has neither an encoding attribute nor a UTF-8 or UTF-16 BOM should default to UTF-8, per the W3C XML Recommendation; if the file does have a BOM, only UTF-8 and UTF-16 are permitted without also naming the encoding in an encoding attribute, although in practice XmlDocument also reads UTF-32 files with a BOM correctly.
This means that .Save() will not preserve the encoding of a (with-BOM) UTF-16 or UTF-32 file that doesn't have an encoding attribute, and will instead create a BOM-less UTF-8 file.
If you want to detect a file's actual encoding - as either inferred from its BOM / absence thereof or, if present, the encoding attribute, read your file via an XmlTextReader instance:
# Create an XML reader.
$xmlReader = [System.Xml.XmlTextReader]::new(
"$pwd/some.xml" # IMPORTANT: use a FULL PATH
)
# Read past the declaration, which detects the encoding,
# whether via the presence / absence of a BOM or an explicit
# `encoding` attribute.
$null = $xmlReader.MoveToContent()
# Report the detected encoding.
$xmlReader.Encoding
# You can now pass the reader to .Load(), if needed
# See next section for how to *save* with the detected encoding.
$xmlDoc = [xml]::new()
$xmlDoc.Load($xmlReader)
$xmlReader.Close()
If a given file is non-compliant and you know the actual encoding used and/or you want to save with a given encoding (be sure that it doesn't contradict the encoding attribute, if there is one), you can specify encodings explicitly (the equivalent of using -Encoding with Get-Content / Set-Content), use the .Load() / .Save() method overloads that accepts a Stream instance, via StreamReader / StreamWriter instances constructed with a given encoding; e.g.:
# Get the encoding to use, matching the input file's.
# E.g., if the input file is ISO-8859-1-encoded, but lacks
# an `encoding` attribute in the XML declaration.
$enc = [System.Text.Encoding]::GetEncoding('ISO-8859-1')
# Create a System.Xml.XmlDocument instance.
$xmlDoc = [xml]::new()
# Create a stream reader for the input XML file
# with explicit encoding.
$streamIn = [System.IO.StreamReader]::new(
"$pwd/some.xml", # IMPORTANT: use a FULL PATH
$enc
)
# Read and parse the file.
$xmlDoc.Load($streamIn)
# Close the stream
$streamIn.Close()
# ... process the XML DOM.
# Create a stream *writer* for saving back to the file
# with the same encoding.
$streamOut = [System.IO.StreamWriter]::new(
"$pwd/t.xml", # IMPORTANT: use a FULL PATH
$false, # don't append
$enc # same encoding as above in this case.
)
# Save the XML DOM to the file.
$xmlDoc.Save($streamOut)
# Close the stream
$streamOut.Close()
A general caveat re passing file paths to .NET methods: Always use full paths, because .NET's idea of the current directory typically differs from PowerShell's.
Related
I have an XML file that includes many instances of a particular tag/element. I am trying to capture each of these and then dump them into a new file.
I have the following script which does work, in that it takes the first occurrence of the text I am after and displays it to the console.
I am trying to incorporate foreach-object to retrieve all occurrences of ...allContent... but am failing to ge it added correctly.
Here is my working script that displays the output I am after for the first occurrence only.
$firstString = "<RunListItems>"
$secondString = "</RunListItems>"
#Get content from file
$file = Get-Content "C:\Users\Bob\Desktop\ps\order.xml"
#Regex pattern to compare two strings
$pattern = "$firstString(.*?)$secondString"
#Perform the opperation
$result = [regex]::Match($file,$pattern).Groups[1].Value
#Return result
return $result
Parsing XML text with regular expressions is brittle and therefore ill-advised.
PowerShell provides easy access to proper XML parsers, and the in case at hand you can use the Select-Xml cmdlet:
Select-Xml //RunListItems C:\Users\Bob\Desktop\ps\order.xml |
ForEach-Object { $_.Node.InnerText }
//RunListItems is an XPath query that selects all elements whose tag name is RunListItems throughout the document, irrespective of their position in the hierarchy (//)
The .Node property of the output objects (of type Microsoft.PowerShell.Commands.SelectXmlInfo) contains the matching element, and its .InnerText property returns its text content.
Note: If your XML document uses namespaces, you must pass a hashtable with prefix-to-URI mappings to Select-Xml's -Namespace parameter, and use these prefixes in the XPath query (-XPath) when referring to elements - see this answer for more information.
To save the output strings to a file, separated with newlines, simply append something like
| Set-Content out.txt; use Set-Content's -Encoding parameter to control the encoding, if needed.[1]
[1] In Windows PowerShell (versions up to 5.1), Set-Content defaults to the active ANSI code page. In PowerShell (Core) 7+, the consistent default across all cmdlets is BOM-less UTF-8. See this answer for more information.
I am trying to extract the Get-Help comment headers from a PowerShell script...using PowerShell. The file I'm reading looks something like this:
<#
.SYNOPSIS
Synopsis goes here.
It could span multiple lines.
Like this.
.DESCRIPTION
A description.
It could also span multiple lines.
.PARAMETER MyParam
Purpose of MyParam
.PARAMETER MySecondParam
Purpose of MySecondParam.
Notice that this section also starts with '.PARAMETER'.
This one should not be captured.
...and many many more lines like this...
#>
# Rest of the script...
I would like to get all the text below .DESCRIPTION, up to the first instance of .PARAMETER. So the desired output would be:
A description.
It could also span multiple lines.
Here's what I've tried:
$script = Get-Content -Path "C:\path\to\the\script.ps1" -Raw
$pattern = '\.DESCRIPTION(.*?)\.PARAMETER'
$description = $script | Select-String -Pattern $pattern
Write-Host $description
When I run that, $description is empty. If I change $pattern to .*, I get the entire contents of the file, as expected; So there must be something wrong with my RegEx pattern, but I can't seem to figure it out.
Any ideas?
(get-help get-date).description
The `Get-Date` cmdlet gets a DateTime object that represents the current date
or a date that you specify. It can format the date and time in several Windows
and UNIX formats. You can use `Get-Date` to generate a date or time character
string, and then send the string to other cmdlets or programs.
(get-help .\script.ps1).description
the Select-String cmdlet works on entire strings and you have given it ONE string. [grin]
so, instead of fighting with that, i went with the -match operator. the following presumes you have loaded the entire file into $InStuff as one multiline string with -Raw.
the (?ms) stuff is two regex flags - multiline & singleline.
$InStuff -match '(?ms)(DESCRIPTION.*?)\.PARAMETER'
$Matches.1
output ...
DESCRIPTION
A description.
It could also span multiple lines.
note that there is a blank line at the end. you likely will want to trim that away.
In the words of #Mathias R. Jessen:
Don't use regex to parse PowerShell code in PowerShell
Use the PowerShell parser instead!
So, let's use PowerShell to parse PowerShell:
$ScriptFile = "C:\path\to\the\script.ps1"
$ScriptAST = [System.Management.Automation.Language.Parser]::ParseFile($ScriptFile, [ref]$null, [ref]$null)
$ScriptAST.GetHelpContent().Description
We use the [System.Management.Automation.Language.Parser]::ParseFile() to parse our file and ouput an Abstract Syntax Tree (AST).
Once we have the Abstract Syntax Tree, we can then use the GetHelpContent() method (exactly what Get-Help uses) to get our parsed help content.
Since we are only interested in the Description portion, we can simply access it directly with .GetHelpContent().Description
I am unable to apply many of the other powershell regex solutions to help solve my problem. The answer may very well already be on stackoverflow, but my lack of experience with powershell is prohibiting me from deducing how to maniupulate the solutions to my question.
I have a text file containing an XML document tree(I bring in the document tree as one large string into powershell)(edit 1) that includes the HTML tags to establish where certain content is. I need to steal the file name from in between the filename tags. Sometimes both tags and the file name are all on one line, and other times the tags are each on a seperate line as well as the file name. An example of the input data I have is below:
<files>
<file>
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
<file>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
I have created the following code to find the content within the tags thus far. It works if the filename tags and the file name are on the same line. The problem I'm having is in the instance where they are all on seperate lines (the example I provided above). I have already managed to transfer the large string above into $xmldata.
$xmldata -match '<fileName>(.*?)(</fileName>)'
$matches
Using the example text I displayed above, the output I receive is as follows:
<fileName>AnotherTextFileINeedReturned.txt</fileName>
I'm ok with receiving the tags, but I also need the name of the file that is on multiple lines. Like this...
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
Or any variation that would give me both of the names of the text files. I have seen the (?m) part used before, but I haven't been able to successfully implement it. Thanks in advance for the help!! Let me know if you need any other information!
You should be able to get around that without using any regex. Powershell supports XML pretty well. Extracting the filename would be as easy as:
$Xml = #"
<files>
<file>
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
<file>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
</files>
"#
Select-Xml -Content $Xml -XPath "//files/file/fileName" | foreach {$_.node.InnerXML.Trim()}
You not explainen how you get your data but I guess you are using Get-Content to retrieve your source file. Get-Content reads the content one line at a time and returns a collection of objects, each of which represents a line of content. In other words, you're probably doing a Match on each separate line and therefor do no find the matches that are spread over multiple lines.
If this is indeed the case, the solution would be to simply join the lines first:
($xmldata -Join "") -match '<fileName>(.*?)(</fileName>)'
And check your matches, e.g.:
$Matches[0]
I have the following (sample) text:
line1
line2
line3
I would like to use the powershell -replace method to replace the whole block with:
lineA
lineB
lineC
I'm not sure how to format this to account for the carriage returns/line breaks... Just encapsulating it in quotes like this doesn't work:
{$_ -replace "line1
line2
line3",
"lineA
lineB
lineC"}
How would this be achieved? Many thanks!
There is nothing syntactically wrong with your command - it's fine to spread string literals and expressions across multiple lines (but see caveat below), so the problem likely lies elsewhere.
Caveat re line endings:
If you use actual line breaks in your string literals, they'll implicitly be encoded based on your script file's line-ending style (CRLF on Windows, LF-only on Unix) - and may not match the line endings in your input.
By contrast, if you use control-character escapes `r`n (CRLF) vs. `n` (LF-only) in double-quoted strings, as demonstrated below, you're not only able to represent multiline strings on a single line, but you also make the line-ending style explicit and independent of the script file's own encoding, which is preferable.
In the remainder of this answer I'm assuming that the input has CRLF (Windows-style) line endings; to handle LF-only (Unix-style) input instead, simply replace all `r`n instances with `n.
I suspect that you're not sending your input as a single, multiline string, but line by line, in which case your replacement command will never find a match.
If your input comes from a file, be sure to use Get-Content's -Raw parameter to ensure that the entire file content is sent as a single string, rather than line by line; e.g.:
Get-Content -Raw SomeFile |
ForEach-Object { $_ -replace "line1`r`nline2`r`nline3", "lineA`r`nlineB`r`nlineC" }
Alternatively, since you're replacing literals, you can use the [string] type's Replace() method, which operates on literals (which has the advantage of not having to worry about needing to escape regular-expression metacharacters in the replacement string):
Get-Content -Raw SomeFile |
ForEach-Object { $_.Replace("line1`r`nline2`r`nline3", "lineA`r`nlineB`r`nlineC") }
MatthewG's answer adds a twist that makes the replacement more robust: appending a final line break to ensure that only a line matching line 3 exactly is considered:
"line1`r`nline2`r`nline3" -> "line1`r`nline2`r`nline3`r`n" and
"lineA`r`nlineB`r`nlineC" -> "lineA`r`nlineB`r`nlineC`r`n"
In Powershell you can use `n (backtick-n) for a newline character.
-replace "line1`nline2`nline3`n", "lineA`nlineB`nlineC`n"
A given XML file with UTF-8 declared as the encoding does not pass xmllint. With the assumption that a non UTF-8 character is causing the error, the following sed command is being run against the file. sed 's/[^\x00-\x7F]//g' file.xml. Either the command is wrong, or non UTF-8 characters are not the problem, as xmllint still fails after running the sed. The first question is: does the sed regex appear correct?
= = = = =
Here is the output of xmllint:
$ xmllint file.xml
file.xml:35533: parser error : CData section not finished
<p class="imgcont"><img alt="Diets of 2013" src="h
<b>What You Eat: </b>Foods low in sugar and carbs and high in fat—80% of cal
^
file.xml:35533: parser error : PCDATA invalid Char value 31
<b>What You Eat: </b>Foods low in sugar and carbs and high in fat—80% of cal
^
file.xml:35588: parser error : Sequence ']]>' not allowed in content
as.people.com/2013/11/07/kerry-washington-pregnant-diet-green-smoothie-recipe/"]
^
= = = = =
UPDATE: In TextMate, on viewing the file, there is a character that is being shown as <US>. If that character is manually deleted from the file, the file then passes xmllint.
It is somewhat hard to work with sed to remove specific code points from Unicode table.
In case you need to target specific Unicode categories of characters it makes more sense to work with Perl.
perl -pe -i 's/(?![\t\n\r])\p{Cc}//g' file
will remove all control characters but TAB, CR and LF.