Powershell JSON transformation removing unicode escape chars without removing literal \n - regex

My issue us similiar to this question:
Json file to powershell and back to json file
When importing and exporting ARM templates in powershell, using Convert-FromJson and Convert-ToJson, introduces unicode escape sequences.
I used the code here to unescape again.
Some example code (mutltiline for clarity):
$armADF = Get-Content -Path $armFile -Raw | ConvertFrom-Json
$armADFString = $armADF | ConvertTo-Json -Depth 50
$armADFString |
ForEach-Object { [System.Text.RegularExpressions.Regex]::Unescape($_) } |
Out-File $outputFile
Here's the doco I've been reading for Unescape
Results in the the output file being identical except that all instances of literal \n (that were in the original JSON file) are turned into actual carriage returns. Which breaks the ARM template.
If I don't include the Unescape code, the \n are preserved but so are the unicode characters which also breaks the ARM template.
It seems like I need to pre-escape the \n so when I call Unescape they are turned into nice little \n. I've tried a couple of things like adding this before calling unescape.
$armADFString = $armADFString -replace("\\n","\u000A")
Which does not give me the results I need.
Anyone come across this and solved it? Any accomplished escape artists?

I reread the Unescape doco and noticed that it would also basically remove leading \ characters so I tried this unlikely bit of code:
$armADF = Get-Content -Path $armFile -Raw | ConvertFrom-Json
$armADFString = $armADF | ConvertTo-Json -Depth 50
$armADFString = $armADFString -replace("\\n","\\n")
$armADFString |
ForEach-Object { [System.Text.RegularExpressions.Regex]::Unescape($_) } |
Out-File $outputFile
Of course - replacing \\n with \\n makes complete sense :|
More than happy for anyone to pose a more elegant solution.
EDIT: I am deploying ADF ARM templates which are themselves JSON based. TO cut a long story short I also found I needed to add this to stop it unescaping legitimately escaped quotes:
$armADFString = $armADFString -replace('\\"','\\"')

Related

Powershell - Extract Non-UTF-8 Characters from multiple files and Re-write the new files and create a new file with the bad Characters (ebcdic?)

I have a small script that I can use to find and replace characters or strings in a file. It works and I can use it to replace the non UTF-8 characters.
What I need to do is run the script once and replace all the invalid data in one shot AND create another file that has the File name and bad characters.
Right now I have to run the script over and over with however many invalid characters I can ID by eyeball. Then I edit my tracking file with the contents of the script I ran and the File I ran it against.
Not efficient at all. Just to be clear, I have almost no clue how to code the second part of keeping track of what is corrected.
Can anyone offer a better way of doing this?
Thank you,
-Ron
$old = 'BAD DATA'
$new = ' '
$configFiles = Get-ChildItem . *.* -rec
foreach ($file in $configFiles)
{
(Get-Content $file.PSPath) |
Foreach-Object { $_ -replace "$old", "$new" } |
Set-Content $file.PSPath
}
Here is a sample of my DATA...
"PARTHENIA STREET °212 "," "," "," ","CAUGA PARK "
The data ' °' in HEX is c2 and b0. The original file before FTP is a single byte HEX 09. Not only did it convert wrong it added a btye to the file.
Here's an example translating ebcidic to ascii based on ASCII-to-EBCDIC or EBCDIC-to-ASCII and Working with non-native PowerShell encoding (EBCDIC), but the ebcidic file is completely unrecognizable. It doesn't have a BOM.
The file was downloaded with sftp, but it sounds like it was already corrupted.
"hi`tthere","how`tare" | set-content file.txt # tab 0x09 in the middle
# From ASCII to EBCDIC
$asciibytes = get-content file.txt -Encoding byte
$rawstring = [System.Text.Encoding]::ASCII.GetString($asciibytes)
$ebcdicbytes = [System.Text.Encoding]::GetEncoding('ebcdic-cp-us').getbytes($rawstring)
$ebcdicbytes | set-content ebcidic.txt -Encoding Byte
# From EBCDIC to ASCII
$ebcidicbytes = get-content ebcidic.txt -Encoding byte
$rawstring = [System.Text.Encoding]::getencoding('ebcdic-cp-us').GetString($ebcidicbytes)
$asciibytes = [system.text.encoding]::ASCII.GetBytes($rawstring)
$asciibytes | set-content ascii.txt -Encoding Byte
Here's a script called nonascii.ps1 that strips non-ascii characters (not between space and tilde in the ascii table, and also tab) and writes to the same filename.
(get-content $args[0]) -replace '[^ -~\t]' | set-content $args[0]
Note that powershell 5.1's get-content can't recognize utf8 no bom files without the '-encoding utf8' parameter.
get-content file -encoding utf8
Also note that powershell 6.2 and above can use any encoding known by .net, although tab completion doesn't reflect this:
"hi`tthere" | set-content ebcidic.txt -encoding ebcdic-cp-us
get-content ebcidic.txt -encoding ebcdic-cp-us

How to replace lines depending on the remaining text in file using PowerShell

I need to edit txt file using PowerShell. The problem is that I need to apply changes for the string only if the remaining part of the string matches some pattern. For example, I need to change 'specific_text' to 'other_text' only if the line ends with 'pattern':
'specific_text and pattern' -> changes to 'other_text and pattern'
But if the line doesn't end with pattern, I don't need to change it:
'specific_text and something else' -> no changes
I know about Replace function in PowerShell, but as far as I know it makes simple change for all matches of the regex. There is also Select-String function, but I couldn't combine them properly. My idea was to make it this way:
((get-content myfile.txt | select-string -pattern "pattern") -Replace "specific_text", "other_text") | Out-File myfile.txt
But this call rewrites the whole file and leaves only changed lines.
You may use
(get-content myfile.txt) -replace 'specific_text(?=.*pattern$)', "other_text" | Out-File myfile.txt
The specific_text(?=.*pattern$) pattern matches
specific_text - some specific_text...
(?=.*pattern$) - not immediately followed with any 0 or more chars other than a newline as many as possible and then pattern at the end of the string ($).

How to Optimize extended event to JSON conversion

I have a small process for ingesting a .xel file, converting it to custom objects with a dba-tools module, and then turning them into single-line JSON and exporting them to a file that gets sent off to wherever it goes. Here:
$path = 'C:\temp\big_xe_file.xel'
#Read in file
$xes = Read-DbaXEFile -Path $path
#Output Variable
$file = ""
foreach ($xe in $xes) {
#format date column
$xe.timestamp = ($xe.timestamp.DateTime).ToString("yyyy-MM-ddThh:mm:ss.ffff")
# convert to JSON and change escaped unicode characters back
$xe = (($xe | ConvertTo-Json -compress)) | % { #| % { [System.Text.RegularExpressions.Regex]::Unescape($_) }
[Regex]::Replace($_,
"\\u(?<Value>[a-zA-Z0-9]{4})", {
param($m) ([char]([int]::Parse($m.Groups['Value'].Value,
[System.Globalization.NumberStyles]::HexNumber))).ToString() } )}
#Write line to file
Add-Content -Value "$($xe)`n" -Path 'C:\temp\myevents.json' -Encoding utf8 -NoNewline
}
This fits the bill and does exactly what I need it to. The nasty regex in the middle is because when you convertto-json, it HANDILY escapes all unicode characters, and the regex magically turns them all back to the characters we know and love.
However, it's all a bit too slow. We churn out lots of .xel files, usually 500mb in size, and we would like to have a shorter delay between the traces being written and being ingested. As it stands, it takes ~35 minutes to serially process a file this way. The delay would likely grow if we got behind, which seems likely at that speed.
I've already sped this up quite a bit. I've tried using [System.Text.RegularExpressions.Regex]::Unescape in place of the regex code I have, but it is only slightly faster and does not provide the correct formatting that we need anyway. My next step is to split the files into smaller pieces and process them in parallel, but that would be significantly more CPU intensive and I'd like to avoid that if possible.
Any help optimizing this is much appreciated!
It turns out there was a config issue and we were able to ditch that regex nonsense and leave the escape characters in the JSON. However, I did also find a solution for speeding it up, in case anyone ever sees this. The solution was changing the writer to use a .NET class instead of the powershell method
$stream = [System.IO.StreamWriter] $outfile
foreach ($xe in $xes) {
#format date column
$xe.timestamp = ($xe.timestamp.DateTime).ToString("yyyy-MM-ddThh:mm:ss.ffff")
$xe | Add-Member -MemberType NoteProperty -Name 'source_host_name' -Value $server_name
# convert to JSON and change escaped unicode characters back
$xe = (($xe | ConvertTo-Json -compress)) #| % { #| % { [System.Text.RegularExpressions.Regex]::Unescape($_) }
# [Regex]::Replace($_,
# "\\u(?<Value>[a-zA-Z0-9]{4})", {
# param($m) ([char]([int]::Parse($m.Groups['Value'].Value,
# [System.Globalization.NumberStyles]::HexNumber))).ToString() } )}
#Add-Content -Value "$($xe)`n" -Path 'C:\DBA Notes\Traces\Xel.json' -Encoding utf8 -NoNewline
$stream.WriteLine($xe)
}
$stream.close()
It takes 1/10 the amount of time. Cheers

Powershell: how to replace quoted text from a batch file

I have a text file that contains:
#define VERSION "0.1.2"
I need to replace that version number from a running batch file.
set NEW_VERSION="0.2.0"
powershell -Command "(gc BBB.iss) -replace '#define VERSION ', '#define VERSION %NEW_VERSION% ' | Out-File BBB.iss"
I know that my match pattern is not correct. I need to select the entire line including the "0.2.0", but I can't figure out how to escape all that because it's all enclosed in double quotes so it runs in a batch file.
I'm guessing that [0-9].[0-9].[0-9] will match the actual old version number, but what about the quotes?
but what about the quotes?
When calling PowerShell's CLI from cmd.exe (a batch file) with powershell -command "....", use \" to pass embedded ".
(This may be surprising, given that PowerShell-internally you typically use `" or "" inside "...", but it is the safe choice from the outside.[1].)
Note:
While \" works robustly on the PowerShell side, it can situationally break cmd.exe's parsing. In that case, use "^"" (sic) with powershell.exe (Windows PowerShell), and "" with pwsh.exe (PowerShell (Core) 7+), inside overall "..." quoting. See this answer for details.
Here's an approach that matches and replaces everything between "..." after #define VERSION :
:: Define the new version *without* double quotes
set NEW_VERSION=0.2.0
powershell -Command "(gc BBB.iss) -replace '(?<=#define VERSION\s+\").+?(?=\")', '%NEW_VERSION%' | Set-Content -Encoding ascii BBB.iss"
Note that using Out-File (as used in the question) to rewrite the file creates a UTF-16LE ("Unicode") encoded file, which may be undesired; use Set-Content -Encoding ... to control the output encoding. The above command uses Set-Content -Encoding ascii as an example.
Also note that rewriting an existing file this way (read existing content into memory, write modified content back) bears the slight risk of data loss, if writing the file gets interrupted.
(?<=#define VERSION\s+\") is a look-behind assertion ((?<=...)) that matches literal #define VERSION followed by at least one space or tab (\s+) and a literal "
Note how the " is escaped as \", which - surprisingly - is how you need to escape literal " chars. when you pass a command to PowerShell from cmd.exe (a batch file).[1]
.+? then non-greedily (?) matches one or more (+) characters (.)...
...until the closing " (escaped as \") is found via (?=\"), a look-ahead assertion
((?<=...))
The net effect is that only the characters between "..." are matched - i.e., the mere version number - which then allows replacing it with just '%NEW_VERSION%', the new version number.
A simpler alternative, if all that is needed is to replace the 1st line, without needing to preserve specific information from it:
powershell -nop -Command "#('#define VERSION \"%NEW_VERSION%\"') + (gc BBB.iss | Select -Skip 1) | Set-Content -Encoding ascii BBB.iss"
The command simply creates an array (#(...)) of output lines from the new 1st line and (+) all but the 1st line from the existing file (gc ... | Select-Object -Skip 1) and writes that back to the file.
[1] When calling from cmd.exe, escaping an embedded " as "" sometimes , but not always works (try
powershell -Command "'Nat ""King"" Cole'").
Instead, \"-escaping is the safe choice.
`", which is the typical PowerShell-internal way to escape " inside "...", never works when calling from cmd.exe.
You can try this,
powershell -Command "(gc BBB.iss) -replace '(?m)^\s*#define VERSION .*$', '#define VERSION %NEW_VERSION% ' | Out-File BBB.iss"
If you want double quotes left,
powershell -Command "(gc BBB.iss) -replace '(?m)^\s*#define VERSION .*$', '#define VERSION "%NEW_VERSION%"' | Out-File BBB.iss"

Usage of | in PowerShell regex

I'm trying to split some text using PowerShell, and I'm doing a little experimenting with regex, and I would like to know exactly what the "|" character does in a PowerShell regex. For example, I have the following line of code:
"[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png" | select-string '\[\d+\]:' | foreach-object {($_ -split '\[|\]')}
Running this line of code gives me the following output:
-blank line-
02
: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png
If I run the code without the "|" in the -split statement as such:
"[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png" | select-string '\[\d+\]:' | foreach-object {($_ -split '\[\]')}
I get the following output without the [] being stripped (essentially it's just displaying the select-string output:
[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png
If I modify the code and run it like this:
"[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png" | select-string '\[\d+\]:' | foreach-object {($_ -split '\[|')}
In the output, the [ is stripped from the beginning but the output has a carriage return after each character (I did not include the full output for space purposes).
0
2
]
:
.
/
m
e
The Pipe character, "|", separates alternatives in regex.
You can see all the metacharacters defined here:
http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1
The answers already explain what the | is for but I would like to explain what is happening with each example that you have above.
-split '\[|\]': You are trying to match either [ or ] which is why you get 3 results. The first being a blank line which is the whitespace represented by the beginning of the line before the first [
-split '\[\]': Since you are omitting the | symbol in this example you are requesting to split on the character sequence [] which does not appear in your string. This is contrasted by the code $_.split('\[\]') which would split on every character. This is by design.
-split '\[|': Here you are running into a caveat of not specifying the right hand operand for the | operator. To quote the help from Regex101 when this regex is specified:
(null, matches any position)
Warning: An empty alternative effectively truncates the regex at this
point because it will always find a zero-width match
Which is why the last example split on every element. Also, I dont think any of this is PowerShell only. This behavior should be seen on other engines as well.
Walter Mitty is correct, | is for alternation.
You can also use [Regex]::Escape("string") in Powershell and it will return a string that has all the special characters escaped. So you can use that on any strings you want to match literally (or to determine if a specific character does or can have special meaning in a regex).