Splitting text in PowerShell using content delimiter as filename

Splitting text in PowerShell using content delimiter as filename - regex

I am trying to split a txt transcription into single files, one for each folio.
The file is marked as [c. 1r],[c. 1v] ... [c. 7v] and so on.
Using this example I was able to create a PowerShell script that does the magic with a regex that match each page delimiter , but I seem totally unable to use the regex in order to give proper names to the pages. With this code
$InputFile = "input.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$a = 1
while (($Line = $Reader.ReadLine()) -ne $null) {
if ($Line -match "\[c\. .*?\]") {
$OutputFile = "MySplittedFileNumber$a$Matches.txt"
$a++
}
Add-Content $OutputFile $Line
}
all the files are named with MySplittedFileNumber1System.Collections.Hashtable.txt instead of the match, with "$Matches[0]" I'm told that the variable does not exist or has been filtered by -Exclude.
All my attempts of setting the $regex before executing seems to go nowhere, can someone point me on how to get the result filenames formatted as MySplittedFileNumber[c. 1r].txt.
Using just a partial match as \[(c\. .*?)\] would be even better, but once I know how to retrieve the match, I bet I can find the solution.
I can do the variable 1r 1v setting in $a, somehow, but I'd rather use the one inside the txt file, since some folio may have been misnumbered in the manuscript and I need to retain this.
Content of original input.txt:
> [c. 1r]
Text paragraph
text paragraph
...
Text paragraph
[c. 1v]
Text paragraph
text paragraph
...
Text paragraph
[c. 2r]
Text paragraph
text paragraph
...
Text paragraph
Desired result:
Content of MySplittedFileNumber[c. 1r].txt:
> [c. 1r]
Text paragraph
text paragraph
...
Text paragraph
Content of MySplittedFileNumber[c. 1v].txt:
> [c. 1v]
Text paragraph
text paragraph
...
Text paragraph
Content of MySplittedFileNumber[c. 2r].txt:
> [c. 2r]
Text paragraph
text paragraph
...
Text paragraph

I tried to reproduce it and with a little change it worked:
$InputFile = "input.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$a = 1
While (($Line = $Reader.ReadLine()) -ne $null) {
If ($Line -match "\[c\. .*?\]") {
$OutputFile = "MySplittedFileNumber$a$($Matches[0]).txt"
$a++
}
Out-File -LiteralPath "<yourFolder>\$OutputFile" -InputObject $Line -Append
}
To call a position of an array while in "" you have to format the variable like this $($array[number])
To write to the file, you should give the Fullpath and not just the Filename.

From Version 3 on PowerShells Get-Content cmdlet has the -Raw parameter which allows to read a file as a whole into a string you can then split into chunks with a regular exression (using a positive look ahead ).
The very same RegEx can be use to grep the section name and insert into the destination file name.
## Q:\Test\2018\07\19\SO_51421567.ps1
##
$RE = [RegEx]'(?=(\[c\. \d+[rv]\]))'
$Sections = (Get-Content '.\input.txt' -raw) -split $RE -ne ''
ForEach ($Section in $Sections){
If ($Section -Match $RE){
$Section | Out-File -LiteralPath ("MySplittedFileNumber{0}.txt" -f $Matches[1])
}
}

Related

Cannot remove text between two strings with ReadLines

test.txt contents:
foo
[HKEY_USERS\S-1-5-18\Software\Microsoft]
bar
delete me!
[HKEY_other_key]
end-------------
Online regex matches the text to be removed correctly (starting from string delete until string [HKEY), but code written in PowerShell doesn't remove anything when I run it in PowerShell ISE:
$file = [System.IO.File]::ReadLines("test.txt")
$pattern = $("(?sm)^delete.*?(?=^\[HKEY)")
$file -replace $pattern, "" # returns original test.txt including line "delete me!" which should be removed
It seems to be a problem with ReadLines because when I use alternative Get-Content:
$file = Get-Content -Path test.txt -Raw
it removes the unwanted line correctly, but I don't want to use Get-Content.

[System.IO.File]::ReadAllLines(..) reads all lines of the file into a string array and you're using a multi-line regex pattern.
Get-Content -Raw same as [System.IO.File]::ReadAllText(..), reads all the text in the file into a string.
[System.IO.File]::ReadAllText("$pwd\test.txt") -replace "(?sm)^delete.*?(?=^\[HKEY)"
Results in:
foo
[HKEY_USERS\S-1-5-18\Software\Microsoft]
bar
[HKEY_other_key]
end-------------
In case you do need to read the file line-by-line due to, for example, high memory consumption, switch -File is an excellent built-in PowerShell alternative:
switch -Regex -File('test.txt') {
'^delete' { # if starts with `delete`
$skip = $true # set this var to `$true
continue # go to next line
}
'^\[HKEY' { # if starts with `[HKEY`
$skip = $false # set this var to `$false`
$_ # output this line
continue # go to next line
}
{ $skip } { continue } # if this var is `$true`, go next line
Default { $_ } # if none of the previous conditions were met, ouput this line
}

Powershell - Regex match multiple lines from file

I am able to match and replace multiple lines if the text string is part of the powsershell script:
$regex = #"
(?s)(--match from here--.*?
--up to here--)
"#
$text = #"
first line
--match from here--
other lines
--up to here--
last line
"#
$editedText = ($text -replace $regex, "")
$editedText | Set-Content ".\output.txt"
output.txt:
first line
last line
But if I instead read the text in from a file with Get-Content -Raw, the same regex fails to match anything.
$text = Get-Content ".\input.txt" -Raw
input.txt:
first line
--match from here--
other lines
--up to here--
last line
output.txt:
first line
--match from here--
other lines
--up to here--
last line
Why is this? What can I do to match the text read in from input.txt? Thanks in advance!

Using a here-string the code depends on the kind of newline characters used by the .ps1 file. It won't work if it doesn't match the newline characters used by the input file.
To remove this dependency, define a RegEx that uses \r?\n to match all kinds of newlines:
$regex = "(?s)(--match from here--.*?\r?\n--up to here--)"
$text = Get-Content "input.txt" -Raw
$editedText = $text -replace $regex, ""
$editedText | Set-Content ".\output.txt"
Alternatively you may use a switch based solution, so you can use simpler RegEx pattern:
$include = $true
& { switch -File 'input.txt' -RegEx {
'--match from here--' { $include = $false }
{ $include } { $_ } # Output line if $include equals $true
'--up to here--' { $include = $true }
}} | Set-Content 'output.txt'
The switch -File construct loops over all lines of the input file and passes each one to the match expressions.
When we find the 1st pattern we set an $include flag to $false, which causes the code to skip over all lines until after the 2nd pattern is found, which sets the $include flag back to $true.
Writing $_ on its own causes the current line to be outputted.
We pipe to Set-Content to reduce memory footprint of the script. Instead of reading all lines into a variable in memory, we use a streaming approach where each processed line is immediately passed to Set-Content. Note that we can't pipe directly from a switch block, so as workaround we wrap the switch inside a script block (& { ... } creates and calls the script block).
The idea has been adopted from this GitHub comment.

Using Powershell to match a pattern on all occurrences in a text file

I have a text file and I am using Powershell to list out the names present in the below pattern
Contents of the file:
beta-clickstream-class="owner:"mike""
beta-clickstream-class="owner:"kelly""
beta-clickstream-class="owner:"sam""
beta-clickstream-class="owner:"joe""
beta-clickstream-class="owner:"john""
beta-clickstream-class="owner:"tam""
Output I am looking for
mike
kelly
sam
joe
john
tam
Script I am using is
$importPath = "test.txt"
$pattern = 'beta-clickstream-class="owner:"(.*?)""'
$string = Get-Content $importPath
$result = [regex]::match($string, $pattern).Groups[1].Value
$result
Above script is only listing the first name on the file. Can you please guide me on how to list all the names on the file.

Get-Content returns an array of strings, so you would have to call [regex]::match() on each element of array $string.
However, the -replace operator, as suggested by AdminOfThings, enables a simpler solution:
(Get-Content $importPath) -replace '.+owner:"([^&]+).+', '$1'
Alternatively, you could have read the file into a single, multi-line string with Get-Content -Raw, followed by [regex]::Matches() (multiple matches), not [regex]::Match() (single match).

Parsing Data in powershell, with the format of Label:Data

I am doing a Invoke-Webrequest in powershell to an url that does not contain any HTML, just text. I am needing to pick out a specific part of this data that is in the format of Label:Data. Each piece of data is one it's own separate line. I'm looking for some ideas on how to accomplish this. Here is a sample of the $Response.Contentdata below. I am looking to isolate the speed-over-ground:0.0
rate-of-turn:0.0
course-over-ground:293.0
speed-over-ground:0.0
heading-true:243.0
hdop:1.0
active-waypoint-name:
bearing-to-waypoint:
distance-to-waypoint:
cross-track-error:0
cross-track-error-limit:
cross-track-error-scale:0
lateral-speed-bow:0.09
lateral-speed-stern:-0.05
longitudinal-speed:-0.05

I guess it's a single string, rather than an array of lines. So, split it into lines:
$Response.Content -split "`r?`n"
Find the one which says speed-over-ground
$line = $Response.Content -split "`r?`n" | Where-Object { $_ -match 'speed-over-ground' }
Split the text from the number, using the : separator, and take the second item, converted from text to a number if appropriate:
[decimal]$speedOverGround = $line.Split(':')[1]
Although, I might try to turn all of them into an object in a bulk transform. Complexity varies with the exact possible inputs, but this tries to convert numbers to numbers and leave empty ones as nulls:
$data = New-Object -TypeName PSCustomObject
$Response.Content -split "`r?`n" -replace ':\s*$', ':$null' |
ForEach-Object {
$name, $value = $_.Split(':').Trim()
$decimalValue = 0
if ([decimal]::TryParse($value, [ref]$decimalValue))
{
$value = $decimalValue
}
$data | Add-Member -NotePropertyName $name -NotePropertyValue $value
}
# Then you can do:
$data.'speed-over-ground'

Multiline Regex in PowerShell

I have this PowerShell script that's main purpose is to search through HTML files within a folder, find specific HTML markup, and replace with what I tell it to.
I have been able to do 3/4 of my find and replaces perfectly. The one I am having trouble with involves a Regular Expression.
This is the markup that I am trying to make my regex find and replace:
<a href="programsactivities_skating.html"><br />
</a>
Here is the regex I have so far, along with the function I am using it in:
automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s| |<br\s?/?>)*</a>)' -replace ''
And here is the automate function:
function automate($school, $query, $replace) {
$processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
foreach ($file in $processFiles) {
$text = Get-Content $file
$text = $text -replace $query, $replace
$text | Out-File $file -Force -Encoding utf8
}
}
I have been trying to figure out the solution to this for about 2 days now, and just can't seem to get it to work. I have determined that problem is that I need to tell my regex to account for Multiline, and that's what I'm having trouble with.
Any help anyone can provide is greatly appreciate.
Thanks in Advance.

Get-Content produces an array of strings, where each string contains a single line from your input file, so you won't be able to match text passages spanning more than one line. You need to merge the array into a single string if you want to be able to match more than one line:
$text = Get-Content $file | Out-String
or
[String]$text = Get-Content $file
or
$text = [IO.File]::ReadAllText($file)
Note that the 1st and 2nd method don't preserve line breaks from the input file. Method 2 simply mangles all line breaks, as Keith pointed out in the comments, and method 1 puts <CR><LF> at the end of each line when joining the array. The latter may be an issue when dealing with Linux/Unix or Mac files.

I don't get what it is you're trying to do with those Exclude elements, but I find multi-line regex is usually easier to construct in a here-string:
$text = #'
<a href="programsactivities_skating.html"><br />
</a>
'#
$regex = #'
(?mis)<a href="programsactivities_skating.html"><br />
\s+?</a>
'#
$text -match $regex
True

Get-Content will return an array of strings, you want to concatenate the strings in question to create one:
function automate($school, $query, $replace) {
$processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
foreach ($file in $processFiles) {
$text = ""
$text = Get-Content $file | % { $text += $_ +"`r`n" }
$text = $text -replace $query, $replace
$text | Out-File $file -Force -Encoding utf8
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Splitting text in PowerShell using content delimiter as filename - regex

Related

Cannot remove text between two strings with ReadLines

Powershell - Regex match multiple lines from file

Using Powershell to match a pattern on all occurrences in a text file

Parsing Data in powershell, with the format of Label:Data

Multiline Regex in PowerShell

Categories

Resources