PowerShell to slip a text file on specific string - regex

I am trying to split a large text file into several files based on a specific string. Every time I see the string ABCDE - 3 I want to cut and paste the content up to that string in a new text file. I also want to extract the last 4 of the social, last name and first name. The new text file needs be saved as first_name,last_name and last 4 of social.
See text file example and a bit of initial code. I would feel much more comfortbale doing it in Python but PowerShell is the only option.
$my_text = Get-Content .\ab.txt
$ssn_pattern = '([0-8]\d{2})-(\d{2})-(\d{4})'
ForEach ($file in my_text)

To get the firstname, lastname and the last 4 digits of the social, you could make use of capturing groups and use those groups when assembling the filename.
From your pattern, only the last 4 digits should be grouped.
You could use a pattern to start the match with TO: and from the next line get the values for the names and the number.
Then match all lines the do not start with ABCDE - 3 using a negative lookahead (?!
You can adjust the pattern and the code to match your exact text.
(?m)^[^\S\r\n]+TO:.*\r?\n\s*ATTN:\s*[A-Z]{3} ([^,\r\n]+),[^\S\r\n]*(.+?)[^\S\r\n]*[0-8]\d{2}-\d{2}-(\d{4})(?:\r?\n(?![^\S\r\n]+ABCDE - 3).*)*\r?\n[^\S\r\n]+ABCDE - 3.*
Regex demo
I constructed a code snippet using stackoverflow postings, so this might be improved. It basically comes down to load a raw string and get all the matches.
Then loop over all the matches and get the groups to assemble a filename an save the full match as the content.
If there are names which contain spaces and you don't want those to be in the filename, you could replace those with an empty string.
Example code:
$my_text = Get-Content -Raw ./Documents/stack-overflow/powershell/ab.txt
$pattern = "(?m)^[^\S\r\n]+TO:.*\r?\n\s*ATTN:\s*[A-Z]{3} ([^,\r\n]+),[^\S\r\n]*(.+?)[^\S\r\n]*[0-8]\d{2}-\d{2}-(\d{4})(?:\r?\n(?![^\S\r\n]+ABCDE - 3).*)*\r?\n[^\S\r\n]+ABCDE - 3.*"
Select-String $pattern -input $my_text -AllMatches |
ForEach-Object { $_.Matches } |
ForEach-Object {
$fileName = -join ($_.groups[2].Value, $_.groups[1].Value, $_.groups[3].Value)
Write-Host $fileName
Set-Content -Path "your-path-here/$fileName.txt" -Value $_.Value
}
When I run this, I get 2 files with the content for each match:
MIOTTISAREMO2222.txt
MIOTTSANREMO1111.txt

Related

Powershell script to replace link:lalala.html[lalala] with xref:lalala.adoc[lalala] capture pattern and replace recursively

I have a folder full of text documents in .adoc format that have some text in them. The text is following: link:lalala.html[lalala]. I want to replace this text with xref:lalala.adoc[lalala]. So, basically, just replace link: with xref:, .html with .adoc, leave all the rest unchanged.
But the problem is that lalala can be anything from a word to ../topics/halva.html.
I definitely know that I need to use regex patterns, I previously used similar script. A replace directive wrapped in an object:
Get-ChildItem -Path *.adoc -file -recurse | ForEach-Object {
$lines = Get-Content -Path $PSItem.FullName -Encoding UTF8 -Raw
$patterns = #{
'(\[\.dfn \.term])#(.*?)#' = '$1_$2_' ;
}
$option = [System.Text.RegularExpressions.RegexOptions]::Singleline
foreach($k in $patterns.Keys){
$pat = [regex]::new($k, $option)
$lines = $pat.Replace($lines, $patterns.$k)
}
$lines | Set-Content -Path $PSItem.FullName -Encoding UTF8 -Force
}
Looks like I need a different script since the new task cannot be added as just another object. I could've just replaced each part separately, using two objects: replace link: with xref:, then replace .html with .adoc.
But this can interfere with other links that end with .html and don't start with link:. In the text, absolute links usually don't have link: in the beginning. They always start with http:// or https://. And they still may or may not end with .html. So the best idea is to take the whole string link:lalala.html[lalala] and try to replace it with xref:lalala.adoc[lalala].
I need the help of someone who knows regex and PowerShell, please this would save me.
As a pattern, you might use
\blink:(.+?)\.html(?=\[[^][]*])
\blink: Match link:
(.+?) Capture 1+ chars as least as possbile in group 1
\.html match .html
(?=\[[^][]*]) Assert from an opening till closing square bracket at the right
Regex demo
In the replacement use group 1 using $1
xref:$1.adoc
Example
$Strings = #("link:lalala.html[lalala]", "link:../topics/halva.html[../topics/halva.html]")
$Strings -replace "\blink:(.+?)\.html(?=\[[^][]*])",'xref:$1.adoc'
Output
xref:lalala.adoc[lalala]
xref:../topics/halva.adoc[../topics/halva.html]

Powershell: append text after string in file

Problem: I am trying to append a string after a tag. I got a large text file, and I only need to append some text after the tag (including the text xxxxxx) <xxxxxx>, and I cannot seem to figure it out just yet.
Currently im trying this with regex: <[(xxxxxx)]+>, which according to regex101.com does match the exact tag <xxxxxx>, but when I use this in Powershell it returns a lot of other stuff.
How can I make sure that Powershell only matches <xxxxxx> ? And to append some string after <xxxxxx> ?
Sample snippet from the text file: PredefinedSettings=<xxxxxx><abc test123 /abc></xxxxxx>
Sample PS command: Get-Content .\samplefile.ini | Select-String -Pattern "<[(xxxxxx)]+>"
Which returns the entire line PredefinedSettings=<xxxxxx><abc test123 /abc></xxxxx> instead of just <xxxxxx>
If you want to output just the matched text, you can do the following:
Select-String -Path sample.ini -Pattern '<(/?xxxxxx)>' -AllMatches | Foreach-Object {
$_.Matches.Groups[1].Value # Outputs matched text between `<>`
$_.Matches.Value # Outputs all matched text
}
The -AllMatches switch will allow matching beyond the first match. So it would return <xxxxxx> and </xxxxxx>.
If you want to replace text in a file, you can do the following:
(Get-Content .\samplefile.ini) -replace '<(/?xxxxxx)>','<$1Text>' |
Set-Content .\sampplefile.ini
If your replacement text is in a variable, you will need to escape the $ for the capture group.
$Text = 'replacement Text'
(Get-Content .\samplefile.ini) -replace '<(/?xxxxxx)>',"<`$1$Text>" |
Set-Content .\sampplefile.ini
$1 is the capture group 1 data matched within the first (). Depending on your Text, it may be wise to name your capture group. If Text is 23OtherText, <$123OtherText> will attempt to substitute capture group 123. Using a named capture group, you can do the following:
(Get-Content .\samplefile.ini) -replace '<(?<Tag>/?xxxxxx)>','<${Tag}Text>' |
Set-Content .\sampplefile.ini
/? matches zero or more / characters.
-replace will return all text not matched and all text replaced by the operator.
I hope I got your question right.
In regex Quantifiers are greedy so it will select from the first open tag to the last closing tag, you can change that by using a ?.
So your Regex will be <[(xxxxxx)]+?>.

Powershell regex match sequence doesn't work although it matches in Sublime Text find and replace

I am trying to create a Powershell regex statement to remove the top five lines of this output from a git diff file that has already been modified with Powershell regex.
[1mdiff --git a/uk1.adoc b/uk2.adoc</span>+++
[1mindex b5d3bf7..90299b8 100644</span>+++
[1m--- a/uk1.adoc</span>+++
[1m+++ b/uk2.adoc</span>+++
[36m## -1,9 +1,9 ##</span>+++
= Heading
Body text
Image shown because binary code doesn't show in the text
The following statement matches the text so the '= Heading' line is placed at the top of the page if I replace with nothing.
^[^=]*.[+][\n]
But in Powershell, it isn't matching the text.
Get-Content "result2.adoc" | % { $_ -Replace '^[^=]*.[+][\n]', '' } | Out-File "result3.adoc";
Any ideas about why it doesn't work in Powershell?
My overall goal is to create a diff file of two versions of an AsciiDoc file and then replace the ASCII codes with HTML/CSS code to display the resulting AsciiDoc file with green/red track changes.
The simplest - and faster - approach is to read the input file as a single, multiline string with Get-Content -Raw and let the regex passed to -replace operate across multiple lines:
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)' |
Set-Content result3.adoc
(?s) activates in-line option s which makes . match newline (\n) characters too.
^.+?\n(?==) matches from the start of the string (^) any number of characters (including newlines) (.+), non-greedily (?)
until a newline (\n) followed by a = is found.
(?=...) is a look-ahead assertion, which matches = without consuming it, i.e., without considering it part of the substring that matched.
Since no replacement operand is passed to -replace, the entire match is replace with the implied empty string, i.e., what was matched is effectively removed.
As for what you tried:
The -replace operator passes its LHS through if no match is found, so you cannot use it to filter out non-matching lines.
Even if you match an undesired line in full and replace it with '' (the empty string), it will show up as an empty line in the output when sent to Set-Content or Out-File (>).
As for your specific regex, ^[^=]*.[+][\n] (whether or not the first ^ is followed by an ESC (0x1b) char.):
[\n] (just \n would suffice) tries to match a newline char. after a literal + ([+]), yet lines read individually with Get-Content (without -Raw) by definition are stripped of their trailing newline, so the \n will never match; instead, use $ to match the end of a line.
Instead of % (the built-in alias for the ForEach-Object cmdlet) you could have used ? (the built-in alias for the Where-Object cmdlet) to perform the desired filtering:
Get-Content result2.adoc | ? { $_ -notmatch '^\e\[' }
$_ -notmatch '^\e[' returns $True only for lines that don't start (^) with an ESC character (\e, whose code point is 0x1b) followed by a literal (\) [, thereby effectively filtering out the lines before the = Heading line.
However, the multi-line -replace command at the top is a more direct and faster expression of your intent.
Here is the code I ended up with after help from #mklement0. This Powershell script creates MS Word-style track changes for two versions of an AsciiDoc file. It creates the Diff file, uses regex to replace ASCII codes with HTML/CSS tags, removes the Diff header (thank you!), uses AsciiDoctor to create an HTML file and then PrinceXML to create a PDF file of the output that I can send to document reviewers.
git diff --color-words file1.adoc file2.adoc > result.adoc;
Get-Content "result.adoc" | % {
$_ -Replace '(=+ ?)([A-Za-z\s]+)(\[m)', '$1$2' `
-Replace '\[32m', '+++<span style="color: #00cd00;">' `
-Replace '\[31m', '+++<span style="color: #cd0000; text-decoration: line-through;">' `
-Replace '\[m', '</span>+++' } | Out-File -encoding utf8 "result2.adoc" ;
(Get-Content -Raw result2.adoc) -replace '(?s)^.+?\n(?==)', '' | Out-File -encoding utf8 "result3.adoc" ;
asciidoctor result3.adoc -o result3.html;
prince result3.html --javascript -o result3.pdf;
Read-Host -Prompt "Press Enter to exit"
Here's a screenshot of the result using some text from Wikipedia:

How to use regex to remove everything except certain "key"/"character containing"

Running my code gives me this output in a txt file:
19:27:28.636 ASSOS\032AB5601\0223-\032312DEEE8EB423._http._tcp.local. can
be reached at ASSOS-032DEEE8EB423.local.:80 (interface 1)
So I just want to parse out string "ASSOS-032DEEE8EB423.local" and remove everything else from the txt file. I can't figure out how to use regex to do so to remove everything except string containing ASSOS-. So the thing is that the string will always contain ASSOS- but the rest is always changing to different numbers. So I'm trying to always be able to get ASSOS-XXXXXXXXXXX.local
This is how I'm trying to do:
$string = 'Get-Content C:\MyFile.Txt'
$pattern = ''
$string -replace $pattern, ' '
It's just that I don't know so much about regex and how to write it to parse out string containing "ASSOS-" and remove everything after ASSOS-XXXXXXXXXXX.local
I would pipe the file content to Select-String and return the values of matches for a string starting with "ASSOS-", ending with "local" and having whatever non-whitespace characters in between:
Get-Content test.txt | Select-String -Pattern "ASSOS-\S*local" | ForEach-Object {$_.Matches.Value}
A possible solution:
$str = "19:27:28.636 ASSOS\032AB5601\0223-\032312DEEE8EB423._http._tcp.local. can
be reached at **ASSOS-032DEEE8EB423.local**.:80 (interface 1)"
$str -replace '.*\*\*(.*?)\*\*.*', '$1'
The RegEx .*\*\*(.*?)\*\*.* captures all characters within **...**. The * have to be escaped by a \ to make it work.

Open a file and filter it using a regular expression

I have a large logfile and I want to extract (write to a new file) certain rows. The problem is I need a certain row and the row before. So the regex should be applied on more than one row. Notepad++ is not able to do that and I don't want to write a script for that.
I assume I can do that with Powershell and a one-liner, but I don't know where to start ...
The regular expression is not the problem, will be something like that ^#\d+.*?\n.*?Failed.*?$
So, how can I open a file using the Powershell, passing the regex and get the rows back that fits my expression?
Look at Select-String and -context parameter:
If you only need to display the matching line and the line before, use
(for a test I use my log file and my regex - the date there)
Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log |
Select-String '2011-05-13 06:16:10' -context 1,0
If you need to manipulate it further, store the result in a variable and use the properties:
$line = Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log |
Select-String '2011-05-13 06:16:10' -context 1
# for all the members try this:
$line | Get-Member
#line that matches the regex:
$line.Line
$line.Context.PreContext
If there are more lines that match the regex, access them with brackets:
$line = Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log |
Select-String '2011-05-13 06:16:10' -context 1
$line[0] # first match
$line[1] # second match