I'm new to powershell, and there seem to be a few differences in the way regex are handled. Currently iterating through a large number of txt files and want the start of each one of them (which is a URL) up to the | character.
The start of every file is a url ending in a slash. This was my umpteenth attempt with no luck:
$FirstUrl = '.*/\|$'
Pushed through a For-Each loop from which every other piece of information i'm trying to grab is coming out as expected:
Foreach-Object {
$FileContent = Get-Content $_.FullName
$Pos = Select-String -InputObject $FileContent -Pattern $FirstURL
Any tips on how to phrase the regex right in the $FirstURL. I'm generally 'ok' at regex and have googled my face off trying to find the proper documentation for powershell.
If each file is having the URL in the first line and after that there is a Pipe, then you do not need to use a regex in this case. You can directly split that like:
$FileContent = Get-Content $_.FullName
$FileContent.Split('|')[0]
Split actually splits the result data into an array. Then the first array element will be in the '0th' index and you can take it out.
Hope it helps.
Related
I am trying to split a large text file into several files based on a specific string. Every time I see the string ABCDE - 3 I want to cut and paste the content up to that string in a new text file. I also want to extract the last 4 of the social, last name and first name. The new text file needs be saved as first_name,last_name and last 4 of social.
See text file example and a bit of initial code. I would feel much more comfortbale doing it in Python but PowerShell is the only option.
$my_text = Get-Content .\ab.txt
$ssn_pattern = '([0-8]\d{2})-(\d{2})-(\d{4})'
ForEach ($file in my_text)
To get the firstname, lastname and the last 4 digits of the social, you could make use of capturing groups and use those groups when assembling the filename.
From your pattern, only the last 4 digits should be grouped.
You could use a pattern to start the match with TO: and from the next line get the values for the names and the number.
Then match all lines the do not start with ABCDE - 3 using a negative lookahead (?!
You can adjust the pattern and the code to match your exact text.
(?m)^[^\S\r\n]+TO:.*\r?\n\s*ATTN:\s*[A-Z]{3} ([^,\r\n]+),[^\S\r\n]*(.+?)[^\S\r\n]*[0-8]\d{2}-\d{2}-(\d{4})(?:\r?\n(?![^\S\r\n]+ABCDE - 3).*)*\r?\n[^\S\r\n]+ABCDE - 3.*
Regex demo
I constructed a code snippet using stackoverflow postings, so this might be improved. It basically comes down to load a raw string and get all the matches.
Then loop over all the matches and get the groups to assemble a filename an save the full match as the content.
If there are names which contain spaces and you don't want those to be in the filename, you could replace those with an empty string.
Example code:
$my_text = Get-Content -Raw ./Documents/stack-overflow/powershell/ab.txt
$pattern = "(?m)^[^\S\r\n]+TO:.*\r?\n\s*ATTN:\s*[A-Z]{3} ([^,\r\n]+),[^\S\r\n]*(.+?)[^\S\r\n]*[0-8]\d{2}-\d{2}-(\d{4})(?:\r?\n(?![^\S\r\n]+ABCDE - 3).*)*\r?\n[^\S\r\n]+ABCDE - 3.*"
Select-String $pattern -input $my_text -AllMatches |
ForEach-Object { $_.Matches } |
ForEach-Object {
$fileName = -join ($_.groups[2].Value, $_.groups[1].Value, $_.groups[3].Value)
Write-Host $fileName
Set-Content -Path "your-path-here/$fileName.txt" -Value $_.Value
}
When I run this, I get 2 files with the content for each match:
MIOTTISAREMO2222.txt
MIOTTSANREMO1111.txt
I have a question which im pretty much stuck on..
I have a file called xml_data.txt and another file called entry.txt
I want to replace everything between <core:topics> and </core:topics>
I have written the below script
$test = Get-Content -Path ./xml_data.txt
$newtest = Get-Content -Path ./entry.txt
$pattern = "<core:topics>(.*?)</core:topics>"
$result0 = [regex]::match($test, $pattern).Groups[1].Value
$result1 = [regex]::match($newtest, $pattern).Groups[1].Value
$test -replace $result0, $result1
When I run the script it outputs onto the console it doesnt look like it made any change.
Can someone please help me out
Note: Typo error fixed
There are three main issues here:
You read the file line by line, but the blocks of texts are multiline strings
Your regex does not match newlines as . does not match a newline by default
Also, the literal regex pattern must when replacing with a dynamic replacement pattern, you must always dollar-escape the $ symbol. Or use simple string .Replace.
So, you need to
Read the whole file in to a single variable, $test = Get-Content -Path ./xml_data.txt -Raw
Use the $pattern = "(?s)<core:topics>(.*?)</core:topics>" regex (it can be enhanced in case it works too slow by unrolling it to <core:topics>([^<]*(?:<(?!</?core:topics>).*)*)</core:topics>)
Use $test -replace [regex]::Escape($result0), $result1.Replace('$', '$$') to "protect" $ chars in the replacement, or $test.Replace($result0, $result1).
I am trying to make a script that takes an XML file, looks for a matching condition, if it finds it adds a new line of asteriks, then when done going through the file to strip it of all its XML tags and leave the data in a plain text file.
The script has been tested on a small input xml file and works fine, but when I pass a large XML file to it takes forever (not actually sure how long as I ran it for over an hour and still no result so I just stopped it).
I'm guessing I must be performing the work in an extremely inefficient manner, hoping you guys can help me make it fast and efficient.
Here is the script below:
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = "ProcessSpecifier = ""true"""
$FileOriginal = Get-Content $FileName
[String[]] $FileModified = #()
Foreach ($Line in $FileOriginal)
{
$FileModified += $Line
if ($Line -match $Pattern)
{
#Add Lines after the selected pattern
$FileModified += "*************isActive=true*****************"
}
}
$FileModified -replace "<[^>]+>", "" | Out-File C:\Users\someguy\Desktop\Output.txt
Let's go with a look behind and a bunch of regex to speed things up here. Also, I'm not going to store the whole thing in memory, I'm just going to pass it down the pipeline, which should help. I remove whitespace from the beginning and ends of lines, and filter out blank lines, but you can remove that bit if you want.
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = '(?<=^.*ProcessSpecifier = "true".*$)'
(Get-Content $FileName) -replace $Pattern, "`n*************isActive=true*****************" -replace '<[^>]+?>' -replace '^\s*|\s$' | ?{$_} | Set-Content C:\Users\someguy\Desktop\Output.txt
So, the main thing here is that I use a look behind to find your pattern text, and then add a new line and the asterisk line to that line. So that the line
<SomeTag>ProcessSpecifier = "true"</SomeTag>
becomes:
<SomeTag>ProcessSpecifier = "true"</SomeTag>`n*************isActive=true*****************
When used inside double quote a backtick ` followed by n creates a new line, so the '*************isActive=true*****************' is on its own line immediately following your search pattern line. Past that I remove the XML tags, and then any leading or trailing whitespace from any line.
After the RegEx replacements I pass the result to a Where statement that removes blank lines, and then pass the remaining lines to Set-Content which I've seen better performance out of than Out-File.
Variation of TheMadTechnician's answer:
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = '(?<=^.*ProcessSpecifier = "true".*$)'
Set-Content -Path C:\Users\someguy\Desktop\Output.txt -Value (((Get-Content $FileName) -replace $Pattern, "`n*************isActive=true*****************" -replace '<[^>]+?>' -replace '^\s*|\s$').Where{$_})
I actually try to avoid the pipeline, it is rather slow afaik. Of course you will run into problem with memory consumption if the files are very large.
The "().Where" construct doesn't work on all powershell versions (Version 4+ iirc).
This is a guess, I am not sure whether this is actually faster than TheMadTechnician's. I'd be curious about the result :)
I have a script that goes through HTTP access log, filters out some lines based on a regex patern and copies them into another file:
param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" |
Select-string -pattern $pattern |
Add-Content "D:\webStatistics\log\filtered-$workingdate.log"
My logs can be quite large (up to 2GB), which takes up to 15 minutes to run. Is there anything I can to do improve the performance of the statement above?
Thank you for your thoughts!
See if this isn't faster than your current solution:
param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" -ReadCount 2000 |
foreach { $_ -match $pattern |
Add-Content "D:\webStatistics\log\filtered-$workingdate.log"
}
You don't show your patterns, but I suspect they are a large part of the problem.
You will want to look for a new question here (I am sure it has been asked) or elsewhere for detailed advice on building fast regular expression patterns.
But I find the best advice is to anchor your patterns and avoid runs of unknown length of all characters.
So instead of a pattern like path/.*/.*\.js use one with a $ on the end to anchor it to the end of the string. That way the regex engine can tell immediately that index.html is not a match. Otherwise it has to do some rather complicated scans with path/ and .js possibly showing up anywhere in the string. This example of course assumes the file name is at the end of the log line.
Anchors work well with start of line patterns as well. A pattern might look like ^[^"]*"GET /myfile" That has a unknown run length but at least it knows that it doesn't have to restart the search for more quotes after finding the first one. The [^"] character class allows the regex engine to stop because the pattern can't match after the first quote.
You could also try seeing if using streams would speed it up. Something like this might help, although I couldn't test it because, as mentioned above, I'm not sure what patter you are using.
param($workingdate=(get-date).ToString("yyMMdd"))
$file = New-Object System.IO.StreamReader -Arg "access-$workingdate.log"
$stream = New-Object System.IO.StreamWriter -Arg "D:\webStatistics\log\filtered-$workingdate.log"
while ($line = $file.ReadLine()) {
if($line -match $pattern){
$stream.WriteLine($line)
}
}
$file.close()
$stream.Close()
I am looking for a way to automate a manual task. I'm not sure if it's even possible.
I have to find a pattern of string in all files in a project folder. It'a project of C#/.net project(if that matters at all). I have to also write the function name and file name where the pattern occurs, along with the full string that matches it. So far I've done following in PowerShell:
PS C:\trunk> Get-ChildItem "C:\trunk” -recurse | Select-String -pattern
“AlertMessage” | group path | select name
This prints file name where string pattern matches.
PS C:\trunk> Select-String -pattern "AlertMessage" -path
"C:\trunk\VATScan.Web\Areas \Administration\Controllers\HomeController.cs”
This prints line number and string that matches it in a given file.
Any pointers on how I can acheive my goal?
By no means perfect but at least this my fall under the category of pointer
$text = #"
Public Sub Bitchin()
Dim AlertMe
End Sub
Private Sub Function() As something
End Function
"#
[void]($text -match "(?smi)((public|private)\W(sub|function)\W(.+?)\(.*?Alertme)")
$Matches[4]
This will look for a Function or Sub routine declaration with a single white space between words followed by the next occurrence of the word AlertMe
Need to get item 4 from $Matches since there are a bunch of capture groups.
A more concise explanation of the regex used can be found here
Hopefully this will get you started or at least thinking. I am not familiar with c# declarations as $text is more of a VBA example but your should get the idea.