powershell script consuming all memory - regex

I'm running the following script to check a group of files for card numbers. When I run it against a group of 38 files that are a total of 600mb, it consumes max cpu (50% restricted) and max memory (3.3GB of 4.0GB physical).
Looking for ideas on why this may be and how to optimize this.
Thanks!
Get-ChildItem "c:\REGEX\ScanMeFiles\" -Recurse |`
Foreach-Object{
$content = Get-Content $_.FullName
$outfile = 'c:\regex\results\'+$_.BaseName+'_results.log'
$content | Where-Object {$_ -match '\b(?:3[47]\d|(?:4\d|5[1-5]|65)\d{2}|6011)\d{12}\b'} | Set-Content $outfile
}

I would make it a little more contained. Do something like this with fewer variables:
$children = (Get-ChildItem).FullName
foreach($child in $children){
Get-Content $child | ?{$_ -match '\b(?:3[47]\d|(?:4\d|5[1-5]|65)\d{2}|6011)\d{12}\b'} | Set-Content ('c:\regex\results\'+$_.BaseName+'_results.log')
}

With Matt's help, this is what I came up with. Runs in <1 minute against my test data. thanks!
Get-ChildItem "c:\REGEX\ScanMeFiles\" |
Foreach-Object{
$content = $_.FullName
$outfile = 'c:\regex\results\'+$_.BaseName+'_results.log'
$regex = '\b(?:3[47]\d|(?:4\d|5[1-5]|65)\d{2}|6011)\d{12}\b'
select-string -Path $content -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } | Set-Content $outfile

Related

powershell regex Pattern

I get different results.
In Powershell using:
$Matches = Select-String -InputObject (Get-Content "StevenBlackhosts-urls.txt") `
-Pattern "(^|\.)ad[sxvkdz]\-" -AllMatches #`
$Matches.Matches.Count
I get 12 matches and this is incorrect.
In Notepad++, find and count
"(^|\.)ad[sxvkdz]\-"
I have 62 matches and this is correct.
I do not know what's wrong?
the txt "StevenBlackhosts-urls.txt" contains 65106 lines ...
zeus.ad.intl.xiaomi.com
api.ad.intl.xiaomi.com
sdkconfig.ad.intl.xiaomi.com
adv.sec.intl.miui.com
zeus.ad.xiaomi.com
www.api.ad.intl.xiaomi.com
ampmetrics.engadget.com
c.adskeeper.co.uk
events3.adcolony.com
metrics.adage.com
ads.feedly.com
lepodownload.mediatek.com
ads.aerserv.com
ads.mp.mydas.mobi
ads.nexage.com
sdk.adincube.com
dasdada.fu.ck
i1.dl-ad.com
ad.api.kaffnet.com
ad.click.kaffnet.com
api.ad.snappea.com
etc..
testing in this way if I get the same result to Notepaq++; Why does this happen ??
$Matches = Select-String -InputObject (Get-Content "StevenBlackhosts-urls.txt") -Pattern "( |\.)ad[sxvkdz]\-" -AllMatches
$Matches.Matches.Count
It also works well like this, giving me the 62 lines
Get-Content "StevenBlackhosts-urls.txt" | Select-String -Pattern "(^|\.)ad[sxvkdz]\-" -AllMatches | set-content "test.txt"
Get-Content "StevenBlackhosts-urls.txt" | Select-String -Pattern "(^|\.)ad[sxvkdz]\-" -AllMatches | Measure-Object –Line | Select-Object -ExpandProperty Lines
this works right for me ..
thank you all for your suggestions . .

How to recursively scrape email addresses from files with Powershell?

I try to scrape emailaddresses with Powershell from a directory, with subdirectories and within them .txt files. So i have this code:
$input_path = ‘C:\Users\Me\Documents\toscrape’
$output_file = ‘C:\Users\Me\Documents\toscrape\output.txt’
$regex = ‘\b[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
But when I execute it, it gives me an error
select-string : The file C:\Users\Me\Documents\toscrape\ can not be read: Could not
path 'C:\Users\Me\Documents\toscrape\'.
At line:1 char:1
+ select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [Select-String], ArgumentException
+ FullyQualifiedErrorId : ProcessingFile,Microsoft.PowerShell.Commands.SelectStringCommand
I've tried variations to the $input_path, with Get-Item, Get-ChildItem, -Recurse, but nothing seems to work. Can anyone figure out how I need to scrape my location and all its subdirectories and files for the regex pattern?
The error is because Select-String assumes the -Path points to a file or is a wildcard pattern, and $input_path is pointing to a folder. You could use:
$input_path = 'C:\Users\Me\Documents\toscrape\*.txt'
Select-String $input_path ....
However, since you want to recurse through subdirectories, you'll need to use Get-ChildItem to do that.
$input_path = 'C:\Users\Me\Documents\toscrape'
$output_file = 'C:\Users\Me\Documents\toscrape\output.txt'
$regex = '\b[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b'
Get-ChildItem $input_path -Include *.txt -Recurse |
Select-String -Pattern $regex -AllMatches |
Select-Object -ExpandProperty Matches |
Select-Object -ExpandProperty Value |
Set-Content $output_file
Note that your regex may cause problems here. You're using \b for word boundary, but period ., hyphen -, and percent sign % are all non-word (\W) characters. The word characters (\w) are [A-Za-z0-9_].
For example:
PS C:\> '%username#example.com' -match '\b[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b'
True
PS C:\> $Matches.Values
username#example.com
If that's what you want the pattern to do, that's great, but it is something to be aware of. Regex for an email address is notoriously difficult.
Your correction didn't work but gave me another error, #Bacon Bits. However you put me on the right track. I adapted a bit and this seemed to work out for me.
$input_path = 'C:\Users\Me\Documents\toscrape'
$output_file = 'C:\Users\Me\Documents\toscrape\output.txt'
$regex = '\b[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b'
Get-ChildItem $input_path -Recurse | Select-String -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file

Extract MAC address and UUID from string

I extract string containing a lot of text and both MAC address and UUID.
For example:
![LOG[AA:AA:AA:AA:AA:AA, 0A0A0000-0000-0000-0000-A0A00A000000: found optional advertisement C0420054]LOG]!><time="09:07:57.573-120" date="04-19-2017" component="SMSPXE" context="" type="1" thread="2900" file="database.cpp:533"
I would like to strip the output to only display the MAC Address (e.g AA:AA:AA:AA:AA:AA) and UUID (e.g 0A0A0000-0000-0000-0000-A0A00A000000)
I don´t know how to trim the output.
Here is my script:
$Path = "\\AAAAAAAA\logs$"
$Text = "AA:AA:AA:AA:AA:AA"
$PathArray = #()
$Results = "C:\temp\test.txt"
# This code snippet gets all the files in $Path that end in ".txt".
Get-ChildItem $Path -Filter "*.log" |
Where-Object { $_.Attributes -ne "Directory"} |
ForEach-Object {
If (Get-Content $_.FullName | Select-String -Pattern $Text) {
$PathArray += $_.FullName
$PathArray += $_.FullName
}
}
Write-Host "Contents of ArrayPath:"
$PathArray | ForEach-Object {$_}
get-content $PathArray -ReadCount 1000 |
foreach { $_ -match $Text}
Instead of using the Where-Object cmdlet to filter all files, you can use the -Filter switch of the Get-ChildItem cmdlet. Also you don't have to load the content using the Get-content cmdlet yourself, just pipe the files to the Select-String cmdlet.
To grab MAC, UUID I just googled both regex and combined them:
$Path = "\\AAAAAAAA\logs$"
$Pattern = '([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2}),\s+(\{{0,1}([0-9a-fA-F]){8}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){12}\}{0,1})'
$Results = "C:\temp\test.txt"
Get-ChildItem $Path -Filter "*.log" -File |
Select-String $Pattern |
ForEach-Object {
$_.Matches.Value
} |
Out-File $Results

How to find all regular expression matches in the file

I have a list of regular expressions(about 2000) and over a million html files. I want to check if each regular expression success on every file or not. How to do this on powershell?
Performance is important, so I don't want to loop through regular expressions.
I try
$text | Select-String -Pattern pattern1, pattern2,...
And it returns all matches, but I also want to find out, which pattern success which one not. I need to build a list of success regular expressions for each file
You could try something like this:
$regex = "^test","e2$" #Or use (Get-Content <path to your regex file>)
$ht = #{}
#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | Select-String -Pattern $regex | ForEach-Object {
$ht[$_.Path] += #($_ | Select-Object -ExpandProperty Pattern)
}
Test-output:
$ht | Format-Table -AutoSize
Name Value
---- -----
C:\Users\graimer\Desktop\New Text Document (2).txt {e2$}
C:\Users\graimer\Desktop\New Text Document.txt {^test, e2$}
You didn't specify how you wanted the output.
UPDATE: To match multiple patterns on a single line, try this(mjolinor's answer is probably faster then this).
$regex = "^test","e2$" #Or use (Get-Content <path to your regex file>)
$ht = #{}
#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
$regex | ForEach-Object {
$pattern = $_
Get-ChildItem -Filter *.txt | Select-String -Pattern $pattern | ForEach-Object {
$ht[$_.Path] += #($_ | Select-Object -ExpandProperty Pattern)
}
}
UPDATE2: I don't have enough samples to try it, but since you have such a huge amount of files, you migh want to try reading the file into memory before looping through the patterns. It may be faster.
$regex = "^test","e2$" #Or use (Get-Content <path to your regex file>)
$ht = #{}
#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | ForEach-Object {
$text = $_ | Get-Content
$filename = $_.FullName
$regex | ForEach-Object {
$text | Select-String -Pattern $_ | ForEach-Object {
$ht[$filename] += #($_ | Select-Object -ExpandProperty Pattern)
}
}
}
I don't see any way around doing a foreach through the regex collection.
This is the best I could come up with performance-wise:
$regexes = 'pattern1','pattern2'
$files = get-childitem -Path <file path> |
select -ExpandProperty fullname
$ht = #{}
foreach ($file in $files)
{
$ht[$file] = New-Object collections.arraylist
foreach ($regex in $regexes)
{
if (select-string $regex $file -Quiet)
{
[void]$ht[$file].add($regex)
}
}
}
$ht
You could speed up the process by using background jobs and dividing up the file collection among the jobs.

remove lines which start with *(asterik) in powershell select-string output

I am working on a code so that it find lines which has $control but should remove lines which start with * at first column
I am working with following but doesn't seem to work ..
$result = Get-Content $file.fullName | Select-String $control | Select-String -pattern "\^*" -notmatch
Thanks in advance
You're escaping the wrong character. You do not want to escape ^ as that's your anchor for "starting with". You'll want to escape the asterix, so try this:
$result = Get-Content $file.fullName | Select-String $control | select-string -pattern "^\*" -notmatch
Also, if all you want is the lines, you could also use this:
Get-Content $file.fullName | ? { $_ -match $control -and $_ -notmatch '^\*'}