I am trying to extract word out of a text file which contains exactly one word per each line. But I only want to match the word if there are no "_"(underscore) or "-" (dash) in the word:
File might look like :
< someword
< SomeOtherword
< wordwith-dash-anotherd
< wordwith_under_anotheru
I only want to extract line 1 & 2 and ignore line 3 & 4
(i.e. result when regex match each line should be: someword SomeOtherword without "<" and space for each line)
I have been trying with "[\w-]+" which matches words with both _ & -
I am using PowerShell regex engine.
I am processing a file with close to 100000 lines. I don't want to loop through each line as need the processing time to be very quick. code I am using:
$rx = '[\w-]+'
Get-Content $filename | Select-String -Pattern $rx -AllMatches | select -ExpandProperty Matches | select -ExpandProperty Value | out-file $outputfile
If you are performance sensitive, this approach is measurably faster (2.6 secs vs. 80 millisecs):
(Select-String '^[a-zA-Z]+$' file.txt -AllMatches).Matches.Value
This does require a feature that is new to PowerShell v3. You don't say which version you are using.
To do a regex match in powershell you can use either -match operator or select-string. There is also a -notmatch operator and a -NotMatch flag for select-string. Both filter for the absence of a match.
So one option is
gc 'file.txt' | where { $_ -notmatch '-|_' } | foreach { $_.Trim('<', ' ') }
and another is
gc 'file.txt' | select-string -NotMatch '-|_' | foreach { $_.Line.Trim('<', ' ') }
Related
Using the following RegEx line in my PowerShell script to pull dates from .txt files. The script is reading and pulling the dates to a .csv file in this format Year,Month,Day,Hour,Min,Sec (2020,06,20,00,50,56). I'm looking for some guidance on how I can get the date just to show without the commas in this format 2020-06-20
This is how date is listed in .txt files see line that starts with Generated:
Node 001 Status Report - Report Version 20200505;
Generated 2020-06-20 00:50:56;
Below is portion of the script that's reading and pulling the date:
If($_ -imatch 'Generated'){
$Date = ([regex]::Matches($_,'\b\d+') | select value).value -join ','
}
You can use Select-String to read each file line by line and pattern match against each line:
Select-String -Path a.txt,b.txt -Pattern '^Generated (\d{4}-\d{2}-\d{2})' |
Foreach-Object { $_.Matches.Groups[1].Value }
Select-String also adds other benefits. Each pattern match is a MatchInfo object that contains the file name, line number that matched, and the line that contains the match. The -AllMatches switch will match as many times as possible per input line. The -Path parameter accepts an array of files and/or wildcards in the path. The [1] index is the first unnamed capture group results, which will be what matches within the first set of ().
As an aside, I would verify that the ####-##-## is actually a valid date unless you know that will always be so within your data. You can do this easily if your system culture settings allow for the date format:
Select-String -Path a.txt,b.txt -Pattern '^Generated (\d{4}-\d{2}-\d{2})' | Foreach-Object {
$_.Matches.Groups[1].Value | Where { $_ -as [datetime] }
}
If the culture settings do not allow the format, you will need to use ParseExact or TryParseExact to test the date.
If you must work within your current data format, then you can do the following to extract the date from the comma-delimited string in the required format:
If($_ -imatch 'Generated'){
$Numbers = ([regex]::Matches($_,'\b\d+') | select value).value -join ','
$Date = ($Numbers -split ',')[0..2] -join '-'
}
You are joining the expression with -join ',' for commas, if you want dashes instead, just change that to a dash.
If($_ -imatch 'Generated'){
$Date = ([regex]::Matches($_,'\b\d+') | select value).value -join '-'
}
I have been trying to extract certain values from multiple lines inside a .txt file with PowerShell.
Host
Class
INCLUDE vmware:/?filter=Displayname Equal "server01" OR Displayname Equal "server02" OR Displayname Equal "server03 test"
This is what I want :
server01
server02
server03 test
I have code so far :
$Regex = [Regex]::new("(?<=Equal)(.*)(?=OR")
$Match = $Regex.Match($String)
You may use
[regex]::matches($String, '(?<=Equal\s*")[^"]+')
See the regex demo.
See more ways to extract multiple matches here. However, you main problem is the regex pattern. The (?<=Equal\s*")[^"]+ pattern matches:
(?<=Equal\s*") - a location preceded with Equal and 0+ whitespaces and then a "
[^"]+ - consumes 1+ chars other than double quotation mark.
Demo:
$String = "Host`nClass`nINCLUDE vmware:/?filter=Displayname Equal ""server01"" OR Displayname Equal ""server02"" OR Displayname Equal ""server03 test"""
[regex]::matches($String, '(?<=Equal\s*")[^"]+') | Foreach {$_.Value}
Output:
server01
server02
server03 test
Here is a full snippet reading the file in, getting all matches and saving to file:
$newfile = 'file.txt'
$file = 'newtext.txt'
$regex = '(?<=Equal\s*")[^"]+'
Get-Content $file |
Select-String $regex -AllMatches |
Select-Object -Expand Matches |
ForEach-Object { $_.Value } |
Set-Content $newfile
Another option (PSv3+), combining [regex]::Matches() with the -replace operator for a concise solution:
$str = #'
Host
Class
INCLUDE vmware:/?filter=Displayname Equal "server01" OR Displayname Equal "server02" OR Displayname Equal "server03 test"
'#
[regex]::Matches($str, '".*?"').Value -replace '"'
Regex ".*?" matches all "..."-enclosed tokens; .Value extracts them, and -replace '"' strips the " chars.
It may be not be obvious, but this happens to be the fastest solution among the answers here, based on my tests - see bottom.
As an aside: The above would be even more PowerShell-idiomatic if the -match operator - which only looks for a (one) match - had a variant named, say, -matchall, so that one could write:
# WISHFUL THINKING (as of PowerShell Core 6.2)
$str -matchall '".*?"' -replace '"'
See this feature suggestion on GitHub.
Optional reading: performance comparison
Pragmatically speaking, all solutions here are helpful and may be fast enough, but there may be situations where performance must be optimized.
Generally, using Select-String (and the pipeline in general) comes with a performance penalty - while offering elegance and memory-efficient streaming processing.
Also, repeated invocation of script blocks (e.g., { $_.Value }) tends to be slow - especially in a pipeline with ForEach-Object or Where-Object, but also - to a lesser degree - with the .ForEach() and .Where() collection methods (PSv4+).
In the realm of regexes, you pay a performance penalty for variable-length look-behind expressions (e.g. (?<=EQUAL\s*")) and the use of capture groups (e.g., (.*?)).
Here is a performance comparison using the Time-Command function, averaging 1000 runs:
Time-Command -Count 1e3 { [regex]::Matches($str, '".*?"').Value -replace '"' },
{ [regex]::matches($String, '(?<=Equal\s*")[^"]+') | Foreach {$_.Value} },
{ [regex]::Matches($str, '\"(.*?)\"').Groups.Where({$_.name -eq '1'}).Value },
{ $str | Select-String -Pattern '(?<=Equal\s*")[^"]+' -AllMatches | ForEach-Object{$_.Matches.Value} } |
Format-Table Factor, Command
Sample timings from my MacBook Pro; the exact times aren't important (you can remove the Format-Table call to see them), but the relative performance is reflected in the Factor column, from fastest to slowest.
Factor Command
------ -------
1.00 [regex]::Matches($str, '".*?"').Value -replace '"' # this answer
2.85 [regex]::Matches($str, '\"(.*?)\"').Groups.Where({$_.name -eq '1'}).Value # AdminOfThings'
6.07 [regex]::matches($String, '(?<=Equal\s*")[^"]+') | Foreach {$_.Value} # Wiktor's
8.35 $str | Select-String -Pattern '(?<=Equal\s*")[^"]+' -AllMatches | ForEach-Object{$_.Matches.Value} # LotPings'
You can modify your regex to use a capture group, which is indicated by the parentheses. The backslashes just escape the quotes. This allows you to just capture what you are looking for and then filter it further. The capture group here is automatically named 1 since I didn't provide a name. Capture group 0 is the entire match including quotes. I switched to the Matches method because that encompasses all matches for the string whereas Match only captures the first match.
$regex = [regex]'\"(.*?)\"'
$regex.matches($string).groups.where{$_.name -eq 1}.value
If you want to export the results, you can do the following:
$regex = [regex]'\"(.*?)\"'
$regex.matches($string).groups.where{$_.name -eq 1}.value | sc "c:\temp\export.txt"
An alterative reading the file directly with Select-String using Wiktor's good RegEx:
Select-String -Path .\file.txt -Pattern '(?<=Equal\s*")[^"]+' -AllMatches|
ForEach-Object{$_.Matches.Value} | Set-Content NewFile.txt
Sample output:
> Get-Content .\NewFile.txt
server01
server02
server03 test
I need to search though a folder of logs and retrieve the most recent logs. Then I need to filter each log, pull out the relevant information and save to another file.
The problem is the regular expression I use to filter the log is dropping the carriage return and the line feed so the new file just contains a jumble of text.
$Reg = "(?ms)\*{6}\sBEGIN(.|\n){98}13.06.2015(.|\n){104}00000003.*(?!\*\*)+"
get-childitem "logfolder" -filter *.log |
where-object {$_.LastAccessTime -gt [datetime]$Test.StartTime} |
foreach {
$a=get-content $_;
[regex]::matches($a,$reg) | foreach {$_.groups[0].value > "MyOutFile"}
}
Log structure:
******* BEGIN MESSAGE *******
<Info line 1>
Date 18.03.2010 15:07:37 18.03.2010
<Info line 2>
File Number: 00000003
<Info line 3>
*Variable number of lines*
******* END MESSAGE *******
Basically capture everything between the BEGIN and END where the dates and file numbers are a certain value. Does anyone know how I can do this without losing the line feeds? I also tried using Out-File | Select-String -Pattern $reg, but I've never had success with using Select-String on a multiline record.
As #Matt pointed out, you need to read the entire file as a single string if you want to do multiline matches. Otherwise your (multiline) regular expression would be applied to single lines one after the other. There are several ways to get the content of a file as a single string:
(Get-Content 'C:\path\to\file.txt') -join "`r`n"
Get-Content 'C:\path\to\file.txt' | Out-String
Get-Content 'C:\path\to\file.txt' -Raw (requires PowerShell v3 or newer)
[IO.File]::ReadAllText('C:\path\to\file.txt')
Also, I'd modify the regular expression a little. Most of the time log messages may vary in length, so matching fixed lengths may fail if the log message changes. It's better to match on invariant parts of the string and leave the rest as variable length matches. And personally I find it a lot easier to do this kind of content extraction in several steps (makes for simpler regular expressions). In your case I would first separate the log entries from each other, and then filter the content:
$date = [regex]::Escape('13.06.2015')
$fnum = '00000003'
$re1 = "(?ms)\*{7} BEGIN MESSAGE \*{7}\s*([\s\S]*?)\*{7} END MESSAGE \*{7}"
$re2 = "(?ms)[\s\S]*?Date\s+$date[\s\S]*?File Number:\s+$fnum[\s\S]*"
Get-ChildItem 'C:\log\folder' -Filter '*.log' | ? {
$_.LastAccessTime -gt [DateTime]$Test.StartTime
} | % {
Get-Content $_.FullName -Raw |
Select-String -Pattern $re1 -AllMatches |
select -Expand Matches |
% {
$_.Groups[1].Value |
Select-String -Pattern $re2 |
select -Expand Matches |
select -Expand Groups |
select -Expand Value
}
} | Set-Content 'C:\path\to\output.txt'
BTW, don't use the redirection operator (>) inside a loop. It would overwrite the output file's content with each iteration. If you must write to a file inside a loop use the append redirection operator instead (>>). However, performance-wise it's usually better to put writing to output files at the end of the pipeline (see above).
Wanted to see if I could make that regex better but for now if you are using those regex modes you should be reading your text file in as a single string which helps a lot.
$a=get-content $_ -Raw
or if you don't have PowerShell 3.0
$a=(get-content $_) -join "`r`n"
I had to solve the problem of disappearing newlines in a completely different context. What you get when you do a get-content of a text file is an array of records, where each record is a line of text.
The only way I found to put the newline back in after some transformation was to use the automatic variable $OFS (output field separator). The default value is space, but if you set it to carriage return line feed, then you get separate records on separate lines.
So try this (it might work):
$OFS = "`r`n"
Maybe my reasoning is faulty, but I can't get this working.
Here's my regex: (Device\s#\d(\n.*)*?(?=\n\s*Device\s#|\Z))
Try it: http://regex101.com/r/jQ6uC8/6
$getdevice is the input string. I'm getting this string from the Stream/Output from a command line tool.
$dstate = $getdevice |
select-string -pattern '(Device\s#\d(\n.*)*?(?=\n\s*SSD\s+|\Z))' -AllMatches |
% { $_ -match '(Device\s#\d(\n.*)*?(?=\n\s*SSD\s+|\Z))' > $null; $matches[0] }
Write-Host $dstate
Output:
Device #0 Device #1 Device #2 Device #3 Device #4
Same output for the $matches[1], $matches[2] is empty.
Is there a way I can get all matches, like on regex101.com? I'm trying to split the Output/String into separate variables (one for Device0, one for Device1, Device2, and so on).
Update: Here's the Output from the command line tool: http://pastebin.com/BaywGtFE
I used your sample data in a here-string for my testing. This should work although it can depend on where your sample data comes from.
Using powershell 3.0 I have the following
$getdevice |
select-string -pattern '(?smi)(Device\s#\d+?(.*?)*?(?=Device\s#|\Z))' -AllMatches |
ForEach-Object {$_.Matches} |
ForEach-Object {$_.Value}
or if your PowerShell Verison supports it...
($getdevice | select-string -pattern '(?smi)(Device\s#\d+?(.*?)*?(?=Device\s#|\Z))' -AllMatches).Matches.Value
Which returns 4 objects with their device id's. I don't know if you wanted those or not but the regex can be modified with lookarounds if you don't need those. I updated the regex to account for device id with more that one digit as well in case that happens.
The modifiers that I used
s modifier: single line. Dot matches newline characters
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not
only begin/end of string)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
Another regex pattern thats works in this way that is shorter
'(?smi)(Device\s#).*?(?=Device\s#|\Z)'
With your existing regex, to get a list of all matches in a string, use one of these options:
Option 1
$regex = [regex] '(Device\s#\d(\n.*)*?(?=\n\s*Device\s#|\Z))'
$allmatches = $regex.Matches($yourString);
if ($allmatches.Count > 0) {
# Get the individual matches with $allmatches.Item[]
} else {
# Nah, no match
}
Option 2
$resultlist = new-object System.Collections.Specialized.StringCollection
$regex = [regex] '(Device\s#\d(\n.*)*?(?=\n\s*Device\s#|\Z))'
$match = $regex.Match($yourString)
while ($match.Success) {
$resultlist.Add($match.Value) | out-null
$match = $match.NextMatch()
}
While it doesn't exactly answer your question, I'll offer a slightly different approach:
($getdevice) -split '\s+(?=Device #\d)' | select -Skip 1
Just for fun,
$drives =
($getdevice) -split '\s+(?=Device #\d)' |
select -Skip 1 |
foreach { $Stringdata =
$_.replace(' : ','=') -replace 'Device #(\d)','Device = $1' -Replace 'Device is a (\w+)','DeviceIs = $1'
New-Object PSObject -Property $(ConvertFrom-StringData $Stringdata)
}
$drives | select Device,DeviceIs,'Total Size'
Device DeviceIs Total Size
------ -------- ----------
0 Hard drive 70007 MB
1 Hard drive 70007 MB
2 Hard drive 286102 MB
3 Hard drive 286102 MB
try this variant:
[regex]::Matches($data,'(?im)device #\d((?!\s*Device #\d)\r?\n.)*?') | select value
Value
-----
Device #0
Device #1
Device #2
Device #3
Device #4
I am having some issues trying to match a certain config block (multiple ones) from a file. Below is the block that I'm trying to extract from the config file:
ap71xx 00-01-23-45-67-89
use profile PROFILE
use rf-domain DOMAIN
hostname ACCESSPOINT
area inside
!
There are multiple ones just like this, each with a different MAC address. How do I match a config block across multiple lines?
The first problem you may run into is that in order to match across multiple lines, you need to process the file's contents as a single string rather than by individual line. For example, if you use Get-Content to read the contents of the file then by default it will give you an array of strings - one element for each line. To match across lines you want the file in a single string (and hope the file isn't too huge). You can do this like so:
$fileContent = [io.file]::ReadAllText("C:\file.txt")
Or in PowerShell 3.0 you can use Get-Content with the -Raw parameter:
$fileContent = Get-Content c:\file.txt -Raw
Then you need to specify a regex option to match across line terminators i.e.
SingleLine mode (. matches any char including line feed), as well as
Multiline mode (^ and $ match embedded line terminators), e.g.
(?smi) - note the "i" is to ignore case
e.g.:
C:\> $fileContent | Select-String '(?smi)([0-9a-f]{2}(-|\s*$)){6}.*?!' -AllMatches |
Foreach {$_.Matches} | Foreach {$_.Value}
00-01-23-45-67-89
use profile PROFILE
use rf-domain DOMAIN
hostname ACCESSPOINT
area inside
!
00-01-23-45-67-89
use profile PROFILE
use rf-domain DOMAIN
hostname ACCESSPOINT
area inside
!
Use the Select-String cmdlet to do the search because you can specify -AllMatches and it will output all matches whereas the -match operator stops after the first match. Makes sense because it is a Boolean operator that just needs to determine if there is a match.
In case this may still be of value to someone and depending on the actual requirement, the regex in Keith's answer doesn't need to be that complicated. If the user simply wants to output each block the following will suffice:
$fileContent = [io.file]::ReadAllText("c:\file.txt")
$fileContent |
Select-String '(?smi)ap71xx[^!]+!' -AllMatches |
%{ $_.Matches } |
%{ $_.Value }
The regex ap71xx[^!]*! will perform better and the use of .* in a regular expression is not recommended because it can generate unexpected results. The pattern [^!]+! will match any character except the exclamation mark, followed by the exclamation mark.
If the start of the block isn't required in the output, the updated script is:
$fileContent |
Select-String '(?smi)ap71xx([^!]+!)' -AllMatches |
%{ $_.Matches } |
%{ $_.Groups[1] } |
%{ $_.Value }
Groups[0] contains the whole matched string, Groups[1] will contain the string match within the parentheses in the regex.
If $fileContent isn't required for any further processing, the variable can be eliminated:
[io.file]::ReadAllText("c:\file.txt") |
Select-String '(?smi)ap71xx([^!]+!)' -AllMatches |
%{ $_.Matches } |
%{ $_.Groups[1] } |
%{ $_.Value }
This regex will search for the text ap followed by any number of characters and new lines ending with a !:
(?si)(a).+?\!{1}
So I was a little bored. I wrote a script that will break up the text file as you described (as long as it only contains the lines you displayed). It might work with other random lines, as long as they don't contain the key words: ap, profile, domain, hostname, or area. It will import them, and check line by line for each of the properties (MAC, Profile, domain, hostname, area) and place them into an object that can be used later. I know this isn't what you asked for, but since I spent time working on it, hopefully it can be used for some good. Here is the script if anyone is interested. It will need to be tweaked to your specific needs:
$Lines = Get-Content "c:\test\test.txt"
$varObjs = #()
for ($num = 0; $num -lt $lines.Count; $num =$varLast ) {
#Checks to make sure the line isn't blank or a !. If it is, it skips to next line
if ($Lines[$num] -match "!") {
$varLast++
continue
}
if (([regex]::Match($Lines[$num],"^\s.*$")).success) {
$varLast++
continue
}
$Index = [array]::IndexOf($lines, $lines[$num])
$b=0
$varObj = New-Object System.Object
while ($Lines[$num + $b] -notmatch "!" ) {
#Checks line by line to see what it matches, adds to the $varObj when it finds what it wants.
if ($Lines[$num + $b] -match "ap") { $varObj | Add-Member -MemberType NoteProperty -Name Mac -Value $([regex]::Split($lines[$num + $b],"\s"))[1] }
if ($lines[$num + $b] -match "profile") { $varObj | Add-Member -MemberType NoteProperty -Name Profile -Value $([regex]::Split($lines[$num + $b],"\s"))[3] }
if ($Lines[$num + $b] -match "domain") { $varObj | Add-Member -MemberType NoteProperty -Name rf-domain -Value $([regex]::Split($lines[$num + $b],"\s"))[3] }
if ($Lines[$num + $b] -match "hostname") { $varObj | Add-Member -MemberType NoteProperty -Name hostname -Value $([regex]::Split($lines[$num + $b],"\s"))[2] }
if ($Lines[$num + $b] -match "area") { $varObj | Add-Member -MemberType NoteProperty -Name area -Value $([regex]::Split($lines[$num + $b],"\s"))[2] }
$b ++
} #end While
#Adds the $varObj to $varObjs for future use
$varObjs += $varObj
$varLast = ($b + $Index) + 2
}#End for ($num = 0; $num -lt $lines.Count; $num = $varLast)
#displays the $varObjs
$varObjs
To me, a very clean and simple approach is to use a multiline bloc regex, with named captures, like this:
# Based on this text configuration:
$configurationText = #"
ap71xx 00-01-23-45-67-89
use profile PROFILE
use rf-domain DOMAIN
hostname ACCESSPOINT
area inside
!
"#
# We can build a multiline regex bloc with the strings to be captured.
# Here, i am using the regex '.*?' than roughly means 'capture anything, as less as possible'
# A more specific regex can be defined for each field to capture.
# ( ) in the regex if for defining a group
# ?<> is for naming a group
$regex = #"
(?<userId>.*?) (?<userCode>.*?)
use profile (?<userProfile>.*?)
use rf-domain (?<userDomain>.*?)
hostname (?<hostname>.*?)
area (?<area>.*?)
!
"#
# Lets see if this matches !
if($configurationText -match $regex)
{
# it does !
Write-Host "Config text is successfully matched, here are the matches:"
$Matches
}
else
{
Write-Host "Config text could not be matched."
}
This script outputs the following:
PS C:\Users\xdelecroix> C:\FusionInvest\powershell\regex-capture-multiline-stackoverflow.ps1
Config text is successfully matched, here are the matches:
Name Value
---- -----
hostname ACCESSPOINT
userProfile PROFILE
userCode 00-01-23-45-67-89
area inside
userId ap71xx
userDomain DOMAIN
0 ap71xx 00-01-23-45-67-89...
For more flexibility, you can use Select-String instead of -match, but this is not really important here, in the context of this sample.
Here's my take. If you don't need the regex, you can use -like or .contains(). The question never says what the search pattern is. Here's an example with a windows text file.
$file = (get-content -raw file.txt) -replace "`r" # avoid the line ending issue
$pattern = 'two
three
f.*' -replace "`r"
# just showing what they really are
$file -replace "`r",'\r' -replace "`n",'\n'
$pattern -replace "`r",'\r' -replace "`n",'\n'
$file -match $pattern
$file | select-string $pattern -quiet