Data extraction using Regular expressions

Data extraction using Regular expressions - regex

Hi I have a file of the format
[stuff not needed]Type:A1[stuff not needed]
[stuff not needed]Name:B1[stuff not needed]
Row:Sampletext
Row:Sampletext
[stuff not needed]Type:A2[stuff not needed]
[stuff not needed]Name:B2[stuff not needed]
Row:Sampletext2
Row:Sampletext2
Row:Sampletext2
I am using regexin powershell to extract the data.
I am using something like Regex1|Regex2|Regex3 ,and saving the output to a file.
The output comes in the format:
A1
B1
Sampletext
Sampletext
A2
B2
Sampletext2
Sampletext2
Sampletext2
I want it in the format
A1 B1 Sampletext
A1 B1 Sampletext
A2 B2 Sampletext2
A2 B2 Sampletext2
A2 B2 Sampletext2
I am new to PowerShell, is there any way I can do this ?
This is the exact code the I have:
$input_path = ‘idx.txt’
$output_file = ‘output.txt’
$regex = ‘Type:\s([A-Za-z]*)|Name:\s\s([A-Za-z]*)|[A-Za-z][a-z0-9A-Z_]*(?:\s*[0-6]\s*[0-4]\s\s[\s\d]\d\s*0)’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
The data is too big to be posted here ,but ill just create a sample data set.But the regular expressions are working ,maybe crude but its capturing the data required .
for the sake of the example ,we can have Type:([A-Za-z])|Name:([A-Za-z])|Row:([A-Za-z]*) as the regular expressions

Check every line if it has type or name and set the corresponding variables only, but if it has row output the type and name variables along with the current row contents.
$allmatches = Select-String '(Type|Name|Row):\s*(\w*)' $input_path -allmatches
$output = foreach ($m in $allmatches) {
$data = $m.Matches.Groups[2].Value
switch ($m.Matches.Groups[1].Value) {
'Type' { $type = $data; break }
'Name' { $name = $data; break }
'Row' { "$type $name $data" }
}
}
$output | Set-Content $output_path -Encoding UTF8
Notes:
We use a faster foreach expression instead of slower pipelining via foreach with a scriptblock.
\w in regex means any word character including a-zA-Z0-9 and _ and some more
Regex-matching and string comparison are case-insensitive in PowerShell by default

Related

Powershell use ForEach to match and replace string with regex and replace with incremental value

I have to replace multiple strings with the same pattern, and several strings are on the same line. The replacement value should be incremental. I need to match and replace only the pattern as in the example, not requesId, nor messageId.
Input:
<requestId>qwerty-qwer12-qwer56</requestId>Ace of Base Order: Q2we45-Uj87f6-gh65De<something else...
<requestId>zxcvbn-zxcv4d-zxcv56</requestId>
<requestId>1234qw-12qw9x-123456</requestId> Stevie Wonder <messageId>1234qw-12qw9x-123456</msg
reportId>plmkjh8765FGH4rt6As</msg:reportId> something <keyID>qwer1234asdf5678zxcv0987bnml65gh</msgdc
The desired output should be:
<requestId>Request-1</requestId>Ace of Base Order: Request-2<something else...
<requestId>Request-3</requestId>
<requestId>Request-4</requestId> Stevie Wonder <messageId>Request-4</msg
reportId>ReportId-1</msg:reportId> something <keyId>KeyId-1</msg
The regex finds all matching values but I cannot make the loop and replace these values. The code I am trying to make work is:
#'
<requestId>qwerty-qwer12-qwer56</requestId>Ace of Base Order: Q2we45-Uj87f6-gh65De<something else...
<requestId>zxcvbn-zxcv12-zxcv56</requestId>
<requestId>1234qw-12qw12-123456</requestId> Stevie Wonder <messageId>1234qw-12qw12-123456</msg
reportId>plmkjh8765FGH4rt6As</msg:reportId> something <keyID>qwer1234asdf5678zxcv0987bnml65gh</msgdc
'# | Set-Content $log -Encoding UTF8
$requestId = #{
Count = 1
Matches = #()
}
$tmp = Get-Content $log | foreach { $n = [regex]::matches((Get-Content $log),'\w{6}-\w{6}-\w{6}').value
if ($n)
{
$_ -replace "$n", "Request-$($requestId.count)"
$requestId.count++
} $_ }
$tmp | Set-Content $log

You want Regex.Replace():
$requestId = 1
$tmp = Get-Content $log |ForEach-Object {
[regex]::Replace($_, '\w{6}-\w{6}-\w{6}', { 'Request-{0}' -f ($script:requestId++) })
}
$tmp |Set-Content $log
The script block will run once per match to calculate the substitue value, allowing us to resolve and increment the $requestId variable, resulting in the consecutive numbering you need.
You can do this for multiple patterns in succession if necessary, although you may want to use an array or hashtable for the individual counters:
$counters = { requestId = 1; keyId = 1 }
$tmp = Get-Content $log |ForEach-Object {
$_ = [regex]::Replace($_, '\w{6}-\w{6}-\w{6}', { 'Request-{0}' -f ($counters['requestId']++) })
[regex]::Replace($_, '\b\w{32}\b', { 'Key-{0}' -f ($counters['keyId']++) })
}
$tmp |Set-Content $log
If you want to capture and the mapping between the original and the new value, do that inside the substitution block:
$translations = #{}
# ...
[regex]::Replace($_, '\w{6}-\w{6}-\w{6}', {
# capture value we matched
$original = $args[0].Value
# generate new value
$substitute = 'Request-{0}' -f ($counters['requestId']++)
# remember it
$translations[$substitute] = $original
return $substitute
})
In PowerShell 6.1 and newer versions, you can also do this directly with the -replace operator:
$requestId = 0
$tmp = Get-Content $log |ForEach-Object {
$_ -replace '\w{6}-\w{6}-\w{6}', { 'Request-{0}' -f ($requestId++) }
}
$tmp |Set-Content $log

PowerShell Regex with csv file

I'm currently trying to match a pattern of IDs and replace with 0 or 1.
example pc0045601234 replace with 1234 the last 4 and add the 3rd digit in front "01234"
I tried the code below but the out only filled the userid column with No matching employee
$reportPath = '.\report.csv'`$reportPath = '.\report.csv'`
$csvPath = '.\output.csv'
$data = Import-Csv -Path $reportPath
$output = #()
foreach ($row in $data) {
$table = "" | Select ID,FirstName,LastName,userid
$table.ID = $row.ID
$table.FirstName = $row.FirstName
$table.LastName = $row.LastName
switch -Wildcard ($row.ID)
{
{$row.ID -match 'P\d\d\d\d\d\D\D\D'} {$table.userid = "Contractor"; continue}
{$row.ID -match 'SEC\d\d\d\D\D\D\D'} {$table.userid = "Contractor"; continue}
{$row.ID.StartsWith("P005700477")} {$table.userid = $row.ID -replace "P005700477","0477"; continue}
{$row.ID.StartsWith("P00570")} {$table.userid = $row.ID -replace "P00570","0"; continue}
default {$table.userid = "No Matching Employee"}
}
$output += $table
}
$output | Export-csv -NoTypeInformation -Path $csvPath

Here are three different ways to achieve the desired result. The first two use the same technique, just written in a different way.
First we put the sample data in a variable as a multiline string array. This is the equivalent as $text = Get-Content $somefile
$text = #'
PC05601234
PC15601234
'# -split [environment]::NewLine
Option 1 # convert to character array, select the 3rd and last 4 digits.
$text | foreach {-join ($_.ToChararray()| select -Skip 2 -First 1 -Last 4)}
Option 2 # same as above, requiring an extra -join to avoid spaces.
$text | foreach {(-join $_.ToChararray()| foreach{$_[2]+(-join $_[-4..-1])})}
Option 3 # my preference, regex. Capture the desired digits and replace the entire string with those two captured values.
$text -replace '^\D+(?!=\d)(\d)\w+([\d]{4}$)','$1$2'
All of these output
01234
11234
Further testing with different char/digit combinations and lengths.
$text = #'
PC05601234
PC15601234
PC0ABC124321
PC1DE4321
PC0A5678
PC1ABCD215678
'# -split [environment]::NewLine
Running the new sample data through each option all produce this output
01234
11234
04321
14321
05678
15678

Reading list style text file into powershell array

I am provided a list of string blocks in a text file, and i need this to be in an array in powershell.
The list looks like this
a:1
b:2
c:3
d:
e:5
[blank line]
a:10
b:20
c:30
d:
e:50
[blank line]
...
and i want this in a powershell array to further work with it.
Im using
$output = #()
Get-Content ".\Input.txt" | ForEach-Object {
$splitline = ($_).Split(":")
if($splitline.Count -eq 2) {
if($splitline[0] -eq "a") {
#Write-Output "New Block starting"
$output += ($string)
$string = "$($splitline[1])"
} else {
$string += ",$($splitline[1])"
}
}
}
Write-Host $output -ForegroundColor Green
$output | Export-Csv ".\Output.csv" -NoTypeInformation
$output | Out-File ".\Output.txt"
But this whole thing feels quite cumbersome and the output is not a csv file, which at this point is i think because of the way i use the array. Out-File does produce a file that contains rows that are separated by commas.
Maybe someone can give me a push in the right direction.
Thx
x

One solution is to convert your data to an array of hash tables that can be read into a custom object. Then the output array object can be exported, formatted, or read as required.
$hashtables = (Get-Content Input.txt) -replace '(.*?):','$1=' | ConvertFrom-StringData
$ObjectShell = "" | Select-Object ($hashtable.keys | Select-Object -Unique)
$output = foreach ($hashtable in $hashtable) {
$obj = $ObjectShell.psobject.Copy()
foreach ($n in $hashtable.GetEnumerator()) {
$obj.($n.key) = $n.value
}
$obj
}
$output
$output | Export-Csv Output.csv -NoTypeInformation
Explanation:
The first colons (:) on each line are replaced with =. That enables ConvertFrom-StringData to create an array of hash tables with values on the LHS of the = being the keys and values on the RHS of the = being the values. If you know there is only one : on each line, you can make the -replace operation simpler.
$ObjectShell is just an object with all of the properties your data presents. You need all of your properties present for each line of data whether or not you assign values to them. Otherwise, your CSV output or table view within the console will have issues.
The first foreach iterates through the $hashtables array. Then we need to enumerate through each hash table to find the keys and values, which is performed by the second foreach loop. Each key/value pair is stored as a copy of $ObjectShell. The .psobject.Copy() method is used to prevent references to the original object. Updating data that is a reference will update the data of the original object.
$output contains the array of objects of all processed data.
Usability of output:
# Console Output
$output | format-table
a b c d e
- - - - -
1
2
3
5
10
20
30
50
# Convert to CSV
$output | ConvertTo-Csv -NoTypeInformation
"a","b","c","d","e"
"1",,,,
,"2",,,
,,"3",,
,,,"",
,,,,"5"
,,,,
"10",,,,
,"20",,,
,,"30",,
,,,"",
,,,,"50"
# Accessing Properties
$output.b
2
20
$output[0],$output[1]
a : 1
b :
c :
d :
e :
a :
b : 2
c :
d :
e :
Alternative Conversion:
$output = ((Get-Content Input.txt -raw) -split "(?m)^\r?\n") | Foreach-Object {
$data = $_ -replace "(.*?):(.*?)(\r?\n)",'"$1":"$2",$3'
$data = $data.Remove($data.LastIndexOf(','),1)
("{1}`r`n{0}`r`n{2}" -f $data,'{','}') | ConvertFrom-Json
}
$output | ConvertTo-Csv -NoType
Alternative Explanation:
Since ConvertFrom-StringData does not guarantee hash table key order, this alternative readies the file for a JSON conversion. This will maintain the property order listed in the file provided each group's order is the same. Otherwise, the property order of the first group will be respected.
All properties and their respective values are divided by the first : character on each line. The property and value are each surrounded by double quotes. Each property line is separated by a ,. Then finally the opening { and closing } are added. The resulting JSON-formatted string is converted to a custom object.

You can split by \n newline, see example:
$text = #"
a:1
b:2
c:3
d:
e:5
a:10
b:20
c:30
d:
e:50
e:50
e:50
e:50
"#
$Array = $text -split '\n' | ? {$_}
$Array.Count
15
if you want to exclude the empty lines, add ? {$_}
With your example:
$Array = (Get-Content ".\Input.txt") -split '\n' | ? {$_}

How to replace text inside braces using Powershell?

I have the following powershell script running over a C# class in order to replace the constructor with an empty constructor. I am doing this over a number of files that were autogenerated.
$nl = [Environment]::NewLine
foreach($item in Get-ChildItem $path) {
if($item.extension -eq ".cs") {
$name = $item.name
$name = $name.replace('.cs', '')
$reg = '(public ' + $name + '\(.*?\})'
$constructorString = [regex]$reg
$emptyConstructor = 'public ' + $name + '()' + $nl + '{' + $nl + '}' + $nl
Get-Content $item.fullname -Raw | Where-Object { $_ -match $constructorString } | ForEach-Object { $_ -replace $constructorString, '$1'}
}
}
The classes have a form of
public Bar()
{
this.foo = new Foo();
}
This results in no matches, let me know if more information is required.

There is no need to convert the pattern to a regex. Try this:
$item | Get-Content -raw | Where {$_ -match "(?s)(public\s+${name}\s*\(.*?\})"} | ...
I believe the issue you are running into is that although you've read the C# file as a single string using the -Raw parameter, there are line breaks in the string that .* won't traverse unless you use singleline mode in the regex. That is what the (?s) does.
Also, if you are on PowerShell V3 you can use basename instead of replace e.g.:
$name = $item.basename
BTW have you looked at Roslyn as an alternative to the regex search/replace approach? Roslyn will build an AST for you for each source file. With that you could easily find default constructors and replace it with an empty constructor.

Use Powershell to print out line number of code matching a RegEx

I think we have a bunch of commented out code in our source, and rather than delete it immediately, we've just left it. Now I would like to do some cleanup.
So assuming that I have a good enough RegEx to find comments (the RegEx below is simple and I could expand on it based on our coding standards), how do I take the results of the file that I read up and output the following:
Filename
Line Number
The actual line of code
I think I have the basis of an answer here, but I don't know how to take the file that I've read up and parsed with RegEx and spit it out in this format.
I'm not looking for the perfect solution - I just want to find big blocks of commented out code. By looking at the result and seeing a bunch of files with the same name and sequential line numbers, I should be able to do this.
$Location = "c:\codeishere"
[regex]$Regex = "//.*;" #simple example - Will expand on this...
$Files = get-ChildItem $Location -include *cs -recurse
foreach ($File in $Files) {
$contents = get-Content $File
$Regex.Matches($contents) | WHAT GOES HERE?
}

You could do:
dir c:\codeishere -filter *.cs -recurse | select-string -Pattern '//.*;' | select Line,LineNumber,Filename

gci c:\codeishere *.cs -r | select-string "//.*;"
The select-string cmdlet already does exactly what you're asking for, though the filename displayed is a relative path.

I would go personally even further. I would like to compute number of consecutive following lines. Then print the file name, count of lines and the lines itself. You may sort the result by count of lines (candidates for delete?).
Note that my code doesn't count with empty lines between commented lines, so this part is considered as two blocks of commented code:
// int a = 10;
// int b = 20;
// DoSomething()
// SomethingAgain()
Here is my code.
$Location = "c:\codeishere"
$occurences = get-ChildItem $Location *cs -recurse | select-string '//.*;'
$grouped = $occurences | group FileName
function Compute([Microsoft.PowerShell.Commands.MatchInfo[]]$lines) {
$local:lastLineNum = $null
$local:lastLine = $null
$local:blocks = #()
$local:newBlock = $null
$lines |
% {
if (!$lastLineNum) { # first line
$lastLineNum = -2 # some number so that the following if is $true (-2 and lower)
}
if ($_.LineNumber - $lastLineNum -gt 1) { #new block of commented code
if ($newBlock) { $blocks += $newBlock }
$newBlock = $null
}
else { # two consecutive lines of commented code
if (!$newBlock) {
$newBlock = '' | select File,StartLine,CountOfLines,Lines
$newBlock.File, $newBlock.StartLine, $newBlock.CountOfLines, $newBlock.Lines = $_.Filename,($_.LineNumber-1),2, #($lastLine,$_.Line)
}
else {
$newBlock.CountOfLines += 1
$newBlock.Lines += $_.Line
}
}
$lastLineNum=$_.LineNumber
$lastLine = $_.Line
}
if ($newBlock) { $blocks += $newBlock }
$blocks
}
# foreach GroupInfo objects from group cmdlet
# get Group collection and compute
$result = $grouped | % { Compute $_.Group }
#how to print
$result | % {
write-host "`nFile $($_.File), line $($_.StartLine), count of lines: $($_.CountOfLines)" -foreground Green
$_.Lines | % { write-host $_ }
}
# you may sort it by count of lines:
$result2 = $result | sort CountOfLines -desc
$result2 | % {
write-host "`nFile $($_.File), line $($_.StartLine), count of lines: $($_.CountOfLines)" -foreground Green
$_.Lines | % { write-host $_ }
}
If you have any idea how to improve the code, post it! I have a feeling that I could do it using some standard cmdlets and the code could be shorter..

I would look at doing something like:
dir $location -inc *.cs -rec | `
%{ $file = $_; $n = 0; get-content $_ } | `
%{ $_.FileName = $file; $_.Line = ++$n; $_ } | `
?{ $_ -match $regex } | `
%{ "{0}:{1}: {2}" -f ($_.FileName, $_.Line, $_)}
I.e. add extra properties to the string to specify the filename and line number, which can be carried through the pipeline after the regex match.
(Using ForEach-Object's -begin/-end script blocks should be able to simplify this.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Data extraction using Regular expressions - regex

Related

Powershell use ForEach to match and replace string with regex and replace with incremental value

PowerShell Regex with csv file

Reading list style text file into powershell array

How to replace text inside braces using Powershell?

Use Powershell to print out line number of code matching a RegEx

Categories

Resources