Regular Expression to get substrings in PowerShell - regex

I need help with the regular expression. I have 1000's of lines in a file with the following format:
+ + [COMPILED]\SRC\FileCheck.cs - TotalLine: 99 RealLine: 27 Braces: 18 Comment: 49 Empty: 5
+ + [COMPILED]\SRC\FindstringinFile.cpp - TotalLine: 103 RealLine: 26 Braces: 22 Comment: 50 Empty: 5
+ + [COMPILED]\SRC\findingstring.js - TotalLine: 91 RealLine: 22 Braces: 14 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\restinpeace.h - TotalLine: 95 RealLine: 24 Braces: 16 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\Getsomething.h++ - TotalLine: 168 RealLine: 62 Braces: 34 Comment: 51 Empty: 21
+ + [COMPILED]\SRC\MemDataStream.hh - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
+ + [CONTEXT]\SRC\MemDataStream.sql - TotalLine: 36 RealLine: 138 Braces: 80 Comment: 76 Empty: 59
I need a regular expression that can give me:
FilePath i.e. \SRC\FileMap.cpp
Extension i.e. .cpp
RealLine value i.e. 17
I'm using PowerShell to implement this and been successful in getting the results back using Get-Content (to read the file) and Select-String cmdlets.
Problem is its taking a long time to get the various substrings and then writing those in the xml file.(I have not put in the code for generating and xml).
I've never used regular expressions before but I know using a regular expression would be an efficient way to get the strings..
Help would be appreciated.
The Select-String cmdlet accepts the regular expression to search for the string.
Current code is as follows:
function Get-SubString
{
Param ([string]$StringtoSearch, [string]$StartOfTheString, [string]$EndOfTheString)
If($StringtoSearch.IndexOf($StartOfTheString) -eq -1 )
{
return
}
[int]$StartOfIndex = $StringtoSearch.IndexOf($StartOfTheString) + $StartOfTheString.Length
[int]$EndOfIndex = $StringtoSearch.IndexOf($EndOfTheString , $StartOfIndex)
if( $StringtoSearch.IndexOf($StartOfTheString)-ne -1 -and $StringtoSearch.IndexOf($EndOfTheString) -eq -1 )
{
[string]$ExtractedString=$StringtoSearch.Substring($StartOfTheString.Length)
}
else
{
[string]$ExtractedString = $StringtoSearch.Substring($StartOfIndex, $EndOfIndex - $StartOfIndex)
}
Return $ExtractedString
}
function Get-FileExtension
{
Param ( [string]$Path)
[System.IO.Path]::GetExtension($Path)
}
#For each file extension we will be searching all lines starting with + +
$SearchIndividualLines = "+ + ["
$TotalLines = select-string -Pattern $SearchIndividualLines -Path
$StandardOutputFilePath -allmatches -SimpleMatch
for($i = $TotalLines.GetLowerBound(0); $i -le $TotalLines.GetUpperBound(0); $i++)
{
$FileDetailsString = $TotalLines[$i]
#Get File Path
$StartStringForFilePath = "]"
$EndStringforFilePath = "- TotalLine"
$FilePathValue = Get-SubString -StringtoSearch $FileDetailsString -StartOfTheString $StartStringForFilePath -EndOfTheString $EndStringforFilePath
#Write-Host FilePathValue is $FilePathValue
#GetFileExtension
$FileExtensionValue = Get-FileExtension -Path $FilePathValue
#Write-Host FileExtensionValue is $FileExtensionValue
#GetRealLine
$StartStringForRealLine = "RealLine:"
$EndStringforRealLine = "Braces"
$RealLineValue = Get-SubString -StringtoSearch $FileDetailsString -
StartOfTheString $StartStringForRealLine -EndOfTheString $EndStringforRealLine
if([string]::IsNullOrEmpty($RealLineValue))
{
continue
}
}

Assume you have those in C:\temp\sample.txt
Something like this?
PS> (get-content C:\temp\sample.txt) | % { if ($_ -match '.*COMPILED\](\\.*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }
FilePath Extention RealLine
-------- --------- --------
\SRC\FileCheck .cs 27
\SRC\FindstringinFile .cpp 26
\SRC\findingstring .js 22
\SRC\restinpeace .h 24
\SRC\Getsomething .h 62
\SRC\MemDataStream .hh 131
Update:
Stuff inside paranthesis is captured, so if you want to capture [COMPILED], you will need to just need to add that part into the regex:
Instead of
$_ -match '.*COMPILED\](\\.*)
use
$_ -match '.*(\[COMPILED\]\\.*)
The link in the comment to your question includes a good primer on the regex.
UPDATE 2
Now that you want to capture set of path, I am guessing you sample looks like this:
+ + [COMPILED]C:\project\Rom\Main\Plan\file1.file2.file3\Cmd\Camera.culture.less-Lat‌​e-PP.min.js - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
The technique above will work, you just need to do a very slight adjustment for the first parenthesis like this:
$_ -match (\[COMPILED\].*)
This will tell regex that you want to capture [COMPILED] and everything that comes after it, until
(\.\w+)
i.e to the extension, which as a dot and a couple of letters (which might not works if you had an extension like .3gp)
So, your original one liner would instead be:
(get-content C:\temp\sample.txt) | % { if ($_ -match '.(\[COMPILED\].*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }

Related

How to match variable length number range 0 to $n using perl regex?

I need to match a numeric range from 0 to a number $n where $n can be any random number from 1 - 40.
For example,
if $n = 16, I need to strictly match only the numeric range from 0-16.
I tried m/([0-9]|[1-3][0-9]|40)/ but that is matching all 0-40. Is there a way to use regex to match from 0 to $n ?
The code snippet is attached for context.
$n = getNumber(); #getNumber() returns a random number from 1 to 40.
$answer = getAnswer(); #getAnswer() returns a user input.
#Check whether user enters an integer between 0 and $n
if ($answer =~ m/regex/){
print("Answer is an integer within specified range!\n");
}
I know can probably do something like
if($answer >= 0 && $answer <=$n)
But I am just wondering if there is a regex way of doing it?
I wouldn't pull out the following trick if there's another reasonable way to solve the problem. There is, for instance, Matching Numeric Ranges with a Regular Expression.
The (?(...)true|false) construct is like a regex conditional operator, and you can use one of the regex verbs, (*FAIL), to always fail a subpattern.
For the condition, you can use (?{...}) as the condition:
my $pattern = qr/
\b # anchor somehow
(\d++) # non-backtracking and greedy
(?(?{ $1 > 42 })(*FAIL))
/x;
my #numbers = map { int( rand(100) ) } 0 .. 10;
say "#numbers";
foreach my $n ( #numbers ) {
next unless $n =~ $pattern;
say "Matched $n";
}
Here's a run:
74 69 24 15 23 26 62 18 18 43 80
Matched 24
Matched 15
Matched 23
Matched 26
Matched 18
Matched 18
This is handy when the condition is complex.
I only think about this because it's an encouraged feature in Raku (and I have several examples in Learning Perl 6). Here's some Raku code in the same form, and the pattern syntax is significantly different:
#!raku
my $numbers = map { 100.rand.Int }, 0 .. 20;
say $numbers;
for #$numbers -> $n {
next unless $n ~~ / (<|w> \d+: <?{ $/ <= 42 }>) /;
say $n
}
The result is the same:
(67 43 31 41 89 14 52 71 48 64 5 21 6 31 44 27 39 94 78 15 39)
31
41
14
5
21
6
31
27
39
15
39
You can dynamically create the pattern. I've used a non-capture group (?:) here to keep the start and end of string anchors outside the list of |-ed numbers.
my $n = int rand 40;
my $answer = 42;
my $pattern = join '|', 0 .. $n;
if ($answer =~ m/^(?:$pattern)$/) {
print "Answer is an integer within specified range";
}
Please keep in mind that for your purpose this makes little sense.

Convert HH:MM into decimal hours

I am trying to convert some time stamps from text file in format HH:MM into number format (for example, 12:30 -> 12,5)1 using a Perl regex for easier processing in future.
I am quite new in this topic so I am struggling with MM part and I don't know how to convert it. Currently I have something like this:
while ( <FILE> ) {
$line = $_;
$line =~ s/([0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])/$2,$1/g;
print $line;
}
1) In my locale, the comma , is used for decimal points. Imagine a . So this means 12 and a half, or 12.5.
I would not use a regular expression for converting. It can be done with pretty simple math. Parse out the times using your search pattern, and then pass it through something like this.
sub to_decimal {
my $time = shift;
my ($hours, $minutes) = split /:/, $time;
my $decimal = sprintf '%.02d', ($minutes / 60) * 100 ;
return join ',', $hours, $decimal;
}
If you run it in a loop like this:
for (qw(00 01 05 10 15 20 25 30 35 40 45 50 55 58 59)) {
say "$_ => " . to_decimal("12:$_");
}
You get:
00 => 12,00
01 => 12,01
05 => 12,08
10 => 12,16
15 => 12,25
20 => 12,33
25 => 12,41
30 => 12,50
35 => 12,58
40 => 12,66
45 => 12,75
50 => 12,83
55 => 12,91
58 => 12,96
59 => 12,98
perl -ple 's|(\d\d):(\d\d)|{$2/60 + $1}|eg'
Your locale should take care of the comma, i think
This will achieve what you need. It uses an executable substitution to replace the time string by an expression in terms of the hour and minute values. tr/./,/r is used to covert all dots to commas
use strict;
use warnings 'all';
while ( <DATA> ) {
s{ ( 0[0-9] | 1[0-9] | 2[0-3] ) : ( [0-5][0-9] ) }{
sprintf('%.2f', $1 + $2 / 60) =~ tr/./,/r
}gex;
print;
}
__DATA__
00:00
05:17
12:30
15:59
23:59
output
0,00
5,28
12,50
15,98
23,98
You only have to adjust the substitution tomake it work:
$line =~ s/(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])/"$1," . substr( int($2)\/60, 2)/eg;
The e modifier causes the substituting content to be eval'ed, thus you can write the intended result as kind of a formula contingent on the capture group contents. Note that the substr call eliminates the leading 0, in the string representation of fractions.
If you need to limit your self to a given number of fraction digits, format the result of the division using sprintf:
$line =~ s/(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])/"$1," . substr( sprintf('%.2f', int($2)\/60), 2)/eg;
You could use egrep and awk:
$ echo 12:30 | egrep -o '([0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])' | awk -F":" '{printf $1+$2/60}'
12.5
Assume your LC_NUMERIC is correct:
while (<FILE>) {
use locale ':not_characters';
my $line = $_;
$line =~ s!\b([01][0-9]|2[0-3]):([0-5][0-9])\b!$1 + $2/60!eg;
print $line;
}

Perl while loop with a regex condition, does not seem to re-enter the loop

I am trying to read a diff file, and group the delete hunks (with leading -<SPACE>) and add hunks (leading +<SPACE>), I have the following, where I want to have an inner while to read all the deletes, or adds, but somehow I see that after the first match, the loop doesn't seem to re-evaluate, i.e after a line matches the condition on line 32, it enters the loop and then I read another line into $_ but the control transfers out of the loop to line 36 instead of re-entering the loop and check the condition again! (I checked from the debugger)
27 while (<$inh>) {
28 next if /^$/;
29 next if /^[^-+]/; # ignore non diff lines
30 chomp;
31 my $tmps = '';
32 while (/^\-\ /) { # Read the entire block (all lines beginning with -<SPACE>
33 $tmps .= getprintables($_);
34 $_ = <$inh>; # Read next line
35 }
36 push(#prevs, $tmps) if $tmps ne '';
37 $tmps = '';
38 while (/^\+\ /) { # Read the entire block (all lines beginning with +<SPACE>
39 $tmps .= getprintables($_);
40 $_ = <$inh>;
41 }
42 push(#curs, $tmps) if $tmps ne '';
43 $tmps = '';
44 }
45 close($inh);
I also tried the longer form while ($_ =~ m/^\+\ /) but same results. I don't see what is wrong here. I have Perl v5.14.2
I would suggest only ever doing your reading from a file handle from a single while loop, and using state variables to perform additional logic.
If I'm reading your intent properly, you can redesign your approach to use the Range operator .. instead like so:
while (<$inh>) {
next if /^$/;
next if /^[^-+]/; # ignore non diff lines
chomp;
if ( my $range = /^- / .. !/^- / ) {
push #prevs, '' if $range == 1;
$prevs[-1] .= getprintables($_) if $range !~ /E/;
}
if ( my $range = /^\+ / .. !/^\+ / ) {
push #curs, '' if $range == 1;
$curs[-1] .= getprintables($_) if $range !~ /E/;
}
}
close($inh);

Matching multiple patterns in the same line using unix utilities

I am trying to find this pattern match. I want to match and display only the first of the matches in the same line. And one of the matches, the fourth field can be match either of the two patterns i.e; A,BCD.EF or AB.CD . An example would be
Example 1:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00
The expected output would be
Expected Result 1:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
I have got this far using my little knowledge of grep and stackoverflow.
< test_data.txt grep -one "[0-9]/[0-9][0-9]\|[0-9]*,[0-9]*.[0-9][0-9]\|[0-9]*.[0-9][0-9]" | awk -F ":" '$1 == y { sub(/[^:]:/,""); r = (r ? r OFS : "") $0; next } x { print x, r; r="" } { x=$0; y=$1; sub(/[^:]:/,"",x) } END { print x, r }'
Any ideas to make this simpler or cleaner and to achieve the complete functionality.
Update 1: Few other examples could be:
Example 2:
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
There could be more fields in some lines.
The order of fields are not necessarily preserved either. I could get around this by treating the files which have different order separately or transforming them to this order somehow. So this condition can be relaxed.
Update 2: Seems like somehow my question was not clear. So one way looking at it would be to look for: the first "time" I find on a line, the first set of alpha-numeric string and first decimal values with/without comma in it, all of them printed on the same output line. A more generic description would be, Given an input line, print the first occurrence of pattern 1, first occurrence of pattern 2 and first occurrence of pattern 3 (which itself is an "or" of two patterns) in one line in the output and must be stable (i.e; preserving the order they appeared in input). Sorry it is a little complicated example and I am also trying to learn if this is the sweet spot to leave using Unix utilities for a full language like Perl/Python. So here is the expected results for the second set of examples.
Expected Result 2:
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00
#!/usr/bin/awk -f
BEGIN {
p[0] = "^[0-9]+:[0-9]{2}$"
p[1] = "^[[:alpha:]][[:alnum:]]*$"
p[2] = "^[0-9]+[0-9,]*[.][0-9]{2}$"
}
{
i = 0
for (j = 1; j <= NF; ++j) {
for (k = 0; k in p; ++k) {
if ($j ~ p[k] && !q[k]++ && j > ++i) {
$i = $j
}
}
}
q[0] = q[1] = q[2] = 0
NF = i
print
}
Input:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
Output:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00
Perl-regex style should solve the problem:
(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))
It will capture the following data (procesing each line You provided separately):
RESULT$VAR1 = [
'12:23',
'ASDFGH',
'1,232.00'
];
RESULT$VAR1 = [
'21:22',
'ASDSDS',
'22.00'
];
RESULT$VAR1 = [
'12:21',
'ASADSS',
'11.00'
];
RESULT$VAR1 = [
'22:22',
'BASDASD',
'1,231.00'
];
Example perl script.pl:
#!/usr/bin/perl
use strict;
use Data::Dumper;
open my $F, '<', shift #ARGV;
my #strings = <$F>;
my $qr = qr/(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))/;
foreach my $string (#strings) {
chomp $string;
next if not $string;
my #tab = $string =~ $qr;
print join(" ", #tab) . "\n";
}
Run as:
perl script.pl test_data.txt
Cheers!

How do I get rid of this "(" using regex?

I was moving along on a regex expression and I have hit a road block I can't seem to get around. I am trying to get rid of "(" in the middle of a line of text using regex, there were 2 but I figured out how to get the one on the end of the line. its the one in the middle I can hack out.
Here is a more complete snippet of the file which I am search through.
ide1:0.present = "TRUE"
ide1:0.clientDevice = "TRUE"
ide1:0.deviceType = "cdrom-raw"
ide1:0.startConnected = "FALSE"
floppy0.startConnected = "FALSE"
floppy0.clientDevice = "TRUE"
ethernet0.present = "TRUE"
ethernet0.virtualDev = "e1000"
ethernet0.networkName = "solignis.local"
ethernet0.addressType = "generated"
guestOSAltName = "Ubuntu Linux (64-bit)"
guestOS = "ubuntulinux"
uuid.location = "56 4d e8 67 57 18 67 04-c8 68 14 eb b3 c7 be bf"
uuid.bios = "56 4d e8 67 57 18 67 04-c8 68 14 eb b3 c7 be bf"
vc.uuid = "52 c7 14 5c a0 eb f4 cc-b3 69 e1 6d ad d8 1a e7"
Here is a the entire foreach loop I am working on.
my #virtual_machines;
foreach my $vm (keys %virtual_machines) {
push #virtual_machines, $vm;
}
foreach my $vm (#virtual_machines) {
my $vmx_file = $ssh1->capture("cat $virtual_machines{$vm}{VMX}");
if ($vmx_file =~ m/^\bguestOSAltName\b\s+\S\s+\W(?<GUEST_OS> .+[^")])\W/xm) {
$virtual_machines{$vm}{"OS"} = "$+{GUEST_OS}";
} else {
$virtual_machines{$vm}{"OS"} = "N/A";
}
if ($vmx_file =~ m/^\bguestOSAltName\b\s\S\s.+(?<ARCH> \d{2}\W\bbit\b)/xm) {
$virtual_machines{$vm}{"Architecture"} = "$+{ARCH}";
} else {
$virtual_machines{$vm}{"Architecture"} = "N/A";
}
}
I am thinking the problem is I cannot make a match to "(" because the expression before that is to ".+" so that it matches everything in the line of text, be it alphanumeric or whitespace or even symbols like hypens.
Any ideas how I can get this to work?
This is what I am getting for an output from a hash dump.
$VAR1 = {
'NS02' => {
'ID' => '144',
'Version' => '7',
'OS' => 'Ubuntu Linux (64-bit',
'VMX' => '/vmfs/volumes/datastore2/NS02/NS02.vmx',
'Architecture' => '64-bit'
},
The part of the code block where I am working with ARCH work flawless so really what I need is hack off the "(64-bit)" part if it exists when the search runs into the ( and have it remove the preceding whitespace before the (.
What I am wanting is to turn the above hash dump into this.
$VAR1 = {
'NS02' => {
'ID' => '144',
'Version' => '7',
'OS' => 'Ubuntu Linux',
'VMX' => '/vmfs/volumes/datastore2/NS02/NS02.vmx',
'Architecture' => '64-bit'
},
Same thing minus the (64-bit) part.
You can simplify your regex to /^guestOSAltName\s+=\s+"(?<GUEST_OS>.+)"/m. What this does:
^ forces the match to start at the beginning of a line
guestOSAltName is a string literal.
\s+ matches 1 or more whitespace characters.
(?<GUEST_OS>.+) matches all the text from after the spaces to the end of the line, catches the group and names it GUEST_OS. If the line could have comments, you might want to change .+ to [^#]+.
The "'s around the group are literal quotes.
The m at the end turns on multi-line matching.
Code:
if ($vmx_file =~ /^guestOSAltName\s+=\s+"(?<GUEST_OS>.+)"/m) {
print "$+{GUEST_OS}";
} else {
print "N/A";
}
See it here: http://ideone.com/1xH5J
So you want to match the contents of the string after guestOSAltName up to (and not including) the first ( if present?
Then replace the first line of your code sample with
if ($vmx_file =~ m/^guestOSAltName\s+=\s+"(?<GUEST_OS>[^"()]+)/xm) {
If there always is a whitespace character before a potential opening parenthesis, then you can use
if ($vmx_file =~ m/^guestOSAltName\s+=\s+"(?<GUEST_OS>[^"()]+)[ "]/xm) {
so you don't need to strip trailing whitespace if present.
Something like this should work:
$match =~ s/^(.*?)\((.*?)$/$1$2/;
Generally find that .* is too powerful (as you are finding!). Two suggestions
Be more explicit on what you are looking for
my $text = '( something ) ( something else) ' ;
$text =~ /
\(
( [\s\w]+ )
\)
/x ;
print $1 ;
Use non greedy matching
my $text = '( something ) ( something else) ' ;
$text =~ /
\(
( .*? ) # non greedy match
\)
/x ;
print $1 ;
General observation - involved regexps are far easier to read if you use the /x option as this allows spacing and comments.
Use an ? behind your counter. ? stands for non greedy.
The regex is /^guestOSAltName[^"]+"(?<GUEST_OS>.+?)\s*[\("]+.*$/:
#!/usr/bin/env perl
foreach my $x ('guestOSAltName = "Ubuntu Linux (64-bit)"', 'guestOSAltName = "Microsoft Windows Server 2003, Standard Edition"') {
if ($x =~ m/^guestOSAltName[^"]+"(?<GUEST_OS>.+?)\s*[\("]+.*$/xm) {
print "$+{GUEST_OS}\n";
} else {
print "N/A\n";
}
if ($x =~ m/^guestOSAltName[^(]+\((?<ARCH>\d{2}).*/xm) {
print "$+{ARCH}\n";
} else {
print "N/A\n";
}
}
Start the demo:
$ perl t.pl
Ubuntu Linux
64
Microsoft Windows Server 2003, Standard Edition
N/A