How do I get rid of this "(" using regex? - regex

I was moving along on a regex expression and I have hit a road block I can't seem to get around. I am trying to get rid of "(" in the middle of a line of text using regex, there were 2 but I figured out how to get the one on the end of the line. its the one in the middle I can hack out.
Here is a more complete snippet of the file which I am search through.
ide1:0.present = "TRUE"
ide1:0.clientDevice = "TRUE"
ide1:0.deviceType = "cdrom-raw"
ide1:0.startConnected = "FALSE"
floppy0.startConnected = "FALSE"
floppy0.clientDevice = "TRUE"
ethernet0.present = "TRUE"
ethernet0.virtualDev = "e1000"
ethernet0.networkName = "solignis.local"
ethernet0.addressType = "generated"
guestOSAltName = "Ubuntu Linux (64-bit)"
guestOS = "ubuntulinux"
uuid.location = "56 4d e8 67 57 18 67 04-c8 68 14 eb b3 c7 be bf"
uuid.bios = "56 4d e8 67 57 18 67 04-c8 68 14 eb b3 c7 be bf"
vc.uuid = "52 c7 14 5c a0 eb f4 cc-b3 69 e1 6d ad d8 1a e7"
Here is a the entire foreach loop I am working on.
my #virtual_machines;
foreach my $vm (keys %virtual_machines) {
push #virtual_machines, $vm;
}
foreach my $vm (#virtual_machines) {
my $vmx_file = $ssh1->capture("cat $virtual_machines{$vm}{VMX}");
if ($vmx_file =~ m/^\bguestOSAltName\b\s+\S\s+\W(?<GUEST_OS> .+[^")])\W/xm) {
$virtual_machines{$vm}{"OS"} = "$+{GUEST_OS}";
} else {
$virtual_machines{$vm}{"OS"} = "N/A";
}
if ($vmx_file =~ m/^\bguestOSAltName\b\s\S\s.+(?<ARCH> \d{2}\W\bbit\b)/xm) {
$virtual_machines{$vm}{"Architecture"} = "$+{ARCH}";
} else {
$virtual_machines{$vm}{"Architecture"} = "N/A";
}
}
I am thinking the problem is I cannot make a match to "(" because the expression before that is to ".+" so that it matches everything in the line of text, be it alphanumeric or whitespace or even symbols like hypens.
Any ideas how I can get this to work?
This is what I am getting for an output from a hash dump.
$VAR1 = {
'NS02' => {
'ID' => '144',
'Version' => '7',
'OS' => 'Ubuntu Linux (64-bit',
'VMX' => '/vmfs/volumes/datastore2/NS02/NS02.vmx',
'Architecture' => '64-bit'
},
The part of the code block where I am working with ARCH work flawless so really what I need is hack off the "(64-bit)" part if it exists when the search runs into the ( and have it remove the preceding whitespace before the (.
What I am wanting is to turn the above hash dump into this.
$VAR1 = {
'NS02' => {
'ID' => '144',
'Version' => '7',
'OS' => 'Ubuntu Linux',
'VMX' => '/vmfs/volumes/datastore2/NS02/NS02.vmx',
'Architecture' => '64-bit'
},
Same thing minus the (64-bit) part.

You can simplify your regex to /^guestOSAltName\s+=\s+"(?<GUEST_OS>.+)"/m. What this does:
^ forces the match to start at the beginning of a line
guestOSAltName is a string literal.
\s+ matches 1 or more whitespace characters.
(?<GUEST_OS>.+) matches all the text from after the spaces to the end of the line, catches the group and names it GUEST_OS. If the line could have comments, you might want to change .+ to [^#]+.
The "'s around the group are literal quotes.
The m at the end turns on multi-line matching.
Code:
if ($vmx_file =~ /^guestOSAltName\s+=\s+"(?<GUEST_OS>.+)"/m) {
print "$+{GUEST_OS}";
} else {
print "N/A";
}
See it here: http://ideone.com/1xH5J

So you want to match the contents of the string after guestOSAltName up to (and not including) the first ( if present?
Then replace the first line of your code sample with
if ($vmx_file =~ m/^guestOSAltName\s+=\s+"(?<GUEST_OS>[^"()]+)/xm) {
If there always is a whitespace character before a potential opening parenthesis, then you can use
if ($vmx_file =~ m/^guestOSAltName\s+=\s+"(?<GUEST_OS>[^"()]+)[ "]/xm) {
so you don't need to strip trailing whitespace if present.

Something like this should work:
$match =~ s/^(.*?)\((.*?)$/$1$2/;

Generally find that .* is too powerful (as you are finding!). Two suggestions
Be more explicit on what you are looking for
my $text = '( something ) ( something else) ' ;
$text =~ /
\(
( [\s\w]+ )
\)
/x ;
print $1 ;
Use non greedy matching
my $text = '( something ) ( something else) ' ;
$text =~ /
\(
( .*? ) # non greedy match
\)
/x ;
print $1 ;
General observation - involved regexps are far easier to read if you use the /x option as this allows spacing and comments.

Use an ? behind your counter. ? stands for non greedy.
The regex is /^guestOSAltName[^"]+"(?<GUEST_OS>.+?)\s*[\("]+.*$/:
#!/usr/bin/env perl
foreach my $x ('guestOSAltName = "Ubuntu Linux (64-bit)"', 'guestOSAltName = "Microsoft Windows Server 2003, Standard Edition"') {
if ($x =~ m/^guestOSAltName[^"]+"(?<GUEST_OS>.+?)\s*[\("]+.*$/xm) {
print "$+{GUEST_OS}\n";
} else {
print "N/A\n";
}
if ($x =~ m/^guestOSAltName[^(]+\((?<ARCH>\d{2}).*/xm) {
print "$+{ARCH}\n";
} else {
print "N/A\n";
}
}
Start the demo:
$ perl t.pl
Ubuntu Linux
64
Microsoft Windows Server 2003, Standard Edition
N/A

Related

Partial match of strings, operator ( =~ )

I have used " =~ " to compare two strings (The length of two strings is the same.) in my script to allow a don't care condition. If a character is "." in a string, that character is ignored to compare. In other words, it is a partial match case.
comp_test.pl :
#!/usr/bin/perl
use strict;
use warnings;
my $a=".0.0..0..1...0..........0...0......010.1..........";
my $b="10.0..0..1...0..........0...0......010.1..........";
my $c=".0.0..0..1...0..........0...0......010.1..........";
if ($a =~ $b) {
print "a and b same\n";
}
if ($a =~ $c) {
print "a and c same\n";
}
Because of don't care condition by ".", the expected result should be both "a and b same" and "a and c same". However, currently, the result is only "a and c same". Please let me know any good operator or changing "." to "x" may help?
This is not a perl version problem. You are doing a regular expression match. The operand on the left of the =~ is the string and the operand on the right is the regex being applied to it.
This can be used for the kind of partial matching you are doing, given that the strings are the same length and each character of the regular expression matches a character of the string, but only where there is a . on the right. Where there is a 1 or a 0 in the regular expression ($b in the case of $a =~ $b), there must be an exactly matching character in the string ($a), not a ..
To do the kind of partial match you seem to want to do, you can use a bitwise exclusive or, like so:
sub partial_match {
my ($string_1, $string_2) = #_;
return 0 if length($string_1) != length($string_2);
# bitwise exclusive or the two strings together; where there is
# a 0 in one string and a 1 in the other, the result will be "\1".
# count \1's to get the number of mismatches
my $mismatches = ( $string_1 ^ $string_2 ) =~ y/\1//;
return $mismatches == 0;
}
While . matches 1 (or any other character), 1 doesn't match . (or any other character other than 1).
The following is a fast solution. It performs best when most strings match (since it always checks the entire string).
sub is_match { ( ( $_[0] ^ $_[1] ) =~ tr/\x00\x1E\x1F//c ) == 0 }
say is_match($a, $b) ? "match" : "no match";
say is_match($b, $c) ? "match" : "no match";
How it works:
Hex of characters
=================
30 30 31 31 2E 2E "0011.."
30 31 30 31 30 31 "010101"
XOR -----------------
00 01 01 00 1E 1F
^^ ^^ 2 mismatches
This solution even works if one of the strings is shorter than the other (since the XOR will result in 30, 31 or 2E for the extra characters).
The following is a fast solution. It performs best when most strings don't match (since it stops checking as soon as a match is impossible).
sub make_matcher {
my $pat =
join '',
map { $_ eq '.' ? $_ : "[.\Q$_\E]" }
split //, $_[0];
return qr/^$pat\z/;
}
sub is_match { $_[0] =~ make_matcher($_[1]) }
say is_match($a, $b) ? "match" : "no match";
say is_match($b, $c) ? "match" : "no match";
Tested.
From my understanding, you are trying to compare the length of 2 strings.Basically, only the length of the sting need to be compared, not the bit-wise characters.
my $a=".0.0..0..1...0..........0...0......010.1..........";
my $b="10.0..0..1...0..........0...0......010.1..........";
my $c=".0.0..0..1...0..........0...0......010.1..........";
So the code could be:
if(length($a) == length($b))
{
print "match found";
}
else
{
print "No match";
}

How to get the position of all capture groups with quantifiers?

I have a small problem. I have a perl regexp with multiple capture groups. Some of them have quantifiers (like '+'). If no quantifier is added, then #- & #+ array are filled nicely with the matched position of the capture groups, but if a quantifier is added only the last match is detected. But I would like to have all of them.
An example:
my $s = 'xx1a2b3cyy';
my $re = qr/^xx(\d\w)+/;
So I'd like to know that matches are '1a', '2b', '3c' at 2, 4, 6.
Simple matching gives:
if ($s =~ $re) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
Gives:
Match 0 6, 8 8
i: 0 - 'xx1a2b3c
i: 1 - '3c
So only the last capture group match is remembered.
My next simple try was which is not really what I want as the RE is different:
$re = qr/(\d\w)/;
my #s = ($s =~ /$re/g);
print "RE: '#s'\n";
while ($s =~ /$re/g) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
gives:
RE: '1a 2b 3c'
Match 2 2, 4 4
i: 0 - '1a
i: 1 - '1a
Match 4 4, 6 6
i: 0 - '2b
i: 1 - '2b
Match 6 6, 8 8
i: 0 - '3c
i: 1 - '3c
But this not what I want, as it would match a string like 'ZZ1aAA2bBB3cZZ'.
So somehow I have to combine the two. The best what I could get:
$re = '^xx(?:\d\w)*?\G(\d\w)';
pos($s) = 2;
while ($s =~ m($re)g) {
print "Match pos: ", pos($s), ', G: ', $1, ", '#-', '#+'\n"
}
gives:
Match pos: 4, G: 1a, '0 2', '4 4'
Match pos: 6, G: 2b, '0 4', '6 6'
Match pos: 8, G: 3c, '0 6', '8 8'
This is almost nice, but for this I need to know the position of the first possible match. If it is not set properly it will not match anything. I can only determine the first position if I remove the non greedy part:
$re = '^xx(\d\w)';
if ($s =~ m($re)) {
print "Match: '#-', '#+'\n";
}
which gives:
Match: '0 2', '4 4'
So $-[1] gives the first position, but for this I have to modify the RE "manually".
If I add code execution into the pattern I almost get what I need:
use re 'eval';
$re = '^xx(\d\w)+(??{print "Code: <#-> <#+>\n"})';
$s =~ m($re) and print "Match\n";
gives:
Code: <0 6> <8 8>
Code: <0 4> <6 6>
Code: <0 2> <4 4>
For this I need to add the (?{ code }) part.
Does anybody know a simpler method (I mean not need to modify the original RE) to get all the possible matches of a capture group having a quantifier?
Thanks in advance!
There's no general solution; the regex engine simply doesn't store the necessary information. You're asking to use a regex as a parser, and that's a no-go.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $1 while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
my ($pairs) = $_[0] =~ /^xx((?:\d\w)+)/
or return ();
return unpack('(a2)*', $pairs);
}
If you just want the positions, it's the same.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $-[1] while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
$_[0] =~ /^xx((?:\d\w)+)/
or return ();
return map { $-[1] + ( $_ - 1 )*2 } 1..length($1)/2;
}
Even a non-general purpose solution is extremely hard using regular expressions. Say you had the following pattern:
xx(\d\w)+yy(\d\w)+zz
The correct solution would be:
use Storable qw( dclone );
my $s = "xx1a2byy3c4dZZ...xx5a6byy7c8dzz";
local our $rv;
if (
$s =~ /
(?{ [] })
xx
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[0] }, $^N; $r }) )+
yy
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[1] }, $^N; $r }) )+
zz
(?{ $rv = $^R; })
/x
) {
say "\$1: #{ $rv->[0] }";
say "\$2: #{ $rv->[1] }";
}
Output:
$1: 5a 6b
$2: 7c 8d
And something like
(zz(\d\w)+)+
would need
use Storable qw( dclone );
my $s = "zz1a2bzz3c4d";
local our $rv;
if (
$s =~ /
(?{ [] })
(?:
(?{ my $r = dclone($^R); push #$r, []; $r })
zz
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[-1] }, $^N; $r }) )+
)+
(?{ $rv = $^R; })
/x
) {
say "\$1: #$_" for #$rv;
}
Output:
$1: 1a 2b
$1: 3c 4d
I think I can give some explanation for the behavior you see:
In the first example I can see only one capture group. The quantifier allows it to be used multiple times, but it's one capture group nonetheless. So every new occurence of a matching subpattern would overwrite the value previously captured there. Even if the RE engine is already advanced behind it, but backtracking would occur (for e.g. a more advanced pattern with branching and the likes), it could be that the now again visited capture group would change. And since #- and #+ hold the positions to the capture groups (as opposed to occuring subpattern matches), this would explain why there's only the last occurence of the subpattern contained.
You could even play around with named subpatterns and %+/%- and would experience the same thing. It becomes more obvious with the already used (?{ }), at least for debugging purposes. But use re 'debug' is fine for shorter regexes / strings to match.
So be aware of the effects of backtracking to capture groups while matching is still in progress!
But if you don't have to care about backtracking, I can think of kind of a recipe to handle a capture group with a quantifier:
If your capture group is (bla) and your quantifier {0,3}, transform it into
(?:(bla)(?{ print $-[$#-],$+[$#-]."\n" })){0,3}.
You practically put the subpattern into another (non-capturing) group. If the RE engine is done with it, execute code regarding the last capture group matched so far. The quantifier outside the surrounding group is then responsible for the correct number of execution of the code fragment.
So you example becomes this:
use Data::Dumper;
my $s = 'xx1a2b3cyy';
my #submatches;
sub getem { push #submatches, [$-[$#-],$+[$#-]]; }
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+/;
print Dumper(\#submatches);
This also works with multiple capture groups transformed this way:
my $s = 'xx1a2b3cyy4de5fg6hihhh2';
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+yy(?:(\d\w{2})(?{ getem() }))+hh/;
You have to adjust the index used, if your capture group contains more capture groups. That's why I prefer names capture groups.
Hope this helps.

Convert HH:MM into decimal hours

I am trying to convert some time stamps from text file in format HH:MM into number format (for example, 12:30 -> 12,5)1 using a Perl regex for easier processing in future.
I am quite new in this topic so I am struggling with MM part and I don't know how to convert it. Currently I have something like this:
while ( <FILE> ) {
$line = $_;
$line =~ s/([0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])/$2,$1/g;
print $line;
}
1) In my locale, the comma , is used for decimal points. Imagine a . So this means 12 and a half, or 12.5.
I would not use a regular expression for converting. It can be done with pretty simple math. Parse out the times using your search pattern, and then pass it through something like this.
sub to_decimal {
my $time = shift;
my ($hours, $minutes) = split /:/, $time;
my $decimal = sprintf '%.02d', ($minutes / 60) * 100 ;
return join ',', $hours, $decimal;
}
If you run it in a loop like this:
for (qw(00 01 05 10 15 20 25 30 35 40 45 50 55 58 59)) {
say "$_ => " . to_decimal("12:$_");
}
You get:
00 => 12,00
01 => 12,01
05 => 12,08
10 => 12,16
15 => 12,25
20 => 12,33
25 => 12,41
30 => 12,50
35 => 12,58
40 => 12,66
45 => 12,75
50 => 12,83
55 => 12,91
58 => 12,96
59 => 12,98
perl -ple 's|(\d\d):(\d\d)|{$2/60 + $1}|eg'
Your locale should take care of the comma, i think
This will achieve what you need. It uses an executable substitution to replace the time string by an expression in terms of the hour and minute values. tr/./,/r is used to covert all dots to commas
use strict;
use warnings 'all';
while ( <DATA> ) {
s{ ( 0[0-9] | 1[0-9] | 2[0-3] ) : ( [0-5][0-9] ) }{
sprintf('%.2f', $1 + $2 / 60) =~ tr/./,/r
}gex;
print;
}
__DATA__
00:00
05:17
12:30
15:59
23:59
output
0,00
5,28
12,50
15,98
23,98
You only have to adjust the substitution tomake it work:
$line =~ s/(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])/"$1," . substr( int($2)\/60, 2)/eg;
The e modifier causes the substituting content to be eval'ed, thus you can write the intended result as kind of a formula contingent on the capture group contents. Note that the substr call eliminates the leading 0, in the string representation of fractions.
If you need to limit your self to a given number of fraction digits, format the result of the division using sprintf:
$line =~ s/(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])/"$1," . substr( sprintf('%.2f', int($2)\/60), 2)/eg;
You could use egrep and awk:
$ echo 12:30 | egrep -o '([0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])' | awk -F":" '{printf $1+$2/60}'
12.5
Assume your LC_NUMERIC is correct:
while (<FILE>) {
use locale ':not_characters';
my $line = $_;
$line =~ s!\b([01][0-9]|2[0-3]):([0-5][0-9])\b!$1 + $2/60!eg;
print $line;
}

Extract journal title from Genbank file using perl without using $1, $2 etc

This is a part of my input Genbank file:
LOCUS AC_000005 34125 bp DNA linear VRL 03-OCT-2005
DEFINITION Human adenovirus type 12, complete genome.
ACCESSION AC_000005 BK000405
VERSION AC_000005.1 GI:56160436
KEYWORDS .
SOURCE Human adenovirus type 12
ORGANISM Human adenovirus type 12
Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE 1 (bases 1 to 34125)
AUTHORS Davison,A.J., Benko,M. and Harrach,B.
TITLE Genetic content and evolution of adenoviruses
JOURNAL J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
PUBMED 14573794
And I want to extract the journal title for example J. Gen. Virol. (not including the issue number and pages)
This is my code and it doesn't give any result so I am wondering what goes wrong. I did use parentheses for $1, $2 etc... And though it worked my tutor told me to try without using that method, use substr instead.
foreach my $line (#lines) {
if ( $line =~ m/JOURNAL/g ) {
$journal_line = $line;
$character = substr( $line, $index, 2 );
if ( $character =~ m/\s\d/ ) {
print substr( $line, 12, $index - 13 );
print "\n";
}
$index++;
}
}
Another way to do this, is to take advantage of BioPerl, which can parse GenBank files:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;
foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
print $annotation->location . "\n";
}
If you run this script with AC_000005.1 saved in a file called AC_000005.1.gb, you get:
J. Gen. Virol. 84 (PT 11), 2895-2908 (2003)
J. Virol. 68 (1), 379-389 (1994)
J. Virol. 67 (2), 682-693 (1993)
J. Virol. 63 (8), 3535-3540 (1989)
Nucleic Acids Res. 9 (23), 6571-6589 (1981)
Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.
Rather than matching and using substr, it is much easier to use a single regex to capture the whole JOURNAL line and use brackets to capture the text representing the journal information:
foreach my $line (#lines) {
if ($line =~ /JOURNAL\s+(.+)/) {
print "Journal information: $1\n";
}
}
The regular expression looks for JOURNAL followed by one or more whitespace characters, and (.+) captures the rest of the characters in the line.
To get the text without using $1, I think you're trying to do something like this:
if ($line =~ /JOURNAL/) {
my $ix = length('JOURNAL');
# variable containing the journal name
my $j_name;
# while the journal name is not defined...
while (! $j_name) {
# starting with $ix = the length of the word JOURNAL, get character $ix in the string
if (substr($line, $ix, 1) =~ /\s/) {
# if it is whitespace, increase $ix by one
$ix++;
}
else {
# if it isn't whitespace, we've found the text!!!!!
$j_name = substr($line, $ix);
}
}
If you already know how many characters there are in the left-hand column, you can just do substr($line, 12) (or whatever) to retrieve a substring of $line starting at character 12:
foreach my $line (#lines) {
if ($line =~ /JOURNAL/) {
print "Journal information: " . substr($line, 12) . "\n";
}
}
You can combine the two techniques to eliminate the issue number and dates from the journal data:
if ($line =~ /JOURNAL/) {
my $j_name;
my $digit;
my $indent = 12; # the width of the left-hand column
my $ix = $indent; # we'll use this to track the characters in our loop
while (! $digit) {
# starting with $ix = the length of the indent,
# get character $ix in the string
if (substr($line, $ix, 1) =~ /\d/) {
# if it is a digit, we've found the number of the journal
# we can stop looping now. Whew!
$digit = $ix;
# set j_name
# get a substring of $line starting at $indent going to $digit
# (i.e. of length $digit - $indent)
$j_name = substr($line, $indent, $digit-$indent);
}
$ix++;
}
print "Journal information: $j_name\n";
}
I think it would have been easier just to get the data from the Pubmed API! ;)

perl regex match with numbers

I am trying to match the following (the below is in an array called #abc):
goo foo tool: 1.2.1 (a3 change: 234342 # 2014/02/19 14:20:27)
with
my $match = "goo foo tool: (\d+)\.(\d+)\.(\d+) \(a3 change: \d+ # #DATE# #TIME#\)";
and in my code,
78 foreach (#abc){
79 print "$_\n";
80 if ($_ =~ m/$match/){
81 print "$1\n";
82 } else {
83 print "not matched\n";
84 }
85 }
I am not seeing why it is printing "not matched\n";
anyone else sees why?
The #DATE# and #TIME# string constants aren't goingt to match the dates you have. Simply adjust to regex to actually match those values:
my $match = "goo foo tool: (\d+)\.(\d+)\.(\d+) \(a3 change: \d+ \# \d+/\d+/\d+ \d+:\d+:\d+\)";