Regex only with starting match - regex

I am familliar with capturing multiple words with a definite match in perl, eg:
$string="dasd 341312 ddas 42 fsd 5345";
#numbers=$string=~/(\d+)/g;
This returns an array of numbers in my string.
I have data in this form:
random
text
START=somenumber
lines
of
text
here
START=someothernumber
other
text
here
START=thirdnumber
more
text
...
How can I capture to array all data blocks beginning with START= and going on (multiline) until the next START= (without it).
so eg:
$array[1] = " START=someothernumber
other
text
here"

Perhaps the following will be helpful:
use strict;
use warnings;
use Data::Dumper;
my $data = do { local $/; <DATA> };
my #array = $data =~ /(START=.+?)(?=START=|\z)/gs;
print Dumper \#array;
__DATA__
random
text
START=somenumber
lines
of
text
here
START=someothernumber
other
text
here
START=thirdnumber
more
text
Output:
$VAR1 = [
'START=somenumber
lines
of
text
here
',
'START=someothernumber
other
text
here
',
'START=thirdnumber
more
text
'
];

There is a simple way of doing this. Switch on multiline and global replace and then remember that you have to handle the new lines ( this is the key to unwinding this ). This will solve your issue:
while ($string =~ /^.*?START=([\w\s\n]*$)/mg) {
print $1,"\n";
}

Related

Unable to match string between | in Perl

I have a text read from a fasta file and I am trying to read the accession number in Perl. But I am not getting an output. Here's the code:
use strict;
use warnings;
sub main {
my $file = "PXXXXX.fasta";
if(!open(FASTA, $file)) {
die "Could not find $file\n";
}
my $myLine = <FASTA>;
my $pat = "|";
my #Num = $myLine =~ /$pat(.*?)$pat/;
print($Num[0]);
close(FASTA);
}
main();
The content of the FASTA filehandle is:
sp|P27455|MOMP_CHLPN Major outer membrane porin OS=Chlamydia pneumoniae OX=83558 GN=ompA PE=2 SV=1
MKKLLKSALLSAAFAGSVGSLQALPVGNPSDPSLLIDGTIWEGAAGDPCDPCATWCDAIS
LRAGFYGDYVFDRILKVDAPKTFSMGAKPTGSAAANYTTAVDRPNPAYNKHLHDAEWFTN
AGFIALNIWDRFDVFCTLGASNGYIRGNSTAFNLVGLFGVKGTTVNANELPNVSLSNGVV
ELYTDTSFSWSVGARGALWECGCATLGAEFQYAQSKPKVEELNVICNVSQFSVNKPKGYK
GVAFPLPTDAGVATATGTKSATINYHEWQVGASLSYRLNSLVPYIGVQWSRATFDADNIR
IAQPKLPTAVLNLTAWNPSLLGNATALSTTDSFSDFMQIVSCQINKFKSRKACGVTVGAT
LVDADKWSLTAEARLINERAAHVSGQFRF
Any clue how to fix the code to return: P27455 ?
The pipe | holds a special meaning in regular expressions. You need to escape it. The easiest way to do that is by using \Q and \E.
$myLine =~ /\Q$pat\E(.*?)\Q$pat\E/;
Or you could use the quotemeta built-in.
my $pat = quotemeta "|";
my #Num = $myLine =~ /$pat(.*?)$pat/; # or use [^$pat]+
You can also just not use a regular expression search and simply split the line. If you always want the second column, this will do just as well.
my (undef, $num) = split /\|/, $line;
Looks like you are trying to split the line on the | character, so use the split function.
my #Num = split /\|/, $myLine;
This splits $myLine on |. Note that you may have to change the index on #Num to get the correct item out of it.

Pattern match in perl

my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my $name = "";
#name = ( $line =~ m/Name:([\w\s\_\,/g );
foreach (#name) {
print $name."\n";
}
I want to capture the word between Name: and ,Region whereever it occurs in the whole line. The main loophole is that the name can be of any format
Amanda_Marry_Rose
Amanda.Marry.Rose
Amanda Marry Rose
Amanda/Marry/Rose
I need a help in capturing such a pattern every time it occurs in the line. So for the line I provided, the output should be
Amanda_Marry_Rose
Raghav.S.Thomas
Does anyone has any idea how to do this? I tried keeping the below line, but it's giving me the wrong output as.
#name=($line=~m/Name:([\w\s\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\#\[\\\]\^\_\`\{\|\}\~\ยด]+)\,/g);
Output
Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE
To capture between Name: and the first comma, use a negated character class:
/Name:([^,]+)/g
This says to match one or more characters following Name: which isn't a comma:
while (/Name:([^,]+)/g) {
print $1, "\n";
}
This is more efficient than a non-greedy quantifier, e.g:
/Name:(.+?),/g
As it doesn't require backtracking.
Reg-ex corrected:
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my #name = ($line =~ /Name\:([\w\s_.\/]+)\,/g);
foreach my $name (#name) {
print $name."\n";
}
What you have there is comma separated data. How you should parse this depends a lot on your data. If it is full-fledged csv data, the most safe approach is to use a proper csv parser, such as Text::CSV. If it is less strict data, you can get away with using the light-weight parser Text::ParseWords, which also has the benefit of being a core module in Perl 5. If what you have here is rather basic, user entered fields, then I would recommend split -- simply because when you know the delimiter, it is easier and safer to define it, than everything else inside it.
use strict;
use warnings;
use Data::Dumper;
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
# Simple split
my #fields = split /,/, $line;
print Dumper for map /^Name:(.*)/, #fields;
use Text::ParseWords;
print Dumper map /^Name:(.*)/, quotewords(',', 0, $line);
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
});
$csv->parse($line);
print Dumper map /^Name:(.*)/, $csv->fields;
Each of these options give the same output, save for the one that uses Text::CSV, which also issues an undefined warning, quite correctly, because your data has a trailing comma (meaning an empty field at the end).
Each of these has different strengths and weaknesses. Text::CSV can choke on data that does not conform with the CSV format, and split cannot handle embedded commas, such as Name:"Doe, John",....
The regex we use to extract the names very simply just captures the entire rest of the lines that begin with Name:. This also allows you to perform sanity checks on the field names, for example issue a warning if you suddenly find a field called Doe;Name:
The simple way is to look for all sequences of non-comma characters after every instance of Name: in the string.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my #names = $line =~ /Name:([^,]+)/g;
print "$_\n" for #names;
output
Amanda_Marry_Rose
Raghav.S.Thomas
However, it may well be useful to parse the data into an array of hashes so that related fields are gathered together.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my %info;
my #persons;
while ( $line =~ / ([a-z]+) : ([^:,]+) /gix ) {
my ($key, $val) = (lc $1, $2);
if ($info{$key}) {
push #persons, { %info };
%info = ();
}
$info{$key} = $val;
}
push #persons, { %info };
use Data::Dump;
dd \#persons;
print "\nNames:\n";
print "$_\n" for map $_->{name}, #persons;
output
[
{
cardtype => "DebitCard",
host => "USE",
name => "Amanda_Marry_Rose",
product => "Satin",
region => "US",
},
{
name => "Raghav.S.Thomas",
region => "UAE",
},
]
Names:
Amanda_Marry_Rose
Raghav.S.Thomas

need regexp to help me extract data from within double-quotes

I have sought the answer to this question here in stackoverflow but can't get acceptable results. (Sorry!)
I have a data file that looks like this:
share "SHARE1" "/path/to/some/share" umask=022 maxusr=4294967295 netbios=SOMECIFSHOST
share "SHARE2" "/path/to/a/different/share with spaces in the dir name" umask=022 maxusr=4294967295 netbios=ANOTHERCIFSHOST
... from which I need to extract the values inside double-quotes. In other words, I'd like to get something like this:
share,SHARE1,/path/to/some/share/,umask=022,maxusr=4294967295,netbios=SOMECIFSHOST
share,SHARE2,/path/to/a/different/share with spaces in the dir name,umask=022,maxusr=4294967295,netbios=ANOTHERCIFSHOST
The tricky part I've found is in trying to extract the data inside quotes. Suggestions made here have not worked for me, so I'm guessing I'm just doing it wrong. I also need to extract BOTH values from each line's double-quoted strings, not just the first one; I figure the remaining stuff could easily be parsed by splitting on whitespace.
In case it's relevant, I'm running this on a RHEL box and I need to pull it out with a regexp using Perl.
Thx!
One option is to treat your data as a CSV file and use Text::CSV_XS to parse it, setting the separator character to a space:
use strict;
use warnings;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, sep_char => ' ' } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, "<:encoding(utf8)", "data.txt" or die "data.txt: $!";
while ( my $row = $csv->getline($fh) ) {
print join ',', #$row;
print "\n";
}
$csv->eof or $csv->error_diag();
close $fh;
Output on your dataset:
share,SHARE1,/path/to/some/share,umask=022,maxusr=4294967295,netbios=SOMECIFSHOST
share,SHARE2,/path/to/a/different/share with spaces in the dir name,umask=022,maxusr=4294967295,netbios=ANOTHERCIFSHOST
Hope this helps!
You can do this:
if literal quotes inside quotes are escaped with a backslash: share "SHA \" RE1" ...
$str =~ s/(?|"((?>[^"\\]++|\\{2}|\\.)*)"|()) /$1,/gs;
if literal quotes are escaped with an other quote: share "SHA "" RE1" ...
$str =~ s/(?|"((?>[^"]++|"")*)"|()) /$1,/g;
if you are absolutly sure that there is no escaped quote between quotes in all your data:
$str =~ s/(?|"([^"]*)"|()) /$1,/g;
Try this.
[^\" ]*
It selects every char but the quotation marks and the spaces.
my $str = 'share "SHARE1" "/path/to/some/share" umask=022 maxusr=4294967295 netbios=SOMECIFSHOST';
$str =~ s/"?\s*"\s*/,/g;
print $str;
This regex replaces like below:
"space" = ,
"space = ,
space" = ,
"" = ,
Not sure if I understand the question, you say one thing in the text but the example says something different, annyway, try this:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my #matches = $_ =~ /"(.*?)"/g;
print "#matches\n";
}
__DATA__
share "SHARE1" "/path/to/some/share" umask=022 maxusr=4294967295 netbios=SOMECIFSHOST
share "SHARE2" "/path/to/a/different/share with spaces in the dir name" umask=022 maxusr=4294967295 netbios=ANOTHERCIFSHOST
output:
$ ./p.pl
SHARE1 /path/to/some/share
SHARE2 /path/to/a/different/share with spaces in the dir name
#!/usr/bin/env perl
while(<>){
my #a = split /\s+\"|\"\s+/ , $_; # split on any spaces + ", or any " + spaces
for my $item ( #a ) {
if ( $item =~ /\"/ ) { # if there's a quote, remove
$item =~ s/\"//g;
} elsif ( $item !~ /\"/ ){ # else just replace spaces with comma
$item =~ s/\s+/,/g;
}
}
print join(",", #a);
print "\n";
}
output:
share,SHARE1,/path/to/some/share,umask=022,maxusr=4294967295,netbios=SOMECIFSHOST,
share,SHARE2,/path/to/a/different/share with spaces in the dir name,umask=022,maxusr=4294967295,netbios=ANOTHERCIFSHOST,
Leave it to you to remove the last comma :)

Multiline match with irregular new line

I have text file with many entries like this:
[...]
Wind: 83,476,224
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
EnterValues: 656,136,1
Speed: 48,32
State: 2,102,83,476,224
[...]
From above part I would like to extract:
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
It would be simple if EnterValues: exists after every Solution:, unfortunately it doesn't. Sometime it is Speed, sometime something different. I don't know how to construct the end of regex (I assume it should be sth like this:Solution:.*?(?<!~)\n).
My file has \n as a delimiter of new line.
What you need is to apply a "record separator" that has the functionality of a regex. Unfortunately, you cannot use $/, because it cannot be a regex. You can however read the entire file into one line, and split that line using a regex:
use strict;
use warnings;
use Data::Dumper;
my $str = do {
local $/; # disable input record separator
<DATA>; # slurp the file
};
my #lines = split /^(?=\pL+:)/m, $str; # lines begin with letters + colon
print Dumper \#lines;
__DATA__
Wind: 83,476,224
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
EnterValues: 656,136,1
Speed: 48,32
State: 2,102,83,476,224
Output:
$VAR1 = [
'Wind: 83,476,224
',
'Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
',
'EnterValues: 656,136,1
',
'Speed: 48,32
',
'State: 2,102,83,476,224
'
You will do some sort of post processing on these variables, I assume, but I will leave that to you. One way to go from here is to split the values on newline.
As I see you first read all file to memory, but this is not a good pracrice. Try use flip flop operator:
while ( <$fh> ) {
if ( /Solution:/ ... !/~$/ ) {
print $_, "\n";
}
}
I can't test it right now, but I think this should work fine.
You can match from Solution to word followed by colon,
my ($solution) = $text =~ /(Solution:.*?) \w+: /xs;

Print line of pattern match in Perl regex

I am looking for a keyword in a multiline input using a regex like this,
if($input =~ /line/mi)
{
# further processing
}
The data in the input variable could be like this,
this is
multi line text
to be matched
using perl
The code works and matches the keyword line correctly. However, I would also like to obtain the line where the pattern was matched - "multi line text" - and store it into a variable for further processing. How do I go about this?
Thanks for the help.
You can grep out the lines into an array, which will then also serve as your conditional:
my #match = grep /line/mi, split /\n/, $input;
if (#match) {
# ... processing
}
TLP's answer is better but you can do:
if ($input =~ /([^\n]+line[^\n]+)/i) {
$line = $1;
}
I'd look if the match is in the multiline-String and in case it is, split it into lines and then look for the correct index number (starting with 0!):
#!/usr/bin/perl
use strict;
use warnings;
my $data=<<END;
this is line
multi line text
to be matched
using perl
END
if ($data =~ /line/mi){
my #lines = split(/\r?\n/,$data);
for (0..$#lines){
if ($lines[$_] =~ /line/){
print "LineNr of Match: " . $_ . "\n";
}
}
}
Did you try his?
This works for me. $1 represents the capture of regex inside ( and )
Provided there is only one match in one of the lines.If there are matches in multiple lines, then only the first one will be captured.
if($var=~/(.*line.*)/)
{
print $1
}
If you want to capture all the lines which has the string line then use below:
my #a;
push #a,$var=~m/(.*line.*)/g;
print "#a";