need regexp to help me extract data from within double-quotes - regex

I have sought the answer to this question here in stackoverflow but can't get acceptable results. (Sorry!)
I have a data file that looks like this:
share "SHARE1" "/path/to/some/share" umask=022 maxusr=4294967295 netbios=SOMECIFSHOST
share "SHARE2" "/path/to/a/different/share with spaces in the dir name" umask=022 maxusr=4294967295 netbios=ANOTHERCIFSHOST
... from which I need to extract the values inside double-quotes. In other words, I'd like to get something like this:
share,SHARE1,/path/to/some/share/,umask=022,maxusr=4294967295,netbios=SOMECIFSHOST
share,SHARE2,/path/to/a/different/share with spaces in the dir name,umask=022,maxusr=4294967295,netbios=ANOTHERCIFSHOST
The tricky part I've found is in trying to extract the data inside quotes. Suggestions made here have not worked for me, so I'm guessing I'm just doing it wrong. I also need to extract BOTH values from each line's double-quoted strings, not just the first one; I figure the remaining stuff could easily be parsed by splitting on whitespace.
In case it's relevant, I'm running this on a RHEL box and I need to pull it out with a regexp using Perl.
Thx!

One option is to treat your data as a CSV file and use Text::CSV_XS to parse it, setting the separator character to a space:
use strict;
use warnings;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, sep_char => ' ' } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, "<:encoding(utf8)", "data.txt" or die "data.txt: $!";
while ( my $row = $csv->getline($fh) ) {
print join ',', #$row;
print "\n";
}
$csv->eof or $csv->error_diag();
close $fh;
Output on your dataset:
share,SHARE1,/path/to/some/share,umask=022,maxusr=4294967295,netbios=SOMECIFSHOST
share,SHARE2,/path/to/a/different/share with spaces in the dir name,umask=022,maxusr=4294967295,netbios=ANOTHERCIFSHOST
Hope this helps!

You can do this:
if literal quotes inside quotes are escaped with a backslash: share "SHA \" RE1" ...
$str =~ s/(?|"((?>[^"\\]++|\\{2}|\\.)*)"|()) /$1,/gs;
if literal quotes are escaped with an other quote: share "SHA "" RE1" ...
$str =~ s/(?|"((?>[^"]++|"")*)"|()) /$1,/g;
if you are absolutly sure that there is no escaped quote between quotes in all your data:
$str =~ s/(?|"([^"]*)"|()) /$1,/g;

Try this.
[^\" ]*
It selects every char but the quotation marks and the spaces.

my $str = 'share "SHARE1" "/path/to/some/share" umask=022 maxusr=4294967295 netbios=SOMECIFSHOST';
$str =~ s/"?\s*"\s*/,/g;
print $str;
This regex replaces like below:
"space" = ,
"space = ,
space" = ,
"" = ,

Not sure if I understand the question, you say one thing in the text but the example says something different, annyway, try this:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my #matches = $_ =~ /"(.*?)"/g;
print "#matches\n";
}
__DATA__
share "SHARE1" "/path/to/some/share" umask=022 maxusr=4294967295 netbios=SOMECIFSHOST
share "SHARE2" "/path/to/a/different/share with spaces in the dir name" umask=022 maxusr=4294967295 netbios=ANOTHERCIFSHOST
output:
$ ./p.pl
SHARE1 /path/to/some/share
SHARE2 /path/to/a/different/share with spaces in the dir name

#!/usr/bin/env perl
while(<>){
my #a = split /\s+\"|\"\s+/ , $_; # split on any spaces + ", or any " + spaces
for my $item ( #a ) {
if ( $item =~ /\"/ ) { # if there's a quote, remove
$item =~ s/\"//g;
} elsif ( $item !~ /\"/ ){ # else just replace spaces with comma
$item =~ s/\s+/,/g;
}
}
print join(",", #a);
print "\n";
}
output:
share,SHARE1,/path/to/some/share,umask=022,maxusr=4294967295,netbios=SOMECIFSHOST,
share,SHARE2,/path/to/a/different/share with spaces in the dir name,umask=022,maxusr=4294967295,netbios=ANOTHERCIFSHOST,
Leave it to you to remove the last comma :)

Related

Unable to match string between | in Perl

I have a text read from a fasta file and I am trying to read the accession number in Perl. But I am not getting an output. Here's the code:
use strict;
use warnings;
sub main {
my $file = "PXXXXX.fasta";
if(!open(FASTA, $file)) {
die "Could not find $file\n";
}
my $myLine = <FASTA>;
my $pat = "|";
my #Num = $myLine =~ /$pat(.*?)$pat/;
print($Num[0]);
close(FASTA);
}
main();
The content of the FASTA filehandle is:
sp|P27455|MOMP_CHLPN Major outer membrane porin OS=Chlamydia pneumoniae OX=83558 GN=ompA PE=2 SV=1
MKKLLKSALLSAAFAGSVGSLQALPVGNPSDPSLLIDGTIWEGAAGDPCDPCATWCDAIS
LRAGFYGDYVFDRILKVDAPKTFSMGAKPTGSAAANYTTAVDRPNPAYNKHLHDAEWFTN
AGFIALNIWDRFDVFCTLGASNGYIRGNSTAFNLVGLFGVKGTTVNANELPNVSLSNGVV
ELYTDTSFSWSVGARGALWECGCATLGAEFQYAQSKPKVEELNVICNVSQFSVNKPKGYK
GVAFPLPTDAGVATATGTKSATINYHEWQVGASLSYRLNSLVPYIGVQWSRATFDADNIR
IAQPKLPTAVLNLTAWNPSLLGNATALSTTDSFSDFMQIVSCQINKFKSRKACGVTVGAT
LVDADKWSLTAEARLINERAAHVSGQFRF
Any clue how to fix the code to return: P27455 ?
The pipe | holds a special meaning in regular expressions. You need to escape it. The easiest way to do that is by using \Q and \E.
$myLine =~ /\Q$pat\E(.*?)\Q$pat\E/;
Or you could use the quotemeta built-in.
my $pat = quotemeta "|";
my #Num = $myLine =~ /$pat(.*?)$pat/; # or use [^$pat]+
You can also just not use a regular expression search and simply split the line. If you always want the second column, this will do just as well.
my (undef, $num) = split /\|/, $line;
Looks like you are trying to split the line on the | character, so use the split function.
my #Num = split /\|/, $myLine;
This splits $myLine on |. Note that you may have to change the index on #Num to get the correct item out of it.

Perl regex substitution

I need to match exactly "", but not ,"",
for example in this string
"abc","123","def","","asd","876"",345
I need to substitute the "", following the "876" but leave the "" after "def" alone.
The regex I have right now is
$line =~ s/[^,]"",/",/g
However this substitutes the 6 from "876".
Use a group replacement:
$line =~ s/([^,])"",/$1",/g
Or a lookbehind:
$line =~ s/(?<!,)"",/",/g
Having said that, "" is a CSV quoted quote, it can appear inside a string. For example, this is valid: """abc""". To avoid breaking that, also exclude " from the lookbehind:
$line =~ s/(?<![,"])"",/",/g
You are trying to fix a broken CSV file.
Doing it with a pattern match is not the best solution.
You should try fixing it by using Text::CSV_XS module with allow_loose_quotes => 1 option.
This converts the broken col to an empty col and double quotes the last column. Not quite sure if that is what your looking for but there you go.
use strict ;
my $line = '"abc","123","def","","asd","876"",345' ;
$line =~ s/\,{0,1}\"\"\,/\,\"\"\,/g ;
my #foo = split(/\,/,$line) ;
$foo[$#foo] =~ s/^|$/\"/g ;
$line = join ",", #foo ;
print "\n$line\n" ;

Load regex from file and match groups with it in Perl

I have a file containing regular expressions, e.g.:
City of (.*)
(.*) State
Now I want to read these (line by line), match them against a string, and print out the extraction (matched group). For example: The string City of Berlin should match with the first expression City of (.*) from the file, after that Berlin should be extracted.
This is what I've got so far:
use warnings;
use strict;
my #pattern;
open(FILE, "<pattern.txt"); # open the file described above
while (my $line = <FILE>) {
push #pattern, $line; # store it inside the #pattern variable
}
close(FILE);
my $exampleString = "City of Berlin"; # line that should be matched and
# Berlin should be extracted
foreach my $p (#pattern) { # try each pattern
if (my ($match) = $exampleString =~ /$p/) {
print "$match";
}
}
I want Berlin to be printed.
What happens with the regex inside the foreach loop?
Is it not compiled? Why?
Is there even a better way to do this?
Your patterns contain a newline character which you need to chomp:
while (my $line = <FILE>) {
chomp $line;
push #pattern, $line;
}
First off - chomp is the root of your problem.
However secondly - your code is also very inefficient. Rather than checking patterns in a foreach loop, consider instead compiling a regex in advance:
#!/usr/bin/env perl
use strict;
use warnings;
# open ( my $pattern_fh, '<', "pattern.txt" ) or die $!;
my #patterns = <DATA>;
chomp(#patterns);
my $regex = join( '|', #patterns );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}
Why is this better? Well, because it scales more efficiently. With your original algorithm - if I have 100 lines in the patterns file, and 100 lines to check as example str, it means making 10,000 comparisons.
With a single regex, you're making one comparison on each line.
Note - normally you'd use quotemeta when reading in regular expressions, which will escape 'meta' characters. We don't want to do this in this case.
If you're looking for even more concise, you can use map to avoid needing an intermediate array:
my $regex = join( '|', map { chomp; $_ } <$pattern_fh> );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}

perl regex replace only part of string

I need to write a perl regex to convert
site.company.com => dc=site,dc=company,dc=com
Unfortunately I am not able to remove the trailing "," using the regex I came with below. I could of course remove the trailing "," in the next statement but would prefer that to be handled as a part of the regex.
$data="site.company.com";
$data =~ s/([^.]+)\.?/dc=$1,/g;
print $data;
This above code prints:
dc=site,dc=company,dc=com,
Thanks in advance.
When handling urls it may be a good idea to use a module such as URI. However, I do not think it applies in this case.
This task is most easily solved with a split and join, I think:
my $url = "site.company.com";
my $string = join ",", # join the parts with comma
map "dc=$_", # add the dc= to each part
split /\./, $url; # split into parts
$data =~s/\./,dc=/g&&s/^/dc=/g;
tested below:
> echo "site.company.com" | perl -pe 's/\./,dc=/g&&s/^/dc=/g'
dc=site,dc=company,dc=com
Try doing this :
my $x = "site.company.com";
my #a = split /\./, $x;
map { s/^/dc=/; } #a;
print join",", #a;
just put like this,
$data="site.company.com";
$data =~ s/,dc=$1/dc=$1/g; #(or) $data =~ s/,dc/dc/g;
print $data;
I'm going to try the /ge route:
$data =~ s{^|(\.)}{
( $1 && ',' ) . 'dc='
}ge;
e = evaluate replacement as Perl code.
So, it says given the start of the string, or a dot, make the following replacement. If it captured a period, then emit a ','. Regardless of this result, insert 'dc='.
Note, that I like to use a brace style of delimiter on all my evaluated replacements.

How do I split a string into an array by comma but ignore commas inside double quotes?

I have a line:
$string = 'Paul,12,"soccer,baseball,hockey",white';
I am try to split this into #array that has 4 values so
print $array[2];
Gives
soccer,baseball,hockey
How do I this? Help!
Just use Text::CSV. As you can see from the source, getting CSV parsing right is quite complicated:
sub _make_regexp_split_column {
my ($esc, $quot, $sep) = #_;
if ( $quot eq '' ) {
return qr/([^\Q$sep\E]*)\Q$sep\E/s;
}
qr/(
\Q$quot\E
[^\Q$quot$esc\E]*(?:\Q$esc\E[\Q$quot$esc\E0][^\Q$quot$esc\E]*)*
\Q$quot\E
| # or
[^\Q$sep\E]*
)
\Q$sep\E
/xs;
}
The standard module Text::ParseWords will do this as well.
my #array = parse_line(q{,}, 0, $string);
In response to how to do it with Text::CSV(_PP). Here is a quick one.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_PP;
my $parser = Text::CSV_PP->new();
my $string = "Paul,12,\"soccer,baseball,hockey\",white";
$parser->parse($string);
my #fields = $parser->fields();
print "$_\n" for #fields;
Normally one would install Text::CSV or Text::CSV_PP through the cpan utility.
To work around your not being able to install modules, I suggest you use the 'pure Perl' implementation so that you can 'install' it. The above example would work assuming you copied the text of Text::CSV_PP source into a file named CSV_PP.pm in a folder called Text created in the same directory as your script. You could also put it in some other location and use the use lib 'directory' method as discussed previously. See here and here to see other ways to get around install restriction using CPAN modules.
Use this regex: m/("[^"]+"|[^,]+)(?:,\s*)?/g;
The above regular expression globally matches any word that starts with a comma or a quote and then matches the remaining word/words based on the starting character (comma or quote).
Here is a sample code and the corresponding output.
my $string = "Word1, Word2, \"Commas, inbetween\", Word3, \"Word4Quoted\", \"Again, commas, inbetween\"";
my #arglist = $string =~ m/("[^"]+"|[^,]+)(?:,\s*)?/g;
map { print $_ , "\n"} #arglist;
Here is the output:
Word1
Word2
"Commas, inbetween"
Word3
"Word4Quoted"
"Again, commas, inbetween"
try this
#array=($string =~ /^([^,]*)[,]([^,]*)[,]["]([^"]*)["][,]([^']*)$/);
the array will contains the output which expected by you.
use strict;
use warning;
#use Data::Dumper;
my $string = qq/Paul,12,"soccer,baseball,hockey",white/;
#split string into three parts
my ($st1, $st2, $st3) = split(/,"|",/, $string);
#output: st1:Paul,12 st2:soccer,baseball,hockey st3:white
#split $st1 into two parts
my ($st4, $st5) = split(/,/,$st1);
#push records into array
push (my #test,$st4, $st5,$st2, $st3 ) ;
#print Dumper \#test;
print "$test[2]\n";
output:
soccer,baseball,hockey
#$VAR1 = [
# 'Paul',
# '12',
# 'soccer,baseball,hockey',
# 'white'
# ];
$string = "Paul,12,\"soccer,baseball,hockey\",white";
1 while($string =~ s#"(.?),(.?)"#\"$1aaa$2\"#g);
#array = map {$_ =~ s/aaa/ /g; $_ =~ s/\"//g; $_} split(/,/, $string);
$" = "\n";
print "$array[2]";