Perl regex: How do you match multiple words in perl? - regex

Im writing a small script which is supposed to match all strings within another file (words in between "" and '' including the "" and '' symbols as well).
Below is the regex statement i am currently using, however it only produces results for '(.*)' and not "(.*)"
my #string_matches = ($file_string =~ /'(.*)' | "(.*)"/g);
print "\n#string_matches";
Also how would I be able to include the "" or '' symbols in the results as well?(print out "string" instead of just string)
I've tried searching online but couldnt find any material on this
$file_string is basically a string version of an entire file.

use this : '(.*?)' | "(.*?)"
i guess the greedy operator is selecting your string upto the last '. make it lazy
IMHO
use this regex :
['"][^'"]*?['"]
this will also solve your problem of not getting the quotes inside the match.
demo here : http://regex101.com/r/dI6gD7

#!/usr/local/bin/perl
open my $fh, '<', "strings.txt"; #read the content of the file and assign it to $string;
read $fh, my $string, -s $fh;
close $fh;
while ($string =~ m/^['"]{1}(.*?)['"]{1,}$/mg) {
print $&;
}

You could use '[^']*' to match a string between single quotes, "[^"]*" for double quotes.
If you want to support other features, such as escape sequence, then you should consider using modules Text::ParseWords or Text::Balanced.
Note:
Because of the greediness of *, '.*' will match all characters between the first and last single quote, if your string has more than one single quoted substrings, this will only give one match instead of several ones.
You can use ('[^']*') instead of '([^']*)' to capture the single quotes and the substring between them, double quotes are similar.
Because '[^']*' and "[^"]*" cannot be matched at the same time, m/('[^']*')|("[^"]*")/ with /g will give some undefs in the returned list in list context, using m/('[^']*'|"[^"]*")/g can fix this problem.
Here is a test program:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(switch say);
use Data::Dumper;
my $file_string = q{Test "test in double quotes" test 'test in single quotes' and "test in double quotes again" test};
my #string_matches = ($file_string =~ /('[^']*'|"[^"]*")/g);
local $" = "\n";
print "#string_matches\n";
Testing:
$ perl t.pl
"test in double quotes"
'test in single quotes'
"test in double quotes again"

Related

Replacing a single character in a perl regex match

How can I replace the 6th "_" that appears in the regex match?
Here is the literal input to be searched. It is not representing a path to the input:
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa
Here is my code, which parses out what I need. I just now need to replace the last matched "_" with a "/":
#!/usr/bin/perl
use strict;
use warnings;
open(IN, '<', '/Users/roblogan/Test_Database.txt') or die $!;
open(OUT, '>', '/Users/roblogan/Test_Output.txt') or die $!;
while (my $line = <IN>){
if ($line =~ m/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/){
print OUT $1, "\n";
}
}
Current output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30
Desired output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
I have tried:
if ($line =~ s/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0\/[0-9]*)/){
Any help would be appreciated.
This Perl code will do what I think you need, determined from your subject line and example output
It finds the sixth occurrence of an underscore in the target string and, if that underscore is followed by decimal digits, it changes the underscore to a slash and removes everything following the digits
I have used the pipe character | as the delimiter for the substitute operator s/// to avoid the need to escape forward slashes
use strict;
use warnings 'all';
my $path = q{/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa};
$path =~ s|^(?:[^_]*_){5}[^_]*\K_(\d+).*|/$1|s;
print $path, "\n";
output
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
From your description, the easiest way is:
$line =~ s!(m160505_031746_42156_c100980652550000001823221307061611_s1_pā€Œā€‹ā€Œā€‹0)_!$1/!
I've chosen ! as the delimiter because / is used in the replacement part.
$1 is a variable containing the text matched by the first ( ) group in the regex (I didn't want to repeat the whole thing twice).
The final _ is not included in $1 (it's outside of the parens); instead we put / in the replacement part.
See perldoc perlretut for more information.

regular expression that matches any word that starts with pre and ends in al

The following regular expression gives me proper results when tried in Notepad++ editor but when tried with the below perl program I get wrong results. Right answer and explanation please.
The link to file I used for testing my pattern is as follows:
(http://sainikhil.me/stackoverflow/dictionaryWords.txt)
Regular expression: ^Pre(.*)al(\s*)$
Perl program:
use strict;
use warnings;
sub print_matches {
my $pattern = "^Pre(.*)al(\s*)\$";
my $file = shift;
open my $fp, $file;
while(my $line = <$fp>) {
if($line =~ m/$pattern/) {
print $line;
}
}
}
print_matches #ARGV;
A few thoughts:
You should not escape the dollar sign
The capturing group around the whitespaces is useless
Same for the capturing group around the dot .
which leads to:
^Pre.*al\s*$
If you don't want words like precious final to match (because of the middle whitespace, change regex to:
^Pre\S*al\s*$
Included in your code:
while(my $line = <$fp>) {
if($line =~ /^Pre\S*al\s*$/m) {
print $line;
}
}
You're getting messed up by assigning the pattern to a variable before using it as a regex and putting it in a double-quoted string when you do so.
This is why you need to escape the $, because, in a double-quoted string, a bare $ indicates that you want to interpolate the value of a variable. (e.g., my $str = "foo$bar";)
The reason this is causing you a problem is because the backslash in \s is treated as escaping the s - which gives you just plain s:
$ perl -E 'say "^Pre(.*)al(\s*)\$";'
^Pre(.*)al(s*)$
As a result, when you go to execute the regex, it's looking for zero or more ses rather than zero or more whitespace characters.
The most direct fix for this would be to escape the backslash:
$ perl -E 'say "^Pre(.*)al(\\s*)\$";'
^Pre(.*)al(\s*)$
A better fix would be to use single quotes instead of double quotes and don't escape the $:
$ perl -E "say '^Pre(.*)al(\s*)$';"
^Pre(.*)al(\s*)$
The best fix would be to use the qr (quote regex) operator instead of single or double quotes, although that makes it a little less human-readable if you print it out later to verify the content of the regex (which I assume to be why you're putting it into a variable in the first place):
$ perl -E "say qr/^Pre(.*)al(\s*)$/;"
(?^u:^Pre(.*)al(\s*)$)
Or, of course, just don't put it into a variable at all and do your matching with
if($line =~ m/^Pre(.*)al(\s*)$/) ...
Try removing trailing newline character(s):
while(my $line = <$fp>) {
$line =~ s/[\r\n]+$//s;
And, to match only words that begin with Pre and end with al, try this regular expression:
/^Pre\w*al$/
(\w means any letter of a word, not just any character)
And, if you want to match both Pre and pre, do a case-insensitive match:
/^Pre\w*al$/i

Perl Regex To Remove Commas Between Quotes?

I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";
You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;
One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.

Matching multiline string in file using perl regex

I am reading in another perl file and trying to find all strings surrounded by quotations within the file, single or multiline. I've matched all the single lines fine but I can't match the mulitlines without printing the entire line out, when I just want the string itself. For example, heres a snippet of what I'm reading in:
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
so the output I'd like is
'Hello World!';
"chmod";
"This is a fun multiple line string, please match";
but I am getting:
'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
This is the code I am using to find the strings - all file content is stored in #contents:
my #strings_found = ();
my $line;
for(#contents) {
$line .= $_;
}
if($line =~ /(['"](.?)*["'])/s) {
push #strings_found,$1;
}
print #strings_found;
I am guessing I am only getting 'Hello World!'; correctly because I am using the $1 but I am not sure how else to find the others without looping line by line, which I would think would make it hard to find the multi line string as it doesn't know what the next line is.
I know my regex is reasonably basic and doesn't account for some caveats but I just wanted to get the basic catch most regex working before moving on to more complex situations.
Any pointers as to where I am going wrong?
Couple big things, you need to search in a while loop with the g modifier on your regex. And you also need to turn off greedy matching for what's inside the quotes by using .*?.
use strict;
use warnings;
my $contents = do {local $/; <DATA>};
my #strings_found = ();
while ($contents =~ /(['"](.*?)["'])/sg) {
push #strings_found, $1;
}
print "$_\n" for #strings_found;
__DATA__
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
Outputs
'Hello World!'
"chmod"
"This is a fun
multiple line string, please match"
You aren't the first person to search for help with this homework problem. Here's a more detailed answer I gave to ... well ... you ;) finding words surround by quotations perl
regexp matching (in perl and generally) are greedy by default. So your regexp will match from 1st ' or " to last. Print the length of your #strings_found array. I think it will always be just 1 with the code you have.
Change it to be not greedy by following * with a ?
/('"*?["'])/s
I think.
It will work in a basic way. Regexps are kindof the wrong way to do this if you want a robust solution. You would want to write parsing code instead for that. If you have different quotes inside a string then greedy will give you the 1 biggest string. Non greedy will give you the smallest strings not caring if start or end quote are different.
Read about greedy and non greedy.
Also note the /m multiline modifier.
http://perldoc.perl.org/perlre.html#Regular-Expressions

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.