regex for multiple strings - regex

I have two strings that I'm trying to match in a file and return just that line.
The first string will match what I'm looking for always but not completely.
Example:
I might be looking for Matcht in Matchthisstring so I match but not the entire string.
String 1 might come before or after string 2 and it might start with an uppercase or lower case letter.
Example:
I might be looking for actually but have Actually.
But it together and I have things like Matchthisstring.Actually or actually.Matchthisstring or Matchthisstring.someotherjunkIdon'tcareabout.actually any other combinations like that.
I'm having problems using the two search strings together and getting the reggae to work.
Here's an example of some code that works:
my #matches;
while (<$in_fh>) {
#push #matches, $_ if / \Q$wanted\E .* \Q$prefix\E /x;
#push #matches, $_ if / \Q^*(.*)$wanted\s*(.*)\E .* \Q^*(.*)$prefix\s*(.*)\E /x;
push #matches, $_ if / \Q$wanted\E /x;
}
what I really want to work is one of the other options that's commented out. I don't think I'm joining the two searches into one string together properly.
Thanks in advance for the assistance.

Sounds like you want
push #matches, $_ if /\Q$wanted\E/ and /\Q$prefix\E/;
Demo:
$ perl -ne 'use strict; use warnings; use vars qw($wanted $prefix #matches);
> BEGIN { $wanted = "/donate"; $prefix = "lghfound"; }
> push #matches, $_ if /\Q$wanted\E/ and /\Q$prefix\E/; print #matches' <<'HERE'
> http://www.google.com/donate
> http://lghfoundation.com/pages/donate.html
> http://example.com
> HERE
http://lghfoundation.com/pages/donate.html
http://lghfoundation.com/pages/donate.html

Related

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Finding and Concatenating strings

I want to find some punctuation characters and concatenate them with spaces.
For example:
If any punctuation are found then I want to add spaces to front and end of them.
$line =~ s/[?%&!,.%*\[◦\]\\;#<>{}#^=\+()\$]/" $1 "/g ;
I tried using $ as used in Php where we can use $1, but it didn't work.
I searched on the web and couldn't find the Perl syntax?
Additionally, how can I preserve ... as a single token?
What is the true syntax for my problem.
Use this:
#!/usr/bin/perl -w
use strict;
my $string = "For example; If i found any puncs. above list, i want to add spaces to front and end of token.";
$string =~ s/([[:punct:]])/ $1 /g;
print "$string\n";
Outputs:
For example ; If i found any puncs . above list , i want to add spaces to front and end of token .
Obviously, if you want your output different from above, you can just add it in-between / / - I've just replaced all punctuation with " punctuation ".
You need to surround match pattern with () to capture it into $1
$line =~ s/([?%&!,.%*\[◦\]\\;#<>{}#^=\+\(\)\$])/ $1 /g;
EDIT (as per OP's comment)
how can i preserve '...' a single token ?
One way would be to revert back the changes for that token.
$line =~ s/ \. \. \. /.../g;

Regex to get just everything in CAPS

I am looking for a regex to get just about the words in CAPS
for eg : I have an array that is storing the file paths and these could be in any following pattern
images/p/n/ct/XYZ-WW_V1.jpg
images/p/c/ABC-TY_V2.jpg
So basically I want just "XYZ-WW" and "ABC-TY" .
Any suggestions what regex to use in my split code . I am using the following
foreach (#filefound){
my #result = split('_',$_);
push #split1, $result[0];
}
This is just splitting at the _ and I am accessing the [0] the value but now I want to get just the part that is in CAPS .
Any Suggestions please !!
No reason to use split at all. Just grab the bits you want via a regular expression. From your example, it looks like you want everything which is made of capital ASCII letters and dashes:
my #bignames;
foreach (#filefound){
if ( /([A-Z-]+)/ ) {
push #bignames, $1;
}
}
I'm thinking this should work:
[A-Z]+-[A-Z]+
foreach (#filefound) {
if ($_ ~= /.*([A-Z]+-[A-Z]+)_[A-Z]\d\..{3}$ ) {
push #split1, $1;
}
You can try this:
if ($_ =~ /[\-A-Z]+/)
push #split1, $&;
that will match any combination of uppercase letters and -; or, if you want a stricter control, this:
if ($_ =~ /\/([A-Z]{3}-[A-Z]{2})_/)
push #split1, $1;
which will match only a sequence of uppercase letters followed by - and by a sequence of 2 uppercase letters; starting with a / and ending in _ (those are excluded).
From these example you can build the exact regex that you need.
Keep in mind that a match in list context will return captured strings.
#!/usr/bin/perl
use warnings; use strict;
use File::Basename qw(basename);
my #files = qw(
images/p/n/ct/XYZ-WW_V1.jpg
images/p/c/ABC-TY_V2.jpg
);
my #prefixes = map { (basename $_) =~ /^( [A-Z]+ - [A-Z]+ )/x } #files;
print "$_\n" for #prefixes;

In Perl, how can I correctly extract URLs that are enclosed in parentheses?

I've got two question about Regexp::Common qw/URI/ and Regex in Perl.
I use Regexp::Common qw/URI/ to parse URI in the strings and delete them. But I've got an error when a URI is between parentheses.
For example: (http://www.example.com)
The error is caused by ')', and when it try to parse the URI, the app crash. So I've thought two fixes:
Do a simple (or I thought so) that writes a whitespace between parentheses and ) characters
The Regexp::Common qw/URI/ has a function that implement a fix.
In my code I've tried to implement the Regex but the app freezes. The code that I've tried is this:
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.example.com)";
while ($str =~ m/\)/){
$str =~ s/\)/ \)/;
}
my ($uri) = $str =~ /$RE{URI}{-keep}/;
print "$uri\n";
print $str;
The output that I want is: (http://www.example.com )
I'm not sure, but I think that the problem is in $str =~ s/\)/ \)/;
BTW, I've got a question about Regexp::Common qw/URI/. I've got two string type:
ablalbalblalblalbal http://www.example.com
asfasdfasdf http://www.example.com aasdfasdfasdf
I want to remove the URI if it is the last component (and save it). And, if not, save it without removing it from the text.
You don't have to first test for a match to be able to use the s/// operator correctly: If the string does not match the search pattern, it will not do anything.
#!/usr/bin/perl
use strict; use warnings;
my $str = "Hello!!, I love (GOOGLE)";
$str =~ s/\)/ )/g;
print "$str\n";
The general problem of detecting URLs correctly in text is error-prone. See for example Jeff's thoughts on this.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
}
Your program goes into an infinite loop at this point. To see why, try printing the value of $str each time round the loop.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
print $str, "\n";
}
The first time it prints "Hello!!, I love (GOOGLE )". The while loop condition is then evaluated again. Your string still matches your regular expression (it still contains a closing parenthesis) so the replacement is run again and this time it prints out "Hello!!, I love (GOOGLE )" with two spaces.
And so it goes on. Each time round the loop another space is added, but each time you still have a closing parenthesis, so another substitution is run.
The simplest solution I can see is to only match the closing parenthesis if it is preceded by a non-whitespace character (using \S).
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/\S)/){
$str =~ s/)/ )/;
print $str, "\n";
}
In this case the loop is only executed once.
Why not just include the parentheses in the search? If the URLs will always be bracketed, then something like this:
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.google.com)";
my ($uri) = $str =~ / \( ( $RE{URI} ) \) /x;
print "$uri\n";
The regex from Regex::Common can be used as part of a longer regex, it doesn't have to be used on its own. Also I've used the 'x' modifier on the regex to allow whitespace so you can see more clearly what is going on - the brackets with the backslashes are treated as characters to match, those without define what is to matched (presumably like the {-keep} - I've not used that before).
You could also make the brackets optional, with something like:
/ (?: \( ( $RE{URI} ) \) | ( $RE{URI} ) ) /
although that would result in two match variables, one undefined - so something like following would be needed:
my $uri = $1 || $2 || die "Didn't match a URL!";
There's probably a better way to do this, and also if you're not bothered about matching parentheses then you could simply make the brackets optional (via a '?') in the first regex...
To answer your second question about only matching URLs at the end of the line - have a look at Regex 'anchors' which can force a match against the beginning or end of a line: ^ and $ (or \A and \Z if you prefer). e.g. matching a URL at the end of a line only:
/$RE{URI}\Z/

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.