perl negative look behind with groupings - regex

I have a problem trying to get a certain match to work with negative look behind
example
#list = qw( apple banana cherry);
$comb_tlist = join ("|", #list);
$string1 = "include $(dir)/apple";
$string2 = "#include $(dir)/apple";
if( $string1 =~ /^(?<!#).*($comb_tlist)/) #matching regex I tried, works kinda
The array holds a set of variables that is matched against the string.
I need the regex to match $string1, but not $string2.
It matches $string1, but it ALSO matches $string2. Can anyone tell me what I am attempting wrong here. Thanks!

The problem is that negative lookbehind and beginning of line ^ is both zero width matches. So when you say
"start at the beginning of the string"
and then say
"check that the character before it is not #"
...you actually check the character before the start of the string. Which is of course not #, because it is nothing.
Use a lookahead instead. This works:
use strict;
use warnings;
my #list = qw( apple banana cherry);
my $comb_tlist = join ("|", #list);
my $string1 = 'include $(dir)/apple';
my $string2 = '#include $(dir)/apple';
if( $string1 =~ /^(?!#).*($comb_tlist)/) { say "String1"; }
if( $string2 =~ /^(?!#).*($comb_tlist)/) { say "String2"; }
Note that you have made four critical mistakes in your sample code. First off, you use string1 which is a bareword, which will be interpreted as a string. Second, you declare #list but then use #tlist. Third, you don't (seem to) use
use strict;
use warnings;
These pragmas could have informed you of your error, and without them, it is fairly likely that you would not have been warned about your first two critical errors. There is no good reason not to use them, so do that in the future.
Fourth, the declaration
$string1 = "include $(dir)/apple";
Means that you try to interpolate the variable $( in your string. $ is a meta character in double quoted strings, so you should use single quotes:
my $string1 = 'include $(dir)/apple';

Some problems:
Always use use strict; use warnings;.
Fix the use of string1 where you meant $string1.
Fix the scoping errors detected by the above by using my where appropriate.
Fix the typo in the variable names (#list vs #tlist).
I'm sure you didn't mean to interpolate the $( variable.
You'll never find a # before the first character of the string, so /^(?<!#).* .../ makes no sense. It simply means /^.* .../. You probably wanted /^[^#].* .../

You don't need negative lookbehind, just match a first character that is not #:
use strict;
use warnings;
my #list = qw( apple banana cherry);
my $comb_tlist = join ("|", #list);
my $string1 = "include dir/apple";
my $string2 = "#include dir/apple";
for ($string1, $string2) {
print "match:$_\n" if( /^[^#].*($comb_tlist)/);
}
Also, if you mean to match a literal $(dir), then you need to escape the $ sign with a backslash, otherwise it denotes a scalar variable. If this is the case, "$(dir)" should be \$(dir) in Perl code.

Sometimes complex regexes became trivial, if you just split them in two or three.
Filterout commented strings in first step.

Related

Remove certain characters from a regex group

I have a string that looks like this (key":["value","value","value"])
"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]
and I use the following regex to select from the string. (the regex is setup in a way where it wont select a string that looks like this "key":[{"key":"value","key":"value"}] )
(?<=:\[").*?(?="])
Resulting Selection:
google.co.uk","google.com","google.com","google.com","google.co.uk
I want to remove the " in that select string, and i was wondering if there was an easy way to do this using the replace command. Desired result...
"emailDomains":["google.co.uk, google.com, google.com, google.com, google.co.uk"]
How do I solve this problem?
If your string indeed has the form "key":["v1", "v2", ... "vN"], you can split off the part that needs to be changed, replace "," by a space in it, and re-assemble:
my #parts = split / (\["\s* | \s*\"]) /x, $string; #"
$parts[2] =~ s/",\s*"/ /g;
my $processed = join '', #parts;
The regex pattern for the separator in split is captured since in that case the separators are also in the returned list, what is helpful here for putting the string back together. Then, we need to change the third element of the array.
In this approach, we have to change a specific element in the array so if your format varies, even a little, this may not (or still may) be suitable.
This should of course be processed as JSON, using a module. If the format isn't sure, as indicated in a comment, it would be best to try to ensure that you have JSON. Picking bits and pieces like above (or below) is a road to madness once requirements slowly start evolving.
The same approach can be used in a regex, and this may in fact have an advantage to be able to scoop up and ignore everything preceding the : (with split that part may end up with multiple elements if the format isn't exactly as shown, what then affects everything)
$string =~ s{ :\["\s*\K (.*?) ( "\] ) }{
my $e = $2;
my $n = $1 =~ s/",\s*"/ /gr;
$n.$e
}ex;
Here /e modifier makes it so that the replacement side is evaluated as code, where we do the same as with the split above. Notes on regex
Have to save away $2 first, since it gets reset in the next regex
The /r modifier†, which doesn't change its target but rather returns the changed string, is what allows us to use substitution operator on the read-only $1
If nothing gets captured for $2, and perhaps for $1, that means that there was no match and the outcome is simply that $string doesn't change, quietly. So if this substitution should always work then you may want to add handling of such unexpected data
Don't need a $n above, but can return ($1 =~ s/",\s*"/ /gr) . $e
Or, using lookarounds as attempted
$string =~ s{ (?<=:\[") (.+?) (?="\]) }{ $1 =~ s/",\s*"/ /gr }egx;
what does reduce the amount of code, but may be trickier to work with later.
While this is a direct answer to the question I think it's least maintainable.
†  This useful modifier, for "non-destructive substitution," appeared in v5.14. In earlier Perl versions we would copy the string and run regex on that, with an idiom
(my $n = $1) =~ s/",\s*"/ /g;
In the lookarounds-example we then need a little more
$string =~ s{...}{ (my $n = $1) =~ s/",\s*"/ /g; $n }gr
since s/ operator returns the number of substitutions made while we need $n to be returned from that whole piece of code in {} (the replacement side), to be used as the replacement.
You can use this \G based regex to start the match with :[" and further captures the values appropriately and replaces matched text so that only comma is retained and doublequotes are removed.
(:\[")|(?!^)\G([^"]+)"(,)"
Regex Demo
Your text is almost proper JSON, so it's really easy to go the final inch and make it so, and then process that:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say postderef/;
no warnings qw/experimental::postderef/;
use JSON::XS; # Install through your OS package manager or a CPAN client
my $str = q/"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]/;
my $json = JSON::XS->new();
my $obj = $json->decode("{$str}");
my $fixed = $json->ascii->encode({emailDomains =>
join(', ', $obj->{'emailDomains'}->#*)});
$fixed =~ s/^\{|\}$//g;
say $fixed;
Try Regex: " *, *"
Replace with: ,
Demo

Perl - how to get values of tokens

I am searching how to get tokens values in properties file with Perl.
Given the source property:
my $source="application.1.hostname={{DNS_APP}}:{{PORT_APP}}/WHATEVER";
And given the target property:
my $target="application.1.hostname=test.test.com:8080/WHATEVER";
I would like to get the following result:
{{DNS_APP}}=test.test.com
{{PORT_APP}}=8080
I have no trouble to get the tokens with :
my #matches= ( $source =~ /({{.*?}})/g );
But then, how to match with their values ?
Is there an easy way, with perl regexps to get these substitutions ?
Another difficulty (but they are execption, so it is not a big deal if this problem is not addressed) is that, sometimes, $target can be
my $target="application.1.hostname=test.test.com/WHATEVER";
Or
my $target="application.1.hostname=test.test.com:8080/SOMETHINGELSE";
Or even
my $target="application.1.hostname=test.test.com/SOMETHINGELSE";
How to deal with that ?
I thank you in advance for you answers.
Regards.
OK, at a basic level, you can turn your thing into a named capture for a regex. There's a caveat though - you might need to restrict character sets.
But something like this might work:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $source = "application.1.hostname={{DNS_APP}}:{{PORT_APP}}/WHATEVER";
my $target = "application.1.hostname=test.test.com:8080/WHATEVER";
$source =~ s|\Q{{\E(\w+)\Q}}\E|(?<$1>.*)|g;
$source = qr/$source/;
print "Using Regex:", $source,"\n";
$target =~ m/$source/;
#%+ is the special named-capture hash. You can access $+{DNS_APP} for example
print Dumper \%+;
Note though - that .* is a greedy match, and that will mean without delimitors/anchors between patterns, this will break. You could perhaps define a more narrow character class - I would think \w normally, but you also have . so perhaps [\w.]+ - or maybe even .*? for non greedy matching instead. This depends rather on what would 'fit' with the types of patterns you're trying to match. If you need to do so with arbitrary patterns, I think you're going to need to need ... something like regex to define the match criteria in the first place.
If your 'targets' are purely that pattern - e.g. trailing static words - you can trim you initial pattern with s/\w+$// which will reduce it to:
application.1.hostname={{DNS_APP}}:{{PORT_APP}}/
Which you then regex transform to:
(?^:application.1.hostname=(?<DNS_APP>.*):(?<PORT_APP>.*)/)
And then get %+ of:
$VAR1 = {
'DNS_APP' => 'test.test.com',
'PORT_APP' => '8080'
};
As you're on 5.8.8 - my first advice is upgrade it, because it's 7 year old software, and is long since end of life.
This variable was added in Perl v5.10.0.
However you should be able to work around by:
my #match_names = $source =~ m|\Q{{\E(\w+)\Q}}\E|g; #capture 'names' of matches
$source =~ s|\Q{{\E(\w+)\Q}}\E|(.*)|g;
$source = qr/$source/;
print "Using Regex:", $source, "\n";
my %results;
my #matches = $target =~ m/$source/;
#results{#match_names} = #matches;
print Dumper \%results;
I'm pretty sure there's a way of capturing what matched from the s pattern replacement. If I figure out what it was, I'll update.
(As it stands:
my ( #match_names ) = $source =~ s|\Q{{\E(\w+)\Q}}\E|\(.*\)|g;
doesn't seem to work as I want - #match_names contains the number of replacements. )

Replace only the second occurance of string in a line in perl regex

I have a string like "ven|ven|vett|vejj|ven|ven". Treat each "|" delimiter for each column.
By splitting the string with "|" saving all the columns in array and reading each column into $str
So, I'm trying to do this as
$string =~ s/$str/venky/g if $str =~ /ven/i; # it will do globally.
Which not met the requirement.
On-demand basis, I need to replace string at the particular number of occurrence of the string.
For example, I've a request to change 2nd occurrence of "ven" to venky.
Then how can I met this requirement simply? Is it some-thing like
$string =~ s/ven/venky/2;
As of my knowledge we have 'o' for replace once and 'g' for globally. I'm struggling for the solution to get the replacement at particular occurrence. And I should not use pos() to get the position, because string keeps on change. It becomes difficult to trace it every-time. That's my intention.
Please help me on this regard.
There is no flag that you can add to the regex that will do this.
The easiest way would be to split and loop. However, if you insist to use one regex, it is doable:
/^(?:[^v]|v[^e]|ve[^n])*ven(?:[^v]|v[^e]|ve[^n])*\Kven/
If you want to replace the Nth occurrence instead of the second, you can do:
/^(?:(?:[^v]|v[^e]|ve[^n])*ven){N-1}(?:[^v]|v[^e]|ve[^n])*\Kven/
The general idea:
(?:[^v]|v[^e]|ve[^n])* - matches any string that isn't part of ven
\K is a cool matcher that drops everything matched so far, so you can sort of use it as a lookbehind with variable length
Currently you're replacing every instance of'ven' with 'venky' if your string contains a match for ven, which of course it does.
What I assume you're trying to do is to substitute 'ven' for 'venky' within your string if it's the second element:
my $string = 'ven|ven|vett|vejj|ven|ven';
my #elements = split(/\|/, $string);
my $count;
foreach (#elements){
$count++;
s/$_/venky/g if /ven/i and $count == 2;
}
print join('|', #elements);
print "\n";
Your approach was already pretty good. What you described makes sense, but I think you are having trouble implementing it.
I created a function to do the work. It takes 4 arguments:
$string is the string we want to work on
$n is the nth occurance you want to replace
$needle is the thing you want to replace – thing needle in a haystack
Note that right now we allow to pass stuff that might contain regular expressions. So you would have to use quotemeta on it or match with /\Q$needle\E/
$replacement is the replacement for the $needle
The idea is to split up the string, then check each element if it matches the pattern ($needle) and keep track of how many have matched. If the nth one is reached, replace it and stop processing. Then put the string back together.
use strict;
use warnings;
use feature 'say';
say replace_nth_occurance("ven|ven|vett|vejj|ven|ven", 2, 'ven', 'venky');
sub replace_nth_occurance {
my ($string, $n, $needle, $replacement) = #_;
# take the string appart
my #elements = split /\|/, $string;
my $count = 0; # keep track of ...
foreach my $e (#elements) {
$count++ if $e =~ m/$needle/; # ... how many matches we've found
if ($count == $n) {
$e =~ s/$needle/$replacement/; # replace
last; # and stop processing
}
}
# put it back into the pipe-separated format
return join '|', #elements;
}
Output:
ven|venky|vett|vejj|ven|ven
To replace the n'th occurrence of "ven" to "venky":
my $n = 3;
my $test = "seven given ravens";
$test =~ s/ven/--$n == 0 ? "venky" : $&/eg;
This uses the ability with the /e flag to specify the substitution part as an expression.

How can I use regex to remove /1 or /2?

Regex gurus,
Here is the following line of code I want to parse with regex:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
I want to obtain the following:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0
I have written the following regex on rubular.com:
(#.* *.)(!?(\/.))
My idea is to use negation to remove /1 by (!?(\/.)). However, this produces the entire line?
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
Why is (?!thisismystring) not removing /1? I googled the fire out of this, but they seemed to suggest similar things I am already trying? I deeply appreciate your help.
I think what you are trying to write is /(\#.* .*)(?=\/\d)/ (you need to escape the at sign # to prevent Perl from treating it as an array) but you need a positive look-ahead because you want to match everything up until the following characters are a slash followed by a digit.
Here is a program that demonstrates.
use strict;
use warnings;
use 5.010;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
$s =~ /(\#.* .*)(?=\/.)/;
print $1, "\n";
But you would be much better off copying the whole string and removing the slash and everything after it, like this
use strict;
use warnings;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
(my $fixed = $s) =~ s{/\d+$}{};
print $fixed, "\n";
output
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.