Perl regexp to match random character/number combination? - regex

I'm trying to match patterns like these with perl regexp:
_b04it4_
_bg4n5p_
_qp9bp_
_hp32z7_
...that is, underscore followed by some combination of characters and numbers.
I guess the "rule" is that there are >=1 [a-z] characters and >=1 [0-9] character/number, and no spaces, "mixed in any combination", between two underscore-characters.
And want to replace this with something, eg. "_X_".
I'd appreciate some help with this .. My own attempts are looking horrible and don't work very well :)

For at least 1 letter and number:
_(?=[^_]*[a-z])(?=[^_]*\d)[a-z\d]+_
RegExr Example
(?=[^_]*[a-z]) checks for the presence of a letter between the two _
(?=[^_]*\d) checks for the presence on a number between the two _
_[a-z\d]+_ does the actual match

How about:
_(?=.*[a-z])(?=.*[0-9])[0-9a-z]+_

Another way without lookaheads:
_([a-z]+[0-9]|[0-9]+[a-z])[a-z0-9]*_

Something like this is easily solved if you separate the conditions into multiple regex's. The first matching the basic constraints, and the 2nd to ensure that at least 1 char and 1 digit are in the match.
use strict;
use warnings;
while (<DATA>) {
chomp;
my $before = my $after = $_;
$after =~ s{_([a-z0-9]+)_}{
my $chars = $1;
# Require 1 digit and 1 letter in the match before replacing.
($chars =~ /[a-z]/ && $chars =~ /[0-9]/) ? "_X_" : "_${chars}_"
}e;
printf "%-12s -> %-12s\n", $before, $after;
}
__DATA__
_b04it4_
_bg4n5p_
_qp9bp_
_hp32z7_
_nonumbers_
_012345_
_1 space_

How about this:
use strict;
my ($replacement, #input) = ('X', qw(_b04it4_ _bg4n5p_ _qp9bp_ _hp32z7_));
my #output = map {'_'.$replacement.'_'} grep {/^_[a-z0-9]+_$/ && /[a-z]+/ && /[0-9]+/} #input;
print "$_\n" foreach #output;

Related

Distinguishing and substituting decimals in Perl

I want to substitute decimals from commas to fullstops in a file and I wanted to try to do this in perl.
An example of my dataset looks something like this:
Species_1:0,12, Species_2:0,23, Species_3:2,53
I want to substitute the decimals but not all commas such that:
Species_1:0.12, Species_2:0.23, Species_3:2.53
I was thinking it might work using the substitution function like such:
$comma_file= "Species_1:0,12 , Species_2:0,23, Species_3:2,53"
$comma = "(:\d+/,\d)";
#match a colon, any digits after the colon, the wanted comma and digits preceding it
if ($comma_file =~ m/$comma/g) {
$comma_file =~ tr/,/./;
}
print "$comma_file\n";
However, when I tried this, what happened was that all my commas changed into fullstops, not just the ones I was targetting. Is it an issue with the regex or am I just not doing the match substitution correctly?
Thanks!
This :
use strict;
use warnings;
my $comma_file = "Species_1:0,12, Species_2:0,23, Species_3:2,53";
$comma_file =~ s/(\d+),(\d+)/$1.$2/g;
print $comma_file, "\n";
Yields :
Species_1:0.12, Species_2:0.23, Species_3:2.53
The regex searches for commas having at least one digit on both sides and replaces them with a dot.
Your code doesn’t work because you first check for commas surrounded by digits, and, if ok, you then replace ALL commas with dots
From the shown data it appears that a comma to be replaced must always have a number on each side, and that every such occurrence need be replaced. There is a fine answer by GMB.
Another way for this kind of a problem is to use lookarounds
$comma_file =~ s/(?<=[0-9]),(?=[0-9])/./g;
which should be more efficient, as there is no copying into $1 and $2 and no quantifiers.
My benchmark
use warnings;
use strict;
use feature 'say';
use Benchmark qw(cmpthese);
my $str = q(Species_1:0,12, Species_2:0,23, Species_3:2,53);
sub subs {
my ($str) = #_;
$str =~ s/(\d+),(\d+)/$1.$2/g;
return $str;
}
sub look {
my ($str) = #_;
$str =~ s/(?<=\d),(?=\d)/./g;
return $str;
}
die "Output not equal" if subs($str) ne look($str);
cmpthese(-3, {
subs => sub { my $res = subs($str) },
look => sub { my $res = look($str) },
});
with output
Rate subs look
subs 256126/s -- -46%
look 472677/s 85% --
This is only one, particular, string but the efficiency advantage should only increase with the length of the string, while longer patterns (numbers here) should reduce that a little.

Regex searching and adding characters

I'm trying to use regex to add $ to the start of words in a string such that:
Answer = partOne + partTwo
becomes
$Answer = $partOne + $partTwo
I'm using / [a-z]/ to locate them but not sure what I'm meant to replace it with.
Is there anyway to do it with regex or am I suppose to just split up my string and put in the $?
I'm using perl right now.
You can match word boundary \b, followed by word class \w
my $s = 'Answer = partOne + partTwo';
$s =~ s|\b (?= \w)|\$|xg;
print $s;
output
$Answer = $partOne + $partTwo
You could use a lookahead to match only the space or start of a line anchor which was immediately followed by an alphabet. Replace the matched space character or starting anchor with a $ symbol.
use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/(^|\s)(?=[A-Za-z])/$1\$/g;
print $line;
}
__DATA__
Answer = partOne + partTwo
Output:
$Answer = $partOne + $partTwo
Perl's regexes have a word character class \w that is meant for exactly this sort of thing. It matches upper-case and lower-case letters, decimal digits, and the underscore _.
So if you prefix all ocurrences of one or more such characters with a dollar then it will achieve what you ask. It would look like this
use strict;
use warnings;
my $str = 'Answer = partOne + partTwo';
$str =~ s/(\w+)/\$$1/g;
print $str, "\n";
output
$Answer = $partOne + $partTwo
But please note that, if the text you're processing is a programming language, this will also process all comments and string literals in a way you probably don't want.
(\w+)
You can use this.Replace by \$$1.
See demo.
http://regex101.com/r/lS5tT3/40

Regex greedyness REasking

I have this text $line = "config.txt.1", and I want to match it with regex and extract the number
part of it. I am using two versions:
$line = "config.txt.1";
(my $result) = $line =~ /(\d*).*/; #ver 1, matched, but returns nothing
(my $result) = $line =~ /(\d).*/; #ver 2, matched, returns 1
(my $result) = $line =~ /(\d+).*/; #ver 3, matched, returns 1
I think the * was sort of messing things around, I have been looking at this, but still
don't the greedy mechanism in the regex engine. If I start from left of the regex, and potentially there might be no digits in the text, so for ver 1, it will match too. But for
ver 3, it won't match. Can someone give me an explanation for why it is that and how
I should write for what I want? (potentially with a number, not necessarily single digit)
Edit
Requirement: potentially with a number, not necessarily single digit, and match can not capture anything, but should not fail
The output must be as follows (for the above example):
config.txt 1
The regex /(\d*).*/ always matches immediately, because it can match zero characters. It translates to match as many digits at this position as possible (zero or more). Then, match as many non-newline characters as possible. Well, the match starts looking at the c of config. Ok, it matches zero digits.
You probably want to use a regex like /\.(\d+)$/ -- this matches an integer number between a period . and the end of string.
Use the literal '.' as a reference to match before the number:
#!/usr/bin/perl
use strict;
use warnings;
my #line = qw(config.txt file.txt config.txt.1 config.foo.2 config.txt.23 differentname.fsdfsdsdfasd.2444);
my (#capture1, #capture2);
foreach (#line){
my (#filematch) = ($_ =~ /(\w+\.\w+)/);
my (#numbermatch) = ($_ =~ /\w+\.\w+\.?(\d*)/);
my $numbermatch = $numbermatch[0] // $numbermatch[1];
push #capture1, #filematch;
push #capture2, #numbermatch;
}
print "$capture1[$_]\t$capture2[$_]\n" for 0 .. $#capture1;
Output:
config.txt
file.txt
config.txt 1
config.foo 2
config.txt 23
differentname.fsdfsdsdfasd 2444
Thanks guys, I think I figured out myself what I want:
my ($match) = $line =~ /\.(\d+)?/; #this will match and capture any digit
#number if there was one, and not fail
#if there wasn't one
To capture all digits following a final . and not fail the match if the string doesn't end with digits, use /(?:\.(\d+))?$/
perl -E 'if ("abc.123" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched 123
perl -E 'if ("abc" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched
You do not need .* at all. These two statements assign the exact same number:
my ($match1) = $str =~ /(\d+).*/;
my ($match1) = $str =~ /(\d+)/;
A regex by default matches partially, you do not need to add wildcards.
The reason your first match does not capture a number is because * can match zero times as well. And since it does not have to match your number, it does not. Which is why .* is actually detrimental in that regex. Unless something is truly optional, you should use + instead.

How can I extract a substring up to the first digit?

How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_

Regex to get just everything in CAPS

I am looking for a regex to get just about the words in CAPS
for eg : I have an array that is storing the file paths and these could be in any following pattern
images/p/n/ct/XYZ-WW_V1.jpg
images/p/c/ABC-TY_V2.jpg
So basically I want just "XYZ-WW" and "ABC-TY" .
Any suggestions what regex to use in my split code . I am using the following
foreach (#filefound){
my #result = split('_',$_);
push #split1, $result[0];
}
This is just splitting at the _ and I am accessing the [0] the value but now I want to get just the part that is in CAPS .
Any Suggestions please !!
No reason to use split at all. Just grab the bits you want via a regular expression. From your example, it looks like you want everything which is made of capital ASCII letters and dashes:
my #bignames;
foreach (#filefound){
if ( /([A-Z-]+)/ ) {
push #bignames, $1;
}
}
I'm thinking this should work:
[A-Z]+-[A-Z]+
foreach (#filefound) {
if ($_ ~= /.*([A-Z]+-[A-Z]+)_[A-Z]\d\..{3}$ ) {
push #split1, $1;
}
You can try this:
if ($_ =~ /[\-A-Z]+/)
push #split1, $&;
that will match any combination of uppercase letters and -; or, if you want a stricter control, this:
if ($_ =~ /\/([A-Z]{3}-[A-Z]{2})_/)
push #split1, $1;
which will match only a sequence of uppercase letters followed by - and by a sequence of 2 uppercase letters; starting with a / and ending in _ (those are excluded).
From these example you can build the exact regex that you need.
Keep in mind that a match in list context will return captured strings.
#!/usr/bin/perl
use warnings; use strict;
use File::Basename qw(basename);
my #files = qw(
images/p/n/ct/XYZ-WW_V1.jpg
images/p/c/ABC-TY_V2.jpg
);
my #prefixes = map { (basename $_) =~ /^( [A-Z]+ - [A-Z]+ )/x } #files;
print "$_\n" for #prefixes;