cant get the perl regex to work - regex

My perl is getting rusty. It only prints "matched=" but $1 is blank!?!
EDIT 1: WHo the h#$! downvoted this? There are no wrong questions. If you dont like it, move on to next one!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([.\n\r]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 2: This is the code fragment with updated regex, works great!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([\s\S]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 3: Haha, i see perl police strikes yet again!!!

I don't know if this is your exact problem, but inside square brackets, '.' is just looking for a period. I didn't see a period in the input, so I wondered which you meant.
Aside from the period, the rest of the character class is looking for consecutive whitespace. And as you didn't use the multiline switch, you've got newlines being counted as whitespace (and any character), but no indication to scan beyond the first record separator. But because of the way that you print it out, it also gives some indication that you meant more than the literal period, as mentioned above.

Axeman is correct; your problem is that . in a character class doesn't do what you expect.
By default, . outside a character class (and not backslashed) matches any character but a newline. If you want to include newlines, you specify the /s flag (which you seem to already have) on your regex or put the . in a (?s:...) group:
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/((?s:.+))/) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}

. in a character class is a literal period, not match anything. What you really want is /(.+)/s. The /g flag says to match multiple times, but you are using the regex in scalar context, so it will only match the first item. The /i flag makes the regex case insensitive, but there are no characters with case in your regex. The \s flag makes . match newlines, and it always matches "\r", so instead of [.\n\r], you can just use ..
However, /(.+)/s will match any string with one or more characters, so you would be better off with
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if (length $crazy) {
print "matched=$crazy\n";
} else {
print "not matched!\n";
}
It is possible you meant to do something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
while ($crazy =~ /(.+)[\r\n]+/g) {
print "matched=$1\n";
}
But that would probably be better phrased:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
for my $part (split /[\r\n]+/, $crazy) {
print "matched=$part\n";
}

$1 contains white space, that's why you don't see it in a print like that, just add something after it/quote it.
Example:
perl -E "qq'abcd\r\nallo\nXYZ\n\n\nQQQ'=~/([.\n\r]+)/gsi;say 'got(',length($1),qq') >$1<';"
got(2) >
<
Updated for your comments:
To match everything you can simply use /(.+)/s

[.] (dot inside a character class) does not mean "match any character", it just means match the literal . character. So in an input string without any dots,
m/([.\n\r]+)/gsi
will just match strings of \n and \r characters.
With the /s modifier, you are already asking the regex engine to include newlines with . (match any character), so you could just write
m/(.+)/gsi

Related

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

How to match a string which starts with either a new line or after a comma?

My string is $tables="newdb1.table1:100,db2.table2:90,db1.table1:90". My search string is db1.table1 and my aim is to extract the value after : (i.e 90 in this case).
I am using:
if ($tables =~ /db1.table1:(\d+)/) { print $1; }
but the problem is it is matching newdb1.table1:100 and printing 100.
Can you please give my a regular expression to match a string which either starts with a newline or has comma before it.
Use word boundaries:
if ($tables =~ /\bdb1.table1:(\d+)/) { print $1; }
here __^^
if ($tables =~ /(^|,)db1.table1:(\d+)/) { print $2; }
To answer your exact question, that is to match just after the start of the string or a comma, you want a positive look-behind assertion. You may be tempted to write a pattern of
/(?<=^|,)db1\.table1:(\d+)/
but that may fail with an error of
Variable length lookbehind not implemented in regex m/(?<=^|,)db1\.table1:(\d+)/ ...
So hold the regex engine’s hand a bit by making the alternatives equal in length—tricky to do in the general case but workable here.
/(?<=^d|,)db1\.table1:(\d+)/
While we are locking it down, let’s be sure to bracket the end with a look-ahead assertion.
while ($tables =~ /(?<=^d|,)db1\.table1:(\d+)(?=,|$)/g) {
print "[$1]\n";
}
Output:
[90]
You could also use \b for a regex word boundary, which has the same output.
while ($tables =~ /\bdb1\.table1:(\d+)(?=,|$)/g) {
print "[$1]\n";
}
For the most natural solution, follow the rule of thumb proposed by Randal Schwartz, author of Learning Perl. Use capturing when you know what you want to keep and split when you know what you want to throw away. In your case you have a mixture: you want to discard the comma separators, and you want to keep the digits after the colon for a certain table. Write that as
for (split /\s*,\s*/, $tables) { # / to fix Stack Overflow highlighting
if (my($value) = /^db1\.table1:(\d+)$/) {
print "[$value]\n";
}
}
Output:
[90]

How can I find repeated letters with a Perl regex?

I am looking for a regex that will find repeating letters. So any letter twice or more, for example:
booooooot or abbott
I won't know the letter I am looking for ahead of time.
This is a question I was asked in interviews and then asked in interviews. Not so many people get it correct.
You can find any letter, then use \1 to find that same letter a second time (or more). If you only need to know the letter, then $1 will contain it. Otherwise you can concatenate the second match onto the first.
my $str = "Foooooobar";
$str =~ /(\w)(\1+)/;
print $1;
# prints 'o'
print $1 . $2;
# prints 'oooooo'
I think you actually want this rather than the "\w" as that includes numbers and the underscore.
([a-zA-Z])\1+
Ok, ok, I can take a hint Leon. Use this for the unicode-world or for posix stuff.
([[:alpha:]])\1+
I Think using a backreference would work:
(\w)\1+
\w is basically [a-zA-Z_0-9] so if you only want to match letters between A and Z (case insensitively), use [a-zA-Z] instead.
(EDIT: or, like Tanktalus mentioned in his comment (and as others have answered as well), [[:alpha:]], which is locale-sensitive)
Use \N to refer to previous groups:
/(\w)\1+/g
You might want to take care as to what is considered to be a letter, and this depends on your locale. Using ISO Latin-1 will allow accented Western language characters to be matched as letters. In the following program, the default locale doesn't recognise é, and thus créé fails to match. Uncomment the locale setting code, and then it begins to match.
Also note that \w includes digits and the underscore character along with all the letters. To get just the letters, you need to take the complement of the non-alphanum, digits and underscore characters. This leaves only letters.
That might be easier to understand by framing it as the question:
"What regular expression matches any digit except 3?"
The answer is:
/[^\D3]/
#! /usr/local/bin/perl
use strict;
use warnings;
# uncomment the following three lines:
# use locale;
# use POSIX;
# setlocale(LC_CTYPE, 'fr_FR.ISO8859-1');
while (<DATA>) {
chomp;
if (/([^\W_0-9])\1+/) {
print "$_: dup [$1]\n";
}
else {
print "$_: nope\n";
}
}
__DATA__
100
food
créé
a::b
The following code will return all the characters, that repeat two or more times:
my $str = "SSSannnkaaarsss";
print $str =~ /(\w)\1+/g;
Just for kicks, a completely different approach:
if ( ($str ^ substr($str,1) ) =~ /\0+/ ) {
print "found ", substr($str, $-[0], $+[0]-$-[0]+1), " at offset ", $-[0];
}
FYI, aside from RegExBuddy, a real handy free site for testing regular expressions is RegExr at gskinner.com. Handles ([[:alpha:]])(\1+) nicely.
How about:
(\w)\1+
The first part makes an unnamed group around a character, then the back-reference looks for that same character.
I think this should also work:
((\w)(?=\2))+\2
/(.)\\1{2,}+/u
'u' modifier matching with unicode