Empty $1 and $2 values Regex Perl - regex

I have the following code:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ m/^(\w)*\:(\w*)\_em\.$/)
{
print "$1\:$2\.\n";
}
else
{
print "$1\:$2\_em\.\n";
}
but I am getting empty $1 and $2. The output is:
Use of uninitialized value in concatenation (.) or string at new_mscn_iden_parse.pl line 187.
Use of uninitialized value in concatenation (.) or string at new_mscn_iden_parse.pl line 187.
:_em.

This code will do what you want
my $sDatabase = "abc_def:xyz_comp.";
$sDatabase =~ m/^(\w+):(\w+?)(_em)?\.$/ or die "Invalid data";
if ($3) {
print "$1:$2.\n";
}
else {
print "$1:$2_em.\n";
}

What do you expect $1 and $2 to contain when you fail to match?!
It contains whatever it contains before you attempted the match.
Possible solution:
$sDatabase =~ s/(?<!_em)(?=\.\z)/_em/;

You have:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ m/^(\w)*\:(\w*)\_em\.$/);
Let's see if this matches:
You're regular expression says:
Anchor at the start of a line.
You are looking for zero or more word characters . Word characters (in the ASCII alphabet) includes lowercase letters, uppercase letters numbers and underscores.
Thus /\w*/ will match all the following:
Computer
computer
computer23
computer_32
an empty string
You're next looking for a colon
Then, more word characters
Followed by a _em string
Followed by a period
And that should be the end of the string (if there's no NL and you're not doing multi-line string searches. Looks like you're safe there).
Now, let's look at your string: abc_def:xyz_comp.
\w* will match up to abc_def. Regular expressions are greedy and will try to match the biggest portion of the string as possible.
The : will match the colon. So far, you're matching abc_def:.
That \w* will match on xyz_comp.
Now, you're trying to match a _em. Oops! No good. There is no _em in your string. Your regular expression match will fail.
Since your regular expression match fails, the $1 and $2 variables simply are not set and have no value.
That's why you're getting Use of uninitialized value. What you can do is make the later half of your expression optional:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ /^(\w)+:(\w*)(_em)?\.$/) {
if ( $3 ) {
print "$1:${2}${3}.\n";
else {
print "$1:${2}_em.";
}
}
else {
die qq(String doesn't match regular expression at all\n);
}
}
First of all, I think you want to match at least one character (I could be wrong), so I switched the asterisk which matches zero or more to a + which matches one or more.
Note I have a third set of parentheses followed by a ?. This means match this zero or one times. Thus, you will have a match, and $1 and $2 will be set as long as your string starts with one or more word characters, followed by a colon, followed by one or more word characters.
What won't necessarily happen is that $3 will be set. This will only be set if your string also ends with _em.. If your string doesn't include the _em, but ends with a period, $1 and $2 will still match.
In your case, we could simplify it by doing this:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ /^(\w)+:(\w*)(?:_em)?\.$/) {
print "$1:${2}_em.";
else {
die qq(String doesn't match regular expression at all\n);
}
The (?:...) means don't set a match, just group. Thus, $3 will never be set. That's okay, either $3 is _em. or we add _em. to the end of the match anyway.

Related

Perl Regular expression | how to exclude words from a file

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project.
First i want to exclude strings from a txt file (dictionary).
For example if my file have this strings:
path.../Document.txt |
tree
car
ship
i using Regular Expression
a1testtre -- match
orangesh1 -- match
apleship3 -- not match [contains word from file ]
Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).
For example :
adminnisstrator21 -- match (have 2 times a repetition of chars)
kkeeykloakk -- not match have over 3 times repetition
stack22ooverflow -- match (have 2 times a repetition of chars)
for this i have try
\b(?:([a-z])(?!\1))+\b
but it works only for the first char-reppeat
Any idea how to solve these two?
To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:
^(?!.*(?:tree|car|ship)).*$
^ Assert start of string
(?! negative lookahead, assert what is on the right is not
.*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
) Close negative lookahead
.* Match any char except a newline
$ Assert end of string
Regex demo
To not allow a string to have over 3 times a char repeat you could use:
\b(?!(?:\w*(\w)\1){3})\w+\b
\b Word boundary
(?! Negative lookahead, assert what is on the right is not
(?: NOn capturing group
\w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
){3} Close non capturing group and repeat 3 times
) close negative lookahead
\w+ Match 1+ word characters
\b word boundary
Regex demo
Update
According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
^
Pattern demo
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
use warnings;
use strict;
use feature qw(say);
use Path::Tiny;
my $file = shift // die "Usage: $0 file\n"; #/
my #words = split ' ', path($file)->slurp;
my $exclude = join '|', map { quotemeta } #words;
foreach my $string (qw(a1testtre orangesh1 apleship3))
{
if ($string !~ /$exclude/) {
say "OK: $string";
}
}
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)
This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.†
The check that successive duplicate characters do not occur more than three times
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
my #chars_that_repeat = $string =~ /(.)\1+/g;
if (#chars_that_repeat < 3) {
say "OK: $string";
}
}
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.
This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.
†  Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } #words;
#==> so|sole|solely
for a quicker match (so matches all three). This, by all means, appears to be the case here.
But, if you wanted to correctly identify which word matched then you must have longer words first,
solely|sole|so
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round,
sort { length $b <=> length $a }
I hope someone else will come with a better solution, but this seems to do what you want:
\b Match word boundary
(?: Start capture group
(?:([a-z0-9])(?!\1))* Match all characters until it encounters a double
(?:([a-z0-9])\2)+ Match all repeated characters until a different one is reached
){0,2} Match capture group 0 or 2 times
(?:([a-z0-9])(?!\3))+ Match all characters until it encounters a double
\b Match end of word
I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.
My problem is that i have 2 regex that working:
Not allow over 3 pairs of chars:
(?=^(?!(?:\w*(.)\1){3}).+$)
Not allow over 4 times a char to repeat:
(?=^(?:(.)(?!(?:.*?\1){4}))*$)
Now i want to combine them into one row like:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
but its working only the regex that is first and not both of them
As mentioned in comment to #zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:
# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST
# TRUSTEES|TRUSTEE|TRUST
sub tight_match_order {
return #_ unless #_ > 1;
my (#alts, #ordered_alts, %alts_seen);
#alts = map { $alts_seen{$_}++ ? () : $_ } #_;
TEST: {
my $alt = shift #alts;
if (grep m#$alt#, #alts) {
push #alts => $alt;
} else {
push #ordered_alts => $alt;
}
redo TEST if #alts;
}
#ordered_alts
}
So following #zdim's answer:
...
my #words = split ' ', path($file)->slurp;
#words = tight_match_order(#words); # add this line
my $exclude = join '|', map { quotemeta } #words;
...
HTH

Perl Regular expression to replace the last matching string after dot

I have string $someString = "XXX.v2016.12.016". Now I am trying to replace the last three digits (after dot) by incrementing one (output: "XXX.v2016.12.017"). Does anyone have idea how to do this with regex?
This problem has two parts: Matching the digits after the last dot, and replacing/incrementing them.
It's possible to do this with s///:
$someString =~ s{\.([0-9]+)\z}{
my $n = $1;
"." . ++$n
}e;
The regex matches a dot, followed by 1 or more digits, followed by the end of the string. This takes care of matching the last digit group.
The replacement part of a substitution normally behaves like a double-quoted string, but with the e flag it turns into a block of code.
We assign the captured group of digits ($1) to a temporary variable, $n. This is because we want to use the increment operator ++ on it, not just add 1. The ++ operator is a bit special in that it handles strings: For numeric strings it preserves leading zeroes, for example.
The return value of the replacement block is a string consisting of a . (to replace the one we matched), followed by the incremented digit string.
$someString =~ s{\.([0-9]+)\z}{ sprintf ".%03d", $1 + 1 }e;
If you don't want to hardcod the length (maybe because it varies), you can use the following:
$someString =~ s{\.([0-9]+)\z}{ sprintf ".%0*d", length($1), $1 + 1 }e;
In both cases, you can use \K to avoid having to re-add the ., but it actually makes the solution slightly longer.

Finding palindrome using regex

This question comes in an attempt to understand one of the answer in : How to check that a string is a palindrome using regular expressions?
Answer given by Markus Jarderot is :
/^((.)(?1)\2|.?)$/
Can someone please explain, whats exactly happening here....i need to do similar in Perl, but not able to understand this solution!!!
PS : I am not very good in perl so please go easy ....and also "this can't be considered a regular expression if you want to be strict" - i read this line, so i am aware that this not regex strictly
^ - matches beginning of string
( - starts capture group #1
(.) - matches any single character except a newline, save it in capture group #2
(?1) - recurse = replace this group with the entire regexp capture group #1
\2 - matches the same thing as capture group #2. This requires the first and last characters of the string to match each other
| - creates an alternative
.? - optionally matches any one character that isn't a newline - This handles the end of the recursion, by matching an empty string (when the whole string is an even length) or a single character (when it's an odd length)
) - ends capture group #1
$ - matches end of string or before a newline at the end of the string.
The recursion (?1) is the key. A palindrome is an empty string, a 1-character string, or a string whose first and last characters are the same and the substring between them is also a palindrome.
It might be easier to understand with this analogous function, that does the same thing for arrays:
sub palindrome {
if (scalar(#_) >= 2) {
my $first_dot = shift;
my $slash_two = pop;
return $first_dot eq $slash_two && palindrome(#_);
} else {
# zero or one items
return 1;
}
}
print "yes!\n" if palindrome(qw(one two three two one));
print "really?\n" if palindrome(qw(one two three two two one));
The (?1) notation is a recursive reference to the start of the first parenthesis in the regex, the \2 is a backreference in the current recursion to the (.). Those two are anchored at the start and end of 'whatever is matching at the current recursion depth', so everything else is matched at the next depth down.
ikegami suspects this is faster:
sub palindrome {
my $next = 0;
my %symbols;
my $s = join '', map chr( $symbols{$_} ||= $next++ ), #_;
return $s =~ /^((.)(?1)\2|.?)\z/s;
}
I made this regEx few days ago.
If you use it like this it will give you an array of all palindromes in a certain text.
The example is for #JavaScript but you can use the regEx itself in any language to do the job.
Works perfect for words to 21 chars or numbers to 21 digits. You can make it more accurate if you need to.
const palindromeFinder = /\b(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w)\S?\10\9\8\7\6\5\4\3\2\1\b/g;
console.log(inputString.match(palindromeFinder));

Perl Regular Expression extracting sub-string?

I have a String variable containing something like ABCD.asd.qwe.com:/dir1.
I want to extract the ABCD portion i.e. the portion from beginning till the first appearance of .. The problem is that there can be almost any characters (only alphanumeric) of any length before the .. So I created this regexp.
if($arg =~ /(.*?\.?)/)
{
my $temp_name = $1;
}
However it is giving me blank string. The logic is that :
.*? - any character non-greedily
\.? - till first or none appearance of .
What could be wrong?
You can instead use negative character class like this
^[^.]+
[^.] would match any character except .
[^.]+ would match 1 to many characters(except .)
^ depicts the start of string
OR
^.+?(?=\.|$)
(?=) is a lookahead which checks for a particular pattern after the current position..So for text abcdad with regex a(?=b) only a would match
$ depicts the end of line(if used with multiline option) or end of string(if used with singleline option)
\.? doesn't mean "till first or none appearance of .". It means "a . here or not".
If the first character of the string is .:
.*? matches 0 chars at position 0.
\.? matches 1 char at position 0.
$1 contains ..
If the first character of the string isn't .:
.*? matches 0 chars at position 0.
\.? matches 0 chars at position 0.
$1 is empty.
To match ABCD, the following would do:
/^(.*?)\./
However, I hate the non-greedy modifier. It's fragile, in the sense that it stops doing what you want if you use two in the same pattern. I'd use the following instead ("match non-periods"):
/^([^.]*)\./
or even just
/^([^.]*)/
use strict;
my $string = "ABCD.asd.qwe.com:/dir1";
$string =~ /([^.]+)/;
my $capture = $1;
print"$capture\n";
OR you can also use Split function like,
my $sub_string = ( split /\./, $string )[0];
print"$sub_string\n";
Note in general: For the explaination of Regex (understanding the complex Regex), take a look at YAPE::Regex::Explain module.
This should work:
if($arg =~ /(.*?)\..+/)
{
my $temp_name = $1;
}
That would match anything before the first . .
You could change the .+ to .* if your input may end after the first ..
You could change the first .*? to .+? if you are sure that there is always at least one character before the first ..

help with perl regex rules

I would need some help with a regex issue in perl. I need to match non_letter characters "nucleated" around letter characters string (of size one).
That is to say... I have a string like
CDF((E)TR)FT
and I want to match ALL the following:
C, D, F((, ((E), )T, R), )F, T.
I was trying with something like
/([^A-Za-z]*[A-Za-z]{1}[^A-Za-z]*)/
but I'm obtaining:
C, D, F((, E), T, R), F, T.
Is like if once a non-letter characters has been matched it can NOT be matched again in another matching.
How can I do this?
A little late on this. Somebody has probably proposed this already.
I would consume the capture in the assertion to the left (via backref) and not consume the capture in the assertion to the right. All the captures can be seen, but the last one is not consumed, so the next pass continues right after the last atomic letter was found.
Character class is simplified for clarity:
/(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 1
(\1[A-Z]) # capture grp 2, consume capture group 1, plus atomic letter
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 3
Do globally, in a while loop, combined groups $2$3 (in that order) are the answer.
Test:
$samp = 'CDF((E)TR)FT';
while ( $samp =~ /(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/g )
{
print "$2$3, ";
}
output:
C, D, F((, ((E), )T, R), )F, T,
The problem is that you are consuming your characters or non letter characters the first time you encounter them, therefore you can't match all that you want. A solution would be to use different regexes for different patterns and combine the results at the end so that you could have your desired result :
This will match all character starting with a non character followed by a single character but NOT followed by a non character
[^A-Z]+[A-Z](?![^A-Z])
This will match a character enclosed by non characters, containing overlapping results :
(?=([^A-Z]+[A-Z][^A-Z]+))
This will match a character followed by one or more non characters only if it is not preceded by a non character :
(?<![^A-Z])[A-Z][^A-Z]+
And this will match single characters which are not enclosed to non characters
(?<![^A-Z])[A-Z](?![^A-Z])
By combining the results you will have the correct desired result:
C,D,T, )T, )F, ((E), F((, R)
Also if you understand the small parts you could join this into one Regex :
#!/usr/local/bin/perl
use strict;
my $subject = "0C0CC(R)CC(L)C0";
while ($subject =~ m/(?=([^A-Z]+[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z](?![^A-Z])))|(?=([^A-Z]+[A-Z](?![^A-Z])))/g) {
# matched text = $1, $2, $3, $4
print $1, " " if defined $1;
print $2, " " if defined $2;
print $3, " " if defined $3;
print $4, " " if defined $4;
}
Output :
0C0 0C C( (R) )C C( (L) )C0
You're right, once a character has been consumed in a regex match, it can't be matched again. In regex flavors that fully support lookaround assertions, you could do it with the regex
(?<=(\P{L}*))\p{L}(?=(\P{L}*))
where the match result would be the letter, and $1 and $2 would contain the non-letters around it. Since they are only matched in the context of lookaround assertions, they are not consumed in the match and can therefore be matched multiple times. You then need to construct the match result as $1 + $& + $2. This approach would work in .NET, for example.
In most other flavors (including Perl) that have limited support for lookaround, you can take a mixed approach, which is necessary because lookbehind expressions don't allow for indefinite repetition:
\P{L}*\p{L}(?=(\P{L}*))
Now $& will contain the non-letter characters before the letter and the letter itself, and $1 contains any non-letter characters that follow the letter.
while ($subject =~ m/\P{L}*\p{L}(?=(\P{L}*))/g) {
# matched text = $& . $1
}
Or, you could do it the hard way and tokenize first, then process the tokens:
#!/usr/bin/perl
use warnings;
use strict;
my $str = 'CDF((E)TR)FT';
my #nucleated = nucleat($str);
print "$_\n" for #nucleated;
sub nucleat {
my($s) = #_;
my #parts; # return list stored here
my #tokens = grep length, split /([a-z])/i, $s;
# bracket the tokens with empty strings to avoid warnings
unshift #tokens, '';
push #tokens, '';
foreach my $i (0..$#tokens) {
next unless $tokens[$i] =~ /^[a-z]$/i; # one element per letter token
my $str = '';
if ($tokens[$i-1] !~ /^[a-z]$/i) { # punc before letter
$str .= $tokens[$i-1];
}
$str .= $tokens[$i]; # the letter
if ($tokens[$i+1] !~ /^[a-z]$/i) { # punc after letter
$str .= $tokens[$i+1];
}
push #parts, $str;
}
return #parts;
}