Perl Regular expression to replace the last matching string after dot - regex

I have string $someString = "XXX.v2016.12.016". Now I am trying to replace the last three digits (after dot) by incrementing one (output: "XXX.v2016.12.017"). Does anyone have idea how to do this with regex?

This problem has two parts: Matching the digits after the last dot, and replacing/incrementing them.
It's possible to do this with s///:
$someString =~ s{\.([0-9]+)\z}{
my $n = $1;
"." . ++$n
}e;
The regex matches a dot, followed by 1 or more digits, followed by the end of the string. This takes care of matching the last digit group.
The replacement part of a substitution normally behaves like a double-quoted string, but with the e flag it turns into a block of code.
We assign the captured group of digits ($1) to a temporary variable, $n. This is because we want to use the increment operator ++ on it, not just add 1. The ++ operator is a bit special in that it handles strings: For numeric strings it preserves leading zeroes, for example.
The return value of the replacement block is a string consisting of a . (to replace the one we matched), followed by the incremented digit string.

$someString =~ s{\.([0-9]+)\z}{ sprintf ".%03d", $1 + 1 }e;
If you don't want to hardcod the length (maybe because it varies), you can use the following:
$someString =~ s{\.([0-9]+)\z}{ sprintf ".%0*d", length($1), $1 + 1 }e;
In both cases, you can use \K to avoid having to re-add the ., but it actually makes the solution slightly longer.

Related

PowerShell Regular Expression match Y or Z

I am trying to match some strings using a regular expression in PowerShell but due to the differing format of the original string that I'm extracting from, encountering difficulty. I admittedly am not very strong with creating regular expressions.
I need to extract the numbers from each of these strings. These can vary in length but in both cases will be preceded by Foo
PC1-FOO1234567
PC2-FOO1234567/FOO98765
This works for the second example:
'PC2-FOO1234567/FOO98765' -match 'FOO(.*?)\/FOO(.*?)\z'
It lets me access the matched strings using $matches[1] and $matches[2] which is great.
It obviously doesn't work for the first example. I suspect I need some way to match on either / or the end of the string but I'm not sure how to do this and end up with my desired match.
Suggestions?
You may use
'FOO(.*?)(?:/FOO(.*))?$'
It will match FOO, then capture any 0 or more chars as few as possible into Group 1 and then will attempt to optionally match a sequence of patterns: /FOO, any 0 or more chars as many as possible captured into Group 2 and then the end of string position should follow.
See the regex demo
Details
FOO - literal substring
(.*?) - Group 1: any zero or more chars other than newline, as few as possible
(?:/FOO(.*))? - an optional non-capturing group matching 1 or 0 repetitions of:
/FOO - a literal substring
(.*) - Group 2: any 0+ chars other than newline as many as possible (* is greedy)
$ - end of string.
[edit - removed the unneeded pipe to Where-Object. thanks to mklement0 for that! [*grin*]]
this is a somewhat different approach. it splits on the foo, then replaces the unwanted / with nothing, and finally filters out any string that contains letters.
the pure regex solutions others offered will likely be faster, but this may be slightly easier to understand - and therefore to maintain. [grin]
# fake reading in a text file
# in real life, use Get-Content
$InStuff = #'
PC1-FOO1234567
PC2-FOO1234567/FOO98765
'# -split [environment]::NewLine
$InStuff -split 'foo' -replace '/' -notmatch '[a-z]'
output ...
1234567
1234567
98765
To offer a more concise alternative with the -split operator, which obviates the need to access $Matches afterwards to extract the numbers:
PS> 'PC1-FOO1234568', 'PC2-FOO1234567/FOO98765' -split '(?:^PC\d+-|/)FOO' -ne ''
1234568 # single match from 1st input string
1234567 # first of 2 matches from 2nd input string
98765
Note: -split always returns a [string[]] array, even if only 1 string is returned; result strings from multiple input strings are combined into a single, flat array.
^PC\d+-|/ matches PC followed by 1 or more (+) digits (\d) at the start of the string (^) or (|) a / char., which matches both PC2-FOO at the beginning and /FOO.
(?:...), a non-capturing subexpression, must be used to prevent -split from including what the subexpression matched in the results array.
-ne '' filters out the empty elements that result from the input strings starting with a separator.
To learn more about the regex-based -split operator and in what ways it is more powerful than the string literal-based .NET String.Split() method, see this answer.

Finding palindrome using regex

This question comes in an attempt to understand one of the answer in : How to check that a string is a palindrome using regular expressions?
Answer given by Markus Jarderot is :
/^((.)(?1)\2|.?)$/
Can someone please explain, whats exactly happening here....i need to do similar in Perl, but not able to understand this solution!!!
PS : I am not very good in perl so please go easy ....and also "this can't be considered a regular expression if you want to be strict" - i read this line, so i am aware that this not regex strictly
^ - matches beginning of string
( - starts capture group #1
(.) - matches any single character except a newline, save it in capture group #2
(?1) - recurse = replace this group with the entire regexp capture group #1
\2 - matches the same thing as capture group #2. This requires the first and last characters of the string to match each other
| - creates an alternative
.? - optionally matches any one character that isn't a newline - This handles the end of the recursion, by matching an empty string (when the whole string is an even length) or a single character (when it's an odd length)
) - ends capture group #1
$ - matches end of string or before a newline at the end of the string.
The recursion (?1) is the key. A palindrome is an empty string, a 1-character string, or a string whose first and last characters are the same and the substring between them is also a palindrome.
It might be easier to understand with this analogous function, that does the same thing for arrays:
sub palindrome {
if (scalar(#_) >= 2) {
my $first_dot = shift;
my $slash_two = pop;
return $first_dot eq $slash_two && palindrome(#_);
} else {
# zero or one items
return 1;
}
}
print "yes!\n" if palindrome(qw(one two three two one));
print "really?\n" if palindrome(qw(one two three two two one));
The (?1) notation is a recursive reference to the start of the first parenthesis in the regex, the \2 is a backreference in the current recursion to the (.). Those two are anchored at the start and end of 'whatever is matching at the current recursion depth', so everything else is matched at the next depth down.
ikegami suspects this is faster:
sub palindrome {
my $next = 0;
my %symbols;
my $s = join '', map chr( $symbols{$_} ||= $next++ ), #_;
return $s =~ /^((.)(?1)\2|.?)\z/s;
}
I made this regEx few days ago.
If you use it like this it will give you an array of all palindromes in a certain text.
The example is for #JavaScript but you can use the regEx itself in any language to do the job.
Works perfect for words to 21 chars or numbers to 21 digits. You can make it more accurate if you need to.
const palindromeFinder = /\b(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w?)(\w)\S?\10\9\8\7\6\5\4\3\2\1\b/g;
console.log(inputString.match(palindromeFinder));

Empty $1 and $2 values Regex Perl

I have the following code:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ m/^(\w)*\:(\w*)\_em\.$/)
{
print "$1\:$2\.\n";
}
else
{
print "$1\:$2\_em\.\n";
}
but I am getting empty $1 and $2. The output is:
Use of uninitialized value in concatenation (.) or string at new_mscn_iden_parse.pl line 187.
Use of uninitialized value in concatenation (.) or string at new_mscn_iden_parse.pl line 187.
:_em.
This code will do what you want
my $sDatabase = "abc_def:xyz_comp.";
$sDatabase =~ m/^(\w+):(\w+?)(_em)?\.$/ or die "Invalid data";
if ($3) {
print "$1:$2.\n";
}
else {
print "$1:$2_em.\n";
}
What do you expect $1 and $2 to contain when you fail to match?!
It contains whatever it contains before you attempted the match.
Possible solution:
$sDatabase =~ s/(?<!_em)(?=\.\z)/_em/;
You have:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ m/^(\w)*\:(\w*)\_em\.$/);
Let's see if this matches:
You're regular expression says:
Anchor at the start of a line.
You are looking for zero or more word characters . Word characters (in the ASCII alphabet) includes lowercase letters, uppercase letters numbers and underscores.
Thus /\w*/ will match all the following:
Computer
computer
computer23
computer_32
an empty string
You're next looking for a colon
Then, more word characters
Followed by a _em string
Followed by a period
And that should be the end of the string (if there's no NL and you're not doing multi-line string searches. Looks like you're safe there).
Now, let's look at your string: abc_def:xyz_comp.
\w* will match up to abc_def. Regular expressions are greedy and will try to match the biggest portion of the string as possible.
The : will match the colon. So far, you're matching abc_def:.
That \w* will match on xyz_comp.
Now, you're trying to match a _em. Oops! No good. There is no _em in your string. Your regular expression match will fail.
Since your regular expression match fails, the $1 and $2 variables simply are not set and have no value.
That's why you're getting Use of uninitialized value. What you can do is make the later half of your expression optional:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ /^(\w)+:(\w*)(_em)?\.$/) {
if ( $3 ) {
print "$1:${2}${3}.\n";
else {
print "$1:${2}_em.";
}
}
else {
die qq(String doesn't match regular expression at all\n);
}
}
First of all, I think you want to match at least one character (I could be wrong), so I switched the asterisk which matches zero or more to a + which matches one or more.
Note I have a third set of parentheses followed by a ?. This means match this zero or one times. Thus, you will have a match, and $1 and $2 will be set as long as your string starts with one or more word characters, followed by a colon, followed by one or more word characters.
What won't necessarily happen is that $3 will be set. This will only be set if your string also ends with _em.. If your string doesn't include the _em, but ends with a period, $1 and $2 will still match.
In your case, we could simplify it by doing this:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ /^(\w)+:(\w*)(?:_em)?\.$/) {
print "$1:${2}_em.";
else {
die qq(String doesn't match regular expression at all\n);
}
The (?:...) means don't set a match, just group. Thus, $3 will never be set. That's okay, either $3 is _em. or we add _em. to the end of the match anyway.

Perl Regular Expression extracting sub-string?

I have a String variable containing something like ABCD.asd.qwe.com:/dir1.
I want to extract the ABCD portion i.e. the portion from beginning till the first appearance of .. The problem is that there can be almost any characters (only alphanumeric) of any length before the .. So I created this regexp.
if($arg =~ /(.*?\.?)/)
{
my $temp_name = $1;
}
However it is giving me blank string. The logic is that :
.*? - any character non-greedily
\.? - till first or none appearance of .
What could be wrong?
You can instead use negative character class like this
^[^.]+
[^.] would match any character except .
[^.]+ would match 1 to many characters(except .)
^ depicts the start of string
OR
^.+?(?=\.|$)
(?=) is a lookahead which checks for a particular pattern after the current position..So for text abcdad with regex a(?=b) only a would match
$ depicts the end of line(if used with multiline option) or end of string(if used with singleline option)
\.? doesn't mean "till first or none appearance of .". It means "a . here or not".
If the first character of the string is .:
.*? matches 0 chars at position 0.
\.? matches 1 char at position 0.
$1 contains ..
If the first character of the string isn't .:
.*? matches 0 chars at position 0.
\.? matches 0 chars at position 0.
$1 is empty.
To match ABCD, the following would do:
/^(.*?)\./
However, I hate the non-greedy modifier. It's fragile, in the sense that it stops doing what you want if you use two in the same pattern. I'd use the following instead ("match non-periods"):
/^([^.]*)\./
or even just
/^([^.]*)/
use strict;
my $string = "ABCD.asd.qwe.com:/dir1";
$string =~ /([^.]+)/;
my $capture = $1;
print"$capture\n";
OR you can also use Split function like,
my $sub_string = ( split /\./, $string )[0];
print"$sub_string\n";
Note in general: For the explaination of Regex (understanding the complex Regex), take a look at YAPE::Regex::Explain module.
This should work:
if($arg =~ /(.*?)\..+/)
{
my $temp_name = $1;
}
That would match anything before the first . .
You could change the .+ to .* if your input may end after the first ..
You could change the first .*? to .+? if you are sure that there is always at least one character before the first ..

What does this regular expression try to match?

These days I am learning regular expressions, but it seems like a little hard to me. I am reading some code in TCL, but what does it want to match?
regexp ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]" $input
If you un-escape the characters, you get the following:
.* ([\d]{3}:[\d]{3}:[\d]{3}.[\d]{5}).[^\n]
The term [\d]{x} would match x number of consecutive digits. Therefore, the portion inside the parentheses would match something of the form ###:###:###?##### (where # can be any digit and ? can be any character). The parentheses themselves aren't matched, they're just used for specifying what part of the input to "capture" and return to the caller. Following this sequence is a single dot ., which matches a single character (which can be anything). The trailing [^\n] will match a single character that is anything except a newline (a ^ at the start of a bracketed expression inverts the match). The .* term at the very beginning matches a sequence of characters of any length (even zero), followed by a space.
With all of this taken into account, it appears that this regular expression extracts a series of digits from the middle of a line. Given the format of the numbers, it may be looking for a timestamp in the hours:minutes:seconds.milliseconds format (although if that is the case, {1,3} and {1,5} should be used instead). The trailing .[^\n] term looks like it could be trying to exclude timestamps that are at or near the end of a line. Timestamped logs often have a timestamp followed by some sort of delimiting character (:, >, a space, etc). A regular expression like this might be used to extract timestamps from the log while ignoring "blank" lines that have a timestamp but no message.
Update:
Here's an example using TCL 8.4:
% set re ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]"
% regexp $re "TEST: 123:456:789:12345> sample log line"
1
% regexp $re " 111:222:333.44444 foo"
1
% regexp $re "111:222:333.44444 foo"
0
% regexp $re " 111:222:333.44444 "
0
% regexp $re " 10:44:56.12344: "
0
%
% regexp $re "TEST: 123:456:789:12345> sample log line" match data
1
% puts $match
TEST: 123:456:789:12345>
% puts $data
123:456:789:12345
The first two examples match the expression. The third fails because it lacks the space character before the first number sequence. The fourth fails because it doesn't have a non-newline character at the end after the trailing space. The fifth fails because the numerical sequences don't have enough digits. By passing parameters after the input, you can store the part of the input that matched the expression as well as the data that was "captured" by using parentheses. See the TCL wiki for details on the regexp command.
The interesting part with TCL is that you have to escape the [ character but not the ], while both the { and } need escaping.
.* ==> match junk part of the input
( ==> start capture
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}. ==> match 3 digits followed by any character
\[\\d]\{5\} ==> match 5 digits
). ==> close capture and match any character
\[^\\n] ==> match a character that is not a newline