Perl regex | Match second from the right - regex

I'm trying to parse an OID and extract the #18 but I am unsure on how to write it to count Right to Left using a dot as a delimiter:
1.3.6.1.2.1.31.1.1.1.18.10035
This regex will grab the last value
my $ifindex = ($_=~ /^.*[.]([^.]*)$/);
I haven't found a way to tweak it to get the value I need yet.

How about:
my $str = "1.3.6.1.2.1.31.1.1.1.18.10035";
say ((split(/\./, $str))[-2]);
output:
18

If the format is always the same (ie. always second from right) then you can either use:-
m/(\d+)\.\d+$/;
..and the answer will end up in: $1
Or a different approach would be to split the string into an array on the dots and examine the penultimate value in the array.

What you need is simpler:
my $ifindex;
if (/(\d+)\.\d+$/)
{
$ifindex = $1;
}
A couple of comments:
You don't need to match the entire string, only the part you care about. Thus, no need to anchor to the beginning with ^ and use .*. Anchor to the end only.
[.] is a character class, intended for matching groups of characters. e.g., [abc] will match either a, b, or c. It should be avoided when matching a single character; just match that character instead. In this case you do need to escape it, since it is a special character: \..
I have assumed based on your example that all of the terms have to be numbers. Hence, I used \d+ for the terms.

my $ifindex = ($_=~ /^.*[.]([^.]*)[.][^.]*$/);

Related

How can I match only integers in Perl?

So I have an array that goes like this:
my #nums = (1,2,12,24,48,120,360);
I want to check if there is an element that is not an integer inside that array without using loop. It goes like this:
if(grep(!/[^0-9]|\^$/,#nums)){
die "Numbers are not in correct format.";
}else{
#Do something
}
Basically, the format should not be like this (Empty string is acceptable):
1A
A2
#A
#
#######
More examples:
1,2,3,A3 = Unacceptable
1,2,###,2 = unacceptable
1,2,3A,4 = Unacceptable
1, ,3,4=Acceptable
1,2,3,360 = acceptable
I know that there is another way by using look like a number. But I can't use that for some reason (outside of my control/setup reasons). That's why I used the regex method.
My question is, even though the numbers are in not correct format (A60 for example), the condition always return False. Basically, it ignores the incorrect format.
You say in the comments that you don't want to use modules because you can't install them, but there are many core modules that should come with Perl (although some systems screw this up).
zdim's answer in the comments is to look for anything that is not 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. That's the negated character class [^0-9]. A grep in scalar context returns the number of items that match:
my $found_non_ints = grep { /[^0-9]/ } #items;
Instead of that, I'd go back to the non-negated character class and match string that only has zero or more digits. To do this, anchor the pattern to the absolute start and end of the string:
my $found_non_ints = grep { ! /\A[0-9]*\z/ } #items;
But, this doesn't really match integers. It matches positive whole numbers (and zero). If you want to match negative numbers as well, allow an optional - at the start of the string:
my $found_non_ints = grep { ! /\A-?[0-9]*\z/ } #items;
That - would be a problem in the negated character class.
Also, you don't want the $ anchor here: that allows a possible newline to match at the end, and that's a non-digit (the \Z is the same for the end of the string). Also, the meaning of $ can change based on the setting of the /m flag, which might be set with default regex flags.
Here's a short program with your sample data. Note that you need to decide how to split up the list; does whitespace matter? I decided to remove whitespace around the comma:
#!perl
use v5.10;
while( <DATA> ) {
chomp;
my $found_non_ints = grep { ! /\A[0-9]*\z/ } split /\s*,\s*/;
say "$_ => Found $found_non_ints non-ints";
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,,3,4
1,2,3,360
The solution proposed in the question gets close, except that the logic got reversed and there is an error in a regex pattern. One way for it:
if ( grep { /[^0-9] | ^$/x } #nums ) { say 'not all integers' }
Regex explanation
[] is a character class: it matches any one of the characters listed inside (so [abc] matches either of a, b, or c) -- but when it starts with a ^ it matches any character not listed; so [^abc] matches any char not being either of a, b, or c. The pattern 0-9 inside a character class specifies all digits in that range (and we can also use a-z and A-Z)
So [^0-9] matches any character that is not a digit
Then that is or-ed by | with a ^$: ^ matches beginning of the string and $ is for the end of it. So ^$ match a string without anything -- an empty string! We need to account for that as [^0-9] doesn't while an array element can be an empty string. (It can also be a undef but from my understanding that is not possible with actual data, and a regex on undef would draw a warning.)
Note that $ allows for a newline as well, and that ^ and $ may change their meaning if /m modifier is in use, matching on linefeeds inside a string. However, in all these cases we'd be matching a non-digit, which is precisely the point here
/x modifier makes it disregard literal spaces inside so we can space things out for easier reading. (It also allows for newlines and comments with #, so complex patterns can be organized and documented very nicely)
So that's all -- the regex tries to match anything that shouldn't be in an integer (assumed to be strictly positive in OP's data).
If it matches any such, in any one of the array elements, then grep returns a list which isn't empty (but has at least one element) and that is "true" under if. So we caught a non-integer and we go into if's block to deal with that.
A little aside: we can also declare and populate an array right inside the if condition, to catch all those non-integers:
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-integers: ', join ' ', map { "|$_|" } #non_ints;
}
This also reads more nicely, telling by the array name what we're after in that complicated condition: "non_ints." I put || around each item in print to be able to see an empty string.†
Now, when you put an exclamation mark in front of that regex, it reverses the true/false return from the regex and our code goes haywire. So drop that !.
The other error is in escaping the ^ by having \^. This would match a literal ^ character, robbing ^ of its special meaning as a pattern in regex, explained above. So drop that \.
One other way is in using an extremely useful List::Util library, which is "core" (so it is normally installed with Perl, even though that can get messed up).
Among a number of essential functions it gives us any, and with it we have
use List::Util qw(any);
if ( any { /[^0-9]|^$/ } #nums ) { say 'not all integers' }
I like any firstly because the name of the function includes at least a part of the needed logic, making code that much clearer and easier to comprehend: is there any element of #nums for which the code in the block is true? So any element which contains a non-digit? Precisely what is needed here.
Then, another advantage is that any will quit as soon as it finds one match, while grep continues through the whole list. But this efficiency advantage shows only on very large arrays or a lot of repeated checks. Also, on the other hand sometimes we want to count all instances.
I'd also like to point out some of any's siblings: none and notall. These names themselves also capture a good deal of logic, making otherwise possibly convoluted code that much clearer. Browse through this library to get accustomed to what is in there.
† A program with your test data
use warnings;
use strict;
use feature 'say';
while (<DATA>) {
chomp;
my #nums = split /\s*,\s*/;
say "Data: #nums";
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-ints: ', join ' ', map { "|$_|" } #non_ints;
}
say '---';
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,2,3,360

Regex: Separate a string of characters with a non-consistent pattern (Oracle) (POSIX ERE)

EDIT: This question pertains to Oracle implementation of regex (POSIX ERE) which does not support 'lookaheads'
I need to separate a string of characters with a comma, however, the pattern is not consistent and I am not sure if this can be accomplished with Regex.
Corpus: 1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25
The pattern is basically 4 digits, followed by 4 characters, followed by a dot, followed by 1,2, or 3 digits! To make the string above clear, this is how it looks like separated by a space 1710ABCD.13 1711ABCD.43 1711ABCD.4 1711ABCD.404 1711ABCD.25
So the output of a replace operation should look like this:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
I was able to match the pattern using this regex:
(\d{4}\w{4}\.\d{1,3})
It does insert a comma but after the third digit beyond the dot (wrong, should have been after the second digit), but I cannot get it to do it in the right position and globally.
Here is a link to a fiddle
https://regex101.com/r/qQ2dE4/329
All you need is a lookahead at the end of the regular expression, so that the greedy \d{1,3} backtracks until it's followed by 4 digits (indicating the start of the next substring):
(\d{4}\w{4}\.\d{1,3})(?=\d{4})
^^^^^^^^^
https://regex101.com/r/qQ2dE4/330
To expand on #CertainPerformance's answer, if you want to be able to match the last token, you can use an alternative match of $:
(\d{4}\w{4}\.\d{1,3})(?=\d{4}|$)
Demo: https://regex101.com/r/qQ2dE4/331
EDIT: Since you now mentioned in the comment that you're using Oracle's implementation, you can simply do:
regexp_replace(corpus, '(\d{1,3})(\d{4})', '\1,\2')
to get your desired output:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
Demo: https://regex101.com/r/qQ2dE4/333
In order to continue finding matches after the first one you must use the global flag /g. The pattern is very tricky but it's feasible if you reverse the string.
Demo
var str = `1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25`;
// Reverse String
var rts = str.split("").reverse().join("");
// Do a reverse version of RegEx
/*In order to continue searching after the first match,
use the `g`lobal flag*/
var rgx = /(\d{1,3}\.\w{4}\d{4})/g;
// Replace on reversed String with a reversed substitution
var res = rts.replace(rgx, ` ,$1`);
// Revert the result back to normal direction
var ser = res.split("").reverse().join("");
console.log(ser);

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Regular expression that finds and replaces a long string of words

I am new to Regular Expressions.
What is the expression that would find a long string of words that begin with a 3-digit number and place spaces at the beginning of capitalized words:
REPLACE:
013TheBlueCowJumpedOverTheFence1984.jpg
WITH:
013 The Blue Cow Jumped Over The Fence 1984
Note: removes the .jpg at the end
This will save me ooooodles of time.
I would not use regular expressions for this task. It's going to be ugly and hard to maintain. A better approach would be to loop through the string and rebuild the string as you go based on your input.
string retVal = "";
foreach(char s in myInput){
if(IsCapitol(s)){
reVal += " " + s;
}
//insert the rest of your conditions
}
try use this regular expression \d+|[A-Z][a-z]*
it will collect all matches, and you must join them with spases
This will need two operations since the replacement is different for each.
The first:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z])))/
Replace with: ' $1' (note the space)
Will put spaces between the words. The second:
/\s*(.*)\s*\..*$/
Replace with: '$1'
Will remove trailing spaces and the extension.
The first expression can be taken into parts: (?<![\d])\d finds a digit not preceded by another digit, the second: ((?<![A-Z])[A-Z](?![A-Z])) finds an uppercase letter not preceded or followed by an uppercase lettter.
You'll likely have more rules that you will want to incorporate into this, such as how are you dealing with the string: 'BackInTheUSSR.jpg'?
Edit: This should handle that example:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z]))|((?<![A-Z])[A-Z]+(?![a-z])))/
match:
'[A-Z][a-z]*'
replace with
' \0'
Note that this doesn't put a space before 1984, and it doesn't remove .jpg.
You can do the former by matching on
'[0-9]+|[A-Z][a-z]*'
instead. And the latter by removing it in a separate instruction, for example with a regexp replacement of '\.jpg$' with ''
Note that \'s need to be written as \\ in many languages.

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.