Perl Regex get last digits in string before \ - regex

Do you know of a way to combine these 2 regular expressions?
Or any other way to just get the last 6 digits before the last \.
The end result that I want is 100144 from the string:
\\XXX\Extract_ReduceSize\MonitoringExport\dev\files\100144\
Here are some things I have tried
(.{1})$
Gets rid of the trailing \ of the string resulting in
\\XXX\Extract_ReduceSize\MonitoringExport\dev\files\100144
.*\\
Gets rid of everything before the last \ resulting in 100144
The software I am using requires just one line. So I can perform 2 calls.

Since you want the last one, would ([^\\]*)\\$ be appropriate? This matches with as many non-slash characters as possible before the last slash. Alternatively, if you don't want to extract the first group, you can do a lookahead with ([^\\]+)(?=\\$).

This code shows two different solutions. Hope it helps:
use strict;
use warnings;
my $example = '\\XXX\Extract_ReduceSize\MonitoringExport\dev\files\100144\\';
# Method 1: split the string by the \ character. This gives us an array,
# and then, select the last element of that array [-1]
my $number = (split /\\/, $example)[-1];
print $number, "\n"; # <-- prints: 100144
# Method 2: use a regexpr. Search in reverse mode ($),
# and catch the number part (\d+) in $1
if( $example =~ m!(\d+)\\$! ) {
print $1, "\n"; # <-- prints: 100144
}

This works to extract the last segment of digits:
(\d+)(?=\\$)

Related

Extracting specific values from a Perl regex

I want to use a Perl regex to extract certain values from file names.
They have the following (valid) names:
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
From the above I would like to extract the first 3 digits (always guaranteed to be 3), and the last part of the string, after the last _ (which is either off, or (m orp) followed by 3 digits). So the first thing I would be extracting are 3 digits, the second a string.
And I came out with the following method (I realise this might be not the most optimal/nicest one):
my $marker = '^testImrr[a-zA-z_]+\d{3}_(off|(m|p)\d{3})$';
if ($str =~ m/$marker/)
{
print "1=$1 2=$2";
}
Where only $1 has a valid result (namely the last bit of info I want), but $2 turns out empty. Any ideas on how to get those 3 digits in the middle?
You were almost there.
Just :
- capture the three digits by adding parenthesis around: (\d{3})
- don't capture m|p by adding ?: after the parenthesis before it ((?:m|p)), or by using [mp] instead:
^testImrr[a-zA-z_]+(\d{3})_(off|[mp]\d{3})$
And you'll get :
1=001 2=off
1=000 2=m030
1=231 2=p030
You can capture both at once, e.g with
if ($str =~ /(\d{3})_(off|(?:m|p)\d{3})$/ ) {
print "1=$1, 2=$2".$/;
}
You example has two capture groups as well (off|(m|p)\d{3} and m|p). In case of you first filename, for the second capture group nothing is catched due to matching the other branch. For non-capturing groups use (?:yourgroup).
There's really no need for regular expressions when a simple split and substr will suffice:
use strict;
use warnings;
while (<DATA>) {
chomp;
my #fields = split(/_/);
my $digits = substr($fields[1], -3);
print "1=$digits 2=$fields[2]\n";
}
__DATA__
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
Output:
1=001 2=off
1=000 2=m030
1=231 2=p030

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

Using perl Regular expressions I want to make sure a number comes in order

I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}

after matching pattern how to add dash after string in perl.regex

i have this type of data:
please help me out i am new to regular expressions,and please explain each step while answering.thanks..
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
i want to extract only this data from above lines:
7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
then if AX1A contains two consecutive alphabets after underscore ,it should be written as AX_ , and if contains single digit and single alphabet then they become as -1_ and -A_ so after applying this pattern it will become: AX_-1_-A_ and all other data should be remain same.
similarly in next line "W1A" so firstly it contains single alphabet "W" which should be converted to -W_ now next character is a single digit so it should also be converted as same pattern -1_ similarly last one is also treated same.so it become -W_-1_-A_
we are only interested in applying regex to the part after digits followed by underscore.
_AX1A_
_W1A_
_U1A_
_AV21NA_
output should be:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
use strict;
use warnings;
my $match
= qr/
( \d+ # group of digits
_ # followed by an underscore
) # end group
( \p{Alpha}+ ) # group of alphas
( \d+ ) # group of digits
( \p{Alpha}* ) # group of alphas
( \w+ ) # group of word characters
/x
;
while ( my $record = <$input> ) { # record of input
# match and capture
if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
say $pre
# if the alpha has length 1, add a dash before it
. ( length $pre_alpha == 1 ? '-' : '' )
# then the alpha
. $pre_alpha
# then the underscore
. '_'
# test if the length of the number is 1 and the length of the
# trailing alpha string is 1
. ( length( $num ) == 1 && length( $post_alpha ) == 1
# if true, apply a dash before each
? "-$num\_-$post_alpha"
# otherwise treat as AV21NA in example.
: "$num\_$post_alpha"
)
. $post
;
}
}
I don't know all the ins and outs of what you need stripped, but I'll extrapolate and let you clarify if this doesn't do quite what you need.
For the first step, extracting the 1X50_RE_ and 1X50_LI, you could search for those strings and replace them with nothing.
Next, to split your second letter/number code into your small chunks, you can use a pair of matches, using a look-ahead on each. However, since you only want to mess with that second code chunk, I'd split the overall line up first, work on the second chunk, and then join the pieces back together again.
while (<$input>) {
# Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
s/1X50_(RE|LI)_//;
my #pieces = split /_/; # split the line into pieces at each underscore
# Just working with the second chunk. /g, means do it for all matches found
$pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
$pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A
# Join the pieces back together again
$_ = join '_', #pieces;
print;
}
The $_ is the variable many Perl operations work on if you don't specify. The <$input> reads the next line of the file handle named $input into $_. The s///, split, and print functions work on $_ when not given. The =~ operator is the way you tell Perl to use $pieces[1] (or whichever variable you are working on) instead of $_ for regular expression operations. (For split or print, you'd pass the variables as the argument instead, so split /_/ is the same as split /_/, $_ and print is the same as print $_.)
Oh, and to explain the regular expressions a bit:
s/1X50_(RE|LI)_//;
This is matching anything containing 1X50_RE or 1X50_LI (the (|) is a list of alternatives) and replacing them with nothing (the empty // at the end).
Looking at one of the other lines:
s/([A-Z])(?=[0-9])/$1_-/g;
The plain parentheses (...) around [A-Z] cause $1 to be set to whatever letter is matched inside (in this case a letter, A-Z). The (?=...) parenthesis cause a zero-width positive look-ahead assertion. That means the regular expression only matches if the very next thing in the string matches the expression (a digit, 0-9), but that part of the match is not included as part of the string that is replaced.
The /$1_-/ causes the matched part of the string, the [A-Z], to be replaced with the value captured by the parentheses, (...), but before the look-head, [0-9], with the addition of the _- you require.
#!/usr/bin/perl -w
use strict;
while (<>) {
next if /^\s*$/;
chomp;
## Remove those parts of the line we do not want
## You do not specify what, if anything, is constant about
## the parts you do not want. One of the following cases should
## serve.
## i) Remove the string _1X50_ and the next characters between
## two underscores:
s/_1X50_.+?_/_/;
## ii) keep the first 2 and last 3 sections of each line.
## Uncomment this line and comment the previous one to use this:
#s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;
## The line now contains only those regions we are
## interested in. Split on '_' to collect an array of the
## different parts (#a):
my #a=split(/_/);
## $a[1] is the second string, eg AX1A,W1A etc.
## We search for one or more letters, followed by one or more digits
## followed by one or more letters. The 'i' operand makes the match
## case Insensitive and the 'g' operand makes the search global, allowing
## us to capture the matches in the #matches array.
my #matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);
## So, for each of the matched strings, if the length of the match
## is less than 2, add a '-' to the beginning of the string:
foreach my $match (#matches) {
if (length($match)<2) {
$match="-" . $match;
}
}
## Now replace the original $a[1] with each string in
## #matches, connected by '_':
$a[1]=join("_", #matches);
## Finally, build the string $kk by joining each element
## of the line (#a) by a '_', and print:
my $kk=join("_", #a);
print "$kk\n";
}
Are you sure like this:
while (<DATA>) {
s/1X50_(LI|RE)_//;
s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
print;
}
__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
output:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
zostay's suggestion of splitting the line may make things easier if you are a regex beginner. However, avoiding the split is optimal from a performance perspective. Here is how to do it without splitting:
open IN_FILE, "filename" or die "Whoops! Can't open file.";
while (<IN_FILE>)
{
s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/
or print "line didn't match: $line\n";
s/1X50_(LI|RE)_//;
}
Breaking down the first pattern:
s/// is the search-and-replace operator.
^ match the beginning of the line
\d{7}_ match seven digits, followed by an underscore
\K look-behind operator. This means that whatever came before won't be part of the string that is replaced. () each set of parentheses specifies a chunk of the match that will be captured. These will be put into the match variables $1, $2, etc. in order. [A-Z]{1,2} this means match between one and two capital letters. You can probably figure out what the other two sections in parentheses mean. -${1}-${2}-${3} Replace what matched with the first three match variables, preceded by dashes. The only reason for the curly braces is to make clear what the variable name is.

Regex to check fix length field with packed space

Say I have a text file to parse, which contains some fixed length content:
123jackysee 45678887
456charliewong 32145644
<3><------16------><--8---> # Not part of the data.
The first three characters is ID, then 16 characters user name, then 8 digit phone number.
I would like to write a regular expression to match and verify the input for each line, the one I come up with:
(\d{3})([A-Za-z ]{16})(\d{8})
The user name should contains 8-16 characters. But ([A-Za-z ]{16}) would also match null value or space. I think of ([A-Za-z]{8,16} {0,8}) but it would detect more than 16 characters. Any suggestions?
No, no, no, no! :-)
Why do people insist on trying to pack so much functionality into a single RE or SQL statement?
My suggestion, do something like:
Ensure the length is 27.
Extract the three components into separate strings (0-2, 3-18, 19-26).
Check that the first matches "\d{3}".
Check that the second matches "[A-Za-z]{8,} *".
Check that the third matches "\d{8}".
If you want the entire check to fit on one line of source code, put it in a function, isValidLine(), and call it.
Even something like this would do the trick:
def isValidLine(s):
if s.len() != 27 return false
return s.match("^\d{3}[A-za-z]{8,} *\d{8}$"):
Don't be fooled into thinking that's clean Python code, it's actually PaxLang, my own proprietary pseudo-code. Hopefully, it's clear enough, the first line checks to see that the length is 27, the second that it matches the given RE.
The middle field is automatically 16 characters total due to the first line and the fact that the other two fields are fixed-length in the RE. The RE also ensures that it's eight or more alphas followed by the right number of spaces.
To do this sort of thing with a single RE would be some monstrosity like:
^\d{3}(([A-za-z]{8} {8})
|([A-za-z]{9} {7})
|([A-za-z]{10} {6})
|([A-za-z]{11} {5})
|([A-za-z]{12} )
|([A-za-z]{13} )
|([A-za-z]{14} )
|([A-za-z]{15} )
|([A-za-z]{16}))
\d{8}$
You could do it by ensuring it passes two separate REs:
^\d{3}[A-za-z]{8,} *\d{8}$
^.{27}$
but, since that last one is simply a length check, it's no different to the isValidLine() above.
I would use the regex you suggested with a small addition:
(\d{3})([A-Za-z]{3,16} {0,13})(\d{8})
which will match things that have a non-whitespace username but still allow space padding. The only addition is that you would then have to check the length of each input to verify the correct number of characters.
Hmm... Depending on the exact version of Regex you're running, consider:
(?P<id>\d{3})(?=[A-Za-z\s]{16}\d)(?P<username>[A-Za-z]{8,16})\s*(?P<phone>\d{8})
Note 100% sure this will work, and I've used the whitespace escape char instead of an actual space - I get nervous with just the space character myself, but you may want to be more restrictive.
See if it works. I'm only intermediate with RegEx myself, so I might be in error.
Check out the named groups syntax for your version of RegEx a) exists and b) matches the standard I've used above.
EDIT:
Just to expand what I'm trying to do (sorry to make your eyes bleed, Pax!) for those without a lot of RegEx experience:
(?P<id>\d{3})
This will try to match a named capture group - 'id' - that is three digits in length. Most versions of RegEx let you use named capture groups to extract the values you matched against. This lets you do validation and data capture at the same time. Different versions of RegEx have slightly different syntaxes for this - check out http://www.regular-expressions.info/named.html for more detail regarding your particular implementation.
(?=[A-Za-z\s]{16}\d)
The ?= is a lookahead operator. This looks ahead for the next sixteen characters, and will return true if they are all letters or whitespace characters AND are followed by a digit. The lookahead operator is zero length, so it doesn't actually return anything. Your RegEx string keeps going from the point the Lookahead started. Check out http://www.regular-expressions.info/lookaround.html for more detail on lookahead.
(?P<username>[A-Za-z]{8,16})\s*
If the lookahead passes, then we keep counting from the fourth character in. We want to find eight-to-sixteen characters, followed by zero or more whitespaces. The 'or more' is actually safe, as we've already made sure in the lookahead that there can't be more than sixteen characters in total before the next digit.
Finally,
(?P<phone>\d{8})
This should check the eight-digit phone number.
I'm a bit nervous that this won't exactly work - your version of RegEx may not support the named group syntax or the lookahead syntax that I'm used to.
I'm also a bit nervous that this Regex will successfully match an empty string. Different versions of Regex handle empty strings differently.
You may also want to consider anchoring this Regex between a ^ and $ to ensure you're matching against the whole line, and not just part of a bigger line.
Assuming you mean perl regex and if you allow '_' in the username:
perl -ne 'exit 1 unless /(\d{3})(\w{8,16})\s+(\d{8})/ && length == 28'
#OP,not every problem needs a regex. your problem is pretty simple to check. depending on what language you are using, they would have some sort of built in string functions. use them.
the following minimal example is done in Python.
import sys
for line in open("file"):
line=line.strip()
# check first 3 char for digit
if not line[0:3].isdigit(): sys.exit()
# check length of username.
if len(line[3:18]) <8 or len(line[3:18]) > 16: sys.exit()
# check phone number length and whether they are digits.
if len(line[19:26]) == 8 and not line[19:26].isdigit(): sys.exit()
print line
I also don't think you should try to pack all the functionality into a single regex. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my #fields = split;
if (
( my ($id, $name) = $fields[0] =~ /^([0-9]{3})([A-Za-z]{8,16})$/ )
and ( my ($phone) = $fields[1] =~ /^([0-9]{8})$/ )
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
And here is another way:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my ($id, $name, $phone) = unpack 'A3A16A8';
if ( is_valid_id($id)
and is_valid_name($name)
and is_valid_phone($phone)
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
sub is_valid_id { ($_[0]) = ($_[0] =~ /^([0-9]{3})$/) }
sub is_valid_name { ($_[0]) = ($_[0] =~ /^([A-Za-z]{8,16})\s*$/) }
sub is_valid_phone { ($_[0]) = ($_[0] =~ /^([0-9]{8})$/) }
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
Generalizing:
#!/usr/bin/perl
use strict;
use warnings;
my %validators = (
id => make_validator( qr/^([0-9]{3})$/ ),
name => make_validator( qr/^([A-Za-z]{8,16})\s*$/ ),
phone => make_validator( qr/^([0-9]{8})$/ ),
);
INPUT:
while ( <DATA> ) {
chomp;
last unless /\S/;
my %fields;
#fields{qw(id name phone)} = unpack 'A3A16A8';
for my $field ( keys %fields ) {
unless ( $validators{$field}->($fields{$field}) ) {
warn "Invalid line: $_\n";
next INPUT;
}
}
print "$_ : $fields{$_}\n" for qw(id name phone);
}
sub make_validator {
my ($re) = #_;
return sub { ($_[0]) = ($_[0] =~ $re) };
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
You can use lookahead: ^(\d{3})((?=[a-zA-Z]{8,})([a-zA-Z ]{16}))(\d{8})$
Testing:
123jackysee 45678887 Match
456charliewong 32145644 Match
789jop 12345678 No Match - username too short
999abcdefghijabcde12345678 No Match - username 'column' is less that 16 characters
999abcdefghijabcdef12345678 Match
999abcdefghijabcdefg12345678 No Match - username column more that 16 characters