POSIX Regular Expressions Limit Repetitions

POSIX Regular Expressions Limit Repetitions - regex

I am trying to grep for a maximum number of repetitions allowed on my input string and can't seem to get it working.
The input file has three lines with 3,5 and 7 repetitions of "pq" respectively. The >=3, >=5 expressions are working fine, but "between 3 and 5" expression {3,5} shows the line with seven repetitions as well.
DEV /> cat input.txt
pq -- One occurance of pq
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
Am I doing something wrong or is this the expected behavior?
If this is the expected behavior ( as the string with 7 PQs has between 3-5 PQs),
1) in what cases is the maximum repetitions applicable? What would be the difference between {3,5} and {3,} (greater than 3)?
2) I can anchor my regular expressions with "^", but what if my string does not end with "pq" and has more text?

If a line has seven repetitions of anything, it also therefore contains between 3–5 repetitions of that thing, and at several points, no less.
Use match anchors if you expect matches to be anchored. Otherwise, of course, they are not.
The practical difference between /X{3,}/ and /X{3,5}/ is the how long of a string it matches — the extent (or span) of the match. If all you are looking for is a boolean yes/no responses and there is nothing further in your pattern, it does not make much of a difference; in fact, a moderately clever regex engine will return early if it knows it is safe to do so.
One way to see the difference is with GNU grep’s ‑o or ‑‑only‐matching option. Watch:
$ echo 123456789 | egrep -o '[0-9]{3}'
123
456
789
$ echo 123456789 | egrep -o '[0-9]{3,}'
123456789
$ echo 123456789 | egrep -o '[0-9]{3,5}'
12345
6789
$ echo 123456789 | egrep -o '[0-9]{3,5}[2468]'
123456
$ echo 123456790 | egrep -o '[0-9]{3,5}[13579]'
12345
6789
To understand how those last two work, it is useful to get a trace of the regex engine’s attempts, including backtracking steps. You can do this using Perl in this way:
$ perl -Mre=debug -le 'print $& while 1234567890 =~ /\d{3,5}[13579]/g'
Compiling REx "\d{3,5}[13579]"
Final program:
1: CURLY {3,5} (4)
3: DIGIT (0)
4: ANYOF[13579][] (15)
15: END (0)
stclass DIGIT minlen 4
Matching REx "\d{3,5}[13579]" against "1234567890"
Matching stclass DIGIT against "1234567" (7 chars)
0 <> <1234567890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
5 <12345> <67890> | 4: ANYOF[13579][](15)
failed...
4 <1234> <567890> | 4: ANYOF[13579][](15)
5 <12345> <67890> | 15: END(0)
Match successful!
12345
Matching REx "\d{3,5}[13579]" against "67890"
Matching stclass DIGIT against "67" (2 chars)
5 <12345> <67890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
10 <1234567890> <> | 4: ANYOF[13579][](15)
failed...
9 <123456789> <0> | 4: ANYOF[13579][](15)
failed...
8 <12345678> <90> | 4: ANYOF[13579][](15)
9 <123456789> <0> | 15: END(0)
Match successful!
6789
Freeing REx: "\d{3,5}[13579]"
When you have additional constraints about what comes after the match, then which type of repetition you choose can make a big difference. Here I’ll impose a constraint on where each match is allowed to finish, by saying it needs to end before an odd digit:
$ perl -le 'print $& while 1234567890 =~ /\d{3}(?=[13579])/g'
234
678
$ perl -le 'print $& while 1234567890 =~ /\d{3,5}(?=[13579])/g'
1234
5678
% perl -le 'print $& while 1234567890 =~ /\d{3,}(?=[13579])/g'
12345678
So when you have things that have to come afterwards, it can make a great deal of difference. When you are just deciding whether the entire line matches something, it may not be as important.

This is expected behavior. The string "pqpqpqpqpqpqpq" does in fact have between three and five repetitions of "pq", and then a few more for good measure. You may want to try anchoring your regular expression, something like ^\(pq\)\{3,5\}$.
Edit to match edited question:
The maximum is applicable in all situations. What is happening is that grep is matching 5 of the 7 repetitions of "pq" (most likely the first five), and since it found a match it prints out the line.
You'll have to figure out a way to change your regex to match what you want and not match what you don't. For example, to match a line starting with 3–5 repetitions of "pq", you might do something like this: ^\(pq\){3,5}\($|[^p]|p$|p[^q]\). That matches 3–5 "pq"s followed immediately by end-of-line or any-character-other-than-"p" or "p"-followed-by-end-of-line or "p"-followed-by-any-character-other-than-"q".

Related

Unable to match multiple digits in regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever

grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text

With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

when does a sub stop in perl regex

I'm trying to translate some perl code into python and I ran into some problem with a certain regex I just can't figure out what it does or why does it stop
this is the regex
$url =~ s/^.*\///;
now I've tried to pass some urls and see what comes out
so this is what comes out
# string input
"http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS"
# string output
"test.cgi?u=USER&p=PASS"
I really don't know why it is stopping at test as far as I understand it,
it replaces any character in the beginning of the string with nothing
so why does it stop at test?
And if you can help me write a regex in python that does the same thing
that would be cool
Thanks in advance!

I really don't know why it is stopping at 'test' as far as I understand it, it replaces any character in the beginning of the string with nothing so why does it stop at test?
Because of the \/ being part of the pattern.
# V here
$url =~ s/^.*\///;
It would be clearer if the code was using a different quoting delimiter, which is possible in Perl. That way, there would not be the leaning toothpick syndrome here.
$url =~ s{^.*/}{};
Note that it's greedy by default, so it will gobble up all the slashes until the last one.
You can use the re pragma in debug mode to learn more about what the regex engine does under the hood.
use re 'debug';
my $url = "http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS";
$url =~ s{^.*/}{};
This will output to STDERR.
Compiling REx "^.*/"
Final program:
1: SBOL /^/ (2)
2: STAR (4)
3: REG_ANY (0)
4: EXACT </> (6)
6: END (0)
floating "/" at 0..9223372036854775807 (checking floating) anchored(SBOL) minlen 1
Matching REx "^.*/" against "http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..54] gave 5
Found floating substr "/" at offset 5 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <http://per> | 0| 1:SBOL /^/(2)
0 <> <http://per> | 0| 2:STAR(4)
| 0| REG_ANY can match 54 times out of 2147483647...
31 <org/c> </test.cgi?> | 1| 4:EXACT </>(6)
32 <rg/c/> <test.cgi?u> | 1| 6:END(0)
Match successful!
Freeing REx: "^.*/"

perl: add to a path using substitution

My script takes in a filepath, and I want to append a directory to the end of the path. The issue is I want to be agnostic of whether the argument has a trailing slash or not. So for example:
$ perl myscript.pl /path/to/dir
/path/to/dir/new
$ perl myscript.pl /path/to/dir/
/path/to/dir/new
I tried $path =~ s/\/?$/\/new/g, but that results in a double /new if a slash is present:
$ perl myscript.pl /path/to/dir
/path/to/dir/new/new
$ perl myscript.pl /path/to/dir
/path/to/dir/new
What's wrong?

Because /g is 'global' and will match multiple times:
#!/usr/bin/env perl
use strict;
use warnings;
#turn on debugging
use re 'debug';
my $path = '/path/to/dir/';
$path =~ s/\/?$/\/new/g;
print $path;
After the first replacement, the regex engine has 'left' the "end of line" marker, and doesn't need to match the optional /. So matches a second time.
E.g.:
Compiling REx "/?$"
Final program:
1: CURLY {0,1} (5)
3: EXACT </> (0)
5: SEOL (6)
6: END (0)
floating ""$ at 0..1 (checking floating) minlen 0
Matching REx "/?$" against "/path/to/dir/"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..13] gave 13
Found floating substr ""$ at offset 13 (rx_origin now 12)...
(multiline anchor test skipped)
try at offset...
Intuit: Successfully guessed: match at offset 12
12 <path/to/dir> </> | 1:CURLY {0,1}(5)
EXACT </> can match 1 times out of 1...
13 <path/to/dir/> <> | 5: SEOL(6)
13 <path/to/dir/> <> | 6: END(0)
Match successful!
Matching REx "/?$" against ""
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [13..13] gave 13
Found floating substr ""$ at offset 13 (rx_origin now 13)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 13
13 <path/to/dir/> <> | 1:CURLY {0,1}(5)
EXACT </> can match 0 times out of 1...
13 <path/to/dir/> <> | 5: SEOL(6)
13 <path/to/dir/> <> | 6: END(0)
Match successful!
Matching REx "/?$" against ""
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [13..13] gave 13
Found floating substr ""$ at offset 13 (rx_origin now 13)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 13
13 <path/to/dir/> <> | 1:CURLY {0,1}(5)
EXACT </> can match 0 times out of 1...
13 <path/to/dir/> <> | 5: SEOL(6)
13 <path/to/dir/> <> | 6: END(0)
This is because $ is a zero width position anchor. And so is \/? if there's no matches. Once the pattern has been consumed all the way up to the trailing / and replaced.. then the regex engine continues (because you told it to with /g) and find just $ left, because that's still the end of line. And that's still a valid match to replace.
But why not instead use File::Spec:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Spec;
use Data::Dumper;
my $path = '/path/to/dir/';
my #dirs = File::Spec->splitdir($path);
print Dumper \#dirs;
$path = File::Spec->catdir(#dirs, "new" );
print $path;
This provides you with a platform independent way to split and join path elements, and doesn't rely on regex matching - which there's various ways it could break (such as the one you found).

Drop the /g modifier:
$path =~ s/\/?$/\/new/
works fine.
You only want to modify add one "new" at the end, so having a /g modifier makes no sense.
Also, note that you can use different delimiters for your regex:
$path =~ s{ /? $}{/new}x;
is a little bit clearer.

Replace regex with captured group ONLY

I'm trying to understand why the following does not give me what I think (or want :)) should be returned:
sed -r 's/^(.*?)(Some text)?(.*)$/\2/' list_of_values
or Perl:
perl -lpe 's/^(.*?)(Some text)?(.*)$/$2/' list_of_values
So I want my result to be just the Some text, otherwise (meaning if there was nothing captured in $2) then it should just be EMPTY.
I did notice that with perl it does work if Some text is at the start of the line/string (which baffles me...). (Also noticed that removing ^ and $ has no effect)
Basically, I'm trying to get what grep would return with the --only-matching option as discussed here. Only I want/need to use sub/replace in the regex.
EDITED (added sample data)
Sample input:
$ cat -n list_of_values
1 Black
2 Blue
3 Brown
4 Dial Color
5 Fabric
6 Leather and Some text after that ....
7 Pearl Color
8 Stainless Steel
9 White
10 White Mother-of-Pearl Some text stuff
Desired output:
$ perl -ple '$_ = /(Some text)/ ? $1 : ""' list_of_values | cat -n
1
2
3
4
5
6 Some text
7
8
9
10 Some text

First of all, this shows how to duplicate grep -o using Perl.
You're asking why
foo Some text bar
012345678901234567
results in just a empty string instead of
Some text
Well,
At position 0, ^ matches 0 characters.
At position 0, (.*?) matches 0 characters.
At position 0, (Some text)? matches 0 characters.
At position 0, (.*) matches 17 characters.
At position 17, $ matches 0 characters.
Match succeeds.
You could use
s{^ .*? (?: (Some[ ]text) .* | $ )}{ $1 // "" }exs;
or
s{^ .*? (?: (Some[ ]text) .* | $ )}{$1}xs; # Warns if warnings are on.
Far simpler:
$_ = /(Some text)/ ? $1 : "";
I question your use of -p. Are you sure you want a line of output for each line of input? It seems to me you'd rather have
perl -nle'print $1 if /(Some text)/'

egrep command for lines that have one or more instance of 1234 but no other numbers?

So I'm fairly new to regular expressions and I'm wondering how this would be implemented as a egrep command.
I basically want to look for lines in a file that have one or more instances of "1234", but no other numbers. (non-digit characters are allowed).
Examples:
1234 - valid
12341234 - valid
12345 - invalid (since 5 is there)

You can use grep to extract the lines that contain 1234, then replace 1234 with something that doesn't appear in the input, then remove lines that still contain any digits, and replace the special string back by 1234:
< input-file grep 1234 \
| sed 's/1234/\x1/g' \
| grep -v '[0-9]' \
| sed 's/\x1/1234/g'

So, we want to select lines that have 1234 one or more times but no other digits:
grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
How it works
The regex begins with ^ and ends with $. That means that is must match the whole line.
Inside the regex are two parts:
([^[:digit:]]*1234)+ matches one or more 1234 with no other digits.
[^[:digit:]]* matches any non-digits that follows the last 1234.
In olden times, one would use [0-9] to match digits. With unicode, that is no longer reliable. So, we are using [:digit:] which is unicode safe.
Example
Let's use this test file:
$ cat file
this 1234 is valid
12341234 valid
not valid 12345
not 2 valid 1234 line
no numbers so not valid
Here is the result:
$ grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
this 1234 is valid
12341234 valid

If you want no other digit after your 1234 block:
egrep '\<(1234)+(\>|[^0-9])' *
-- -- --> word delimiters
---- --> the word you're looking for
------ --> non digit characters
- --> one or more times
If you want only "words" made up by the "1234" block, then you can egrep this:
egrep '\<(1234)+\>' *
-- -- --> word delimiters
---- --> the word you're looking for
- --> one or more times.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

POSIX Regular Expressions Limit Repetitions - regex

Related

Unable to match multiple digits in regex

when does a sub stop in perl regex

perl: add to a path using substitution

Replace regex with captured group ONLY

egrep command for lines that have one or more instance of 1234 but no other numbers?

Categories

Resources