Perl: Fastest match of anything? - regex

I want to do this without using /s:
(?:.|\n)+
And I want it to be fast.
It is part of a bigger regexp, which is the reason I cannot use /s. I have tested:
perl -pe 's/(?:.|\n)+//' # 30 MBytes/s
perl -pe 's/[^\777]+//' # 124 MBytes/s
They are not as fast as /s:
perl -pe 's/.+//s' # 164 MBytes/s
Can I somehow get the same speed as /s?
Edit:
perl -pe 's/(?s).+(?-s)//' # 162 MBytes/s
perl -pe 's/[\d\D]+//' # 162 MBytes/s
perl -pe 's/[\s\S]+//' # 161 MBytes/s
These are all good options. Thanks.

If I create a 98MB file:
$ dd bs=1024 count=100000 </dev/urandom >file
100000+0 records in
100000+0 records out
102400000 bytes transferred in 0.558640 secs (183302305 bytes/sec)
Now compare some methods:
time perl -0777 -lne 's/(?:.|\n)+//' file
real 0m1.957s
user 0m1.902s
sys 0m0.050s
time perl -0777 -lne 's/[^\777]+//' file
real 0m0.130s
user 0m0.082s
sys 0m0.046s
time perl -0777 -lne 's/[\s\S]+//' file
real 0m0.065s
user 0m0.022s
sys 0m0.041s
time perl -0777 -lne 's/.+//s' file
real 0m0.064s
user 0m0.022s
sys 0m0.038s
As you can see, the character class of [\s\S] is optimized to /./s since 5.24.
$ perl -Mre=debug -e'qr/./s'
Compiling REx "."
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "."
$ perl -Mre=debug -e'qr/[\s\S]/'
Compiling REx "[\s\S]"
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "[\s\S]"

You do not need to use anyworkarounds.
You may use inline modifier flags, (?s) to enable dot to match across lines and (?-s) to disable this effect.
For example:
(?s).*PATTERN(?-s).*
where the first .* matches any text and the last .* only matches till the end of a line.
You can also use modifier groups:
(?s:.*)PATTERN(?-s:.*)
See more in Extended Patterns.

You can use (?s:PATTERN) to enable /s for just the subpattern in the parens.
(?s:.)+
or
(?s:.+)
For example,
$ perl -M5.010 -e'
say /^.(?s:.).\z/ ? "match" : "no match"
for "\n\n\n", "A\n\n", "\n\nA", "A\nA";
'
no match
no match
no match
match
There is no difference between . with /s and (?s:.), so you'll get exactly the same performance.
$ perl -Mre=debug -e'qr/(?s:.)/'
Compiling REx "(?s:.)"
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "(?s:.)"
$ perl -Mre=debug -e'qr/./s'
Compiling REx "."
Final program:
1: SANY (2)
2: END (0)
minlen 1
Freeing REx: "."
Advantages over the other solutions:
(?s:.) restores the s flag state, while (?s).(?-s) turns it off.
(?s:.) is a fair bit shorter than (?s).(?-s).
(?s:.) is not longer than [\s\S] and [\d\D].
(?s:.) is not a workaround like [\s\S] and [\d\D].
That said, (?s:.), (?s).(?-s), [\s\S] and [\d\D] all result in exactly the same regex program (since 5.24), so they'll all perform identically.

Related

Regex substitute multiple lines for a single line

I have a plain text file in which I need to substitute multiple consecutive lines of text with a single replacement line. For example, when I have a date and time, followed by a blank line, followed by a page number,
11/13/2018 08:33:00
Page 1 of 1
I'd like to replace it with a single line (e.g., PAGE BREAK).
I've tried
sed 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
and
perl -pe 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
but it leaves the text unchanged.
Both sed and Perl process the input line by line. You can tell Perl to load the whole file into memory by using -0777 (if it's not too large):
perl -0777 -pe 's=[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}\n\nPage [0-9]+ of [0-9]+=PAGE BREAK=g'
Note that I used [0-9], because \d can match ٤, ໖, ६, or 𝟡.
I also used s=== instead of s/// so I don't have to backslash the slashes in the date part.
Another Perl variant
$ cat page_break.txt
123 45 jh kljl
11/13/2018 08:33:00
Page 1 of 1
ghjgjh hkjhj
fhfghfghfh
11/13/2018 08:33:00
Page 1 of 2
ghgigkjkj
$ perl -ne '{ if ( (/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}/ and $x++)or ( /^\s*$/ and $x++) or (/Page \d of \d/ and $x++) ){} if($x==0) { print "$_" } if($x==3) { print "PAGE BREAK\n"; $x=0} }' page_break.txt
123 45 jh kljl
PAGE BREAK
ghjgjh hkjhj
fhfghfghfh
PAGE BREAK
ghgigkjkj
$

when does a sub stop in perl regex

I'm trying to translate some perl code into python and I ran into some problem with a certain regex I just can't figure out what it does or why does it stop
this is the regex
$url =~ s/^.*\///;
now I've tried to pass some urls and see what comes out
so this is what comes out
# string input
"http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS"
# string output
"test.cgi?u=USER&p=PASS"
I really don't know why it is stopping at test as far as I understand it,
it replaces any character in the beginning of the string with nothing
so why does it stop at test?
And if you can help me write a regex in python that does the same thing
that would be cool
Thanks in advance!
I really don't know why it is stopping at 'test' as far as I understand it, it replaces any character in the beginning of the string with nothing so why does it stop at test?
Because of the \/ being part of the pattern.
# V here
$url =~ s/^.*\///;
It would be clearer if the code was using a different quoting delimiter, which is possible in Perl. That way, there would not be the leaning toothpick syndrome here.
$url =~ s{^.*/}{};
Note that it's greedy by default, so it will gobble up all the slashes until the last one.
You can use the re pragma in debug mode to learn more about what the regex engine does under the hood.
use re 'debug';
my $url = "http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS";
$url =~ s{^.*/}{};
This will output to STDERR.
Compiling REx "^.*/"
Final program:
1: SBOL /^/ (2)
2: STAR (4)
3: REG_ANY (0)
4: EXACT </> (6)
6: END (0)
floating "/" at 0..9223372036854775807 (checking floating) anchored(SBOL) minlen 1
Matching REx "^.*/" against "http://perltest.my-mobile.org/c/test.cgi?u=USER&p=PASS"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..54] gave 5
Found floating substr "/" at offset 5 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <http://per> | 0| 1:SBOL /^/(2)
0 <> <http://per> | 0| 2:STAR(4)
| 0| REG_ANY can match 54 times out of 2147483647...
31 <org/c> </test.cgi?> | 1| 4:EXACT </>(6)
32 <rg/c/> <test.cgi?u> | 1| 6:END(0)
Match successful!
Freeing REx: "^.*/"

Delete all lines which don't match a pattern

I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).
Pattern which I need to keep the lines for:
x//x/x/x/5/x/
x could be any amount of characters, numbers or special characters.
5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.
/ are actual forward slashes.
Input:
abc//a/123/gds:/4AdFg/f3dsg34/
y35sdf//x/gd:df/j5je:/x/x/x
yh//x/x/x/5Fsaf/x/
45wuhrt//x/x/dsfhsdfs54uhb/
5ehys//srt/fd/ab/cde/fg/x/x
Desired output:
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:
$ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Explanation:
"x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).
"5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.
Possible improvements
One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:
grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:
grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
I was figuring out how to do it in awk at the same time as the other answer and came up with:
awk -F/ 'BEGIN{OFS=FS}$2==""&&$6~/[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/&&NF=8'
The awk I worked it out on didn't support the {5} regexp frob.
You can use -P option for extended perl support like
grep -P "^(?:[^/]*/){5}[A-Za-z0-9]{5}/(?:/|$)" input
Output
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Regex Breakdown
^ #Start of line
(?: #Non capturing group
[^/]* #Match anything except /
/ #Match / literally
){5} #Repeat this 5 times
[A-Za-z0-9]{5} #Match alphanumerics. You can use \w if you want to allow _ along with [A-Za-z0-9]
(?: #Non capturing group
/ #Next character should be /
| #OR
$ #End of line
)
Using sed and in place edit to delete all lines that do not follow a specific pattern (from a txt file):
$ sed -i.bak -n "/.*\/\/.*\/.*\/.*\/[a-zA-Z0-9]\{5\}\/.*\//p" test.in
$ cat test.in
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
-i.bak in place edit creating a test.in.bak backup file, -n quiet, do not print non-matches to output
and ".../p" print matches.

Bash script grep for pattern in variable of text

I have a variable which contains text; I can echo it to stdout so I think the variable is fine. My problem is trying to grep for a pattern in that variable of text. Here is what I am trying:
ERR_COUNT=`echo $VAR_WITH_TEXT | grep "ERROR total: (\d+)"`
When I echo $ERR_COUNT the variable appears to be empty, so I must be doing something wrong.
How to do this properly? Thanks.
EDIT - Just wanted to mention that testing that pattern on the example text I have in the variable does give me something (I tested with: http://rubular.com)
However the regex could still be wrong.
EDIT2 - Not getting any results yet, so here's the string I'm working with:
ALERT line125: Alert: Cannot locate any description for 'asdf' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../hgfd.controls) ALERT line126: Alert: Cannot locate any description for 'zxcv' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../dfhg.controls) ALERT line127: Alert: Cannot locate any description for 'rtyu' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../kjgh.controls) [1] 22280 IGNORE total: 0 WARN total: 0 ALERT total: 3 ERROR total: 23 [1] + Done /tool/pandora/bin/gvim -u NONE -U NONE -nRN -c runtime! plugin/**/*.vim -bg ...
That's the string, so hopefully there should be no ambiguity anymore... I want to extract the number "23" (after "ERROR total: ") into a variable and I'm having a hard time haha.
Cheers
You can use bash's =~ operator to extract the value.
[[ $VAR_WITH_TEXT =~ ERROR\ total:\ ([0-9]+) ]]
Note that you have to escape the spaces, or only only quote
the fixed parts of the regular expression:
[[ $VAR_WITH_TEXT =~ "ERROR total: "([0-9]+) ]]
since quoting any of the metacharacters causes them to be treated
literally.
You can also save the regex in a variable:
regex="ERROR total: ([0-9]+)"
[[ $VAR_WITH_TEXT =~ $regex ]]
In any case, once the expression matches, the parenthesized expression
can be found in BASH_REMATCH array.
ERR_COUNT=${BASH_REMATCH[1]}
(The zeroth element contains the entire matched regular expression; the parenthesized subexpressions are found in the remaining elements in the order they appear in the full regex.)
If you want to use grep, you'll need a version that can accept Perl-style regexes.
ERR_COUNT=$( echo "$VAR_WITH_TEXT" | grep -Po "(?<=ERROR total: )\d+" )
As long as you need to use Perl-style regexes to enable the look-behind assertion, you can replace [0-9] with \d.
Your error is in the pattern: (\d+) matches:
'('
a digit
'+'
')'
According to your comment, what you want is \(\d\+\), which:
defines a sub-pattern by \( ... \)
Inside it matches at least one (\+) digit (\d).
In this case, if you don't need a sub-pattern, you can just drop the \( and \).
Note: if your grep doesn't understand \d, you can replace it by [0-9]. Easiest way is to write grep '\d' and test it by writing a couple test lines.
# setting example data
test="adfa\nfasetrfaqwe\ndsfa ERROR total: 32514235dsfaewrf"
one solution:
echo $(sed -n 's/^.*ERROR total: \([0-9]*\).*$/\1/p' < <(echo $test))
32514235
other solution:
# throw away everything up to "ERROR total: "
test=${test##*ERROR total: }
# cut from behind assuming number contains no spaces and is
# separated by space
test=${test%% *}
echo $test
32514235
The \d is probably only recognized as a digit in perl regex mode, you probably want to use grep -P.
If you only want the number you could try:
ERR_COUNT=$(echo $VAR_WITH_TEXT | perl -pe "s/.*ERROR total: (\d+).*/\1/g")
or:
ERR_COUNT=$(echo $VAR_WITH_TEXT | sed -n "s/.*ERROR total: ([0-9]+).*/\1/gp")

POSIX Regular Expressions Limit Repetitions

I am trying to grep for a maximum number of repetitions allowed on my input string and can't seem to get it working.
The input file has three lines with 3,5 and 7 repetitions of "pq" respectively. The >=3, >=5 expressions are working fine, but "between 3 and 5" expression {3,5} shows the line with seven repetitions as well.
DEV /> cat input.txt
pq -- One occurance of pq
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
Am I doing something wrong or is this the expected behavior?
If this is the expected behavior ( as the string with 7 PQs has between 3-5 PQs),
1) in what cases is the maximum repetitions applicable? What would be the difference between {3,5} and {3,} (greater than 3)?
2) I can anchor my regular expressions with "^", but what if my string does not end with "pq" and has more text?
If a line has seven repetitions of anything, it also therefore contains between 3–5 repetitions of that thing, and at several points, no less.
Use match anchors if you expect matches to be anchored. Otherwise, of course, they are not.
The practical difference between /X{3,}/ and /X{3,5}/ is the how long of a string it matches — the extent (or span) of the match. If all you are looking for is a boolean yes/no responses and there is nothing further in your pattern, it does not make much of a difference; in fact, a moderately clever regex engine will return early if it knows it is safe to do so.
One way to see the difference is with GNU grep’s ‑o or ‑‑only‐matching option. Watch:
$ echo 123456789 | egrep -o '[0-9]{3}'
123
456
789
$ echo 123456789 | egrep -o '[0-9]{3,}'
123456789
$ echo 123456789 | egrep -o '[0-9]{3,5}'
12345
6789
$ echo 123456789 | egrep -o '[0-9]{3,5}[2468]'
123456
$ echo 123456790 | egrep -o '[0-9]{3,5}[13579]'
12345
6789
To understand how those last two work, it is useful to get a trace of the regex engine’s attempts, including backtracking steps. You can do this using Perl in this way:
$ perl -Mre=debug -le 'print $& while 1234567890 =~ /\d{3,5}[13579]/g'
Compiling REx "\d{3,5}[13579]"
Final program:
1: CURLY {3,5} (4)
3: DIGIT (0)
4: ANYOF[13579][] (15)
15: END (0)
stclass DIGIT minlen 4
Matching REx "\d{3,5}[13579]" against "1234567890"
Matching stclass DIGIT against "1234567" (7 chars)
0 <> <1234567890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
5 <12345> <67890> | 4: ANYOF[13579][](15)
failed...
4 <1234> <567890> | 4: ANYOF[13579][](15)
5 <12345> <67890> | 15: END(0)
Match successful!
12345
Matching REx "\d{3,5}[13579]" against "67890"
Matching stclass DIGIT against "67" (2 chars)
5 <12345> <67890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
10 <1234567890> <> | 4: ANYOF[13579][](15)
failed...
9 <123456789> <0> | 4: ANYOF[13579][](15)
failed...
8 <12345678> <90> | 4: ANYOF[13579][](15)
9 <123456789> <0> | 15: END(0)
Match successful!
6789
Freeing REx: "\d{3,5}[13579]"
When you have additional constraints about what comes after the match, then which type of repetition you choose can make a big difference. Here I’ll impose a constraint on where each match is allowed to finish, by saying it needs to end before an odd digit:
$ perl -le 'print $& while 1234567890 =~ /\d{3}(?=[13579])/g'
234
678
$ perl -le 'print $& while 1234567890 =~ /\d{3,5}(?=[13579])/g'
1234
5678
% perl -le 'print $& while 1234567890 =~ /\d{3,}(?=[13579])/g'
12345678
So when you have things that have to come afterwards, it can make a great deal of difference. When you are just deciding whether the entire line matches something, it may not be as important.
This is expected behavior. The string "pqpqpqpqpqpqpq" does in fact have between three and five repetitions of "pq", and then a few more for good measure. You may want to try anchoring your regular expression, something like ^\(pq\)\{3,5\}$.
Edit to match edited question:
The maximum is applicable in all situations. What is happening is that grep is matching 5 of the 7 repetitions of "pq" (most likely the first five), and since it found a match it prints out the line.
You'll have to figure out a way to change your regex to match what you want and not match what you don't. For example, to match a line starting with 3–5 repetitions of "pq", you might do something like this: ^\(pq\){3,5}\($|[^p]|p$|p[^q]\). That matches 3–5 "pq"s followed immediately by end-of-line or any-character-other-than-"p" or "p"-followed-by-end-of-line or "p"-followed-by-any-character-other-than-"q".