Perl - Generate All Matching String To A Regex - regex

I am kinda new in perl, i wanted to know if there is a way for generating all the combinations that matches a regex.
how is the best way to generate all the matching strings to :
05[0,2,4,7][\d]{7}
thanks in advance.

While you cannot just take any regex and produce any strings it might fit, in this case you can easily adapt and overcome.
You can use glob to generate combinations:
perl -lwe "print for glob '05{0,2,4,7}'"
050
052
054
057
However, I should not have to tell you that \d{7} actually means quite a few million combinations, right? Generating a list of numbers is trivial, formatting them can be done with sprintf:
my #nums = map sprintf("%07d", $_), 0 .. 9_999_999;
That is assuming you are only looking for 0-9 numericals.
Take those nums and combine them with the globbed ones: Tada.

No there is no way to generate all matches for a certain regex. Consider this one:
a+
There is an infinite number of matches for that regex, thus you cannot list them all.
By the way, I think you want your regex to look like this:
05[0247]\d{7}

2012 answer
String::Random
Regexp::Genex - generates random strings that match the regexp; not all the possible strings, even for finite patterns like [class]
Parse::RandGen
ยง6.5 regex string generation in HOP

Then there is a way to generate all (four billion of) the matches for this certain regex, viz., 05[0247]\d{7}:
use Modern::Perl;
for my $x (qw{0 2 4 7}) {
say "05$x" . sprintf '%07d', $_ for 0 .. 9999999;
}

Related

Perl - how can I match strings that are not exactly the same?

I have a list of strings I want to find within a file. This would be fairly simple to accomplish if the strings in my list and in the file matched exactly. Unfortunately, there are typos and variations on the name. Here's an example of how some of these strings differ
List File
B-Arrestin Beta-Arrestin
Becn-1 BECN 1
CRM-E4 CRME4
Note that each of those pairs should count as a match despite being different strings.
I know that I could categorize every kind of variation and write separate REGEX to identify matches but that is cumbersome enough that I might be better off manually looking for matches. I think the best solution for my problem would be some kind of expression that says:
"Match this string exactly but still count it as a match if there are X characters that do not match"
Does something like this exist? Is there another way to match strings that are not exactly the same but close?
As 200_success pointed out, you can do fuzzy matching with Text::Fuzzy, which computes the Levenshtein distance between bits of text. You will have to play with what maximum Levenshtein distance you want to allow, but if you do a case-insensitive comparison, the maximum distance in your sample data is three:
use strict;
use warnings;
use 5.010;
use Text::Fuzzy;
my $max_dist = 3;
while (<DATA>) {
chomp;
my ($string1, $string2) = split ' ', $_, 2;
my $tf = Text::Fuzzy->new(lc $string1);
say "'$string1' matches '$string2'" if $tf->distance(lc $string2) <= $max_dist;
}
__DATA__
B-Arrestin Beta-Arrestin
Becn-1 BECN 1
CRM-E4 CRME4
Output:
'B-Arrestin' matches 'Beta-Arrestin'
'Becn-1' matches 'BECN 1'
'CRM-E4' matches 'CRME4'
There are CPAN modules for that:
String::Approx
Text::Fuzzy

regex for n characters or at least m characters

This should be a pretty simple regex question but I couldn't find any answers anywhere. How would one make a regex, which matches on either ONLY 2 characters, or at least 4 characters. Here is my current method of doing it (ignore the regex itself, that's besides the point):
[A-Za-z0_9_]{2}|[A-Za-z0_9_]{4,}
However, this method takes twice the time (and is approximately 0.3s slower for me on a 400 line file), so I was wondering if there was a better way to do it?
Optimize the beginning, and anchor it.
^[A-Za-z0-9_]{2}(?:|[A-Za-z0-9_]{2,})$
(Also, you did say to ignore the regex itself, but I guessed you probably wanted 0-9, not 0_9)
EDIT Hm, I was sure I read that you want to match lines. Remove the anchors (^$) if you want to match inside the line as well. If you do match full lines only, anchors will speed you up (well, the front anchor ^ will, at least).
Your solution looks pretty good. As an alternative you can try smth like that:
[A-Za-z0-9_]{2}(?:[A-Za-z0-9_]{2,})?
Btw, I think you want hyphen instead of underscore between 0 and 9, don't you?
The solution you present is correct.
If you're trying to optimize the routine, and the number of matches strings matching 2 or more characters is much smaller than those that do not, consider accepting all strings of length 2 or greater, then tossing those if they're of length 3. This may boost performance by only checking the regex once, and the second call need not even be a regular expression; checking a string length is usually an extremely fast operation.
As always, you really need to run tests on real-world data to verify if this would give you a speed increase.
so basically you want to match words of length either 2 or 2+2+N, N>=0
([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)
working example:
#!/usr/bin/perl
while (<STDIN>)
{
chomp;
my #matches = ($_=~/([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)/g);
for my $m (#matches) {
print "match: $m\n";
}
}
input file:
cat in.txt
ab abc bcad a as asdfa
aboioioi i i abc bcad a as asdfa
output:
perl t.pl <in.txt
match: ab
match: ab
match: bcad
match: as
match: asdf
match: aboioioi
match: ab
match: bcad
match: as
match: asdf

Anyone see anything wrong with my regex for port numbers?

I made a regex for port numbers (before you say this is a bad idea, its going into a bigger regex for URL's which is much harder than it sounds).
My coworker said this is really bad and isn't going to catch everything. I disagree.
I believe this thing catches everything from 0 to 65535 and nothing else, and I'm looking for confirmation of this.
Single-line version (for computers):
/(^[0-9]$)|(^[0-9][0-9]$)|(^[0-9][0-9][0-9]$)|(^[0-9][0-9][0-9][0-9]$)|((^[0-5][0-9][0-9][0-9][0-9]$)|(^6[0-4][0-9][0-9][0-9]$)|(^65[0-4][0-9][0-9]$)|(^655[0-2][0-9]$)|(^6553[0-5]$))/
Human readable version:
/(^[0-9]$)| # single digit
(^[0-9][0-9]$)| # two digit
(^[0-9][0-9][0-9]$)| # three digit
(^[0-9][0-9][0-9][0-9]$)| # four digit
((^[0-5][0-9][0-9][0-9][0-9]$)| # five digit (up to 59999)
(^6[0-4][0-9][0-9][0-9]$)| # (up to 64999)
(^65[0-4][0-9][0-9]$)| # (up to 65499)
(^655[0-2][0-9]$)| # (up to 65529)
(^6553[0-5]$))/ # (up to 65535)
Can someone confirm that my understanding is correct (or otherwise)?
You could shorten it considerably:
^0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$
no need to repeat the anchors every single time
no need for lots of capturing groups
no need to spell out repetitions.
Drop the leading 0* if you don't want to allow leading zeroes.
This regex is also better because it matches the special cases (65535, 65001 etc.) first and thus avoids some backtracking.
Oh, and since you said you want to use this as part of a larger regex for URLs, you should then replace both ^ and $ with \b (word boundary anchors).
Edit: #ceving asked if the repetition of 6553, 655, 65 and 6 is really necessary. The answer is no - you can also use a nested regex instead of having to repeat those leading digits. Let's just consider the section
6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}
This can be rewritten as
6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))
I would argue that this makes the regex even less readable than it already was. Verbose mode makes the differences a bit clearer. Compare
6553[0-5] |
655[0-2][0-9] |
65[0-4][0-9]{2} |
6[0-4][0-9]{3}
with
6
(?:
[0-4][0-9]{3}
|
5
(?:
[0-4][0-9]{2}
|
5
(?:
[0-2][0-9]
|
3[0-5]
)
)
)
Some performance measurements: Testing each regex against all numbers from 1 through 99999 shows a minimal, probably irrelevant performance benefit for the nested version:
import timeit
r1 = """import re
regex = re.compile(r"0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
r2 = """import re
regex = re.compile(r"0*(?:6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
stmt = """for i in range(1,100000):
regex.match(str(i))"""
print(timeit.timeit(setup=r1, stmt=stmt, number=100))
print(timeit.timeit(setup=r2, stmt=stmt, number=100))
Output:
7.7265428834649
7.556472630353351
Personally I would match just a number and then I would check with code that the number is in range.
Well, it's easy to prove that it will validate any correct port: just generate each valid string and test that it passes. Making sure it doesn't allow anything that it shouldn't is harder though - obviously you can't test absolutely every invalid string. You should definitely test simple cases and anything which you think might pass incorrectly (or which would pass incorrectly with a lesser regex - "65536" being an example).
It will allow some slightly odd port specifications though - such as "0000". Do you want to allow leading zeroes?
You might also want to consider whether you actually need to specify ^ and $ separately for each case, or whether you could use ^(case 1)|(case 2)|...$. Oh, and quantifiers could simplify the "1 to 4 digits" case too: ([0-9]{1,4}) will find between 1 and 4 digits.
(You might want to work on sounding a little less arrogant, by the way. If you're working with other people, communicating in a less aggressive way is likely to do more to improve everyone's day than just proving your regex is correct...)
What's wrong with parsing it into a number and work with integer comparisons? (regardless of whether or not this will be part of a "larger" regex).
If I were to use regex, I would just use:
\d{1,5}
Nope, it doesn't check for "valid" port numbers (neither does yours). But it's much more legible and for practical purposes I'd say it's "good enough."
PS: I'd work on being more humble.
A style note:
Repeating [0-9] over and over again is silly - something like [0-9][0-9][0-9] is much better written as \d{3}.
/^(6553[0-5])|(655[0-2]\d)|(65[0-4]\d{2})|(6[0-4]\d{3})|([1-5]\d{4})|([1-9]\d{1,3})|(\d)$/
regex has many implement ,what the paltform. try below , remove blanks
^[1-5]?\d{1,4}|6([0-4]\d{3}|5([0-4]\d{2}|5([0-2]\d|3[0-5]))$
readable
^
[1-5]?\d{1,4}|
6(
[0-4]\d{3}|
5(
[0-4]\d{2}|
5(
[0-2]\d|
3[0-5]
)
)
$
I would use this one:
6(?:[0-4]\d{3}|5(?:[0-4]\d{2}|5(?:[0-2]\d|3[0-5])))|(?:[1-5]\d{0,3}|[6-9]\d{0,2})?\d
The following Perl script tests some numbers:
#! /usr/bin/perl
use strict;
use warnings;
my $port = qr{
6(?:[0-4]\d{3}|5(?:[0-4]\d{2}|5(?:[0-2]\d|3[0-5])))|(?:[1-5]\d{0,3}|[6-9]\d{0,2})?\d
}x;
sub test {
my ($label, $regexp, $start, $stop) = #_;
my $matches = 0;
my $tests = 0;
foreach my $n ($start..$stop) {
$tests++;
$matches++ if "$n" =~ /^$regexp$/;
$tests++;
$matches++ if "0$n" =~ /^$regexp$/;
}
print "$label [$start $stop] => $matches matches in $tests tests\n";
}
test "Port", $port, 0, 2**16;
The output is:
Port [0 65536] => 65536 matches in 131074 tests

RegEx: How can I replace with $n instances of a string?

I'm trying to replace numbers of the form 4.2098234e-3 with 00042098234. I can capture the component parts ok with:
(-?)(\d+).(\d)+e-($d+)
but what I don't know how to do is to repeat the zeros at the start $4 times.
Any ideas?
Thanks in advance,
Ross
Ideally, I'd like to be able to do this with the find/replace feature of TextMate, if that's of any consequence. I appreciate that there are better tools than RegEx for this problem, but it's still an interesting question (to me).
You can't do it purely in regular expressions, because the replace string is just a string with backreferences -- you can't use repetition there.
In most programming lnaguages, you have regex replace with callback, which would be able to do it. However it's not something that a text editor can do (unless it has some scripting support).
This isn't something that should be done with regex. That said, you can do something like this, but it's not really worth the effort: the regex is complicated, and the capability is limited.
Here's an illustrative example of replacing a digit [0-9] with that many zeroes.
// generate the regex and the replacement strings
String seq = "123456789";
String regex = seq.replaceAll(".", "(?=[$0-9].*(0)\\$)?") + "\\d";
String repl = seq.replaceAll(".", "\\$$0");
// let's see what they look like!!!
System.out.println(repl); // prints "$1$2$3$4$5$6$7$8$9"
System.out.println(regex); // prints oh my god just look at the next section!
// let's see if they work...
String input = "3 2 0 4 x 11 9";
System.out.println(
(input + "0").replaceAll(regex, repl)
); // prints "000 00 0000 x 00 000000000"
// it works!!!
The regex is (as seen on ideone.com) (slightly formatted for readability):
(?=[1-9].*(0)$)?
(?=[2-9].*(0)$)?
(?=[3-9].*(0)$)?
(?=[4-9].*(0)$)?
(?=[5-9].*(0)$)?
(?=[6-9].*(0)$)?
(?=[7-9].*(0)$)?
(?=[8-9].*(0)$)?
(?=[9-9].*(0)$)?
\d
But how does it work??
The regex relies on positive lookaheads. It matches \d, but before doing that, it tries to see if it's [1-9]. If so, \1 goes all the way to the end of the input, where a 0 has been appended, to capture that 0. Then the second assertion checks if it's [2-9], and if so, \2 goes all the way to the end of the input to grab 0, and so on.
The technique works, but beyond a cute regex exercise, it probably has no real practicability.
Note also that 11 is replaced to 00. That is, each 1 is replaced with 1 zero. It's probably possible to recognize 11 as a number and put 11 zeroes instead, but it'd only make the regex more convoluted.

How can I make this regex more compact?

Let's say I have a line of text like this
Small 0.0..20.0 0.00 1.49 25.71 41.05 12.31 0.00 80.56
I want to capture the last six numbers and ignore the Small and the first two groups of numbers.
For this exercise, let's ignore the fact that it might be easier to just do some sort of string-split instead of a regular expression.
I have this regex that works but is kind of horrible looking
^(Small).*?[0-9.]+.*?[0-9.]+.*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+)
Is there some way to compact that?
For example, is it possible to combine the check for the last 6 numbers into a single statement that still stores the results as 6 separate group matches?
If you want to keep each match in a separate backreference, you have no choice but to "spell it out" - if you use repetition, you can either catch all six groups "as one" or only the last one, depending on where you put the capturing parentheses. So no, it's not possible to compact the regex and still keep all six individual matches.
A somewhat more efficient (though not beautiful) regex would be:
^Small\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)
since it matches the spaces explicitly. Your regex will result in a lot of backtracking. My regex matches in 28 steps, yours in 106.
Just as an aside: In Python, you could simply do a
>>> pieces = "Small 0.0..20.0 0.00 1.49 25.71 41.05 12.31 0.00 80.56".split()[-6:]
>>> print pieces
['1.49', '25.71', '41.05', '12.31', '0.00', '80.56']
Here is the shortest I could get:
^Small\s+(?:[\d.]+\s+){2}([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s*$
It must be long because each capture must be specified explicitly. No need to capture "Small", though. But it is better to be specific (\s instead of .) when you can, and to anchor on both ends.
For usability, you should use string substitution to build regex from composite parts.
$d = "[0-9.]+";
$s = ".*?";
$re = "^(Small)$s$d$s$d$s($d)$s($d)$s($d)$s($d)$s($d)$s($d)";
At least then you can see the structure past the pattern, and changing one part changes them all.
If you wanted to get really ANSI you could make a short use metasyntax and make it even easier to read:
$re = "^(Small)_#D_#D_(#D)_(#D)_(#D)_(#D)_(#D)_(#D)";
$re = str_replace('#D','[0-9.]+',$re);
$re = str_replace('_', '.*?' , $re );
( This way it also makes it trivial to change the definition of what a space token is, or what a digit token is )