Regex performance: validating alphanumeric characters

Regex performance: validating alphanumeric characters - regex

When trying to validate that a string is made up of alphabetic characters only, two possible regex solutions come to my mind.
The first one checks that every character in the string is alphanumeric:
/^[a-z]+$/
The second one tries to find a character somewhere in the string that is not alphanumeric:
/[^a-z]/
(Yes, I could use character classes here.)
Is there any significant performance difference for long strings?
(If anything, I'd guess the second variant is faster.)

Just by looking at it, I'd say the second method is faster.
However, I made a quick non-scientific test, and the results seem to be inconclusive:
Regex Match vs. Negation.
P.S. I removed the group capture from the first method. It's superfluous, and would only slow it down.

Wrote this quick Perl code:
#testStrings = qw(asdfasdf asdf as aa asdf as8up98;n;kjh8y puh89uasdf ;lkjoij44lj 'aks;nasf na ;aoij08u4 43[40tj340ij3 ;salkjaf; a;lkjaf0d8fua ;alsf;alkj
a a;lkf;alkfa as;ldnfa;ofn08h[ijo ok;ln n ;lasdfa9j34otj3;oijt 04j3ojr3;o4j ;oijr;o3n4f;o23n a;jfo;ie;o ;oaijfoia ;aosijf;oaij ;oijf;oiwj;
qoeij;qwj;ofqjf08jf0 ;jfqo;j;3oj4;oijt3ojtq;o4ijq;onnq;ou4f ;ojfoqn;aonfaoneo ;oef;oiaj;j a;oefij iiiii iiiiiiiii iiiiiiiiiii);
print "test 1: \n";
foreach my $i (1..1000000) {
foreach (#testStrings) {
if ($_ =~ /^([a-z])+$/) {
#print "match"
} else {
#print "not"
}
}
}
print `date` . "\n";
print "test 2: \n";
foreach my $j (1..1000000) {
foreach (#testStrings) {
if ($_ =~ /[^a-z]/) {
#print "match"
} else {
#print "not"
}
}
}
then ran it with:
date; <perl_file>; date
it isn't 100% scientific, but it gives us a good idea. The first Regex took 10 or 11 seconds to execute, the second Regex took 8 seconds.

Related

Unanchored substring searching: index vs regex?

I am writing some Perl scripts where I need to do a lot of string matching.
For example:
my $str1 = "this is a test string";
my $str2 = "test";
To see if $str1 contains $str2 - I found that there are 2 approaches:
Approach 1:
use Index function:
if ( index($str1, $str2) != -1 ) { .... }
Approach 2:
use regular expression:
if( $str1 =~ /$str2/ ) { .... }
Which is better? and when should we use each of these over the other?

Here is the result of Benchmark:
use Benchmark qw(:all) ;
my $count = -1;
my $str1 = "this is a test string";
my $str2 = "test";
my $str3 = qr/test/;
cmpthese($count, {
'type1' => sub { if ( index($str1, $str2) != -1 ) { 1 } },
'type2' => sub { if( $str1 =~ $str3 ) { 1 } },
});
Result (when a match happens):
Rate type2 type1
type2 1747627/s -- -70%
type1 5770465/s 230% --
To be able to draw a conclusion, test not to match:
my $str2 = "text";
my $str3 = qr/text/;
Result (when a match does not happen):
Rate type2 type1
type2 1857295/s -- -67%
type1 5560630/s 199% --
Conclusion:
The index function is much faster than the regexp match.

When I see code that uses index, I usually see an index within an index within an index, etc. There's also more branching too: "if found, look for this; otherwise since not found, look for that." Almost always a single regex would have worked. So, for me, I almost always use a regex unless there's some specific reason I want to use an index.
Unfortunately, most programmers I run into don't read regex well and so for maintainability, the index method should be used more than I do.

If you need a substring match, use index. If you need a regexp match (with special meaning for regexp metacharacters), use =~. A substring match is usually faster, but regexps in Perl are quite well optimized, and simple regexp matches can be surprisingly fast. Benchmark it for yourself.
Since Perl 5.6, Perl is smart enough to recompile the regexp in $str =~ /$str2/ iff $str2 has changed since the last compilation. To fully control when your regexp is compiled, use qr/$str2/. See Does the 'o' modifier for Perl regular expressions still provide any benefit? for q/.../o (obsolete) and qr/.../ (not needed most of the time, but can be useful).

regular expression help: catch this: |TrxId=475665|

For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|

TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.

The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.

An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'

/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665

You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.

Find words, that are substrings of other words efficiently

I have an Ispell list of english words (nearly 50 000 words), my homework in Perl is to get quickly (like under one minute) list of all strings, that are substrings of some other word. I have tried solution with two foreach cycles comparing all words, but even with some optimalizations, its still too slow. I think, that right solution could be some clever use of regular expressions on array of words. Do you know how to solve this problem quicky (in Perl)?

I have found fast solution, which can find some all these substrings in about 15 seconds on my computer, using just one thread. Basically, for each word, I have created array of every possible substrings (eliminating substrings which differs only in "s" or "'s" endings):
#take word and return list of all valid substrings
sub split_to_all_valid_subwords {
my $word = $_[0];
my #split_list;
my ($i, $j);
for ($i = 0; $i < length($word); ++$i){
for ($j = 1; $j <= length($word) - $i; ++$j){
unless
(
($j == length($word)) or
($word =~ m/s$/ and $i == 0 and $j == length($word) - 1) or
($word =~ m/\'s$/ and $i == 0 and $j == length($word) - 2)
)
{
push(#split_list, substr($word, $i, $j));
}
}
}
return #split_list;
}
Then I just create list of all candidates for substrings and make intersection with words:
my #substring_candidates;
foreach my $word (#words) {
push( #substring_candidates, split_to_all_valid_subwords($word));
}
#make intersection between substring candidates and words
my %substring_candidates=map{$_ =>1} #substring_candidates;
my %words=map{$_=>1} #words;
my #substrings = grep( $substring_candidates{$_}, #words );
Now in substrings I have array of all words, that are substrings of some other words.

Perl regular expressions will optimize patterns like foo|bar|baz into an Aho-Corasick match - up to a certain limit of total compiled regex length. Your 50000 words will probably exceed that length, but could be broken into smaller groups. (Indeed, you probably want to break them up by length and only check words of length N for containing words of length 1 through N-1.)
Alternatively, you could just implement Aho-Corasick in your perl code - that's kind of fun to do.

update
Ondra supplied a beautiful solution in his answer; I leave my post here as an example of overthinking a problem and failed optimisation techniques.
My worst case kicks in for a word that doesn't match any other word in the input. In that case, it goes quadratic. The OPT_PRESORT was a try to advert the worst case for most words. The OPT_CONSECUTIVE was a linear-complexity filter that reduced the total number of items in the main part of the algorithm, but it is just a constant factor when considering the complexity. However, it is still useful with Ondras algorithm and saves a few seconds, as building his split list is more expensive than comparing two consecutive words.
I updated the code below to select ondras algorithm as a possible optimisation. Paired with zero threads and the presort optimisation, it yields maximum performance.
I would like to share a solution I coded. Given an input file, it outputs all those words that are a substring of any other word in the same input file. Therefore, it computes the opposite of ysth's ideas, but I took the idea of optimisation #2 from his answer. There are the following three main optimisations that can be deactivated if required.
Multithreading
The questions "Is word A in list L? Is word B in L?" can be easily parallelised.
Pre-sorting all the words for their length
I create an array that points to the list of all words that are longer than a certain length, for every possible length. For long words, this can cut down the number of possible words dramatically, but it trades quite a lot of space, as one word of length n appears in all lists from length 1 to length n.
Testing consecutive words
In my /usr/share/dict/words, most consecutive lines look quite similar:
Abby
Abby's
for example. As every word that would match the first word also matches the second one, I immediately add the first word to the list of matching words, and only keep the second word for further testing. This saved about 30% of words in my test cases. Because I do that before optimisation No 2, this also saves a lot of space. Another trade-off is that the output will not be sorted.
The script itself is ~120 lines long; I explain each sub before showing it.
head
This is just a standard script header for multithreading. Oh, and you need perl 5.10 or better to run this. The configuration constants define the optimisation behaviour. Add the number of processors of your machine in that field. The OPT_MAX variable can take the number of words you want to process, however this is evaluated after the optimisations have taken place, so the easy words will already have been caught by the OPT_CONSECUTIVE optimisation. Adding anything there will make the script seemingly slower. $|++ makes sure that the status updates are shown immediately. I exit after the main is executed.
#!/usr/bin/perl
use strict; use warnings; use feature qw(say); use threads;
$|=1;
use constant PROCESSORS => 0; # (false, n) number of threads
use constant OPT_MAX => 0; # (false, n) number of words to check
use constant OPT_PRESORT => 0; # (true / false) sorts words by length
use constant OPT_CONSECUTIVE => 1; # (true / false) prefilter data while loading
use constant OPT_ONDRA => 1; # select the awesome Ondra algorithm
use constant BLABBER_AT => 10; # (false, n) print progress at n percent
die q(The optimisations Ondra and Presort are mutually exclusive.)
if OPT_PRESORT and OPT_ONDRA;
exit main();
main
Encapsulates the main logic, and does multi-threading. The output of n words will be matched will be considerably smaller than the number of input words, if the input was sorted. After I have selected all matched words, I print them to STDOUT. All status updates etc. are printed to STDERR, so that they don't interfere with the output.
sub main {
my #matching; # the matching words.
my #words = load_words(\#matching); # the words to be searched
say STDERR 0+#words . " words to be matched";
my $prepared_words = prepare_words(#words);
# do the matching, possibly multithreading
if (PROCESSORS) {
my #threads =
map {threads->new(
\&test_range,
$prepared_words,
#words[$$_[0] .. $$_[1]] )
} divide(PROCESSORS, OPT_MAX || 0+#words);
push #matching, $_->join for #threads;
} else {
push #matching, test_range(
$prepared_words,
#words[0 .. (OPT_MAX || 0+#words)-1]);
}
say STDERR 0+#matching . " words matched";
say for #matching; # print out the matching words.
0;
}
load_words
This reads all the words from the input files which were supplied as command line arguments. Here the OPT_CONSECUTIVE optimisation takes place. The $last word is either put into the list of matching words, or into the list of words to be matched later. The -1 != index($a, $b) decides if the word $b is a substring of word $a.
sub load_words {
my $matching = shift;
my #words;
if (OPT_CONSECUTIVE) {
my $last;
while (<>) {
chomp;
if (defined $last) {
push #{-1 != index($_, $last) ? $matching : \#words}, $last;
}
$last = $_;
}
push #words, $last // ();
} else {
#words = map {chomp; $_} <>;
}
#words;
}
prepare_words
This "blows up" the input words, sorting them after their length into each slot, that has the words of larger or equal length. Therefore, slot 1 will contain all words. If this optimisation is deselected, it is a no-op and passes the input list right through.
sub prepare_words {
if (OPT_ONDRA) {
my $ondra_split = sub { # evil: using $_ as implicit argument
my #split_list;
for my $i (0 .. length $_) {
for my $j (1 .. length($_) - ($i || 1)) {
push #split_list, substr $_, $i, $j;
}
}
#split_list;
};
return +{map {$_ => 1} map &$ondra_split(), #_};
} elsif (OPT_PRESORT) {
my #prepared = ([]);
for my $w (#_) {
push #{$prepared[$_]}, $w for 1 .. length $w;
}
return \#prepared;
} else {
return [#_];
}
}
test
This tests if the word $w is a substring in any of the other words. $wbl points to the data structure that was created by the previous sub: Either a flat list of words, or the words sorted by length. The appropriate algorithm is then selected. Nearly all of the running time is spent in this loop. Using index is much faster than using a regex.
sub test {
my ($w, $wbl) = #_;
my $l = length $w;
if (OPT_PRESORT) {
for my $try (#{$$wbl[$l + 1]}) {
return 1 if -1 != index $try, $w;
}
} else {
for my $try (#$wbl) {
return 1 if $w ne $try and -1 != index $try, $w;
}
}
return 0;
}
divide
This just encapsulates an algorithm that guarantees a fair distribution of $items items into $parcels buckets. It outputs the bounds of a range of items.
sub divide {
my ($parcels, $items) = #_;
say STDERR "dividing $items items into $parcels parcels.";
my ($min_size, $rest) = (int($items / $parcels), $items % $parcels);
my #distributions =
map [
$_ * $min_size + ($_ < $rest ? $_ : $rest),
($_ + 1) * $min_size + ($_ < $rest ? $_ : $rest - 1)
], 0 .. $parcels - 1;
say STDERR "range division: #$_" for #distributions;
return #distributions;
}
test_range
This calls test for each word in the input list, and is the sub that is multithreaded. grep selects all those elements in the input list where the code (given as first argument) return true. It also regulary outputs a status message like thread 2 at 10% which makes waiting for completition much easier. This is a psychological optimisation ;-).
sub test_range {
my $wbl = shift;
if (BLABBER_AT) {
my $range = #_;
my $step = int($range / 100 * BLABBER_AT) || 1;
my $i = 0;
return
grep {
if (0 == ++$i % $step) {
printf STDERR "... thread %d at %2d%%\n",
threads->tid,
$i / $step * BLABBER_AT;
}
OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)
} #_;
} else {
return grep {OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)} #_;
}
}
invocation
Using bash, I invoked the script like
$ time (head -n 1000 /usr/share/dict/words | perl script.pl >/dev/null)
Where 1000 is the number of lines I wanted to input, dict/words was the word list I used, and /dev/null is the place I want to store the output list, in this case, throwing the output away. If the whole file should be read, it can be passed as an argument, like
$ perl script.pl input-file >output-file
time just tells us how long the script ran. Using 2 slow processors and 50000 words, it executed in just over two minutes in my case, which is actually quite good.
update: more like 6–7 seconds now, with the Ondra + Presort optimisation, and no threading.
further optimisations
update: overcome by better algorithm. This section is no longer completely valid.
The multithreading is awful. It allocates quite some memory and isn't exactly fast. This isn't suprising considering the amount of data. I considered using a Thread::Queue, but that thing is slow like $#*! and therefore is a complete no-go.
If the inner loop in test was coded in a lower-level language, some performance might be gained, as the index built-in wouldn't have to be called. If you can code C, take a look at the Inline::C module. If the whole script were coded in a lower language, array access would also be faster. A language like Java would also make the multithreading less painful (and less expensive).

Why does my regex fail when the number ends in 0?

This is a really basic regex question but since I can't seem to figure out why the match is failing in certain circumstances I figured I'd post it to see if anyone else can point out what I'm missing.
I'm trying to pull out the 2 sets of digits from strings of the form:
12309123098_102938120938120938
1321312_103810312032123
123123123_10983094854905490
38293827_1293120938129308
I'm using the following code to process each string:
if($string && $string =~ /^(\d)+_(\d)+$/) {
if(IsInteger($1) && IsInteger($2)) { print "success ('$1','$2')"; }
else { print "fail"; }
}
Where the IsInterger() function is as follows:
sub IsInteger {
my $integer = shift;
if($integer && $integer =~ /^\d+$/) { return 1; }
return;
}
This function seems to work most of the time but fails on the following for some reason:
1287123437_1268098784380
1287123437_1267589971660
Any ideas on why these fail while others succeed? Thanks in advance for your help!

This is an add-on to the answers from unicornaddict and ZyX: what are you trying to match?
If you're trying to match the sequences left and right of '_', unicorn addict is correct and your regex needs to be ^(\d+)_(\d+)$. Also, you can get rid of the first qualifier and the 'IsIntrger()` function altogether - you already know it's an integer - it matched (\d+)
if ($string =~ /^(\d+)_(\d+)$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
If you're trying to match the last digit in each and wondering why it's failing, it's the first check in IsInteger() ( if($intger && ). It's redundant anyway (you know it's an integer) and fails on 0 because, as ZyX notes - it evaluates to false.
Same thing applies though:
if ($string =~ /^(\d)+_(\d)+$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
This will output success ('8','8') given the input 12309123098_102938120938120938

Because you have 0 at the end of the second string, (\d)+ puts only the last match in the $N variable, string "0" is equivalent to false.

When in doubt, check what your regex is actually capturing.
use strict;
use warnings;
my #data = (
'1321312_103810312032123',
'123123123_10983094854905490',
);
for my $s (#data){
print "\$1=$1 \$2=$2\n" if $s =~ /^(\d)+_(\d)+$/;
# Output:
# $1=2 $2=3
# $1=3 $2=0
}
You probably intended the second of these two approaches.
(\d)+ # Repeat a regex group 1+ times,
# capturing only the last instance.
(\d+) # Capture 1+ digits.
In addition, both in your main loop and in IsInteger (which seems unnecessary, given the initial regex in the main loop), you are testing for truth rather than something more specific, such as defined or length. Zero, for example, is a valid integer but false.

Shouldn't + be included in the grouping:
^(\d+)_(\d+)$ instead of ^(\d)+_(\d)+$

Many people have commented on your regex, but the problem you had in your IsInteger (which you really don't need for your example). You checked for "truth" when you really want to check for defined:
sub IsInteger {
my $integer = shift;
if( defined $integer && $integer =~ /^\d+$/) { return 1; }
return;
}
You don't need most of the infrastructure in that subroutine though:
sub IsInteger {
defined $_[0] && $_[0] =~ /^\d+$/
}

Regular Expression to find numbers with same digits in different order

I have been looking for a regular expression with Google for an hour or so now and can't seem to work this one out :(
If I have a number, say:
2345
and I want to find any other number with the same digits but in a different order, like this:
2345
For example, I match
3245 or 5432 (same digits but different order)
How would I write a regular expression for this?

There is an "elegant" way to do it with a single regex:
^(?:2()|3()|4()|5()){4}\1\2\3\4$
will match the digits 2, 3, 4 and 5 in any order. All four are required.
Explanation:
(?:2()|3()|4()|5()) matches one of the numbers 2, 3, 4, or 5. The trick is now that the capturing parentheses match an empty string after matching a number (which always succeeds).
{4} requires that this happens four times.
\1\2\3\4 then requires that all four backreferences have participated in the match - which they do if and only if each number has occurred once. Since \1\2\3\4 matches an empty string, it will always match as long as the previous condition is true.
For five digits, you'd need
^(?:2()|3()|4()|5()|6()){5}\1\2\3\4\5$
etc...
This will work in nearly any regex flavor except JavaScript.

I don't think a regex is appropriate. So here is an idea that is faster than a regex for this situation:
check string lengths, if they are different, return false
make a hash from the character (digits in your case) to integers for counting
loop through the characters of your first string:
increment the counter for that character: hash[character]++
loop through the characters of the second string:
decrement the counter for that character: hash[character]--
break if any count is negative (or nonexistent)
loop through the entries, making sure each is 0:
if all are 0, return true
else return false
EDIT: Java Code (I'm using Character for this example, not exactly Unicode friendly, but it's the idea that matters now):
import java.util.*;
public class Test
{
public boolean isSimilar(String first, String second)
{
if(first.length() != second.length())
return false;
HashMap<Character, Integer> hash = new HashMap<Character, Integer>();
for(char c : first.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count++;
hash.put(c, count);
}
else
{
hash.put(c, 1);
}
}
for(char c : second.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count--;
if(count < 0)
return false;
hash.put(c, count);
}
else
{
return false;
}
}
for(Integer i : hash.values())
{
if(i.intValue()!=0)
return false;
}
return true;
}
public static void main(String ... args)
{
//tested to print false
System.out.println(new Test().isSimilar("23445", "5432"));
//tested to print true
System.out.println(new Test().isSimilar("2345", "5432"));
}
}
This will also work for comparing letters or other character sequences, like "god" and "dog".

Put the digits of each number in two arrays, sort the arrays, find out if they hold the same digits at the same indices.
RegExes are not the right tool for this task.

You could do something like this to ensure the right characters and length
[2345]{4}
Ensuring they only exist once is trickier and why this is not suited to regexes
(?=.*2.*)(?=.*3.*)(?=.*4.*)(?=.*5.*)[2345]{4}

The simplest regular expression is just all 24 permutations added up via the or operator:
/2345|3245|5432|.../;
That said, you don't want to solve this with a regex if you can get away with it. A single pass through the two numbers as strings is probably better:
1. Check the string length of both strings - if they're different you're done.
2. Build a hash of all the digits from the number you're matching against.
3. Run through the digits in the number you're checking. If you hit a match in the hash, mark it as used. Keep going until you don't get an unused match in the hash or run out of items.

I think it's very simple to achieve if you're OK with matching a number that doesn't use all of the digits. E.g. if you have a number 1234 and you accept a match with the number of 1111 to return TRUE;
Let me use PHP for an example as you haven't specified what language you use.
$my_num = 1245;
$my_pattern = '/[' . $my_num . ']{4}/'; // this resolves to pattern: /[1245]{4}/
$my_pattern2 = '/[' . $my_num . ']+/'; // as above but numbers can by of any length
$number1 = 4521;
$match = preg_match($my_pattern, $number1); // will return TRUE
$number2 = 2222444111;
$match2 = preg_match($my_pattern2, $number2); // will return TRUE
$number3 = 888;
$match3 = preg_match($my_pattern, $number3); // will return FALSE
$match4 = preg_match($my_pattern2, $number3); // will return FALSE
Something similar will work in Perl as well.

Regular expressions are not appropriate for this purpose. Here is a Perl script:
#/usr/bin/perl
use strict;
use warnings;
my $src = '2345';
my #test = qw( 3245 5432 5542 1234 12345 );
my $canonical = canonicalize( $src );
for my $candidate ( #test ) {
next unless $canonical eq canonicalize( $candidate );
print "$src and $candidate consist of the same digits\n";
}
sub canonicalize { join '', sort split //, $_[0] }
Output:
C:\Temp> ks
2345 and 3245 consist of the same digits
2345 and 5432 consist of the same digits

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex performance: validating alphanumeric characters - regex

Just by looking at it, I'd say the second method is faster. However, I made a quick non-scientific test, and the results seem to be inconclusive: Regex Match vs. Negation. P.S. I removed the group capture from the first method. It's superfluous, and would only slow it down.

Related

Unanchored substring searching: index vs regex?

regular expression help: catch this: |TrxId=475665|

Find words, that are substrings of other words efficiently

Why does my regex fail when the number ends in 0?

Regular Expression to find numbers with same digits in different order

Categories

Resources