A tiny template language - regex

I pass a string with a "partial" SQL statement like:
"SELECT %FIELDS% FROM ... %ORDER% %LIMIT%"
Then I want to replace %FIELDS% either with a real list of fields (which I get from an array) or with COUNT(*). Likewise I replace %ORDER% either with ORDER BY ... (which I generate) or with empty string and %LIMIT% with either LIMIT ... (which I generate) or with empty string.
I also want a way to prevent these sequences to be replaced. For example we may "turn off" replacing if percents are doubled: %%FIELDS%% should be not replaced with a list of fields but be replaced with literal %FIELDS%.
Note that I do not insist namely on this syntax. Instead of percent signs we may use some other escape syntax (for example ${FIELDS} or {{FIELDS}}).
I want the most easy and (what is probably more important) most efficient way to do this.
Note that I use Perl.
Maybe, I should not invent my own template language with regexps and use Perl module Template::Tiny? What will be the most efficient?

#!/usr/bin/perl
use strict;
use warnings;
sub MyReplace {
my ($tmpl, $Hash) = #_;
my %Hash2;
foreach my $Key (keys %$Hash) {
$Hash2{"%$Key%"} = $Hash->{$Key};
$Hash2{"\\%$Key%"} = "%$Key%"; # escaped
}
my $re_str = "(" . (join '|', map { "\Q$_\E" } keys %Hash2) . ")";
$tmpl =~ s/$re_str/$Hash2{$1}/g;
return $tmpl;
}
print MyReplace('SELECT %FIELDS% FROM ... %ORDER% %LIMIT%', {FIELDS=>'a, b, c', ORDER=>'ORDER BY id', LIMIT=>''}), "\n";
print MyReplace('SELECT \\%FIELDS% FROM ... %ORDER% \\\\%LIMIT%', {FIELDS=>'a, b, c', ORDER=>'ORDER BY id', LIMIT=>''}), "\n";

Related

perl regular expression match scalar plus punctuation

I have scalars (columns in a table) that have one or two email addresses separated by a comma. such as 'Joek#xyznco.com, jrancher#candyco.us' or 'jsmith#wellingent.com,mjones#wellingent.com' for several of these records I need to remove a bad/old email address and the trailing comma (if one exists).
if jmsith#wellingent is no longer valid how do I remove that address and the trailing comma?
This only removes the address but leaves the comma.
my $general_email = 'jsmith#wellingent.com,mjones#wellingent.com';
my $bad_addr = 'jsmith#wellingent.com';
$general_email =~ s/$bad_addr//;
Thanks for any help.
You may be better off without a regex but with list splitting:
use strict;
use warnings;
sub remove_bad {
my ($full, $bad) = #_;
my #emails = split /\s*,\s*/, $full; # split at comma, allowing for spaces around the comma
my #filtered = grep { $_ ne $bad } #emails;
return join ",", #filtered;
}
print 'First: ' , remove_bad('me#example.org, you#example.org', 'me#example.org'), "\n";
print 'Last: ', remove_bad('me#example.org, you#example.org', 'you#example.org'), "\n";
print 'Middle: ', remove_bad('me#example.org, you#example.org, other#eample.org', 'you#example.org'), "\n";
First, split the bad email address list at the comma, creating an array. Filter that using grep to remove the bad address. join the remaining elements back into a string.
The above code prints:
First: you#example.org
Last: me#example.org
Middle: me#example.org,other#eample.org

Perl Regular Expression to match special characters

I have the below code in which I am checking for a specific variable location in an array excluded.It works fine with all the array elements except one (abc/def/libraries/linux_3.2.60-1+deb7u3.dsc). When I provide this element as my location its printing "location not excluded" , even though its excluded.
How can I made my code to get this element as well as excluded?
use strict;
use warnings;
my #excluded = (
"xyz/efg/headers/",
"abc/def/libraries/jni-mr.h",
"abc/def/libraries/linux_3.2.60-1+deb7u3.dsc",
);
my $location = "abc/def/libraries/linux_3.2.60-1+deb7u3.dsc";
my $badpath = 0;
foreach (#excluded) {
# -- Check if location is contained in excluded array
if ($location =~ /^$_/) {
$badpath = 1;
print "location is excluded : $location \n";
}
}
if (! $badpath) {
print "location is not excluded : $location \n";
}
Desired Output:
location is excluded : abc/def/libraries/linux_3.2.60-1+deb7u3.dsc
Current Output:
location is not excluded : abc/def/libraries/linux_3.2.60-1+deb7u3.dsc
Use quotemeta($text) or \Q$text\E (inside double quotes or a regex literal) to create a pattern that matches the value of $text. In other words, use
if ($location =~ /^\Q$_\E/)
instead of:
if ($location =~ /^$_/)
It looks like you intend to define your exclusions by regex, but you have not escaped regex metachars properly in those regexes. For your failing case, the metachar causing it to fail is the plus (+), which is a one-or-more multiplier in most regex flavors (including Perl), but you need to match it literally.
Also, I'd recommend moving the ^ anchor from the loop to each individual regex, which would make the code more flexible, in that you could choose not to anchor some of the exclusion regexes if you wanted.
Also, you should use the qr() construct, which allows you to precompile regexes, saving on CPU.
Also, this requirement is a good candidate for using grep().
use strict;
use warnings;
my #excluded = (
qr(^xyz/efg/headers/),
qr(^abc/def/libraries/jni-mr\.h),
qr(^abc/def/libraries/linux_3\.2\.60-1\+deb7u3\.dsc),
);
my $location = 'abc/def/libraries/linux_3.2.60-1+deb7u3.dsc';
# -- Check if location is contained in excluded array
my $badpath = scalar(grep($location =~ $_, #excluded )) >= 1 ? 1 : 0;
if ($badpath) {
print "location is excluded : $location \n";
} else {
print "location is not excluded : $location \n";
}

Passing a parameter to a regular expression to match the first letter in a word in perl

So here is what I'm doing. This is for homework, and I know I can't come on here and get you guys to do my homework for me but I'm stuck. We have to use perl (First time ever using it so forgive my stupidity) to make a function $starts_with that takes a parameter $str0 and $prefix. if $str0 starts with $prefix. then the function returns true. if it doesn't then it isn't pretty simple. We have to use regular expressions because that is the whole point of the exercise so here is my code
sub starts_with
{
$str0 = $_[0];
$prefix = $_[1];
if($prefix =~ /^($str0)/)
{
print $str0."\n";
print m/^(prefix)/."\n";
$startsWith = "Y"
}
if ($startsWith eq "Y")
{
print $str0." starts with ".$prefix."\n";
}
else
{
print $str0." does not start with ".$prefix."\n";
}
}
I'm almost ashamed to put this up here because I have no Idea what I'm doing yet. But I am trying to learn. I don't know how to do true false in perl thats why I have the $startsWith variable. you can fix that if you want. the part I need to fix is the line
if(str0 =~ /^($prefix)/)
I also need to find out how to refer to the first letter in str0...I think
A couple points without giving away the answer:
1) Arguments to functions are passed in a special variable called #_, which is what you are accessing when you say $_[0] and $_[1], but can be written much more concisely by assigned the argument list (#_) to your variables in list context
sub starts_with {
my ($str0, $prefix) = #_;
...
}
2) This statement: if($prefix =~ /^($str0)/) tests the exact opposite condition you are trying to prove. It says does the prefix start with the value of the variable $str0. What you really want to test is if $str0 starts with $prefix.
It might also be using to prefix your pattern with m flag, m/PATTERN which means match this pattern.
3) You don't have a return statement in your function, (As #M42 points out) the result of the last expression is returned; that expression being print will return true. You probably want to return true or false explicity.
See if you can use this to get started.
What I would do :
use Modern::Perl; # or use strict; use warnings; use feature qw/say/;
sub starts_with {
# better use #_, the default array instead of just elements of them
# ...like $_[0]
my ($str, $pref) = #_;
# very short expression, the pattern matching return a boolean.
# \Q\E is there to treat the prefix as-is (no metacharacters)
return $str =~ /^\Q$pref\E/;
}
# using our function
if (starts_with("foobar", "f")) {
say "TRUE";
}
else {
say "FALSE";
}
Golfing it a bit...
sub starts_with { $_[0] =~ /^\Q$_[1]/ }
Don't hand that version in though :-)

How to use regular expression to find keys in hash

I have 6mio hashes and need to count how many of these have keys that start with AA00, AB10 and how many of them have keys starting with with both strings.
For each hash I have done this:
if (exists $hash{AA00}) {
$AA00 +=1;
}
if (exists $hash{AB10}) {
$AB10 += 1;
}
if (exists $hash{AA00} and exists $hash{AA10}) {
$both += 1;
}
but then I count only the number of hashes that contains exactly AA00 or AB10 as keys, but I would also like to count hashes that contain, say AA001. Can I use regular expression for this?
I completely misunderstood your question. To find the number of hashes with keys matching a regex (as opposed to the number of keys matching a regex in a single hash), you can still use the grep approach I outlined in my earlier answer. This time, however, you need to loop through your hashes (I assume you're storing them in an array if you have 6 million of them) and run grep twice on each one:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my #array = (
{ AA00 => 'foo' },
{ AB10 => 'bar' },
{ AA001 => 'foo' },
{ AA00 => 'foo', AB10 => 'bar' }
);
my ($hashes_with_aa00, $hashes_with_ab10, $hashes_with_both) = (0, 0, 0);
foreach my $hash (#array) {
my $aa_count = grep { /^AA00/ } keys %$hash;
my $ab_count = grep { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_count;
$hashes_with_ab10++ if $ab_count;
$hashes_with_both++ if $aa_count and $ab_count;
}
say "AA00: $hashes_with_aa00";
say "AB10: $hashes_with_ab10";
say "Both: $hashes_with_both";
Output:
AA00: 3
AB10: 2
Both: 1
This works, but is pretty poor in terms of performance: grep loops through every element in the list of keys for each hash, and we're calling it twice per hash!
Since we don't care how many keys match in each hash, only whether there is a match, a better solution would be any from List::MoreUtils. any works much like grep but returns as soon as it finds a match. To use any instead of grep, change this:
foreach my $hash (#array) {
my $aa_count = grep { /^AA00/ } keys %$hash;
my $ab_count = grep { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_count;
$hashes_with_ab10++ if $ab_count;
$hashes_with_both++ if $aa_count and $ab_count;
}
to this:
use List::MoreUtils 'any';
foreach my $hash (#array) {
my $aa_exists = any { /^AA00/ } keys %$hash;
my $ab_exists = any { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_exists;
$hashes_with_ab10++ if $ab_exists;
$hashes_with_both++ if $aa_exists and $ab_exists;
}
Note that I changed the variable names to better reflect their meaning.
This is much better in terms of performance, but as Borodin notes in a comment on your question, you're losing the speed advantage of hashes by not accessing them with specific keys. You might want to change your data structure accordingly.
Original Answer: Counting keys that match a regex in a single hash
This is my original answer based on a misunderstanding of your question. I'm leaving it up because I think it could be useful for similar situations.
To count the number of keys that match a regex in a single hash, you can use grep:
my $aa_count = grep { /^AA00/ } keys %hash;
my $ab_count = grep { /^AB10/ } keys %hash;
my $both = $aa_count + $ab_count;
As HunterMcMillen points out in the comments, there's no need to search through the hash keys again to get the total count; in this case, you can simply add the two subtotals. You can get away with this because the two patterns you're searching for are mutually exclusive; in other words, you cannot have a key that both begins with AA00 and AB10.
In the more general case, it might be possible for a single key to match both patterns (thanks Borodin). In that case, you cannot simply add up the two subtotals. For example, if you wanted your keys to merely contain AA00 or AB10 anywhere in the string, not necessarily at the beginning, you would need to do something like this:
my $aa_count = grep { /AA00/ } keys %hash;
my $ab_count = grep { /AB10/ } keys %hash;
my $both = grep { /(?:AA00|AB10)/ } keys %hash;
Note that this calls grep multiple times, which means traversing the entire hash multiple times. This could be done more efficiently using a single for loop like FlyingFrog and Kenosis did.

regular expression help: catch this: |TrxId=475665|

For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|
TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.
The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.
An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'
/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665
You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.