Shortening RegEx for Perl - regex

I have created a regexp in Perl that is about 95 characters in length, I wish to shorten it to 78 characters but can't find a suitable method. Any advice welcome, the regexp is similar to the code below, ideally there is something similar to \ in C.
my ($foo, $bar, $etc) = $input_line =~
/^\d+: .... (\X+)\(\X(\d+.\d+|\d+)\/\X(\d+.\d+|\d+) (\X+)\)$/

There is a way to tell regex to skip embedded whitespace and comments, so not only you can split it up into multiple lines, but also comment it, format it to sections etc. I think it's 'x', but I don't have documentation handy right now, so look it up in the man page.
So you'd change it to something like:
my ($foo, $bar, $etc) = $input_line =~ /
^\d+: ....
(\X+)\(
\X(\d+.\d+|\d+) # numerator
\/\X(\d+.\d+|\d+) # denominator
\ (\X+)\)$/x # mind the escaped space!
It's also possible to construct pieces of regular expression separately via the 'qr' string prefix and combine them using variable substitution. Something like
my $num_re = qr/(\X+)\(\X(\d+.\d+|\d+)\/\X(\d+.\d+|\d+)/;
my ($foo, $bar, $etc) = $input_line =~ /^\d+: .... $num_re (\X+)\)$/;
I have not done this for long, so I am not sure you whether any flags are needed.

Perl interpolates regex, so you could do something like this
my $input_line = '123: .... X(X1.1/X5 XXX)';
my $dOrI = '(\d+.\d+|\d+)';
my ($foo, $bar, $etc) = $input_line =~
/^\d+: .... (\X+)\(\X$dOrI\/\X$dOrI (\X+)\)$/;
print "$foo, $bar, $etc";
Output -
X, 1.1, 5

One thing I see in the regex is the period in '\d+.\d+'.
You know that '.' in a regex matches ANY character, not only an actual period character.
If you want to specify only an actual period character, you'll have to use '\.' instead.
The other thing is that you may be able to replace '\d+.\d+|\d+' with '\d+.?\d+'
[EDIT]
One more thing, if you use the interpolated regex more than once and don't change it in between uses, (say, in a loop), you should use the /o option to have Perl compile the entire regex so it doesn't need to be compiled everytime.

Related

Using Perl split function to keep (capture) some delimiters and discard others

Let's say I am using Perl's split function to split up the contents of a file.
For example:
This foo file has+ a bunch of; (random) things all over "the" place
So let's say I want to use whitespace and the semicolons a delimiters.
So I would use something like:
split(/([\s+\;])/, $fooString)
I'm having trouble figuring out a syntax (or even if it exists) to capture the semicolon and discard the whitespace.
You seem to ask for something like
my #fields_and_delim = split /\s+|(;)/, $string; # not quite right
but this isn't quite what it may seem. It also returns empty elements (with warnings) since when \s+ matches then the () captures nothing but $1 is still returned as asked, and it's undef. There are yet more spurious matches when your delimiters come together in the string.
So filter
my #fields_and_delim = grep { defined and /\S/ } split /(\s+|;)/, $string;
in which case you can normally capture the delimiter.
This can also be done with a regex
my #fields_and_delim = $string =~ /([^\s;]+|;+)/g;
which in this case allows more control over what and how you pick from the string.
If repeated ; need be captured separately change ;+ to ;
I think that what you want is as simple as:
split /\s*;\s*/, $fooString;
That will separate around the ; character that may or may not have any whitespace before or after.
In your example:
>This foo file has+ a bunch of; (random) things all over "the" place<
It would split into:
>This foo file has+ a bunch of<
and:
>(random) things all over "the" place<
By the way, you need to put the result of split into an array; for instance:
my #parts = split /\s*;\s*/, $fooString;
Then $parts[0] and $parts[1] would have the two bits.
I think grep is what you're looking for really, to filter the list for values that aren't all whitespace:
my #all_exc_ws = grep {!/^\s+$/} split(/([\s\;])/, $fooString);
Also I removed the + from your regex since it was inside the [], which changes its meaning.

How to do conditional ("if exist" logic) search & replace in Perl?

in my Perl script I want to do conditional search & replace using regular expression: Find a certain pattern, and if the pattern exists in a hash, then replace it with something else.
For example, I want to search for a combination of "pattern1" and "pattern2", and if the latter exists in a hash, then replace the combination with "pattern1" and "replacement". I tried the following, but it just doesn't do anything at all.
$_ =~ s/(pattern1)(pattern2)/$1replacement/gs if exists $my_hash{$2};
I also tried stuff like:
$_ =~ s/(pattern1)(pattern2) && exists $my_hash{$2}/$1replacement/gs;
Also does nothing at all, as if no match is found.
Can anyone help me with this regex problem? Thx~
I would do it a different way. It looks like you have a 'search this, replace that' hash.
So:
#!/usr/bin/env perl
use strict;
use warnings;
#our 'mappings'.
#note - there can be gotchas here with substrings
#so make sure you anchor patterns or sort, so
#you get the right 'substring' match occuring.
my %replace = (
"this phrase" => "that thing",
"cabbage" => "carrot"
);
#stick the keys together into an alternation regex.
#quotemeta means regex special characters will be escaped.
#you can remove that, if you want to use regex in your replace keys.
my $search = join( "|", map {quotemeta} keys %replace );
#compile it - note \b is a zero width 'word break'
#so it will only match whole words, not substrings.
$search = qr/\b($search)\b/;
#iterate the special DATA filehandle - for illustration and a runnable example.
#you probably want <> instead for 'real world' use.
while (<DATA>) {
#apply regex match and replace
s/(XX) ($search)/$1 $replace{$2}/g;
#print current line.
print;
}
##inlined data filehandle for testing.
__DATA__
XX this phrase cabbage
XX cabbage carrot cabbage this phrase XX this phrase
XX no words here
and this shouldn't cabbage match this phrase at all
By doing this, we turn your hash keys into a regex (you can print it - it looks like: (?^:\b(cabbage|this\ phrase)\b)
Which is inserted into the substitution pattern. This will only match if the key is present, so you can safely do the substitution operation.
Note - I've added quotemeta because then it escapes any special characters in the keys. And the \b is a "word boundary" match so it doesn't do substrings within words. (Obviously, if you do want that, then get rid of them)
The above gives output of:
XX that thing cabbage
XX carrot carrot cabbage this phrase XX that thing
XX no words here
and this shouldn't cabbage match this phrase at all
If you wanted to omit lines that didn't pattern match, you can stick && print; after the regex.
What is wrong (as in not working) with
if (exists($h{$patt1)) { $text =~ s/$patt1$patt2/$patt1$1replacement/g; }
If $patt1 exists as a key in a hash then you go ahead and replace $patt1$patt2 with $patt1$replacement. Of course, if $patt1$patt2 is found in $text, otherwise nothing happens. Your first code snippet is circular, while the second one can't work like that at all.
If you want $patt1$patt2 first, and hash key as well then it seems that you'd have to go slow
if ($str =~ /$patt11$patt2/ && exists $h{$patt2}) {
$str =~ s/$patt1$patt2/$patt1$replacement/gs;
}
If this is what you want then it is really simple: you need two unrelated conditions, whichever way you turn it around. Can't combine them since it would be circular.
From the point of view of the outcome these are the same. If either condition fails nothing happens, regardless of the order in which you check them.
NOTE Or maybe you don't have to go slow, see Sobrique's post.

Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.

Matching the last digits of a number in Perl

I have a file in which there are a lot of GUIDs mentioned like this
Dlg1={929EC5C7-0A40-4BE4-8F0A-60C3CB4A62A7}-SdWelcome-0
I wanted to replace the last eight digits of these GUIDs with the last eight digits of a new GUID which is already generated using a tool. What I have tried so follows.
Read the last eight digits of the generated GUID like this:
$GUID =~ /[0-9a-fA-F]{8}/;
Assign it to a new variable like:
$newGUID = $1;
Now try to replace this with the old GUID inside the file:
if ($line =~ /^.* {(.*)}/) {
$line =~ s/[0-9a-fA-F]{8}}/$newGUID/;
}
But it does not seem to be working. It replaces the last eight digits of the old GUID with 32 digits of the new GUID. How can I fix this?
it replaces the last 8 digits of old GUID with 32 digits of new GUID , any ideas how to achieve it.
You now have this:
$line =~s/[0-9a-fA-F]{8}}/$newGUID/;
You say that replaces the last eight characters of your GUID with the entire 32 digit new GUID. That means your finding and replacing the right characters, but what you're replacing it with is wrong.
What is $newGUID equal to? Is it an entire 32 digit GUID? If so, you need to pull off the last 8 characters.
Two things I would recommend.
If you are using a hexadecimal number in your regular expression, use [[:xdigit:]] and not [0-9a-fA-F]. Although both are pretty much equivalent. Using :xdigit: is cleaner and it's easier to understand.
In Perl, we love regular expressions. Heck, Perl regular expression syntax has invaded and found homes in almost all other programming languages. However, regular expressions can be difficult to get right and test. They can also be difficult to understand too. However, sometimes there are better ways of doing something besides a regular expression that's cleaner and easier to undertstand.
In this case, you should use substr rather than regular expressions. You know exactly what you want, and you know the location in the string. The substr command would make what you're doing easier to understand and even cleaner:
use constant {
GUID_RE => qr/^[[:xdigit:]]{8}-[[:xdigit:]]{4}-[[:xdigit:]]{4}-[[:xdigit:]]{12}$/,
};
my $old_guid = '929EC5C7-0A40-4BE4-8F0A-60C3CB4A62A7';
my $new_guid = 'oooooooo-oooo-oooo-oooo-ooooXXXXXXXX';
# Regular expressions are great for verifying formats!
if ( not $old_guid =~ GUID_RE ) {
die qq(Old GUID "$new_guid" is not a GUID string);
}
if ( not $new_guid =~ GUID_RE ) { # Yes, I know this will die in this case
die qq(New GUID "$new_guid" is not a GUID string);
}
# Easy to understand, I'm removing the last eight characters of $old_guid
# and appending the last eight digits of $new_guid
my $munged_guid = substr( $old_guid, 0, -8 ) . substr( $new_guid, -8 );
say $munged_guid; # Prints 929EC5C7-0A40-4BE4-8F0A-60C3XXXXXXX
I'm using regular expressions to verify that the GUID are correctly formatted which is a great task for regular expressions.
I define a GUID_RE constant. You can look to see how it's defined and verify if it's in the correct format (12 hex digits, 4 hex digits, 4 hex digits, and 12 hex digits all separated by dashes).
Then, I can use that GUID_RE constant in my program, and it's easy to see what I'm doing. Is my GUID actually in the GUID_ID format?
Using substr instead of regular expressions make it easy to see exactly what I am doing. I am removing the last eight characters off of $old_guid and appending the last eight characters of $new_guid.
Again, your immediate issue is that your s/.../.../ is finding the right characters, but your substitution string isn't correct. However, this isn't the best use for regular expressions.
I think your problem is that you're not correctly setting $1 to the last eight digits (if it's coming from that regex, it would match the first eight digits and isn't setting any groups). You could instead try something like $newGUID = substr($GUID, -8);. I also think something like $GUIDTail makes more sense for the variable since it doesn't store an entire GUID.
Also, at the moment you're eating the closing curly brace. You should either include that in newGuid/guidTail, include it in the s/// call, or change the curly in the match to (?=\}) (which represents match this but don't include it in the match).
P.S.: You're making the assumption there that's there's only one GUID on the line. You may want to tack a global modifier to the match if there's any chance of multiple GUIDs (or otherwise disambiguating which one you want to modify, but this will just replace the first one).
Here's a small code snippet that demonstrates the principle I think you are after. First off, I start with a given string, and take the last 8 characters of it and store it in a new variable, $insert. Then I perform a somewhat strict substitution on the input data (here in the internal file handle DATA, which is convenient when demonstrating), and print the altered string.
The regex in the substitution looks for curly brackets { ... } with a mixture of hex digits [:xdigit:] and dashes \- between them ([[:xdigit:]\-]+), followed by 8 hex digits. The \K escape allows us to "keep" the matched string before it, so all we need to do is insert our stored string, and replace the closing curly bracket.
If you wish to try this on a file, change <DATA> to <> and run it like so:
perl script.pl input
Code:
use strict;
use warnings;
my $new = "929EC5C7-0A40-4BE4-8F0A-1234567890";
my $insert = substr($new, -8);
while (<DATA>) {
s/\{[[:xdigit:]\-]+\K[[:xdigit:]]{8}\}/$insert}/i;
print;
}
__DATA__
Dlg1={929EC5C7-0A40-4BE4-8F0A-60C3CB4A62A7}-SdWelcome-0
Output:
Dlg1={929EC5C7-0A40-4BE4-8F0A-60C334567890}-SdWelcome-0

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.