How to construct this regular expression? - regex

How to ignore abc and def,01 in below expression using regex. I tried ignore me it doesn’t work.
Abc-def-smdp-01

One way to go is to split the original string into prefix, portion of interest, and suffix, removing the unwanted charcters in the affixes thereafter:
$raw = "abc-def-smdp-01";
preg_match ( "/^(.*)(-smdp-)(.*)$/", $raw, $matches ); // separate original string into prefix, trunk, suffix
$matches[1] = preg_replace ( "/[^-]/", "", $matches[1] ); // non-'-' characters deleted in prefix
$matches[3] = preg_replace ( "/[^-]/", "", $matches[3] ); // non-'-' characters deleted in suffixfix
$result = $matches[1].$matches[2].$matches[3]; // composing target string
echo $result;
Online demo available here.
NB
Problems similar to this one can be tackled easily with some knowledge of the php function library whose doc within the online php manual comes with exakt syntax, example code, and user comments.
In this case, look up:
preg_match
preg_replace
Of course, finding suitable candidates for the intended functionality assumes at least a perfunctory grasp of the available facilities which makes the 15 minutes of browsing the hierarchical index a judicious investment of time

I would use this pattern:
preg_match("/\w+\-\w+\-(\w+)\-\d+/", "Abc-def-smdp-01", $output);
Echo $output[1];
http://www.phpliveregex.com/p/ftB
EDIT: or do you need the dashes?
In that case you need to use this pattern:"/\w+(\-)\w+(\-)(\w+)(\-)\d+/"
And then output it as:
echo $output[1].$output[2].$output[3].$output[4];
or loop it and build a string with .=

Related

How do I use operations inside a Perl regex?

For example let's say I have something like this:
$_ = 23;
$a = 2;
print /$a $a+1/x;
should print 1. Basically, is it possible to use functions inside the regex string?
Variable interpolation in regexes works pretty much the same as variable interpolation in strings. Given my $x = 2, the string "$x $x+1" would be "2 2+1". The variable is expanded, but code in the string is not executed.
One trick around this is to use dereference a reference inside the string. This allows us to include arbitrary expressions, but the syntax is a bit cumbersome. Usually, we create an array reference with the value we want to include [$x + 1], then immediately dereference it: #{[$x + 1]}. This is similar to Ruby's #{...} interpolation, or to Bash $(...) command substitution.
So the regex /$x #{[$x + 1]}/x would work.
But in most cases, it's going to be much clearer to perform all calculations outside of the regex:
my $x = 2;
my $y = $x + 1;
/$x $y/x;
The Perl regex syntax also has syntax that can generate parts of the regex dynamically. With variable interpolation as above the variable contents are interpolated, and then the regex is compiled. But advanced regexes may change the value of a variable during the pattern match. These delayed regexes can be written with the (??{ ... }) syntax. Here: /$x (??{ $x + 1 })/x. However, this is a very advanced and error-prone regex feature. This will also be slower than an ordinary regex.
There is an extended pattern that provides for code execution in the match operator m/ or in the matching part of the substitution operator s///.
Its version that substitutes the code's return and goes on to treat it as a pattern is
/(??{ code })/
so in your case
$_ = 23;
my $x = 2;
my ($m) = /(2(??{ $x+1 }))/;
say $m;
or
RE_EVAL: {
use re 'eval';
my ($m) = /($x(??{ $x+1 }))/;
say $m;
}
matches and captures 23.
Here use re 'eval' specifically allows this, normally disallowed for security reasons.
This is a very involved capability which comes with complex warnings. Apart from its entry at the above link also follow the link in that text and read about Embedded Code Execution frequency.
Please don't use this complex tool for convenience, or to substitute for properly written code.

Perl Regex E-Mail TLD

i have this code:
if ( $Mail =~ /$Tld{$_}/ ) {
$TldFound = 1;
}
The variable $Mail has for example the info "mail#mail.com". The variable $Tld has the info ".com". How can i cut the variable $Mail that only the tld .com will remain?
You should use Email::Address to parse email addresses.
To be able to extract a TLD with certainty requires a list of what you consider to be TLDs. For example, do .co.uk, or .com.tr count? Or, do you just want the last string of non-dot characters?
If you restrict your attention to 2 - 3 character TLDs such as .co, .com, .io, .net, .org, .us etc, you can do my ($tld) = ($email =~ /[.] ([a-z]{2,3}) \z/x); and then check with if ($tld and ($tld eq 'com')) { ... } etc, but you really want a good list of acceptable strings that can be TLDs: Net::Domain::TLD, Mozilla::PublicSuffix.
Naive Regex Solutions
The following solutions will solve your problem as posted, but are not intended to address every possible edge case. Parsing email addresses in a comprehensive way is non-trivial, and requires a parser such as Email::Address if you want to handle the full complexity of the RFCs.
Printing Your TLD from a String
Since you already know the string you want to print on success (e.g. ".com"), you don't actually need the result of your regular expression match; you can print the string stored in $Tld when the match is true using a post-statement condition. For example:
$Mail = 'mail#mail.com';
$Tld = '.com';
print "$Tld\n" if $Mail =~ /${Tld}$/;
This will correctly print:
.com
Printing the Match
If you really want the full match, there are a number of ways to do it. One way would be to use the special $& variable:
$Mail = 'mail#mail.com';
$Tld = '.com';
if ($Mail =~ /${Tld}$/) {
print "$&\n";
}
This will also correctly print:
.com
Partitioning the String
All of the previous examples will solve your problem as posted, but the best generic solution short of a parser is really to partition the TLD, and treat the last segment of the domain as an unvalidated TLD. Ruby has the super-handy String#rpartition method, but I'm unaware of a similar function in Perl. However, you can use an anchored match to accomplish much the same thing. For example:
$Mail = 'mail#mail.com';
$Mail =~ /(\.[[:alpha:]]+)$/;
print "$1\n";
If you need to validate the TLD against an expected value such as .com, you can compare it to a string or variable. For example:
$Mail = 'mail#mail.com';
$Tld = '.com';
$Mail =~ /(\.[[:alpha:]]+)$/;
print "$1\n" if $1 eq $Tld

Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.

regular expression url rewrite based on folder

I need to be able to take /calendar/MyCalendar.ics where MyCalendar.ics coudl be about anything with an ICS extention and rewrite it to /feeds/ics/ics_classic.asp?MyCalendar.ics
Thanks
Regular expressions are meant for searching/matching text. Usually you will use regex to define what you search for some text manipulation tool, and then use a tool specific way to tell the tool with what to replace the text.
Regex syntax use round brackets to define capture groups inside the whole search pattern. Many search and replace tools use capture groups to define which part of the match to replace.
We can take the Java Pattern and Matcher classes as example. To complete your task with the Java Matcher you can use the following code:
Pattern p = Pattern.compile("/calendar/(.*\.(?i)ics)");
Matcher m = p.matcher(url);
String rewritenUrl = "";
if(m.matches()){
rewritenUrl = "/feeds/ics/ics_classic.asp?" + url.substring( m.start(1), m.end(1));
}
This will find the requested pattern but will only take the first regex group for creating the new string.
Here is a link to regex replacement information in (imho) a very good regex information site: http://www.regular-expressions.info/refreplace.html
C:\x>perl foo.pl
Before: a=/calendar/MyCalendar.ics
After: a=/feeds/ics/ics_classic.asp?MyCalendar.ics
...or how about this way?
(regex kind of seems like overkill for this problem)
b=/calendar/MyCalendar.ics
index=9
c=MyCalendar.ics (might want to add check for ending with '.ics')
d=/feeds/ics/ics_classic.asp?MyCalendar.ics
Here's the code:
C:\x>type foo.pl
my $a = "/calendar/MyCalendar.ics";
print "Before: a=$a\n";
my $match = (
$a =~ s|^.*/([^/]+)\.ics$|/feeds/ics/ics_classic.asp?$1.ics|i
);
if( ! $match ) {
die "Expected path/filename.ics instead of \"$a\"";
}
print "After: a=$a\n";
print "\n";
print "...or how about this way?\n";
print "(regex kind of seems like overkill for this problem)\n";
my $b = "/calendar/MyCalendar.ics";
my $index = rindex( $b, "/" ); #find last path delim.
my $c = substr( $b, $index+1 );
print "b=$b\n";
print "index=$index\n";
print "c=$c (might want to add check for ending with '.ics')\n";
my $d = "/feeds/ics/ics_classic.asp?" . $c;
print "d=$d\n";
C:\x>
General thoughts:
If you do solve this with a regex, a semi-tricky bit is making sure your capture group (the parens) exclude the path separator.
Some things to consider:
Are your paths separators always forward-slashes?
Regex seems like overkill for this; the simplest thing I can think of of to get the index of your last path separator and do simple string manipulation (2nd part of sample program).
Libraries often have routines for parsing paths. In Java I'd look at the java.io.File object, for example, specifically
getName()
Returns the name of the file or directory denoted by
this abstract pathname. This is just the last name in
the pathname's name sequence