matching the closest strings to a search term (perl regex) - regex

Basically, what I'm trying to do is search through a rather large PHP file, and replace any block of PHP code that includes the string "search_term" somewhere in it with some other code. I.e.
<?php
//some stuff
?>
<?php
// some more stuff
$str = "search_term";
// yes...
?>
<?php
// last stuff
?>
should become
<?php
//some stuff
?>
HELLO
<?php
// last stuff
?>
What I've got so far is
$string =~ s/<\?php(.*?)search_term(.*?)\?>/HELLO/ims;
This correctly matches the closest closing ?>, but begins the match at the very first <?php, instead of the one closest to the string search_term.
What am I doing wrong?

Generally, I don't like to use non-greedy matching, because it usually leads to problems like this. Perl looks at your file, finds the first '<?php', then starts looking for the rest of the regexp. It passes over the first '?>' and the second '<?php' because they match .*, then finds search_term and the next '?>', and it's done.
Non-greedy matching means that you have a regular expression that matches more things than you really want, and it leaves it up to perl to decide which match to return. It's better to use a regular expression that matches exactly what you want to match. In this case, you can get what you want by using ((?!\?>).)* instead of .*? ((?!\?>) is a negative look-ahead assertion)
s/<\?php((?!\?>).)*search_term((?!\?>).)*\?>/HELLO/is;
If you expect multiple matches, you might want to use /isg rather than /is.
Alternatively, just split the file into blocks:
#blocks = split /(\?>)/, $string;
while (#blocks) {
$block = shift #blocks;
$sep = shift #blocks;
if ($block=~/search_term/) {
print "HELLO";
} else {
print $block, $sep;
}
}

You just need to put your first capture group back into your replacement. Something like this:
s/<\?php(.*)<\?php(.*?)search_term(.*?)\?>/<\?php$1HELLO/ims

$string =~ s/<\?php(?:(?!\?>|search_term).)*search_term.*?\?>/HELLO/isg;
(?:(?!\?>|search_term).)* matches one character at a time, after making sure the character isn't the beginning of ?> or search_term. When that stops matching, if the next thing in the string is search_term it consumes that and everything after it until the next ?>. Otherwise, that attempt fails and it starts over at the next <?php.
The crucial point is that, like #RobertYoung's solution, it's not allowed to match ?> as it searches for search_term. By not matching search_term either, it eliminates backtracking, which makes the search more efficient. Depending on the size of the source string that may not matter, but it won't noticeably hurt performance either.
#Benj's solution (as currently posted) does not work. It yields the desired output with the sample string you provided, but that's only by accident. It only replaces the last code block with search_term in it, and (as #mob commented) it completely ignores the contents of the very first code block.

s/(.*)<\?php.*?search_term.*?\?>/${1}HELLO/ims;
In your regular expression, the regex engine is trying to find the earliest occurence of a substring that matches your target expression, and it finds it between the first <?php and the second ?>.
By putting (.*) at the start of the regex, you trick the regex engine into going to the end of the string (since .* matches the whole string), and then backtracking to spots where it can find the string "<?php". That way the resulting match won't include any more <?php tokens than necessary.

You are using greedystingy matching but that can still match too much.
Matching repetitions in perlretut describes it well.
I sometimes use negated matches to help but I don't think it will help. For example:
s/^[^A]*A/A/
to make sure my characters aren't matched.
But I'm not usually trying to cross multiple lines and I'm not using perl unless I have to.

Related

adding regex pattern in string in perl

I want to compare string from one file to another. but another file may contains some element and that element can occur anywhere and it can occur many times also.
Note : these tags needs to be retain in final output.
For e.g.:
I want to compare word ‘scripting’.. tag indicates the word to be matched from str2.
$str1 = “perl is an <match>scripting</match> language”;
$str2 = “perl is an s<?..?>cr<?..?>ipti<?..?>ng langu<?..?>age”;
Output required :
perl is an <match>s<?..?>cr<?..?>ipti<?..?>ng</match> langu<?..?>age
I am adding pattern after each character:
$str1 =~ {(.)}
{
‘$&(?:(?:<?...?>|\n)+)?’
}esgi;
These works for few case but for few its goes on running. Please suggest.
(?:(?:<?...?>|\n)+)? is the same as (?:<?...?>|\n)* Also you dont want to add the pattern after each character; just in between the characters of the matched part of $str1. So no pattern before the first character and no pattern after the last. Otherwise the replace statement will around those tags, and you want them to around the front and back of the words. My guess is that if you are runnging that first replace command over all of $str1 you may end up with quite a large string. Also see my answer for related question here

Regex substituting opening parenthesis

As part of a parsing script I'm trying to convert strings like this:
<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">
into
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
The regex for the closing parenthesis works fine
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%29).)*)%29([^\"\']*[\"\'])~\1)\2~g" "$pageName".html
giving me
<a href="http://www.web.com/%20Special%20event%202013%20%282).pdf">
The problem arrises with the equivalent regex for the opening parenthesis:
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(\2~g" "$pageName".html
just returns the two groups with nothing in between:
<a href="http://www.web.com/%20Special%20event%202013%202%29.pdf">
Escaping the ( in the substitution with a backslash (or two) has no effect. If I wrap it in some other characters (say ~\1#(#\2~g ) the parenthesis still disappears (giving me %20##2%29 ).
If however in a fit of desperation I add seven parenthesises into the substitution, it works.
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(((((((\L\2~g" "$pageName".html
outputs
<a href="http://www.web.com/%20Special%20event%202013%20(2%29.pdf">
Can somebody please make sense of this.
Perhaps the following will be helpful or at least provide some direction. It will work on Perl version 10 and above.
use strict;
use warnings;
use v5.10.0; # For regex \K
use URI::Escape;
my $string = '<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">';
$string =~ s/.+2013%20\K([^.]+)(?=\.pdf)/uri_unescape($1)/e;
print $string;
Output:
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
Left enough of the date and the space (%20) as an anchor, then used \K to *K*eep all of that. Then captured the URI encoded text, which is later decoded and used as the substitution text.
The pattern you have doesn't match the string you show at all. It matches something that looks like
<a href=/"../$i-xxxxxxxxxxxxxxx%29xxxxxxxxxx">
with literal dots, and whatever $i contains.
Also, a couple of points about your substitution:
Don't escape characters that don't need escaping. It may take some experience to know without checking which characters you need to escape, but the main point of using ~ as a delimiter is to avoid having to escape slashes in the regex, so at least you could have avoided that.
Don't use \1, \2 etc. in the replacement string. Perl tries very hard to make this work, but normally in Perl those sequences mean to insert the characters \x01 and \x02. Use $1 and $2.
So your regex could be written
s~(href=/?["']\.\./$i-(?:(?!%29).)*)%29([^"']*["'])~$1)$2~;
but it still doesn't "work fine" with the string you gave, which would have to look something like
<a href=/"../$i-xxxxxxxxxxxxxxx%282%29xxxxxxxxxx">
again, containing whatever is in $i. I don't understand at all the optional slash before the href attribute value: it is invalid HTML.
However, using a string that your first regex matches, your second one also works, replacing opening parentheses correctly, so I can't guess at what the problem may be.
There is often no need to verify the entire string. You can just replace the parts you're interested in. So I would write something like
s/(href="[^"]+)%28(\d+)%29(\.pdf")/$1($2)$3/;
which works fine on the string you gave, and replaces both open and close parentheses at once.
I had some problems understanding your regex, but this might work:
perl -pe "s~(href\s*=\s*\"[^\"]*)%28(.*?)%29~\$1(\$2)~g" input

Regex for matching last two parts of a URL

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?
You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.
I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps
if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result
A shorter version
/(\.[^\.]+){2}$/

Regex to match suffixes to english words

I'm searching for the word "move" and i want to match "moved" as well when I print.
The way I'm going about this is:
if ($sentence =~ /($search_key)d$/i) {
$search_key = $search_keyd;
}
$subsentences[$i] =~ s/$search_key/ **$search_key** /i;
$subsentences[$i] =~ s/\b$parsewords[1]_\w+/ --$parsewords[1]--/i;
print "MATCH #$count\n",split(/_\S+/,$subsentences[$i]), "\n";
$count++;
This is part of a longer code so if anything is unclear let me know. The _ is because the words in the sentence are tagged (ex. I_NN move_VB to_PREP ....).
Where $search_keyd will be $search_key."d", which worked!
A nice addition would be to check if the word ended in e and therefore only a d would need to be appended. I'd guess it'd look something like this: e?$/d$
Even a general answer will suffice.
I'm new to Perl. So sorry if this is elementary. Thanks in advance!!!
If I understand you correctly, you want to search for "move" and add a highlight, but also include any variation of the basic word, such as "moves" "moved".
When you are replacing words in a text like this, you usually want to replace all the words, and then you need the /g operator on the regex, like so:
$subsentences[$i] =~ s/$search_key/ **$search_key** /ig
Also, you should make sure to not match partials of words. E.g. you want to match "move", but not perhaps "remove". For this, you can use \b to mark word boundry:
$subsentences[$i] =~ s/\b$search_key/ **$search_key** /ig
In order to match certain suffixes, you need a character class with valid characters or combination of characters. move[sd] will find "moves" and "moved". However, for a word like "jump", you would need to be a bit more specific: "jump(s|ed)". Note that [sd] can be replaced with (s|d). So barring any bad spelling in your text, you can get away with:
$subsentences[$i] =~ s/\b$search_key(s|d|ed)/ **$search_key$1** /ig
Note that $1 matches whatever is found inside the first matching parenthesis.
To find the number of matching words:
my $matches = $subsentences[$i] =~ s/\b$search_key(s|d|ed)/ **$search_key$1** /ig
If you want to be more specific with the suffixes, i.e. make it not match badly spelled words like "moveed", you'd need to do some special matching. Something like:
if ($search_key =~ /e$/i) { $suffix = '(s|d)' }
else { $suffix = '(s|ed)' }
my $matches = $subsentences[$i] =~ s/\b$search_key$suffix/ **$search_key$1** /ig
It can probably become very complicated the more search words you add.
Some help about regexes here
If what you want is to match all complete words which begin with your search term, i.e. 'move' matches 'move', 'moved', 'movers', etc, then you want to use a character class to detect the end of the word.
So, instead of:
if ($sentence =~ /($search_key)d$/i)
Try using:
if ($sentence =~ /($search_key\w*)\W$/i)
The \w* will match any number of standard word characters and the \W should prevent you from including other characters, such as whitespace or punctuation.

regex string does not contain substring

I am trying to match a string which does not contain a substring
My string always starts "http://www.domain.com/"
The substring I want to exclude from matches is ".a/" which comes after the string (a folder name in the domain name)
There will be characters in the string after the substring I want to exclude
For example:
"http://www.domain.com/.a/test.jpg" should not be matched
But "http://www.domain.com/test.jpg" should be
Use a negative lookahead assertion as:
^http://www\.domain\.com/(?!\.a/).*$
Rubular Link
The part (?!\.a/) fails the match if the URL is immediately followed with a .a/ string.
My advise in such cases is not to construct overly complicated regexes whith negative lookahead assertions or such stuff.
Keep it simple and stupid!
Do 2 matches, one for the positives, and sort out later the negatives (or the other way around). Most of the time, the regexes become easier, if not trivial.
And your program gets clearer.
For example, to extract all lines with foo, but not foobar, I use:
grep foo | grep -v foobar
I would try with
^http:\/\/www\.domain\.com\/([^.]|\.[^a]).*$
You want to match your domain, plus everything that do not continue with a . and everything that do continue with a . but not a a. (Eventually you can add you / if needed after)
If you don't use look ahead, but just simple regex, you can just say, if it matches your domain but doesn't match with a .a/
<?php
function foo($s) {
$regexDomain = '{^http://www.domain.com/}';
$regexDomainBadPath = '{^http://www.domain.com/\.a/}';
return preg_match($regexDomain, $s) && !preg_match($regexDomainBadPath, $s);
}
var_dump(foo('http://www.domain.com/'));
var_dump(foo('http://www.otherdomain.com/'));
var_dump(foo('http://www.domain.com/hello'));
var_dump(foo('http://www.domain.com/hello.html'));
var_dump(foo('http://www.domain.com/.a'));
var_dump(foo('http://www.domain.com/.a/hello'));
var_dump(foo('http://www.domain.com/.b/hello'));
var_dump(foo('http://www.domain.com/da/hello'));
?>
note that http://www.domain.com/.a will pass the test, because it doesn't end with /.