Lighttpd URL rewrite with negative matching

Lighttpd URL rewrite with negative matching - regex

I'm trying to rewrite urls that have an image in them but not 'php', for example:
I want to match this:
http://domain.com/trees.jpg
and rewrite it to this:
http://domain.com/viewimg.php?image=trees.jpg
But not match this:
http://domain.com/index.php?image=trees.jpg
because it has 'php' in it, so it shouldn't do any rewriting at all to that url.
I'm no good with regex but I've been trying a number of variations and none have worked.
Here's an example of one variation I've tried:
url.rewrite-once = (
"(?!php\b)\b\w+\.(jpg|jpeg|png|gif)$" => "/viewimg.php?image=$1.$2"
)
Any help would be greatly appreciated, thanks.

This should get you started. It will fail matches where there is the string \.php anywhere behind \w+\.(jpg|jpeg|png|gif)
Pattern: (?<!.*\.php.*)\b(\w+\.(jpg|jpeg|png|gif))$
Replace: viewing.php?image=$1
Tested:
http://domain.com/trees.jpg ---> http://domain.com/viewing.php?image=trees.jpg
http://domain.com/index.php/trees.jpg ---> http://domain.com/index.php/trees.jpg
http://domain.com/index.php/image=trees.jpg ---> http://domain.com/index.php/image=trees.jpg
Edit:
Above uses non-constant-length lookbehind, which is apparently not supported on many platforms. Without this I don't know if you can codify "match XYZ when ABC is nowhere in the string behind it" in pure regex. Maybe someone with more regex-fu than me knows a way.
As a somewhat less general solution, this will check only if the fixed string .php?image= does not come right before the image name:
Pattern: (?<!\.php\?image=)\b(\w+\.(jpg|jpeg|png|gif))$
Replace: viewing.php?image=$1

Related

Multiple slash in URL replacement though regex

I am trying to create a regex in pcre, that is going to salinize URL with multiple slashes like the following:
https://www.domin.com/test1/////test2/somemoretests_67142 https://www.domin.com/test1/test2/somemoretests_67142///// https://www.domin.com/test1/test2///somemoretests_67142
So that I can replace it with the following: https://\2\4 and the link at the end of it looks: https://www.domin.com/test1/test2/somemoretests_67142
I have been struggling with it for the past couple of days, so any regex guru help is more than welcome :)
I have tried the following and more:
(http|https):\/\/(.*)(\/\/+)(.*)
(http|https):\/\/(.*)(\/\/){2,}(.*)
(http|https):\/\/(.*)(\/\/{2})(.*)
I am going to utilize these for Akamai to sanitize our URLs though cloudlet.

You can try:
(?<!https:\/)(?<!http:\/)(\/+$|(?<=\/)\/+)
And substitute the first group with empty string.
Regex demo.
This will produce this output:
https://www.domin.com/test1/test2/somemoretests_67142
https://www.domin.com/test1/test2/somemoretests_67142
https://www.domin.com/test1/test2/somemoretests_67142

How to fix regex url pattern

I need to fix my url pattern:
/^((http(s)?(\:\/\/)){1}(www\.)?([\w\-\.\/])*(\.[a-zA-Z]{2,4}\/?)[^\\\/#?])[^\s\b\n|]*[^\.,;:\?\!\#\^\$ -]/
I thought this regex was ok, but it is not working for urls like: https://xx.xx (without www). 'www' should be optional ((www.)?). Where is the bug?

The problem is not in the (www\.)? part but that parts after that.
Take a look at the [^\\\/#?] and the [^\.,;:\?\!\#\^\$ -] parts.
So a valid URL would be https://xx.xx plus none of \/#? plus none of .,;:?!#^$_- making the url valid if you add those, for example https://xx.xx11.
I do advice you to not try to create your own regex because you are missing a lot!
For example, tlds like .amsterdam are valid. And why are you capturing so many groups?
Your regex as an image made with https://www.debuggex.com/:

Selecting URLs using RegExp but ignoring them when surrounded by double quotes

I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.

I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.

add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.

Ruby Puppet Regex string matching

I'm somewhat new to ruby and have done a ton of google searching but just can't seem to figure out how to match this particular pattern. I have used rubular.com and can't seem to find a simple way to match. Here is what I'm trying to do:
I have several types of hosts, they take this form:
Sample hostgroups
host-brd0000.localdomain
host-cat0000.localdomain
host-dog0000.localdomain
host-bug0000.localdomain
Next I have a case statement, I want to keep out the bugs (who doesn't right?). I want to do something like this to match the series of characters. However, it starts matching at host-b, host-c, host-d, and matches only a single character as if I did a [brdcatdog].
case $hostgroups { #variable takes the host string up to where the numbers begin
# animals to keep
/host-[["brd"],["cat"],["dog"]]/: {
file {"/usr/bin/petstore-friends.sh":
owner => petstore,
group => petstore,
mode => 755,
source => "puppet:///modules/petstore-friends.sh.$hostgroups",
}
}
I could do something like [bcd][rao][dtg] but it's not very clean looking and will match nonsense like "bad""cot""dat""crt" which I don't want.
Is there a slick way to use \A and [] that I'm missing?
Thanks for your help.
-wootini

How about using negative lookahead?
host-(?!bug).*
Here is the RUBULAR permalink matching everything except those pesky bugs!

Is this what you're looking for?
host-(brd|cat|dog)
(Following gtgaxiola's example, here's the Rubular permalink)

Regex to match URL not followed by " or <

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.
For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com
www.foo.com http://www.baz.com
I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.
I can't see what I'm doing wrong... any ideas?
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])
Here is a simpler example too:
((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).

Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.
(?>\b ...)(?!["<])
A perl test:
use strict;
use warnings;
my $str = 'www.foo.com http://www.baz.comhttp://www.some.com';
while ($str =~ m~
(?>
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
)
(?!["<])
~xg)
{
print "$1\n";
}
Output:
www.foo.com
http://www.some.com

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Lighttpd URL rewrite with negative matching - regex

Related

Multiple slash in URL replacement though regex

How to fix regex url pattern

Selecting URLs using RegExp but ignoring them when surrounded by double quotes

Ruby Puppet Regex string matching

Regex to match URL not followed by " or <

Categories

Resources