regex : how to eliminiate urls ending with .dtd - regex

This is JavaScript regex.
regex = /(http:\/\/[^\s]*)/g;
text = "I have http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd and I like http://google.com a lot";
matches = text.match(regex);
console.log(matches);
I get both the urls in the result. However I want to eliminate all the urls ending with .dtd . How do I do that?
Note that I am saying ending with .dtd should be removed. It means a url like http://a.dtd.google.com should pass .

The nicest way to do it is to use a negative lookbehind (in languages that support them):
/(?>http:\/\/[^\s]*)(?<!\.dtd)/g
The ?> in the first bracket makes it an atomic grouping which stops the regex engine backtracking - so it'll match the full URL as it does now, and if/when the next part fails it won't try going back and matching less.
The (<!\.dtd) is a negative lookbehind, which only matches if \.dtd doesn't match ending at that position (i.e., the URL doesn't end in .dtd).
For languages that don't (such as JavaScript), you can do a negative lookahead instead, which is a bit more ugly and is generally less efficient:
/(http:\/\/(?![^\s]*\.dtd\b)[^\s]*)/g
Will match http://, then scan ahead to make sure it doesn't end in .dtd, then backtrack and scan forward again to get the actual match.
As always, http://www.regular-expressions.info/ is a good reference for more information

Related

URL regex that skips ending periods

I'm trying to create a regex that matches url strings within normal text. I have this:
http[s]?://[^\s]+
This seems to work well with the exception that if the url is at the end of a sentence it will grab the period as well. For example for this string:
I am typing some text with the url http://something.com/something-?args=someargs. This is another sentence.
it matches:
http://something.com/some-thing?args=someargs.
I would like it to match:
http://something.com/some-thing?args=someargs
Obviously I can't exclude periods because they are in the url previously but I can't figure out how to tell it to exclude the last period if there is one. I could potentially use a negative lookahead for end of line or whitespace, but if it's in the middle of the line (without a period after it) that would leave off the last character of the url.
Most of the ones I have seen online have the same issue that they match the ending dot so maybe it's not possible? I know basic regex but certainly not a genius with it so if someone has a solution I would be very grateful :).
Also, I can do some post-process in this case to remove the dot if I need to, just seems like there should be a Regex solution...
Try this one
http[s]?://[^\s]+[^. ]

Regex for capturng all urls with a "/" at the end apart from "/cms/"

I want to capture all urls with a "/" at the end apart from "/cms/"
I currently have this, but it's not right. I'm really bad at regex:
(.*[^\/cms\/])\/$
https://regex101.com/r/Bxa6Ma/1
If I do this:
(.*[^cms\/])\/$
It works except for when a url has /blahcms/ - at which point it should once again capture it, that's why i'm trying to also include a "/" at the beginning too.
Example url I would like to catch:
example/hitherecms/
example/bingbangboomcms/
Example url I do not want to catch:
example/cms/
example/cms
example/bingbangboom
This regex will be used inside a Web.config rewrite rule.
Your approach is buggy, as it doesn't match the string if either s, m, c, or backslash precedes the ending backslash. It's generally wrong to use character class in place of negative lookahead.
One possible approach to solve this in the language that doesn't support negative lookbehind (JavaScript is a prominent example):
^(?:(?!\/cms\/$).)*\/$
... or (seems to be far more performant):
(?:(?!\/cms).{4}|^.{0,3})\/$
Demo.
It's trivial to do with negative lookbehind, though:
^.*(?<!\/cms)\/$
Demo. Note that regex flavor's change. You can skip ^.* part if you only need to test, not match.

URL rewrite using PCRE expression - append prefix to all incoming URIs except one pattern

i am using match expression as https://([^/]*)/(.*) and replace expression as constantprefix/$2 and trying to rewrite incoming URL by adding '/constantprefix' to all URLs
for Below URLs it is working as expected:
https://hostname/incomingURI is converting to
/constantprefix/incomingURI
https://hostname/ is converting to /constantprefix/
https://hostname/login/index.aspx is converting to
/constantprefix/login/index.aspx
i am having problem for the URLs which already starting with /constantprefix, i am seeing two /constantprefix/constantprefix in the output URL which I am not looking for, is there any way we can avoid that ?
if incoming URL is https://hostname/constantprefix/login/index.aspx then output URL is becoming https://hostname/constantprefix/constantprefix/login/index.aspx
may i know how i can avoid /constantprefix/constantprefix from match expression ?
You can do it with:
https://[^/]*/(?!constantprefix(?:/|$))(.*)
using the replacement string:
constantprefix/$1
(?!...) is a negative lookahead and means not followed by. It's only a test and doesn't consume characters (this kind of elements in a pattern are also called "zero-width assertions" as a lookbehind or anchors ^ and $).
The first capture group in your pattern was useless, I removed it.

Perl regex to match only if not followed by both patterns

I am trying to write a pattern match to only match when a string is not followed by both following patterns. Right now I have a pattern that I've tried to manipulate but I can't seem to get it to match correctly.
Current pattern:
/(address|alias|parents|members|notes|host|name)(?!(\t{5}|\S+))/
I am trying to match when a string is not spaced correctly but not if it is part of a larger word.
For example I want it to match,
host \t{4} something
but not,
hostgroup \t{5} something
In the above example it will match hostgroup and end up separating it into 2 separate words "host" and "group"
Match:
notes \t{4} something
but not,
notes_url \t{5} something
Using my pattern it ends up turning into:
notes \t{5} _url
Hopefully that makes a bit more sense.
I'm not at all clear what you want, but word boundaries will probably do what you ask.
Does this work for you?
/\b(address|alias|parents|members|notes|host|name)\b(?!\t{5})/
Update
Having understood your problem better, does this do what you want?
/\b(address|alias|parents|members|notes|host|name)\b(?!\t{5}(?!\t))/

regular expression to parse short urls

I've a list of possible urls on my site like
1 http://dev.site.com/People/
2 http://dev.site.com/People
3 http://dev.site.com/Groups/
4 http://dev.site.com/Groups
5 http://dev.site.com/
6 http://dev.site.com/[extraword]
I want to be able to match all the urls like 6 and redirect them to
http://dev.site.com/?Shorturl=extraword
but I don't want to redirect the first 5 urls
I tried something like
((.*)(?!People|Groups))\r
but something is wrong.
any help?
thanks
You should put the check that it isn't People or Groups at the start:
(?!People|Groups)(.*)
At the moment you're checking that the regular expression isn't followed by People or Groups.
Depending on which language/framework you're using, you might also need to use ^ and $ to make sure you're matching the whole string:
^(?!People|Groups)(.*)$
You should also think about whether you want to match urls that begin with People, eg. http://dev.site.com/People2/. So this might be better:
^(?!(?:People|Groups)(?:/|$))(.*)$
It checks that a negative match for People or Groups is followed by the end of the url or a slash.
You might want to make sure you don't match an empty string, so use .+ instead of .*:
^(?!(?:People|Groups)(?:/|$))(.+)$
And if you want a word without any slashes:
^(?!(?:People|Groups)(?:/|$))([^/]+)$
In your regex, the (.*) subpattern consumes the entire string, which then causes the negative lookahead to succeed.
You need a negative lookahead to exclude People|Groups, and then you need to capture the extra word (and the word needs to have some stuff in it, otherwise we want the match to fail). The crucial piece here is that the negative lookahead does not consume any of the string, so you are able to capture the extra word for subsequent use in the redirect URL you are trying to build.
Here's a solution in Perl, but the approach should work for you in C#:
use warnings;
use strict;
while (<DATA>){
print "URL=$1 EXTRA_WORD=$2\n"
if /^(.*)\/(?!People|Groups)(\w+)\/?$/;
}
__DATA__
http://dev.site.com/People/
http://dev.site.com/People
http://dev.site.com/Groups/
http://dev.site.com/Groups
http://dev.site.com/
http://dev.site.com/extraword1
http://dev.site.com/extraword2/
Output:
URL=http://dev.site.com EXTRA_WORD=extraword1
URL=http://dev.site.com EXTRA_WORD=extraword2