How to exclude the last part of a variable string using regex - regex

I am currently making a bunch of landing pages that use similar URL structure, but each URL varies in number of words.
So it's something like:
http://landingpage.xyz/page-number-five
http://landingpage.xyz/page-number-fifty-four
http://landingpage.xyz/page-for-a-different-topic
and for the sent page I just postfix -sent like this. The reason I am not adding it as /sent is because the platform I am using handles URLs this way.
http://landingpage.xyz/page-number-five-sent
http://landingpage.xyz/page-number-fifty-four-sent
http://landingpage.xyz/page-for-a-different-topic-sent
Now I found it easy to make a regular expression that identifies all the sent pages which is let's say:
\/([a-z0-9\-]*)-sent
The thing is that I am not sure how to identify the ones that are not sent. I tried using a similar regular expression using something like this, but it's not working as expected:
\/([a-z0-9\-]*)(?!-sent)
What's the best way to design the regex for this? Or I am approaching it in the wrong way?

A lookahead should be considered where there are some characters left to match. So one at the end of regex doesn't look for anything. As long as I'm not sure whether or not your environment supports lookbehinds, this should be a workaround:
\/(?!.*-sent\b)([a-z0-9\-]*)

Related

Match a string with a fixed substring in variable positions

there:
I want to create a filter in my email server that matches any message that contains any URL (using either http or https protocols) from a certain domain (let's say domain.org). I want it to match things like:
https://site1.domain.org
https://anothersite.domain.org
http://yetanotherone.domain.org
The problem here is that these strings can be wrapped in the message body at any random position of the string. And even worse, when the string is wrapped an equal sign is added before the end of the line, so I would need it to be able to match strings like these:
ht=
tps://thisisanexample.domain.org
https://thisisane=
xample.domain.org
https://thisisanexample.do=
main.org
I came up with a simple (but huge) solution, but I think there must be a much more elegant one than mine:
/h[=[:cntrl:]]*t[=[:cntrl:]]*t[=[:cntrl:]]*p[=[:cntrl:]]*s?[=[:cntrl:]]*:[=[:cntrl:]]*\/[=[:cntrl:]]*\/[=[:cntrl:]]*[-+_#&%$#|()=?¿:;,.,çÇ^[:cntrl:][:alnum:]\[\]\{\}\*\\]*[=[:cntrl:]]*.[=[:cntrl:]]*d[=[:cntrl:]]*o[=[:cntrl:]]*m[=[:cntrl:]]*a[=[:cntrl:]]*[=[:cntrl:]]*i[=[:cntrl:]]*n[=[:cntrl:]]*.[=[:cntrl:]]*o[=[:cntrl:]]*r[=[:cntrl:]]*g/
I have been looking around but I can not find anything that I understand to improve my solution given that my knowledge of regex does not go beyond simple queries.
Thank you very much in advance.
Regards.
2018/04/11 EDIT: Thank you to everyone who tried but the solutions proposed do not meet the requirements of elegance and readability I was expecting. I was looking for something like capturing everything but the equal-return string and performing the web address string search on the captured result of the first search. Is this a doable idea?

How to use regex for URL-targeting

As a disclaimer, I must say that my experience with regular expressions is very limited. I am using Optimizely for A/B testing and have run into a problem. I only want my experiment to run on one page, however, this page's URL-structure is somewhat complicated. The URL-structure of the page where I want to run my experiment looks like this:
https://mywebsite.co/term/public_id/edit/pricing
The problem is the public_id that changes dynamically, whenever a new user goes through the signup flow. How can I use regex to target this page exclusively? I have been trying to figure it out these past days but without any luck. Optimizely regex docs can be found here. I can't just use a simple match because /term/ appears in the URL of several pages on my site.
You could use this regular expression:
mywebsite\.co/somepage/.*?/edit/pricing
The .* part means any character can occur here any number of times. The additional ? makes it lazy, meaning the rest of the regular expression will kick in as soon as possible.
Note that a literal . needs to be escaped with a backslash, like \.

Regular expression to exclude local addresses

I'm trying to configure my Foxy Proxy program and one of the features is to provide a regular expression for an exclusion list.
I'm trying to blacklist the local sites (ending in .local), but it doesn't seem to work.
This is what I attempted:
^(?:https?://)?\d+\.(?!local)+/.*$
^(?:https?://)?\d+\.(?!local)(\d)+/.*$
I also researched on Google and Stack Exchange with no success.
Since you indicate in the comments that you actually need a whitelist solution, I went with that:
Try: ^(?:https?://)?[\w.-]+\\.(?!local)\w+/.*$
http://regex101.com/r/xV4gS0
Your regex expressions match host names which start with a series of digits followed by a period and then not followed by the string "local". If this is a "blacklist", then that hardly seems like what you want.
If you're trying to match all hostnames which end in .local, you'd want something like the following for the hostname portion:
[^/]*\.local(?:/|$)
with appropriate escapes inserted depending on regex context.
If your original question was incorrect and you really need a whitelist, then you'd want something like:
^(?:(?!\.local)[^\/])*(?:\/|$)
as illustrated in http://regex101.com/r/yB0uY4
Thank you everyone to help. Indeed, it turns out that for this program, enlisting "not .local" as blacklist, it's not the same as "all .local" as whitelist.
I also had a rookie mistake on my pattern. I meant "\w" instead of "\d". Thank you Peter Alfvin for catching that.
So my final working solution is what Bart suggested:
^(?:https?://)?[\w.-]+\.(?!local)\w+/.*$ as a whitelist.

Regex to find bad URLs in a database field

We had an issue with the text editor on our website that was doubling up the URL. So for example, the text field may look contain:
This is a description for a media item, and here in a link.
So pretty much I need a regex to detect any string that begins with http and has another http before a closing quote, as in "http://www.example.com/apage.htmlhttp://www.example.com/apage.html"
"http[^"]+http
http://www.example.com/apage.htmlhttp://www.example.com/apage.html
This is actually a valid URL! So you'd want to be a bit careful not to munge any other URLs that happen to have ‘http://’ in the middle of them. To detect only a ‘doubled’ URL you could use backreferences:
"(https?://[^"]*)\1"
(This is a non-standard regex feature, but most modern implementations have it.)
Using regex to process HTML is a bad idea. HTML cannot reliably be parsed by regex.
If you can use the *.? syntax, you can just look for the following:
http(.*?)http
and if its present, reject the url.
The string that begins with http and has another http before a quote is:
^http[^"]*http
But, although this answers exactly your question I suspect you may want Uh Clem's answer instead ;-)
You will probably want something like this:
("http[^"]+)(http)
Then compare the two and if \1 === " + \2 then replace them.
One thought; do you have any query strings in any of your urls. If you do, are any of them like this "http://someurl.com?http=somemoredatahttp://someurl.com?http=somemoredata"?
If so, you will want something far more complicated.

Writing Regular Expression for URL in Google Analytics

I have a huge list of URL's, in the format:
http://www.example.com/dest/uk/bath/
http://www.example.com/dest/aus/sydney/
http://www.example.com/dest/aus/
http://www.example.com/dest/uk/
http://www.example.com/dest/nor/
What RegEx could I use to get the last three URL's, but miss the first two, so that every URL without a city attached is given, but the ones with cities are denied?
Note: I am using Google Analytics, so I need to use RegEx's to monitor my URL's with their advanced feature. As of right now Google is rejecting each regular expression.
Generally, the best suggestion I can make for parsing URL's with a Regex is don't.
Your time is much much better spent finding a libary that exists for your language dedicated to the task of processing URLs.
It will have worked out all the edge cases, be fully RFC compliant, be bug free, secure, and have a great user interface so you can just suck out the bits you really want.
In your case, the suggested way to process it would be, using your URL library, extract the element s and then work explicitly on them.
That way, at most you'll have to deal with the path on its own, and not have to worry so much wether its
http://site.com/
https://site.com/
http://site.com:80/
http://www.site.com/
Unless you really want to.
For the "Path" you might even wish to use a splitter ( or a dedicated path parser ) to tokenise the path into elements first just to be sure.
tj111's current solution doesn't work - it matches all your urls.
Here's one that works (and I checked with your values). It also matches, no matter if there is a trailing slash or not:
http:\/\/.*dest\/\w+/?$
/http:\/\/www\.site\.com\/dest\/\w+\/?$/i
matches if they're all the same site with the "dest" there. you could also do this:
/\w+:\/\/[^/]+\/dest\/\w+\/?$/i
which will match any site with any protocal (http,ftp) and any site with the /dest/country at the end, and an optional /
Note, that this will only work with a subset of what the urls could legitimately be.
Try this regular expression:
^http://www\.example\.com/dest/[^/]+/$
This would only match the last three URLs.