How can I use regular expression to match urls starting with https and ending with #? - regex

Very much a newb with regex and having a hard time figuring this one out. I have an HTML document and I want to clear out a ton of URLs that are inside of it. All of the URLs begin with https:// and they all end with a pound sign #.
Any help would be extremely appreciative. Using sublime text for my editor in case that is needed.

A basic way to do it:
\bhttps://[^\s#]+#
free-spaced:
\b //word start
https://
[^\s#]+ //followed by anything but whitespace and '#'
#

If you truly want to clear everything in between the url from https:// [...] # then you can use:
^(https)+(.)*(#)+$
But you may want to be more specific in terms of what you are filtering out. If this is from a database query you should be ok since you can assume the URL will be the content of the field(s) returned the you will be running the regex through a code loop of some kind.
BTW you can hone your scripts using something like http://regexpal.com/

Related

Use Regex to match beginning and end part of URL in Google Analytics

I'm looking for a regex function to implement in a goal for Google Analytics.
Consider this URL: /dagje-uit/....variable part..../contact/vpv/bedankt
Regex should work when beginning of URL matches /dagje-uit/ and end part contains /contact/vpv/bedankt Everything in the middle can be variable.
Without result i've tried
(?=^/dagje-uit/.*)(?=.*/bedankt$).*
(?=^dagje-uit.*)(?=.*bedankt$).*
Thanks in advance!
Regards,
Pim
Forgive me if Google Analytics has some regex standards which I am overlooking but is it possible that your regex is failing because it does not account for the start of the whole of the URL? Adding .* to either end of your regex may help.
It also looks like your regex is over-complex for the conditions you have described. Could a simpler match be :
.*/dagje-uit/.*/contact/vpv/bedankt.*
or
http(s)?://.*/dagje-uit/.*/contact/vpv/bedankt.*
if you want to be a little more confident that it is a valid URL.

Regex for simple urls

I am looking for regex for simple URLs as
http://www.google.com
http://www.yahoo.in
http://www.example.eu
http://www.example.net
etc.
No subdirectories allowed. For example in this cases it must not validate http://www.google.com/, http://www.yahoo.in/mail.
Does anyone know any regex to do this?
I'm still a noob, but try this:
^http:\/\/[a-zA-Z0-9_\-]+\.[a-zA-Z0-9_\-]+\.[a-zA-Z0-9_\-]+$
This one should do:
^(https?:\/\/)?[0-9a-zA-Z]+\.[-_0-9a-zA-Z]+\.[0-9a-zA-Z]+$
This should work for URLs starting with http:// or https:// or without the protocol name.
The regex should also be used as case-insensitive. In that case, it can be shortened a bit:
^(https?:\/\/)?[0-9a-z]+\.[-_0-9a-z]+\.[0-9a-z]+$
If you don't care whether it is a valid url, you can use:
\S*www\.\S+
All the examples contain www. followed by a nonspace character, but that is unlikely to occur in a normal word.

Regex for excluding URL

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.
For this system a pattern starts and ends with a "/".
What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html
I would have thought the pattern below would work since it does not have a trailing slash but no luck.
Any ideas?
/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/
Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/
If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/
It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters

transforming URLS to active links with REGEX

i have this code in php that transforms URL inside a text to active html links.
For example in a string
Hey check this cool link http://www.example.com
this transforms to:
Hey check this cool link http://www.example.com
As you can see it just adds the correct < a > html tag
The code is this:
$active_links_text = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]","\\0", $original_text);
My question is, how to do this to work EXCEPT if the URL is a youtube url.
So i want this result: In a string
Wow have you checked http://www.youtube.com/watch?v=dQw4w9WgXcQ its even better than http://www.example.com !!!
i want to be transformed to
Wow have you checked http://www.youtube.com/watch?v=dQw4w9WgXcQ its even better than http://www.example.com
As you can see the < a > html tag was added to the example.com's URL but NOT at the youtube's URL.
How can i make this happen???
I hope i described my problem good enough, i hope its easy to implement this! Last note: i am using this code in php 5.2.14
Thank you guys!
[EDIT : Wow, I had gotten your question completely wrong! Below's a better attempt at helping you.]
I gave it a go in js here, here is the original regex : /(http:\/\/(?!www.youtube)[^<>\s]+)\b/g, since i'm not a php coder. The negative lookahead prevents a litteral www.youtube match (the lookahead content can be adapted if you need a more complex pattern).
There's nothing js-specific here to my knowledge, but I don't know the ereg regex syntax. with preg functions, you would just need not to escape the slashes, the word boundaries \b and negative lookahead (?!*pattern*) are the same. The /g flag is for a global replacement, that is, not stopping on the first match, I suppose you have a kind of replaceAll function in your toolbox.
Also, I'm not sure about the global flag in php, I guess you can just call a kind of replaceAll function.
You've made several mistakes about valid URI components. The scheme is defined as ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ), not [[:alpha:]]+.
The part after the : of the scheme need not start with //, that's particular to http: and a few other file-oriented schemes. But the [[:alpha:]]+: start of your regex shows you weren't aiming to restrict yourself to http:. In that case, all printable ASCII characters are valid. I.e. everything from ! to ~, or [\x21-x7E]* as a regex.
To summarize: [[:alpha:]][A-Za-z0-9+-.]*:[\x21-x7E]*.

Regex Problem (newbie)

i'm writing a little app for spam-checking and i'm having problems with a regex.
let's say i'm having this spam-url:
http://hosting.tyumen.ru/tip.html
so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".
here's what i got so far:
(http://.*?\..*?..*?/.*?.html)
might look like rubbish but it works - the problem: it's really slow and freezing my app.
any hints on how to optimize it?
thx.re
The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking
Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":
\1
I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.
Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.
(http://[\w.-]+/.+?\.html) - may be will work for your case only.
or may be faster one
(http://[\w.-]+/[^.]+\.html)
Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.
It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.
In Python, a simple way to match URLs ending in .html or .htm is to use
url_re = re.compile(
r'https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:\S+.html?)+' # ending in .html
, re.IGNORECASE)
which is a modified version of Django's UrlField regex.
This will match any site ending with .html or .htm. (either localhost, ip, domain).
#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#