please regex url request - regex

I would like to know if anybody can help me with a regular expression problem. I want to write a regular expression to catch URLs similar to this URL:
www.justin.tv/channel_name_here
I have tried:
/justin\.tv\/(.*)
The problem I get is that when this channel goes live, sometimes the URL transforms to something like this:
www.justin.tv/channel_name_here#/w/45365675688
I can't catch this. :( Can anybody please help me with this? I just want to catch the channel name without the pound symbol and the rest of the URL.
Here are some example URLs:
www.justin.tv/winning_movies#/w/6347562128
http://www.justin.tv/cine_accion_hd16#/w/6347562128/18
http://www.justin.tv/fox_movies_hd1/
I would want to get:
winning_movies
cine_accion_hd16
fox_movies_hd1
Thanks in advance! :)

Short answer:
(?<=justin\.tv\/)([^#\/]+)
Long answer:
Let's split this up into parts. Look at the back part first.
([^#\/]+)
This delimits the string into parts that don't include either '#' or '/'.
Now let's look at the first part.
(?<=justin\.tv\/)
The syntax "(?<=" followed by ")" is called positive lookbehind (this page has good examples and explanation of the different types of lookaround). Using a simple example:
(?<=A)B
The above example says "I want all 'B' that are immediately after an 'A'." Going to our big example, we're saying we want all parts (separated by '#' or '/') that are immediately after a part called "justin.tv/".
Look here for an example of the expression in action.

#justin\.tv/([^#/]+)#
If you want everything up to a certain character(-set), use a negated class.
Also, when working on regex for urls, using / as delimiter is error-prone, as you have to escape all the /'s. Use something else instead (like # in this case)

Related

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

How to write a regex to validate a specific URL format?

My URL will look something like this :
"/eshop/products/SMART+TV+SAMSUNG+UE55H6500SLXXH+3D/productDetail/ger_20295028/"
Where product names can keep changing here
SMART+TV+SAMSUNG+UE55H6500SLXXH+3D and product id here ger_20295028. I tried writing a regex which is wrong.
How can I correct it for the above URL?
Regex:
.*/products/[^/]*?/productDetail/[^/]*?/([^/].*?)/[^/]*?/([^/]*)(/.*?)*$
You use ? (single character) instead of * (any number) and you also have much more parts at the end than the example you've given. Try something like this
.*/products/[^/]*/productDetail/[^/]*/
You should read up on quantifiers (the ? means once or zero times, you are confusing it with *). This regex might work for you:
/^.*\/products\/[^\/]+\/productDetail\/[^\/]+\/$/
Try it online here.

Detect URL in a string without any whitespace regexp

So I know the idea of catching any URL is a very difficult task, and that's not what i'm wanting to do. I'm wanting to find a piece of regex that'll catch urls in the form of
http://something.xx.yy
http://www.something.xxx
www.something.xx.yy
in a string that will contain lots of other text and no whitespace, so for example
hellopleasevisitwww.something.xxthankyou
I've tried my best to detect something like that by myself, but it's been pretty fruitless. Any help would be great. Below are some of the expressions I tried to modify in order to have these requirements met
.*\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|].*
\\b\\w*\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]\\w*\\b
\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]
Thanks for your time
If it really can be as simple as you're saying...
(http://(www\\.)?|www\\.)[^.]+\\.(\\w{3}|\\w{2}\\.\\w{2})
The expressions you tried all have \\b which is a word boundary and your string unfortunately does not have word boundaries.
See it in action

transforming URLS to active links with REGEX

i have this code in php that transforms URL inside a text to active html links.
For example in a string
Hey check this cool link http://www.example.com
this transforms to:
Hey check this cool link http://www.example.com
As you can see it just adds the correct < a > html tag
The code is this:
$active_links_text = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]","\\0", $original_text);
My question is, how to do this to work EXCEPT if the URL is a youtube url.
So i want this result: In a string
Wow have you checked http://www.youtube.com/watch?v=dQw4w9WgXcQ its even better than http://www.example.com !!!
i want to be transformed to
Wow have you checked http://www.youtube.com/watch?v=dQw4w9WgXcQ its even better than http://www.example.com
As you can see the < a > html tag was added to the example.com's URL but NOT at the youtube's URL.
How can i make this happen???
I hope i described my problem good enough, i hope its easy to implement this! Last note: i am using this code in php 5.2.14
Thank you guys!
[EDIT : Wow, I had gotten your question completely wrong! Below's a better attempt at helping you.]
I gave it a go in js here, here is the original regex : /(http:\/\/(?!www.youtube)[^<>\s]+)\b/g, since i'm not a php coder. The negative lookahead prevents a litteral www.youtube match (the lookahead content can be adapted if you need a more complex pattern).
There's nothing js-specific here to my knowledge, but I don't know the ereg regex syntax. with preg functions, you would just need not to escape the slashes, the word boundaries \b and negative lookahead (?!*pattern*) are the same. The /g flag is for a global replacement, that is, not stopping on the first match, I suppose you have a kind of replaceAll function in your toolbox.
Also, I'm not sure about the global flag in php, I guess you can just call a kind of replaceAll function.
You've made several mistakes about valid URI components. The scheme is defined as ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ), not [[:alpha:]]+.
The part after the : of the scheme need not start with //, that's particular to http: and a few other file-oriented schemes. But the [[:alpha:]]+: start of your regex shows you weren't aiming to restrict yourself to http:. In that case, all printable ASCII characters are valid. I.e. everything from ! to ~, or [\x21-x7E]* as a regex.
To summarize: [[:alpha:]][A-Za-z0-9+-.]*:[\x21-x7E]*.

Regex to find bad URLs in a database field

We had an issue with the text editor on our website that was doubling up the URL. So for example, the text field may look contain:
This is a description for a media item, and here in a link.
So pretty much I need a regex to detect any string that begins with http and has another http before a closing quote, as in "http://www.example.com/apage.htmlhttp://www.example.com/apage.html"
"http[^"]+http
http://www.example.com/apage.htmlhttp://www.example.com/apage.html
This is actually a valid URL! So you'd want to be a bit careful not to munge any other URLs that happen to have ‘http://’ in the middle of them. To detect only a ‘doubled’ URL you could use backreferences:
"(https?://[^"]*)\1"
(This is a non-standard regex feature, but most modern implementations have it.)
Using regex to process HTML is a bad idea. HTML cannot reliably be parsed by regex.
If you can use the *.? syntax, you can just look for the following:
http(.*?)http
and if its present, reject the url.
The string that begins with http and has another http before a quote is:
^http[^"]*http
But, although this answers exactly your question I suspect you may want Uh Clem's answer instead ;-)
You will probably want something like this:
("http[^"]+)(http)
Then compare the two and if \1 === " + \2 then replace them.
One thought; do you have any query strings in any of your urls. If you do, are any of them like this "http://someurl.com?http=somemoredatahttp://someurl.com?http=somemoredata"?
If so, you will want something far more complicated.