I am trying to create a regex in pcre, that is going to salinize URL with multiple slashes like the following:
https://www.domin.com/test1/////test2/somemoretests_67142 https://www.domin.com/test1/test2/somemoretests_67142///// https://www.domin.com/test1/test2///somemoretests_67142
So that I can replace it with the following: https://\2\4 and the link at the end of it looks: https://www.domin.com/test1/test2/somemoretests_67142
I have been struggling with it for the past couple of days, so any regex guru help is more than welcome :)
I have tried the following and more:
(http|https):\/\/(.*)(\/\/+)(.*)
(http|https):\/\/(.*)(\/\/){2,}(.*)
(http|https):\/\/(.*)(\/\/{2})(.*)
I am going to utilize these for Akamai to sanitize our URLs though cloudlet.
You can try:
(?<!https:\/)(?<!http:\/)(\/+$|(?<=\/)\/+)
And substitute the first group with empty string.
Regex demo.
This will produce this output:
https://www.domin.com/test1/test2/somemoretests_67142
https://www.domin.com/test1/test2/somemoretests_67142
https://www.domin.com/test1/test2/somemoretests_67142
Related
I need to fix my url pattern:
/^((http(s)?(\:\/\/)){1}(www\.)?([\w\-\.\/])*(\.[a-zA-Z]{2,4}\/?)[^\\\/#?])[^\s\b\n|]*[^\.,;:\?\!\#\^\$ -]/
I thought this regex was ok, but it is not working for urls like: https://xx.xx (without www). 'www' should be optional ((www.)?). Where is the bug?
The problem is not in the (www\.)? part but that parts after that.
Take a look at the [^\\\/#?] and the [^\.,;:\?\!\#\^\$ -] parts.
So a valid URL would be https://xx.xx plus none of \/#? plus none of .,;:?!#^$_- making the url valid if you add those, for example https://xx.xx11.
I do advice you to not try to create your own regex because you are missing a lot!
For example, tlds like .amsterdam are valid. And why are you capturing so many groups?
Your regex as an image made with https://www.debuggex.com/:
I have a list of text lines. Each line contains a title and a URL as follows:
product-title-7134 http://domain.com/page-1
another-product-title-822 http://domain.com/page-218
etc.
Using only .NET regex, please help me extract the url from each line.
I understand it can be done by looking at the string from the end until the http is met and output that part but I don't know the exact regex formula for that. Any help is much appreciated.
I would do that with this regex:
http://(\S+)
And find first group in every match.
This regex will math all https:// and http:// links:
(http|https)(://\S+)
You can test this in the .NET regex tester: http://regexstorm.net/tester
I want to crawl the pages of Techcrunch uploaded after the 1 Jan of 2013.The website follows the pattern
http://www.techcrunch.com/YYYY/MM/DD
So my question is how to setup the regex in urlfilter in nutch so that i could crawl only pages which i want.
+^http://www.techcrunch.com/2013/dd/dd/([a-z0-9\-A-Z]*\/)*
I don't know nutch but do you try:
+^http://www.techcrunch.com/2013/[0-9]{2}/[0-9]{2}.*$
or
+^http://www.techcrunch.com/2013/[0-9]+/[0-9]+.*$
The following expressions will match the URLs you need:
Without groups
http:\/\/www.techcrunch.com\/\d{4}\/\d{2}\/\d{2}\/\w+
With groups
http:\/\/www.techcrunch.com\/(\d{4})\/(\d{2})\/(\d{2})\/(\w+)
I didn't put anchors (^$), but you can put them if you need them for the filtering.
Try them to see if any of them work.
I don't know how nutch works, but a couple of suggestions about your regex that may apply: the / in the regexp should be escaped; the dd parts should be \d\d so they match two digits.
About setting up the regex, check out this answer to see if it helps you.
I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.
I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.
add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.
I have the following regex that attempts to match URLs:
/((http|https):(([A-Za-z0-9$_.+!*(),;/?:#&~=-])|%[A-Fa-f0-9]{2}){2,}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*(),;/?:#&~=%-]*))?([A-Za-z0-9$_+!*();/?:~-]))/g
How can I modify this regex to only match URLs of a single domain?
For example, I only want to match URLs that begin with http://www.google.com?
This should simplify my regex, but I'm too much of a regex noob to get it working (after all these years...)
Did you write that RegEx? I don't know what it's trying to do, but it certainly doesn't match URLs correctly. Here's something it matches:
http:###9#?~
which I'm pretty sure isn't a valid URL.
You shouldn't be using RegEx to match URLs like this. You haven't said what language you're working in, but use whatever its equivalent of urlparse is..
Here's a relevant question: How do you validate a URL with a regular expression in Python?