How to find "complicated" URLs in a text file - regex

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).

This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)

So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).

There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.

Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Related

How to match all URLs until slash?

I need a RegEx for matching all of these URLs:
https://www.domain.tld/service?itm_pm=de:ncp:ctr:c1cn:0:0
https://www.domain.tld/service
https://www.domain.tld/service/
But not these one:
https://www.domain.tld/service/afdsasdaf
https://www.domain.tld/service/afdsasdaf/asdasd
I tried it with
https://www.domain.tld/service[^/]*
but it doesn't work
Mark the end of the string
Summary of changes:
I would work with a $ delimiter for "end of string"
A / usually needs to be escaped. This may be different based on your settings/language etc.
The . must be escaped as well, otherwise wwwwdomain.tld would be found
Let's use this one:
Solution with working example:
https:\/\/www\.domain\.tld\/service[^\/]*\/?$
You can play around with it here:
https://regex101.com/r/wm6Nit/1
If you want to allow https://www.domain.tld/service/ specifically, do that explicitly:
https://www.domain.tld/service(/?|[^/]*)$

Regex: matching unknown groups that repeat?

I'm trying to create a generic regex pattern for a crawler, to avoid so called "crawler traps" (links that just add url parameters and refer to the exact same page, which results in tons of useless data). Alot of times, those links just add the same part to a URL over and over again. Here is an example out of a log file:
http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/...
I can use regular expressions to narrow the scope of the crawler and i would love to have a pattern, that tells the crawler to ignore everything that has repeating parts. Is that possible with a regex?
Thanks in advance for some tips!
JUST TO CLARIFY:
the crawlertraps are not designed to prevent crawling, they are a result of poor web design. All the pages we are crawling explicitly allowed us to do so!
If you are already looping through a list of URLs, you could add matching as a condition to skip the current iteration:
array = ["/abcd/abcd/abcd/abcd/", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/", "http://examplepage/apple/cake/banana/"]
import re
pattern1 = re.compile(r'.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*')
for url in array:
if re.match(pattern1, url):
print "It matches; skipping this URL"
continue
print url
Example regex:
.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*
([^\/\&?]{4,}) matches and captures sequences of anything, but not containing [/&?], repeated 4 or more times.
(?:[\/\&\?]) looks for one /,& or ?
(.*?(?:[\/\&\?])\1){3,} match anything until [/&?], followed by what we captured, doing all of this 3 or more times.
demo
You can use a backreference in Python/PERL regexes (and possibly others) to catch a pattern which is repeated:
>>> re.search(r"(/.+)\1", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/").group(1)
'/cssms/chrome'
\1 references the first match, so (/.+)\1 means the same sequence repeated twice in a row. The leading / is just to avoid the regex matching the first single repeating letter (which is the t in http) and catch repetitions in the path.

Ruby Puppet Regex string matching

I'm somewhat new to ruby and have done a ton of google searching but just can't seem to figure out how to match this particular pattern. I have used rubular.com and can't seem to find a simple way to match. Here is what I'm trying to do:
I have several types of hosts, they take this form:
Sample hostgroups
host-brd0000.localdomain
host-cat0000.localdomain
host-dog0000.localdomain
host-bug0000.localdomain
Next I have a case statement, I want to keep out the bugs (who doesn't right?). I want to do something like this to match the series of characters. However, it starts matching at host-b, host-c, host-d, and matches only a single character as if I did a [brdcatdog].
case $hostgroups { #variable takes the host string up to where the numbers begin
# animals to keep
/host-[["brd"],["cat"],["dog"]]/: {
file {"/usr/bin/petstore-friends.sh":
owner => petstore,
group => petstore,
mode => 755,
source => "puppet:///modules/petstore-friends.sh.$hostgroups",
}
}
I could do something like [bcd][rao][dtg] but it's not very clean looking and will match nonsense like "bad""cot""dat""crt" which I don't want.
Is there a slick way to use \A and [] that I'm missing?
Thanks for your help.
-wootini
How about using negative lookahead?
host-(?!bug).*
Here is the RUBULAR permalink matching everything except those pesky bugs!
Is this what you're looking for?
host-(brd|cat|dog)
(Following gtgaxiola's example, here's the Rubular permalink)

Exclude part of the string with regex

I'm quite bad with regex, and I'm looking to match a criteria.
This is a regex expression that should go emmbed into the url for a firewall, so It will block any url that is not like the list at the end.
This is what Im currently using but its not working:
http://www.youtube.com/(*.*)list=UUFwtOm4N5djdcuTAlNIWJaQ
This is the example url (to be blocked):
http://www.youtube.com/watch?NR=1&feature=fvwp&v=P1b5VY_Bp_o&list=UUFwtOm4N5djdcuTAlNIWJaQ
I'm trying to make a regex that will Success fully match when NR=1 or feature=fvwp
are NOT present, I asume I can do it like this: (?!^feature=fvwp$) but the v= and list=UUFwtOm4N5djdcuTAlNIWJaQ are allowed.
Also the v= should be limited to any character (uppercase and lowercase) and 11 length, I assume its: /^[a-z0-9]{11}$/
How can I build all that together and make it work so it would allow and match only on this urls excluding from allowing the previous criterias that I explained:
http://www.youtube.com/watch?v=4eK_RWpTgcc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=TLRl85TJwZM&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=QEV9yqrpxkc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
Can you block based on matching by regex? If so, just use
(.*)www\.youtube\.com/watch\?NR=1&feature=fvwp and block whatever matches that.

RegEx check if string contains certain value

I need some help with writing a regex validation to check for a specific value
here is what I have but it don't work
Regex exists = new Regex(#"MyWebPage.aspx");
Match m = exists.Match(pageUrl);
if(m)
{
//perform some action
}
So I basically want to know when variable pageUrl will contains value MyWebPage.aspx
also if possible to combine this check to cover several cases for instance MyWebPage.aspx, MyWebPage2.aspx, MyWebPage3.aspx
Thanks!
try this
"MyWebPage\d*\.aspx$"
This will allow for any pages called MyWebPage#.aspx where # is 1 or more numbers.
if (Regex.Match(url, "MyWebPage[^/]*?\\.aspx")) ....
This will match any form of MyWebPageXXX.aspx (where XXX is zero or more characters). It will not match MyWebPage/test.aspx however
That RegEx should work in the case that MyWebPage.aspx is in your pageUrl, albeit by accident. You really need to replace the dot (.) with \. to escape it.
Regex exists = new Regex(#"MyWebPage\.aspx");
If you want to optionally match a single number after the MyWebPage bit, then look for the (optional) presence of \d:
Regex exists = new Regex(#"MyWebPage\d?\.aspx");
I won't post a regex, as others have good ones going, but one thing that may be an issue is character case. Regexs are, by default, case-sensitive. The Regex class does have a static overload of the Match function (as well as of Matches and IsMatch) which takes a RegexOptions parameter allowing you to specify if you want to ignore case.
For example, I don't know how you are getting your pageUrl variable but depending on how the user typed the URL in their browser, you may get different casings, which could cause your Regex to not find a match.