(r'^/(?P<the_param>[a-zA-z0-9_-]+)/$','myproject.myapp.views.myview'),
How can I change this so that "the_param" accepts a URL(encoded) as a parameter?
So, I want to pass a URL to it.
mydomain.com/http%3A//google.com
Edit: Should I remove the regex, like this...and then it would work?
(r'^/(?P[*]?)/?$','myproject.myapp.views.myview'),
Add % and . to the character class:
[a-zA-Z0-9_%.-]
Note: You don't need to escape special characters inside character classes because special characters lose their meaning inside character sets. The - if not to be used to specify a range should be escaped with a back slash or put at the beginning (for python 2.4.x , 2.5.x, 2.6.x) or at the end of the character set(python 2.7.x) hence something like [a-zA-Z0-9_%-.] will throw an error.
You'll at least need something like:
(r'^the/uri/is/(?P<uri_encoded>[a-zA-Z0-9~%\._-])$', 'project.app.view'),
That expression should match everything described here.
Note that your example should be mydomain.com/http%3A%2F%2Fgoogle.com (slashes should be encoded).
I think you can do it with:
(r'^(?P<url>[\w\.~_-]+)$', 'project.app.view')
Related
I need a RegEx for matching all of these URLs:
https://www.domain.tld/service?itm_pm=de:ncp:ctr:c1cn:0:0
https://www.domain.tld/service
https://www.domain.tld/service/
But not these one:
https://www.domain.tld/service/afdsasdaf
https://www.domain.tld/service/afdsasdaf/asdasd
I tried it with
https://www.domain.tld/service[^/]*
but it doesn't work
Mark the end of the string
Summary of changes:
I would work with a $ delimiter for "end of string"
A / usually needs to be escaped. This may be different based on your settings/language etc.
The . must be escaped as well, otherwise wwwwdomain.tld would be found
Let's use this one:
Solution with working example:
https:\/\/www\.domain\.tld\/service[^\/]*\/?$
You can play around with it here:
https://regex101.com/r/wm6Nit/1
If you want to allow https://www.domain.tld/service/ specifically, do that explicitly:
https://www.domain.tld/service(/?|[^/]*)$
I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)
suppose I have this url
url(r'^delete_group/(\w+)/', 'delete_group_view',name='delete_group')
In template
{%url 'delete_group' 'mwas'%} works but when I use
{%url 'delete_group' 'mwas 45'%} is not working. Any way to modify the url to accept both mwas and mwas 45
The issue might be your regex. The URL example you're showing has a space in it. \w won't match spaces. Try this instead: r'^delete_group/([\w\s]+)/ which allows either words or spaces in multiples.
However, know that spaces are not valid in URLs and will likely get converted to %20 or something similar. A best practice is to use hyphens where you would put a space.
I'd also point you at this answer to a similar question.
I'm trying to pass a 'string' argument to a view with a url.
The urls.py goes
('^add/(?P<string>\w+)', add ),
I'm having problems with strings including punctuation, newlines, spaces and so on.
I think I have to change the \w+ into something else.
Basically the string will be something copied by the user from a text of his choice, and I don't want to change it. I want to accept any character and special character so that the view acts exactly on what the user has copied.
How can I change it?
Thanks!
Notice that you can use only strings that can be understood as a proper URLs, it is not good idea to pass any string as url.
I use this regex to allow strings values in my urls:
(?P<string>[\w\-]+)
This allows to have 'slugs; in your url (like: 'this-is-my_slug')
Well, first off, there are a lot of characters that aren't allowed in URLs. Think ? and spaces for starters. Django will probably prevent these from being passed to your view no matter what you do.
Second, you want to read up on the re module. It is what sets the syntax for those URL matches. \w means any upper or lowercase letter, digit, or _ (basically, identifier characters, except it doesn't disallow a leading digit).
The right way to pass a string to a URL is as a form parameter (i.e. after a ?paramName= in the URL, and with special characters escaped, such as spaces changed to +).
I need some help with writing a regex validation to check for a specific value
here is what I have but it don't work
Regex exists = new Regex(#"MyWebPage.aspx");
Match m = exists.Match(pageUrl);
if(m)
{
//perform some action
}
So I basically want to know when variable pageUrl will contains value MyWebPage.aspx
also if possible to combine this check to cover several cases for instance MyWebPage.aspx, MyWebPage2.aspx, MyWebPage3.aspx
Thanks!
try this
"MyWebPage\d*\.aspx$"
This will allow for any pages called MyWebPage#.aspx where # is 1 or more numbers.
if (Regex.Match(url, "MyWebPage[^/]*?\\.aspx")) ....
This will match any form of MyWebPageXXX.aspx (where XXX is zero or more characters). It will not match MyWebPage/test.aspx however
That RegEx should work in the case that MyWebPage.aspx is in your pageUrl, albeit by accident. You really need to replace the dot (.) with \. to escape it.
Regex exists = new Regex(#"MyWebPage\.aspx");
If you want to optionally match a single number after the MyWebPage bit, then look for the (optional) presence of \d:
Regex exists = new Regex(#"MyWebPage\d?\.aspx");
I won't post a regex, as others have good ones going, but one thing that may be an issue is character case. Regexs are, by default, case-sensitive. The Regex class does have a static overload of the Match function (as well as of Matches and IsMatch) which takes a RegexOptions parameter allowing you to specify if you want to ignore case.
For example, I don't know how you are getting your pageUrl variable but depending on how the user typed the URL in their browser, you may get different casings, which could cause your Regex to not find a match.