Django URL issue with regular expressions - regex

I am new to Python, Django 1.9 and overall regular expressions. So I am trying to write something like this within urls.py
search/doc_name/language/?id
where doc_name, allow for any name/case/length etc. like so: 'My Fave Doc 12'
where language, allow two letters like so: 'en'
where id, allows only numbers.
This is what I have, can someone point out where I went wrong?
url(r'^search/[\w-]+/[a-z]{2}+/(?P<id>[0-9]+)$', '....

The doc_name doesn't allow spaces. Add a space in the character set if you want one. Make sure you put it before the dash ([\w -]+). If other whitespaces are allowed, used \s instead ([\w\s-]+).
Also the language would currently match any even amount of letters. Remove the + and leave only [a-z]{2}. + means repeat one or more times, anything is matched only once by default.

You should really avoid to have spaces in you URL, I suggest the following:
url format: /search/<doc_name>/<id>/?lang=<language>
in urls.py:
url(r'^search/(?P<doc_name>[\w]+)/(?P<id>[0-9]+)/$'), your_view)
in views.py:
lang = request.GET.get('lang', 'en')
doc_name = request.POST.get('doc_name')
id = request.POST.get('id')

Related

Urlpattern regular expression not working

So i'm trying to make url like so
re_path(r'^product_list/(?P<category_slug>[\w-]+)/(?:(?P<filters>[\w~#=]+)?)$', views.ProductListView.as_view(), name='filtered_product_list'),
and at this point it works with things like:
/product_list/sdasdadsad231/bruh=1~3~10#nobruh=1~4
bruh=1~3~10#nobruh=1~4 - those are filters
but later i want to implement search by word functionality
so i want it recognize things like
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4&search_for=athing
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4
/product_list/sdasdadsad231/?search_for=athing
/product_list/sdasdadsad231/
so in different situations it will get filters and/or search_for or nothing at all
You might write the pattern as:
^product_list/(?P<category_slug>[\w-]+)/(?:\??(?P<filters>[\w~#=&-]+)?)$
Regex demo
If you want to match the leading / from the example data, you can append that in the pattern after the ^
The part after the question mark is the query string [wiki] and does not belong to the path. Django will construct a QueryDict for this, and this will be available through request.GET [Django-doc]. Indeed, if the path is for example:
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4&search_for=athing
Then the ?filters=bruh-1~3~10&nobruh-1~4&search_for=athing is not part of the path, and it will be wrapped in request.GET as a QueryDict that looks like:
>>> QueryDict('filters=bruh-1~3~10&nobruh-1~4&search_for=athing')
<QueryDict: {'filters': ['bruh-1~3~10'], 'nobruh-1~4': [''], 'search_for': ['athing']}>
You thus can not capture the part after (and including) the question mark, this is already stripped of the path when trying to match with the re_path(…) and path(…) definitions.

Regex for Page Filtering in Google Analytics

I'm trying to use GA to filter out certain URL pages. I need to distinguish between pages like this:
www.example.com/hotel/hotelfoofoo
and this:
www.example.com/hotel/hotelfoofoo/various-options-go-here?lots-of-other-stuff-follows
I'm new to regex, so I know very little, but am basically trying to capture URL pages that begin with /hotel/ but do not include any other forward slashes. Is there a way to write that code?
Two possible solutions:
1) Assuming only alpha numeric + '-' signs allowed in the name of hotel:
/hotel/([-\w]+)(?![-\/\w])
Note: hotel name would be caught in first group. Idea here - is to capture all digits/letters/underscor/- symbols which are not followed by slash.
2) Assuming white space symbol required to designate url end:
/hotel/([^\s/]+)(?=\s)
Note: depending on your regexp language some of character should be escaped. For js all "/" should be escaped e.g.: "/"

RegEx to cut out URL

I try to get an URL from a String of the following format:
RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH
I already tried some things, especially the the look before/after, which I used before successfully on another url format (starts https... ends .html, this was working).
But seems I'm too stupid to figure out the regex for the kind of string mentioned above. I just want the URL part from https.... to the end of the random last name. Is this even possible?
Any Ideas?
If you can guarantee that randomfirstname_randomlastname is all lowercase and RANDOMRUBBISH is all uppercase, you can use character classes [a-z] and [A-Z]. The language the regex is for will determine how to use these.
This is example works in javascript:
var str = "RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH";
var match = /https:\/\/www\.my-url\.com\/[a-z]*/.exec(str);

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

parse url from string in coldfusion

i need to parse all urls from a paragraph(string)
eg.
"check out this site google.com and don't forget to see this too bing.com/maps"
it should return "google.com and bing.com/maps"
i'm currently using this and its not to perfection.
reMatch("(^|\s)[^\s#]+\.[^\s#\?\/]{2,5}((\?|\/)\S*)?",mystring)
thanks
You need to define more clearly what you consider a URL
For example, I might use something such as this:
(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?
(use with reMatchNoCase or plonk (?i) at front to ignore case)
Which specifically only allows alphanumerics, underscore, and hyphen in domain and path parts, requires the TLD to be letters only, and only looks for numeric ports.
It might be this is good enough, or you may need something that looks for more characters, or perhaps you want to trim things likes quotes, brackets, etc off the end of the URL, or whatever - it depends on the context of what you're doing as to whether you'd like to err towards missing URLs or detecting non-URLs.
(I'd probably go for the latter, then potentially run a secondary filter to verify if something is a URL, but that takes more work, and may not be necessary for what you're doing.)
Anyhow, the explanation of the above expression is below, hopefully with clear comments to help it make sense. :)
(Note that all groups are non-capturing (?:...) since we don't need the indiv parts.)
# PROTOCOL
(?:https?:)? # optional group of "http:" or "https:"
# SERVER NAME / DOMAIN
(?://)? # optional double forward slash
(?:[\w-]+\.)+ # one or more "word characters" or hyphens, followed by a literal .
# grouped together and repeated one or more times
[a-z]{2,6} # as many as 6 alphas, but at least 2
# PORT NUMBER
(?::\d+)? # an optional group made up of : and one or more digits
# PATH INFO
(?:/[\w.,-]+)* # a forward slash then multiple alphanumeric, underscores, or hyphens
# or dots or commas (add any other characters as required)
# in a group that might occur multiple times (or not at all)
# QUERY STRING
(?:\?\S+)? # an optional group containing ? then any non-whitespace
Update:
To prevent the end of email addresses being matched, we need to use a lookbehind, to ensure that prior to the URL we don't have an # sign (or anything else unwanted) but without actually including that prior character in the match.
CF's regex is Apache ORO which doesn't support lookbehinds, but we can use the java.util.regex nice and easily with a component I have created which does support lookbehinds.
Using that is as simple as:
<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
...
<cfset Urls = jrex.match( regex , input ) />
After the createObject, it should basically be like using the built-in re~ stuff, but with the slight syntax difference, and the different regex engine under the hood.
(If you have any problems or questions with the component, let me know.)
So, on to your excluding emails from URL matching problem:
We can either do a (?<=positive) or (?<!negative) lookbehind, depending on if we want to say "we must have this" or "we must not have this", like so:
(?<=\s) # there must be whitespace before the current position
(?<!#) # there must NOT be an # before current position
For this URL example, I would expand either of those examples to:
(?<=\s|^) # look for whitespace OR start of string
or
(?<![#\w/]) # ensure there is not a # or / or word character.
Both will work (and can be expanded with more chars), but in different ways, so it simply depends which method you want to do it with.
Put whichever one you like at the start of your expression, and it should no longer match the end of abcd#gmail.com, unless I've screwed something up. :)
Update 2:
Here is some sample code which will exclude any email addresses from the match:
<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
<cfsavecontent variable="SampleInput">
check out this site google.com and don't forget to see this too bing.com/maps
this is an email#somewhere.com which should not be matched
</cfsavecontent>
<cfset FindUrlRegex = '(?<=\s|^)(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?' />
<cfset MatchedUrls = jrex.match( FindUrlRegex , SampleInput ) />
<cfdump var=#MatchedUrls#/>
Make sure you have downloaded the jre-utils.cfc from here and put in an appropriate place (e.g. same directory as script running this code).
This step is required because the (?<=...) construct does not work in CF regular expressions.