Trying to get a regex for any string that matches view.php with the GET variable file with as value [a-zA-Z0-9_]*. FYI, I need to rewrite this URL to /file/value
What I did but didn't work: ^view.php\?.*?(&|\?)file=([a-zA-Z0-9_]*).*$
What does work?
Leading ^ means your entire string begins with view.php, which is probably not true.
Also in your regex your assume, that file is the last GET
etc.
This regex should match get value for file in any string
view\.php\?.*?\bfile=(\w*)\b
view.php\? here you accept viewaphp?. Is that ok? You probably mean view\.php. Also, you enforce a question mark at the end, whereas:
(&|\?) here you again enforce either an ampersand or a question mark. Hence, you require something like view.php??file=... or view.php?.*&file=...
What you want is probably something like (although untested, and note the + to not allow empty filenames):
^view\.php\?(?:file=)|(?:.*&file=)([a-zA-Z0-9_]+)(?:&|$)
As #Lindrian asked, will this be run against a string starting with view.py or against an entire url? For the former case, this simple regex should work fine in my opinion (using Python here):
In [1]: import re
In [2]: s = 'view.php?foo=bar&file=blablah123&anotherfoo=anotherbar'
In [3]: re.sub(r'view\.php\?.*\bfile=(\w+).*', '/file/\g<1>', s)
Out[3]: '/file/blablah123'
Related
So i'm trying to make url like so
re_path(r'^product_list/(?P<category_slug>[\w-]+)/(?:(?P<filters>[\w~#=]+)?)$', views.ProductListView.as_view(), name='filtered_product_list'),
and at this point it works with things like:
/product_list/sdasdadsad231/bruh=1~3~10#nobruh=1~4
bruh=1~3~10#nobruh=1~4 - those are filters
but later i want to implement search by word functionality
so i want it recognize things like
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4&search_for=athing
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4
/product_list/sdasdadsad231/?search_for=athing
/product_list/sdasdadsad231/
so in different situations it will get filters and/or search_for or nothing at all
You might write the pattern as:
^product_list/(?P<category_slug>[\w-]+)/(?:\??(?P<filters>[\w~#=&-]+)?)$
Regex demo
If you want to match the leading / from the example data, you can append that in the pattern after the ^
The part after the question mark is the query string [wiki] and does not belong to the path. Django will construct a QueryDict for this, and this will be available through request.GET [Django-doc]. Indeed, if the path is for example:
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4&search_for=athing
Then the ?filters=bruh-1~3~10&nobruh-1~4&search_for=athing is not part of the path, and it will be wrapped in request.GET as a QueryDict that looks like:
>>> QueryDict('filters=bruh-1~3~10&nobruh-1~4&search_for=athing')
<QueryDict: {'filters': ['bruh-1~3~10'], 'nobruh-1~4': [''], 'search_for': ['athing']}>
You thus can not capture the part after (and including) the question mark, this is already stripped of the path when trying to match with the re_path(…) and path(…) definitions.
I need to use python to match url in my text file.
However, there is a special case:
i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣
In this case I would like to keep the emoji next to the url and just match the url in the middle.
Ideally, I would like to have result like this:
i like 🤣<url>🤣
Since I am new to this, this is what I have so far.
pattern = re.compile("([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])")
but the return result is something unsatisfied like this:
i like 🤣<url>Sex8JaP5w5/a7htvq🤣
Would you please help me with this? Thank you so much
A solution using existing packages:
from urlextract import URLExtract
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(r'', text)
extractor = URLExtract()
source = "i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣 "
urlsWithEmojis = extractor.find_urls(source)
urls = list(map(remove_emoji, urlsWithEmojis))
print(urls)
output
['pic.twitter.com/Sex8JaP5w5/a7htvq']
Try it Online!
Inspired by How do you extract a url from a string using python? and removing emojis from a string in Python
If looks like you are missing * or+ at the last matching group so it only matches one character. So you want "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])*" or "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])+".
Now I don't know if this regex is simplified for your case, but it does not match all urls. For an example of that check out https://www.regextester.com/20
If you are attempting to match any url I would recommend rethinking your problem and trying to simplify down to more specific types of urls, like the example you provided.
EDIT: Also why (.com)+? Is there really a case where multiple ".com"s appear like .com.com.com
Also I think you have small typo and it is supposed to be (\.com). But since you have ([:///a-zA-Z////\.])+ it could be reduced to (com), however i think the explicit (\.com) makes it an easier expression to read.
I need to put together a regex that matches a patter only if string does not begin with 'N'.
Here is my pattern so far [A-E]+[-+]?.
Now I want to make sure that it does not match something like:
N\A
NA
NB+
NB-
NCAB
This is for REGEXP_SUBSTR command in Oracle SQL DB
UPDATE
It looks like I should have been more specific, sorry
I want to extract from a string [A-E]+[-+]? but if the string also matches ^(N|n) then I want my regex to return nothing.
See examples below:
String Returns
N/A
F1/AAA AAA
NABC
FABC ABC
To match a character between A and E not preceded by N, you can use:
([^N]|^)[A-E]+
If you want to avoid fields that contains N[A-E] use a negation in your query using the pattern N[A-E] (in other words, use two predicates, this one to exclude NA and the first to find A)
To be more clear:
WHERE NOT REGEXP_LIKE(coln, 'N[A-E]') AND REGEXP_LIKE(coln, '[A-E]')
Ok I figured it out, I broadened the scope of the problem a little, I realized that I can also play with other parameters of REGEXP_SUBSTR in this case that I can have returned only second substring.
REGEXP_SUBSTR(field1, '^([^NA-D][^A-D]*)?([A-D]+[-+]?)',1,1,'i',2)
I still have to give you guys the credit, lot of good ideas that led me to here.
Just throw a [^N]? in front. That should do it.
OOPS...
That actually needs to include an " OR ^ "...
It should look like this:
([^N]|^)[A-E]+[-+]?
Sorry about that...It looks like the right answer already got posted anyway.
Got this:
<TAG>something one</TAG><TAG>something two</TAG><TAG>something three</TAG>
I want only match: something two
I try: (?<=<TAG>)(.*two.*)(?=<\/TAG>)
but got:
something one</TAG><TAG>something two</TAG><TAG>something three
Maybe I give another example
RECORDsomething beetwenRECORD RECORDanything beetwenRECORD etc.
want to get words beetwen RECORD
You can use
<TAG>.+?<TAG>(.*?)</TAG>
Your something two is in the first match in $1
Try this:
(?<=</TAG><TAG>)[^<]*(?=</TAG><TAG>)
As already said, parsing HTML using regular expressions is discouraged! There are plenty of HTML parsers for doing this. But if you want a regex at all costs, here is how I would it in Python:
In [1]: import re
In [2]: s = '<TAG>something one</TAG><TAG>something two</TAG><TAG>something three</TAG>'
In [3]: re.findall(r'(?<=<TAG>).*?(?=</TAG>)', s)[1]
Out[3]: 'something two'
However, this solution only works if you always want to extract the content of the second tag pair. But as I said, don't do this.
If you know that the TAG is not the first and not the last, you can do
(?<=.+<TAG>)(.*two.*)(?=<\/TAG>.+)
Of course, it's much better to capture the tags as well and use a capturing group
.*<TAG>(.*two.*?)<\/TAG>
In short, I need to match all URLs in a block of text that are for a certain domain and don't contain a specific querystring parameter and value (refer=twitter)
I have the following regex to match all URLs for the domain.
\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
I just can't get the last part to work
(?![&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
So the following SHOULD match
example.com
http://example.com/
https://www.example.com#link
www.example.com?somevalue=foo
But these should NOT
https://www.anotherexample.com#link
www.example.com?refer=twitter
EDIT:
And if you can get it to match the
http://example.com?foo=foo.bar
out of a sentence like
For examples go to http://example.com?foo=foo.bar.
without picking up the period, that would be great!
EDIT2:
Fixed the trailing period issue with this
\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT3:
This seems to work, or at least 99% of the tests I've thrown at it
(?!\b.*[&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT4:
Settled on
\b(?!.*[&?]refer=twitter)(https?://)?([a-z0-9-]+\.)*nygard\.com(?!\.)[^\s]*\b+
(?!\b.*[&?]refer=twitter)
Is what you're looking for.
To be honest, at first the thought of using a regex didn't even cross my mind (which is a good sign - using a regex must, IMO, always be a secondary option, not primary). Here is how I'd do it in my language of choice
>>> from urlparse import urlparse, parse_qs
>>> p = urlparse(r'http://foo.bar.com/baz?refer=twitter&rock=paper')
>>> parse_qs(p.query)
{'rock': ['paper'], 'refer': ['twitter']}
You can do anything from here.