regular expression in python for match url - regex

I need to use python to match url in my text file.
However, there is a special case:
i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣
In this case I would like to keep the emoji next to the url and just match the url in the middle.
Ideally, I would like to have result like this:
i like 🤣<url>🤣
Since I am new to this, this is what I have so far.
pattern = re.compile("([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])")
but the return result is something unsatisfied like this:
i like 🤣<url>Sex8JaP5w5/a7htvq🤣
Would you please help me with this? Thank you so much

A solution using existing packages:
from urlextract import URLExtract
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(r'', text)
extractor = URLExtract()
source = "i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣 "
urlsWithEmojis = extractor.find_urls(source)
urls = list(map(remove_emoji, urlsWithEmojis))
print(urls)
output
['pic.twitter.com/Sex8JaP5w5/a7htvq']
Try it Online!
Inspired by How do you extract a url from a string using python? and removing emojis from a string in Python

If looks like you are missing * or+ at the last matching group so it only matches one character. So you want "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])*" or "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])+".
Now I don't know if this regex is simplified for your case, but it does not match all urls. For an example of that check out https://www.regextester.com/20
If you are attempting to match any url I would recommend rethinking your problem and trying to simplify down to more specific types of urls, like the example you provided.
EDIT: Also why (.com)+? Is there really a case where multiple ".com"s appear like .com.com.com
Also I think you have small typo and it is supposed to be (\.com). But since you have ([:///a-zA-Z////\.])+ it could be reduced to (com), however i think the explicit (\.com) makes it an easier expression to read.

Related

Urlpattern regular expression not working

So i'm trying to make url like so
re_path(r'^product_list/(?P<category_slug>[\w-]+)/(?:(?P<filters>[\w~#=]+)?)$', views.ProductListView.as_view(), name='filtered_product_list'),
and at this point it works with things like:
/product_list/sdasdadsad231/bruh=1~3~10#nobruh=1~4
bruh=1~3~10#nobruh=1~4 - those are filters
but later i want to implement search by word functionality
so i want it recognize things like
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4&search_for=athing
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4
/product_list/sdasdadsad231/?search_for=athing
/product_list/sdasdadsad231/
so in different situations it will get filters and/or search_for or nothing at all
You might write the pattern as:
^product_list/(?P<category_slug>[\w-]+)/(?:\??(?P<filters>[\w~#=&-]+)?)$
Regex demo
If you want to match the leading / from the example data, you can append that in the pattern after the ^
The part after the question mark is the query string [wiki] and does not belong to the path. Django will construct a QueryDict for this, and this will be available through request.GET [Django-doc]. Indeed, if the path is for example:
/product_list/sdasdadsad231/?filters=bruh-1~3~10&nobruh-1~4&search_for=athing
Then the ?filters=bruh-1~3~10&nobruh-1~4&search_for=athing is not part of the path, and it will be wrapped in request.GET as a QueryDict that looks like:
>>> QueryDict('filters=bruh-1~3~10&nobruh-1~4&search_for=athing')
<QueryDict: {'filters': ['bruh-1~3~10'], 'nobruh-1~4': [''], 'search_for': ['athing']}>
You thus can not capture the part after (and including) the question mark, this is already stripped of the path when trying to match with the re_path(…) and path(…) definitions.

Regular expressions (RegEx) to filter string from URLs in Google Analytics

I want to filter a string from the URLs in Google Analytics. This can be done using the Views > Filter > Exclude using RegEx, but I have been unable to get it to work.
An outline of how these filters are set up, can be found here, however, I can not work out how to isolate the string using RegEx. I believe it will need to be one filter per URL type.
The URLs follow this format:
/software/11F372288FA/pagename
/software/13F412C5FA/pagename/summary
/software/XIL1P0BFXCKM81/pagename2
I need to exclude this part of the URL:
/11F372288FA/
So that the URL data (e.g. Session time) is recorded against:
/software/pagename
/software/pagename/summary
/software/pagename2
I have worked out that I can isolate the string using thing following RegEx
^\/validate\/(..........)\/accounts\/summary$
It is not very elegant and would require a filter for every URL type.
Thanks for the help!
I'm not certain if this will work in your exact case but instead of using regex for this it might be easier to just create a new string from the start to the end of "software" and append everything from pagename to the end. In Java this might look something like:
String newString = oldString.substring(0, 9) + oldString.substring(oldString.indexOf("pagename"));
Take note though that this will only work if the "software" at the start is always the same length and you are actually only excluding things between "software" and "pagename".

Python regex to ignore what is between quotes

I have a string, part of which is surrounded within quotes. Like the one at the third line of the code snippet below. I want the string to be formatted into a dict literal. Meaning wherever the quotes are missing, they should be added. But the part which is within the quotes has to be ignored. I came up with the code below to handle this:
from ast import literal_eval
from re import sub
str = "key1:[val1,val2,val3],key2:'val4A,val4B'"
str = sub(r"([\w\-\.]+|[\"'].*[\"'])", r"'\1'", f"{{{str}}}")
str = sub(r"[\"']{2,}(.*)[\"']{2,}", r"'\1'", str)
fin = literal_eval(str)
print(fin)
This code does the work, but I want to know if there is a way to achieve this with one time usage of sub. Before you mark this as a duplicate, I tried a large number of the solutions provided on the web including positive and negative look ahead and look behind, exclusion, and simple negative match. Couldn't find any which would work. If there is a solution I have missed or anyone has a solutions, I would highly appreciate knowing about it.
Try this ([\w\-\.]+(?=(?:[^']*'[^']*')*[^']*$)) :
Live Demo

Finding a group of words using Regular Expressions

I am using python to get user input and then by using regular expressions I want to check for certain words. In this case I want to check how the user is feeling and then store it in a list. The problem is that when I print the list it is empty.
import re
phrase = raw_input("How are you feeling ")
phrase = phrase.lower()
feel=(re.findall(r^(?=.*\bsad\b)(?=.*\bhappy\b)(?=.*\bjoyful\b)(?=.*\bmad\b)(?=.*\bsad\b), phrase))
print feel
I'm not a python expert, but am fairly decent with regex. Why wouldn't you just use something like:
\b(happy|sad|joyful|mad)\b
Add chars to match
...(?=.*\bsad\b).*

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.