Python Regex List into Another List - regex

I currently have a piece of code that runs mainly as I would expect only it prints out both the original list and the one that has been filtered. Essentially what I am trying to do is read URL's from a webpage and store them into a list ( called match, this part works fine) and then filter that list into a new list (called fltrmtch) because the original contains all of the extra href tags ect.
For example at the moment it would print out A and B but Im only after B:
A Core Development',
B'http://docs.python.org/devguide/'),
Heres the code:
url = "URL WOULD BE IN HERE BUT NOT ALLOWED TO POST MULTIPLE LINKS" #Name of the url being searched
webpage = urllib.urlopen(url)
content = webpage.read() #places the read url contents into variable content
import re # Imports the re module which allows seaching for matches.
import pprint # This import allows all listitems to be printed on seperate lines.
match = re.findall(r'\<a.*href\=.*http\:.+', content)#matches any content that begins with a href and ands in >
def filterPick(list, filter):
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
regex=re.compile(r'\"(.+?)\"').search
fltrmtch = filterPick(match, regex)
try:
if match: # defines that if there is a match the below is ran.
print "The number of URL's found is:" , len(match)
match.sort()
print "\nAnd here are the URL's found: "
pprint.pprint(fltrmtch)
except:
print "No URL matches have been found, please try again!"
Any help would be much appreciated.
Thank you in advance.
UPDATE: Thank you for the answer issued however I managed to find the flaw
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
I simply had to remove the 1, from [(1, m.group(1)) ). Thanks again.

It appears that the bottom portion of your code is mostly catching errors from the top portion, and that the regex you provided has no capturing groups. Here is a revised example:
import re
url = "www.site.com" # place real web address here
# read web page into string
page = urllib.urlopen(url).read()
# use regex to extract URLs from <a href=""> patterns
matches = re.findall(r'''\<a\s[^\>]*?\bhref\=(['"])(.+?)\1[^\>]*?\>''', page, re.IGNORECASE)
# keep only the second group of positive matches
matches = sorted([match.group(2) for match in matches if match])
# print matches if they exist
if matches:
print("The number of URL's found is:" + str(len(matches)))
print("\nAnd here are the URL's found:")
# print each match
print('\n'.join(matches))
else:
print 'No URL matches have been found, please try again!'

Related

How to use re.search on a list?

I have tried to change the re.search to re.match and so. But still it will show "No match result" no matter what I type.
I think there could be a problem in the code, since I made this code without fully comprehend the concept behind it.
Basically, I am trying to do a "search engine" to look for all the matching name if a word is given and matches one of the word in the names. Can someone tell me what is wrong?
import re
searchlist=[ *insert name here* ]
word_s = input("Search : ")
search_list = re.compile(r'\b(?:%s)\b' % '|'.join(searchlist), re.I|re.M)
result = re.search(search_list, word_s)
if result:
print("Match Result: ", result.group())
else:
print("No match result.")
Your last comment shows the problem:
In your code, searchlist is a list of the search terms (the things the regex searches for), not the list of strings to be searched.
For example:
searchlist = ["Fundamentals", "Engineering"]
search_list = re.compile(r'\b(?:%s)\b' % '|'.join(searchlist), re.I|re.M)
Now search_list is \b(?:Fundamentals|Engineering)\b, so it can be used as regex that will find if any of those terms appears in word_s
result = re.search(search_list, word_s)
You want to do the exact opposite:
books = ["Fundamentals of Organic Chemistry, International Edition", "Engineering Mechanics: Statics In SI Units"]
word_s = input("Search for: ")
word_re = re.compile(r"\b{}\b".format(word_s), re.I)
for book in books:
if re.search(word_re, book):
print("First Match Result: ", book)
break # Abort search after first match
else: # Only executed if the for loop was exhausted
print("No match result.")

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

how do you extract a certain part of a URL in python?

I am looking to extract only a portion of a patterned URL:
https://<some_server>/server/rest/services/<that part I need>/<other data I do not need>/...
I just need to extract the portion after 'rest/services/' the first part can change, the last part can change, but the URL will always have 'rest/services/' and what I need follows, followed by '/...
What you can do is turn this string into a list by splitting it, and then access the nth part.
m = 'https://<some_server>/server/rest/services/<that part I need>/<other data I do not need>'
a = m.split('/')
a[6] //This is the part that you want
You could try this:
(?<=rest/services/) # look behind for this
[^/]+ # anything except a '/'
import re
rgxp = re.compile(r'(?<=rest/services/)[^/]+')
print re.findall(rgxp, text)

Splitting a string in Python based on a regex pattern

I have a bytes object that contains urls:
> body.decode("utf-8")
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
I need to split it into a list with each url as a separate element:
import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'
urls = re.compile(pattern).split(body.decode("utf-8"))
What I get is a list of one element with all urls pasted together:
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']
How do I split each url into a separate element?
Try splitting it with \s+
Try this sample python code,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)
This outputs,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']
Does this result looks ok? Or we can work on it and make as you desire.
In case you don't want empty string ('') in your result list (because of \r\n in the end), you can use find all to find all the URLs in your string. Sample python code for same is following,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)
This gives following output,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']

Get digits between slashes or on the end in URL

I need a reg expression (for groovy) to match 7 digits between 2 slashes (in a url) or on the end of the url. So fe:
https://stackoverflow.com/questions/6032324/problem-with-this-reg-expression
I need 6032324 but it should also match:
https://stackoverflow.com/questions/6032324
If it has 1 digit more/less, I should not match.
Maybe its an easy reg exp but Im not so familiar with this :)
Thanks for you help!
Since you are parsing a URL, it makes sense to use an URL parser to first grab the path part to split with /. Then, you will have direct access to the slash-separated path parts that you may test against a very simple [0-9]{7} pattern and get them all with
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }
You may also take the first match:
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }.first()
Or last:
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }.last()
See the Groovy demo:
def surl = "https://stackoverflow.com/questions/6032324/problem-with-this-reg-expression"
def url = new URL(surl)
final result = url.path.split("/").findAll { it.matches(/\d{7}/) }.first()
print(result) // => 6032324