how do you extract a certain part of a URL in python? - regex

I am looking to extract only a portion of a patterned URL:
https://<some_server>/server/rest/services/<that part I need>/<other data I do not need>/...
I just need to extract the portion after 'rest/services/' the first part can change, the last part can change, but the URL will always have 'rest/services/' and what I need follows, followed by '/...

What you can do is turn this string into a list by splitting it, and then access the nth part.
m = 'https://<some_server>/server/rest/services/<that part I need>/<other data I do not need>'
a = m.split('/')
a[6] //This is the part that you want

You could try this:
(?<=rest/services/) # look behind for this
[^/]+ # anything except a '/'
import re
rgxp = re.compile(r'(?<=rest/services/)[^/]+')
print re.findall(rgxp, text)

Related

Splitting a string in Python based on a regex pattern

I have a bytes object that contains urls:
> body.decode("utf-8")
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
I need to split it into a list with each url as a separate element:
import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'
urls = re.compile(pattern).split(body.decode("utf-8"))
What I get is a list of one element with all urls pasted together:
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']
How do I split each url into a separate element?
Try splitting it with \s+
Try this sample python code,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)
This outputs,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']
Does this result looks ok? Or we can work on it and make as you desire.
In case you don't want empty string ('') in your result list (because of \r\n in the end), you can use find all to find all the URLs in your string. Sample python code for same is following,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)
This gives following output,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']

Get digits between slashes or on the end in URL

I need a reg expression (for groovy) to match 7 digits between 2 slashes (in a url) or on the end of the url. So fe:
https://stackoverflow.com/questions/6032324/problem-with-this-reg-expression
I need 6032324 but it should also match:
https://stackoverflow.com/questions/6032324
If it has 1 digit more/less, I should not match.
Maybe its an easy reg exp but Im not so familiar with this :)
Thanks for you help!
Since you are parsing a URL, it makes sense to use an URL parser to first grab the path part to split with /. Then, you will have direct access to the slash-separated path parts that you may test against a very simple [0-9]{7} pattern and get them all with
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }
You may also take the first match:
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }.first()
Or last:
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }.last()
See the Groovy demo:
def surl = "https://stackoverflow.com/questions/6032324/problem-with-this-reg-expression"
def url = new URL(surl)
final result = url.path.split("/").findAll { it.matches(/\d{7}/) }.first()
print(result) // => 6032324

Extract string of numbers from URL using regex PIG

I'm using PIG to generate a list of URLs that have been recently visited. In each of the URLs, there is a string of numbers that represents the product page visited. I'm trying to use a regex_extract_all() function to extract just the string of numbers, which vary in length from 6-8. The string of digits can be found directly after jobs2/view/ and usually ends with +&cd but sometimes they may end with ).
Here are a few example URLs:
(http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=cl k&gl=hk)
Here is the current regex I am using:
J = FOREACH jpage GENERATE FLATTEN(REGEX_EXTRACT_ALL(TEXTCOLUMN, '\/view\/(\d+)\+\&')) as (output:chararray)
I have also tried other forms such as:
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', 'view.([0-9]+)', 'view\/([\d]+)\+',
'[0-9][0-9][0-9]+', and
'[0-9][0-9][0-9]*'; none of which work.
Can anybody assist here or have another way of going about it?
Much appreciated,
MM
Reason for"Unexpected character 'D'" is, you need to put double backslash instead of single backslash. eg just replace [\d+] to [\\d+]
Here your solution, please validate all your inputs strings
input.txt
http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928)=2&hl=zh-TW&ct=clk&gl=hk
http://webcache.googleusercontent.com/search?q=cache:http://my.linkedin.com/jobs2/view/9919248
Updated Pigscript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'.*/view/(\\d+)([+|&|cd|)?]+)?',1);
dump B;
(17069404)
(5977065)
(16988928)
(16988928)
(16988928)
(16988928)
I'm not familiar with PIG, but this regex will match your target:
(?<=/jobs2/view/)\d+
By using a (non-consuming) look behind, the entire match (not just a group of the match) is your number.

Python script to extract data from text file

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(
you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)
You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written
Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

Python Regex List into Another List

I currently have a piece of code that runs mainly as I would expect only it prints out both the original list and the one that has been filtered. Essentially what I am trying to do is read URL's from a webpage and store them into a list ( called match, this part works fine) and then filter that list into a new list (called fltrmtch) because the original contains all of the extra href tags ect.
For example at the moment it would print out A and B but Im only after B:
A Core Development',
B'http://docs.python.org/devguide/'),
Heres the code:
url = "URL WOULD BE IN HERE BUT NOT ALLOWED TO POST MULTIPLE LINKS" #Name of the url being searched
webpage = urllib.urlopen(url)
content = webpage.read() #places the read url contents into variable content
import re # Imports the re module which allows seaching for matches.
import pprint # This import allows all listitems to be printed on seperate lines.
match = re.findall(r'\<a.*href\=.*http\:.+', content)#matches any content that begins with a href and ands in >
def filterPick(list, filter):
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
regex=re.compile(r'\"(.+?)\"').search
fltrmtch = filterPick(match, regex)
try:
if match: # defines that if there is a match the below is ran.
print "The number of URL's found is:" , len(match)
match.sort()
print "\nAnd here are the URL's found: "
pprint.pprint(fltrmtch)
except:
print "No URL matches have been found, please try again!"
Any help would be much appreciated.
Thank you in advance.
UPDATE: Thank you for the answer issued however I managed to find the flaw
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
I simply had to remove the 1, from [(1, m.group(1)) ). Thanks again.
It appears that the bottom portion of your code is mostly catching errors from the top portion, and that the regex you provided has no capturing groups. Here is a revised example:
import re
url = "www.site.com" # place real web address here
# read web page into string
page = urllib.urlopen(url).read()
# use regex to extract URLs from <a href=""> patterns
matches = re.findall(r'''\<a\s[^\>]*?\bhref\=(['"])(.+?)\1[^\>]*?\>''', page, re.IGNORECASE)
# keep only the second group of positive matches
matches = sorted([match.group(2) for match in matches if match])
# print matches if they exist
if matches:
print("The number of URL's found is:" + str(len(matches)))
print("\nAnd here are the URL's found:")
# print each match
print('\n'.join(matches))
else:
print 'No URL matches have been found, please try again!'