How can I improve this piece of code (scraping with Python)? - python-2.7

I'm quite new to programming so I appologise if my question is too trivial.
I've recently taken some Udacity courses like "Intro to Computer Science", "Programming foundations with Python" and some others.
The other day my boss asked me to collect some email addresses from certain websites. Some of them had many addresses at the same page so, the bell rang and I was thinking of creating my own code to do the repetitive task of collecting the emails and pasting them in a spreadsheet.
So, after reviewing some of the lessons of those corses plus some videos on youtube I came up with this code.
Notes: It's written in Python 2.7.12 and I'm using Ubuntu 16.04.
import xlwt
from bs4 import BeautifulSoup
import urllib2
def emails_page2excel(url):
# Create html file from a given url
sauce = urllib2.urlopen(url).read()
soup = BeautifulSoup(sauce,'lxml')
# Create the spreadsheet book and a page in it
wb = xlwt.Workbook()
sheet1 = wb.add_sheet('Contacts')
# Find the emails and write them in the spreadsheet table
count = 0
for url in soup.find_all('a'):
link = url.get('href')
if link.find('mailto')!=-1:
start_email = link.find('mailto')+len('mailto:')
email = link[start_email:]
sheet1.write(count,0,email)
count += 1
wb.save('This is an example.xls')
The code runs fine and it's quite quick. However I'd like to improve it in these ways:
I got the feeling that the for loop could be done in a more elegant
way. Is there any other way to look for the email besides the string find? Just in a similar way in which I found the 'a' tags?
I'd like to be able to evaluate this code with a list of websites(most likely in a spreadsheet) instead of evaluating it only with a url string. I haven't had time to research on how to do this yet but any suggestion is welcome.
Last but not least, I'd like to ask if there's any way to implement this script in some sort of friendly-to-use mini-programme. I mean, for instance, my boss is totally bad at computers: I can't imagine her opening a terminal shell and executing the python code. Instead I'd like to crate some programme where she could just paste the url, or upload a spreadsheet with the websites she wants to extract the emails from, select whether she wants to extract emails or any other information, maybe some more features and then click a button and get the result.
I hope I've expressed myself clearly.
Thanks in advance,
Anqin

As far as BeautifulSoup goes you can search for emails in a in three ways:
1) Use find_all with a lambda to search all tags that are a and have href as an attribute and its value has mailto.
for email in soup.find_all(lambda tag: tag.name == "a" and "href" in tag.attrs and "mailto:" in tag.attrs["href"]):
print (email["href"][7:])
2) Use find_all with regex to find mailto: in an a tag.
for email in soup.find_all("a", href=re.compile("mailto:")):
print (email["href"][7:])
3) Use select to find an a tag on which its href attribute starts with mailto:.
for email in soup.select('a[href^=mailto]'):
print (email["href"][7:])
This is my personal preference, but I prefer using requests over urllib. Far simpler, better at error handling and safer when threading.
As far as your other questions, you can create a method for fetching, parsing and returning the results that you want and pass an url as a parameter. You would only need to loop your list of urls and call that method.
For your boss question you should use a GUI.
Have fun coding.

Related

Random Word from Website

I have a simple program (not related to school) that requires a lot of random words in the local database. Earlier today, I found this website http://www.setgetgo.com/randomword/get.php that will always generate a random word every time the page is reloaded. I have an idea to create a variable that will consistently grab the value from this website, and append it to my list (acts as a local database).
Any idea how to do that? I thought there is a "wget" library in python too. However, my python keeps returning an error.
My idea:
a_variable = wget the website text
Here is the block of code you need
import requests
res = requests.get("http://www.setgetgo.com/randomword/get.php")
print res.content
I would adivce you to dive into Request and BeautifulSoup. If you want to learn more about it.
Goodluck

Twitter search with urllib2 failing

I am trying to search Twitter for a given search term with the following code:
from bs4 import BeautifulSoup
import urllib2
link = "https://twitter.com/search?q=stackoverflow%20since%3A2014-11-01%20until%3A2015-11-01&src=typd&vertical=default"
page = urllib2.urlopen(link).read()
soup = BeautifulSoup(page)
first = soup.find_all('p')
(Replace "stackoverflow" in link with any search term you want.) However, when I do this (and every time I have tried for the past few days, thinking Twitter might be too bogged down), I get this error:
No results.
Twitter may be over capacity or experiencing a momentary hiccup.
(HTML in results of BS omitted for simplicity in viewing.)
This code used to work for me, but now is not. Additionally, plugging link directly into a browser gives the correct result and Twitter status shows all is well.
Thoughts?
I was able to reproduce your results. I believe that Twitter is using this message to discourage people from scraping. It makes sense since they have taken the time to publish an API for people to access their data, that they discourage scraping.
My advice is to use their API which is documented here: https://dev.twitter.com/overview/documentation

Regular expression for URL filter that makes it hard to read the filterd URL

I want to insert a URL filter and I would like the URL to be hard to dechiffre.
For example .*porn\.* in a way maybe that it uses the ASCII code for the letters in hex form .
Of course, the example is obvious and I definately will leave that one as it is ;)
But for the others I would like them to be hard to read!
Thx!
You can use the $_GET function in PHP to pull an ID out of the URL and display it that way, similar to Youtube with their "watch?v=". I recently did one using "?id=49" (I only have a few pages ATM, I will have about 70 soon). What I did is use a database with a song_id to index the information. I use the same basic layout, but you can use the ID to access information wrapped in PHP so that it doesnt get sent to the browser but will still display the page you want.
Or if you really want it to look crazy, you could use a database using the SHA() or MD5() function to encrypt it.
and your display will look like /page.php?id=21a57f2fe765e1ae4a8bf15d73fc1bf2a533f547f2343d12a499d9c0592044d4.

Create/Edit MS Word & Word Perfect docs in Django?

Is it possible to create and/or edit MS Word and Word Perfect documents with django? I'd like to be able to allow the user to fill out a form and have the form fields inserted into an MS Word/Word Perfect document. Or, the form fields are used to create a new MS Word/Word Perfect document. The user can then send that document via email to others who may not have access to the django web-app.
I have a client who needs this functionality and I'd like to keep it all within the web-app.
Any ideas?
Thanks!
For MS Word, you could use docx-mailmerge. Run the following commands to install lxml(dependecy required by docx-mailmerge) and docx-mailmerge
conda install lxml
pip install docx-mailmerge
In order for docx-mailmerge to work correctly, you need to create a standard Word document and define the appropriate merge fields. The examples below are for Word 2010. Other versions of Word should be similar. It actually took me a while to figure out this process but once you do it a couple of times, it is pretty simple.
Start Word and create the basic document structure. Then place the cursor in the location where the merged data should be inserted and choose Insert -> Quick Parts -> Field..:
Word Quick Parts
From the Field dialog box, select the “MergeField” option from the Field Names list. In the Field Name, enter the name you want for the field. In this case, we are using Business Name.
Word Add Field
Once you click ok, you should see something like this: <> in the Word document. You can go ahead and create the document with all the needed fields.
from __future__ import print_function
from mailmerge import MailMerge
from datetime import date
template = "Practical-Business-Python.docx"
document = MailMerge(template)
document.merge(
status='Gold',
city='Springfield',
phone_number='800-555-5555',
Business='Cool Shoes',
zip='55555',
purchases='$500,000',
shipping_limit='$500',
state='MO',
address='1234 Main Street',
date='{:%d-%b-%Y}'.format(date.today()),
discount='5%',
recipient='Mr. Jones')
document.write('test-output.docx')
More at http://pbpython.com/python-word-template.html
I do not know how to do exactly what you ask for, but I would suggest that you also look into creating PDFs with Django. If you only want to send information in a particular format then PFD might be better because it is more portable across platforms. You may also want to look at this documentation for how to send emails from Django.

Django url pattern to retrieve query string parameters

I was about ready to start giving a jqgrid in a django app greater functionality (pagination, searching, etc). In order to do this it looks as though jqgrid sends its parameters in the GET to the server. I plan to write an urlpattern to pull out the necessary stuff (page number, records per page, search term, etc) so I can pass it along to my view to return the correct rows to the grid. Has anyone out there already created this urlpattern I am in search of?
Thanks much.
The answer to this was simpler than I realized. As stated in Chapter 7 of the djangobook in the section titled "Query string parameters" one can simply do something as follows, where "someParam" is the parameter in the query string you want to retrieve. However, Django is designed to be clean in that address bar at the top of the page so you should only use this option if you must.
The query string might look something like this.
http://somedomainname.com/?someString=1
The view might look like this.
def someView(request):
if 'someParam' in request.GET and request.GET['someParam']:
someParam = request.GET['someParam']
Hopefully this is of some help to someone else down the road.