How do I get all cookies from a site, including 3rd party? - cookies

I've been asked to write a web crawler that lists all cookies (including 3rd party such as Youtube) and then checks them in a database that provides extra info (such as what is the cookie for). Users write their address in a search bar and then receive the info.
The problem is: I'm completly lost! I barely have any idea where to begin from, what to do, and it's starting to give me actual headaches.
I can think up the logic, and I know it shouldn't be a hard problem, but what do I have to use?
I have tried Selenium (still have no idea how it works) with Python mainly, I've looked at Java and even considered C#, but still, the problem is that I don't know where to start this from, what to use to do it. Every step I take is like climbing a wall, only to drop on the other side and find a larger wall.
All I ask is some guidance, no need for actual code.

Alright so I finally got something going. The trick is Python + Selenium + ChromeDriver. I will post more details in the future once I get this all done.
With Python 3, this is enough to connect to a site and get an output of cookies (they're, in this case, stored in myuserdir/Documents/Default/cookies):
from selenium import webdriver
import sys
co = webdriver.ChromeOptions()
co.add_argument("user-data-dir={}".format("C:\\Users\\myuserdir\\Documents"))
driver = webdriver.Chrome(chrome_options = co)
driver.get("http://www.example.com)

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
def getCookies(self):
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'./geckodriver')
driver.get(self.website_url)
cookie = driver.get_cookies()
driver.quit()
return cookie
The approach I used is use get_cookies() to store cookie file for future use. But sometimes you need to simulate js process to get cookies loaded by javascript code.

Related

Random Word from Website

I have a simple program (not related to school) that requires a lot of random words in the local database. Earlier today, I found this website http://www.setgetgo.com/randomword/get.php that will always generate a random word every time the page is reloaded. I have an idea to create a variable that will consistently grab the value from this website, and append it to my list (acts as a local database).
Any idea how to do that? I thought there is a "wget" library in python too. However, my python keeps returning an error.
My idea:
a_variable = wget the website text
Here is the block of code you need
import requests
res = requests.get("http://www.setgetgo.com/randomword/get.php")
print res.content
I would adivce you to dive into Request and BeautifulSoup. If you want to learn more about it.
Goodluck

How can I improve this piece of code (scraping with Python)?

I'm quite new to programming so I appologise if my question is too trivial.
I've recently taken some Udacity courses like "Intro to Computer Science", "Programming foundations with Python" and some others.
The other day my boss asked me to collect some email addresses from certain websites. Some of them had many addresses at the same page so, the bell rang and I was thinking of creating my own code to do the repetitive task of collecting the emails and pasting them in a spreadsheet.
So, after reviewing some of the lessons of those corses plus some videos on youtube I came up with this code.
Notes: It's written in Python 2.7.12 and I'm using Ubuntu 16.04.
import xlwt
from bs4 import BeautifulSoup
import urllib2
def emails_page2excel(url):
# Create html file from a given url
sauce = urllib2.urlopen(url).read()
soup = BeautifulSoup(sauce,'lxml')
# Create the spreadsheet book and a page in it
wb = xlwt.Workbook()
sheet1 = wb.add_sheet('Contacts')
# Find the emails and write them in the spreadsheet table
count = 0
for url in soup.find_all('a'):
link = url.get('href')
if link.find('mailto')!=-1:
start_email = link.find('mailto')+len('mailto:')
email = link[start_email:]
sheet1.write(count,0,email)
count += 1
wb.save('This is an example.xls')
The code runs fine and it's quite quick. However I'd like to improve it in these ways:
I got the feeling that the for loop could be done in a more elegant
way. Is there any other way to look for the email besides the string find? Just in a similar way in which I found the 'a' tags?
I'd like to be able to evaluate this code with a list of websites(most likely in a spreadsheet) instead of evaluating it only with a url string. I haven't had time to research on how to do this yet but any suggestion is welcome.
Last but not least, I'd like to ask if there's any way to implement this script in some sort of friendly-to-use mini-programme. I mean, for instance, my boss is totally bad at computers: I can't imagine her opening a terminal shell and executing the python code. Instead I'd like to crate some programme where she could just paste the url, or upload a spreadsheet with the websites she wants to extract the emails from, select whether she wants to extract emails or any other information, maybe some more features and then click a button and get the result.
I hope I've expressed myself clearly.
Thanks in advance,
Anqin
As far as BeautifulSoup goes you can search for emails in a in three ways:
1) Use find_all with a lambda to search all tags that are a and have href as an attribute and its value has mailto.
for email in soup.find_all(lambda tag: tag.name == "a" and "href" in tag.attrs and "mailto:" in tag.attrs["href"]):
print (email["href"][7:])
2) Use find_all with regex to find mailto: in an a tag.
for email in soup.find_all("a", href=re.compile("mailto:")):
print (email["href"][7:])
3) Use select to find an a tag on which its href attribute starts with mailto:.
for email in soup.select('a[href^=mailto]'):
print (email["href"][7:])
This is my personal preference, but I prefer using requests over urllib. Far simpler, better at error handling and safer when threading.
As far as your other questions, you can create a method for fetching, parsing and returning the results that you want and pass an url as a parameter. You would only need to loop your list of urls and call that method.
For your boss question you should use a GUI.
Have fun coding.

My Web scraper sporadically fails due to very small difference in URLs

I am facing this frustrating problem that has affected the speed of my data collection a lot. I have written this customized web scraper tailored for a specific sports website and I read the URLs from a file and then call my scraper:
import re
from bs4 import BeautifulSoup
import html5lib
import socket
from PassesData import *
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
base: http://www.something.com
with open('Part2-PostIS-0430PM.txt', 'w') as f5:
with open('URLLinks.txt') as temp:
for url in temp:
f5.write(getData(base+url))
f5.write("\n")
Sample data in URLLinks.txt --> /something/wherein/12345
The crawler works perfectly for many hours reading URLs one by one and passing it to crawler sparsing and returning the result writing it to the text file in the outer with --> but when it reads a URL with a slight difference like:
/someting/wherein/12345 instead of /something/wherein/12345,
my crawler fails with: UnboundLocalError: local variable 'header' referenced before assignment: header being the page header that I sparse from the URL like header = soup.h1.b.text.strip() and pass it to the print function. It works perfectly for 99% of the URLs I am reading but that one URL in the middle causes the whole process to stop and then when I pass the URL to for example Google chrome it automatically fixes the missing term and fetches the correct term, for example when I pass "http://www.something.com/someting/wherein/12345" to Chrome it opens /something/wherein/12345 with no problem. And here I go and change that one URL in the URLlinks.txt and then run my crawler again which has cause huge delay in my data collection.
This has caused huge delay as I have to constantly babysit the process.
I really appreciate any solution for this.
I am using BeautifulSoup4 and socket (no use of URLLIB and other methods since they just dont work for the website I am scraping )
I have to again stress that my crawler works perfectly, but for the small small variations of the URL, Like having /this-is-a/link/to/12345 instead of /this-is/link/to/12345 which a browser perfectly understand but my code fails even though I have collected these URLs from the same website in the first place!!!
Please help me out. Thanks community
you could use the web scraping framework scrapy which I assume will also fail with incorrect urls, but it won't stop the process, so the other requests would still work. It is also asynchronous so your requests will be handled independently and faster.
Actually I did something that does not solve the issue but is the next best thing. It allows me to continue scraping through while logging those URLs who were problematic so I could look them up later: The code I used is as follows:
with open('Part2-PostIS-Datas-7-1159AM.txt', 'w') as f5:
log = open("LOG-P2-7-1159.txt", 'a')
with open('Part2-Links-7.txt') as temp:
for url in temp:
try:
f5.write(getData(base+url))
f5.write("\n")
except (KeyboardInterrupt, SystemExit):
raise
except:
print(url, file=log )
pass
Now I can run my script and scrape the pages one by one and the 1% problematic URLs would not stop the whole process. After my data collection is over, I would have to look in my LOG files and fix the URLs and run them again.
Really don't know what module you use for scraping, but on my side requests works all right with the examples you give me:
>>> import requests
>>> response = requests.get('http://stackoverflow.com//questions/41051497/my-web-scraper-sporadically-due-very-small-difference-in-urls?noredirect=1#comment69311269_41051497')
>>> response.url
u'http://stackoverflow.com/questions/41051497/my-web-scraper-sporadically-fails-due-to-very-small-difference-in-urls?noredirect=1'
>>> response = requests.get('http://stackoverflow.com//questions/41051497/how-wrong-can-the-name-be?noredirect=1')
>>> response.url
u'http://stackoverflow.com/questions/41051497/my-web-scraper-sporadically-fails-due-to-very-small-difference-in-urls?noredirect=1'
As you can see no matter what kind of "wrong name" you provide, the response will perform a 301 and redirect to a correct url as per stackoverflow server asked as long as the question number is correct.

Twitter search with urllib2 failing

I am trying to search Twitter for a given search term with the following code:
from bs4 import BeautifulSoup
import urllib2
link = "https://twitter.com/search?q=stackoverflow%20since%3A2014-11-01%20until%3A2015-11-01&src=typd&vertical=default"
page = urllib2.urlopen(link).read()
soup = BeautifulSoup(page)
first = soup.find_all('p')
(Replace "stackoverflow" in link with any search term you want.) However, when I do this (and every time I have tried for the past few days, thinking Twitter might be too bogged down), I get this error:
No results.
Twitter may be over capacity or experiencing a momentary hiccup.
(HTML in results of BS omitted for simplicity in viewing.)
This code used to work for me, but now is not. Additionally, plugging link directly into a browser gives the correct result and Twitter status shows all is well.
Thoughts?
I was able to reproduce your results. I believe that Twitter is using this message to discourage people from scraping. It makes sense since they have taken the time to publish an API for people to access their data, that they discourage scraping.
My advice is to use their API which is documented here: https://dev.twitter.com/overview/documentation

i need to restart django server to make my app properly work

so i made a python script to grab images from a subreddit (from Imgur and imgur albums). i successfully done that (it returns img urls) and wanted to integrate it into django so i can deploy it online and let other people use it. when i started running the server at my machine, the images from subreddit loads flawlessly, but when i try another subreddit, it craps out on me (i'll post the exception at the end of the post). so i restart the django server, and same thing happen. the images loads without a hitch. but the second time i do it, it craps out on me. what gives?
Exception Type: siteError, which pretty much encompasses urllib2.HTTPError, urllib2.URLError, socket.error, socket.sslerror
since i'm a noob in all of this, i'm not sure what's going on. so anyone care to help me?
note: l also host the app on pythoneverywhere.com. same result.
Using a global in your get_subreddit function looks wrong to me.
reddit_url = 'http://reddit.com/r/'
def get_subreddit(name):
global reddit_url
reddit_url += name
Every time, you run that function, you append the value of name to a global reddit_url.
It starts as http://reddit.com/r/
run get_subreddit("python") and it changes to http://reddit.com/r/python
run get_subreddit("python") again, and it changes to http://reddit.com/r/pythonpython
at this point, the url is invalid, and you have to restart your server.
You probably want to change get_subreddit so that it returns a url, and fetch this url in your function.
def get_subreddit(name):
return "http://reddit.com/r/" + name
# in your view
url = get_subreddit("python")
# now fetch url
There are probably other mistakes in your code as well. You can't really expect somebody on stack overflow to fix all the problems for you on a project of this size. The best thing you can do is learn some techniques for debugging your code yourself.
Look at the full traceback, not just the final SiteError. See what line of your code the problem is occurring in.
Add some logging or print statement, and try and work out why the SiteError is occurring.
Are you trying to download the url that you think you are (as I explained above, I don't think you are, because of problems with your get_subreddit method).
Finally, I recommend you make sure that the site works on your dev machine before you move on to deploying it on python anywhere. Deploying can cause lots of headaches all by itself, so it's good to start with an app that's working before you start.
Good luck :)