Automating pulling csv files off google Trends

Automating pulling csv files off google Trends - google-trends

pyGTrends does not seem to work. Giving errors in Python.
pyGoogleTrendsCsvDownloader seems to work, logs in, but after getting 1-3 requests (per day!) complains about exhausted quota, even though manual download with the same login/IP works flawlessly.
Bottom line: neither work. Searching through stackoverflow: many questions from people trying to pull csv's from Google, but no workable solution I could find...
Thank you in advance: whoever will be able to help. How should the code be changed? Do you know of another solution that works?
Here's the code of pyGoogleTrendsCsvDownloader.py
import httplib
import urllib
import urllib2
import re
import csv
import lxml.etree as etree
import lxml.html as html
import traceback
import gzip
import random
import time
import sys
from cookielib import Cookie, CookieJar
from StringIO import StringIO
class pyGoogleTrendsCsvDownloader(object):
'''
Google Trends Downloader
Recommended usage:
from pyGoogleTrendsCsvDownloader import pyGoogleTrendsCsvDownloader
r = pyGoogleTrendsCsvDownloader(username, password)
r.get_csv(cat='0-958', geo='US-ME-500')
'''
def __init__(self, username, password):
'''
Provide login and password to be used to connect to Google Trends
All immutable system variables are also defined here
'''
# The amount of time (in secs) that the script should wait before making a request.
# This can be used to throttle the downloading speed to avoid hitting servers too hard.
# It is further randomized.
self.download_delay = 0.25
self.service = "trendspro"
self.url_service = "http://www.google.com/trends/"
self.url_download = self.url_service + "trendsReport?"
self.login_params = {}
# These headers are necessary, otherwise Google will flag the request at your account level
self.headers = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'),
("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
("Accept-Language", "en-gb,en;q=0.5"),
("Accept-Encoding", "gzip, deflate"),
("Connection", "keep-alive")]
self.url_login = 'https://accounts.google.com/ServiceLogin?service='+self.service+'&passive=1209600&continue='+self.url_service+'&followup='+self.url_service
self.url_authenticate = 'https://accounts.google.com/accounts/ServiceLoginAuth'
self.header_dictionary = {}
self._authenticate(username, password)
def _authenticate(self, username, password):
'''
Authenticate to Google:
1 - make a GET request to the Login webpage so we can get the login form
2 - make a POST request with email, password and login form input values
'''
# Make sure we get CSV results in English
ck = Cookie(version=0, name='I4SUserLocale', value='en_US', port=None, port_specified=False, domain='www.google.com', domain_specified=False,domain_initial_dot=False, path='/trends', path_specified=True, secure=False, expires=None, discard=False, comment=None, comment_url=None, rest=None)
self.cj = CookieJar()
self.cj.set_cookie(ck)
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
self.opener.addheaders = self.headers
# Get all of the login form input values
find_inputs = etree.XPath("//form[#id='gaia_loginform']//input")
try:
#
resp = self.opener.open(self.url_login)
if resp.info().get('Content-Encoding') == 'gzip':
buf = StringIO( resp.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = resp.read()
xmlTree = etree.fromstring(data, parser=html.HTMLParser(recover=True, remove_comments=True))
for input in find_inputs(xmlTree):
name = input.get('name')
if name:
name = name.encode('utf8')
value = input.get('value', '').encode('utf8')
self.login_params[name] = value
except:
print("Exception while parsing: %s\n" % traceback.format_exc())
self.login_params["Email"] = username
self.login_params["Passwd"] = password
params = urllib.urlencode(self.login_params)
self.opener.open(self.url_authenticate, params)
def get_csv(self, throttle=False, **kwargs):
'''
Download CSV reports
'''
# Randomized download delay
if throttle:
r = random.uniform(0.5 * self.download_delay, 1.5 * self.download_delay)
time.sleep(r)
params = {
'export': 1
}
params.update(kwargs)
params = urllib.urlencode(params)
r = self.opener.open(self.url_download + params)
# Make sure everything is working ;)
if not r.info().has_key('Content-Disposition'):
print "You've exceeded your quota. Continue tomorrow..."
sys.exit(0)
if r.info().get('Content-Encoding') == 'gzip':
buf = StringIO( r.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = r.read()
myFile = open('trends_%s.csv' % '_'.join(['%s-%s' % (key, value) for (key, value) in kwargs.items()]), 'w')
myFile.write(data)
myFile.close()

Although I don't know python, I may have a solution. I am currently doing the same thing in C# and though I didn't get the .csv file, I got created a custom URL through code and then downloaded that HTML and saved to a text file (also through code). In this HTML (at line 12) is all the information needed to create the graph that is used on Google Trends. However, this has alot of unnecessary text within it that needs to be cut down. But either way, you end up with the same result. The Google Trends data. I posted a more detailed answer to my question here:
Downloading .csv file from Google Trends

There is an alternative module named pytrends - https://pypi.org/project/pytrends/ It is really cool. I would recommend this.
Example usage:
import numpy as np
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
#It is the term that you want to search
pytrend.build_payload(kw_list=["Eminem is the Rap God"])
# Find which region has searched the term
df = pytrend.interest_by_region()
df.to_csv("path\Eminem_InterestbyRegion.csv")
Potentially if you have a list of terms to search you could make use of "for loop" to automate the insights as per your wish.

Related

Authentication with GitLab to a terminal

I have a terminal that served in webbrowser with wetty. I want to authenticate the user from gitlab to let user with interaction with the terminal(It is inside docker container. When user authenticated i ll allow him to see the containers terminal).
I am trying to do OAuth 2.0 but couldn't manage to achieve.
That is what i tried.
I created an application on gitlab.
Get the code and secret and make a http call with python script.
Script directed me to login and authentication page.
I tried to get code but failed(Their is no mistake on code i think)
Now the problem starts in here. I need to get the auth code from redirected url to gain access token but couldn't figure out. I used flask library for get the code.
from flask import Flask, abort, request
from uuid import uuid4
import requests
import requests.auth
import urllib2
import urllib
CLIENT_ID = "clientid"
CLIENT_SECRET = "clientsecret"
REDIRECT_URI = "https://UnrelevantFromGitlabLink.com/console"
def user_agent():
raise NotImplementedError()
def base_headers():
return {"User-Agent": user_agent()}
app = Flask(__name__)
#app.route('/')
def homepage():
text = 'Authenticate with gitlab'
return text % make_authorization_url()
def make_authorization_url():
# Generate a random string for the state parameter
# Save it for use later to prevent xsrf attacks
state = str(uuid4())
save_created_state(state)
params = {"client_id": CLIENT_ID,
"response_type": "code",
"state": state,
"redirect_uri": REDIRECT_URI,
"scope": "api"}
url = "https://GitlapDomain/oauth/authorize?" + urllib.urlencode(params)
print get_redirected_url(url)
print(url)
return url
# Left as an exercise to the reader.
# You may want to store valid states in a database or memcache.
def save_created_state(state):
pass
def is_valid_state(state):
return True
#app.route('/console')
def reddit_callback():
print("-----------------")
error = request.args.get('error', '')
if error:
return "Error: " + error
state = request.args.get('state', '')
if not is_valid_state(state):
# Uh-oh, this request wasn't started by us!
abort(403)
code = request.args.get('code')
print(code.json())
access_token = get_token(code)
# Note: In most cases, you'll want to store the access token, in, say,
# a session for use in other parts of your web app.
return "Your gitlab username is: %s" % get_username(access_token)
def get_token(code):
client_auth = requests.auth.HTTPBasicAuth(CLIENT_ID, CLIENT_SECRET)
post_data = {"grant_type": "authorization_code",
"code": code,
"redirect_uri": REDIRECT_URI}
headers = base_headers()
response = requests.post("https://MyGitlabDomain/oauth/token",
auth=client_auth,
headers=headers,
data=post_data)
token_json = response.json()
return token_json["access_token"]
if __name__ == '__main__':
app.run(host="0.0.0.0",debug=True, port=65010)
I think my problem is on my redirect url. Because it is just an irrelevant link from GitLab and there is no API the I can make call.
If I can fire
#app.route('/console')
that line on Python my problem will probably will be solved.
I need to make correction on my Python script or different angle to solve my problem. Please help.

I was totally miss understand the concept of auth2. Main aim is to have access_token. When i corrected callback url as localhost it worked like charm.

How to scrape pages after login

I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.

The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies

I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy

extracting tweet from Twitter API using Python

I want to get every tweet of the HousingWire on Twitter (https://twitter.com/HousingWire). I understood how to authenticate into the twitter account but how I can get the tweet of HousingWire?
I know how to stream the data based on the keywords,but I want to stream the HousingWire tweet. how I can do that?
import time
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
ckey=''
csecret=''
atoken=''
asecret=''
class listener(StreamListener):
def on_data(self,data):
try:
print data
#tweet=data.split(',"text":"')[1].split('","source')[0]
#print tweet
#savethis=str(time.time())+'::'+tweet
savefile=open('tweetdb.txt','a')
savefile.write(data)
savefile.write('\n')
savefile.close()
return True
except BaseException,e:
print 'failed on data',str(e)
time.sleep(5)
def on_error(self,status):
print status
auth=OAuthHandler(ckey,csecret)
auth.set_access_token(atoken,asecret)
twitterStream=Stream(auth,listener())
twitterStream.filter(track=["stock"])

You can use the below Python script to grab the last 3,240 tweets from HousingWire (Twitter only allows access to that many tweets from a user - no way to grab the complete history). Usage: Simply put their twitter screen name in the script.
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import csv
#Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
print "getting tweets before %s" % (oldest)
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
print "...%s tweets downloaded so far" % (len(alltweets))
#transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
#write the csv
with open('%s_tweets.csv' % screen_name, 'wb') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","text"])
writer.writerows(outtweets)
pass
if __name__ == '__main__':
#pass in the username of the account you want to download
get_all_tweets("J_tsar")

fetching various data corresponding to a tweet

I am trying to fetch data from twitter for processing. Please see the code I want various data corresponding to a particular tweet corresponding to a given topic. I am able to fetch data (created_at, text, username, user_id). It shows error when i try to fetch(location, followers_count, friends_count, retweet_count).
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import json
ckey = '***********************'
csecret = '************************'
atoken ='*************************'
asecret = '**********************'
class listener(StreamListener):
def on_data(self,data):
try:
all_data = json.loads(data)
tweet = all_data["text"]
username = all_data["user"]["screen_name"]
timestamp = all_data["created_at"]
user_id = all_data["id_str"]
location = all_data["location"]
followers_count = all_data["followers_count"]
friends_count = all_data["friends_count"]
retweet_count = all_data["retweet_count"]
saveThis = str(time.time())+'::'+timestamp+'::'+username+'::'+user_id+'::'+tweet+'::'+followers_count+'::'+friends_count+'::'+retweet_count+'::'+location
saveFile = open('clean2.txt','a')
saveFile.write(saveThis)
saveFile.write('\n')
saveFile.close
return True
except BaseException, e:
print 'failed on data,',str(e)
time.sleep(5)
def on_error(self, status):
print status
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=["tweepy"])#topic

The reason it fails on all_data["location"] is that tweets don't have such a property: https://dev.twitter.com/overview/api/tweets
same with friends_count, followers_count - they are properties of users, not tweets.
The code should not be failing on all_date["retweet_count"] as tweets have such a property.
P.S. please include the error message (even if you skip the full error trackback) when reporting errors. makes it's easier to help you, otherwise one has to guess what the error might be.

passing commandline arguments to a selenium python webdriver test case

The following code is written using selenium python web driver which is run in saucelabs.I am providing the browser name,version and platform in a list,how do i do the same by providing the browser details through command line arguments? I am using py.test to execute the test cases.
import os
import sys
import httplib
import base64
import json
import new
import unittest
import sauceclient
from selenium import webdriver
from sauceclient import SauceClient
# it's best to remove the hardcoded defaults and always get these values
# from environment variables
USERNAME = os.environ.get('SAUCE_USERNAME', "ranjanprabhub")
ACCESS_KEY = os.environ.get('SAUCE_ACCESS_KEY', "ecec4dd0-d8da-49b9-b719-17e2c43d0165")
sauce = SauceClient(USERNAME, ACCESS_KEY)
browsers = [{"platform": "Mac OS X 10.9",
"browserName": "chrome",
"version": ""},
]
def on_platforms(platforms):
def decorator(base_class):
module = sys.modules[base_class.__module__].__dict__
for i, platform in enumerate(platforms):
d = dict(base_class.__dict__)
d['desired_capabilities'] = platform
name = "%s_%s" % (base_class.__name__, i + 1)
module[name] = new.classobj(name, (base_class,), d)
return decorator
#on_platforms(browsers)
class SauceSampleTest(unittest.TestCase):
def setUp(self):
self.desired_capabilities['name'] = self.id()
sauce_url = "http://%s:%s#ondemand.saucelabs.com:80/wd/hub"
self.driver = webdriver.Remote(
desired_capabilities=self.desired_capabilities,
command_executor=sauce_url % (USERNAME, ACCESS_KEY)
)
self.driver.implicitly_wait(30)
def test_sauce(self):
self.driver.get('http://saucelabs.com/test/guinea-pig')
assert "I am a page title - Sauce Labs" in self.driver.title
comments = self.driver.find_element_by_id('comments')
comments.send_keys('Hello! I am some example comments.'
' I should be in the page after submitting the form')
self.driver.find_element_by_id('submit').click()
commented = self.driver.find_element_by_id('your_comments')
assert ('Your comments: Hello! I am some example comments.'
' I should be in the page after submitting the form'
in commented.text)
body = self.driver.find_element_by_xpath('//body')
assert 'I am some other page content' not in body.text
self.driver.find_elements_by_link_text('i am a link')[0].click()
body = self.driver.find_element_by_xpath('//body')
assert 'I am some other page content' in body.text
def tearDown(self):
print("Link to your job: https://saucelabs.com/jobs/%s" % self.driver.session_id)
try:
if sys.exc_info() == (None, None, None):
sauce.jobs.update_job(self.driver.session_id, passed=True)
else:
sauce.jobs.update_job(self.driver.session_id, passed=False)
finally:
self.driver.quit()

So this is a bit complicated because you can pass an array of browsers into the #on_platforms decorator. My solution will only work for a single browser, as it looks like that's what you're doing right now.
For the current, single browser, situation -- you're looking for argparse. Here's my suggested fix:
import argparse
def setup_parser():
parser = argparse.ArgumentParser(description='Automation Testing!')
parser.add_argument('-p', '--platform', help='Platform for desired_caps', default='Mac OS X 10.9')
parser.add_argument('-b', '--browser-name', help='Browser Name for desired_caps', default='chrome')
parser.add_argument('-v', '--version', default='')
args = vars(parser.parse_args())
return args
desired_caps = setup_parser()
browsers = [desired_caps]
print browsers
But if you're looking to test multiple browsers (which I suggest you do!), you should not try and use command line arguments for the desired_caps of each individual browser. You should instead load a json config file for the browsers and the desired_caps for each one that you want Sauce to run.
Maybe have a different config file for each set of browsers, and then use command line arguments to pass in the config files you want to load.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Automating pulling csv files off google Trends - google-trends

Related

Authentication with GitLab to a terminal

How to scrape pages after login

extracting tweet from Twitter API using Python

fetching various data corresponding to a tweet

passing commandline arguments to a selenium python webdriver test case

Categories

Resources