There is a site that i connect to, but need to login 4 times with different user names and passwords.
Is there anyway that i can do this by looping through the usernames and passwords in a payload.
This is the first time im am doing this and am not really sure of how to go about it.
The code works fine if i post just one username and password.
Im using Python 2.7 and BeautifulSoup and requests.
Here is my code.
import requests
import zipfile, StringIO
from bs4 import BeautifulSoup
# Here were add the login details to be submitted to the login form.
payload = [
{'USERNAME': 'xxxxxx','PASSWORD': 'xxxxxx','option': 'login'},
{'USERNAME': 'xxxxxx','PASSWORD': 'xxxxxxx','option': 'login'},
{'USERNAME': 'xxxxx','PASSWORD': 'xxxxx','option': 'login'},
{'USERNAME': 'xxxxxx','PASSWORD': 'xxxxxx','option': 'login'},
#Possibly need headers later.
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
base_url = ""
with requests.Session() as s:
p ='', data=payload)
# Get the download page to scrape.
r = s.get('', stream=True)
content = r.text
soup = BeautifulSoup(content, 'lxml')
#Now i get the most recent download URL.
download_url = soup.find_all("a", {'class':'tabletd'})[-1]['href']
#now we join the base url with the download url.
download_docs = s.get(base_url + download_url, stream=True)
print "Checking Content"
content_type = download_docs.headers['content-type']
print content_type
print "Checking Filename"
content_name = download_docs.headers['content-disposition']
print content_name
print "Checking Download Size"
content_size = download_docs.headers['content-length']
print content_size
#This is where we extract and download the specified xml files.
z = zipfile.ZipFile(StringIO.StringIO(download_docs.content))
print "---------------------------------"
print "Downloading........."
#Now we save the files to the specified location.
print "Download Complete"
Just use a for loop. You may need to adjust your download directory if files will be overwritten.
payloads = [
{'USERNAME': 'xxxxxx1','PASSWORD': 'xxxxxx','option': 'login'},
{'USERNAME': 'xxxxxx2','PASSWORD': 'xxxxxxx','option': 'login'},
{'USERNAME': 'xxxxx3','PASSWORD': 'xxxxx','option': 'login'},
{'USERNAME': 'xxxxxx4','PASSWORD': 'xxxxxx','option': 'login'},
for payload in payloads:
with requests.Session() as s:
p ='', data=payload)
I attempted to request a REST request to see the document below. But do not work.
request: curl -XGET -L http://[IP:PORT]/api/v1/chart
response: {"msg":"Bad Authorization header. Expected value 'Bearer <JWT>'"}
The Superset installation has been on PIP and was also Helm Chart. But all are the same. helm:
How should I order a REST API?
Check the security section of the documentation you have linked. It has this API /security/login, you can follow the JSON parameter format and get the JWT bearer token. Use that token to send in the Header of your other API calls to superset.
open http://localhost:8080/swagger/v1, assuming http://localhost:8080 is your Superset host address
then find this section
the response would be like this
"access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmcmVzaCI6dHJ1ZSwiaWF0IjoxNjU0MzQ2OTM5LCJqdGkiOiJlZGY2NTUxMC0xMzI1LTQ0NDEtYmFmMi02MDc1MzhjZDcwNGYiLCJ0eXBlIjoiYWNjZXNzIiwic3ViIjoxLCJuYmYiOjE2NTQzNDY5MzksImV4cCI6MTY1NDM0NzgzOX0.TfjUea3ycH77xhCWOpO4LFbYHrT28Y8dnWsc1xS_IOY",
"refresh_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmcmVzaCI6ZmFsc2UsImlhdCI6MTY1NDM0NjkzOSwianRpIjoiNzBiM2EyZDYtNDFlNy00ZDNlLWE0NDQtMTRiNTkyNTk4NjUwIiwidHlwZSI6InJlZnJlc2giLCJzdWIiOjEsIm5iZiI6MTY1NDM0NjkzOSwiZXhwIjoxNjU2OTM4OTM5fQ.OgcctNnO4zTDfTgtHnaEshk7u-D6wOxfxjCsjqjKYyE"
Thank #andrewsali commented on this github issue, I finally figure out how to access the superset REST API by python code.
import requests
from bs4 import BeautifulSoup
import json
def get_supetset_session():
url = f'http://{superset_host}/api/v1/chart/'
r = s.get(url)
# print(r.json())
superset_host = '' # replace with your own host
username = 'YOUR_NAME'
password = 'YOUR_PASSWORD'
# set up session for auth
s = requests.Session()
login_form ="http://{superset_host}/login")
# get Cross-Site Request Forgery protection token
soup = BeautifulSoup(login_form.text, 'html.parser')
csrf_token = soup.find('input',{'id':'csrf_token'})['value']
data = {
'username': username,
'password': password,
# login the given session'http://{superset_host}/login/', data=data)
return s
# s = get_supetset_session()
base_url = ''
def get_dashboards_list(s, base_url=base_url):
"""## GET List of Dashboards"""
url = base_url + '/api/v1/dashboard/'
r = s.get(url)
resp_dashboard = r.json()
for result in resp_dashboard['result']:
print(result['dashboard_title'], result['id'])
s = get_supetset_session()
# {'session': '.eJwlj8FqAzEMRP_F5z1Islay8jOLJcu0NDSwm5xK_r0uPQ7DG978lGOeeX2U2_N85VaOz1FuxVK6JIHu1QFhGuEOk5NG8qiYGkJ7rR3_Ym-uJMOzJqySeHhIG8SkNQK6GVhTdLf0ZMmG6sZGQtiQ1Gz0qYiUTVoHhohZthLXOY_n4yu_l0-VKTObLaE13i2Hz2A2rzBmhU7WkkN1cfdH9HsuZoFbeV15_l_C8v4F4nBC9A.Ypn16Q.yz4E-vz0gp3EmJwv-6tYIcOGavU'}
Thanks #Ferris for this visual solution!
To add to this, you can also create the appropriate API call with Python just like following:
import requests
api_url = "your_url/api/v1/security/login"
payload = {"password":"your password",
"username":"your username"
response =, json=payload)
# the acc_token is a json, which holds access_token and refresh_token
access_token = response.json()['access_token']
# no get a guest token
api_url_for_guesttoken = "your_url/api/v1/security/guest_token"
payload = {}
# now this is the crucial part: add the specific auth-header
response = , json=payload, headers={'Authorization':f"Bearer {access_token}"})
I have been working in a python code to search and download SMAP satellite data from NSIDC https website. My code was working until last week when start a bug:
urllib2.HTTPError: HTTP Error 404: Not Found
Any help?
The code Is a adaptation from a NSIDC website proposed to do exactly what I need. The example below:
"""This script,, defines an HTML parser to scrape data files from an earthdata HTTPS URL and bulk downloads all files to your working directory.
This code was adapted from
Last edited Jan 26, 2017 G. Deemer"""
import urllib2
import os
from cookielib import CookieJar
from HTMLParser import HTMLParser
# Define a custom HTML parser to scrape the contents of the HTML data table
class MyHTMLParser(HTMLParser):
def __init__(self):
self.inLink = False
self.dataList = [] = '/'
self.indexcol = ';'
self.Counter = 0
def handle_starttag(self, tag, attrs):
self.inLink = False
if tag == 'table':
self.Counter += 1
if tag == 'a':
for name, value in attrs:
if name == 'href':
if in value or self.indexcol in value:
self.inLink = True
self.lasttag = tag
def handle_endtag(self, tag):
if tag == 'table':
self.Counter +=1
def handle_data(self, data):
if self.Counter == 1:
if self.lasttag == 'a' and self.inLink and data.strip():
parser = MyHTMLParser()
# Define function for batch downloading
def BatchJob(Files, cookie_jar):
for dat in Files:
print "downloading: ", dat
JobRequest = urllib2.Request(url+dat)
JobRequest.add_header('cookie', cookie_jar) # Pass the saved cookie into additional HTTP request
JobRedirect_url = urllib2.urlopen(JobRequest).geturl() + '&app_type=401'
# Request the resource at the modified redirect url
Request = urllib2.Request(JobRedirect_url)
Response = urllib2.urlopen(Request)
f = open( dat, 'wb')
print "Files downloaded to: ", os.path.dirname(os.path.realpath(__file__))
# The following code block is used for HTTPS authentication
# The user credentials that will be used to authenticate access to the data
username = "user"
password = "password"
# The FULL url of the directory which contains the files you would like to bulk download
url = "" # Example URL
# Create a password manager to deal with the 401 reponse that is returned from
# Earthdata Login
password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, "", username, password)
# Create a cookie jar for storing cookies. This is used to store and return
# the session cookie given to use by the data server (otherwise it will just
# keep sending us back to Earthdata Login to authenticate). Ideally, we
# should use a file based cookie jar to preserve cookies between runs. This
# will make it much more efficient.
cookie_jar = CookieJar()
# Install all the handlers.
opener = urllib2.build_opener(
#urllib2.HTTPHandler(debuglevel=1), # Uncomment these two lines to see
#urllib2.HTTPSHandler(debuglevel=1), # details of the requests/responses
# Create and submit the requests. There are a wide range of exceptions that
# can be thrown here, including HTTPError and URLError. These should be
# caught and handled.
# Open a requeset to grab filenames within a directory. Print optional
DirRequest = urllib2.Request(url)
DirResponse = urllib2.urlopen(DirRequest)
# Get the redirect url and append 'app_type=401'
# to do basic http auth
DirRedirect_url = DirResponse.geturl()
DirRedirect_url += '&app_type=401'
# Request the resource at the modified redirect url
DirRequest = urllib2.Request(DirRedirect_url)
DirResponse = urllib2.urlopen(DirRequest)
DirBody =
# Uses the HTML parser defined above to pring the content of the directory containing data
Files = parser.dataList
# Display the contents of the python list declared in the HTMLParser class
# print Files #Uncomment to print a list of the files
# Call the function to download all files in url
BatchJob(Files, cookie_jar) # Comment out to prevent downloading to your working directory
I could fix the bug using a directly load of the website and selecting the images to download. As the code above.
"""This script,, defines an HTML parser to scrape data files from an earthdata HTTPS URL and bulk downloads all files to your working directory.
This code was adapted from Last edited Jan 26, 2017 G. Deemer"""
import urllib2
import os
from cookielib import CookieJar
# Define function for batch downloading
def BatchJob(Files, cookie_jar):
for dat in Files:
print "downloading: ", dat
JobRequest = urllib2.Request(url+dat)
JobRequest.add_header('cookie', cookie_jar) # Pass the saved cookie into additional HTTP request
JobRedirect_url = urllib2.urlopen(JobRequest).geturl() + '&app_type=401'
# Request the resource at the modified redirect url
Request = urllib2.Request(JobRedirect_url)
Response = urllib2.urlopen(Request)
f = open( dat, 'wb')
print "Files downloaded to: ", os.path.dirname(os.path.realpath(__file__))
# The following code block is used for HTTPS authentication
# The user credentials that will be used to authenticate access to the data
username = "user"
password = "password"
# The FULL url of the directory which contains the files you would like to bulk download
url = "" # Example URL
# Create a password manager to deal with the 401 reponse that is returned from # Earthdata Login
password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
username, password)
# Create a cookie jar for storing cookies. This is used to store and return
# the session cookie given to use by the data server (otherwise it will just
# keep sending us back to Earthdata Login to authenticate). Ideally, we
# should use a file based cookie jar to preserve cookies between runs. This
# will make it much more efficient.
cookie_jar = CookieJar()
# Install all the handlers.
opener = urllib2.build_opener(
#urllib2.HTTPHandler(debuglevel=1), # Uncomment these two lines to see
#urllib2.HTTPSHandler(debuglevel=1), # details of the requests/responses
# Create and submit the requests. There are a wide range of exceptions that
# can be thrown here, including HTTPError and URLError. These should be
# caught and handled.
# Open a requeset to grab filenames within a directory. Print optional
DirResponse = urllib2.urlopen(url)
htmlPage =
listFiles = [x.split(">")[0].replace('"', "")
for x in htmlPage.split("><a href=") if x.split(">")[0].endswith('.h5"') == True]
# Display the contents of the python list declared in the HTMLParser class
# print Files #Uncomment to print a list of the files
# Call the function to download all files in url
BatchJob(Files, cookie_jar) # Comment out to prevent downloading to your working directory
I like to print a pdf-version of my mediawikipage using pdfkit.
My mediawiki requires a valid login to see any pages.
I login to mediawiki using requests, and this works, and I get some cookies. However, I am not able to use these cookies with pdfkit.from_url()
My python-script looks like this:
#!/usr/bin/env python2
import pdfkit
import requests
import pickle
mywiki = ""# URL
username = 'produnis' # Username to login with
password = 'seeeecret#' # Login Password
## Login to MediaWiki
# Login request
payload = {'action': 'query', 'format': 'json', 'utf8': '', 'meta': 'tokens', 'type': 'login'}
r1 = + 'api.php', data=payload)
# login confirm
login_token = r1.json()['query']['tokens']['logintoken']
payload = {'action': 'login', 'format': 'json', 'utf8': '', 'lgname': username, 'lgpassword': password, 'lgtoken': login_token}
r2 = + 'api.php', data=payload, cookies=r1.cookies)
So, right here I am successfully logged in, and cookies are stored in r2.cookies.
The print()-command gives:
<RequestsCookieJar[<Cookie produniswikiToken=832a1f1da165016fb9d9a107ddb218fc for>, <Cookie produniswikiUserID=1 for>, <Cookie produniswikiUserName=Produnis for>, <Cookie produniswiki_session=oddicobpi1d5af4n0qs71g7dg1kklmbo for>]>
I can save the cookies into a file:
def save_cookies(requests_cookiejar, filename):
with open(filename, 'wb') as f:
pickle.dump(requests_cookiejar, f)
save_cookies(r2.cookies, "cookies")
This file looks like this:
Now I want to print a specific page into PDF using pdfkit. Manpage states, that cookies can be set via a cookie-jar file:
options = {
'page-size': 'A4',
'margin-top': '0.5in',
'margin-right': '0.5in',
'margin-bottom': '0.5in',
'margin-left': '0.5in',
'encoding': "UTF-8",
'cookie-jar' : "cookies",
'no-outline': None
current_pdf = pdfkit.from_url(pdf_url, the_filename, options=options)
My Problem is:
with this code, the "cookies" file becomes 0KB and the PDF states "You must be logged in to view a page..."
So my question is:
How can I use a requests.cookies in pdfkit.from_url()?
I had the same issue and overcame it with the following:
import requests, pdfkit
# Get login cookie
s = requests.session() # if you're making multiple calls
data = {'username': 'admin', 'password': 'hunter2'}'', data=data)
# Get yourself a PDF
options = {'cookie': s.cookies.items(), 'javascript-delay': 1000}
pdfkit.from_url('', 'report.pdf', options=options)
Depending on how much javascript you're trying to load you might want to set the javascript-delay to something higher or lower; the default is 200ms.
I am trying to build a basic LinkedIn scraper for a research project and am running into challenges when I try to scrape through levels of the directory. I am a beginner and I keep on running the code below and IDLE returns and error before shutting down. See below the code and error:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint as pp
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
#use this to gather all of the individual links from the second directory page
def get_second_links(pre_section_link):
response = requests.get(pre_section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
second_links = [li.a["href"] for li in column.findAll("li")]
return second_links
# use this to gather all of the individual links from the third directory page
def get_third_links(section_link):
response = requests.get(section_link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column = soup.find("ul", attrs={'class':'column dual-column'})
third_links = [li.a["href"] for li in column.findAll("li")]
return third_links
use this to build the individual profile links
def get_profile_link(link):
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
column2 = soup.find("ul", attrs={'class':'column dual-column'})
profile_links = [PROFILE_URL + li.a["href"] for li in column2.findAll("li")]
return profile_links
if __name__=="__main__":
sub_directory = get_second_links("")
sub_directory = map(get_third_links, sub_directory)
profiles = get_third_links(sub_directory)
profiles = map(get_profile_link, profiles)
profiles = [item for sublist in fourth_links for item in sublist]
Error I keep getting:
Error Page
You need to add https to PROFILE_URL:
pyGTrends does not seem to work. Giving errors in Python.
pyGoogleTrendsCsvDownloader seems to work, logs in, but after getting 1-3 requests (per day!) complains about exhausted quota, even though manual download with the same login/IP works flawlessly.
Bottom line: neither work. Searching through stackoverflow: many questions from people trying to pull csv's from Google, but no workable solution I could find...
Thank you in advance: whoever will be able to help. How should the code be changed? Do you know of another solution that works?
Here's the code of
import httplib
import urllib
import urllib2
import re
import csv
import lxml.etree as etree
import lxml.html as html
import traceback
import gzip
import random
import time
import sys
from cookielib import Cookie, CookieJar
from StringIO import StringIO
class pyGoogleTrendsCsvDownloader(object):
Google Trends Downloader
Recommended usage:
from pyGoogleTrendsCsvDownloader import pyGoogleTrendsCsvDownloader
r = pyGoogleTrendsCsvDownloader(username, password)
r.get_csv(cat='0-958', geo='US-ME-500')
def __init__(self, username, password):
Provide login and password to be used to connect to Google Trends
All immutable system variables are also defined here
# The amount of time (in secs) that the script should wait before making a request.
# This can be used to throttle the downloading speed to avoid hitting servers too hard.
# It is further randomized.
self.download_delay = 0.25
self.service = "trendspro"
self.url_service = ""
self.url_download = self.url_service + "trendsReport?"
self.login_params = {}
# These headers are necessary, otherwise Google will flag the request at your account level
self.headers = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'),
("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
("Accept-Language", "en-gb,en;q=0.5"),
("Accept-Encoding", "gzip, deflate"),
("Connection", "keep-alive")]
self.url_login = ''+self.service+'&passive=1209600&continue='+self.url_service+'&followup='+self.url_service
self.url_authenticate = ''
self.header_dictionary = {}
self._authenticate(username, password)
def _authenticate(self, username, password):
Authenticate to Google:
1 - make a GET request to the Login webpage so we can get the login form
2 - make a POST request with email, password and login form input values
# Make sure we get CSV results in English
ck = Cookie(version=0, name='I4SUserLocale', value='en_US', port=None, port_specified=False, domain='', domain_specified=False,domain_initial_dot=False, path='/trends', path_specified=True, secure=False, expires=None, discard=False, comment=None, comment_url=None, rest=None)
self.cj = CookieJar()
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
self.opener.addheaders = self.headers
# Get all of the login form input values
find_inputs = etree.XPath("//form[#id='gaia_loginform']//input")
resp =
if'Content-Encoding') == 'gzip':
buf = StringIO(
f = gzip.GzipFile(fileobj=buf)
data =
data =
xmlTree = etree.fromstring(data, parser=html.HTMLParser(recover=True, remove_comments=True))
for input in find_inputs(xmlTree):
name = input.get('name')
if name:
name = name.encode('utf8')
value = input.get('value', '').encode('utf8')
self.login_params[name] = value
print("Exception while parsing: %s\n" % traceback.format_exc())
self.login_params["Email"] = username
self.login_params["Passwd"] = password
params = urllib.urlencode(self.login_params), params)
def get_csv(self, throttle=False, **kwargs):
Download CSV reports
# Randomized download delay
if throttle:
r = random.uniform(0.5 * self.download_delay, 1.5 * self.download_delay)
params = {
'export': 1
params = urllib.urlencode(params)
r = + params)
# Make sure everything is working ;)
if not'Content-Disposition'):
print "You've exceeded your quota. Continue tomorrow..."
if'Content-Encoding') == 'gzip':
buf = StringIO(
f = gzip.GzipFile(fileobj=buf)
data =
data =
myFile = open('trends_%s.csv' % '_'.join(['%s-%s' % (key, value) for (key, value) in kwargs.items()]), 'w')
Although I don't know python, I may have a solution. I am currently doing the same thing in C# and though I didn't get the .csv file, I got created a custom URL through code and then downloaded that HTML and saved to a text file (also through code). In this HTML (at line 12) is all the information needed to create the graph that is used on Google Trends. However, this has alot of unnecessary text within it that needs to be cut down. But either way, you end up with the same result. The Google Trends data. I posted a more detailed answer to my question here:
Downloading .csv file from Google Trends
There is an alternative module named pytrends - It is really cool. I would recommend this.
Example usage:
import numpy as np
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
#It is the term that you want to search
pytrend.build_payload(kw_list=["Eminem is the Rap God"])
# Find which region has searched the term
df = pytrend.interest_by_region()
Potentially if you have a list of terms to search you could make use of "for loop" to automate the insights as per your wish.