Parsing a Table from the following website - python-2.7

I want to collect the past weather details of a particular city in India for each day in the year 2016.The following website has this data :
"https://www.timeanddate.com/weather/india/kanpur/historic?month=1&year=2016"
This link has the data for month January 2016. There is a nice table out there
I want to extract this table
I have tried enough and I could extract another table which is this one. BUT I DO NOT WANT THIS ONE. It is not serving my purpose
I want the other big table with data given with time. "For each day of that month" because then I can loop over all months using the URL.
The problem is I do not know html and stuffs related to it. So I am not able to scrape out things myself.

It would have been better if you had provided some codes that you tried. Anyway, this code works for the 1st Jan table. You can write the loop to extract data for other days as well.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/weather/india/kanpur/historic?
month=1&year=2016"
page = urlopen(url)
soup = BeautifulSoup(page, 'lxml')
Data = []
table = soup.find('table', attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
dict = {}
dict['time'] = tr.find('th').text.strip()
all_td = tr.find_all('td')
dict['temp'] = all_td[1].text
dict['weather'] = all_td[2].text
dict['wind'] = all_td[3].text
arrow = all_td[4].text
if arrow == '↑':
dict['wind_dir'] = 'South to North'
else:
dict['wind_dir'] = 'North to South'
dict['humidity'] = all_td[5].text
dict['barometer'] = all_td[6].text
dict['visibility'] = all_td[7].text
Data.append(dict)
Note: add other cases for the wind_dir logic

Related

How to add weeks to a datetime column, depending on a django model/dictionary?

Context
There is a dataframe of customer invoices and their due dates.(Identified by customer code)
Week(s) need to be added depending on customer code
Model is created to persist the list of customers and week(s) to be added
What is done so far:
Models.py
class BpShift(models.Model):
bp_name = models.CharField(max_length=50, default='')
bp_code = models.CharField(max_length=15, primary_key=True, default='')
weeks = models.IntegerField(default=0)
helper.py
from .models import BpShift
# used in views later
def week_shift(self, df):
df['DueDateRange'] = df['DueDate'] + datetime.timedelta(
weeks=BpShift.objects.get(pk=df['BpCode']).weeks)
I realised my understanding of Dataframes is seriously flawed.
df['A'] and df['B'] would return Series. Of course, timedelta wouldn't work like this(weeks=BpShift.objects.get(pk=df['BpCode']).weeks).
Dataframe
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d)
Customer List csv
BP Name,BP Code,Week(s)
Customer1,CA0023MY,1
Customer2,CA0064SG,1
Error
BpShift matching query does not exist.
Commentary
I used these methods in hope that I would be able to change the dataframe at once, instead of
using df.iterrows(). I have recently been avoiding for loops like a plague and wondering if this
is the "correct" mentality. Is there any recommended way of doing this? Thanks in advance for any guidance!
This question Python & Pandas: series to timedelta will help to take you from Series to timedelta. And although
pandas.Series(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('weeks', flat=True)
)
will give you a Series of integers, I doubt the order is the same as in df['BpCode']. Because it depends on the django Model and database backend.
So you might be better off to explicitly create not a Series, but a DataFrame with pk and weeks columns so you can use df.join. Something like this
pandas.DataFrame(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('pk', 'weeks'),
columns=['BpCode', 'weeks'],
)
should give you a DataFrame that you can join with.
So combined this should be the gist of your code:
django_response = [('customer1', 1), ('customer2', '2')]
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d).set_index('BpCode').join(
pd.DataFrame(django_response, columns=['BpCode', 'weeks']).set_index('BpCode')
)
df['DueDate'] = pd.to_datetime(df['DueDate'])
df['weeks'] = pd.to_numeric(df['weeks'])
df['new_duedate'] = df['DueDate'] + df['weeks'] * pd.Timedelta('1W')
print(df)
DueDate weeks new_duedate
BpCode
customer1 2020-05-30 1 2020-06-06
customer2 2020-04-30 2 2020-05-14
You were right to want to avoid looping. This approach gets all the data in one SQL query from your Django model, by using filter. Then does a left join with the DataFrame you already have. Casts the dates and weeks to the right types and then computes a new due date using the whole columns instead of loops over them.
NB the left join will give NaN and NaT for customers that don't exist in your Django database. You can either avoid those rows by passing how='inner' to df.join or handle them whatever way you like.

Unable to retrieve data in django

I am writing a weblog application in django. As part of this, I have a view function that fetches an object from the database corresponding to a single blog post. The field that I am using to query the database is the published date (pub_date) which is of type DateTime (Python). I have a MySQL database and the type of the column for this field is datetime. But I am not able to fetch the object from the database though I am passing the correct date attributes. I am getting a 404 error.The following is my view function:
def entry_detail(request,year,month,day,slug):
import datetime,time
date_stamp = time.strptime(year+month+day,"%Y%b%d")
pub_date = datetime.date(*date_stamp[:3])
entry = get_object_or_404(Entry,pub_date__year=pub_date.year,pub_date__month=pub_date.month,pub_date__day=pub_date.day,slug=slug)
return render_to_response('coltrane/entry_detail.html',{'entry':entry})
The following is the URL of the individual post that I want to fetch:
http://127.0.0.1:8000/weblog/2014/oct/28/third-post/
And this is how the pub_date column value for the third-post in the database looks like:
2014-10-28 13:26:39
The following is the URL pattern:
url(r'^weblog/(?P<year>\d{4})/(?P<month>\w{3})/(?P<day>\d{2})/(?P<slug>[-\w]+)/$','coltrane.views.entry_detail'),
You're doing some odd things here: you're converting to a time, then converting that to a datetime.date, then extracting the year, month and day as integers and passing them to the query. You could bypass almost the whole process: the only thing you need is to convert the month, the other parameters can be passed directly:
month_no = datetime.datetime.strptime(month, '%b').month
entry = get_object_or_404(Entry, pub_date__year=year, pub_date__month=month_no, pub_date__day=day, slug=slug)

Importing salesforce report data using python

I am new to sfdc . I have report already created by user . I would like to use python to dump the data of the report into csv/excel file.
I see there are couple of python packages for that. But my code gives an error
from simple_salesforce import Salesforce
sf = Salesforce(instance_url='https://cs1.salesforce.com', session_id='')
sf = Salesforce(password='xxxxxx', username='xxxxx', organizationId='xxxxx')
Can i have the basic steps for setting up the API and some example code
This worked for me:
import requests
import csv
from simple_salesforce import Salesforce
import pandas as pd
sf = Salesforce(username=your_username, password=your_password, security_token = your_token)
login_data = {'username': your_username, 'password': your_password_plus_your_token}
with requests.session() as s:
d = s.get("https://your_instance.salesforce.com/{}?export=1&enc=UTF-8&xf=csv".format(reportid), headers=sf.headers, cookies={'sid': sf.session_id})
d.content will contain a string of comma separated values which you can read with the csv module.
I take the data into pandas from there, hence the function name and import pandas. I removed the rest of the function where it puts the data into a DataFrame, but if you're interested in how that's done let me know.
In case it is helpful, I wanted to write out the steps I used to answer this question now (Aug-2018), based on Obol's comment. For reference, I followed the README instructions at https://github.com/cghall/force-retrieve/blob/master/README.md for the salesforce_reporting package.
To connect to Salesforce:
from salesforce_reporting import Connection, ReportParser
sf = Connection(username='your_username',password='your_password',security_token='your_token')
Then, to get the report I wanted into a Pandas DataFrame:
report = sf.get_report(your_reports_id)
parser = salesforce_reporting.ReportParser(report)
report = parser.records_dict()
report = pd.DataFrame(report)
If you were so inclined, you could also simplify the four lines above into one, like so:
report = pd.DataFrame(salesforce_reporting.ReportParser(sf.get_report(your_reports_id)).records_dict())
One difference I ran into from the README is that sf.get_report('report_id', includeDetails=True) threw an error stating get_report() got an unexpected keyword argument 'includeDetails'. Simply removing it out seemed result in the code working fine.
report can now be exported via report.to_csv('report.csv',index=False), or manipulated directly.
EDIT: parser.records() changed to parser.records_dict(), as this allows the DataFrame to have the columns already listed, rather than indexing them numerically.
The code below is rather long and might be just for our use case but the basic idea is the following:
Find out date interval length and additional needed filtering to never run into the "more the 2'000" limit. In my case I could have weekly date range filter but would need to apply some additional filters
Then run it like this:
report_id = '00O4…'
sf = SalesforceReport(user, pass, token, report_id)
it = sf.iterate_over_dates_and_filters(datetime.date(2020,2,1),
'Invoice__c.InvoiceDate__c', 'Opportunity.CustomField__c',
[('a', 'startswith'), ('b', 'startswith'), …])
for row in it:
# do something with the dict
The iterator goes through every week (if you need daily iterators or monthly then you'd need to change the code, but the change should be minimal) since 2020-02-01 and applies the filter CustomField__c.startswith('a'), then CustomField__c.startswith('b'), … and acts as a generator so you don't need to mess with the filter cycling yourself.
The iterator throws an Exception if there's a query which returns more than 2000 rows, just to be sure that the data is not incomplete.
One warning here: SF has a limit of max 500 queries per hour. Say if you have one year with 52 weeks and 10 additional filters you'd already run into that limit.
Here's the class (relies on simple_salesforce)
import simple_salesforce
import json
import datetime
"""
helper class to iterate over salesforce report data
and manouvering around the 2000 max limit
"""
class SalesforceReport(simple_salesforce.Salesforce):
def __init__(self, username, password, security_token, report_id):
super(SalesforceReport, self).__init__(username=username, password=password, security_token=security_token)
self.report_id = report_id
self._fetch_describe()
def _fetch_describe(self):
url = f'{self.base_url}analytics/reports/{self.report_id}/describe'
result = self._call_salesforce('GET', url)
self.filters = dict(result.json()['reportMetadata'])
def apply_report_filter(self, column, operator, value, replace=True):
"""
adds/replaces filter, example:
apply_report_filter('Opportunity.InsertionId__c', 'startsWith', 'hbob').
For date filters use apply_standard_date_filter.
column: needs to correspond to a column in your report, AND the report
needs to have this filter configured (so in the UI the filter
can be applied)
operator: equals, notEqual, lessThan, greaterThan, lessOrEqual,
greaterOrEqual, contains, notContain, startsWith, includes
see https://sforce.co/2Tb5SrS for up to date list
value: value as a string
replace: if set to True, then if there's already a restriction on column
this restriction will be replaced, otherwise it's added additionally
"""
filters = self.filters['reportFilters']
if replace:
filters = [f for f in filters if not f['column'] == column]
filters.append(dict(
column=column,
isRunPageEditable=True,
operator=operator,
value=value))
self.filters['reportFilters'] = filters
def apply_standard_date_filter(self, column, startDate, endDate):
"""
replace date filter. The date filter needs to be available as a filter in the
UI already
Example: apply_standard_date_filter('Invoice__c.InvoiceDate__c', d_from, d_to)
column: needs to correspond to a column in your report
startDate, endDate: instance of datetime.date
"""
self.filters['standardDateFilter'] = dict(
column=column,
durationValue='CUSTOM',
startDate=startDate.strftime('%Y-%m-%d'),
endDate=endDate.strftime('%Y-%m-%d')
)
def query_report(self):
"""
return generator which yields one report row as dict at a time
"""
url = self.base_url + f"analytics/reports/query"
result = self._call_salesforce('POST', url, data=json.dumps(dict(reportMetadata=self.filters)))
r = result.json()
columns = r['reportMetadata']['detailColumns']
if not r['allData']:
raise Exception('got more than 2000 rows! Quitting as data would be incomplete')
for row in r['factMap']['T!T']['rows']:
values = []
for c in row['dataCells']:
t = type(c['value'])
if t == str or t == type(None) or t == int:
values.append(c['value'])
elif t == dict and 'amount' in c['value']:
values.append(c['value']['amount'])
else:
print(f"don't know how to handle {c}")
values.append(c['value'])
yield dict(zip(columns, values))
def iterate_over_dates_and_filters(self, startDate, date_column, filter_column, filter_tuples):
"""
return generator which iterates over every week and applies the filters
each for column
"""
date_runner = startDate
while True:
print(date_runner)
self.apply_standard_date_filter(date_column, date_runner, date_runner + datetime.timedelta(days=6))
for val, op in filter_tuples:
print(val)
self.apply_report_filter(filter_column, op, val)
for row in self.query_report():
yield row
date_runner += datetime.timedelta(days=7)
if date_runner > datetime.date.today():
break
For anyone just trying to download a report into a DataFrame this is how you do it (I added some notes and links for clarifications):
import pandas as pd
import csv
import requests
from io import StringIO
from simple_salesforce import Salesforce
# Input Salesforce credentials:
sf = Salesforce(
username='johndoe#mail.com',
password='<password>',
security_token='<security_token>') # See below for help with finding token
# Basic report URL structure:
orgParams = 'https://<INSERT_YOUR_COMPANY_NAME_HERE>.my.salesforce.com/' # you can see this in your Salesforce URL
exportParams = '?isdtp=p1&export=1&enc=UTF-8&xf=csv'
# Downloading the report:
reportId = 'reportId' # You find this in the URL of the report in question between "Report/" and "/view"
reportUrl = orgParams + reportId + exportParams
reportReq = requests.get(reportUrl, headers=sf.headers, cookies={'sid': sf.session_id})
reportData = reportReq.content.decode('utf-8')
reportDf = pd.read_csv(StringIO(reportData))
You can get your token by following the instructions at the bottom of this page

Yield multiple items using scrapy

I'm scraping data from the following URL:
http://www.indexmundi.com/commodities/?commodity=gasoline
There are two sections which contain price: Gulf Coast Gasoline Futures End of Day Settlement Price and Gasoline Daily Price
I want to scrape data from both sections as two different items. Here is the code which I've written:
if dailyPrice:
item['description'] = u''.join(dailyPrice.xpath(".//h1/text()").extract())
item['price'] = u''.join(dailyPrice.xpath(".//span/text()").extract())
item['unit'] = dailyPrice.xpath(".//div/p/text()").extract()[0].split(',')[-1]
regex = re.compile("Source:(.*)",re.IGNORECASE|re.UNICODE)
result = re.search(regex, u''.join(dailyPrice.xpath(".//div/p/text()").extract()))
if result:
item['source'] = result.group(1).strip()
yield item
if futurePrice:
item['description'] = u''.join(futurePrice.xpath(".//h1/text()").extract())
item['price'] = u''.join(futurePrice.xpath(".//span/text()").extract())
item['unit'] = u''.join(futurePrice.xpath(".//div[2]/table//tr[1]/td/text()").extract())
source = futurePrice.xpath(".//div[2]/table//tr[4]/td/a/text()").extract()
if source:
item['source'] = u' - '.join(source)
else:
item['source'] = ''
yield item
I want to know if this code will work fine or what should be correct way to do this?
It should work just fine. You can yield as many items from a parse callback as you need. Just some notes:
In the second case it's better to create a new item then reusing the old one. Because you never know what has happened to the old item reference. Maybe you are overwriting and losing the previous data.
You can create different item types for your two cases. And in the pipeline treat them differently.

Django queryset aggregate by time interval

Hi I am writing a Django view which ouputs data for graphing on the client side (High Charts). The data is climate data with a given parameter recorded once per day.
My query is this:
format = '%Y-%m-%d'
sd = datetime.datetime.strptime(startdate, format)
ed = datetime.datetime.strptime(enddate, format)
data = Climate.objects.filter(recorded_on__range = (sd, ed)).order_by('recorded_on')
Now, as the range is increased the dataset obviously gets larger and this does not present well on the graph (aside from slowing things down considerably).
Is there an way to group my data as averages in time periods - specifically average for each month or average for each year?
I realize this could be done in SQL as mentioned here: django aggregation to lower resolution using grouping by a date range
But I would like to know if there is a handy way in Django itself.
Or is it perhaps better to modify the db directly and use a script to populate month and year fields from the timestamp?
Any help much appreciated.
Have you tried using django-qsstats-magic (https://github.com/kmike/django-qsstats-magic)?
It makes things very easy for charting, here is a timeseries example from their docs:
from django.contrib.auth.models import User
import datetime, qsstats
qs = User.objects.all()
qss = qsstats.QuerySetStats(qs, 'date_joined')
today = datetime.date.today()
seven_days_ago = today - datetime.timedelta(days=7)
time_series = qss.time_series(seven_days_ago, today)
print 'New users in the last 7 days: %s' % [t[1] for t in time_series]