grouping and plotting data by time in python - python-2.7

I have a csv file and I am trying to plot the average of some values per month. My csv file is structured as shown below, so I believe that I should group my data daily, then monthly in order to calculate the mean value.
timestamp,heure,lat,lon,impact,type
2007-01-01 00:00:00,13:58:43,33.837,-9.205,10.3,1
2007-01-02 00:00:00,00:07:28,34.5293,-10.2384,17.7,1
2007-01-02 00:00:00,23:01:03,35.0617,-1.435,-17.1,2
2007-01-03 00:00:00,01:14:29,36.5685,0.9043,36.8,1
2007-01-03 00:00:00,05:03:51,34.1919,-12.5061,-48.9,1
I am using this code:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
df= pd.read_csv("ave.txt", sep=',', names =["timestamp","heure","lat","lon","impact","type"])
daily = df.set_index('timestamp').groupby(pd.TimeGrouper(key='timestamp', freq='D', axis=1), axis=1)['impact'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
plt.show()
But, I keep getting errors like this:
KeyError: 'The grouper name timestamp is not found'
any ideas ??

You're getting this error because you've set the timestamp column to index. Try removing key='timestamp' from TimeGrouper() or the set_index method and it should group as you expect:
daily = df.set_index('timestamp').groupby(pd.TimeGrouper(freq='D', axis=1), axis=1)['impact'].count()
or
daily = df.groupby(pd.TimeGrouper(key='timestamp', freq='D', axis=1), axis=1)['impact'].count()

I believe you need DataFrame.resample.
Also is necessary convert timestamp to DataTimeindex by parameter parse_dates and index_col in read_csv.
names =["timestamp","heure","lat","lon","impact","type"]
data = pd.read_csv('fou.txt',names=names, parse_dates=['timestamp'],index_col=['timestamp'])
print (data.head())
#your code
daily = data.groupby(pd.TimeGrouper(freq='D'))['impact'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
plt.show()
#more simpliest
daily = data.resample('D')['impact'].count()
monthly = daily.resample('M').mean()
ax = monthly.plot(kind='bar')
plt.show()
Also check if really need count, not size.
What is the difference between size and count in pandas?
daily = data.resample('D')['impact'].size()
monthly = daily.resample('M').mean()
ax = monthly.plot(kind='bar')
plt.show()

Related

XGboost Google-AI-Model expecting float values instead of using Categorical values and converting them

I'm trying to run a simple XGBoost Prediction based on Google Cloud using this simple example https://cloud.google.com/ml-engine/docs/scikit/getting-predictions-xgboost#get_online_predictions
The model is building fine, but when I try to run a prediction with a sample input JSON it fails with error "Could not initialize DMatrix from inputs: could not convert string to float:" as shown in the screen below. I understand this is happening because the test-input has strings, I was hoping the Google machine learning model should have information to convert the categorical values to floats. I cannot expect my user to submit-online-prediction-request with float values.
Based on the tutorial it should work without converting the categorical values to floats. Please advise, I have attached the GIF with more details. Thanks
import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
# these are the column labels from the census data files
COLUMNS = (
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income-level'
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country'
)
# load training set
with open('./census_data/adult.data', 'r') as train_data:
raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')
# load test set
with open('./census_data/adult.test', 'r') as test_data:
raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')
# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
test_features[col] = encoders[col].fit_transform(test_features[col])
# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)
# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')
Here is a fix. Put the input shown in the Google documentation in a file input.json, then run this. The output is input_numerical.json and prediction will succeed if you use that in place of input.json.
This code is just preprocessing categorical columns to numerical forms using the same procedure as was done with training and test data.
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
COLUMNS = (
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"income-level",
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
)
with open("./input.json", "r") as json_lines:
rows = [json.loads(line) for line in json_lines]
prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))
encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
prediction_features[col] = encoders[col].fit_transform(prediction_features[col])
with open("input_numerical.json", "w") as input_numerical:
for index, row in prediction_features.iterrows():
input_numerical.write(row.to_json(orient="values") + "\n")
I created this Google Issues Tracker ticket as the Google documentation is missing this important step.
You can use pandas to convert categorical strings into codes for model inputs. For prediction input you can define a dictionary for each category with corresponding category values and codes. For example, for workclass:
df['workclass_cat'] = df['workclass'].astype('category')
df['workclass_cat'] = df['workclass_cat'].cat.codes
workclass_dict = dict(zip(list(df['workclass'].values), list(df['workclass_cat'].values)))
If a prediction input is 'somestring' you can access its code as follows:
category_input = workclass_dict['somestring']
XGBoost models take floats as input. In your training script you converted the categorical variables into numbers. The same transformation needs to be done when submitting a prediction.

How to export the data of a pivot table ( of an existing excel sheets) to multiple excel sheets

I am bit new to python and implementing a project to convert an excel sheet to pivot table excel sheet, and then once the pivot table excel sheet is created, I am trying to create multiple excel sheet for each pivot data ( each row of the pivot data )
I am succesful in getting till the pivot data in excel ( Workbook3.xlsx ), but from here, if i am trying to go for a "for loop", I am getting below error :
"raise NotImplementedError("Writing as Excel with a MultiIndex is "
NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented."
Can someone help me out to implement this last step ?
Below is my code :
Author : Abhishek
Version : December 23, 2016
import pandas as pd
import xlrd
import xlsxwriter
import numpy as np
names=['Business Type','Trip ID','Status','Name of Customer','Driver Name','Trip Type','One way/two way','Channel of Lead','Payment Rate','Remark for Payment/Discount/Fixed Rate/Monthly Rate','Trip Date','Trip ID','Entry Date and Time','CSE','Trip Time(when customer required)','Begin Trip Time','End Trip Time','Hours','Minutes','Basic','Basics','Transit','Transits','Discount ','Discounts','Tax','Total','Wallet','Wallet Type','Cash with Driver','Adjustment','Remark','Blank','Basic','Zuver (20%)','Basic Earning','Transit','Total Earning','Cash Colleted','Balance','Inventive','Total Earning','Total Cash Collected','Total Balance','Total Incentive','Final Invoice']
df=pd.read_excel(r'path/28 Nov - 4 Dec _ Payment IC _ Mumbai.xlsx',sheetname='calc',header=None,names=names)
df = df[df.Status != 'Cancelled']
df = df[df.Status != 'Unfulfilled']
#print df['Begin Trip Time'].values
# Defining variables for the output Report
custname=df['Name of Customer'].values
drivername=df['Driver Name'].values
drivercontact=df['Trip Date'].values
status=df['Status'].values
tripstatus=df['Begin Trip Time'].values
triptype=df['End Trip Time'].values
starttime=df['Hours'].values
endtime=df['Minutes'].values
transit=df['Transits'].values
totalbill=df['Basic Earning'].values
totalearning=df['Transits'].values+df['Basic Earning'].values
cashcollected=df['Cash Colleted'].values
balance=df['Balance'].values
incentive= df['Inventive'].values
df1 = pd.DataFrame(zip(totalbill,drivername,custname,drivercontact,status,tripstatus,triptype,starttime,endtime,totalbill,transit,totalearning,cashcollected,balance,incentive))
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('path/Workbook2.xlsx', engine='xlsxwriter')
df1.to_excel(writer, sheet_name='Sheet1',index=False,header=False)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
#Pivot table
df2=pd.read_excel(r'path/Workbook2.xlsx',sheetname='Sheet1')
#table=pd.pivot_table(df2, index=['Driver Name','Name of Customer'])
df2['Inventive'].fillna(0, inplace=True)
df2 = pd.pivot_table(df2, values=['Hours', 'Minutes'],index=['Driver Name','Name of Customer','Inventive','Trip Date','Status','Begin Trip Time','End Trip Time','Basic Earning','Cash Colleted','Balance'],aggfunc=np.sum, fill_value='0', margins=True)
pivoted = pd.ExcelWriter('path/Workbook3.xlsx', engine='xlsxwriter')
df2.to_excel(pivoted, sheet_name='Sheet1')
pivoted.save()
df3=pd.read_excel(r'path/Workbook3.xlsx',sheetname='Sheet1')
for n in range(0, len(df3)):
tempdata=df3.iloc[n]
df4= pd.DataFrame(tempdata)
writer=pd.ExcelWriter('path/Final%s.xlsx' % n, engine='xlsxwriter')
df4.to_excel(writer, sheet_name='Sheet 1')
writer.save()

Plotting error bars from 2 axis

I'm looking to plot the standard deviation of some array data I've been looking at in python however the data is averaged over both longitude and latitude (Axis 2,3 of my arrays).
What I have so far is a monthly plot that looks like this but I can't get the standard deviations to work Monthly plot
I was just wondering if anyone knew how to get around this problem. Here's the code I've used thus far.
Any help is much appreciated!
# import things
import matplotlib.pyplot as plt
import numpy as np
import netCDF4
# [ date, hour, 0, lon, lat ]
temp = (f.variables['TEMP2'][:, 14:24, 0, :, :]) # temp at 2m
temp2 = (f.variables['TEMP2'][:, 0:14, 0, :, :])
# concatenate back to 24 hour period
tercon = np.concatenate((temp, temp2), axis=1)
ter1 = tercon.mean(axis=(2, 3))
rtemp = np.reshape(ter1, 672)-273
# X axis dates instead of times
date = np.arange(rtemp.shape[0]) # assume that delta time between data is 1
date21 = (date/24.) # use days instead of hours
# change plot size for monthly
rcParams['figure.figsize'] = 15, 5
plt.plot(date21, rtemp , linestyle='-', linewidth=3.0, c='orange')
You should errorbar instead of plot and pass the precalculated standard deviations. The following adapted example uses random data to emulate your temperature data with an hourly resolution and accumulates the data and the standard deviation.
# import things
import matplotlib.pyplot as plt
import numpy as np
# x-axis: day-of-month
date21 = np.arange(1, 31)
# generate random "hourly" data
hourly_temp = np.random.random(30*24)*10 + 20
# mean "temperature"
dayly_mean_temp = hourly_temp.reshape(24,30).mean(axis=0)
# standard deviation per day
dayly_std_temp = hourly_temp.reshape(24,30).std(axis=0)
# create a figure
figure = plt.figure(figsize = (15, 5))
#add an axes to the figure
ax = figure.add_subplot(111)
ax.grid()
ax.errorbar(date21, dayly_mean_temp , yerr=dayly_std_temp, fmt="--o", capsize=15, capthick=3, linestyle='-', linewidth=3.0, c='orange')
plt.show()

Python 2.7 Not plotting extremes based on slope in pandas dataframe

Basically, I want to not plot extremes in my graph. I thought doing this based on the slope of the graph would be a good idea, but for some reason I keep getting the error that the dates on my x-axis do not exist (DataFrame has no attribute Datumtijd). (Edit: Removed file location as question has been answered)
from pylab import *
import matplotlib.pyplot as plt
import matplotlib.dates as pld
%matplotlib inline
import pandas as pd
from pandas import DataFrame
pbn135 = pd.read_csv('3873_135.csv', parse_dates=[0], index_col = 0, dayfirst = True, delimiter = ';', usecols = ['Datumtijd','DisplayWaarde'])
pbn135.plot()
for i in range(len(pbn135)):
slope = (pbn135.DisplayWaarde[i+1]-pbn135.DisplayWaarde[i])/(pbn135.Datumtijd[i+1]-pbn135.Datumtijd[i])
Python can't operate with DateTime. Converting the DateTime to an integer works. Usually done by calculating the total seconds from a reference date (e.g. 1 jan 2015).
This is done by importing datetime from datetime. Then by setting a reference date datetime(2015,1,1) the seconds are calculted with total_seconds().
However this does create a slope where the interval is in seconds and not the interval of your datetime. If anyone knows how to fix that without manually entering a division please let us know
from datetime import datetime
for i in range(len(pbn135)):
slope = (pbn135.pbn73[i+1]-pbn135.pbn73[i])/((pbn135.index[i+1]-datetime(2015,1,1)).total_seconds()-(pbn135.index[i]-datetime(2015,1,1)).total_seconds())
print slope

Pandas DataFrame Matplotlib BoxPlot Boxes

How to make a boxplot where each row in my dataframe object is a box in the plot?
I have some stock data that I want to plot with a box plot. My data is from yahoo finance and includes Open, High, Low, Close, Adjusted Close and Volume data for each trading day. I want to plot a box plot where each box is 1 day of OHLC price action.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.io.data import DataReader
# get daily stock price data from yahoo finance for S&P500
SP = DataReader("^GSPC", "yahoo")
SP.head()
Open High Low Close Volume Adj Close
Date
2010-01-04 1116.56 1133.87 1116.56 1132.99 3991400000 1132.99
2010-01-05 1132.66 1136.63 1129.66 1136.52 2491020000 1136.52
2010-01-06 1135.71 1139.19 1133.95 1137.14 4972660000 1137.14
2010-01-07 1136.27 1142.46 1131.32 1141.69 5270680000 1141.69
2010-01-08 1140.52 1145.39 1136.22 1144.98 4389590000 1144.98
plt.figure()
bp = SP.boxplot()
But when I plot this data frame as a boxplot, I only get one box with the Open, High, Low, and Close values of the entire Volume column.
Likewise, I try re-sampling my Adjusted Close daily price data to get weekly OHLC:
close = SP['Adj Close']
wk = close.resample('W', how='ohlc')
wk.head()
open high low close
Date
2010-01-10 1132.99 1144.98 1132.99 1144.98
2010-01-17 1146.98 1148.46 1136.03 1136.03
2010-01-24 1150.23 1150.23 1091.76 1091.76
2010-01-31 1096.78 1097.50 1073.87 1073.87
2010-02-07 1089.19 1103.32 1063.11 1066.19
This yields a Box Plot with 4 Boxes. Each box is the range of each column, not row. So for example, the first Box, 'open', shows the Open, Close, High and Low of the entire 'open' Column.
But what I actually want is 1 box for each 'Date' (index or row of my DataFrame). So the first Box will show the OHLC of the first row, '2010-01-10'. Second box will be the second row ('2010-01-17').
What I really want though is each row in my original Daily data (SP DataFrame) is its own OHLC Box. Essentially I want daily candlesticks, generated as a boxplot().
Open High Low Close
Date
2010-01-04 1116.56 1133.87 1116.56 1132.99
How do I do this using the Pandas DataFrame and Matplotlib boxplot()? I just want a basic boxplot plot where each row from the DataFrame is a OHLC box in the plot. Nothing fancy at this point. Thanks!
As I said in the comments, you don't really want boxplots. Instead you should be making a candlestick chart. Here's some code to get you started.
import numpy as np
import pandas
import matplotlib.pyplot as plt
from matplotlib.finance import candlestick, candlestick2
import matplotlib.dates as mdates
from pandas.io.data import DataReader
# get daily stock price data from yahoo finance for S&P500
SP = DataReader("^GSPC", "yahoo")
SP.reset_index(inplace=True)
print(SP.columns)
SP['Date2'] = SP['Date'].apply(lambda date: mdates.date2num(date.to_pydatetime()))
fig, ax = plt.subplots()
csticks = candlestick(ax, SP[['Date2', 'Open', 'Close', 'High', 'Low']].values)
plt.show()