Convert UTC to local time - python-2.7

I have a fairly large dataset that has UTC timestamps. I need to convert the UTC to local (central) timezone..I tried my google-fu, to no avail.
Dataframe is below.
STID UTCTIME TRES VRIR RETY REWT WEDN DELP WDIR DERT RTAX GAIN DEVD
0 ARFW 2012-01-01T00:00 28.47 65 -999 -999 41 41 289 12 20 0 0
1 ARFW 2012-01-01T00:30 28.55 62 -999 -999 32 33 359 23 31 0 0
2 ARFW 2012-01-01T01:00 28.59 60 -999 -999 29 30 345 19 26 0 0
3 ARFW 2012-01-01T01:30 28.63 60 -999 -999 24 25 339 20 27 0 0
4 ARFW 2012-01-01T02:00 28.66 58 -999 -999 22 25 335 24 30 0 0
#Define time as UTC
data_df['UTCTIME'] = pd.to_datetime(data_df['UTCTIME'], utc= True)
data_df.dtypes
STID object
UTCTIME datetime64[ns]
TRES float64
.
.
.
GAIN float64
DEVD int64
dtype: object
Here's the code I'm trying to use:
import pytz, datetime
utc = pytz.utc
fmt = '%Y-%m-%d %H:%M'
CSTM= pytz.timezone('US/Central')
local = pytz.timezone('US/Central')
dt = datetime.datetime.strptime(data_df['UTCTIME'], fmt)
CSTM_dt = CSTM.localize(dt)
and the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-f10301993777> in <module>()
4 CSTM = pytz.timezone('US/Central')
5 local = pytz.timezone('US/Central')
----> 6 dt = datetime.datetime.strptime(data_df['UTCTIME'], fmt)
7 CSTM = CSTM.localize(dt)
TypeError: must be string, not Series
Also, there are duplicate entries for UTCTIME...I can't comprehend indexing...and I believe indexing could be one issue here..I am not sure what is missing here.

In your code in the strptime line you do not use the actual date string from your dataframe, but the literal string "UTCTIME".
from_zone = tz.gettz('UTCTIME')
to_zone = tz.tzlocal()
utc = datetime.strptime('UTCTIME', '%Y-%m-%dT%H:%M') # <====== STRING
utc = utc.replace(tzinfo = from_zone)
central = utc.astimezone(to_zone)
If you want to use that on your dataframe, you either need to loop over the UTCTIME column or create a helper function doing your conversion and use the DataFrame.column.apply(helperfunc) method.
To only test your code, replace the 'UTCTIME' string with a actual date string or use a variable with the string.

Related

How can I delete objects without complete data by using stata

I have a large panel dataset that looks as follows.
input id age high weight str6 daily_drink
1 10 110 35 water
1 10 110 35 coffee
1 11 120 38 water
1 11 120 38 coffee
1 12 130 50 water
1 12 130 50 coffee
2 11 118 31 water
2 11 118 31 coffee
2 11 118 31 milk
2 12 123 38 water
2 12 123 38 coffee
2 12 123 38 milk
3 10 98 55 water
3 11 116 36 water
3 12 129 39 water
4 12 125 40 water
end
However, I would like to use stata to keep objects with complete 10, 11, and 12 age. Looks like this.
id age high weight daily_drink
1 10 110 35 water
1 10 110 35 coffee
1 11 120 38 water
1 11 120 38 coffee
1 12 130 50 water
1 12 130 50 coffee
3 10 98 55 water
3 11 116 36 water
3 12 129 39 water
However, all the rows are without missing data, so I cannot simply delete the row with missing data. Is there any way to do it? Any suggestion will help. Thanks in advance.
You can use bysort and egen for this. Something along the lines of
bysort id: egen has10 = total(age==10)
bysort id: egen has11 = total(age==11)
bysort id: egen has12 = total(age==12)
keep if (has10 != 0) & (has11 != 0) & (has12 != 0)
should work (untested). See help egen for more info. Install gtools if you have very large data (ssc install gtools) and then replace egen by gegen.
A solution that works if 10, 11, 12 are the only age values possible:
bysort id (age) : gen nvals = sum(age != age[_n-1])
by id : replace nvals = nvals[_N]
keep if nvals == 3
Consider also
bysort id (age) : gen OK1 = age[1] == 10 & age[_N] == 12
by id : egen OK2 = max(age == 11)
keep if OK1 & OK2

Power BI What if analysis

I have a matrix Power BI visualization which is like
Jan Feb Mar April
Client1 10 20 30 10
Client2 15 25 65 80
Client3 66 22 54 12
I have created 3 what if parameters slicer table (having values from 1 to 4) for each client
For example, If the value of the first slicer is 1 and the second is 2 and the third is 2 then I want
Jan Feb Mar April
Client1 0 20 30 10
Client2 0 0 65 80
Client3 0 0 54 12
That is, it should replace the value with zero. I have been able to achieve that for one client using Dateadd function (by adding month)
Measure = CALCULATE(SUM('Table'[Value]),
DATEADD('Table'[Column], Parameter[Parameter Value], MONTH))
and I have used this measure to display the value, but how to make it work for the other two clients as well .
Let say you have three parameter tables as follows
Parameter1 Parameter2 Parameter3
Value1 Value2 Value3
------ ------ ------
1 1 1
2 2 2
3 3 3
4 4 4
and each of them has its own slicer. Then the measure you are after might look something like this:
Measure =
VAR Val1 = MAX(Parameter1[Value1])
VAR Val2 = MAX(Parameter2[Value2])
VAR Val3 = MAX(Parameter3[Value3])
VAR CurrClient = MAX('Table'[Client])
VAR CurrMonth = MONTH(DATEVALUE(MAX('Table'[Month]) & " 1, 2000"))
RETURN SWITCH(CurrClient,
"Client1", IF(CurrMonth <= Val1, 0, SUM('Table'[Value])),
"Client2", IF(CurrMonth <= Val2, 0, SUM('Table'[Value])),
"Client3", IF(CurrMonth <= Val3, 0, SUM('Table'[Value])),
SUM('Table'[Value])
)
Basically, you read in each parameter and compare them to the month in the current cell.

plotly choropleth not plotting data

here is a link to my data https://docs.google.com/document/d/1oIiwiucRkXBkxkdbrgFyPt6fwWtX4DJG4nbRM309M20/edit?usp=sharing
My problem is that when I run this in a Jupyter Notebook. I get just the USA map with the colour bar and the lakes in blue. No data is on the map, not the labels nor the actual z data.
Here is my header:
import plotly.graph_objs as go
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
Here is my data and layout:
data = dict(type='choropleth',
locations = gb_state['state'],
locationmode = 'USA-states',
colorscale = 'Portland',
text =gb_state['state'],
z = gb_state['beer'],
colorbar = {'title':"Styles of beer"}
)
data
layout = dict(title = 'Styles of beer by state',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
layout
and here is how I fire off the command:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)
Any help, guidelines or pointers would be appreciated
Here is a minified working example which will give you the desired output.
import pandas as pd
import io
import plotly.graph_objs as go
from plotly.offline import plot
txt = """ state abv ibu id beer style ounces brewery city
0 AK 25 17 25 25.0 25.0 25 25 25
1 AL 10 9 10 10.0 10.0 10 10 10
2 AR 5 1 5 5.0 5.0 5 5 5
3 AZ 44 24 47 47.0 46.0 47 47 47
4 CA 182 135 183 183.0 183.0 183 183 183
5 CO 250 146 265 265.0 263.0 265 265 265
6 CT 27 6 27 27.0 27.0 27 27 27
7 DC 8 4 8 8.0 8.0 8 8 8
8 DE 1 1 2 2.0 2.0 2 2 2
9 FL 56 37 58 58.0 58.0 58 58 58
10 GA 16 7 16 16.0 16.0 16 16 16
"""
gb_state = pd.read_csv(io.StringIO(txt), delim_whitespace=True)
data = dict(type='choropleth',
locations=gb_state['state'],
locationmode='USA-states',
text=gb_state['state'],
z=gb_state['beer'],
)
layout = dict(geo = dict(scope='usa',
showlakes= False)
)
choromap = go.Figure(data=[data], layout=layout)
plot(choromap)

Python Pandas - read_csv makes duplicates to NaN

I am currently working with pandas on a Dataframe and after reading a csv file and converting a specific column into str, pandas seems to transform all duplicates of this row into NaNs.
bla = pd.read_csv(bla_path, sep=',',converters={'order_id':str})
and it gives me this results:
internal_conversion_id order_id conversion_target_id \
0 85 9222 67
1 20 9224 65
2 20 NaN 65
3 20 NaN 65
4 33 9233 67
5 33 NaN 67
Does anybody know what I'm missing? The original file does contain the duplicates.
EDIT: I just checked - this also happens, when i don't use converters.
EDIT 2: here some lines from the original csv:
internal_conversion_id,order_id,conversion_target_id,product_nr
85,9222,67,1
20,9224,65,1
20,9224,65,2
20,9224,65,3
33,9223,67,1
33,9223,67,2
EDIT3:
ok, i think i found the source.
At some point in the code I wanted to create a second variable with the same content as the first one, but without the duplicates. Pandas deletes all the duplicates in the first variable too. How can i stop Pandas from doing so?
here is the piece of code:
bla2 = bla
bla2['order_id'] = bla2['order_id'].drop_duplicates()
bla2 = bla2[pd.notnull(bla2['order_id'])]
if you wanted to just drop the duplicates you could've done it this way:
bla2 = bla2.drop_duplicates(subset='order_id')
what you did was to overwrite the column with the returned results from drop_duplicates for that column by doing this:
bla2['order_id'] = bla2['order_id'].drop_duplicates()
at that point you introduced NaN where the values were duplicates:
In [3]:
df['order_id'].drop_duplicates()
Out[3]:
0 9222
1 9224
4 9223
Name: order_id, dtype: int64
In [4]:
df['order_id'] = df['order_id'].drop_duplicates()
df
Out[4]:
internal_conversion_id order_id conversion_target_id product_nr
0 85 9222 67 1
1 20 9224 65 1
2 20 NaN 65 2
3 20 NaN 65 3
4 33 9223 67 1
5 33 NaN 67 2
However, your last line of code should've worked:
In [5]:
df = df[pd.notnull(df['order_id'])]
df
Out[5]:
internal_conversion_id order_id conversion_target_id product_nr
0 85 9222 67 1
1 20 9224 65 1
4 33 9223 67 1
So I don't know if you're getting confused
EDIT
if you want to make a copy then just make a copy:
bla2 = bla.copy()
then you can do whatever you want with bla2 and it won't affect bla
or you create bla2 from the result of bla.drop_duplicates:
bla2 = bla.drop_duplicates(subset='order_id')

Grouping data by value ranges

I have a csv file that shows parts on order. The columns include days late, qty and commodity.
I need to group the data by days late and commodity with a sum of the qty. However the days late needs to be grouped into ranges.
>56
>35 and <= 56
>14 and <= 35
>0 and <=14
I was hoping I could use a dict some how. Something like this
{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}
I am looking for a result like this
Red Amber Yellow White
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
I am new to pandas so I don't know if this is possible at all. Could anyone provide some advice.
Thanks
Suppose you start with this data:
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
# Days Late ID quantity
# 0 60 STRSUB 56
# 1 60 BOTDWG 20
# 2 50 STRSUB 60
# 3 50 BOTDWG 67
# 4 20 STRSUB 74
# 5 20 BOTDWG 87
# 6 10 STRSUB 40
# 7 10 BOTDWG 34
Then you can find the status category using pd.cut. Note that by default, pd.cut splits the Series df['Days Late'] into categories which are half-open intervals, (-1, 14], (14, 35], (35, 56], (56, 365]:
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
# ID quantity status
# 0 STRSUB 56 Red
# 1 BOTDWG 20 Red
# 2 STRSUB 60 Amber
# 3 BOTDWG 67 Amber
# 4 STRSUB 74 Yellow
# 5 BOTDWG 87 Yellow
# 6 STRSUB 40 White
# 7 BOTDWG 34 White
Now use pivot to get the DataFrame in the desired form:
df = df.pivot(index='ID', columns='status', values='quantity')
and use reindex to obtain the desired order for the rows and columns:
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
Thus,
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)
yields
Red Amber Yellow White
ID
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
You can create a column in your DataFrame based on your Days Late column by using the map or apply functions as follows. Let's first create some sample data.
df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
'Days Late': numpy.random.randn(8)*20+30})
Days Late ID
0 30.746244 foo
1 16.234267 bar
2 14.771567 foo
3 33.211626 bar
4 3.497118 foo
5 52.482879 bar
6 11.695231 foo
7 47.350269 foo
Create a helper function to transform the data of the Days Late column and add a column called Code.
def days_late_xform(dl):
if dl > 56: return 'Red'
elif 35 < dl <= 56: return 'Amber'
elif 14 < dl <= 35: return 'Yellow'
elif 0 < dl <= 14: return 'White'
else: return 'None'
df["Code"] = df['Days Late'].map(days_late_xform)
Days Late ID Code
0 30.746244 foo Yellow
1 16.234267 bar Yellow
2 14.771567 foo Yellow
3 33.211626 bar Yellow
4 3.497118 foo White
5 52.482879 bar Amber
6 11.695231 foo White
7 47.350269 foo Amber
Lastly, you can use groupby to aggregate by the ID and Code columns, and get the counts of the groups as follows:
g = df.groupby(["ID","Code"]).size()
print g
ID Code
bar Amber 1
Yellow 2
foo Amber 1
White 2
Yellow 2
df2 = g.unstack()
print df2
Code Amber White Yellow
ID
bar 1 NaN 2
foo 1 2 2
I know this is coming a bit late, but I had the same problem as you and wanted to share the function np.digitize. It sounds like exactly what you want.
a = np.random.randint(0, 100, 50)
grps = np.arange(0, 100, 10)
grps2 = [1, 20, 25, 40]
print a
[35 76 83 62 57 50 24 0 14 40 21 3 45 30 79 32 29 80 90 38 2 77 50 73 51
71 29 53 76 16 93 46 14 32 44 77 24 95 48 23 26 49 32 15 2 33 17 88 26 17]
print np.digitize(a, grps)
[ 4 8 9 7 6 6 3 1 2 5 3 1 5 4 8 4 3 9 10 4 1 8 6 8 6
8 3 6 8 2 10 5 2 4 5 8 3 10 5 3 3 5 4 2 1 4 2 9 3 2]
print np.digitize(a, grps2)
[3 4 4 4 4 4 2 0 1 4 2 1 4 3 4 3 3 4 4 3 1 4 4 4 4 4 3 4 4 1 4 4 1 3 4 4 2
4 4 2 3 4 3 1 1 3 1 4 3 1]