How to remove the DATE TIME that have NAN vaules - python-2.7

How do I remove the DATE and TIMEs with the NAN value in the 'oo' column.
this is my csv
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
here's my code.
exp = pd.read_csv('example.txt', parse_dates = [["DATE", "TIME"]], index_col=0)
exp['oo'] = opcl.OPEN.resample("5Min").first()
print exp['oo']
and I get this
DATE_TIME
1997-02-03 09:04:00 NaN
1997-02-03 09:05:00 3047.0
1997-02-03 09:06:00 NaN
1997-02-03 09:07:00 NaN
1997-02-03 09:08:00 NaN
1997-02-03 09:09:00 NaN
1997-02-03 09:10:00 3046.5
I want to get rid of all the DATE_TIME rows with NaN vaules in the 'oo' column.
I've tried.
exp['oo'] = exp['oo'].dropna()
But I get the same thing.
I've looked all threw the http://pandas.pydata.org/pandas-docs/stable/missing_data.html
And looked all over this website.
I would like to keep my csv reader the same but idk.
If anybody could help it would be greatly appreciated thanks so much for your time.

I think you want this:
>>> exp.OPEN.resample("5Min", how='first')
DATE_TIME
1997-02-03 09:00:00 3046.0
1997-02-03 09:05:00 3047.0
1997-02-03 09:10:00 3046.5
1997-02-03 09:15:00 3044.0
1997-02-03 09:20:00 3043.0
1997-02-03 09:25:00 3044.5
1997-02-03 09:30:00 3045.0
Freq: 5T, Name: OPEN, dtype: float64

Related

SAS using Datalines - "observation read not used"

I am a complete newb to SAS and I only know is basic sql. Currently taking Regression class and having trouble with SAS code.
I am trying to input two columns of data where x variable is State; y variable is # of accidents for a simple regression.
I keep getting this:
ERROR: No valid observations are found.
Number of Observations Read 51
Number of Observations Used 0
Number of Observations with Missing Values 51
Is it because datalines only read numbers and not charcters?
Here is the code as well as the datalines:
Data Firearm_Accidents_1999_to_2014;
ods graphics on;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
Connecticut 0
Delaware 0
District_of_Columbia 0
Florida 350
Georgia 413
Hawaii 0
Idaho 0
Illinois 287
Indiana 288
Iowa 0
Kansas 44
Kentucky 384
Louisiana 562
Maine 0
Maryland 21
Massachusetts 27
Michigan 168
Minnesota 0
Mississippi 332
Missouri 320
Montana 0
Nebraska 0
Nevada 0
New_Hampshire 0
New_Jersey 85
New_Mexico 49
New_York 218
North_Carolina 437
North_Dakota 0
Ohio 306
Oklahoma 227
Oregon 41
Pennsylvania 465
Rhode_Island 0
South_Carolina 324
South_Dakota 0
Tennessee 603
Texas 876
Utah 0
Vermont 0
Virginia 203
Washington 45
West_Virginia 136
Wisconsin 64
Wyoming 0
;
run; proc print;
proc reg data = Firearm_Accidents_1999_to_2014;
model State = Sum_OF_Deaths;
ods graphics off;
run; quit;
OK, some different levels of issues here.
ODS GRAPHICS go before and after procs, not inside them.
When reading a character variable you need to tell SAS using an informat.
This allows you to read in the data. However your regression has several issues. For one, State is a character variable and you can do regression with a character variable. I think that issue is beyond this forum. Review your regression basics and check what you're trying to do.
Data Firearm_Accidents_1999_to_2014;
informat state $32.;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
....
;
run;

how to draw a multiline chart using python pandas?

Dataframe:
Dept,Date,Que
ece,2015-06-25,96
ece,2015-06-24,89
ece,2015-06-26,88
ece,2015-06-19,87
ece,2015-06-23,82
ece,2015-06-30,82
eee,2015-06-24,73
eee,2015-06-23,71
eee,2015-06-25,70
eee,2015-06-19,66
eee,2015-06-27,60
eee,2015-06-22,56
mech,2015-06-27,10
mech,2015-06-22,8
mech,2015-06-25,8
mech,2015-06-19,7
I need multiline chart with grid based on Dept column, i need each Dept in one line.
For Ex:ece the sparkline should be 96,89,88,87,82,82.... like wise i need for other Dept also.
I think you need pivot and plot:
import matplotlib.pyplot as plt
df = df.pivot(index='Dept', columns='Date', values='Que')
print df
Date 2015-06-19 2015-06-22 2015-06-23 2015-06-24 2015-06-25 2015-06-26 \
Dept
ece 87.0 NaN 82.0 89.0 96.0 88.0
eee 66.0 56.0 71.0 73.0 70.0 NaN
mech 7.0 8.0 NaN NaN 8.0 NaN
Date 2015-06-27 2015-06-30
Dept
ece NaN 82.0
eee 60.0 NaN
mech 10.0 NaN
df.plot()
plt.show()
You can check docs.

SAS ARIMA modelling for 5 different variables

I am trying do a ARIMA model estimation for 5 different variables. The data consists of 16 months of Point of Sales. How do I approach this complicated ARIMA modelling?
Furthermore I would like to do:
A simple moving average of each product group
A Holt-Winters
exponential smoothing model
Data is as follows with date and product groups:
Date Gloves ShoeCovers Socks Warmers HeadWear
apr-14 11015 3827 3465 1264 772
maj-14 11087 2776 4378 1099 1423
jun-14 7645 1432 4490 674 670
jul-14 10083 7975 2577 1558 8501
aug-14 13887 8577 6854 1305 15621
sep-14 9186 5213 5244 1183 6784
okt-14 7611 4279 4150 977 6191
nov-14 6410 4033 2918 507 8276
dec-14 4856 3552 3192 450 4810
jan-15 17506 7274 3137 2216 3979
feb-15 21518 5672 8848 1838 2321
mar-15 17395 5200 5712 1604 2282
apr-15 11405 4531 5185 1479 1888
maj-15 11509 5690 4370 1145 2369
jun-15 9945 2610 4884 882 1709
jul-15 8707 5658 4570 1948 6255
Any skilled forecasters out there willing to help? Much appreciated!

Pandas: How to stack time series into a dataframe with time columns?

I have a pandas timeseries with minute tick data:
2011-01-01 09:30:00 -0.358525
2011-01-01 09:31:00 -0.185970
2011-01-01 09:32:00 -0.357479
2011-01-01 09:33:00 -1.486157
2011-01-01 09:34:00 -1.101909
2011-01-01 09:35:00 -1.957380
2011-01-02 09:30:00 -0.489747
2011-01-02 09:31:00 -0.341163
2011-01-02 09:32:00 1.588071
2011-01-02 09:33:00 -0.146610
2011-01-02 09:34:00 -0.185834
2011-01-02 09:35:00 -0.872918
2011-01-03 09:30:00 0.682824
2011-01-03 09:31:00 -0.344875
2011-01-03 09:32:00 -0.641186
2011-01-03 09:33:00 -0.501414
2011-01-03 09:34:00 0.877347
2011-01-03 09:35:00 2.183530
What is the best way to stack it into a dataframe such as :
09:30:00 09:31:00 09:32:00 09:33:00 09:34:00 09:35:00
2011-01-01 -0.358525 -0.185970 -0.357479 -1.486157 -1.101909 -1.957380
2011-01-02 -0.489747 -0.341163 1.588071 -0.146610 -0.185834 -0.872918
2011-01-03 0.682824 -0.344875 -0.641186 -0.501414 0.877347 2.183530
I'd make sure that this is actually want you want to do, as the resulting df loses a lot of the nice time-series functionality that pandas has.
But here is some code that would accomplish it. First, a time column is added, and the index is set to just the date part of the DateTimeIndex. The pivot command reshapes the data, setting the times as columns.
In [74]: df.head()
Out[74]:
value
date
2011-01-01 09:30:00 -0.358525
2011-01-01 09:31:00 -0.185970
2011-01-01 09:32:00 -0.357479
2011-01-01 09:33:00 -1.486157
2011-01-01 09:34:00 -1.101909
In [75]: df['time'] = df.index.time
In [76]: df.index = df.index.date
In [77]: df2 = df.pivot(index=df.index, columns='time')
The results dataframe will have a MultiIndex for the columns (the top level just being the name of your values variable). If you want it back to just a list of columns, the code below will flatten the column list.
In [78]: df2.columns = [c for (_, c) in df2.columns]
In [79]: df2
Out[79]:
09:30:00 09:31:00 09:32:00 09:33:00 09:34:00 09:35:00
2011-01-01 -0.358525 -0.185970 -0.357479 -1.486157 -1.101909 -1.957380
2011-01-02 -0.489747 -0.341163 1.588071 -0.146610 -0.185834 -0.872918
2011-01-03 0.682824 -0.344875 -0.641186 -0.501414 0.877347 2.183530

Pandas - Python 2.7: How convert timeseries index to seconds of the day?

I'm trying to convert a time series index to a seconds of the day i.e. so that the seconds increases from 0-86399 as the day progresses. I currently can recover the time of the day, but am having trouble converting this to seconds in a vectorized way:
df['timeofday'] = df.index.time
Any ideas? Thanks.
As #Jeff points out my original answer misunderstood what you were doing. But the following should work and it is vectorized. My answer relies on numpy datetime64 operations (subtract the beginning of the day from the current datetime64 and the divide through with a timedelta64 to get seconds):
>>> df
A
2011-01-01 00:00:00 -0.112448
2011-01-01 01:00:00 1.006958
2011-01-01 02:00:00 -0.056194
2011-01-01 03:00:00 0.777821
2011-01-01 04:00:00 -0.552584
2011-01-01 05:00:00 0.156198
2011-01-01 06:00:00 0.848857
2011-01-01 07:00:00 0.248990
2011-01-01 08:00:00 0.524785
2011-01-01 09:00:00 1.510011
2011-01-01 10:00:00 -0.332266
2011-01-01 11:00:00 -0.909849
2011-01-01 12:00:00 -1.275335
2011-01-01 13:00:00 1.361837
2011-01-01 14:00:00 1.924534
2011-01-01 15:00:00 0.618478
df['sec'] = (df.index.values
- df.index.values.astype('datetime64[D]'))/np.timedelta64(1,'s')
A sec
2011-01-01 00:00:00 -0.112448 0
2011-01-01 01:00:00 1.006958 3600
2011-01-01 02:00:00 -0.056194 7200
2011-01-01 03:00:00 0.777821 10800
2011-01-01 04:00:00 -0.552584 14400
2011-01-01 05:00:00 0.156198 18000
2011-01-01 06:00:00 0.848857 21600
2011-01-01 07:00:00 0.248990 25200
2011-01-01 08:00:00 0.524785 28800
2011-01-01 09:00:00 1.510011 32400
2011-01-01 10:00:00 -0.332266 36000
2011-01-01 11:00:00 -0.909849 39600
2011-01-01 12:00:00 -1.275335 43200
2011-01-01 13:00:00 1.361837 46800
2011-01-01 14:00:00 1.924534 50400
2011-01-01 15:00:00 0.618478 54000
May be a bit overdone, but this would be my answer:
from pandas import date_range, Series, to_datetime
# Some test data
rng = date_range('1/1/2011 01:01:01', periods=3, freq='s')
df = Series(randn(len(rng)), index=rng).to_frame()
def sec_in_day(timestamp):
date = timestamp.date() # We get the date less the time
elapsed_time = timestamp.to_datetime() - to_datetime(date) # We get the time
return elapsed_time.total_seconds()
Series(df.index).apply(sec_in_day)
I modified KarlD's answer for datetime with time zone:
d = pd.DataFrame({"t_naive":pd.date_range("20160101","20160102", freq = "2H")})
d['t_utc'] = d['t_naive'].dt.tz_localize("UTC")
d['t_ct'] = d['t_utc'].dt.tz_convert("America/Chicago")
print(d.head())
# t_naive t_utc t_ct
# 0 2016-01-01 00:00:00 2016-01-01 00:00:00+00:00 2015-12-31 18:00:00-06:00
# 1 2016-01-01 02:00:00 2016-01-01 02:00:00+00:00 2015-12-31 20:00:00-06:00
# 2 2016-01-01 04:00:00 2016-01-01 04:00:00+00:00 2015-12-31 22:00:00-06:00
# 3 2016-01-01 06:00:00 2016-01-01 06:00:00+00:00 2016-01-01 00:00:00-06:00
# 4 2016-01-01 08:00:00 2016-01-01 08:00:00+00:00 2016-01-01 02:00:00-06:00
The answer by KarlD gives sec of day in UTC
s0 = (d["t_naive"].values - d["t_naive"].values.astype('datetime64[D]'))/np.timedelta64(1,'s')
s0
# array([ 0., 7200., 14400., 21600., 28800., 36000., 43200.,
# 50400., 57600., 64800., 72000., 79200., 0.])
s1 = (d["t_ct"].values - d["t_ct"].values.astype('datetime64[D]'))/np.timedelta64(1,'s')
s1
# array([ 0., 7200., 14400., 21600., 28800., 36000., 43200.,
# 50400., 57600., 64800., 72000., 79200., 0.])
For sec of day in local time, use:
s2 = (d["t_ct"].view("int64") - d["t_ct"].dt.normalize().view("int64"))//pd.Timedelta(1, unit='s')
#use d.index.normalize() for index
s2.values
# array([64800, 72000, 79200, 0, 7200, 14400, 21600, 28800, 36000,
# 43200, 50400, 57600, 64800], dtype=int64)
or,
s3 = d["t_ct"].dt.hour*60*60 + d["t_ct"].dt.minute*60+ d["t_ct"].dt.second
s3.values
# array([64800, 72000, 79200, 0, 7200, 14400, 21600, 28800, 36000,
# 43200, 50400, 57600, 64800], dtype=int64)