How to remove the DATE TIME that have NAN vaules - python-2.7
How do I remove the DATE and TIMEs with the NAN value in the 'oo' column.
this is my csv
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
here's my code.
exp = pd.read_csv('example.txt', parse_dates = [["DATE", "TIME"]], index_col=0)
exp['oo'] = opcl.OPEN.resample("5Min").first()
print exp['oo']
and I get this
DATE_TIME
1997-02-03 09:04:00 NaN
1997-02-03 09:05:00 3047.0
1997-02-03 09:06:00 NaN
1997-02-03 09:07:00 NaN
1997-02-03 09:08:00 NaN
1997-02-03 09:09:00 NaN
1997-02-03 09:10:00 3046.5
I want to get rid of all the DATE_TIME rows with NaN vaules in the 'oo' column.
I've tried.
exp['oo'] = exp['oo'].dropna()
But I get the same thing.
I've looked all threw the http://pandas.pydata.org/pandas-docs/stable/missing_data.html
And looked all over this website.
I would like to keep my csv reader the same but idk.
If anybody could help it would be greatly appreciated thanks so much for your time.
I think you want this:
>>> exp.OPEN.resample("5Min", how='first')
DATE_TIME
1997-02-03 09:00:00 3046.0
1997-02-03 09:05:00 3047.0
1997-02-03 09:10:00 3046.5
1997-02-03 09:15:00 3044.0
1997-02-03 09:20:00 3043.0
1997-02-03 09:25:00 3044.5
1997-02-03 09:30:00 3045.0
Freq: 5T, Name: OPEN, dtype: float64
Related
SAS using Datalines - "observation read not used"
I am a complete newb to SAS and I only know is basic sql. Currently taking Regression class and having trouble with SAS code. I am trying to input two columns of data where x variable is State; y variable is # of accidents for a simple regression. I keep getting this: ERROR: No valid observations are found. Number of Observations Read 51 Number of Observations Used 0 Number of Observations with Missing Values 51 Is it because datalines only read numbers and not charcters? Here is the code as well as the datalines: Data Firearm_Accidents_1999_to_2014; ods graphics on; Input State Sum_OF_Deaths; Datalines; Alabama 526 Alaska 0 Arizona 150 Arkansas 246 California 834 Colorado 33 Connecticut 0 Delaware 0 District_of_Columbia 0 Florida 350 Georgia 413 Hawaii 0 Idaho 0 Illinois 287 Indiana 288 Iowa 0 Kansas 44 Kentucky 384 Louisiana 562 Maine 0 Maryland 21 Massachusetts 27 Michigan 168 Minnesota 0 Mississippi 332 Missouri 320 Montana 0 Nebraska 0 Nevada 0 New_Hampshire 0 New_Jersey 85 New_Mexico 49 New_York 218 North_Carolina 437 North_Dakota 0 Ohio 306 Oklahoma 227 Oregon 41 Pennsylvania 465 Rhode_Island 0 South_Carolina 324 South_Dakota 0 Tennessee 603 Texas 876 Utah 0 Vermont 0 Virginia 203 Washington 45 West_Virginia 136 Wisconsin 64 Wyoming 0 ; run; proc print; proc reg data = Firearm_Accidents_1999_to_2014; model State = Sum_OF_Deaths; ods graphics off; run; quit;
OK, some different levels of issues here. ODS GRAPHICS go before and after procs, not inside them. When reading a character variable you need to tell SAS using an informat. This allows you to read in the data. However your regression has several issues. For one, State is a character variable and you can do regression with a character variable. I think that issue is beyond this forum. Review your regression basics and check what you're trying to do. Data Firearm_Accidents_1999_to_2014; informat state $32.; Input State Sum_OF_Deaths; Datalines; Alabama 526 Alaska 0 Arizona 150 Arkansas 246 California 834 Colorado 33 .... ; run;
how to draw a multiline chart using python pandas?
Dataframe: Dept,Date,Que ece,2015-06-25,96 ece,2015-06-24,89 ece,2015-06-26,88 ece,2015-06-19,87 ece,2015-06-23,82 ece,2015-06-30,82 eee,2015-06-24,73 eee,2015-06-23,71 eee,2015-06-25,70 eee,2015-06-19,66 eee,2015-06-27,60 eee,2015-06-22,56 mech,2015-06-27,10 mech,2015-06-22,8 mech,2015-06-25,8 mech,2015-06-19,7 I need multiline chart with grid based on Dept column, i need each Dept in one line. For Ex:ece the sparkline should be 96,89,88,87,82,82.... like wise i need for other Dept also.
I think you need pivot and plot: import matplotlib.pyplot as plt df = df.pivot(index='Dept', columns='Date', values='Que') print df Date 2015-06-19 2015-06-22 2015-06-23 2015-06-24 2015-06-25 2015-06-26 \ Dept ece 87.0 NaN 82.0 89.0 96.0 88.0 eee 66.0 56.0 71.0 73.0 70.0 NaN mech 7.0 8.0 NaN NaN 8.0 NaN Date 2015-06-27 2015-06-30 Dept ece NaN 82.0 eee 60.0 NaN mech 10.0 NaN df.plot() plt.show() You can check docs.
SAS ARIMA modelling for 5 different variables
I am trying do a ARIMA model estimation for 5 different variables. The data consists of 16 months of Point of Sales. How do I approach this complicated ARIMA modelling? Furthermore I would like to do: A simple moving average of each product group A Holt-Winters exponential smoothing model Data is as follows with date and product groups: Date Gloves ShoeCovers Socks Warmers HeadWear apr-14 11015 3827 3465 1264 772 maj-14 11087 2776 4378 1099 1423 jun-14 7645 1432 4490 674 670 jul-14 10083 7975 2577 1558 8501 aug-14 13887 8577 6854 1305 15621 sep-14 9186 5213 5244 1183 6784 okt-14 7611 4279 4150 977 6191 nov-14 6410 4033 2918 507 8276 dec-14 4856 3552 3192 450 4810 jan-15 17506 7274 3137 2216 3979 feb-15 21518 5672 8848 1838 2321 mar-15 17395 5200 5712 1604 2282 apr-15 11405 4531 5185 1479 1888 maj-15 11509 5690 4370 1145 2369 jun-15 9945 2610 4884 882 1709 jul-15 8707 5658 4570 1948 6255 Any skilled forecasters out there willing to help? Much appreciated!
Pandas: How to stack time series into a dataframe with time columns?
I have a pandas timeseries with minute tick data: 2011-01-01 09:30:00 -0.358525 2011-01-01 09:31:00 -0.185970 2011-01-01 09:32:00 -0.357479 2011-01-01 09:33:00 -1.486157 2011-01-01 09:34:00 -1.101909 2011-01-01 09:35:00 -1.957380 2011-01-02 09:30:00 -0.489747 2011-01-02 09:31:00 -0.341163 2011-01-02 09:32:00 1.588071 2011-01-02 09:33:00 -0.146610 2011-01-02 09:34:00 -0.185834 2011-01-02 09:35:00 -0.872918 2011-01-03 09:30:00 0.682824 2011-01-03 09:31:00 -0.344875 2011-01-03 09:32:00 -0.641186 2011-01-03 09:33:00 -0.501414 2011-01-03 09:34:00 0.877347 2011-01-03 09:35:00 2.183530 What is the best way to stack it into a dataframe such as : 09:30:00 09:31:00 09:32:00 09:33:00 09:34:00 09:35:00 2011-01-01 -0.358525 -0.185970 -0.357479 -1.486157 -1.101909 -1.957380 2011-01-02 -0.489747 -0.341163 1.588071 -0.146610 -0.185834 -0.872918 2011-01-03 0.682824 -0.344875 -0.641186 -0.501414 0.877347 2.183530
I'd make sure that this is actually want you want to do, as the resulting df loses a lot of the nice time-series functionality that pandas has. But here is some code that would accomplish it. First, a time column is added, and the index is set to just the date part of the DateTimeIndex. The pivot command reshapes the data, setting the times as columns. In [74]: df.head() Out[74]: value date 2011-01-01 09:30:00 -0.358525 2011-01-01 09:31:00 -0.185970 2011-01-01 09:32:00 -0.357479 2011-01-01 09:33:00 -1.486157 2011-01-01 09:34:00 -1.101909 In [75]: df['time'] = df.index.time In [76]: df.index = df.index.date In [77]: df2 = df.pivot(index=df.index, columns='time') The results dataframe will have a MultiIndex for the columns (the top level just being the name of your values variable). If you want it back to just a list of columns, the code below will flatten the column list. In [78]: df2.columns = [c for (_, c) in df2.columns] In [79]: df2 Out[79]: 09:30:00 09:31:00 09:32:00 09:33:00 09:34:00 09:35:00 2011-01-01 -0.358525 -0.185970 -0.357479 -1.486157 -1.101909 -1.957380 2011-01-02 -0.489747 -0.341163 1.588071 -0.146610 -0.185834 -0.872918 2011-01-03 0.682824 -0.344875 -0.641186 -0.501414 0.877347 2.183530
Pandas - Python 2.7: How convert timeseries index to seconds of the day?
I'm trying to convert a time series index to a seconds of the day i.e. so that the seconds increases from 0-86399 as the day progresses. I currently can recover the time of the day, but am having trouble converting this to seconds in a vectorized way: df['timeofday'] = df.index.time Any ideas? Thanks.
As #Jeff points out my original answer misunderstood what you were doing. But the following should work and it is vectorized. My answer relies on numpy datetime64 operations (subtract the beginning of the day from the current datetime64 and the divide through with a timedelta64 to get seconds): >>> df A 2011-01-01 00:00:00 -0.112448 2011-01-01 01:00:00 1.006958 2011-01-01 02:00:00 -0.056194 2011-01-01 03:00:00 0.777821 2011-01-01 04:00:00 -0.552584 2011-01-01 05:00:00 0.156198 2011-01-01 06:00:00 0.848857 2011-01-01 07:00:00 0.248990 2011-01-01 08:00:00 0.524785 2011-01-01 09:00:00 1.510011 2011-01-01 10:00:00 -0.332266 2011-01-01 11:00:00 -0.909849 2011-01-01 12:00:00 -1.275335 2011-01-01 13:00:00 1.361837 2011-01-01 14:00:00 1.924534 2011-01-01 15:00:00 0.618478 df['sec'] = (df.index.values - df.index.values.astype('datetime64[D]'))/np.timedelta64(1,'s') A sec 2011-01-01 00:00:00 -0.112448 0 2011-01-01 01:00:00 1.006958 3600 2011-01-01 02:00:00 -0.056194 7200 2011-01-01 03:00:00 0.777821 10800 2011-01-01 04:00:00 -0.552584 14400 2011-01-01 05:00:00 0.156198 18000 2011-01-01 06:00:00 0.848857 21600 2011-01-01 07:00:00 0.248990 25200 2011-01-01 08:00:00 0.524785 28800 2011-01-01 09:00:00 1.510011 32400 2011-01-01 10:00:00 -0.332266 36000 2011-01-01 11:00:00 -0.909849 39600 2011-01-01 12:00:00 -1.275335 43200 2011-01-01 13:00:00 1.361837 46800 2011-01-01 14:00:00 1.924534 50400 2011-01-01 15:00:00 0.618478 54000
May be a bit overdone, but this would be my answer: from pandas import date_range, Series, to_datetime # Some test data rng = date_range('1/1/2011 01:01:01', periods=3, freq='s') df = Series(randn(len(rng)), index=rng).to_frame() def sec_in_day(timestamp): date = timestamp.date() # We get the date less the time elapsed_time = timestamp.to_datetime() - to_datetime(date) # We get the time return elapsed_time.total_seconds() Series(df.index).apply(sec_in_day)
I modified KarlD's answer for datetime with time zone: d = pd.DataFrame({"t_naive":pd.date_range("20160101","20160102", freq = "2H")}) d['t_utc'] = d['t_naive'].dt.tz_localize("UTC") d['t_ct'] = d['t_utc'].dt.tz_convert("America/Chicago") print(d.head()) # t_naive t_utc t_ct # 0 2016-01-01 00:00:00 2016-01-01 00:00:00+00:00 2015-12-31 18:00:00-06:00 # 1 2016-01-01 02:00:00 2016-01-01 02:00:00+00:00 2015-12-31 20:00:00-06:00 # 2 2016-01-01 04:00:00 2016-01-01 04:00:00+00:00 2015-12-31 22:00:00-06:00 # 3 2016-01-01 06:00:00 2016-01-01 06:00:00+00:00 2016-01-01 00:00:00-06:00 # 4 2016-01-01 08:00:00 2016-01-01 08:00:00+00:00 2016-01-01 02:00:00-06:00 The answer by KarlD gives sec of day in UTC s0 = (d["t_naive"].values - d["t_naive"].values.astype('datetime64[D]'))/np.timedelta64(1,'s') s0 # array([ 0., 7200., 14400., 21600., 28800., 36000., 43200., # 50400., 57600., 64800., 72000., 79200., 0.]) s1 = (d["t_ct"].values - d["t_ct"].values.astype('datetime64[D]'))/np.timedelta64(1,'s') s1 # array([ 0., 7200., 14400., 21600., 28800., 36000., 43200., # 50400., 57600., 64800., 72000., 79200., 0.]) For sec of day in local time, use: s2 = (d["t_ct"].view("int64") - d["t_ct"].dt.normalize().view("int64"))//pd.Timedelta(1, unit='s') #use d.index.normalize() for index s2.values # array([64800, 72000, 79200, 0, 7200, 14400, 21600, 28800, 36000, # 43200, 50400, 57600, 64800], dtype=int64) or, s3 = d["t_ct"].dt.hour*60*60 + d["t_ct"].dt.minute*60+ d["t_ct"].dt.second s3.values # array([64800, 72000, 79200, 0, 7200, 14400, 21600, 28800, 36000, # 43200, 50400, 57600, 64800], dtype=int64)