Pandas dataframe applying NA to part of the data - python-2.7

Let me preface this with I am new at using pandas so I'm sorry if this question is basic or answered before, I looked online and couldn't find what I needed.
I have a dataframe that consists of a baseball teams schedule. Some of the games have been played already and as a result the results from the game are inputed in the dataframe. However, for games that are yet to happen, there is only the time they are to be played (eg 1:35 pm).
So, I would like to convert all of the values of the games yet to happen into Na's.
Thank you
As requested here is what the results dataframe for the Arizona Diamondbacks contains
print MLB['ARI']
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 1
9 0
10 1
...
151 3:40 pm
152 8:40 pm
153 8:10 pm
154 4:10 pm
155 4:10 pm
156 8:10 pm
157 8:10 pm
158 1:10 pm
159 9:40 pm
160 8:10 pm
161 4:10 pm
Name: ARI, Length: 162, dtype: object

Couldn't figure out any direct solution, only iterative
for i in xrange(len(MLB)):
if 'pm' in MLB.['ARI'].iat[i] or 'am' in MLB.['ARI'].iat[i]:
MLB.['ARI'].iat[i] = np.nan
This should work if your actual values (1s and 0s) are also strings. If they are numbers, try:
for i in xrange(len(MLB)):
if type(MLB.['ARI'].iat[i]) != type(1):
MLB.['ARI'].iat[i] = np.nan

The more idiomatic way to do this would be with the vectorised string methods.
http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods
mask = MLB['ARI'].str.contains('pm') #create boolean array
MLB['ARI'][mask] = np.nan #the column names goes first
Create the boolean array from and then use it to select the data you want.
Make sure that the column name goes before the masking array, otherwise you'll be acting on a copy of the data and your original dataframe wont get updated.
MLB['ARI'][mask] #returns a view on MLB datafrmae, will be updated
MLB[mask]['ARI'] #returns a copy of MLB, wont be updated.

Related

How to filter csv data to remove data after a specified year?

I am reading a csv in python with multiple columns.
The first column is the date and I have to delete the rows that correspond to years previous to 2017.
time high low Volume Plot Rango
0 2017-12-22 25.17984 24.280560 970 0.329943 0.899280
1 2017-12-26 25.17984 23.381280 2579 1.057921 1.798560
2 2017-12-27 25.17984 23.381280 2499 0.998083 1.798560
3 2017-12-28 25.17984 24.280560 1991 0.919885 0.899280
4 2017-12-29 25.17984 24.100704 2703 1.237694 1.079136
.. ... ... ... ... ... ...
580 2020-04-16 5.45000 4.450000 117884 3.168380 1.000000
581 2020-04-17 5.35000 4.255200 58531 1.370538 1.094800
582 2020-04-20 4.66500 4.100100 25770 0.582999 0.564900
583 2020-04-21 4.42000 3.800000 20914 0.476605 0.620000
584 2020-04-22 4.22000 3.710100 23212 0.519275 0.509900
I want to delete the rows corresponding to years prior to 2018, so 2017,2016,2015... should be deleted
I am trying with this but does not work
if 2017 in datos['time']: datos['time'].remove() #check if number 2017 is in each of the items of the column 'time'
The dates are recognized as numbers, not as datatime but I think I do not need to declare it as datatime.
In pandas
Given your data
Use Boolean indexing
time must be datetime64[ns] format
df.info() will give the dtypes
df['date'] = pd.to_datetime(df['date'])
df[df['time'].dt.year >= 2018]

Deleting rows in a file using Python

I have input files "input.dat" contain some values like this :
41611 2014 12 18 0 0
41615 2014 12 18 0 0
41625 2014 12 18 0 0
41640 2014 6 14 3 3
42248 2014 12 18 0 0
42323 2014 12 18 0 0
42330 2014 8 13 7 7
42334 2014 12 18 0 0
42335 2014 12 18 0 0
...
I have many dataset files but seems so many unwanted data
How to delete many rows for this case 41640 and 42330 and its entire row values at instant. For now I used this script:
with open(path+fname,"r") as input:
with open("00-new.dat","wb") as output:
for line in input:
if line!="41640"+"\n":
output.write(line)
The result: The data 41640 is still exist in output. Any ideas??
You need to change your condition - how it is now it checks if the whole line is equal to 41640. Each line is instead equal to the whole row of data you are reading followed by a \n. Fixed version of your program looks like this:
with open("00-old.dat","r") as input:
with open("00-new.dat","wb") as output:
for line in input:
if "41640" not in line:
output.write(line)
To delete multiple lines you can use all() combined with a list comprehension as for instance described in this post,
if all(nb not in line for nb in del_list):
output.write(line)
where del_list is a list of values you want deleted,
del_list = ["41615", "41640", "42334"]
Also, due to Python's operator precedence your original condition will always evaluate to True. That is because even if the 41640!=line was false, the \n is added to it and interpreted (after conversion) as True. Basically, the != is evaluated first, instead of the string concatenation followed by a !=.

Data format and pandas

I am using Pandas to format things nicely in a tabular format
data = []
for i in range (start, end_value):
data([i, value])
# modify value in some way
print pd.DataFrame(data)
gives me
0 1
0 38 2.500000e+05
1 39 2.700000e+05
2 40 2.916000e+05
3 41 3.149280e+05
How can I modify this to remove scientific notation and for extra points add thousands separator?
data['column_name'] = data['column_name'].apply('{0:,.2f}'.format)
thanks to John Galt's previous SO answer

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.

rrd graph configurate query

I am updating my RRD file with some counts...
For example:
time: value:
12:00 120
12:05 135
12:10 154
12:20 144
12:25 0
12:30 23
13:35 36
here my RRD is updating as below logic:
((current value)-(previous value))/((current time)-(previous time))
eg. ((135-120))/5 = 15
but my problem is when it comes 0 the reading will be negative:
((0-144))/5
Here " 0 " value comes with system failure only( from where the data is fetched)..It must not display this reading graph.
How can I configure like when 0 comes it will not update the "RRD graph" (skip this reading (0-144/5)) and next time it will take reading like ((23-0)/5) but not (23-144/10)
When specifying the data sources when creating the RRD, you can specify which range of values is acceptable.
DS:data_source:GAUGE:10:1:U will only accept values above 1.
So if you get a 0 during an update, rrd will replace it with unknown and i assume it can find a way to discard it.