Subtract value in one data frame from the next value in a second data frame - python-2.7

I have a data frame that is composed of several datasets (about 146 and counting). two of my columns are labeled "start_time" and "stop_time," which represent the start and stop of a response (i.e., the total duration of the response).
I need to get the "inter-response time" or the start_time subtracted from the next corresponding value in start_time. Basically if:
start_time = [1,4,7]
stop_time = [2,5,8]
I need:
stop_time[0] - start_time[1]
stop_time[2] - start_time[3]
in order to get:
iri = [2,2]
My code looks like this:
iri_t = []
def grps():
for grp in lset2_name_grps.groups:
beg_eng_t = pd.DataFrame([lset2_name_grps.stop_time, lset2_name_grps.start_time], columns=['end_t','beg_t'])
end_t = [i for i in lset2_name_grps.stop_time]
beg_t = [i for i in lset2_name_grps.start_time]
beg_t = np.insert(beg_t, len(beg_t),0)
end_t = np.insert(end_t, 0,0)
iri_t.append(np.subtract(end_t, beg_t))
# for i,j in zip(end_t, beg_t):
# iri_t.append(np.subtract(i,j))
# lset2_name_grps['iri'] = iri_t
grps()
Essentially, it doesn't do anything close to what I'm trying to accomplish and the only out I get is either "Not Implemented" or an error.

How about something like this:
import pandas as pd
starts = pd.Series([1, 4, 7])
stops = pd.Series([2, 5, 8])
iri_t = [0]
for i in range(1, len(starts)):
iri_t.append(starts[i] - ends[i-1])
times_df = pd.concat([starts, stops, pd.Series(iri_t)], axis=1)
This creates the following data_frame:
0 1 2
0 1 2 0
1 4 5 2
2 7 8 2

I think what your asking (correct me if I'm wrong) is best accomplished by putting the two columns in a single dataframe, using shift to offset one of your columns, then doing an ordinary subtraction.
df = pd.DataFrame({'start_time':[1,4,7], 'stop_time':[2,5,8]})
df.stop_time - df.start_time.shift()
Out[5]:
0 NaN
1 4
2 4
dtype: float64

Related

Parsing periods in a column dataframe

I have a csv with one of the columns that contains periods:
timespan (string): PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0, and a minus sign may appear before the P to specify a negative duration.
I want to return a data frame that contains all the data in the csv with parsed timespan column.
So far I have a code that parses periods:
import re
timespan_regex = re.compile(r'P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?')
def parse_timespan(timespan):
# check if the input is a valid timespan
if not timespan or 'P' not in timespan:
return None
# check if timespan is negative and skip initial 'P' literal
curr_idx = 0
is_negative = timespan.startswith('-')
if is_negative:
curr_idx = 1
# extract years, months and days with the regex
match = timespan_regex.match(timespan[curr_idx:])
years = int(match.group(1) or 0)
months = int(match.group(2) or 0)
days = int(match.group(3) or 0)
timespan_days = years * 365 + months * 30 + days
return timespan_days if not is_negative else -timespan_days
print(parse_timespan(''))
print(parse_timespan('P2Y11M20D'))
print(parse_timespan('-P2Y11M20D'))
print(parse_timespan('P2Y'))
print(parse_timespan('P0Y'))
print(parse_timespan('P2Y4M'))
print(parse_timespan('P16D'))
Output:
None
1080
-1080
730
0
850
16
How do I apply this code to the whole csv column while running the function processing csv?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = parse_timespan(my_ocan['timespan']) #I tried like this, but sure it is not working :)
return my_ocan
Thank you and have a lovely day :)
Like with Python's builtin map, Pandas also has that method. You can check its documentation here. Since you already have your function ready which takes a single parameter and returns a value, you just need this:
my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan) #This will take each value in the column "timespan", pass it to your function 'parse_timespan', and update the specific row with the returned value
And here is a generic demo:
import pandas as pd
def demo_func(x):
#Takes an int or string, prefixes with 'A' and returns a string.
return "A" + str(x)
df = pd.DataFrame({"Column_1": [1, 2, 3, 4], "Column_2": [10, 9, 8, 7]})
print(df)
df['Column_1'] = df['Column_1'].map(demo_func)
print("After mapping:\n{}".format(df))
Output:
Column_1 Column_2
0 1 10
1 2 9
2 3 8
3 4 7
After mapping:
Column_1 Column_2
0 A1 10
1 A2 9
2 A3 8
3 A4 7

Drop rows based on one column values

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

writing to columns in same row in csv file (python)

Im trying to write values to a csv file such that for every two iterations, the result is in the same row and then the next the values print to a new row. Any help would be greatly appreciated. Thank you!
This is what I have so far:
import csv
import math
savePath = '/home/dehaoliu/opencv_test/Engineering_drawings_outputs/'
with open(str(savePath) +'outputsTest.csv','w') as f1:
writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
temp = []
for k in range(0,2):
temp = []
for i in range(0,4):
a = 2 +i
b = 3+ i
list = [a,b]
temp.append(list)
writer.writerow(temp)
The result I am getting now is
[2 3][3 4][4 5][5 6]
[2 3][3 4][4 5][5 6]
But I would like to get this (without the brackets) where each number in a row is in a separate column:
2 3 3 4
4 5 5 6
Try the following:
import csv
import math
savePath = '/home/dehaoliu/opencv_test/Engineering_drawings_outputs/'
with open(str(savePath) +'outputsTest.csv','w') as f1:
writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
temp = [2, 3]
for i in range(2):
temp = [x + i for x in temp]
additional = [y+1 for y in temp]
writer.writerow(temp + additional)
temp = additional[:]
This should return:
# 2 3 3 4
# 4 5 5 6
You start with a temporary containing the numbers 2 and 3. Then, you loop from 0 to 2 (excluding). At every iteration, you increment the values of the temporary by the current index and subsequently create an additional list with these new values of your temporary list. Once that's done, you join the two lists together and write the result out to your file. At this point, you can set your temporary list to be equal to the values of the additional list, before moving on to the next iteration.
I hope this helps.
The way you present it you can do it with a simple seed and increment:
import csv
import os
save_path = "/home/dehaoliu/opencv_test/Engineering_drawings_outputs/"
with open(os.path.join(save_path, "outputsTest.csv"), "w") as f:
writer = csv.writer(f, delimiter="\t", lineterminator="\n")
temp = [2, 3, 3, 4] # init seed
increment = len(temp) // 2 # how many pairs we have, used to increase our seed each row
for _ in range(2): # how many rows do you need, any positive integer will do
writer.writerow(temp) # write the current value
temp = [x + increment for x in temp] # add 'increment' to the elements
Resulting in:
2 3 3 4
4 5 5 6
But if your seed is: temp = [2, 3, 3, 4, 4, 5] and you decide to generate 4 rows, it will still adapt:
2 3 3 4 4 5
5 6 6 7 7 8
8 9 9 10 10 11
11 12 12 13 13 14

Random.randint on lists in Python

I want to create a list and fill it with 15 zeros, then I want to change the 0 to 1 in 5 random spots of the list, so it has 10 zeros and 5 ones, here is what I tried
import random, time
dasos = []
for i in range(1, 16):
dasos.append(0)
for k in range(1, 6):
dasos[random.randint(0, 15)] = 1
Sometimes I would get anywhere from 0 to 5 ones but I want exactly 5 ones,
if I add:
print(dasos)
...to see my list I get:
IndexError: list assignment index out of range
I think the best solution would be to use random.sample:
my_lst = [0 for _ in range(15)]
for i in random.sample(range(15), 5):
my_lst[i] = 1
You could also consider using random.shuffle and use the first 5 entries:
my_lst = [0 for _ in range(15)]
candidates = list(range(15))
random.shuffle(candidates)
for i in candidates[0:5]:
my_lst[i] = 1
TL;DR: Read the the Python random documentation, this can be done in multiple ways.

Efficiently walking through pandas dataframe index

import pandas as pd
from numpy.random import randn
oldn = pd.DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])
I want to make a new DataFrame that is 0..9 rows long, and has one column "avg", whose value for row N = average(old[N]['A'], old[N]['B']..old[N]['D'])
I'm not very familiar with pandas, so all my ideas how to do this are gross for- loops and things. What is the efficient way to create and populate the new table?
Call mean on your df and pass param axis=1 to calculate the mean row-wise, you can then pass this as data to the DataFrame ctor:
In [128]:
new_df = pd.DataFrame(data = oldn.mean(axis=1), columns=['avg'])
new_df
Out[128]:
avg
0 0.541550
1 0.525518
2 -0.492634
3 0.163784
4 0.012363
5 0.514676
6 -0.468888
7 0.334473
8 0.669139
9 0.736748
If you want average for specific columns use the following. Else you can use the answer provided by #EdChum
oldn['Avg'] = oldn.apply(lambda v: ((v['A']+v['B']+v['C']+v['D']) / 4.), axis=1)
or
old['Avg'] = oldn.apply(lambda v: ((v[['A','B','C','D']]).sum() / 4.), axis=1)
print oldn
A B C D Avg
0 -0.201468 -0.832845 0.100299 0.044853 -0.222290
1 1.510688 -0.955329 0.239836 0.767431 0.390657
2 0.780910 0.335267 0.423232 -0.678401 0.215252
3 0.780518 2.876386 -0.797032 -0.523407 0.584116
4 0.438313 -1.952162 0.909568 -0.465147 -0.267357
5 0.145152 -0.836300 0.352706 -0.794815 -0.283314
6 -0.375432 -1.354249 0.920052 -1.002142 -0.452943
7 0.663149 -0.064227 0.321164 0.779981 0.425017
8 -1.279022 -2.206743 0.534943 0.794929 -0.538973
9 -0.339976 0.636516 -0.530445 -0.832413 -0.266579