I have a table with a number of columns. I created a measure that counts how many days its been since the last entry was recorded.
Location
Days since Last entry
Book
10
Hat
4
Dress
9
Shoe
2
Bag
1
I want to create a column that shows the days since the last entry by group. (Red = 9+ days , Amber = 5+&9- days , Green = less than 4 days.
So far I tried
NewColumn=
IF (
[DaysSinceLastEntry] >= 9, "Red",
IF([DaysSinceLastEntry] < 9 && [DaysSinceLastEntry] >5 = "Amber",)
&
IF(
[DaysSinceLastEntry] <= 5, "Green"
))
The above gives something like:
Location
Days since Last entry
Group
Book
10
Red
Book
5
Amber
Book
2
Green
Hat
9
Red
Hat
5
Amber
Hat
2
Green
I want:
Location
Days since Last entry
Group
Book
10
Red
Hat
6
Amber
Dress
9
Red
Shoe
2
Green
Bag
1
Green
I cant figure out how to display the red/amber/green based on the number of days since the last entry. Doesn't have to be an if statement. Any help would be much appreciated, thank you.
Don't know if this is what you are looking for:
import pandas as pd
import plotly.graph_objs as go
# make dataframe
data = {
'Location': [
'Book',
'Hat',
'Dress',
'Shoe',
'Bag',
],
'DaysSinceLastEntry': [
10,
4,
9,
2,
1,
],
}
df = pd.DataFrame(data)
# assign color
def color_filter(x):
if x <= 5:
return '#00FF00' # green
elif 5 < x <= 9:
return '#FFBF00' # amber
else:
return '#FF0000' # red
df['Color'] = df.DaysSinceLastEntry.map(lambda x: color_filter(x))
# plot
fig = go.Figure(
go.Bar(x=df['Location'],
y=df['DaysSinceLastEntry'],
marker={'color': df['Color']})
)
fig.show()
Related
I have a csv with one of the columns that contains periods:
timespan (string): PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0, and a minus sign may appear before the P to specify a negative duration.
I want to return a data frame that contains all the data in the csv with parsed timespan column.
So far I have a code that parses periods:
import re
timespan_regex = re.compile(r'P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?')
def parse_timespan(timespan):
# check if the input is a valid timespan
if not timespan or 'P' not in timespan:
return None
# check if timespan is negative and skip initial 'P' literal
curr_idx = 0
is_negative = timespan.startswith('-')
if is_negative:
curr_idx = 1
# extract years, months and days with the regex
match = timespan_regex.match(timespan[curr_idx:])
years = int(match.group(1) or 0)
months = int(match.group(2) or 0)
days = int(match.group(3) or 0)
timespan_days = years * 365 + months * 30 + days
return timespan_days if not is_negative else -timespan_days
print(parse_timespan(''))
print(parse_timespan('P2Y11M20D'))
print(parse_timespan('-P2Y11M20D'))
print(parse_timespan('P2Y'))
print(parse_timespan('P0Y'))
print(parse_timespan('P2Y4M'))
print(parse_timespan('P16D'))
Output:
None
1080
-1080
730
0
850
16
How do I apply this code to the whole csv column while running the function processing csv?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = parse_timespan(my_ocan['timespan']) #I tried like this, but sure it is not working :)
return my_ocan
Thank you and have a lovely day :)
Like with Python's builtin map, Pandas also has that method. You can check its documentation here. Since you already have your function ready which takes a single parameter and returns a value, you just need this:
my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan) #This will take each value in the column "timespan", pass it to your function 'parse_timespan', and update the specific row with the returned value
And here is a generic demo:
import pandas as pd
def demo_func(x):
#Takes an int or string, prefixes with 'A' and returns a string.
return "A" + str(x)
df = pd.DataFrame({"Column_1": [1, 2, 3, 4], "Column_2": [10, 9, 8, 7]})
print(df)
df['Column_1'] = df['Column_1'].map(demo_func)
print("After mapping:\n{}".format(df))
Output:
Column_1 Column_2
0 1 10
1 2 9
2 3 8
3 4 7
After mapping:
Column_1 Column_2
0 A1 10
1 A2 9
2 A3 8
3 A4 7
I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
How do I:
Identify which item from the dataframe df falls within each list (list1 or list2)
Create a new column ('new_item')
Determine which variable should be appended to the 'item' value and add it to the new column
Two lists of unique items:
list1 = ['one','two','shoes']
list2 = ['door','four','tires']
If item is in list1, append the following variable value to the end of the item and append it to the 'new_item' column:
twentysix_above = '_26+' (value is equal or greater than 26)
six_to_twentyfive = '_25' (value is between 6 and 25)
one_to_five = '_5' (value is between 1 and 5)
If item is in list2, append the following variable value to the end of each item and append it to the 'new_item' column:
twentyone_above = '_21+' (value is equal or greater than 21)
one_to_twenty = '_20' (value is between 1 and 20)
If the item isn't in either list, carry over the item name to the 'new_item' column.
Dataframe column will have one, some, or none of the 'items' from each list in it and an associated number from the 'number' column. I've gotten partially there, but I'm not sure how to compare to the other list and put that all into the 'new_item' column? Any help is appreciated, thanks!
>> print df
item number
0 one 4
1 door 55
2 sun 2
3 tires 62
4 tires 7
5 water 94
>> list1 = ['one','two','shoes']
>> list2 = ['door','four','tires']
>> df['match'] = df.item.isin(list1)
>> bucket = []
>> for row in df.itertuples():
if row.match == True and row.item > 25:
bucket.append(row.item + '_26+')
elif row.match == True and row.item >5:
bucket.append(row.item + '_25')
elif row.match == True and row.item >0:
bucket.append(row.item +'_5')
else:
bucket.append(row.item)
df['new_item'] = bucket
>> print df
item number match new_item
0 one 4 True one_5
1 door 55 True door
2 sun 2 False sun
3 tires 62 True tires
4 tires 7 True tires
5 water 94 False water
Desired Result: (comparing both lists and potentially not needing the boolean check column)
item number new_item
0 one 4 one_20
1 door 55 door__21+
2 sun 2 sun
3 tires 62 tires_21
4 tires 7 tires_20
5 water 94 water
It looks like your desired result is a bit off. The first row is in list one and has a value of 4, so it should be 'one_5' right?
Anyway, this can be accomplished with boolean masking. DataFrames have a useful isin() function making it easy to find if the value is in your lists. Then you have two more conditions, if you need a value between two numbers, or just one more condition if the range is unbounded.
import pandas as pd
import numpy as np
df = pd.DataFrame({'item': ['one', 'door', 'sun', 'tires', 'tires', 'water'],
'number': [4, 55, 2, 62, 7, 94]})
list1 = ['one','two','shoes']
list2 = ['door','four','tires']
df['new_item'] = df['item']
logic1 = np.logical_and(df.item.isin(list1), df.number > 25)
logic2 = np.logical_and.reduce([df.item.isin(list1), df.number > 5, df.number <= 25])
logic3 = np.logical_and.reduce([df.item.isin(list1), df.number > 1, df.number <= 5])
logic4 = np.logical_and(df.item.isin(list2), df.number >= 21)
logic5 = np.logical_and.reduce([df.item.isin(list2), df.number > 1, df.number < 20])
df.loc[logic1,'new_item'] = df.loc[logic1,'item']+'_26+'
df.loc[logic2,'new_item'] = df.loc[logic2,'item']+'_25'
df.loc[logic3,'new_item'] = df.loc[logic3,'item']+'_5'
df.loc[logic4,'new_item'] = df.loc[logic4,'item']+'_21+'
df.loc[logic5,'new_item'] = df.loc[logic5,'item']+'_20'
And we have this as the output
I have a data frame that is composed of several datasets (about 146 and counting). two of my columns are labeled "start_time" and "stop_time," which represent the start and stop of a response (i.e., the total duration of the response).
I need to get the "inter-response time" or the start_time subtracted from the next corresponding value in start_time. Basically if:
start_time = [1,4,7]
stop_time = [2,5,8]
I need:
stop_time[0] - start_time[1]
stop_time[2] - start_time[3]
in order to get:
iri = [2,2]
My code looks like this:
iri_t = []
def grps():
for grp in lset2_name_grps.groups:
beg_eng_t = pd.DataFrame([lset2_name_grps.stop_time, lset2_name_grps.start_time], columns=['end_t','beg_t'])
end_t = [i for i in lset2_name_grps.stop_time]
beg_t = [i for i in lset2_name_grps.start_time]
beg_t = np.insert(beg_t, len(beg_t),0)
end_t = np.insert(end_t, 0,0)
iri_t.append(np.subtract(end_t, beg_t))
# for i,j in zip(end_t, beg_t):
# iri_t.append(np.subtract(i,j))
# lset2_name_grps['iri'] = iri_t
grps()
Essentially, it doesn't do anything close to what I'm trying to accomplish and the only out I get is either "Not Implemented" or an error.
How about something like this:
import pandas as pd
starts = pd.Series([1, 4, 7])
stops = pd.Series([2, 5, 8])
iri_t = [0]
for i in range(1, len(starts)):
iri_t.append(starts[i] - ends[i-1])
times_df = pd.concat([starts, stops, pd.Series(iri_t)], axis=1)
This creates the following data_frame:
0 1 2
0 1 2 0
1 4 5 2
2 7 8 2
I think what your asking (correct me if I'm wrong) is best accomplished by putting the two columns in a single dataframe, using shift to offset one of your columns, then doing an ordinary subtraction.
df = pd.DataFrame({'start_time':[1,4,7], 'stop_time':[2,5,8]})
df.stop_time - df.start_time.shift()
Out[5]:
0 NaN
1 4
2 4
dtype: float64
How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)
Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000