Adding elements in a row corresponding to a column - list

I'm extracting data from a CSV with 5 rows and 5 columns.
For example
print("Year Age Scholarship Academic Stipend")
print("1982 20 $20000.00 $1000.00")
print("1983 21 $25000.00 NA")
print("1984 22 $30000.00 $500.00")
print("1982 20 $16000.00 $200.00")
print("1983 21 $17500.00 $600.00")
I extracted individual lists with all these elements:
Year = [1982,1983,1984,1982,1983]
Age = [20,21,22,20,21]
Scholarship = [20000, 25000, 30000, 16000, 17500]
Stipend_Amount = [1000, NA, 500, 200, 600]
I want to group all my years together. How do I add the corresponding elements in column 4, corresponding only to the elements in Year?
For example. I want to be able to print
#Year Total_Scholarship_Granted
#1982 36000.00
But my for loop below is just adding all the elements together:
Start_Fund = 0
for i in range(len(year)):
Start_Fund += Scholarship[i]
print(year[i],Start_Fund)
#1982 108500
I want my results to be:
1982 36000
(which is acquired by adding all amounts from 1982)

You are missing an if statement, to check if the year in the loop is the year you want (in your case 1982). So your pseydocode should look like this:
Start_Fund = 0
my_year=1982
for i in range(len(year)):
if (year[i]==my_year)
Start_Fund += Scholarship[i]
print(my_year,Start_Fund)

Related

Calculating results pro rata over several months with PowerQuery

I am currently stuck on below issue:
I have two tables that I have to work with, one contains financial information for vessels and the other contains arrival and departure time for vessels. I get my data combining multiple excel sheets from different folders:
financialTable
voyageTimeTable
I have to calculate the result for above voyage, and apportion the result over June, July and August for both estimated and updated.
Time in June : 4 hours (20/06/2020 20:00 - 23:59) + 10 days (21/06/2020 00:00 - 30/06/2020 23:59) = 10.1666
Time in July : 31 full days
Time in August: 1 day + 14 hours (02/08/2020 00:00 - 14:00) = 1.5833
Total voyage duration = 10.1666 + 31 + 1.5833 = 42.7499
The result for the "updated" financialItem would be the following:
Result June : 100*(10.1666/42.7499) = 23.7816
Result July : 100*(31/42.7499) = 72.5148
Result August : 100*(1.5833/42.7499) = 3.7036
sum = 100
and then for "estimated" it would be twice of everything above.
This is the format I ideally would like to get:
prorataResultTable
I have to do this for multiple vessels, with multiple timespans and several voyage numbers.
Eagerly awaiting responses, if any. Many thanks in advance.
Brds,
Not sure if you're still looking for an answer, but code below gives me your expected output:
let
financialTable = Table.FromRows({{"A", 1, "profit/loss", 200, 100}}, type table [vesselName = text, vesselNumber = Int64.Type, financialItem = text, estimated = number, updated = number]),
voyageTimeTable = Table.FromRows({{"A", 1, #datetime(2020, 6, 20, 20, 0, 0), #datetime(2020, 8, 2, 14, 0, 0)}}, type table [vesselName = text, vesselNumber = Int64.Type, voyageStartDatetime = datetime, voyageEndDatetime = datetime]),
joined =
let
joined = Table.NestedJoin(financialTable, {"vesselName", "vesselNumber"}, voyageTimeTable, {"vesselName", "vesselNumber"}, "$toExpand", JoinKind.LeftOuter),
expanded = Table.ExpandTableColumn(joined, "$toExpand", {"voyageStartDatetime", "voyageEndDatetime"})
in expanded,
toExpand = Table.AddColumn(joined, "$toExpand", (currentRow as record) =>
let
voyageInclusiveStart = DateTime.From(currentRow[voyageStartDatetime]),
voyageExclusiveEnd = DateTime.From(currentRow[voyageEndDatetime]),
voyageDurationInDays = Duration.TotalDays(voyageExclusiveEnd - voyageInclusiveStart),
createRecordForPeriod = (someInclusiveStart as datetime) => [
inclusiveStart = someInclusiveStart,
exclusiveEnd = List.Min({
DateTime.From(Date.EndOfMonth(DateTime.Date(someInclusiveStart)) + #duration(1, 0, 0, 0)),
voyageExclusiveEnd
}),
durationInDays = Duration.TotalDays(exclusiveEnd - inclusiveStart),
prorataDuration = durationInDays / voyageDurationInDays,
estimated = prorataDuration * currentRow[estimated],
updated = prorataDuration * currentRow[updated],
month = Date.MonthName(DateTime.Date(inclusiveStart)),
year = Date.Year(inclusiveStart)
],
monthlyRecords = List.Generate(
() => createRecordForPeriod(voyageInclusiveStart),
each [inclusiveStart] < voyageExclusiveEnd,
each createRecordForPeriod([exclusiveEnd])
),
toTable = Table.FromRecords(monthlyRecords)
in toTable
),
expanded =
let
dropped = Table.RemoveColumns(toExpand, {"estimated", "updated", "voyageStartDatetime", "voyageEndDatetime"}),
expanded = Table.ExpandTableColumn(dropped, "$toExpand", {"month", "year", "estimated", "updated"})
in expanded
in
expanded
The code tries to:
join financialTable and voyageTimeTable, so that for each vesselName and vesselNumber combination, we know: estimated, updated, voyageStartDatetime and voyageEndDatetime.
generate a list of months for the period between voyageStartDatetime and voyageEndDatetime (which get expanded into new table rows)
for each month (in the list), do all the arithmetic you mention in your question
get rid of some columns (like the old estimated and updated columns)
I recommend testing it with different vesselNames and vesselNumbers from your dataset, just to see if the output is always correct (I think it should be).
You should be able to manually inspect the cells in the $toExpand column (of the toExpand step/expression) to see the nested rows before they get expanded.

Parsing periods in a column dataframe

I have a csv with one of the columns that contains periods:
timespan (string): PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0, and a minus sign may appear before the P to specify a negative duration.
I want to return a data frame that contains all the data in the csv with parsed timespan column.
So far I have a code that parses periods:
import re
timespan_regex = re.compile(r'P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?')
def parse_timespan(timespan):
# check if the input is a valid timespan
if not timespan or 'P' not in timespan:
return None
# check if timespan is negative and skip initial 'P' literal
curr_idx = 0
is_negative = timespan.startswith('-')
if is_negative:
curr_idx = 1
# extract years, months and days with the regex
match = timespan_regex.match(timespan[curr_idx:])
years = int(match.group(1) or 0)
months = int(match.group(2) or 0)
days = int(match.group(3) or 0)
timespan_days = years * 365 + months * 30 + days
return timespan_days if not is_negative else -timespan_days
print(parse_timespan(''))
print(parse_timespan('P2Y11M20D'))
print(parse_timespan('-P2Y11M20D'))
print(parse_timespan('P2Y'))
print(parse_timespan('P0Y'))
print(parse_timespan('P2Y4M'))
print(parse_timespan('P16D'))
Output:
None
1080
-1080
730
0
850
16
How do I apply this code to the whole csv column while running the function processing csv?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = parse_timespan(my_ocan['timespan']) #I tried like this, but sure it is not working :)
return my_ocan
Thank you and have a lovely day :)
Like with Python's builtin map, Pandas also has that method. You can check its documentation here. Since you already have your function ready which takes a single parameter and returns a value, you just need this:
my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan) #This will take each value in the column "timespan", pass it to your function 'parse_timespan', and update the specific row with the returned value
And here is a generic demo:
import pandas as pd
def demo_func(x):
#Takes an int or string, prefixes with 'A' and returns a string.
return "A" + str(x)
df = pd.DataFrame({"Column_1": [1, 2, 3, 4], "Column_2": [10, 9, 8, 7]})
print(df)
df['Column_1'] = df['Column_1'].map(demo_func)
print("After mapping:\n{}".format(df))
Output:
Column_1 Column_2
0 1 10
1 2 9
2 3 8
3 4 7
After mapping:
Column_1 Column_2
0 A1 10
1 A2 9
2 A3 8
3 A4 7

Drop rows based on one column values

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

How do I compare two lists to a python tuple, identify items, and append value based on conditionals?

How do I:
Identify which item from the dataframe df falls within each list (list1 or list2)
Create a new column ('new_item')
Determine which variable should be appended to the 'item' value and add it to the new column
Two lists of unique items:
list1 = ['one','two','shoes']
list2 = ['door','four','tires']
If item is in list1, append the following variable value to the end of the item and append it to the 'new_item' column:
twentysix_above = '_26+' (value is equal or greater than 26)
six_to_twentyfive = '_25' (value is between 6 and 25)
one_to_five = '_5' (value is between 1 and 5)
If item is in list2, append the following variable value to the end of each item and append it to the 'new_item' column:
twentyone_above = '_21+' (value is equal or greater than 21)
one_to_twenty = '_20' (value is between 1 and 20)
If the item isn't in either list, carry over the item name to the 'new_item' column.
Dataframe column will have one, some, or none of the 'items' from each list in it and an associated number from the 'number' column. I've gotten partially there, but I'm not sure how to compare to the other list and put that all into the 'new_item' column? Any help is appreciated, thanks!
>> print df
item number
0 one 4
1 door 55
2 sun 2
3 tires 62
4 tires 7
5 water 94
>> list1 = ['one','two','shoes']
>> list2 = ['door','four','tires']
>> df['match'] = df.item.isin(list1)
>> bucket = []
>> for row in df.itertuples():
if row.match == True and row.item > 25:
bucket.append(row.item + '_26+')
elif row.match == True and row.item >5:
bucket.append(row.item + '_25')
elif row.match == True and row.item >0:
bucket.append(row.item +'_5')
else:
bucket.append(row.item)
df['new_item'] = bucket
>> print df
item number match new_item
0 one 4 True one_5
1 door 55 True door
2 sun 2 False sun
3 tires 62 True tires
4 tires 7 True tires
5 water 94 False water
Desired Result: (comparing both lists and potentially not needing the boolean check column)
item number new_item
0 one 4 one_20
1 door 55 door__21+
2 sun 2 sun
3 tires 62 tires_21
4 tires 7 tires_20
5 water 94 water
It looks like your desired result is a bit off. The first row is in list one and has a value of 4, so it should be 'one_5' right?
Anyway, this can be accomplished with boolean masking. DataFrames have a useful isin() function making it easy to find if the value is in your lists. Then you have two more conditions, if you need a value between two numbers, or just one more condition if the range is unbounded.
import pandas as pd
import numpy as np
df = pd.DataFrame({'item': ['one', 'door', 'sun', 'tires', 'tires', 'water'],
'number': [4, 55, 2, 62, 7, 94]})
list1 = ['one','two','shoes']
list2 = ['door','four','tires']
df['new_item'] = df['item']
logic1 = np.logical_and(df.item.isin(list1), df.number > 25)
logic2 = np.logical_and.reduce([df.item.isin(list1), df.number > 5, df.number <= 25])
logic3 = np.logical_and.reduce([df.item.isin(list1), df.number > 1, df.number <= 5])
logic4 = np.logical_and(df.item.isin(list2), df.number >= 21)
logic5 = np.logical_and.reduce([df.item.isin(list2), df.number > 1, df.number < 20])
df.loc[logic1,'new_item'] = df.loc[logic1,'item']+'_26+'
df.loc[logic2,'new_item'] = df.loc[logic2,'item']+'_25'
df.loc[logic3,'new_item'] = df.loc[logic3,'item']+'_5'
df.loc[logic4,'new_item'] = df.loc[logic4,'item']+'_21+'
df.loc[logic5,'new_item'] = df.loc[logic5,'item']+'_20'
And we have this as the output

Removing duplicates from the data

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]