I have two Pandas data frames representing an inventory of items. Both data frames have four columns:
df1
id, item, colour, year
1, car, red, 2015
2, truck,, 2016
3, house, blue,
4, car, blue,
5, truck, red, 2015
df2
id, item, colour, year
1, house, blue, 2015
2, truck,, 2015
3, car, blue,
4, house,,
5, car, red, 2015
I know that these inventories are likely to represent the same object, so I would like to relate both of these.
For instance,
df1[1] = df2[5] (3 identique variables)
df1[4] = df2[3] (2 identique variables)
df1[3] (house, blue,) is probably the same as df2[1] (house, blue, 2015).
I have 2 main issues: how to do it efficiently, and how to give a reliability to the link.
I've thought of creating a common field which would be a combination of all the columns [item, colour, year] and merge on this. I would get the two first matches above; but they don't have the same reliability. I wonder if there would be an easy way to 'score' this reliability (at the moment I'm thinking of doing two merges, depending on variable availability).
The I would create another common field, with only 2 variables (item, colour), and merge on this. That would give me the link: (house, blue,) and (house, blue, 2015). This would obviously be a weaker link.
Any idea how to do this without merging sequentially? My current plan is to merge with 3 attributes (when they are present), then 2 attributes (there are 3 permutations) on what is left and has at least 2 attributes, and then 1 only. I would give a reliability score to the link based on the number of attributes I used to merge.
df = pd.DataFrame(
(df1.values[:, None] == df2.values).sum(2),
df1.index, df2.index)
matches = df.mask(df.lt(2)).stack()
def f(df):
i, j = df.name
return pd.concat([df1.loc[i], df2.loc[i]], axis=1, keys=['df1', 'df2']).T
matches.groupby(level=[0, 1]).apply(f).stack().unstack([-2, -1])
Related
So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!
I have a spreadsheet where Column A is the year, and column B is the shirt color used. The shirt colors are repeated.
I want to find a way to generate a list of the colors that have not been used in the last 10 years. The problem I am running into is due to the fact that the colors repeat. I tried using
=unique(filter(B2:B, A2:A<today()-(365*10)))
but shirts that were used in the last 10 years are then still included.
try:
=ARRAYFORMULA(TEXTJOIN(", ", 1, UNIQUE(IF(NOT(REGEXMATCH(B:B,
TEXTJOIN("|", 1, UNIQUE(FILTER(B:B, A:A>=YEAR(TODAY())-10))))), B:B, ))))
for dates in column A use:
=ARRAYFORMULA(TEXTJOIN(", ", 1, UNIQUE(IF(NOT(REGEXMATCH(B:B,
TEXTJOIN("|", 1, UNIQUE(FILTER(B:B, YEAR(A:A)>=YEAR(TODAY())-10))))), B:B, ))))
I have two lists, the first of which represents times of observation and the second of which represents the observed values at those times. I am trying to find the maximum observed value and the corresponding time given a rolling window of various length. For example-sake, here are the two lists.
# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]
# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]
# actual dataset is of size ~ 11,000
The missing times (ex: 3.0) correspond to an observed value of zero, whereas duplicate times correspond to multiple observations to the floored time. Since my window will be rolling over the time_count (ex: max value in first 2 hours, next 2 hours, 2 hours after that; max value in first 4 hours, next 4 hours, ...), I plan to use an array-reshaping routine. However, it's important to set up everything properly before, which entails finding the maximum value given duplicate times. To solve this problem, I tried the code just below.
def list_duplicates(data_list):
seen = set()
seen_add = seen.add
seen_twice = set(x for x in data_list if x in seen or seen_add(x))
return list(seen_twice)
# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]
# get index of duplicates
for dup in dups:
print(time_count.index(dup))
>> 2
>> 4
When checking for the index of the duplicates, it appears that this code will only return the index of the first occurrence of the duplicate value. I also tried using OrderedDict via module collections for reasons concerning code efficiency/speed, but dictionaries have a similar problem. Given duplicate keys for non-duplicate observation values, the first instance of the duplicate key and corresponding observation value is kept while all others are dropped from the dict. Per this SO post, my second attempt is just below.
for dup in dups:
indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s
I should be getting [2,3] for time in time_count = 8.0 and [4,5,6] for time in time_count = 10.0. From the duplicate time_counts, 475.2 is the max linspeed that corresponds to duplicate time_count 8.0 and 400.9 is the max linspeed that corresponds to duplicate time_count 10.0, meaning that the other linspeeds at leftover indices of duplicate time_counts would be removed.
I'm not sure what else I can try. How can I adapt this (or find a new approach) to find all of the indices that correspond to duplicate values in an efficient manner? Any advice would be appreciated. (PS - I made numpy a tag because I think there is a way to do this via numpy that I haven't figured out yet.)
Without going into the details of how to implement and efficient rolling-window-maximum filter; reducing the duplicate values can be seen as a grouping-problem, which the numpy_indexed package (disclaimer: I am its author) provides efficient and simple solutions to:
import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)
For large input datasets (ie, where it matters), this should be a lot faster than any non-vectorized solution. Memory consumption is linear and performance in general NlogN; but since time_count appears to be sorted already, performance should be linear too.
OK, if you want to do this with numpy, best is to turn both of your lists into arrays:
l = np.array(linspeed)
tc = np.array(time_count)
Now, finding unique times is just an np.unique call:
u, i, c = np.unique(tc, return_inverse = True, return_counts = True)
u
Out[]: array([ 4., 6., 8., 10., 14., 16.])
i
Out[]: array([0, 1, 2, 2, 3, 3, 3, 4, 5], dtype=int32)
c
Out[]: array([1, 1, 2, 3, 1, 1])
Now you can either build your maximums with a for loop
m = np.array([np.max(l[i==j]) if c[j] > 1 else l[j] for j in range(u.size)])
m
Out[]: array([ 280. , 275. , 475.2, 400.9, 360.1, 400.9])
Or try some 2d method. This could be faster, but it would need to be optimized. This is just the basic idea.
np.max(np.where(i[None, :] == np.arange(u.size)[:, None], linspeed, 0),axis = 1)
Out[]: array([ 280. , 275. , 475.2, 400.9, 323.8, 289.7])
Now your m and u vectors are the same length and include the output you want.
I need to draw a bar graph for the values:
male=('2', '1', '2', '6', '6', '1') # list may increase
time=('Tue_Aug_13_04:37:40_2013', 'Mon_Jul__1_02:33:11_2013','Tue_Aug_13_04:37:40_2013', 'Thu_Jul__4_01:53:32_2013', 'Mon_Jul__1_10:05:55_2013','Mon_Jul__1_04:15:25_2013')# list may increase
female=(16, 11, 16, 12, 12, 11) # list may increase
Male in green colour, female in red colour as the image attached below:
The code which I tried:
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse, Polygon
fig = plt.figure()
ax1 = fig.add_subplot(131)
ax1.bar(male, color='red', edgecolor='black')
ax1.bar(bottom=range(female), color='blue', edgecolor='black')
ax1.set_xticks(time)
plt.show()
What modifications do I need to make in order to draw the bar graph as shown in the image attached for my values?
1.) I strongly suggest that you familiarize yourself with the python syntax:
What's the difference between lists enclosed by square brackets and parentheses?
What's the difference between '2' and 2?
2.) Make use of the matplotlib documentation to figure out the correct syntaxt for the plot commands you are using.
3.) In this particular case: To get you going, change your data to:
male=[2, 1, 2, 6, 6, 1] # list may increase
time=['Tue_Aug_13_04:37:40_2013', 'Mon_Jul__1_02:33:11_2013','Tue_Aug_13_04:37:40_2013', 'Thu_Jul__4_01:53:32_2013', 'Mon_Jul__1_10:05:55_2013','Mon_Jul__1_04:15:25_2013']# list may increase
female=[16, 11, 16, 12, 12, 11] # list may increase
Please examine carefully what has changed.
4.) The bar command you try to call has not enough input arguments. With the changed data from above, try this:
ax1.bar(range(len(time)),male,width=0.5, color='red', edgecolor='black')
ax1.bar(range(len(time)),female,width=0.5,bottom=male,color='blue', edgecolor='black')
What has changed?
you need the following inputs: left, height, width=0.8
you had only one of those
due to the fact that your dates are given as strings, you need a generic counter for the x-axis, hence the range(len(time)) to provide as many tics as there are entries in time.
now, you specify the height according to the values in male and female - none of which should be strings!
define a width
in your case, you want the bars to be stacked - therefore, specify the first set of values as bottom for the second
4.) Because time is made up of strings, you cannot use it for the ticks. Instead, try:
ax1.set_xticklabels(time,rotation=90)
Here, you use the strings from time as tick-labels. The rotation=90 is a nice feature so that the long strings do not overlap.
5.) If the labels are cut off by the plot window, try this:
plt.tight_layout()
plt.show()
This should get you back on track.
Good key words for a web-search inlcude:
matplotlib stacked bar
matplotlib tick labels rotation
matplotlib ticks date
I'm encountering this problem and would like to seek your help.
The context:
I'm having a bag of balls, each of which has an age (red and blue) and color attributes.
What I want is to get the top 10 "youngest" balls and there are at most 3 blue balls (this means if there are more than 3 blue balls in the list of 10 youngest balls, then replace the "redudant" oldest blue balls with the youngest red balls)
To get top 10:
sel_balls = Ball.objects.all().sort('age')[:10]
Now, to also satisfy the conditions "at most 3 blue balls", I need to process further:
Iterate through sel_balls and count the number of blue balls (= B)
If B <= 3: do nothing
Else: get additional B - 3 red balls to replace the oldest (B - 3) blue balls (and these red balls must not have appeared in the original 10 balls already taken out). I figure I can do this by getting the oldest age value among the list of red balls and do another query like:
add_reds = Ball.objects.filter(age >= oldest_sel_age)[: B - 3]
My question is:
Is there any way that I can satisfy the constraints in only one query?
If I have to do 2 queries, is there any faster ways than the one method I mentioned above?
Thanks all.
Use Q for complex queries to the database: https://docs.djangoproject.com/en/dev/topics/db/queries/#complex-lookups-with-q-objects
You should use annotate to do it.
See documentation.
.filter() before .annotate() gives 'WHERE'
.filter() after .annotate() gives 'HAVING' (this is what you need)