How to measure the similarity between three vectors?
Suppose I have three students and their subjects marks.
Student 1 (12,23,43,35,21)
Student 2 (23, 34, 45, 25.17) and
Student 3 (34, 43, 22, 11, 39)
now I want to measure the similarity between these three students. Can anyone help me on this. Thanks in advance.
You want similarity, not dissimilarity. The latter is available in numerous functions, some noted in the comments. The most commonly used metric for dissimilarity is Euclidean distance.
To measure similarity, you could use the simil(...) function in the proxy package in R, as shown below. Assuming that the scores are in the same order for each student, you would combine the scores into a matrix row-wise, then:
Student.1 <- c(12, 23, 43, 35, 21)
Student.2 <- c(23, 34, 45, 25, 17)
Student.3 <- c(34, 43, 22, 11, 39)
students <- rbind(Student.1,Student.2,Student.3)
library(proxy)
simil(students,method="Euclidean")
# Student.1 Student.2
# Student.2 0.04993434
# Student.3 0.02075985 0.02593140
This calculates the Euclidean distance for every student vs. every other student, and converts that to a similarity score using
sim = 1 / (1+dist)
So if the scores for two students are identical, their similarity will be 1.
But this is only one way to do it. There are 48 similarity/distance metrics coded in the proxy package, which can be listed using:
pr_DB$get_entries()
You can even code your own metric, using, e.g.,
simil(students,FUN=f)
where f(x,y) is a function that takes two vectors as arguments and returns a similarity score defined as you like. This might be relevant if, for example, some courses were "more important" in the sense that you wanted to weight differences wrt those courses more highly than the others.
Related
I need to create a Python 2 algorithm to increment and decrement shop prices in the most exact and economical way for the customer, accounting for prices at various quantities.
For example (replaced item names):
75 "stickers" costs $4
200 "stickers" costs $9
675 "stickers" costs $27
The user should be able to increment to the next cheapest combination that provides a greater quantity of items, skipping objectively bad combinations. So in this example the increments would go: 75, 150, 200, 225, 275, 300, 350, 400...
Here the quantity 375 (5 * $4 = $20) is skipped. 400 (2 * $9 = $18) is cheaper and provides more items.
Thank you to anyone that can provide some insights! I am using Python 2.7 and must rely on the Python standard library.
So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!
I´ve got a time series of temperature data with some wrong values, I want to sort out. The problem is, I want to sort out only the points in a certain period of time.
If I sort out the wrong points by their temperature value, ALL of the points of this temperature value are sorted out (through the wohle measuring period)
This is a very easy version of my code (in reality, there are many more values)
laketemperature <- c(15, 14, 14, 12, 11, 9, 9, 8, 6, 4, 15, 14, 3) #only want to sort out the last 14 and 15
out <- c(15, 14)
laketemperature_clean <- laketemperature [- out] # the 15 and 14s at the beginning are sorted out, too :(
I want to have the whole laketemperature-series in the end, only without the second 15.
I already tried with ifelse, but it didn´t work out.
I'm trying to figure out how to dynamically have multiple drop-down values calculated as per what's in the cells in Google Sheets. Not sure if I'm using the correct text or how to describe what I'm looking for.
I'm working for a building company I want to find out how to calculate which house will fit on a certain sized amount of land, based off the R code that the suburb has and the frontage of the property.
Example data:
Size of a block is from 80m2 and up to 5000m2
R-Codes are: 2,2.5,5,10,12.5,15,17.5,20,25,30,35,40,50,60,80
Frontage (m) can be: 7.5,8.5,10,12,12.5,14,15,15.65,17
R Codes determine the size of the house that can be built on the land provided.
Example:
R Code: Min size house:
2,2.5 20% of land size
5 30% of land size
10 40% of land size
12.5 45% of land size
15,17.5,20,25 50% of land size
30,35,40 55% of land size
50,60 60% of land size
80 70% of land size
So if a client has a 350m2 piece of land and the code for that area is R20 then the size of the house that can be built on that land is 175m2.
I want the drop down to have the option for each field
Block size | R Code | Frontage | and calculate the size of the house size
With this information, we could reference a house a model that can fit on the block and be shown to a client.
Example:
Name House size (m2) Frontage (m)
Davenport 176.8 8.5
I'm playing around with these formulas:
=if(B2<5,(A2/100)*20,"")
This tells me if r codes are under 5 (R codes:2,2.5) multiply the block by a percentage (relating to that R code)
Trying to figure out which formulae can be introduced into the calculation and produce the correct answer with all the conditions.
I don't have any coding experience
=IFERROR(VLOOKUP(B2,
{{2, 20%};
{2.5, 20%};
{5, 30%};
{10, 40%};
{12.5, 45%};
{15, 50%};
{17.5, 50%};
{20, 50%};
{25, 50%};
{30, 55%};
{35, 55%};
{40, 55%};
{50, 60%};
{60, 60%};
{80, 70%}}, 2, 0)*A2, )
spreadsheet demo
I have a large pandas Series, which contains unique numbers from 0 to 1,000,000. The series is not complete, but lacks some numbers in this range. I want to get a rough idea of what numbers are missing, so I'm thinking I should plot the data as a line with gaps showing the missing data.
How would I accomplish that? This does not work:
nums = pd.Series(myNumbers)
nums.plot()
The following provides a list of the missing numbers in Series nums. You can then plot them as needed. For your purposes adjust the max to 1E6.
max = 10 # highest number to look for in the Series
import pandas as pd
nums = pd.Series([1, 2, 3, 4, 5, 6, 9])
missing = [n for n in xrange(int(max + 1)) if n not in nums.values]
print missing
# prints: [0, 7, 8, 10]
I think there are two concerns with the plotting function you wrote. First, there are one million numbers. Second, the x-axis for the plot will be indexes in the series (start at 0, going sequentially); the y-axis will be numbers that you care about (nums.values in the code here). Therefore, you are looking for missing y-axis values.
I think it depends on what you mean by missing. If those are nans, then you can do something like
len(nums[nums.apply(numpy.isnan)])
if you are looking for numbers that are not present between 0-1M in the series, then do something like
a= set([i for i in xrange(int(1e6))])
b= set(nums.values)
print len(a-b) # or plot it as scatter.