I'm trying to figure out how to dynamically have multiple drop-down values calculated as per what's in the cells in Google Sheets. Not sure if I'm using the correct text or how to describe what I'm looking for.
I'm working for a building company I want to find out how to calculate which house will fit on a certain sized amount of land, based off the R code that the suburb has and the frontage of the property.
Example data:
Size of a block is from 80m2 and up to 5000m2
R-Codes are: 2,2.5,5,10,12.5,15,17.5,20,25,30,35,40,50,60,80
Frontage (m) can be: 7.5,8.5,10,12,12.5,14,15,15.65,17
R Codes determine the size of the house that can be built on the land provided.
Example:
R Code: Min size house:
2,2.5 20% of land size
5 30% of land size
10 40% of land size
12.5 45% of land size
15,17.5,20,25 50% of land size
30,35,40 55% of land size
50,60 60% of land size
80 70% of land size
So if a client has a 350m2 piece of land and the code for that area is R20 then the size of the house that can be built on that land is 175m2.
I want the drop down to have the option for each field
Block size | R Code | Frontage | and calculate the size of the house size
With this information, we could reference a house a model that can fit on the block and be shown to a client.
Example:
Name House size (m2) Frontage (m)
Davenport 176.8 8.5
I'm playing around with these formulas:
=if(B2<5,(A2/100)*20,"")
This tells me if r codes are under 5 (R codes:2,2.5) multiply the block by a percentage (relating to that R code)
Trying to figure out which formulae can be introduced into the calculation and produce the correct answer with all the conditions.
I don't have any coding experience
=IFERROR(VLOOKUP(B2,
{{2, 20%};
{2.5, 20%};
{5, 30%};
{10, 40%};
{12.5, 45%};
{15, 50%};
{17.5, 50%};
{20, 50%};
{25, 50%};
{30, 55%};
{35, 55%};
{40, 55%};
{50, 60%};
{60, 60%};
{80, 70%}}, 2, 0)*A2, )
spreadsheet demo
Related
So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!
I am trying to plot graph for throughput numbers.
my data is x axis = time in epoch, y = throughput in bytes.
I have y-ticks as
print loc, labels
[ 0. 5000000. 10000000. 15000000. 20000000. 25000000.
30000000. 35000000.]<a list of 8 Text yticklabel objects>
I want to show this data in KB or MB. Please help on how I can go about it?
I am lost and stuck. Currently the data on y starts with 0 -> 3.5 (1e7) which in itself does not make sense about throughput.
So y ticks are - 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 with 1e7
Appreciate help!
I have a large pandas Series, which contains unique numbers from 0 to 1,000,000. The series is not complete, but lacks some numbers in this range. I want to get a rough idea of what numbers are missing, so I'm thinking I should plot the data as a line with gaps showing the missing data.
How would I accomplish that? This does not work:
nums = pd.Series(myNumbers)
nums.plot()
The following provides a list of the missing numbers in Series nums. You can then plot them as needed. For your purposes adjust the max to 1E6.
max = 10 # highest number to look for in the Series
import pandas as pd
nums = pd.Series([1, 2, 3, 4, 5, 6, 9])
missing = [n for n in xrange(int(max + 1)) if n not in nums.values]
print missing
# prints: [0, 7, 8, 10]
I think there are two concerns with the plotting function you wrote. First, there are one million numbers. Second, the x-axis for the plot will be indexes in the series (start at 0, going sequentially); the y-axis will be numbers that you care about (nums.values in the code here). Therefore, you are looking for missing y-axis values.
I think it depends on what you mean by missing. If those are nans, then you can do something like
len(nums[nums.apply(numpy.isnan)])
if you are looking for numbers that are not present between 0-1M in the series, then do something like
a= set([i for i in xrange(int(1e6))])
b= set(nums.values)
print len(a-b) # or plot it as scatter.
How to measure the similarity between three vectors?
Suppose I have three students and their subjects marks.
Student 1 (12,23,43,35,21)
Student 2 (23, 34, 45, 25.17) and
Student 3 (34, 43, 22, 11, 39)
now I want to measure the similarity between these three students. Can anyone help me on this. Thanks in advance.
You want similarity, not dissimilarity. The latter is available in numerous functions, some noted in the comments. The most commonly used metric for dissimilarity is Euclidean distance.
To measure similarity, you could use the simil(...) function in the proxy package in R, as shown below. Assuming that the scores are in the same order for each student, you would combine the scores into a matrix row-wise, then:
Student.1 <- c(12, 23, 43, 35, 21)
Student.2 <- c(23, 34, 45, 25, 17)
Student.3 <- c(34, 43, 22, 11, 39)
students <- rbind(Student.1,Student.2,Student.3)
library(proxy)
simil(students,method="Euclidean")
# Student.1 Student.2
# Student.2 0.04993434
# Student.3 0.02075985 0.02593140
This calculates the Euclidean distance for every student vs. every other student, and converts that to a similarity score using
sim = 1 / (1+dist)
So if the scores for two students are identical, their similarity will be 1.
But this is only one way to do it. There are 48 similarity/distance metrics coded in the proxy package, which can be listed using:
pr_DB$get_entries()
You can even code your own metric, using, e.g.,
simil(students,FUN=f)
where f(x,y) is a function that takes two vectors as arguments and returns a similarity score defined as you like. This might be relevant if, for example, some courses were "more important" in the sense that you wanted to weight differences wrt those courses more highly than the others.
I can't manage to create an archive with the correct type.
What am I missing?
My example is very similar to the official example on https://code.google.com/p/rrd4j/wiki/Tutorial
RRD creation:
rrdDef.setStartTime(L - 300);
rrdDef.addDatasource("speed", DsType.GAUGE, 600, Double.NaN, Double.NaN);
rrdDef.addArchive(ConsolFun.MAX, 0.5, 1, 24);
rrdDef.addArchive(ConsolFun.MAX, 0.5, 6, 10);
I add some values: (1,2,3 for each step)
long x = L;
while (x <= L + 4200) {
Sample sample = rrdDb.createSample();
sample.setAndUpdate((x + 11) + ":1");
sample.setAndUpdate((x + 12) + ":2");
sample.setAndUpdate((x + 14) + ":3");
x += 300;
}
And then I fetch it:
FetchRequest fetchRequest = rrdDb.createFetchRequest(ConsolFun.MAX, (L - 600), L + 4500);
FetchData fetchData = fetchRequest.fetchData();
String s = fetchData.dump();
I get the result: (hoping to find the maximum)
920804100: NaN
920804400: NaN
920804700: +1.0000000000E00
920805000: +1.0166666667E00
920805300: +1.0166666667E00
...
920808600: +1.0166666667E00
920808900: +1.0166666667E00
920809200: NaN
I would like to see the maximum value here. Tried it with total as well, and I get THE SAME result.
What do I have to change, so I get the greatest value sent in one step, or to get the sum of the values sent in one step.
Thanks
The MAX is not the maximum value input but the maximum consolidated data point. What you're saying to rrd given your example is
At one point in time I'm going 1MPH
One second later I'm going 2MPH
Two seconds later I'm going 4MPH
rrd now has 3 data points covering 3 seconds of a 300 second interval. What should rrd store? 1, 2, or 3? None of the above it has to normalize the data in some way to say between X and X+STEP the rate is Y.
To complicate matters it's not for certain that your 3 data points are landing in the the same 300 second interval. Your first 2 data points could be in one interval and the 4MPH could be in the next one. This is because the starting data point stored is not exactly start+step. i.e. if you start at 14090812456 it might be something like 14090812700 even though your step is 300
The only way to store exact input values with GAUGE is to push updates at the exact step times rrd store the data points. I'm going 1MPH at x, 2MPH at x+300, 4MPH at x+300 where x starts at the first data point.
Here is a bash example showing this working using your rrd settings, I'm using a constant start time and x starting at what I know is rrd's first data point.
L=1409080000
rrdtool create max.rrd --start=$L DS:speed:GAUGE:600:U:U RRA:MAX:0.5:1:24 RRA:MAX:0.5:6:10
x=$(($L+200))
while [ $x -lt $(($L+3000)) ]; do
rrdtool update max.rrd "$(($x)):1"
rrdtool update max.rrd "$(($x+300)):2"
rrdtool update max.rrd "$(($x+600)):3"
x=$(($x+900))
done
rrdtool fetch max.rrd MAX -r 600 -s 1409080000
speed
1409080200: 1.0000000000e+00
1409080500: 2.0000000000e+00
1409080800: 3.0000000000e+00
1409081100: 1.0000000000e+00
1409081400: 2.0000000000e+00
1409081700: 3.0000000000e+00
1409082000: 1.0000000000e+00
Not really that usefull but if you increase the resolution to say 1200 seconds you start getting max over larger time intervals
rrdtool fetch max.rrd MAX -r 1200 -s 1409080000
speed
1409081400: 3.0000000000e+00
1409083200: 3.0000000000e+00
1409085000: nan
1409086800: nan
1409088600: nan