Given a multiple linear regression: hypothetical example: lm(income ~ age*gender + experience), can I estimate a percentile score on a new observation (not used in model generation)?
For example, based on the residual between the model income prediction for the new observation and the true income value for this observation, can I meaningfully compute something like: 'Based on our model and given the individual's, gender, age, and experience level, we estimate they are currently earning more than 55% of individuals with the same characteristics.'
I assume that to do this, I would need to compute a distribution of predictions. Then I could use these to compute a quantile score.
My questions are:
is this approach logically/statistically sound? And if not what are the alternatives?
are there any existing implementations/ packages to do this?
Related
I am trying to calculate the standard deviation from a hospital Average Daily Census report. The report has by floor and by unit. The raw data is midnight census events for each patient...hundreds every day. I also have a filter on the report for different clinical services so the standard deviation needs to calculate "on the fly" as I change the filter.
The first picture below shows the results unfiltered. The second shows the results with some services selected.
I have found one way to calculate deviation but it has to be from a specific field. Since my ADC itself is calculated, this does not work.
I also saw how you can create a table (DAX?) but have not been able to get that to work and not sure it can be dynamic and calculate after filtering.
Is what I am trying to do even possible in Power BI?
Thanks
It sounds like you want the standard deviation of ADC over time at a daily granularity.
If this is correct, the basic approach is to calculate the measure for each day and then take the standard deviation on that set. In DAX, this will look something like this:
StdDevADC =
STDEVX.S (
SUMMARIZECOLUMNS ( DateTable[Date], "ADCThisDate", [ADC] ),
[ADCThisDate]
)
Even if this isn't exactly what you need, it should give you an idea of how to approach this. You need to calculate [ADC] for each element of the dimension you want to take the standard deviation over and then use the iterator version of the Standard Deviation function to calculate over that table/list you just calculated.
I have a table that has PIMS code of crudes and first day of the month, its cost(Rs/MT) and quantity Thousand MT to be processed.
I need to calculate the weighted average of cost(Rs/MT) based on PIMS code and for that month only.
In the table, as you can see there are double entries of PIMS code with different quantity and price but with the same date and that difference needs to be considered while doing the average so I want to get a weighted average.
You can create a Weighted Average measure using the Quick measure functionality. You can do a basic google search for this as well.
Here is the documentation for creating a Quick Measure.
The following should work (Assuming your table is called 'Data')
Weighted Avg:=
VAR TotalUnits=SUM(Data[Quantity (TMT)])
VAR TotalCost=SUMX(Data,Data[cost(Rs/MT)]*Data[Quantity (TMT)])
RETURN
TotalCost/TotalUnits
Screenshot below to show examples with some dummy data
I am implementing a logit model in a database of households using as dependent variable the classification of poor or not poor household (1 if it is poor, 0 if it is not):
proc logistic data=regression;
model poor(event="1") = variable1 variable2 variable3 variable4;
run;
Using the proc logistic in SAS, I obtained the table "Association of predicted probabilities and observed responses" that allows me to know the concordant percentage. However, I require detailed information of how many households are classified poor adequately, in this way:
I will appreciate your help with this issue.
Add the CTABLE option to your MODEL statement.
model poor(event="1") = variable1 variable2 variable3 variable4 / ctable;
CTABLE classifies the input binary response observations according to
whether the predicted event probabilities are above or below some
cutpoint value z in the range . An observation is predicted as an
event if the predicted event probability exceeds or equals z. You can
supply a list of cutpoints other than the default list by specifying
the PPROB= option. Also, you can compute positive and negative
predictive values as posterior probabilities by using Bayes’ theorem.
You can use the PEVENT= option to specify prior probabilities for
computing these statistics. The CTABLE option is ignored if the data
have more than two response levels. This option is not available with
the STRATA statement.
For more information, see the section Classification Table.
I am trying to develop a spatiotemporal logistic regression model to predict the presence/absence of a disease in U.S. counties (contiguous U.S.) based on climatologic variables, with data points for each year between 2007 and 2014; ideally, I would like a model with functionality to score additional datasets, e.g., use the model developed for 2006-2014 to predict disease probability in future climate scenarios. The model needs to account for spatial autocorrelation, and (again, ideally) repeated measures (each county has one data point per year). Unfortunately, my SAS abilities are not up to the task. Would anyone have suggestions for developing the model? The data, in csv format, take the form of:
countyFIPS year outcome predictor1 predictor2 predictor3 latitude longitude
where
countyFIPS = unique 5-digit identifier for U.S. counties
outcome = at least one case in the county for the given year, coded 0/1
latitude and longitude denote the centroid of the county
I'm really bad at this, so please be gentle and use small words...
I am pretty new to MDX but I know what I want accomplish but its proving very hard. Basically, I have a dataset where each row is a sale for a customer. I also have postcode data and the UK population at each ward.
The total population in each ward is then divided by the count of the wardcode within the data set - e.g. ward A had a population of 1,000. I have ten customers who live in ward A and so the population value is therefore 1,000/10.
So as long as there are no other dimensions selected, only the region hierarchy, I can then drill up and down and the population penetration as count of customers / calculated population value is correct. However, as soon as I introduce more dimension the total population will not sum to its true value.
So I therefore need to do the calculation above within the cube and I am trying to find the MDX function(s) to do this.
Esentially something like -
step 1) sum the number of ward codes (the lowest level of the Geographic hierarchy) and group this by the distinct ward code, eg wardcodeA = 5, wardcodeB=10 etc.
Step 2) Then take the population in each ward (which could be stored as the total at ward level and taking the average) and then divide this by the result of the previous step
step 3) sum the results from each ward at the currently select Geographical level
The fact other dimensions are changing the value of customers / population means that something in your modeling is wrong.
You should have a fact table (can be a view/concept) like this :
REGION_ID, CUSTOMER_COUNT, POPULATION_COUNT
Once you got this create a fact table and a specific measure for counting customers and population with a single dimension linked. This is the main point, do not link your measures with dimension that are not needed.