Trajectory Analysis (SAS): Incorrect number of start values - sas

I am attempting a trajectory analysis in SAS (proc traj).
Following instructions found online, I first begin by testing two quadratic models, then three, then four (i.e., order 2 2, order 2 2 2, order 2 2 2 2, order 2 2 2 2 2).
I determined that a three-group linear model is the best fit (order 1 1 1;)
I then wish to add time stable covariates with the risk command. As found online, I did this by adding the start parameters provided in the Log.
At this point, I receive a notice: "Incorrect number of start values. There should be 10 start values based on the model specifications.").
I understand that it's possible to delete some of the 12 parameter estimates provided - But how do I select which ones to remove?
Thank you.
Code:
proc traj data=followupyes outplot=op outstat=os out=of outest=oe itdetail;
id youthid;
title3 'linear 3-gp model ';
var pronoun_allpar1-pronoun_allpar3;
indep time1-time3;
model logit;
ngroups 3;
order 1 1 1;
weight wgt_00;
start 0.031547 0.499724 1.969017 0.859566 -1.236747 0.007471
0.771878 0.495458 0.000000 0.000000 0.000000 0.000000;
risk P00_45_1;
run;
%trajplot (OP, OS, "linear 3-gp model ", "Traj of Pronoun Support", "Pron Support", "Time");

Because you are estimating a model with 3 linear trajectories, you will need 2 start values for each of your 3 groups.
See here for more info: https://www.andrew.cmu.edu/user/bjones/example.htm

Related

combining multiple items to create one dummy variable

I have 7 items/variables in Stata that address the same survey question. These 7 items are each different weight control behaviors (diet, exercise, pills, etc.). I am trying to combine these variables to create a single weight control behavior dummy variable that is coded as yes (did engage in weight control) and no (did not engage in weight control).
The response options for each variable look something like this for a given weight control behavior
dieted
11438 0 not marked
2771 1 marked
16 6 refused
6508 7 legitimate skip
13 8 don’t know
Here is my code. I re-coded 6,7,8 for all 7 vars as missing:
tab1 h1gh30a-h1gh30g,m`
foreach X of varlist h1gh30a-h1gh30g {
replace `X'=. if `X' > 1
}
egen wgt_control= rowmax(h1gh30a-h1gh30g)
ta wgt_control
gen wgt_control_new=wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
ta wgt_control_new
I used rowmax() to combine all 7 items but my issue is that the response option 0 or No doesn't appear when I tabulate it. I only get those who responded yes=1.
Here is a suggestion with a reproducible example for what I think is the cleanest approach. I also included some unsolicited advice about survey data best practices
* Example generated by -dataex-. For more info, type help dataex
clear
input double(h1gh30a h1gh30b h1gh30c)
1 1 1
1 0 1
6 1 8
0 0 0
7 6 8
end
* Explicit coding is better, so if possible, which it is with 7 vars,
* create a local with the vars are explicitly listed
local wgt_controls h1gh30a h1gh30b h1gh30c
* Recode is a better command to use here. And do not destroy information,
* there is a survey data quality assurance difference between respondent
* refusing to answer, not knowing or question skipped. You can replace this
* survey codes with these extended missing values that behaves like missing values
* but retain the differences in the survey codes
recode `wgt_controls' (6=.a) (7=.b) (8=.c)
* While rowmax() could be used, I think it seems like anymatch() fits
* what you are trying to do better
egen wgt_control = anymatch(`wgt_controls'), values(1)
There is no minimal reproducible example here, so we can't reproduce the problem independently.
From your code, it seems that h1gh30a-h1gh30g are recoded so that all are 0, 1 or missing, so their maximum takes one of the same values.
gen wgt_control_new = wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
seems to boil down to cloning the variable:
gen wgt_control_new = wgt_control
In short, I can't see a reason in your code why you should never see 0 as a possible result.
EDIT
A minimal check on whether there are zeros that aren't showing up as they should might be
egen max = rowmax(h1gh30a-h1gh30g)
list high30a-high30g if max == 0
```

Divide the testing set into subgroup, then make prediction on each subgroup separately

I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.
When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])

Filtering on annotations with max date in Django

I have 3 models in Django-project:
class Hardware(models.Model):
inventory_number = models.IntegerField(unique=True,)
class Subdivision(models.Model):
name = models.CharField(max_length=50,)
class Relocation(models.Model):
hardware = models.ForeignKey('Hardware',)
subdivision = models.ForeignKey('Subdivision',)
relocation_date = models.DateField(verbose_name='Relocation Date', default=date.today())
Table 'Hardware_Relocation' with values for example:
id hardware subdivision relocation_date
1 1 1 01.01.2009
2 1 2 01.01.2010
3 1 1 01.01.2011
4 1 3 01.01.2012
5 1 3 01.01.2013
6 1 3 01.01.2014
7 1 3 01.01.2015 # Now hardware 1 located in subdivision 3 because relocation_date is max
I would like to write a filter to find hardwares in subdivisions on today.
I'm trying to write a filter:
subdivision = Subdivision.objects.get(pk=1)
hardware_list = Hardware.objects.annotate(relocation__relocation_date=Max('relocation__relocation_date')).filter(relocation__subdivision = subdivision)
Now hardware_list contains hardware 1, but it is wrong (because now hardware 1 in subdivision 3).
hardware_list must be None in this example.
The following code works wrong (hardware_list contains hardware 1, for subdivision 1).
limit_date = datetime.datetime.now()
q1 = Hardware.objects.filter(relocation__subdivision=subdivision, relocation__relocation_date__lte=limit_date)
q2 = q1.exclude(~Q(relocation__relocation_date__gt=F('relocation__relocation_date')), ~Q(relocation__subdivision=subdivision))
hardware_list = q2.distinct()
Maybe better use SQL?
This might work...
from django.db.models import F, Q
Hardware.objects
.filter(relocation__subdivision=target_subdivision, relocation__relocation_date__lte=limit_date)
.exclude(~Q(relocation__subdivision=target_subdivision), relocation__relocation_date__gt=F('relocation__relocation_date'))
.distinct()
The idea is, give me all hardware that have been relocated to target division before limit date, which DON'T have been relocated to other divisions after that.

Query on plotting Lorenz curves on Stata

I am trying to plot a lorenz curve, using the following command:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
generate rank1=rank
label variable rank "Cum share of mortality"
label variable rank1 "Equality Line"
twoway (line rank1 rank, sort clwidth(medthin) clpat(longdash))(line yord rank , sort clwidth(medthin) clpat(red)), ///
ytitle(Cumulative share of drug activity, size(medsmall)) yscale(titlegap(2)) xtitle(Cumulative share of mortality (2012), size(medsmall)) ///
legend(rows(5)) xscale(titlegap(5)) legend(region(lwidth(none))) plotregion(margin(zero)) ysize(6.75) xsize(6) plotregion(lcolor(none))
However, in the resultant curves, the Line of equality does not start from 0, is there a way to fix this?
Is it recommended to use the following in order to get the perfect 45 degree line of equality:
(function y=x, range(0 1)
Also, how many minimum observations are required to plot the above graph? Does it work well with 2 observations as well?
The reason your Line of Perfect Equality does not pass through (0,0) is because the values for your variable do not contain 0.
The smallest value you will have for rank will be 1/_N. Although this value will asymptotically approach 0, it will never actually reach 0.
To see this, try:
quietly sum rank
di r(min)
di 1/_N
Further, by applying the program code to your data (beginning around line 152 in the ado file and removing unnecessary bits), one can easily see that yord cannot take on a value of 0 without values of 0 for drugs:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
sort death drugs , stable
gen double rank1 = _n / _N
qui sum drugs
gen yord1= (sum(drugs) / _N) / r(mean)
The best way to plot your Equality would be the method from your edit, namely:
twoway(function y = x, ra(0 1))
One quick yet (very) crude fix to force the lorenz curve to start at the origin (if it doesn't already) is to add an observation to the data after obtaining rank and yord, and then deleting it after you have your curve:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
expand 2 in 1
replace yord = 0 in 1
replace rank = 0 in 1
twoway (function y = x, ra(0 1)) ///
(line yord rank)
drop in 1
Like I said, this is admittedly crude and even somewhat ill advised, but I can't see a much better alternative at the moment, and with this method you will not be altering any of the other values of yord by running glcurve on the extrapolated data.

Computation of Kullback-Leibler (KL) distance between text-documents using numpy

My goal is to compute the KL distance between the following text documents:
1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY
I first of all vectorised the documents in order to easily apply numpy
1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]
I then applied the following code for computing KL distance between the texts:
import numpy as np
import math
from math import log
v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
KL=kl(x,c)
print KL
Here is the result of the above code: [0.0, 0.602059991328, 0.0].
Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.
Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.
Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.
Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":
scipy.stats.entropy(pk, qk=None, base=None)
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html
From the docs:
If qk is not None, then compute a relative entropy (also known as
Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk *
log(pk / qk), axis=0).
The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).
Hope this helps, and at least a library provides it so don't have to code your own.
After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:
# The boy is having a lad relationship It lovely day in NY
1)[1 1 1 1 1 1 1 0 0 0 0 0]
2)[1 2 1 1 1 0 1 0 0 0 0 0]
3)[0 0 1 0 1 0 0 1 1 1 1 1]
Then you can use your kl function.
To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.
A potential issue might be in your NP definition of KL. Read the wikipedia page for formula: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Note that you multiply (p-q) by the log result. In accordance with the KL formula, this should only be p:
return np.sum(np.where(p != 0,(p) * np.log10(p / q), 0))
That may help...