0 DF in regression in SAS enterprise guide - sas

I created dummies in SAS (part of the codes below) and run regression (threw away M23). It was working fine. But then I tried to group them by age since we don't have enough members. I ran it the same way and threw away one age group (M20to24 since this group has the highest membership). Now some of my variables have 0 DF. Does anyone know what went wrong?
I got the message - Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
data Table;
set Table;
M0=(AgeGender = '0M');
M1=(AgeGender = '1M');
M2=(AgeGender = '2M');
M3=(AgeGender = '3M');
M4=(AgeGender = '4M');
M5to9=(AgeGender = ' 5to9M');
M10to14=(AgeGender = '10to14M');
M15to19=(AgeGender = '15to19M');
M20to24=(AgeGender = '20to24M');
M25to29=(AgeGender = '25to29M');
M30to34=(AgeGender = '30to34M');
M35to39=(AgeGender = '35to39M');
M40to44=(AgeGender = '40to44M');
M45to49=(AgeGender = '45to49M');
M50to54=(AgeGender = '50to54M');
M55to59=(AgeGender = '55to59M');
M60to64=(AgeGender = '60to64M');
M65Plus=(AgeGender = '65+M');
F0=(AgeGender = '0F');
F1=(AgeGender = '1F');
F2=(AgeGender = '2F');
F3=(AgeGender = '3F');
F4=(AgeGender = '4F');
F5to9=(AgeGender = ' 5to9F');
F10to14=(AgeGender = '10to14F');
F15to19=(AgeGender = '15to19F');
F20to24=(AgeGender = '20to24F');
F25to29=(AgeGender = '25to29F');
F30to34=(AgeGender = '30to34F');
F35to39=(AgeGender = '35to39F');
F40to44=(AgeGender = '40to44F');
F45to49=(AgeGender = '45to49F');
F50to54=(AgeGender = '50to54F');
F55to59=(AgeGender = '55to59F');
F60to64=(AgeGender = '60to64F');
F65Plus=(AgeGender = '65+F');
Dep = (Relationship = 'Dep');
Mandatory = (Mand_Vo = 'Mandatory');
run;
ods output ParameterEstimates=Parameter_Estimates;
proc reg data= Table;
model logPMPM =
M0
M1
M2
M3
M4
M5to9
M10to14
M15to19
M25to29
M30to34
M35to39
M40to44
M45to49
M50to54
M55to59
M60to64
M65Plus
F0
F1
F2
F3
F4
F5to9
F10to14
F15to19
F20to24
F25to29
F30to34
F35to39
F40to44
F45to49
F50to54
F55to59
F60to64
F65Plus;
weight Membership;
run;
ods output close;

It doesn't look like you have overlaps or identical complimentary data variables but that's by definition. Your data is likely having that occur by chance, which is harder to find. You can likely find this by crossing variables that you suspect may be related or doing a pair wise scatter plot (PROC SGSCATTER) and seeing which two overlap almost identically.
You're correct, you wouldn't get this behaviour with continuous values because they're continuous and less likely to overlap exactly. In general, it's considered best practice to NOT categorize/bin variables when you can keep them continuous. The boundaries are artificial, does a 34 year old really differ from that 36 year old? What if all the people in that age group are 34 compared to the 36 in the 35 to 39 age group? You may not find a difference, but if your distribution was everyone at 39 vs everyone at 31 you may find more of a difference. Keeping the data continuous avoids these manufactured issues.

Related

How to build an inflation term structure in QuantLib?

This is what I've got, but I'm getting weird results. Can you spot an error?:
#Zero Coupon Inflation Indexed Swap Data
zciisData = [(ql.Date(18,4,2020), 1.9948999881744385),
(ql.Date(18,4,2021), 1.9567999839782715),
(ql.Date(18,4,2022), 1.9566999673843384),
(ql.Date(18,4,2023), 1.9639999866485596),
(ql.Date(18,4,2024), 2.017400026321411),
(ql.Date(18,4,2025), 2.0074000358581543),
(ql.Date(18,4,2026), 2.0297999382019043),
(ql.Date(18,4,2027), 2.05430006980896),
(ql.Date(18,4,2028), 2.0873000621795654),
(ql.Date(18,4,2029), 2.1166999340057373),
(ql.Date(18,4,2031), 2.152100086212158),
(ql.Date(18,4,2034), 2.18179988861084),
(ql.Date(18,4,2039), 2.190999984741211),
(ql.Date(18,4,2044), 2.2016000747680664),
(ql.Date(18,4,2049), 2.193000078201294)]
def build_inflation_term_structure(calendar, observationDate):
dayCounter = ql.ActualActual()
yTS = build_yield_curve()
lag = 3
fixing_date = calendar.advance(observationDate,-lag, ql.Months)
convention = ql.ModifiedFollowing
cpiTS = ql.RelinkableZeroInflationTermStructureHandle()
inflationIndex = ql.USCPI(False, cpiTS)
#last observed CPI level
fixing_rate = 252.0
baseZeroRate = 1.8
inflationIndex.addFixing(fixing_date, fixing_rate)
observationLag = ql.Period(lag, ql.Months)
zeroSwapHelpers = []
for date,rate in zciisData:
nextZeroSwapHelper = ql.ZeroCouponInflationSwapHelper(rate/100,observationLag,date,calendar,
convention,dayCounter,inflationIndex)
zeroSwapHelpers = zeroSwapHelpers + [nextZeroSwapHelper]
# the derived inflation curve
derived_inflation_curve = ql.PiecewiseZeroInflation(observationDate, calendar, dayCounter, observationLag,
inflationIndex.frequency(), inflationIndex.interpolated(),
baseZeroRate, yTS, zeroSwapHelpers,
1.0e-12, ql.Linear())
cpiTS.linkTo(derived_inflation_curve)
return inflationIndex, derived_inflation_curve, cpiTS, yTS
observation_date = ql.Date(17, 4, 2019)
calendar = ql.UnitedStates()
inflationIndex, derived_inflation_curve, cpiTS, yTS = build_inflation_term_structure(calendar, observation_date)
If I plot the inflationIndex zero rates, I get this:
I've now been checking the same problem and first of all I don't think you need that function ZeroCouponInflationSwapHelper at all.
Just to let you know I've perfectly replicated OpenRiskEngine (ORE)'s inflation curve which itself is built on QuantLib and the idea is quite simple.
compute base value I(0) which is the lagged CPI (this is supposedly your value 252.0),
compute CPI(t)=I(T)=I(0)*(1+quote)^T, where T = ZCIS expiry date, t is T - 3M
I'm not sure where you took I(0) from, but since a lag is present the I(0) is not the latest available CPI value. Instead I(0)=I(2019-04-17) is interpolated value of CPI(2019-01) and CPI(2019-02).
Also to build the CPI curve you don't need any interest rate yield curve at all because the ZCIS pays at maturity T a single cash flow: 'floating' (I(T)/I(0)-1) for 'fixed' (1+quote)^T-1 cash flows. If you equate these, you can back out the 'fair' I(T) which I used above.
Assuming your value I(0)=252.0 is correct, your CPI curve would look like this:
import QuantLib as ql
import pandas as pd
fixing_rate = 252.0
observation_date = ql.Date(17, 4, 2019)
zciisData = [(ql.Date(18,4,2020), 1.9948999881744385),
(ql.Date(18,4,2021), 1.9567999839782715),
(ql.Date(18,4,2022), 1.9566999673843384),
(ql.Date(18,4,2023), 1.9639999866485596),
(ql.Date(18,4,2024), 2.017400026321411),
(ql.Date(18,4,2025), 2.0074000358581543),
(ql.Date(18,4,2026), 2.0297999382019043),
(ql.Date(18,4,2027), 2.05430006980896),
(ql.Date(18,4,2028), 2.0873000621795654),
(ql.Date(18,4,2029), 2.1166999340057373),
(ql.Date(18,4,2031), 2.152100086212158),
(ql.Date(18,4,2034), 2.18179988861084),
(ql.Date(18,4,2039), 2.190999984741211),
(ql.Date(18,4,2044), 2.2016000747680664),
(ql.Date(18,4,2049), 2.193000078201294)]
fixing_dates = []
CPI_computed = []
for tenor, quote in zciisData:
fixing_dates.append(tenor - ql.Period('3M')) # this is 'fixing date' t
pay_date = ql.ActualActual().yearFraction(observation_date, tenor) # this is year fraction of 'pay date' T
CPI_computed.append(fixing_rate * (1+quote/100)**(pay_date))
results = pd.DataFrame({'date': pd.Series(fixing_dates).apply(ql.Date.to_date), 'CPI':CPI_computed})
display(results)
results.set_index('date')['CPI'].plot();

Extract intercepts from multiple regressions in stata

I am attempting to reproduce the following in stata. This is a scatter plot of average portfolio returns (y axis) and predicted retruns (x axis).
To do so, I need your help on how I can extract the intercepts from 25 regressions into one variable? I am currently running the 25 portfolio regressions as follows. I have seen that parmest can potentially do this but can't get it to work with the forval. Many thanks
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
}
}
I don't know what your data look like, but maybe something like this will work:
gen intercepts = .
local i = 1
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
// assign the ith observation of intercepts
// equal to the regression constant
replace intercepts = _b[_cons] if _n == `i'
// increment i
local ++i
}
}
The postfile series of commands can be very helpful in a situation like this. The commands allows you to store results in a separate data set without losing the data in memory.
You can start with this as a simple example. This code will produce a Stata data set called "results.dta" with the variables s h and constant with a record of each regression.
cap postclose results
postfile results s h constant using results.dta, replace
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
loc c = _b[_cons]
post results (`s') (`h') (`c')
}
}
postclose results
use results, clear

“quantize result has too many digits for current context” error on save

I know this has been asked a number of times but somehow that didn't work out. I have this model field in model Song.
sent_score = models.DecimalField(default= 0.0,max_digits=4,decimal_places=2,blank=True)
In my view i update sent score the following ways.
song_item = Song.objects.get(id=id)
com_list = Comment.objects.filter(song_id=id)
com_count = com_list.count()
pos_comments = com_list.filter(sentiment='positive')
pos_count = pos_comments.count()
getcontext().prec = 2
sent_score = Decimal(pos_count)/Decimal(com_count)
However it still gives me this error:
InvalidOperation: quantize result has too many digits for current context
I also tried the quantize method in this way:
sc = Decimal(pos_count)/Decimal(com_count)
Decimal(sc).quantize(Decimal('12.12'), rounding = ROUND_DOWN)
song_item.sent_score = sc
song_item.save()
but it still gives the same error please tell me where am i going wrong
You should not use getcontext().prec = 2 as you did in your first try, because it changes the precision for everything, not just this one calculation.
The issue with your second example is, that Decimal(sc).quantize(Decimal('12.12'), rounding=ROUND_DOWN) and even sc.quantize(Decimal('12.12'), rounding=ROUND_DOWN) do not change sc. Here is an example that works:
sc = Decimal(pos_count)/Decimal(com_count)
sc_adjusted = sc.quantize(Decimal('12.12'), rounding=ROUND_DOWN)
song_item.sent_score = sc_adjusted
song_item.save()

SAS regresses missing values

For some reason when SAS does proportional hazards regression it is including those observations that are specified as . as a group in the results. I suspect it has something to do with how I created my variable (and that SAS thinks my numeric variables are characters) but I can't figure out what I did wrong. I am using SAS 9.4
data final; set final;
if edu_d = 'hs less' then edu_regress = 1;
else if edu_d = 'hs' then edu_regress = 1;
else if edu_d = 'some college' then edu_regress = 2;
else if edu_d = 'college plus' then edu_regress = 3;
else if edu_d = 'missing' then edu_regress=.;
run;
Then I run my regression:
proc phreg data=final;
class edu_regress;
model fuptime*dc(0)=edu_regress/rl;
run;
And the output is as follows:
edu_regress . 1 0.10963 0.12941 0.7177 0.3969 1.116 0.866 1.438
edu_regress 1 1 0.22514 0.10949 4.2278 0.0398 1.252 1.011 1.552
edu_regress 2 1 0.21706 0.11410 3.6190 0.0571 1.242 0.993 1.554
Where . is a category instead of treated as missing.
I'm sure I'm making a rookie mistake but I just can't figure it out.
I would clear your output, and re-run the code, and check the log and output.
As I read the docs, to get missing values treated as a category you would need to have /missing on your CLASS statement, which you do not have in the code shown. Without that, I think missing values should be automatically excluded.
When I run PHREG with a CLASS variable that has missing values, I get a note in the log about observations being deleted due to missing values, and the output shows that the number of observations used is less than the number of observations read.
If SAS thinks edu_regress is character, that's possible if it already was on the dataset as character. This is one reason not to do data x; set x; and instead make a new dataset. You should see notes in the datastep when you run it the way you have now regarding numeric to character conversion, if this is indeed the problem.
Anyway, one way to adjust this is to use CALL MISSING. It sets a variable to missing correctly regardless of the type.
data final;
set final;
if edu_d = 'hs less' then edu_regress = 1;
else if edu_d = 'hs' then edu_regress = 1;
else if edu_d = 'some college' then edu_regress = 2;
else if edu_d = 'college plus' then edu_regress = 3;
else if edu_d = 'missing' then call missing(edu_Regress);
run;

Log base 2 calculation in python

I am trying to calculate the average disorder in ID trees. My code is below:
Republican_yes = yes.count('Republican')
Democrat_yes = yes.count('Democrat')
Republican_no = no.count('Republican')
Democrat_no = no.count('Democrat')
Indep_yes = yes.count('Independent')
Indep_no = no.count('Independent')
disorder_yes= Republican_yes/len(yes)*(math.log(float(Republican_yes)/len(yes),2))+ Democrat_yes/len(yes)*(math.log(float(Democrat_yes)/len(yes),2))+Indep_yes/len(yes)*(math.log(float(Indep_yes)/len(yes),2))
disorder_no= Republican_no/len(no)*(math.log(float(Republican_no)/len(no),2))+Democrat_no/len(no)*(math.log(float(Democrat_no)/len(no),2))+Indep_no/len(no)*(math.log(float(Indep_no)/len(no),2))
avgdisorder = -len(yes)/(len(yes)+len(no))*disorder_yes - len(no)/(len(yes)+len(no))*disorder_no
return avgdisorder
why do I keep getting math domain error?
Check if the lengths are 0 or not, else you will get MathError.
if len(yes):
disorder_yes= Republican_yes/len(yes)*(math.log(float(Republican_yes)/len(yes),2))+ Democrat_yes/len(yes)*(math.log(float(Democrat_yes)/len(yes),2))+Indep_yes/len(yes)*(math.log(float(Indep_yes)/len(yes),2))
if len(no):
disorder_no= Republican_no/len(no)*(math.log(float(Republican_no)/len(no),2))+Democrat_no/len(no)*(math.log(float(Democrat_no)/len(no),2))+Indep_no/len(no)*(math.log(float(Indep_no)/len(no),2))
if len(yes) or len(no):
avgdisorder = -len(yes)/(len(yes)+len(no))*disorder_yes - len(no)/(len(yes)+len(no))*disorder_no
If you want, you can always add the else clause for all 3 if statements as per your requirement.