combining multiple items to create one dummy variable

combining multiple items to create one dummy variable - stata

I have 7 items/variables in Stata that address the same survey question. These 7 items are each different weight control behaviors (diet, exercise, pills, etc.). I am trying to combine these variables to create a single weight control behavior dummy variable that is coded as yes (did engage in weight control) and no (did not engage in weight control).
The response options for each variable look something like this for a given weight control behavior
dieted
11438 0 not marked
2771 1 marked
16 6 refused
6508 7 legitimate skip
13 8 don’t know
Here is my code. I re-coded 6,7,8 for all 7 vars as missing:
tab1 h1gh30a-h1gh30g,m`
foreach X of varlist h1gh30a-h1gh30g {
replace `X'=. if `X' > 1
}
egen wgt_control= rowmax(h1gh30a-h1gh30g)
ta wgt_control
gen wgt_control_new=wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
ta wgt_control_new
I used rowmax() to combine all 7 items but my issue is that the response option 0 or No doesn't appear when I tabulate it. I only get those who responded yes=1.

Here is a suggestion with a reproducible example for what I think is the cleanest approach. I also included some unsolicited advice about survey data best practices
* Example generated by -dataex-. For more info, type help dataex
clear
input double(h1gh30a h1gh30b h1gh30c)
1 1 1
1 0 1
6 1 8
0 0 0
7 6 8
end
* Explicit coding is better, so if possible, which it is with 7 vars,
* create a local with the vars are explicitly listed
local wgt_controls h1gh30a h1gh30b h1gh30c
* Recode is a better command to use here. And do not destroy information,
* there is a survey data quality assurance difference between respondent
* refusing to answer, not knowing or question skipped. You can replace this
* survey codes with these extended missing values that behaves like missing values
* but retain the differences in the survey codes
recode `wgt_controls' (6=.a) (7=.b) (8=.c)
* While rowmax() could be used, I think it seems like anymatch() fits
* what you are trying to do better
egen wgt_control = anymatch(`wgt_controls'), values(1)

There is no minimal reproducible example here, so we can't reproduce the problem independently.
From your code, it seems that h1gh30a-h1gh30g are recoded so that all are 0, 1 or missing, so their maximum takes one of the same values.
gen wgt_control_new = wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
seems to boil down to cloning the variable:
gen wgt_control_new = wgt_control
In short, I can't see a reason in your code why you should never see 0 as a possible result.
EDIT
A minimal check on whether there are zeros that aren't showing up as they should might be
egen max = rowmax(h1gh30a-h1gh30g)
list high30a-high30g if max == 0
```

Related

Is there a way to count from last condition x?

I have a complex set of data that can return 3 different conditions per row. I need to be able to count the last x rows matching one of the specific conditions.
The following formula has been working well for me, but I have discovered a glitch in one instance of this formula (the formula is replicated at least a dozen times)
=ArrayFormula(LOOKUP(9.99999999999999E+307,IF(FREQUENCY(IF(AQ:AQ)=1,ROW(AQ:AQ)),IF(AQ:AQ<>1,ROW(AQ:AQ)))=0,FREQUENCY(IF(AQ:AQ=1,ROW(AQ:AQ)),IF(AQ:AQ=0,ROW(AQ:AQ))))))
Current criteria are as such:
0: Condition x met - Reset counter
1: Condition y met - Increment counter
2: Condition z met - Ignore this row
Therefore this:
1
2
2
2
1
1
0
1
1
1
Should output: 3
This:
1
2
0
2
2
1
2
1
Should output: 2
However the glitch I have encountered isn't resetting the counter when 0 is reached, for example:
1
2
1
2
1
1
2
2
2
2
0
Should output: 0
But in fact is outputting: 4
I have tested all possible conditions with that specific data set and I cannot rectify the issue. I believe there is an error in the formula (specifically the 9.99999999999999E+307) but I wrote it so long ago that I cannot successfully debug it. I have tried 1E+306 but the result is the same.
EDIT1: Upon request I have included as stripped down version of the sheet as I can while recreating the issue.
https://docs.google.com/spreadsheets/d/1SOXiFMEQelqptBvjcabMZGNgG60TRRbe_b65rzT1bi0/edit?usp=sharing
If you scroll to the bottom of the sheet you can see Col AQ has a 0, as a result the value in the cell AF2 should be 0.
You will notice in the sheet that I am using Named Ranges.
EDIT2: player0's answer was PERFECT!! <3
I modified the new formula to adapt to my spreadsheet so it could accommodate Named Ranges and drop-down lists. This question helped me a lot with that:
Convert column index into corresponding column letter
The final formula (just FYI) turned out to be:
=ARRAYFORMULA(COUNTIF(
INDIRECT(REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")&
MAX(IF((INDIRECT($A$1 & Z$1 & "L")=0)*(INDIRECT($A$1 & Z$1 & "L")<>""),
ROW(INDIRECT($A$1 & Z$1 & "L"))+1,5))&":"&
REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")), 1))

=ARRAYFORMULA(COUNTIF(INDIRECT("A"&
MAX(IF((A2:A=0)*(A2:A<>""), ROW(A2:A)+1, ROW(A2)))&":A"), 1))
spreadsheet demo

Precisions and counts

I am working with a educational dataset called IPEDS from the National Center for Educational Statistics. They track students in college based upon major, degree completion, etc. The problem in Stata is that I am trying to determine the total count for degrees obtained by a specific major.
They have a variable cipcode which contains values that serve as "majors". cipcode might be 14.2501 "petroleum engineering, 16.0102 "Linguistics" and so forth.
When I write a particular code like
tab cipcode if cipcode==14.2501
it reports no observations. What code will give me the totals?
/*Convert Float Variable to String Variable and use Force Replace*/
tostring cipcode, gen(cipcode_str) format(%6.4f) force
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "0"), .))
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "."), .))
/* Created a total variable called total_t1 for total count of all stem majors listed in table 1*/
gen total_t1 = cipcode_str== "14.2501" + "14.3901" + "15.0999" + "40.0601"

This minimal example confirms your problem. (See, by the way, https://stackoverflow.com/help/mcve for advice on good examples.)
* code
clear
input code
14.2501
14.2501
14.2501
end
tab code if code == 14.2501
tab code if code == float(14.2501)
* results
. tab code if code == 14.2501
no observations
. tab code if code == float(14.2501)
code | Freq. Percent Cum.
------------+-----------------------------------
14.2501 | 3 100.00 100.00
------------+-----------------------------------
Total | 3 100.00
The keyword is one you use, precision. In Stata, search precision for resources, starting with blog posts by William Gould. A decimal like 14.2501 is hard (impossible) to hold exactly in binary and the details of holding a variable as type float can bite.
It's hard to see what you're doing with your last block of code, which you don't explain. The last statement looks puzzling, as you're adding strings. Consider what happens with
. gen whatever = "14.2501" + "14.3901" + "15.0999" + "40.0601"
. di whatever[1]
14.250114.390115.099940.0601
The result is a long string that cannot be a valid cipcode. I suspect that you are reaching towards
... if inlist(cipcode_str, "14.2501", "14.3901", "15.0999", "40.0601")
which is quite different.
But using float() is the minimal trick for this problem.

Query on plotting Lorenz curves on Stata

I am trying to plot a lorenz curve, using the following command:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
generate rank1=rank
label variable rank "Cum share of mortality"
label variable rank1 "Equality Line"
twoway (line rank1 rank, sort clwidth(medthin) clpat(longdash))(line yord rank , sort clwidth(medthin) clpat(red)), ///
ytitle(Cumulative share of drug activity, size(medsmall)) yscale(titlegap(2)) xtitle(Cumulative share of mortality (2012), size(medsmall)) ///
legend(rows(5)) xscale(titlegap(5)) legend(region(lwidth(none))) plotregion(margin(zero)) ysize(6.75) xsize(6) plotregion(lcolor(none))
However, in the resultant curves, the Line of equality does not start from 0, is there a way to fix this?
Is it recommended to use the following in order to get the perfect 45 degree line of equality:
(function y=x, range(0 1)
Also, how many minimum observations are required to plot the above graph? Does it work well with 2 observations as well?

The reason your Line of Perfect Equality does not pass through (0,0) is because the values for your variable do not contain 0.
The smallest value you will have for rank will be 1/_N. Although this value will asymptotically approach 0, it will never actually reach 0.
To see this, try:
quietly sum rank
di r(min)
di 1/_N
Further, by applying the program code to your data (beginning around line 152 in the ado file and removing unnecessary bits), one can easily see that yord cannot take on a value of 0 without values of 0 for drugs:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
sort death drugs , stable
gen double rank1 = _n / _N
qui sum drugs
gen yord1= (sum(drugs) / _N) / r(mean)
The best way to plot your Equality would be the method from your edit, namely:
twoway(function y = x, ra(0 1))
One quick yet (very) crude fix to force the lorenz curve to start at the origin (if it doesn't already) is to add an observation to the data after obtaining rank and yord, and then deleting it after you have your curve:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
expand 2 in 1
replace yord = 0 in 1
replace rank = 0 in 1
twoway (function y = x, ra(0 1)) ///
(line yord rank)
drop in 1
Like I said, this is admittedly crude and even somewhat ill advised, but I can't see a much better alternative at the moment, and with this method you will not be altering any of the other values of yord by running glcurve on the extrapolated data.

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.
Variable
FO.variable
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data.
All I want to do is create an additional binary keep column which should be:
1 if FO.variable is inside Variable
0 if FO.Variable is not inside Variable
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.
Here's what I've tried:
Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:
for(i in 1:nrow(dd)){
if(dd[i,4] %in% dd[i,2])
dd$test[i] <- 1
}
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.

I read the data
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.

I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...

Here is a data.table approach that I think is very similar in spirit to Martin's:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to
identify all pairs of rn & variable (saved in dtvars) and
see which of these pairs match with rn & F0.variable pairs (in the original table, dt).

Replace zeros with missing values in certain cases

I was wondering if anyone knew an easier way of doing the following:
I have a dataset of health facility caseload by year, where each observation is one health facility. Facilities were 'brought online' in different years, so some have zeros before they have values for caseload. Also, some 'discontinue', as in they did provide services, but don't any more. I would like to replace the zeros with missing values for the years in which a facility discontinued. In the following example, the 3rd and 4th facilities discontinued, so I'd like missing for y2014 for the 3rd and y2013 & y2014 for the 4th.
y2011 y2012 y2013 y2014
0 0 76 82
0 0 29 13
0 0 25 0
5 10 0 0
0 0 17 24
I tried the following, which worked, but I'm going to have many years worth of data to work on (2000-2014), so was wondering if there was a more efficient way.
replace y2014=. if y2014==0 & (y2013>0 | y2012>0 | y2011>0)
replace y2013=. if y2013==0 & ( y2012>0 | y2011>0)
replace y2012=. if y2012==0 & ( y2011>0)
I messed around with egen rowlast to identify the facilities with a zero in the last year (meaning they discontinued), but then wasn't sure where to go with it.

Your problem would benefit from a loop over the variables.
We'll initialise started to 0, change our mind about started when we see a positive value, and change any subsequent 0s to missings if started is 1.
gen started = 0
forval y = 2000/2014 {
replace started = 1 if y`y' > 0
replace y`y' = . if started == 1 & y`y' == 0
}
Note that this scheme allows re-starts.
A more general comment is that this is not the better data structure for such panel or longitudinal data. This particular problem is not too challenging, but most problems with such data will be easier after reshape long.
See here for a survey of "rowwise" technique in Stata.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js