String search for range of ICD codes - stata

I want to search in Stata for C00-D49 and flag them as Neoplasms.
I could do
gen neo =1 if strmatch(diagnosis, "C*")
But, unsure of how to make the string search limited only upto D49.
Also, I need to flag O00-O9A as Pregnancy.
I can do following as well:
gen neo =1 if strmatch(diagnosis, "D1*")
gen neo =1 if strmatch(diagnosis, "D2*")
gen neo =1 if strmatch(diagnosis, "D3*")
gen neo =1 if strmatch(diagnosis, "D4*")
But, is there a way to perform a string match for a given range?

The way I understand ICD codes to be organized, they are all in alphabetic order. So you do not need to search any strings, just compare them alphabetically like this:
* Example generated by -dataex-. For more info, type help dataex
clear
input str7 diagnosis
"ABB"
"A12"
"C34"
"D49.512"
"O02"
"Q34"
"C00.2"
end
gen neoplasm = (diagnosis >= "C00" & diagnosis < "D50")
gen pregnancy = (diagnosis >= "O00" & diagnosis < "P")

Related

Is there a way to flip the order of observations in Stata?

Is it possible to create a backwards counting variable in Stata (like the command _n, just numbering observations backwards)? Or a command to flip the data set, so that the observation with the most recent date is the first one? I would like to make a scatter plot with AfD on the y-axis and the date (row_id) on the x-axis. When I make the plot however, the weeks are ordered backwards. How can I change the order?
This is the code:
generate row_id=_n
twoway scatter AfD row_id || lfit AfD row_id
Here are the data set and the plot:
Your date variable is a string variable, which is unlikely to get you the desired result if you sort on that variable.
You can create a Stata internal form date variable from your string variable:
gen date_num = daily(date, "MDY")
format date_num %td
The values of this new variable will represent the number of days since 1 Jan 1960.
If you create a scatter plot with this date variable on the x-axis, by default it will be sorted from min to max. To let it run from max to min you can specify option xscale(reverse).
If you still want to create an id variable by yourself you can choose one of these options (ascending and descending):
sort date_num
gen id = _n
gsort -date_num
gen id = _n
For your problem, plotting in terms of a daily date variable and -- if for some reason that is a good idea -- using xscale(reverse) are likely to be what you need, as well explained by #Wouter.
In general something like
gen long newid = _N - _n + 1
sort newid
will reverse a dataset.

Looping with distance matching

I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}

Simulating AR(1) in Stata from last observation to the first

I want to simulate an AR(1) process, but start from the end. But my code does not work as expected:
clear
set obs 100
gen et=rnormal(0,1)
quietly gen yt= et in L
quietly replace yt=0.5*yt[_n+1]+et in 1/L-1
Your help is really appreciated.
Just do it the normal way and then reverse order:
clear
set obs 100
gen obs = -_n
gen et=rnormal(0,1)
quietly gen yt = et in 1
quietly replace yt = 0.5*yt[_n-1] + et in 2/L
sort obs
The key is that Stata works in order of the observations. So, this code works as you would want in cascade, value for observation 2 depending on observation 1, 3 on 2, and so forth.
You won't get a cascade going the other direction.
Also, set seed for reproducibility.

Using Compress and Put/Input functions in SAS

I have these two datasets here:
data ONE;
input ID LastName $ FirstInit $ 1.;
datalines;
509182793 Smith C
319861601 Williams J
345121778 Connor F
480863211 King L
907636280 Franklin D
729082859 Monroe T
835688938 Hall K
;
run;
data TWO;
input ID $ 11. State $ 2.;
datalines;
334-99-5246 TX
480-86-3211 MD
449-55-9407 VA
345-12-1778 GA
907-63-6280 NY
790-09-9813 WY
319-86-1601 FL
;
run;
I have two questions:
1) How would you use COMPRESS to create a new character variable, "ncv" and set the value of ncv to be the value of the character variable ID with the hyphens removed? Here's my attempt:
data TWO_NUMERIC;
set TWO;
ncv=COMPRESS(TWO, "+-", "d");
run;
2) How would you use PUT/INPUT to convert ncv to a numerical value to create a numeric variable, "newncv"
data TWO_NUMERIC;
set TWO;
put(TWO,z6.);
run;
To start off with these two questions, I start off with the DATA step and SET statements:
data TWO_NUMERIC;
set TWO;
run;
I looked SAS 9.2's help page but the use of these two statements in their example code seems to confuse me.
Ok, I was going to say RTM, but in this case it's not clear, at least not in my opinion.
Your mistake for compress is that the first parameter should be the variable, in this case ID, not the dataset TWO. In addition you only need to specify the - in your list, not +, unless you think there might be + in the variable as well. Adding the modifier D, specifies add digits to the remove list, which is the opposite of what you want.
Similar concept with PUT/INPUT, reference the variable and make sure you're using the correct function, in this case, input to convert it to numeric.
Data two_numeric;
set two;
ncv=COMPRESS(ID, "-");
ncv_num=input(ncv, 12.);
run;
Compress can be used in multiple ways, one way is described by #Reeza above and the other is using the "k" modifier, which means "keep" as shown below,
data TWO_NUMERIC;
set TWO;
ncv_d=COMPRESS(ID," ", "kd"); * kd means keep-digits, your code had TWO which is a dataset name;
ncv_n=COMPRESS(ID," ", "kn"); * kd means keep-numbers;
/* Input Function is used to convert CHAR to NUM *
* the best. format applies the nearest matching format */
newncv=input(ncv_d,best.);
run;
The link I found useful to explain the K modifier is http://www.amadeus.co.uk/sas-training/tips/1/1/11/the-enhanced-compress-function.php

Is there a way to get past the "too many values" error in Stata when using tabulate?

I am trying to generate frequencies for a variable in Stata conditional on categories of another variable.
This other categorical variable has about 790,000 observations for the category I am interested in.
Stata's 12,000 rows and 1,200 rows limit for one-way and two-way tables respectively makes this impossible.
Every time I run tab x if y==<category of interest> I get the following error:
too many values
r(134);
I installed the bigtab package and though it gives me tables it cannot be used with by or run statistical tests.
Is there a work around for this?
It seems silly that Stata should have this arbitrary limit when SAS and even SPSS can run the exact same operation without trouble.
To some it might seem silly, or at least puzzling, that people want tables with more than 12000 rows, as there must be a better way to display results or answer the question that is in mind.
That said, the limits of tabulate are hard-wired. But you just need to think of reproducing whatever you want to show. So, for one-way frequencies
. bysort rowvar : gen freq = _N
. by rowvar : gen tag = _n == 1
. gsort -freq rowvar
. list rowvar freq if tag, noobs
and for two-way frequencies
. bysort rowvar colvar : gen freq = _N
. by rowvar colvar : gen tag = _n == 1
. gsort -freq rowvar colvar
. list rowvar freq if tag, noobs
A similar approach, with more bells and whistles, is coded within groups (SSC). An even simpler approach in many ways is to collapse or contract the dataset and then list it.
To flag the general strategy here:
Produce what you want as new variables.
Select just one observation from each group if there are multiple observations.
list, not tabulate.
UPDATE
OP asked
. bysort rowvar : gen freq = _N
OP: This generates the freq variable for the last count of every individual value in my rowvar
Me: No. The freq variable is the count of observations for every distinct value of rowvar.
. by rowvar : gen tag = _n == 1
OP: This generates the tag variable for the first count of every unique observation in rowvar.
Me: Correct, provided you say "distinct", not "unique". Unique values occur once only.
. gsort -freq rowvar
OP: This sorts freq and rowvar in descending order
Me: It sorts freq in descending order and rowvar in ascending order within blocks of constant freq.
. list rowvar freq if tag, noobs
OP: What does if do here?
Me: That one is left as an exercise.
Use the command bigtab. (You have to install the package first: run ssc install bigtab.) For help type h bigtab.