How do I sum up aggregate data from individual level dataset? - stata

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 householdID byte(childID HHmemberID)
"0940041260" 1 3
"0940041030" 1 .
"0940041030" 2 .
"0940041030" 3 3
"0940041030" 4 .
"0940041030" 5 .
"0940041110" 1 3
"0940041100" 2 3
"0940041100" 3 4
"0940041100" 4 .
"0940041080" 1 .
"0940041080" 2 .
"0940041080" 3 .
"0940041060" 1 3
"0940041140" 1 .
"0940041180" 1 .
"0940041010" 1 .
"0940041010" 2 .
"0940041040" 1 .
"0940041040" 2 .
"0940041190" 1 .
"0940041190" 2 .
"0940041220" 1 3
"0940041160" 1 3
"0940041170" 1 .
"0940041170" 2 .
end
I am trying to sum up a household size and how many children a household has, but I don't know how to do that in Stata. Is there a way to deal with this problem? The greatest number of childID and HHmemberID will represent the number but I don't know how to extract the information.

If you want this info in your original data, you can use extended generate:
bysort householdID: egen N_members = max(HHmemberID)
bysort householdID: egen N_kids = max(childID)
If you want a new dataset with only that data, you should collapse:
collapse (max) N_members = HHmemberID N_kids = childID, by(householdID)

Related

Counting observations with duplicate ID's

I have a dataset that I am converting from wide to long format.
Currently I have 1 observation per patient, and each patient can have up to 5 aneurysms, currently recorded in wide format.
I am trying to re-arrange this dataset so that I have one observation per aneurysm instead. I have done so successfully, but now I need to label the aneurysms in a new variable called aneurysmIdentifier.
Here is a glimpse at the data. You can see how, when a patient has 4 aneurysms, I have successfully created 4 corresponding observations, however these are duplicates created via the expand function.
I am stuck at the next point, which, as mentioned, is creating a new variable aneurysmIdentifier that reads 1 if there is only one copy of the specific record_id, 1 and 2 if there are two copies and so forth all the way to 1-2-3-4-5. This would enable me to have a point of reference as to what I call aneurysm 1, 2, 3, 4 and 5 so I can keep re-arranging data to fit as such.
I have created this sketch hopefully showcasing what I mean. As you can see it counts how many duplicates there are and then counts forward up to the maximum of 5.
Can anyone push me in the right direction on how to achieve this?
Example of data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 record_id float aneurysmNumber
"007128de18ce5cb1635b8f27c5435ff3" 1
"00abd7bdb6283dd0ac6b97271608a122" 1
"0142103f84693c6eda416dfc55f65de1" 1
"0153826d93a58d7e1837bb98a3c21ba8" 1
"01c729ac4601e36f245fd817d8977917" 2
"01c729ac4601e36f245fd817d8977917" 2
"01dd90093fbf201a1f357e22eaff6b6a" 1
"0208e14dcabc43dd2b57e2e8b117de4d" 1
"0210f575075e5def7ffa77530ce17ef0" 1
"022cc7a9397e81cf58cd9111f9d1db0d" 1
"02afd543116a22fc7430620727b20bb5" 1
"0303ef0bd5d256cca1c836e2b70415ac" 2
"0303ef0bd5d256cca1c836e2b70415ac" 2
"041b2b0cac589d6e3b65bb924803cf1a" 1
"0536317a2bbb936e85c3eb8294b076da" 1
"06161d4668f217937cac0ac033d8d199" 1
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"07196414cd6bf89d94a33e149983d102" 1
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"07bef516d53279a3f5e477d56d552a2b" 1
"08678829b7e0ee6a01b17974b4d19cfa" 1
"08bb6c65e63c499ea19ac24d5113dd94" 1
"08f036417500c332efd555c76c4654a0" 1
"090c54d021b4b21c7243cec01efbeb91" 1
"09166bb44e4c5cdb8f40d402f706816e" 1
"0930159addcdc35e7dc18812522d4377" 1
"096844af91d2e266767775b0bee9105e" 1
"09884af1bb9d59803de0c74d6df57c23" 1
"09e03748da35e9d799dc5d8ddf1909b5" 1
"0a4ce4a7941ff6d1f5c217bf5a9a3bf9" 1
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a73c992955231650965ed87e3bd52f6" 1
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0af333ae400f75930125bb0585f0dcf5" 1
"0af73334d9d2166191f3385de48f15d2" 1
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b3e110c9765e14a5c41fadcc3cfc300" .
"0b6681f0f441e69c26106ab344ac0733" 1
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b92c26375117bf42945c04d8d6573d4" 2
"0b92c26375117bf42945c04d8d6573d4" 2
"0ba961f437f43105c357403c920bdef1" 1
"0bb601fabe1fdfa794a5272408997a2f" 1
"0c75b36e91363d596dc46bd563c3f5ef" 1
"0d461328a3bae7164ce7d3a10f366812" 1
"0d4cc4eb459301a804cbef22914f44a3" 1
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d513c74d667f55c8f4a9836c304149c" 1
"0da25de126bb3b3ee565eff8888004c2" 2
"0da25de126bb3b3ee565eff8888004c2" 2
"0db9ae1f2201577f431b7603d0819fa6" 1
"0dd8a681f6a5d4c888831a591e57a747" 1
"0e05d6958d878368b5fb831211fad6a1" 1
"0e3ff41e0e2b2cb5ec336fd0b04e5d44" 1
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f69f1f998984d37f133185179d63c60" 1
"1037032886a93e66406a4c910d1ef747" 2
"1037032886a93e66406a4c910d1ef747" 2
"1044b81b354b420e85ae835ea07de2d6" 1
"10620fc488346291281212a404681386" 1
"1074389c469944edf026d193a55b1148" 1
"1090d5a678119b03cddab609289a4d3c" 1
"111eebb45cef2211a2a2ff0219095e6a" 1
"11ddcbc8de8ef56cbc578fc81b602ffc" 1
"11f22488513cf717c333786c789b0289" 2
"11f22488513cf717c333786c789b0289" 2
"121552b22cee2a1eb4360b4d2534cd39" 1
"1251d707c5dc9243dc45d04beb7c3493" 1
"125689659bb3821fa81698dd72462773" 1
"127ba572433921c5bb408fc62eb9b5d7" 1
"129bea3f73e84e37d77d55fadfeb49dd" 1
"12e8dc6fb87822be26d6678cee9644f5" 1
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"13d2bc86f1a19ed2959cd7354bc92d1d" 1
"13db5ede38e2ae1da17884c9a18df202" 1
"13f946e50df8ad74d7cf9fa05b4ad05b" 1
"146c4b8be7996a9789873fe55a47ab41" 1
"147fadd87da13a0271225d944d2a5e98" 1
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14d1377f74a63ffa29db2d99e7f6a1ce" 1
"150017d944a87b4c61f90034380c0659" 1
"150f6ca1ea453260eabf3472d3ebcad1" 1
end
You can go
bysort record_id: gen aneurysm_id = _n
but the results will be arbitrary unless there is some other information, say a date variable, to provide a rationale for the ordering. Let's suppose that there is a date variable date that is numeric and in good order. Then
bysort record_id (date) : gen aneurysm_id = _n
would be a suitable modification. For date read also date-time if time of day is noted and notable.

Reshaping dataset wide

I have a dataset from a small clinic which looks something like this:
What I am trying to do is make the top long form of the dataset look like the bottom wide form.
My code is the following:
reform date injury_code_1 .... , i(ID) j(VisitNum)
The error code I get is this:
There are variables other than a, b, ID, VisitNum in your data. They must be constant within ID because that is the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within ID. Perhaps the values are in error. Type reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop them.
Why is my code wrong?
Using the data as illustrated in the screenshot, the following works for me:
clear
input ID VisitNum str6 date Injury_1 Injury_2 Injury_3 gender
1 1 "12-Mar" 1 2 3 0
2 1 "2-Apr" 4 . . 1
1 2 "23-Jun" 1 2 . 0
3 1 "1-Feb" 5 6 . 1
1 3 "30-Aug" 8 9 10 0
end
reshape wide date Injury_1 Injury_2 Injury_3, i(ID) j(VisitNum)
order ID gender
list, abbreviate(15)
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| ID gender date1 Injury_11 Injury_21 Injury_31 date2 Injury_12 Injury_22 Injury_32 date3 Injury_13 Injury_23 Injury_33 |
|----------------------------------------------------------------------------------------------------------------------------------------------------|
1. | 1 0 12-Mar 1 2 3 23-Jun 1 2 . 30-Aug 8 9 10 |
2. | 2 1 2-Apr 4 . . . . . . . . |
3. | 3 1 1-Feb 5 6 . . . . . . . |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
The command provided is not valid Stata syntax.

SAS - How to keep the earliest date considering a missing

A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example
You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;
Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5

SAS Merge Issue

So, I'm familiar with merges in SAS, and haven't had issues before, but I noticed an issue today that has never been an issue before.
For some reason the actual merging of observations is working properly in more complex data sets, however it only lists the variable values from one of the data sets (e.g. it doesn't overwrite missing values).
For instance, I wrote up this simple program:
data dataset1;
input id var1 var2 var3 var4;
cards;
1 . . 2 2
2 . . 2 2
3 . . 2 2
4 . . 2 2
5 . . 2 2
6 . . 2 2
7 . . 2 2
8 . . 2 2
9 . 2 . 2
10 1 . . .
;
data dataset2;
input id var1 var2 var3 var4;
cards;
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
10 . 1 . .
;
data dataset3;
merge dataset1 dataset2;
by id;
run;
This should yield the following:
id var1 var2 var3 var4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
9 . 2 . 2
10 1 1 . .
but instead, I get this:
id var1 var2 var3 var4
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
9 . 2 . 2
10 . 1 . .
So, it's as if the merge is merging the observations and then just displaying the second data set's values.
I've tried to figure out the issue (I have a feeling it's something very basic I've just looked over), but I've no idea what's happening, since I've never come across the issue before.
Anyone know what's going wrong?
Thanks for any help.
Your problem is that you are merging the datasets by ID but both datasets have the variables VAR1-VAR4. So when both datasets contribute to an observations the one that is listed last in the MERGE statement will "win".
The reason you probably never saw this before is that normally when you are merging two datasets the only variables they have in common are the key variables. So the fact that the values read from the first datasets are overwritten by the values read from the second dataset didn't matter.
To get what you want you can use the UPDATE statement instead. Update will not replace a value with a missing value. Basically it is designed to apply transactions to a master dataset.
Since it looks like each ID only has one observation in DATASET1 you could just use DATASEt1 as your master dataset.
data want ;
update dataset1 dataset2 ;
by id ;
run;

Simple counting across columns

In Stata, if you have data such as this:
Location Person 1 Gifts Person 2 Gifts Person 3 Gifts Person 4 Gifts
1 2 7 1
2 4 1 12 2
3 5 5 5 5
4 4 1
What is the easiest way to create a new variable, 'over_three_less_than_six' to count how many people per location gave 3 or more gifts but less than 6. I want it to ignore missing values. So in the above example the new column would output:
over_three_less_than_six
0
1
4
1
I beg to differ on style in variable naming! I assume variables such as gift1 ... gift4
gen count = 0
quietly forval j = 1/4 {
replace count = count + inrange(gift`j', 3, 5)
}
See also for a detailed review of technique
SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced
.pdf freely available at http://www.stata-journal.com/sjpdf.html?articlenum=pr0046
inlist(gift`j', 3, 4, 5)
would also work instead of the inrange() call.