I have a little problem with retain function.
I want a code that can remember one date and fill a new field when in my dataset field var>0 and my date is null.
What I want you can see below.
My "data have" is the same as "data want" without filed of new data. I tried written code but something doesn't work.
Please look and help mi with this strange case with retaining.
data want;
input dateid person var mydate :yymmdd10. newdate:yymmdd10. ;
format mydate date9. newdate date9.;
datalines;
20210101 1 0 . .
20210102 1 0 . .
20210103 1 0 . .
20210104 1 1 2019-04-28 2019-04-28
20210105 1 2 2019-04-28 2019-04-28
20210106 1 0 . .
20210107 1 1 2019-04-30 2019-04-30
20210108 1 2 2019-04-30 2019-04-30
20210109 1 3 2019-04-30 2019-04-30
20210110 1 4 2019-04-30 2019-04-30
20210111 1 5 2019-04-30 2019-04-30
20210112 1 4 . .
20210113 1 3 . .
20210114 1 5 2019-05-05 2019-04-30
20210115 1 5 2019-05-05 2019-04-30
20210116 1 4 . .
20210117 1 3 . .
20210118 1 2 . .
20210119 1 1 . .
20210120 1 0 . .
20210121 1 1 2019-06-06 2019-06-06
20200122 1 0 . .
20210101 2 0 . .
20210102 2 1 2019-04-28 2019-04-28
20210103 2 2 2019-04-28 2019-04-28
20210104 2 0 . .
20210105 2 0 . .
20210106 2 0 . .
20210107 2 1 2019-04-30 2019-04-30
20210108 2 2 2019-04-30 2019-04-30
20210109 2 3 2019-04-30 2019-04-30
20210110 2 4 2019-04-30 2019-04-30
20210111 2 0 . .
20210112 2 0 . .
20210113 2 0 . .
20210114 2 1 2019-05-05 2019-05-05
20210115 2 0 . .
20210116 2 1 2019-06-05 2019-06-05
20210117 2 2 2019-06-05 2019-06-05
20210118 2 3 2019-06-05 2019-06-05
20210119 2 4 2019-06-05 2019-06-05
20210120 2 5 2019-06-05 2019-06-05
20210121 2 4 . .
20210122 2 5 2019-06-06 2019-06-05
;run;
Data want2 (drop= data_check);
set want (drop=newdate);
by person;
format newdate date9.;
format DATA_CHECK date9.;
retain DATA_CHECK;
IF var > 0 THEN DATA_CHECK = mydate;
IF (mydate = .) and (var > 0) THEN newdate = DATA_CHECK;
RUN;
Thank you!
Related
I have two datasets. One dataset here
contains information on product assortment at grocery store/day level. This data reflects all the products that were available at a store in a given day.
Another data set
contains data on individuals who visited those stores on a given day.
As you can see in screenshot 2 the same person (highlighted, panid=1101758) only bought 2 products: Michelob and Sam Adams in week 1677 2 at store 234140, whereas we know that overall 4 options were available to that individual in that store on that same day, i.e. 2 additional Budweisers (screenshot 1, highlighted obs.)
I need to merge/append these two datasets at the store/day for each individual in a way that the final data set shows that a person made those two purchases and in addition there were two more that were available to that individual at that store/day. Thus, that specific individual will have 4 observations - 2 purchased and 2 more available options. I have various stores, days, and individuals.
input store day brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
input hh store day brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
In the Stata code above you can see that it was another individual who purchased 2 Budweisers. For that individual a similar action has to also take place, where it can be shown that the individual had 4 options to choose from (Michelob, Sam Adams, Budweiser, Budweiser) but they ended up choosing only 2 Budweisers.
Here is the example of the end result I would like to receive:
input hh store day brand choice
1 1 1 "Michelob" 1
1 1 1 "Sam Adams" 1
1 1 1 "Bud" 0
1 1 1 "Bud" 0
1 1 1 "Coors" 0
2 1 1 "Bud" 1
2 1 1 "Bud" 1
2 1 1 "Michelob" 0
2 1 1 "Sam Adams" 0
2 1 1 "Coors" 0
3 1 1 "Coors" 1
3 1 1 "Michelob" 0
3 1 1 "Sam Adams" 0
3 1 1 "Bud" 0
3 1 1 "Bud" 0
Here's one way to do it. It involves creating an indicator for repeated products within store and day, using joinby to create all possible combinations between hh and products by store and day, and finally a merge to get the choice variable.
// Import hh data
clear
input hh store day str9 brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
gen choice = 1
tempfile hh hh_join
save `hh'
// Create dataset for use with joinby to create all possible combinations
// of hh and products per day/store
drop brand n_brand choice
duplicates drop
save `hh_join'
// Import store data
clear
input store day str9 brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
// Create all possible combinations of hh and products per day/store
joinby store day using `hh_join'
order hh store day brand n_brand
sort hh store day brand n_brand
// Merge with hh data to get choice variable
merge 1:1 hh store day brand n_brand using `hh'
drop _merge
// Replace choice with 0 if missing
replace choice = 0 if missing(choice)
list, noobs sepby(hh)
And the result:
. list, noobs sepby(hh)
+-------------------------------------------------+
| hh store day brand n_brand choice |
|-------------------------------------------------|
| 1 1 1 Bud 1 0 |
| 1 1 1 Bud 2 0 |
| 1 1 1 Coors 1 0 |
| 1 1 1 Michelob 1 1 |
| 1 1 1 Sam Adams 1 1 |
|-------------------------------------------------|
| 2 1 1 Bud 1 1 |
| 2 1 1 Bud 2 1 |
| 2 1 1 Coors 1 0 |
| 2 1 1 Michelob 1 0 |
| 2 1 1 Sam Adams 1 0 |
|-------------------------------------------------|
| 3 1 1 Bud 1 0 |
| 3 1 1 Bud 2 0 |
| 3 1 1 Coors 1 1 |
| 3 1 1 Michelob 1 0 |
| 3 1 1 Sam Adams 1 0 |
+-------------------------------------------------+
I have a dataset that I am converting from wide to long format.
Currently I have 1 observation per patient, and each patient can have up to 5 aneurysms, currently recorded in wide format.
I am trying to re-arrange this dataset so that I have one observation per aneurysm instead. I have done so successfully, but now I need to label the aneurysms in a new variable called aneurysmIdentifier.
Here is a glimpse at the data. You can see how, when a patient has 4 aneurysms, I have successfully created 4 corresponding observations, however these are duplicates created via the expand function.
I am stuck at the next point, which, as mentioned, is creating a new variable aneurysmIdentifier that reads 1 if there is only one copy of the specific record_id, 1 and 2 if there are two copies and so forth all the way to 1-2-3-4-5. This would enable me to have a point of reference as to what I call aneurysm 1, 2, 3, 4 and 5 so I can keep re-arranging data to fit as such.
I have created this sketch hopefully showcasing what I mean. As you can see it counts how many duplicates there are and then counts forward up to the maximum of 5.
Can anyone push me in the right direction on how to achieve this?
Example of data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 record_id float aneurysmNumber
"007128de18ce5cb1635b8f27c5435ff3" 1
"00abd7bdb6283dd0ac6b97271608a122" 1
"0142103f84693c6eda416dfc55f65de1" 1
"0153826d93a58d7e1837bb98a3c21ba8" 1
"01c729ac4601e36f245fd817d8977917" 2
"01c729ac4601e36f245fd817d8977917" 2
"01dd90093fbf201a1f357e22eaff6b6a" 1
"0208e14dcabc43dd2b57e2e8b117de4d" 1
"0210f575075e5def7ffa77530ce17ef0" 1
"022cc7a9397e81cf58cd9111f9d1db0d" 1
"02afd543116a22fc7430620727b20bb5" 1
"0303ef0bd5d256cca1c836e2b70415ac" 2
"0303ef0bd5d256cca1c836e2b70415ac" 2
"041b2b0cac589d6e3b65bb924803cf1a" 1
"0536317a2bbb936e85c3eb8294b076da" 1
"06161d4668f217937cac0ac033d8d199" 1
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"07196414cd6bf89d94a33e149983d102" 1
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"07bef516d53279a3f5e477d56d552a2b" 1
"08678829b7e0ee6a01b17974b4d19cfa" 1
"08bb6c65e63c499ea19ac24d5113dd94" 1
"08f036417500c332efd555c76c4654a0" 1
"090c54d021b4b21c7243cec01efbeb91" 1
"09166bb44e4c5cdb8f40d402f706816e" 1
"0930159addcdc35e7dc18812522d4377" 1
"096844af91d2e266767775b0bee9105e" 1
"09884af1bb9d59803de0c74d6df57c23" 1
"09e03748da35e9d799dc5d8ddf1909b5" 1
"0a4ce4a7941ff6d1f5c217bf5a9a3bf9" 1
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a73c992955231650965ed87e3bd52f6" 1
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0af333ae400f75930125bb0585f0dcf5" 1
"0af73334d9d2166191f3385de48f15d2" 1
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b3e110c9765e14a5c41fadcc3cfc300" .
"0b6681f0f441e69c26106ab344ac0733" 1
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b92c26375117bf42945c04d8d6573d4" 2
"0b92c26375117bf42945c04d8d6573d4" 2
"0ba961f437f43105c357403c920bdef1" 1
"0bb601fabe1fdfa794a5272408997a2f" 1
"0c75b36e91363d596dc46bd563c3f5ef" 1
"0d461328a3bae7164ce7d3a10f366812" 1
"0d4cc4eb459301a804cbef22914f44a3" 1
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d513c74d667f55c8f4a9836c304149c" 1
"0da25de126bb3b3ee565eff8888004c2" 2
"0da25de126bb3b3ee565eff8888004c2" 2
"0db9ae1f2201577f431b7603d0819fa6" 1
"0dd8a681f6a5d4c888831a591e57a747" 1
"0e05d6958d878368b5fb831211fad6a1" 1
"0e3ff41e0e2b2cb5ec336fd0b04e5d44" 1
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f69f1f998984d37f133185179d63c60" 1
"1037032886a93e66406a4c910d1ef747" 2
"1037032886a93e66406a4c910d1ef747" 2
"1044b81b354b420e85ae835ea07de2d6" 1
"10620fc488346291281212a404681386" 1
"1074389c469944edf026d193a55b1148" 1
"1090d5a678119b03cddab609289a4d3c" 1
"111eebb45cef2211a2a2ff0219095e6a" 1
"11ddcbc8de8ef56cbc578fc81b602ffc" 1
"11f22488513cf717c333786c789b0289" 2
"11f22488513cf717c333786c789b0289" 2
"121552b22cee2a1eb4360b4d2534cd39" 1
"1251d707c5dc9243dc45d04beb7c3493" 1
"125689659bb3821fa81698dd72462773" 1
"127ba572433921c5bb408fc62eb9b5d7" 1
"129bea3f73e84e37d77d55fadfeb49dd" 1
"12e8dc6fb87822be26d6678cee9644f5" 1
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"13d2bc86f1a19ed2959cd7354bc92d1d" 1
"13db5ede38e2ae1da17884c9a18df202" 1
"13f946e50df8ad74d7cf9fa05b4ad05b" 1
"146c4b8be7996a9789873fe55a47ab41" 1
"147fadd87da13a0271225d944d2a5e98" 1
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14d1377f74a63ffa29db2d99e7f6a1ce" 1
"150017d944a87b4c61f90034380c0659" 1
"150f6ca1ea453260eabf3472d3ebcad1" 1
end
You can go
bysort record_id: gen aneurysm_id = _n
but the results will be arbitrary unless there is some other information, say a date variable, to provide a rationale for the ordering. Let's suppose that there is a date variable date that is numeric and in good order. Then
bysort record_id (date) : gen aneurysm_id = _n
would be a suitable modification. For date read also date-time if time of day is noted and notable.
I would like to add summary record after each group of records connected with specific shop. So, I have this:
Shop_id Trans_id Count
1 1 10
1 2 23
1 3 12
2 1 8
2 2 15
And want to have this:
Shop_id Trans_id Count
1 1 10
1 2 23
1 3 12
. . 45
2 1 8
2 2 15
. . 23
I have done this using PROC SQL but I would like to do this using PROC REPORT as I have read that PROC REPORT should handle such cases.
Try this:
data have;
input shop_id Trans_id Count;
cards;
1 1 10
1 2 23
1 3 12
2 1 8
2 2 15
;
proc report data=have out=want(drop=_:);
define shop_id/group;
define trans_id/order;
define count/sum;
break after shop_id/summarize;
compute after shop_id;
if _break_='shop_id' then shop_id='';
endcomp;
run;
* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 householdID byte(childID HHmemberID)
"0940041260" 1 3
"0940041030" 1 .
"0940041030" 2 .
"0940041030" 3 3
"0940041030" 4 .
"0940041030" 5 .
"0940041110" 1 3
"0940041100" 2 3
"0940041100" 3 4
"0940041100" 4 .
"0940041080" 1 .
"0940041080" 2 .
"0940041080" 3 .
"0940041060" 1 3
"0940041140" 1 .
"0940041180" 1 .
"0940041010" 1 .
"0940041010" 2 .
"0940041040" 1 .
"0940041040" 2 .
"0940041190" 1 .
"0940041190" 2 .
"0940041220" 1 3
"0940041160" 1 3
"0940041170" 1 .
"0940041170" 2 .
end
I am trying to sum up a household size and how many children a household has, but I don't know how to do that in Stata. Is there a way to deal with this problem? The greatest number of childID and HHmemberID will represent the number but I don't know how to extract the information.
If you want this info in your original data, you can use extended generate:
bysort householdID: egen N_members = max(HHmemberID)
bysort householdID: egen N_kids = max(childID)
If you want a new dataset with only that data, you should collapse:
collapse (max) N_members = HHmemberID N_kids = childID, by(householdID)
So, I'm familiar with merges in SAS, and haven't had issues before, but I noticed an issue today that has never been an issue before.
For some reason the actual merging of observations is working properly in more complex data sets, however it only lists the variable values from one of the data sets (e.g. it doesn't overwrite missing values).
For instance, I wrote up this simple program:
data dataset1;
input id var1 var2 var3 var4;
cards;
1 . . 2 2
2 . . 2 2
3 . . 2 2
4 . . 2 2
5 . . 2 2
6 . . 2 2
7 . . 2 2
8 . . 2 2
9 . 2 . 2
10 1 . . .
;
data dataset2;
input id var1 var2 var3 var4;
cards;
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
10 . 1 . .
;
data dataset3;
merge dataset1 dataset2;
by id;
run;
This should yield the following:
id var1 var2 var3 var4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
9 . 2 . 2
10 1 1 . .
but instead, I get this:
id var1 var2 var3 var4
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
9 . 2 . 2
10 . 1 . .
So, it's as if the merge is merging the observations and then just displaying the second data set's values.
I've tried to figure out the issue (I have a feeling it's something very basic I've just looked over), but I've no idea what's happening, since I've never come across the issue before.
Anyone know what's going wrong?
Thanks for any help.
Your problem is that you are merging the datasets by ID but both datasets have the variables VAR1-VAR4. So when both datasets contribute to an observations the one that is listed last in the MERGE statement will "win".
The reason you probably never saw this before is that normally when you are merging two datasets the only variables they have in common are the key variables. So the fact that the values read from the first datasets are overwritten by the values read from the second dataset didn't matter.
To get what you want you can use the UPDATE statement instead. Update will not replace a value with a missing value. Basically it is designed to apply transactions to a master dataset.
Since it looks like each ID only has one observation in DATASET1 you could just use DATASEt1 as your master dataset.
data want ;
update dataset1 dataset2 ;
by id ;
run;