Counting observations with duplicate ID's - stata

I have a dataset that I am converting from wide to long format.
Currently I have 1 observation per patient, and each patient can have up to 5 aneurysms, currently recorded in wide format.
I am trying to re-arrange this dataset so that I have one observation per aneurysm instead. I have done so successfully, but now I need to label the aneurysms in a new variable called aneurysmIdentifier.
Here is a glimpse at the data. You can see how, when a patient has 4 aneurysms, I have successfully created 4 corresponding observations, however these are duplicates created via the expand function.
I am stuck at the next point, which, as mentioned, is creating a new variable aneurysmIdentifier that reads 1 if there is only one copy of the specific record_id, 1 and 2 if there are two copies and so forth all the way to 1-2-3-4-5. This would enable me to have a point of reference as to what I call aneurysm 1, 2, 3, 4 and 5 so I can keep re-arranging data to fit as such.
I have created this sketch hopefully showcasing what I mean. As you can see it counts how many duplicates there are and then counts forward up to the maximum of 5.
Can anyone push me in the right direction on how to achieve this?
Example of data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 record_id float aneurysmNumber
"007128de18ce5cb1635b8f27c5435ff3" 1
"00abd7bdb6283dd0ac6b97271608a122" 1
"0142103f84693c6eda416dfc55f65de1" 1
"0153826d93a58d7e1837bb98a3c21ba8" 1
"01c729ac4601e36f245fd817d8977917" 2
"01c729ac4601e36f245fd817d8977917" 2
"01dd90093fbf201a1f357e22eaff6b6a" 1
"0208e14dcabc43dd2b57e2e8b117de4d" 1
"0210f575075e5def7ffa77530ce17ef0" 1
"022cc7a9397e81cf58cd9111f9d1db0d" 1
"02afd543116a22fc7430620727b20bb5" 1
"0303ef0bd5d256cca1c836e2b70415ac" 2
"0303ef0bd5d256cca1c836e2b70415ac" 2
"041b2b0cac589d6e3b65bb924803cf1a" 1
"0536317a2bbb936e85c3eb8294b076da" 1
"06161d4668f217937cac0ac033d8d199" 1
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"07196414cd6bf89d94a33e149983d102" 1
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"07bef516d53279a3f5e477d56d552a2b" 1
"08678829b7e0ee6a01b17974b4d19cfa" 1
"08bb6c65e63c499ea19ac24d5113dd94" 1
"08f036417500c332efd555c76c4654a0" 1
"090c54d021b4b21c7243cec01efbeb91" 1
"09166bb44e4c5cdb8f40d402f706816e" 1
"0930159addcdc35e7dc18812522d4377" 1
"096844af91d2e266767775b0bee9105e" 1
"09884af1bb9d59803de0c74d6df57c23" 1
"09e03748da35e9d799dc5d8ddf1909b5" 1
"0a4ce4a7941ff6d1f5c217bf5a9a3bf9" 1
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a73c992955231650965ed87e3bd52f6" 1
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0af333ae400f75930125bb0585f0dcf5" 1
"0af73334d9d2166191f3385de48f15d2" 1
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b3e110c9765e14a5c41fadcc3cfc300" .
"0b6681f0f441e69c26106ab344ac0733" 1
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b92c26375117bf42945c04d8d6573d4" 2
"0b92c26375117bf42945c04d8d6573d4" 2
"0ba961f437f43105c357403c920bdef1" 1
"0bb601fabe1fdfa794a5272408997a2f" 1
"0c75b36e91363d596dc46bd563c3f5ef" 1
"0d461328a3bae7164ce7d3a10f366812" 1
"0d4cc4eb459301a804cbef22914f44a3" 1
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d513c74d667f55c8f4a9836c304149c" 1
"0da25de126bb3b3ee565eff8888004c2" 2
"0da25de126bb3b3ee565eff8888004c2" 2
"0db9ae1f2201577f431b7603d0819fa6" 1
"0dd8a681f6a5d4c888831a591e57a747" 1
"0e05d6958d878368b5fb831211fad6a1" 1
"0e3ff41e0e2b2cb5ec336fd0b04e5d44" 1
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f69f1f998984d37f133185179d63c60" 1
"1037032886a93e66406a4c910d1ef747" 2
"1037032886a93e66406a4c910d1ef747" 2
"1044b81b354b420e85ae835ea07de2d6" 1
"10620fc488346291281212a404681386" 1
"1074389c469944edf026d193a55b1148" 1
"1090d5a678119b03cddab609289a4d3c" 1
"111eebb45cef2211a2a2ff0219095e6a" 1
"11ddcbc8de8ef56cbc578fc81b602ffc" 1
"11f22488513cf717c333786c789b0289" 2
"11f22488513cf717c333786c789b0289" 2
"121552b22cee2a1eb4360b4d2534cd39" 1
"1251d707c5dc9243dc45d04beb7c3493" 1
"125689659bb3821fa81698dd72462773" 1
"127ba572433921c5bb408fc62eb9b5d7" 1
"129bea3f73e84e37d77d55fadfeb49dd" 1
"12e8dc6fb87822be26d6678cee9644f5" 1
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"13d2bc86f1a19ed2959cd7354bc92d1d" 1
"13db5ede38e2ae1da17884c9a18df202" 1
"13f946e50df8ad74d7cf9fa05b4ad05b" 1
"146c4b8be7996a9789873fe55a47ab41" 1
"147fadd87da13a0271225d944d2a5e98" 1
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14d1377f74a63ffa29db2d99e7f6a1ce" 1
"150017d944a87b4c61f90034380c0659" 1
"150f6ca1ea453260eabf3472d3ebcad1" 1
end

You can go
bysort record_id: gen aneurysm_id = _n
but the results will be arbitrary unless there is some other information, say a date variable, to provide a rationale for the ordering. Let's suppose that there is a date variable date that is numeric and in good order. Then
bysort record_id (date) : gen aneurysm_id = _n
would be a suitable modification. For date read also date-time if time of day is noted and notable.

Related

In Stata, how can I only analyze observations with repeated measures using the mixed command?

I have a dataset on multiple outcome for individuals in two groups that were treated (or not treated) by an intervention at two time points. However, not every individual has complete data for each measure at each time point.
id
outcome
outcome_value
group
time
1
depression
10
1
1
1
depression
8
1
2
2
depression
10
2
1
2
depression
.
2
2
1
anxiety
12
1
1
1
anxiety
8
1
2
2
anxiety
12
2
1
2
anxiety
6
2
2
How do I exclude IDs that do not have an outcome in both periods? I only want to see how outcomes changed between groups over time for observations have data in all periods. I am using the mixed command in Stata to conduct this analysis.
First drop the missing rows
keep if !missing(outcome_value)
Then, keep the ID/outcome combinations that have _N==2
bysort id outcome: keep if _N==2
Output:
id outcome outco~ue group time ct
1 anxiety 8 1 2 2
1 anxiety 12 1 1 2
1 depression 10 1 1 2
1 depression 8 1 2 2
2 anxiety 6 2 2 2
2 anxiety 12 2 1 2
As #NickCox has pointed out in the comments, while we cannot directly combine these two, there is still a one-line approach:
bysort id outcome (time) : keep if !missing(outcome_value[1], outcome_value[2])
Of note, we cannot do this:
bysort id outcome : keep if !missing(outcome_value) & _N==2
because _N is not reduced by group until after the rows with missing outcome have been removed.

Giving subjects a binary id they keep for every period

In Stata I have a list of subjects and contributions from an economic experiment.
There are multiple rounds being played for each treatment. Now I want to keep track of those who contributed in the first period and give them either 1 if a contributor or 0 if a defector. The game is played for multiple periods, but I only really care about the first round. My current code looks like this
g firstroundcont = 0
replace firstroundcont = 1 if c>0 & period==1
This however results in everyone getting a 0 for every subsequent period meaning that they are not "identified" as either a "first round" contributor or a defector for all other periods in the dataset. The table below shows a snippet of how my data looks and how the variable firstroundcont should look.
sessionID
period
subject
group
contribution
firstroundcont
1
1
1
1
4
1
1
1
2
1
0
0
1
1
3
1
2
1
1
1
4
2
10
1
1
1
5
2
0
0
1
1
6
2
0
0
1
2
1
1
0
1
1
2
2
1
5
0
1
2
3
1
0
1
#JR96 is right: this sorely and surely needs a data example. But I guess you want something with the flavour of
bysort id (period) : gen wanted = c[1] > 0
See https://www.stata.com/support/faqs/data-management/creating-dummy-variables/ and https://www.stata-journal.com/article.html?article=dm0099 for more on how to get indicators in one step. The business of generating with 0 and then replacing with 1 can usually be cut to a direct one-line statement.

Stata: append by ID and time stamp

I have two datasets. One dataset here
contains information on product assortment at grocery store/day level. This data reflects all the products that were available at a store in a given day.
Another data set
contains data on individuals who visited those stores on a given day.
As you can see in screenshot 2 the same person (highlighted, panid=1101758) only bought 2 products: Michelob and Sam Adams in week 1677 2 at store 234140, whereas we know that overall 4 options were available to that individual in that store on that same day, i.e. 2 additional Budweisers (screenshot 1, highlighted obs.)
I need to merge/append these two datasets at the store/day for each individual in a way that the final data set shows that a person made those two purchases and in addition there were two more that were available to that individual at that store/day. Thus, that specific individual will have 4 observations - 2 purchased and 2 more available options. I have various stores, days, and individuals.
input store day brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
input hh store day brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
In the Stata code above you can see that it was another individual who purchased 2 Budweisers. For that individual a similar action has to also take place, where it can be shown that the individual had 4 options to choose from (Michelob, Sam Adams, Budweiser, Budweiser) but they ended up choosing only 2 Budweisers.
Here is the example of the end result I would like to receive:
input hh store day brand choice
1 1 1 "Michelob" 1
1 1 1 "Sam Adams" 1
1 1 1 "Bud" 0
1 1 1 "Bud" 0
1 1 1 "Coors" 0
2 1 1 "Bud" 1
2 1 1 "Bud" 1
2 1 1 "Michelob" 0
2 1 1 "Sam Adams" 0
2 1 1 "Coors" 0
3 1 1 "Coors" 1
3 1 1 "Michelob" 0
3 1 1 "Sam Adams" 0
3 1 1 "Bud" 0
3 1 1 "Bud" 0
Here's one way to do it. It involves creating an indicator for repeated products within store and day, using joinby to create all possible combinations between hh and products by store and day, and finally a merge to get the choice variable.
// Import hh data
clear
input hh store day str9 brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
gen choice = 1
tempfile hh hh_join
save `hh'
// Create dataset for use with joinby to create all possible combinations
// of hh and products per day/store
drop brand n_brand choice
duplicates drop
save `hh_join'
// Import store data
clear
input store day str9 brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
// Create all possible combinations of hh and products per day/store
joinby store day using `hh_join'
order hh store day brand n_brand
sort hh store day brand n_brand
// Merge with hh data to get choice variable
merge 1:1 hh store day brand n_brand using `hh'
drop _merge
// Replace choice with 0 if missing
replace choice = 0 if missing(choice)
list, noobs sepby(hh)
And the result:
. list, noobs sepby(hh)
+-------------------------------------------------+
| hh store day brand n_brand choice |
|-------------------------------------------------|
| 1 1 1 Bud 1 0 |
| 1 1 1 Bud 2 0 |
| 1 1 1 Coors 1 0 |
| 1 1 1 Michelob 1 1 |
| 1 1 1 Sam Adams 1 1 |
|-------------------------------------------------|
| 2 1 1 Bud 1 1 |
| 2 1 1 Bud 2 1 |
| 2 1 1 Coors 1 0 |
| 2 1 1 Michelob 1 0 |
| 2 1 1 Sam Adams 1 0 |
|-------------------------------------------------|
| 3 1 1 Bud 1 0 |
| 3 1 1 Bud 2 0 |
| 3 1 1 Coors 1 1 |
| 3 1 1 Michelob 1 0 |
| 3 1 1 Sam Adams 1 0 |
+-------------------------------------------------+

count the total of unique numbers occur in a range of cells

Hello this is my data sample
coustmer_NO id
1 5
1 13
2 4
2 4
2 4
3 4
3 10
4 8
4 8
using SQL >> I Would like to count for each customer how many different ID They have.
the expected output is:
coustmer_NO total_id
1 2
2 1
3 2
4 1
I guess there is a typo in your data,
The result should be:
coustmer_NO total_id
1 2
2 1
3 2
4 1
You can do the following:
SELECT costumer_NO, count(distinct id) AS total_id FROM <table_name> GROUP BY costumer_NO;
Try this query in MYSQL:
select coustmer_NO, count(distinct id) as 'total_id' from table_name group by coustmer_NO;

Two Way EntityCollection Binding to a Two Dimension Data Matrix

I have a Day Strucuture Table, which has following Columns I want to display:
DoW HoD Value
1 1 1
1 2 2
1 3 2
1 4 2
1 5 2
1 6 2
1 7 2
1 8 2
1 9 2
1 10 2
1 11 4
1 12 4
1 13 4
1 14 4
1 15 4
1 16 4
1 17 4
1 18 4
1 19 4
1 20 4
1 21 1
1 22 1
1 23 1
1 24 1
Dow is The Day of Week (Monday etc.), HoD is the Hour of Day and Value is the actual value.
Now I want to Bind this Day Structure Entity Collection directly to a Control so any Changes can be bound TwoWay
Like this Format:
I think the best way to achieve this is to use a Template and/or a converter, but I just dont know how ;)
I already read this article, but Lack of a TwoWay Binding functionality makes it not useful for me :(
I Hope you can help me
Jonny
Again i solved it on my own ;)
For this problem i created a Grid with a fixed amout of rows and columns. Inside this Grid I put a Itemscontrol bound to my List of data. Inside the DataTemplate I placed a Textbox bound to the current value and bound the Grid Row and Columnproperties to the Day of the Week/Hour of Day.
Pro:
The Textbox is TwoWay Databound to a certain Object or Element.
Very Easy to implement if Row and Colum Property is numeric.
Con:
Limited to a fixed amout of Rows/Columns.
Very much Code to write in XAML (Copy and Paste)
Kinda "dirty" Code. Feels not like the best way to do it.
Im still open for other suggestions.