How can I combine categories? - stata

I have a variable fruit with the following categories:
1
2
3
4
5
6
7
8
9
10
20
25
I want to collapse these as below:
1
2
3
4
5+
How can I do this?

Consider your example:
clear
input fruit
1
2
3
4
5
6
7
8
9
10
20
25
end
tabulate fruit
fruit | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 8.33 8.33
2 | 1 8.33 16.67
3 | 1 8.33 25.00
4 | 1 8.33 33.33
5 | 1 8.33 41.67
6 | 1 8.33 50.00
7 | 1 8.33 58.33
8 | 1 8.33 66.67
9 | 1 8.33 75.00
10 | 1 8.33 83.33
20 | 1 8.33 91.67
25 | 1 8.33 100.00
------------+-----------------------------------
Total | 12 100.00
The following works for me:
replace fruit = 5 if fruit >= 5
tabulate fruit
fruit | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 8.33 8.33
2 | 1 8.33 16.67
3 | 1 8.33 25.00
4 | 1 8.33 33.33
5 | 8 66.67 100.00
------------+-----------------------------------
Total | 12 100.00

Related

Subtract Set Value at Aggregated Level

Values are for two groups by quarter.
In DAX, need to summarize all the data but also need to remove -5 from each quarter (-20 for full year) in 2021 for Group 1, without allowing the value to go below 0.
This only impacts:
Group 1
2021
However, I also need to retain the data details without the adjustment. So I can't do this in Power Query.
Data:
Group
Date
Value
1
01/01/2020
10
1
02/01/2020
9
1
03/01/2020
10
1
04/01/2020
8
1
05/01/2020
10
1
06/01/2020
11
1
07/01/2020
18
1
08/01/2020
2
1
09/01/2020
1
1
10/01/2020
0
1
11/01/2020
1
1
12/01/2020
0
1
01/01/2021
1
1
02/01/2021
12
1
03/01/2021
12
1
04/01/2021
3
1
05/01/2021
13
1
06/01/2021
14
1
07/01/2021
7
1
08/01/2021
1
1
09/01/2021
0
1
10/01/2021
1
1
11/01/2021
2
1
12/01/2021
1
2
01/01/2020
18
2
02/01/2020
7
2
03/01/2020
6
2
04/01/2020
8
2
05/01/2020
12
2
06/01/2020
13
2
07/01/2020
14
2
08/01/2020
8
2
09/01/2020
7
2
10/01/2020
6
2
11/01/2020
5
2
12/01/2020
4
2
01/01/2021
12
2
02/01/2021
18
2
03/01/2021
19
2
04/01/2021
20
2
05/01/2021
12
2
06/01/2021
12
2
07/01/2021
7
2
08/01/2021
18
2
09/01/2021
16
2
10/01/2021
15
2
11/01/2021
13
2
12/01/2021
1
Result:
Qtr/Year
Group 1 Value
Group 2 Value
Total
Q1-2020
29
31
60
Q2-2020
29
33
62
Q3-2020
21
29
50
Q4-2020
1
15
16
2020
80
108
188
Q1-2021
20
49
69
Q2-2021
25
44
69
Q3-2021
3
41
44
Q4-2021
0
29
29
2021
48
271
211
I'd suggest summarizing at the Year/Quarter/Group granularity and summing that up as follows:
SumValue =
VAR Summary =
SUMMARIZE (
Table2,
Table2[Year],
Table2[Qtr],
Table2[Group],
"#RawValue", SUM ( Table2[Value] ),
"#RemoveValue", IF ( Table2[Year] = 2021 && Table2[Group] = 1, 5 )
)
RETURN
SUMX ( Summary, MAX ( [#RawValue] - [#RemoveValue], 0 ) )
(This assumes the amount to remove for a year is the same as for four quarters.)

Duplication of data entries by id if they meet a certain condition

In the original choice data set, individuals (id) are captured making purchases (choice) among all the product options possible (assortchoice is a product code). Every individual always faces the same set of products to choose from; as a result the value of choice is always either 0 or 1 ("was the product chosen or not?").
clear
input
id assortchoice choice sumchoice
2 12 1 2
2 13 0 2
2 14 0 2
2 15 0 2
2 16 0 2
2 17 0 2
2 18 0 2
2 19 0 2
2 20 0 2
2 21 0 2
2 22 0 2
2 23 1 2
3 12 1 1
3 13 0 1
3 14 0 1
3 15 0 1
3 16 0 1
3 17 0 1
3 18 0 1
3 19 0 1
3 20 0 1
3 21 0 1
3 22 0 1
3 23 0 1
4 12 1 3
4 13 0 3
4 14 1 3
4 15 1 3
4 16 0 3
4 17 0 3
4 18 0 3
4 19 0 3
4 20 0 3
4 21 0 3
4 22 0 3
4 23 0 3
end
I created the following code to understand how many choices were made by each individual:
egen sumchoice=total(choice), by(id)
In this example, an individual 3 (id=3) only chose one product (since sumchoice=1), but individual 2 made two choices (sumchoice=2). Finally, individual 4 made three choices (sumchoice=3).
Since this is a choice data, I need to transform all the instances of multiple choices into sets of single choices.
What I mean by that: if an individual made two purchases, I need to duplicate the choice set for that individual twice; for an individual who made 3 purchases, I need to replicate the choice set three times, so the final structure looks like the data set below.
clear
input
id transaction assortchoice choice
2 1 12 1
2 1 13 0
2 1 14 0
2 1 15 0
2 1 16 0
2 1 17 0
2 1 18 0
2 1 19 0
2 1 20 0
2 1 21 0
2 1 22 0
2 1 23 0
2 2 12 0
2 2 13 0
2 2 14 0
2 2 15 0
2 2 16 0
2 2 17 0
2 2 18 0
2 2 19 0
2 2 20 0
2 2 21 0
2 2 22 0
2 2 23 1
3 1 12 1
3 1 13 0
3 1 14 0
3 1 15 0
3 1 16 0
3 1 17 0
3 1 18 0
3 1 19 0
3 1 20 0
3 1 21 0
3 1 22 0
3 1 23 0
4 1 12 1
4 1 13 0
4 1 14 0
4 1 15 0
4 1 16 0
4 1 17 0
4 1 18 0
4 1 19 0
4 1 20 0
4 1 21 0
4 1 22 0
4 1 23 0
4 2 12 0
4 2 13 0
4 2 14 1
4 2 15 0
4 2 16 0
4 2 17 0
4 2 18 0
4 2 19 0
4 2 20 0
4 2 21 0
4 2 22 0
4 2 23 0
4 3 12 0
4 3 13 0
4 3 14 0
4 3 15 1
4 3 16 0
4 3 17 0
4 3 18 0
4 3 19 0
4 3 20 0
4 3 21 0
4 3 22 0
4 3 23 0
end
***update:
transaction indicates which transaction order this is:
bysort id assortchoice (choice): gen transaction=_n
Hence, choice=1 should appear only once per each transaction.
The answer isn't quite "use expand" as there is a twist that you don't want exact replicates.
expand sumchoice
bysort id assortchoice (choice) : replace choice = 0 if _n != _N & choice == 1
list if id == 2 , sepby(assortchoice)
+-----------------------------------+
| id assort~e choice sumcho~e |
|-----------------------------------|
1. | 2 12 0 2 |
2. | 2 12 1 2 |
|-----------------------------------|
3. | 2 13 0 2 |
4. | 2 13 0 2 |
|-----------------------------------|
5. | 2 14 0 2 |
6. | 2 14 0 2 |
|-----------------------------------|
7. | 2 15 0 2 |
8. | 2 15 0 2 |
|-----------------------------------|
9. | 2 16 0 2 |
10. | 2 16 0 2 |
|-----------------------------------|
11. | 2 17 0 2 |
12. | 2 17 0 2 |
|-----------------------------------|
13. | 2 18 0 2 |
14. | 2 18 0 2 |
|-----------------------------------|
15. | 2 19 0 2 |
16. | 2 19 0 2 |
|-----------------------------------|
17. | 2 20 0 2 |
18. | 2 20 0 2 |
|-----------------------------------|
19. | 2 21 0 2 |
20. | 2 21 0 2 |
|-----------------------------------|
21. | 2 22 0 2 |
22. | 2 22 0 2 |
|-----------------------------------|
23. | 2 23 0 2 |
24. | 2 23 1 2 |
+-----------------------------------+

Conditionally create new observations

I have data in the following format (there are a lot more variables):
year ID Dummy
1495 65 1
1496 65 1
1501 65 1
1502 65 1
1520 65 0
1522 65 0
What I am trying to achieve is conditionally create new observations that fills in the data between two points in time conditional on a dummy. If the dummy is equal to 1, the data is supposed to be filled in. If the variable is equal to 0 then it shall not be filled in.
For example:
year ID Dummy
1495 65 1
1496 65 1
1497 65 1
1498 65 1
.
.
1501 65 1
1502 65 1
1503 65 1
1504 65 1
.
.
.
1520 65 0
1522 65 0
Here's one way to do this:
clear
input year id dummy
1495 65 1
1496 65 1
1501 65 1
1502 65 1
1520 65 0
1522 65 0
end
generate tag = year[_n] != year[_n+1] & dummy == 1
generate delta = year[_n] - year[_n+1] if tag
replace delta = . if abs(delta) == 1
expand abs(delta) if tag & delta != .
sort year
bysort year: egen seq = seq() if delta != .
replace seq = seq - 1
replace seq = 0 if seq == .
replace year = year + seq if year != .
drop tag delta seq
The above code snippet will produce:
list
+-------------------+
| year id dummy |
|-------------------|
1. | 1495 65 1 |
2. | 1496 65 1 |
3. | 1497 65 1 |
4. | 1498 65 1 |
5. | 1499 65 1 |
|-------------------|
6. | 1500 65 1 |
7. | 1501 65 1 |
8. | 1502 65 1 |
9. | 1503 65 1 |
10. | 1504 65 1 |
|-------------------|
11. | 1505 65 1 |
12. | 1506 65 1 |
13. | 1507 65 1 |
14. | 1508 65 1 |
15. | 1509 65 1 |
|-------------------|
16. | 1510 65 1 |
17. | 1511 65 1 |
18. | 1512 65 1 |
19. | 1513 65 1 |
20. | 1514 65 1 |
|-------------------|
21. | 1515 65 1 |
22. | 1516 65 1 |
23. | 1517 65 1 |
24. | 1518 65 1 |
25. | 1519 65 1 |
|-------------------|
26. | 1520 65 0 |
27. | 1522 65 0 |
+-------------------+

select minimum value by ID, over range of visits

I'm trying to extract a variable for the lowest value over a range of visits, in this case:
I want the lowest value over first 3 days of admission (admission day 1 or 2 or 3) , by VisitID. any suggestions?
visitID value day of admission
1 941 1
1 948 2
1 935 4
2 83 1
2 84 2
2 50 4
2 79 5
and I would want:
visitID value visit minvalue
1 941 1 941
1 948 2 941
1 935 4 941
2 83 1 83
2 84 2 83
2 50 4 83
2 79 5 83
It would have been helpful if you had presented your data in an easily usable form. But here's an approach that should point you in a useful direction.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte visitid int value byte day
1 941 1
1 948 2
1 935 4
2 83 1
2 84 2
2 50 4
2 79 5
end
bysort visitid (day) : egen minvalue = min(cond(day<=3,value,.))
Which results in
. list, sepby(visitid)
+----------------------------------+
| visitid value day minvalue |
|----------------------------------|
1. | 1 941 1 941 |
2. | 1 948 2 941 |
3. | 1 935 4 941 |
|----------------------------------|
4. | 2 83 1 83 |
5. | 2 84 2 83 |
6. | 2 50 4 83 |
7. | 2 79 5 83 |
+----------------------------------+

Frequency table with group variable

I have a dataset with firm level data.
I have a variable employees (an integer) and a variable nace2 (an integer indicating what industry or service sector the company is related to)
I have created a third variable for grouping employees:
gen employees_cat = .
replace employees_cat = 1 if employees >=0 & employees<10
replace employees_cat = 2 if employees >=10 & employees<20
replace employees_cat = 3 if employees >=20 & employees<49
replace employees_cat = 4 if employees >=49 & employees<249
replace employees_cat = 5 if employees >=249
I would like to create a frequency table showing how many employees work in every nace2 sector per employees_cat.
As a reproducible example take
sysuse auto.dta
Let's try to get a frequency table showing the overall mileage (mpg) of all domestic / foreign cars that have a trunk space of 11, 12, 16, etc.
The starting point for frequency tabulations in Stata is tabulate which can show one- and two-way breakdowns. Used with by: multi-way breakdowns can be produced as a series of two-way tables. See also table.
With the variables you mention in the auto data there are 21 distinct values for mpg and 18 for trunk, so a two-way table would be 21 x 18 or 18 x 21 with many empty cells, as the number of observations at 74 is much less than the product 378. (Here to count distinct values the command distinct is installed: search distinct in Stata for literature references and latest code version to download.)
. sysuse auto, clear
(1978 Automobile Data)
. distinct mpg trunk
------------------------------
| total distinct
-------+----------------------
mpg | 74 21
trunk | 74 18
------------------------------
One way around this problem is to collapse the tabulation into a list with typical entry {row variable, column variable, frequency information}. This is offered by the program groups, which must be installed first, as here:
. ssc inst groups
. groups trunk mpg
+-------------------------------+
| trunk mpg Freq. Percent |
|-------------------------------|
| 5 28 1 1.35 |
| 6 23 1 1.35 |
| 7 18 1 1.35 |
| 7 24 2 2.70 |
| 8 21 1 1.35 |
|-------------------------------|
| 8 24 1 1.35 |
| 8 26 1 1.35 |
| 8 30 1 1.35 |
| 8 35 1 1.35 |
| 9 22 1 1.35 |
|-------------------------------|
| 9 28 1 1.35 |
| 9 29 1 1.35 |
| 9 31 1 1.35 |
| 10 21 1 1.35 |
| 10 24 1 1.35 |
|-------------------------------|
| 10 25 1 1.35 |
| 10 26 2 2.70 |
| 11 17 1 1.35 |
| 11 18 1 1.35 |
| 11 22 1 1.35 |
|-------------------------------|
| 11 23 1 1.35 |
| 11 28 1 1.35 |
| 11 30 1 1.35 |
| 11 34 1 1.35 |
| 11 35 1 1.35 |
|-------------------------------|
| 12 22 1 1.35 |
| 12 23 1 1.35 |
| 12 25 1 1.35 |
| 13 19 3 4.05 |
| 13 21 1 1.35 |
|-------------------------------|
| 14 14 1 1.35 |
| 14 17 1 1.35 |
| 14 18 1 1.35 |
| 14 19 1 1.35 |
| 15 14 1 1.35 |
|-------------------------------|
| 15 17 1 1.35 |
| 15 18 1 1.35 |
| 15 25 1 1.35 |
| 15 41 1 1.35 |
| 16 14 3 4.05 |
|-------------------------------|
| 16 18 1 1.35 |
| 16 19 3 4.05 |
| 16 20 2 2.70 |
| 16 21 1 1.35 |
| 16 22 1 1.35 |
|-------------------------------|
| 16 25 1 1.35 |
| 17 16 3 4.05 |
| 17 18 1 1.35 |
| 17 19 1 1.35 |
| 17 20 1 1.35 |
|-------------------------------|
| 17 22 1 1.35 |
| 17 25 1 1.35 |
| 18 12 1 1.35 |
| 20 14 1 1.35 |
| 20 15 1 1.35 |
|-------------------------------|
| 20 16 1 1.35 |
| 20 18 2 2.70 |
| 20 21 1 1.35 |
| 21 17 1 1.35 |
| 21 18 1 1.35 |
|-------------------------------|
| 22 12 1 1.35 |
| 23 15 1 1.35 |
+-------------------------------+
groups has many more options, which are documented in its help. But it extends easily to multi-way tables also collapsed to lists, as here with a third grouping variable:
. groups foreign trunk mpg, sepby(foreign trunk)
+------------------------------------------+
| foreign trunk mpg Freq. Percent |
|------------------------------------------|
| Domestic 7 18 1 1.35 |
| Domestic 7 24 2 2.70 |
|------------------------------------------|
| Domestic 8 26 1 1.35 |
| Domestic 8 30 1 1.35 |
|------------------------------------------|
| Domestic 9 22 1 1.35 |
| Domestic 9 28 1 1.35 |
| Domestic 9 29 1 1.35 |
|------------------------------------------|
| Domestic 10 21 1 1.35 |
| Domestic 10 24 1 1.35 |
| Domestic 10 26 1 1.35 |
|------------------------------------------|
| Domestic 11 17 1 1.35 |
| Domestic 11 22 1 1.35 |
| Domestic 11 28 1 1.35 |
| Domestic 11 34 1 1.35 |
|------------------------------------------|
| Domestic 12 22 1 1.35 |
|------------------------------------------|
| Domestic 13 19 3 4.05 |
| Domestic 13 21 1 1.35 |
|------------------------------------------|
| Domestic 14 19 1 1.35 |
|------------------------------------------|
| Domestic 15 14 1 1.35 |
| Domestic 15 18 1 1.35 |
|------------------------------------------|
| Domestic 16 14 3 4.05 |
| Domestic 16 18 1 1.35 |
| Domestic 16 19 3 4.05 |
| Domestic 16 20 2 2.70 |
| Domestic 16 22 1 1.35 |
|------------------------------------------|
| Domestic 17 16 3 4.05 |
| Domestic 17 18 1 1.35 |
| Domestic 17 19 1 1.35 |
| Domestic 17 20 1 1.35 |
| Domestic 17 22 1 1.35 |
| Domestic 17 25 1 1.35 |
|------------------------------------------|
| Domestic 18 12 1 1.35 |
|------------------------------------------|
| Domestic 20 14 1 1.35 |
| Domestic 20 15 1 1.35 |
| Domestic 20 16 1 1.35 |
| Domestic 20 18 2 2.70 |
| Domestic 20 21 1 1.35 |
|------------------------------------------|
| Domestic 21 17 1 1.35 |
| Domestic 21 18 1 1.35 |
|------------------------------------------|
| Domestic 22 12 1 1.35 |
|------------------------------------------|
| Domestic 23 15 1 1.35 |
|------------------------------------------|
| Foreign 5 28 1 1.35 |
|------------------------------------------|
| Foreign 6 23 1 1.35 |
|------------------------------------------|
| Foreign 8 21 1 1.35 |
| Foreign 8 24 1 1.35 |
| Foreign 8 35 1 1.35 |
|------------------------------------------|
| Foreign 9 31 1 1.35 |
|------------------------------------------|
| Foreign 10 25 1 1.35 |
| Foreign 10 26 1 1.35 |
|------------------------------------------|
| Foreign 11 18 1 1.35 |
| Foreign 11 23 1 1.35 |
| Foreign 11 30 1 1.35 |
| Foreign 11 35 1 1.35 |
|------------------------------------------|
| Foreign 12 23 1 1.35 |
| Foreign 12 25 1 1.35 |
|------------------------------------------|
| Foreign 14 14 1 1.35 |
| Foreign 14 17 1 1.35 |
| Foreign 14 18 1 1.35 |
|------------------------------------------|
| Foreign 15 17 1 1.35 |
| Foreign 15 25 1 1.35 |
| Foreign 15 41 1 1.35 |
|------------------------------------------|
| Foreign 16 21 1 1.35 |
| Foreign 16 25 1 1.35 |
+------------------------------------------+