Merging two observations - stata

I have a list of places with population, much like in the example data below:
sysuse census, clear
How can I combine (sum) only two observations to create a new observation, while maintaining the rest of the data?
In the below example I would like to combine Alabama and Alaska to create a new observation called 'Alabama & Alaska' with the sum of their populations.
With the new observation, the previous records will need to be deleted.
+----------------------------+
| state pop |
|----------------------------|
1. | Alabama 3,893,888 |
2. | Alaska 401,851 |
3. | Arizona 2,718,215 |
4. | Arkansas 2,286,435 |
5. | California 23,667,902 |
+----------------------------+
+-----------------------------------+
| state pop |
|-----------------------------------|
1. | Alabama & Alaska 4,295,739 | <--Alabama & Alaska combined
2. | Arizona 2,718,215 | <--Retain other observations and variables
3. | Arkansas 2,286,435 |
4. | California 23,667,902 |
+-----------------------------------+
This is my original toy data example and its expected output:
PlaceName Population
Town 1 100
Town 2 200
Town 3 100
Town 4 100
PlaceName Population
Town 1 & Town 2 300
Town 3 100
Town 4 100

Using your original toy example, the following works for me:
clear
input str6 PlaceName Population
"Town 1" 100
"Town 2" 200
"Town 3" 100
"Town 4" 100
end
generate PlaceName2 = cond(_n == 1, PlaceName + " & " + PlaceName[_n+1], PlaceName)
generate Population2 = cond(_n == 1, Population[_n+1] + Population, Population)
replace PlaceName2 = "" in 2
replace Population2 = . in 2
gsort - Population2
list, abbreviate(12)
+--------------------------------------------------------+
| PlaceName Population PlaceName2 Population2 |
|--------------------------------------------------------|
1. | Town 1 100 Town 1 & Town 2 300 |
2. | Town 4 100 Town 4 100 |
3. | Town 3 100 Town 3 100 |
4. | Town 2 200 . |
+--------------------------------------------------------+

This is how to do it with collapse. As you ask, this combines two observations into one, and thus changes the dataset.
clear
input str6 PlaceName Population
"Town 1" 100
"Town 2" 200
"Town 3" 100
"Town 4" 100
end
replace PlaceName = "Towns 1 and 2" in 1/2
collapse (sum) Population , by(PlaceName)
list
+--------------------------+
| PlaceName Popula~n |
|--------------------------|
1. | Town 3 100 |
2. | Town 4 100 |
3. | Towns 1 and 2 300 |
+--------------------------+

Related

Computing a sequence of events over time and extract a percentage of duration

I have a dataset which stores events regarding the availability status of a room.
For example, if someone is entering the room at 8:30 am, I get the following row in my table :
# room status date
--- ---- -------- -------------------
0 A1 OCCUPIED 2022-01-01 08:30:00
A similar event is created when this person is leaving the room. My table would then look like this :
# room status date
--- ---- --------- -------------------
0 A1 OCCUPIED 2022-01-01 08:30:00
1 A1 AVAILABLE 2022-01-01 09:15:00
In practice, the table has way more entries, and data are intertwined.
# room status date
--- ---- --------- -------------------
0 A1 OCCUPIED 2022-01-01 08:30:00 <--
1 B4 OCCUPIED 2022-01-01 08:32:00
2 C2 OCCUPIED 2022-01-01 08:41:00
3 A1 AVAILABLE 2022-01-01 09:15:00 <--
4 C2 AVAILABLE 2022-01-01 09:20:00
5 A1 OCCUPIED 2022-01-01 09:30:00 <--
6 B4 AVAILABLE 2022-01-01 10:00:00
7 A1 AVAILABLE 2022-01-01 12:00:00 <--
I am currently looking for a way to extract a percentage/duration of availability from each of my rooms, but I don't know how to proceed.
I have created a few measures :
// A measure to count the total of status
Count status = COUNT(myTable[status])
// A calculated measure for available ones
Total available = CALCULATE([count status], myTable[status]=="AVAILABLE")
// A calculated measure for occupied ones
Total occupied = CALCULATE([count status], myTable[status]=="OCCUPIED")
I already have a date hierarchy which means I can change the granularity from year to month, to week day, to hour of the day. I can also apply a filter to select a range of hours, for example 8:00 to 18:00.
The problem is, the measures I have created simply count the number of changes that occur in a given period (in the chart below, the hours), but they don't reflect the actual duration of each event, which means that my graph is actually wrong.
If I take my room A1 as an example, in the actual configuration, my graph would look like this :
___ ___ ___ ___ ___ ___ ___ ___
| 0 | | | | | | | |
available | | 50| | |100| | | |
| |___| | | | | | |
|100| | | | | | | |
occupied | | 50| | | 0 | | | |
|___|___|___|___|___|___|___|___|
8 9 10 11 12 13 14 15
In the column 8, 100% occupied because 1 entry in the dataset for this status vs 0 entry for "available".
In the column 9, 50-50 because 1 entry for each status (one at 09:15, the other at 09:30)
...
The result I am looking for is this one :
___ ___ ___ ___ ___ ___ ___ ___
| | 25| 0 | 0 | | | | |
available | 50|___| | |100|100|100|100|
|___| | | | | | | |
| | 75|100|100| | | | |
occupied | 50| | | | 0 | 0 | 0 | 0 |
|___|___|___|___|___|___|___|___|
8 9 10 11 12 13 14 15
In the column 8, I would get 50-50 because the room was available between 08:00 and 08:30, but then it was occupied
In the column 9, I would get 75% occupied because the room was only available between 09:15 and 09:30
In the column 10, I would get 100% occupied
...
Is it possible to get it through a DAX measure or do I need to restructure some of my data ?
The solution to your problem is to add calculated column to your source table which has the time of next Event in the same room. The Room_No here is your category column.
First, add index by category (by Room)
Event_asc =
VAR Current_Category = Table[Category]
RETURN
RANKX (
FILTER (
Table,
Table[Category] = Current_Category
),
Table[DateTime], , ASC, Dense
)
Then add this column:
Event_Next_Time =
VAR Current_Category = Table[Category]
VAR CurIndex = Table[Event_asc]
VAR Result =
CALCULATE(
MAX( Table[DateTime] ),
Table[Category] = Current_Category
&& Table[Event_asc] = CurIndex + 1,
REMOVEFILTERS()
)
RETURN
Result
Once you have it, just add a third column which calculates the difference between two Datetimes (Event and NextEvent).
Lapse = DATEDIFF( Table[DateTime], Table[TimeOfNextEvent], SECOND )
The rest should be easy for you :-)

Create a new variable if value in var1 exists in var2

Assume I have a list_a variable with all possible sports played in the world:
football
tennis
hockey
cricket
croquet
racquetball
cricket
pingpong
squash
rugby
swimming
swimming
soccer
Also assume I have another variable list_b of only three sports:
cricket
hockey
swimming
I want to create a new variable Cont, which will equal 1 when the sports in list_a are found in list_b, and equal to 0 when the sport is not in list_b.
This is what variable Cont would look like:
0
0
1
1
0
0
1
0
0
0
1
1
0
Will the following work:
gen Cont = 0
replace Cont = 1 if (strmatch( list_a, ( list_b)))
EDIT:
Suppose list_a also contained hoccckey (which is a typo) but I still want it to get counted.
Is there a way to do that?
The answer is no because your approach will compare the values of the two variables in each observation. Instead, you need to compare the value at each row of list_a, with all values of variable list_b.
Using your toy example:
clear
input strL(list_a list_b)
football cricket
tennis hockey
hockey swimming
cricket
croquet
racquetball
cricket
pingpong
squash
rugby
swimming
swimming
soccer
end
The following illustrates the philosophy:
local obs = _N
generate Cont = 0
forvalues i = 1 / `obs' {
forvalues j = 1 / `obs' {
replace Cont = 1 if list_a[`i'] == list_b[`j'] in `i'
}
}
list
+-------------------------------+
| list_a list_b Cont |
|-------------------------------|
1. | football cricket 0 |
2. | tennis hockey 0 |
3. | hockey swimming 1 |
4. | cricket 1 |
5. | croquet 0 |
|-------------------------------|
6. | racquetball 0 |
7. | cricket 1 |
8. | pingpong 0 |
9. | squash 0 |
10. | rugby 0 |
|-------------------------------|
11. | swimming 1 |
12. | swimming 1 |
13. | soccer 0 |
+-------------------------------+
EDIT:
If you have certain typos that you additionally want to take into account, you can combine my solution with #NickCox's. In the above loop use instead:
replace Cont = 1 if inlist(list_a, "hoccckey") | list_a[`i'] == list_b[`j'] in `i'
There is a simple technique that works fine for your toy example:
clear
input strL list_a
football
tennis
hockey
cricket
croquet
racquetball
cricket
pingpong
squash
rugby
swimming
swimming
soccer
end
gen wanted = inlist(list_a, "cricket", "hockey", "swimming")
list, sepby(wanted)
+----------------------+
| list_a wanted |
|----------------------|
1. | football 0 |
2. | tennis 0 |
|----------------------|
3. | hockey 1 |
4. | cricket 1 |
|----------------------|
5. | croquet 0 |
6. | racquetball 0 |
|----------------------|
7. | cricket 1 |
|----------------------|
8. | pingpong 0 |
9. | squash 0 |
10. | rugby 0 |
|----------------------|
11. | swimming 1 |
12. | swimming 1 |
|----------------------|
13. | soccer 0 |
+----------------------+
If you had many more values, you could loop over the distinct values sought, using levelsof if they are in a second variable, or put the candidates in a separate dataset and merge as explained in this FAQ.
All these techniques depend on exact equality of strings, so watch out for variations between upper and lower case, leading and trailing spaces and inconsistencies in spelling.

Save duplicates by id

I have two variables in Stata, id and price:
id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
Usually I can use the duplicates command to get the duplicate observations of a variable.
However, how can I create a new variable that will save the duplicates
of price for each id?
There is no reason that I can see for duplicates to work with by:. duplicates whatever price id is the general recipe with your example, to examine duplicates jointly for two variables. Consider
clear
input id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
end
. duplicates example id price
Duplicates in terms of id price
+------------------------------------+
| group: # e.g. obs id price |
|------------------------------------|
| 1 2 2 1 7634 |
| 2 4 11 2 4567 |
+------------------------------------+
. duplicates tag id price, gen(tag)
Duplicates in terms of id price
. list id price if tag , sepby(id)
+------------+
| id price |
|------------|
2. | 1 7634 |
4. | 1 7634 |
|------------|
11. | 2 4567 |
12. | 2 4567 |
13. | 2 4567 |
14. | 2 4567 |
+------------+
Beyond that, I am not clear exactly what output or data result you wish to see.
EDIT In response to comment, here are two more direct approaches. duplicates is based on the idea that duplicates are mostly unwanted; you seem to have the opposite point of view, in which case duplicates is oblique to your wants.
* approach 1
bysort price id : gen wanted = _n == 1 & _N > 1
list if wanted
+---------------------+
| id price wanted |
|---------------------|
7. | 2 4567 1 |
15. | 1 7634 1 |
+---------------------+
* approach 2
drop wanted
bysort price id : keep if _n == 1 & _N > 1
list
+------------+
| id price |
|------------|
1. | 2 4567 |
2. | 1 7634 |
+------------+
Naturally if you want to duplicate data yet further (why?) then after approach 1
gen duplicated_price = price if wanted
gives you one copy of each of the duplicated values in a new variable. This is a slightly simpler equivalent of #Pearly Spencer's approach.
bysort price id : gen duplicated_price = price if _n == 1 & _N > 1
does it in one line.

Create table for asclogit and nlogit

Suppose I have the following table:
id | car | sex | income
-------------------------------
1 | European | Male | 45000
2 | Japanese | Female | 48000
3 | American | Male | 53000
I would like to create the one below:
| id | car | choice | sex | income
--------------------------------------------
1.| 1 | European | 1 | Male | 45000
2.| 1 | American | 0 | Male | 45000
3.| 1 | Japanese | 0 | Male | 45000
| ----------------------------------------
4.| 2 | European | 0 | Female | 48000
5.| 2 | American | 0 | Female | 48000
6.| 2 | Japanese | 1 | Female | 48000
| ----------------------------------------
7.| 3 | European | 0 | Male | 53000
8.| 3 | American | 1 | Male | 53000
9.| 3 | Japanese | 0 | Male | 53000
I would like to fit an asclogit and according to Example 1 in Stata's Manual, this table format seems necessary. However, i have not found a way to create this easily.
You can use the cross command to generate all the possible combinations:
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
generate choice = 0
save old, replace
keep id
save new, replace
use old
rename id =_0
cross using new
replace choice = 1 if id_0 == id
replace sex = cond(id == 2, "Female", "Male")
replace income = cond(id == 1, 45000, cond(id == 2, 48000, 53000))
Note that the use of the cond() function here is equivalent to:
replace sex = "Male" if id == 1
replace sex = "Female" if id == 2
replace sex = "Male" if id == 3
replace income = 45000 if id == 1
replace income = 48000 if id == 2
replace income = 53000 if id == 3
The above code snipped produces the desired output:
drop id_0
order id car choice sex income
sort id car
list, sepby(id)
+------------------------------------------+
| id car choice sex income |
|------------------------------------------|
1. | 1 American 0 Male 45000 |
2. | 1 European 1 Male 45000 |
3. | 1 Japanese 0 Male 45000 |
|------------------------------------------|
4. | 2 American 0 Female 48000 |
5. | 2 European 0 Female 48000 |
6. | 2 Japanese 1 Female 48000 |
|------------------------------------------|
7. | 3 American 1 Male 53000 |
8. | 3 European 0 Male 53000 |
9. | 3 Japanese 0 Male 53000 |
+------------------------------------------+
For more information, type help cross and help cond() from Stata's command prompt.
Please see dataex in Stata for how to produce data examples useful in web forums. (If necessary, install first using ssc install dataex.)
This could be an exercise in using fillin followed by filling in the missings.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
fillin id car
foreach v in sex income {
bysort id (_fillin) : replace `v' = `v'[1]
}
list , sepby(id)
+-------------------------------------------+
| id car sex income _fillin |
|-------------------------------------------|
1. | 1 European Male 45000 0 |
2. | 1 American Male 45000 1 |
3. | 1 Japanese Male 45000 1 |
|-------------------------------------------|
4. | 2 Japanese Female 48000 0 |
5. | 2 European Female 48000 1 |
6. | 2 American Female 48000 1 |
|-------------------------------------------|
7. | 3 American Male 53000 0 |
8. | 3 European Male 53000 1 |
9. | 3 Japanese Male 53000 1 |
+-------------------------------------------+
A provisional solution using Pandas in Python is the following:
1) Open the base with:
df = pd.read_stata("mybase.dta")
2) Use the code of the accepted answer of this question.
3) Save the base:
df.to_stata("newbase.dta")
If one wants to use dummy variables, reshape also is an option.
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
tabulate car, gen(choice)
reshape long choice, i(id)
label define car 2 "European" 3 "Japanese" 1 "American"
drop car
rename _j car
label values car car
list, sepby(id)
+------------------------------------------+
| id car sex income choice |
|------------------------------------------|
1. | 1 American Male 45000 0 |
2. | 1 European Male 45000 1 |
3. | 1 Japanese Male 45000 0 |
|------------------------------------------|
4. | 2 American Female 48000 0 |
5. | 2 European Female 48000 0 |
6. | 2 Japanese Female 48000 1 |
|------------------------------------------|
7. | 3 American Male 53000 1 |
8. | 3 European Male 53000 0 |
9. | 3 Japanese Male 53000 0 |
+------------------------------------------+

Generating a variable only including the top 4 firms with largest sales

My question is very related to the question below:
Calculate industry concentration in Stata based on four biggest numbers
I want to generate a variable only including the top 4 firms with largest sales and exclude the rest.
In other words the new variable will only have values of the 4 firms with largest sales in a given industry for a given year and the rest will be .
Consider this:
webuse grunfeld, clear
bysort year (invest) : gen largest4 = cond(_n < _N - 3, ., invest)
sort year invest
list year largest4 if largest4 < . in 1/40, sepby(year)
+-----------------+
| year largest4 |
|-----------------|
7. | 1935 39.68 |
8. | 1935 40.29 |
9. | 1935 209.9 |
10. | 1935 317.6 |
|-----------------|
17. | 1936 50.73 |
18. | 1936 72.76 |
19. | 1936 355.3 |
20. | 1936 391.8 |
|-----------------|
27. | 1937 74.24 |
28. | 1937 77.2 |
29. | 1937 410.6 |
30. | 1937 469.9 |
|-----------------|
37. | 1938 51.6 |
38. | 1938 53.51 |
39. | 1938 257.7 |
40. | 1938 262.3 |
+-----------------+
If you had missing values, they would sort to the end of each block and mess up the results.
So you need a trick more:
generate OK = !missing(invest)
bysort OK year (invest) : gen Largest4 = cond(_n < _N - 3, ., invest) if OK
sort year invest
list year Largest4 if Largest4 < . in 1/40, sepby(year)
With this example, which you can run, there are no missing values and the results are the same.