Keeping distinct combinations (pairs) of observations - stata

I have a dataset where I have to remove duplicate combinations.
These combinations are pairs of places, one each in two columns:
ID Place1 Place2
1 Ann Arbor Toledo
2 LA San Francisco
3 Chicago Peoria
4 Pittsburgh Cleveland
5 Richmond New Port
6 Ann Arbor Cincinnati
7 LA San Francisco
8 LA San Jose
9 Springfield Chicago
10 Richmond New Port
11 Atlanta Greenville
How can I get the output below?
ID Place1 Place2
1 Ann Arbor Toledo
2 LA San Francisco
3 Chicago Peoria
4 Pittsburgh Cleveland
5 Richmond New Port
6 Ann Arbor Cincinnati
7 LA San Jose
8 Springfield Chicago
9 Atlanta Greenville

The following works for me:
clear
input ID str20 Place1 str20 Place2
1 "Ann Arbor" "Toledo"
2 "LA" "San Francisco"
3 "Chicago" "Peoria"
4 "Pittsburgh" "Cleveland"
5 "Richmond" "New Port"
6 "Ann Arbor" "Cincinnati"
7 "LA" "San Francisco"
8 "LA" "San Jose"
9 "Springfield" "Chicago"
10 "Richmond" "New Port"
11 "Atlanta" "Greenville"
end
duplicates drop Place1 Place2, force
list, separator(0)
+----------------------------------+
| ID Place1 Place2 |
|----------------------------------|
1. | 1 Ann Arbor Toledo |
2. | 2 LA San Francisco |
3. | 3 Chicago Peoria |
4. | 4 Pittsburgh Cleveland |
5. | 5 Richmond New Port |
6. | 6 Ann Arbor Cincinnati |
7. | 8 LA San Jose |
8. | 9 Springfield Chicago |
9. | 11 Atlanta Greenville |
+----------------------------------+
Type help duplicates in Stata's command prompt for details and full syntax.
It is important to note that this will not work if you have pairs in your data like the one below:
LA San Francisco
San Francisco LA
See this article by #NickCox on how to deal with this case.

Related

VLOOKUP: How do I match one column to another sheet and join them without losing data?

I have two sheets that look something like this:
Sheet1
id | phone | age
0 123 23
1 456 42
2 789 36
Sheet2
id | city | country
0 madrid spain
1 nyc usa
2 dubai uae
3 london england
4 lisbon portugal
My goal is to have a sheet that looks like this:
Sheet3
id | phone | age | city | country
0 123 23 madrid spain
1 456 42 nyc usa
2 789 36 dubai uae
3 london england
4 lisbon portugal
I've been using this formula:
=ARRAYFORMULA({'Sheet1'!A$1:C$4, VLOOKUP('Sheet1'!A$1:A$4,{'Sheet2'!A$1:A$6, 'Sheet2'!B$1:C$6}, {2,3}, false)})
This is what I get:
Sheet3
id | phone | age | #N/A | #N/A
0 123 23 madrid spain
1 456 42 nyc usa
2 789 36 dubai uae
So as you can see, it is leaving out the column headers from Sheet2 in the combined table and it leaves out any rows where the id doesn't match. How do I tell it to leave those rows in and leave the cells blank and include the column headers from Sheet2?
try:
=ARRAYFORMULA(QUERY({A:C, IFNA(VLOOKUP(IF(A:A<>"", A:A, "×"), E:G, {2, 3}, 0));
FILTER({E2:E, IFERROR(E2:F/0), F2:G}, NOT(COUNTIF(E2:E, A2:A)))},
"where Col1 is not null order by Col1", 1))

Multiple choices in a choice data set

The original data contains information on the consumerid and the cars they purchased.
clear
input consumerid car purchase
6 American 1
6 Japanese 0
6 European 0
7 American 0
7 Japanese 0
7 European 1
7 Korean 1
end
Since this is a purchase data, the data set needs to be expanded in a way to depict the full choice set of cars every time a consumer made a purchase. The final data set should look like this (the screenshot taken from the Stata manual www.stata.com/manuals/cm.pdf on p. 97 in "Example 4: Multiple choices per case"):
I have generated several codes (shown below) to almost get me to where I need but I have trouble generating a single value of purchase=1 per consumerid-carnumber combination (i.e. due to the expansion, the purchase values are duplicated).
egen sumpurchase=total(purchase), by(id)
expand sumpurchase
bysort id car (purchase): gen carnumber=_n
You could use reshape to get all combinations of consumerid/car per car bought. This example assumes that the sort order in the original dataset defines which car is carnumber 1, carnumber 2 etc.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte consumerid str8 car byte purchase
6 "American" 1
6 "Japanese" 0
6 "European" 0
7 "American" 0
7 "Japanese" 0
7 "European" 1
7 "Korean" 1
end
// Generate carnumber
bys consumerid: gen carnumber = cond(purchase != 0, sum(purchase), 0)
// To wide
reshape wide purchase, i(consumerid car) j(carnumber)
// Keep purchased cars only
drop purchase0
// Back to long
reshape long
// Drop if no cars purchased for consumerid/carnumber
bysort consumerid carnumber (purchase) : drop if missing(purchase[1])
// Replace missing with 0 for non-purchased cars
mvencode purchase, mv(0)
// Sort and see results
sort consumerid carnumber car
list, sepby(consumerid carnumber) abbr(14)
Results:
. list, sepby(consumerid carnumber) abbr(14)
+----------------------------------------------+
| consumerid car carnumber purchase |
|----------------------------------------------|
1. | 6 American 1 1 |
2. | 6 European 1 0 |
3. | 6 Japanese 1 0 |
|----------------------------------------------|
4. | 7 American 1 0 |
5. | 7 European 1 1 |
6. | 7 Japanese 1 0 |
7. | 7 Korean 1 0 |
|----------------------------------------------|
8. | 7 American 2 0 |
9. | 7 European 2 0 |
10. | 7 Japanese 2 0 |
11. | 7 Korean 2 1 |
+----------------------------------------------+

Pandas dynamically pattern match from second dataframe and extract string

Getting my knickers in a twist trying to dynamically build a regex extract pattern from a second dataframe list and populate another column with the string if it's contained in the list.
here are the two starting tables:
import pandas as pd
import re
# this is a short extract, there are 1000's of records in this table:
provinces = pd.DataFrame({'country': ['Brazil','Brazil','Brazil','Colombia','Colombia','Colombia'],
'area': ['Cerrado','Sul de Minas', 'Mococoa','Tolima','Huila','Quindio'],
'index': [13,21,19,35,36,34]})
# test dataframe
df_test = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['sul de minas minas gerais','chapadao cerrado','cerrado cerrado','mococa sao paulo','pitalito huila','pijao quindio','espirito santo']})
print(provinces)
country area index
0 Brazil Cerrado 13
1 Brazil Sul de Minas 21
2 Brazil Mococoa 19
3 Colombia Tolima 35
4 Colombia Huila 36
5 Colombia Quindio 34
print(df_test)
country locality
0 brazil sul de minas minas gerais
1 brazil chapadao cerrado
2 brazil cerrado cerrado
3 brazil mococa sao paulo
4 colombia pitalito huila
5 colombia pijao quindio
6 brazil espirito santo
and end result:
df_result = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['minas gerais','chapadao','cerrado','sao paulo','pitalito','pijao','espirito santo'],
'area': ['sul de minas','cerrado','cerrado','mococoa','huila','quindio',''],
'index': [21,13,13,19,36,34,np.nan]})
print(df_result)
country locality area index
0 brazil minas gerais sul de minas 21.0
1 brazil chapadao cerrado 13.0
2 brazil cerrado cerrado 13.0
3 brazil sao paulo mococoa 19.0
4 colombia pitalito huila 36.0
5 colombia pijao quindio 34.0
6 brazil espirito santo NaN
Can't get around the first step to populate the area column. Once the area column contains a string, stripping the same string from the locality column and adding the index column with a left join on the country and area columns is the easy part(!)
# to create the area column and extract the area string if there's a match (by string and country) in the provinces table
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
and I'd also need to apply a map to exclude some records from this step.
# as above but for added complexity, populate the area column only if df_test.country == 'brazil':
df_test['area'] = ''
mapping = df_test.country =='brazil'
df_test.loc[mapping,'area'] = df_test.loc[mapping,'locality'].str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
All the vectorised regex extract solutions I've found rely on pre-defined regex patterns, but given these patterns need to come from the provinces dataframe where the countries match, this question and answer seemed like the closet match to this scenario but I couldn't make sense of it...
Thanks in advance
following the trail of error messages (and sleep!), "Can only compare identically-labeled Series objects" resolved with this answer
And then "ValueError: Lengths must match to compare" with this answer
here's the solution:
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'({})'.format('|'.join(provinces.loc[provinces.country.str.lower().isin(df_test.country),'area'].str.lower().to_list()), expand=False))
[out]
country locality area
0 brazil sul de minas minas gerais sul de minas
1 brazil chapadao cerrado cerrado
2 brazil cerrado cerrado cerrado
3 brazil mococoa sao paulo mococoa
4 colombia pitalito huila huila
5 colombia pijao quindio quindio
6 brazil espirito santo NaN

variable showing the highest value attained of another variable, recorded so far, over time

I have a dataset of patients and their alcohol-related patient data over time (in years) like below
clear
input long patid float(year cohort)
1051 1994 1
2051 1972 1
2051 1989 2
2051 1990 2
2051 2000 2
2051 2001 3
2051 2002 1
2051 2003 2
8051 1995 1
8051 1996 1
8051 2003 1
end
label values cohort cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "non-drinker" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
I would like to create a variable that shows the highest level of alcohol code that has been used so far at any (year) point in a patient's record, such that the dataset would be like below:
clear
input long patid float(year cohort highestsofar)
1051 1994 1 1
2051 1972 1 1
2051 1989 2 2
2051 1990 2 2
2051 2000 2 2
2051 2001 3 3
2051 2002 1 3
2051 2003 2 3
8051 1995 1 1
8051 1996 1 1
8051 2003 1 1
end
label values cohort cohortlab
label values highestsofar cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "lifetime_abstainer" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
Thanks for the clear example and question.
The problem is already covered by an FAQ link here on the StataCorp website. Here's a one-line solution using rangestat from SSC.
clear
input long patid float(year cohort)
1051 1994 1
2051 1972 1
2051 1989 2
2051 1990 2
2051 2000 2
2051 2001 3
2051 2002 1
2051 2003 2
8051 1995 1
8051 1996 1
8051 2003 1
end
label values cohort cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "non-drinker" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
rangestat (max) highestsofar = cohort, interval(year . 0) by(patid)
list, sepby(patid)
+-------------------------------------------+
| patid year cohort highes~r |
|-------------------------------------------|
1. | 1051 1994 no alcohol data 1 |
|-------------------------------------------|
2. | 2051 1972 no alcohol data 1 |
3. | 2051 1989 indeterminate 2 |
4. | 2051 1990 indeterminate 2 |
5. | 2051 2000 indeterminate 2 |
6. | 2051 2001 non-drinker 3 |
7. | 2051 2002 no alcohol data 3 |
8. | 2051 2003 indeterminate 3 |
|-------------------------------------------|
9. | 8051 1995 no alcohol data 1 |
10. | 8051 1996 no alcohol data 1 |
11. | 8051 2003 no alcohol data 1 |
+-------------------------------------------+
I would like to offer an answer:
by patid: g highestsofar=cohort if cohort>cohort[_n-1]|_n==1
by patid: replace highestsofar=highestsofar[_n-1] if cohort<=cohort[_n-1]&_n>1
by patid: replace highestsofar=highestsofar[_n-1] if (highestsofar<highestsofar[_n-1]) & ((cohort>cohort[_n-1])&_n>1)
label values highestsofar cohortlab
I would be happy if a more compact syntax could be discussed.
Thanks

Reshaping when year and countries are both columns

I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18
The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+