Replace value in variable based on other variables - stata

Assuming I have the following dataset, which has a few missing entries for Country:
clear
input strL Person strL Country Population
'ABC' "USA" 3999
'ABC' " " 544
'ABC' " " 7546
'ABD' "China" 10000
'BCG' "India" 6789
'BCG' " " 5454
'ABD' " " 10000
end
I wish to replace missing countries with the matching values in Person. For all Person 'ABC', the country should be the same.
I need a solution that differs from manually scripting replace Country = "USA" if Person == "ABC" as my dataset has more than 10,000 unique observations for Person.
The dataset should look like the following:
Person Country Population
'ABC' "USA" 2514
'ABC' "USA" 388
'ABC' "USA" 8245
'ABD' "China" 10000
'BCG' "India" 6789
'BCG' "India" 5454
'ABD' "China" 10000

Your input and output don't match Stata standards. Stata does not use single quotes as string delimiters or show string delimiters in listings.
Stata doesn't regard one or more spaces as string missing.
Nevertheless this may help for a string variable such as Country:
clear
input strL Person strL Country Population
"ABC" "USA" 3999
"ABC" " " 544
"ABC" " " 7546
"ABD" "China" 10000
"BCG" "India" 6789
"BCG" " " 5454
"ABD" " " 10000
end
bysort Person (Country) : replace Country = Country[_N] if missing(trim(Country))
list, sepby(Person)
+-----------------------------+
| Person Country Popula~n |
|-----------------------------|
1. | ABC USA 7546 |
2. | ABC USA 544 |
3. | ABC USA 3999 |
|-----------------------------|
4. | ABD China 10000 |
5. | ABD China 10000 |
|-----------------------------|
6. | BCG India 5454 |
7. | BCG India 6789 |
+-----------------------------+

Related

Counting number of prior observation excluding those belonging to a certain group

I have loan level data which has the following structure and want to create the variable Number
Loan Borrower Lender Date Crop Country Number
1 A X 01/01/20 Coffee USA 0
2 B X 01/02/20 Coffee USA 0
3 C X 01/03/20 Coffee USA 0
4 D X 01/04/20 Coffee USA 0
5 E X 01/05/20 Banana USA 4
6 F X 01/06/20 Banana USA 4
7 G X 01/07/20 Coffee USA 2
8 H X 01/08/20 Orange USA 7
9 I X 01/09/20 Coffee USA 3
. . . . . . .
. . . . . . .
I want to number my loan based on this set of rules
How many loans has the lender issued up to this point (including this loan)
This number should only include loans in the same country as my loan
This number should exclude all loans given out in the same crop
Hence I am left with a number for each observation which states the number of loans given out by the lender in the same country as said loan but excluding those observations in the country which also occur in the same crop.
So far I tried running:
bysort Lender Country (Date): gen var = _n
The problem with this is that I don't subtract the observations which occur in the same crop.
* Example generated by -dataex-.
clear
input byte Loan str8 Borrower str6 Lender float Date str6 Crop str7 Country byte Number
1 "A" "X" 21915 "Coffee" "USA" 0
2 "B" "X" 21916 "Coffee" "USA" 0
3 "C" "X" 21917 "Coffee" "USA" 0
4 "D" "X" 21918 "Coffee" "USA" 0
5 "E" "X" 21919 "Banana" "USA" 4
6 "F" "X" 21920 "Banana" "USA" 4
7 "G" "X" 21921 "Coffee" "USA" 2
8 "H" "X" 21922 "Orange" "USA" 7
9 "I" "X" 21923 "Coffee" "USA" 3
end
format %td Date
bysort Crop (Date) : gen this = _n
bysort Crop Date (this): replace this = this[_N]
sort Loan
gen wanted1 = _n - this
bysort Country (Date) : replace this = _n
bysort Country Date (this): replace this = this[_N]
sort Loan
gen wanted2 = _n - this
list
+---------------------------------------------------------------------------------------------+
| Loan Borrower Lender Date Crop Country Number this wanted1 wanted2 |
|---------------------------------------------------------------------------------------------|
1. | 1 A X 01jan2020 Coffee USA 0 1 0 0 |
2. | 2 B X 02jan2020 Coffee USA 0 2 0 0 |
3. | 3 C X 03jan2020 Coffee USA 0 3 0 0 |
4. | 4 D X 04jan2020 Coffee USA 0 4 0 0 |
5. | 5 E X 05jan2020 Banana USA 4 5 4 0 |
|---------------------------------------------------------------------------------------------|
6. | 6 F X 06jan2020 Banana USA 4 6 4 0 |
7. | 7 G X 07jan2020 Coffee USA 2 7 2 0 |
8. | 8 H X 08jan2020 Orange USA 7 8 7 0 |
9. | 9 I X 09jan2020 Coffee USA 3 9 3 0 |
+---------------------------------------------------------------------------------------------+

Replacing a line in .txt file corrupts the next line and the line itself

I want to update a part(sub-info) of a line(info) in a text file. It does update/replace the part of the line but corrupts the beginning of the next line. Also a part of the old line still remains in the updated line.
product.txt file:
| Product | |Stock | |Wholesale| |Retail| | Product | | Note |
| Name | |amount| | price | |price | | Supplier | | |
-----------------------------------------------------------------------------------
Big Banana, 210 ,270, 310 ,kashem suppply ,no note
Apple, 50 ,100, 145 ,Khulna Fruits ,No note
Dragon Fruit, 9 ,700, 890 ,Indo-Asian farms ,Costly
Orange, 88 ,70, 100 ,Khulna Fruits ,rotten
Guava, 16 ,120, 150 ,Sandwip Agro ,Dababy
Goat Fry, 2 ,1000, 1700 ,Sandwip Agro ,Costly
Dumb Miraz, 9 ,700, 890 ,Indo-Asian farms ,Cannibal
How code meant to work:
Open product.txt which contains all the information of products. Show information to user. then ask to update a sub-info of a product.
Take product name. Then program will find out the line containing the product name. strstr() is used as product name is sub-string of the whole line string.
Take sub-info that need to be updated. Use .find() to find the sub-info, .replace() to replace it with new_sub_info.
Now Replace the line in text file with the updated line. prev is used to keep track of the line pointer and to seek to the beginning of that-line.
confusion: How pointer behave/change/increase with getline().
Observation: First line seeks 90(char?)(prev=90). the next lines seeks 84(prev+=84) each. But all lines are equal in size (84 char long).
Windows 10 OS
Related code:
void get_Untill_Int(int* pInput)//keep taking input until input is `int or float`; safer than cin
{...}
string StringInput() //returns null-terminated string
{...}
void Admin_Show_ProductFromFile() //shows product informations and give further option to user
{
string Product_whole_info_line;
fstream Read_n_Write_ProductFrom_txtFile; //fstream used. cause want to both read and write
Read_n_Write_ProductFrom_txtFile.open("product.txt"); //.txt file open.
if (!Read_n_Write_ProductFrom_txtFile)
{
perror("Product File failed to open");
return;
}
/*----------------show product info-------------------------*/
while(getline(Read_n_Write_ProductFrom_txtFile, Product_whole_info_line))
{
cout<<" "<<Product_whole_info_line<<'\n';
}
/*-----------product info shown-------------------------*/
/*------------Edit a particular sub-info of a product----------------------*/
Read_n_Write_ProductFrom_txtFile.clear(); //failbit not in failed state
Read_n_Write_ProductFrom_txtFile.seekg(0);//go back to the start of .txt file
string tempProductName; //whose sub-info will be edited/updated
cout << "Enter product Name: "; /*the product to be searched to edit its' sub-info*/
tempProductName = StringInput();
char chartempProductName[tempProductName.length()+1];
char charProduct_whole_info_line[Product_whole_info_line.length()+1];//will need for strstr() later.
streampos prev=Read_n_Write_ProductFrom_txtFile.tellg(); //save current pointer of .txt file(confused with names)
while (getline(Read_n_Write_ProductFrom_txtFile, Product_whole_info_line))//take line containing product info
{
strcpy(chartempProductName, tempProductName.c_str());//convert string to char[]
strcpy(charProduct_whole_info_line, Product_whole_info_line.c_str());//convert string to char[]
if (strstr(charProduct_whole_info_line, chartempProductName))//wnated product_line found by comparing chartempProductName
{
string product_sub_info, newProduct_sub_info;
cout << "Enter info that need to be changed: "; /*which sub-info of product_line to be updated*/
product_sub_info = StringInput(); /*product sub_info*/
cout<<"enter updated info: ";
newProduct_sub_info = StringInput();
Product_whole_info_line.replace(Product_whole_info_line.find(product_sub_info),product_sub_info.length(),newProduct_sub_info);
//product sub-info has been updated inside program. Now we have to replace/update the line in the .txt file
Read_n_Write_ProductFrom_txtFile.seekg(prev);//seek the pointer back to the beginning of the .txt file
Read_n_Write_ProductFrom_txtFile << Product_whole_info_line<<'\n';
break; //break `while loop`. cause product found and info updated.
}
prev = Read_n_Write_ProductFrom_txtFile.tellg(); //current position of the getline input pointer
}
}
//print updated file
Read_n_Write_ProductFrom_txtFile.clear();
Read_n_Write_ProductFrom_txtFile.seekg(0);//go back to the start of .txt file
while(getline(Read_n_Write_ProductFrom_txtFile, Product_whole_info_line))
{
cout<<" "<<Product_whole_info_line<<'\n';
}
Read_n_Write_ProductFrom_txtFile.close();//.txt file closed.
}
I think the problem is in .seekg() or the pointer of text file(streampos prev). But I still can't figure out.
Input:
Enter product Name: Big Banana
Enter info that need to be changed: 210
enter updated info: 220
Output:
| Product | |Stock | |Wholesale| |Retail| | Product | | Note |
| Name | |amount| | price | |price | | Supplier | | |
-----------------------------------------------------------------------------------
BigBig Banana, 220 ,270, 310 ,kashem suppply ,no note
le, 50 ,100, 145 ,Khulna Fruits ,No note
Dragon Fruit, 9 ,700, 890 ,Indo-Asian farms ,Costly
Orange, 80 ,70, 100 ,Khulna Fruits ,no note
Guava, 16 ,120, 150 ,Sandwip Agro ,Dababy
Goat Fry, 2 ,1000, 1700 ,Sandwip Agro ,Costly
Dumb Miraz, 9 ,700, 890 ,Indo-Asian farms ,Cannibal
Expected output:
| Product | |Stock | |Wholesale| |Retail| | Product | | Note |
| Name | |amount| | price | |price | | Supplier | | |
-----------------------------------------------------------------------------------
Big Banana, 220 ,270, 310 ,kashem suppply ,no note
Apple, 50 ,100, 145 ,Khulna Fruits ,No note
Dragon Fruit, 9 ,700, 890 ,Indo-Asian farms ,Costly
Orange, 88 ,70, 100 ,Khulna Fruits ,rotten
Guava, 16 ,120, 150 ,Sandwip Agro ,Dababy
Goat Fry, 2 ,1000, 1700 ,Sandwip Agro ,Costly
Dumb Miraz, 9 ,700, 890 ,Indo-Asian farms ,Cannibal

Merging two observations

I have a list of places with population, much like in the example data below:
sysuse census, clear
How can I combine (sum) only two observations to create a new observation, while maintaining the rest of the data?
In the below example I would like to combine Alabama and Alaska to create a new observation called 'Alabama & Alaska' with the sum of their populations.
With the new observation, the previous records will need to be deleted.
+----------------------------+
| state pop |
|----------------------------|
1. | Alabama 3,893,888 |
2. | Alaska 401,851 |
3. | Arizona 2,718,215 |
4. | Arkansas 2,286,435 |
5. | California 23,667,902 |
+----------------------------+
+-----------------------------------+
| state pop |
|-----------------------------------|
1. | Alabama & Alaska 4,295,739 | <--Alabama & Alaska combined
2. | Arizona 2,718,215 | <--Retain other observations and variables
3. | Arkansas 2,286,435 |
4. | California 23,667,902 |
+-----------------------------------+
This is my original toy data example and its expected output:
PlaceName Population
Town 1 100
Town 2 200
Town 3 100
Town 4 100
PlaceName Population
Town 1 & Town 2 300
Town 3 100
Town 4 100
Using your original toy example, the following works for me:
clear
input str6 PlaceName Population
"Town 1" 100
"Town 2" 200
"Town 3" 100
"Town 4" 100
end
generate PlaceName2 = cond(_n == 1, PlaceName + " & " + PlaceName[_n+1], PlaceName)
generate Population2 = cond(_n == 1, Population[_n+1] + Population, Population)
replace PlaceName2 = "" in 2
replace Population2 = . in 2
gsort - Population2
list, abbreviate(12)
+--------------------------------------------------------+
| PlaceName Population PlaceName2 Population2 |
|--------------------------------------------------------|
1. | Town 1 100 Town 1 & Town 2 300 |
2. | Town 4 100 Town 4 100 |
3. | Town 3 100 Town 3 100 |
4. | Town 2 200 . |
+--------------------------------------------------------+
This is how to do it with collapse. As you ask, this combines two observations into one, and thus changes the dataset.
clear
input str6 PlaceName Population
"Town 1" 100
"Town 2" 200
"Town 3" 100
"Town 4" 100
end
replace PlaceName = "Towns 1 and 2" in 1/2
collapse (sum) Population , by(PlaceName)
list
+--------------------------+
| PlaceName Popula~n |
|--------------------------|
1. | Town 3 100 |
2. | Town 4 100 |
3. | Towns 1 and 2 300 |
+--------------------------+

Grouping multiple text values from Query in Google Sheets

So I have a pivot table created from a Query function in Google sheets that I wish to group by its rows based on a decision rule.
the pivot table basically looks something like this (a table of classes and grades, and a header with student names):
| John Dough | John Though | John Doe |... | John A Hill
History | 79 | | |... | |
Chem 101 | | | 87 |... | |
Phys 101 | | | |... | 77 |
Phys 202 | | | |... | |
Geo 101 | | 75 | |... | |
... | | | ... |... | |
Sport AT | | | 85 |... | |
now, Let say the needed score in the final exam is 75, what I'd like to do is get this table:
| Failed Passed
History | John Dough | John A Hill , John Deere
Chem 101 | John E , John Tra | John Son , John Snow
Phys 101 | John B Good , John Na | #N/A
Phys 202 | John Bon Jovi | John Diy , John L , John R
Geo 101 | #N/A | John Lennon
... | ... | ...
Sport AT | John Bone | John the revelator
the catch is that I'd like to wrap the existing pivot table with a formula, so it looks something like:
=MagicFormula[Query("Data !A1:X99","select yada yada, sum(yada), Pivot(whatever)]
And my question is, can it be done by wrapping?
=ARRAYFORMULA({"", "PASSED", "FAILED"; A2:A, REGEXREPLACE(TRIM({
TRANSPOSE(QUERY(TRANSPOSE(IF((B2:E>=79)*(B2:E<>""), B1:E1&",", )),,999^99)),
TRANSPOSE(QUERY(TRANSPOSE(IF((B2:E< 79)*(B2:E<>""), B1:E1&",", )),,999^99))}),
",$", )})
UPDATE:
=ARRAYFORMULA({{QUERY(QUERY({List!B5:D},
"select Col1,sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"select Col1", 0)}, {"PASSED", "FAILED";
REGEXREPLACE(TRIM({TRANSPOSE(QUERY(TRANSPOSE(IF((QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)>=79)*(QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)<>""), QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"limit 0", 1)&",", )),,999^99)), TRANSPOSE(QUERY(TRANSPOSE(IF((QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)< 79)*(QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)<>""), QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"limit 0", 1)&",", )),,999^99))}), ",$", )}})

Create table for asclogit and nlogit

Suppose I have the following table:
id | car | sex | income
-------------------------------
1 | European | Male | 45000
2 | Japanese | Female | 48000
3 | American | Male | 53000
I would like to create the one below:
| id | car | choice | sex | income
--------------------------------------------
1.| 1 | European | 1 | Male | 45000
2.| 1 | American | 0 | Male | 45000
3.| 1 | Japanese | 0 | Male | 45000
| ----------------------------------------
4.| 2 | European | 0 | Female | 48000
5.| 2 | American | 0 | Female | 48000
6.| 2 | Japanese | 1 | Female | 48000
| ----------------------------------------
7.| 3 | European | 0 | Male | 53000
8.| 3 | American | 1 | Male | 53000
9.| 3 | Japanese | 0 | Male | 53000
I would like to fit an asclogit and according to Example 1 in Stata's Manual, this table format seems necessary. However, i have not found a way to create this easily.
You can use the cross command to generate all the possible combinations:
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
generate choice = 0
save old, replace
keep id
save new, replace
use old
rename id =_0
cross using new
replace choice = 1 if id_0 == id
replace sex = cond(id == 2, "Female", "Male")
replace income = cond(id == 1, 45000, cond(id == 2, 48000, 53000))
Note that the use of the cond() function here is equivalent to:
replace sex = "Male" if id == 1
replace sex = "Female" if id == 2
replace sex = "Male" if id == 3
replace income = 45000 if id == 1
replace income = 48000 if id == 2
replace income = 53000 if id == 3
The above code snipped produces the desired output:
drop id_0
order id car choice sex income
sort id car
list, sepby(id)
+------------------------------------------+
| id car choice sex income |
|------------------------------------------|
1. | 1 American 0 Male 45000 |
2. | 1 European 1 Male 45000 |
3. | 1 Japanese 0 Male 45000 |
|------------------------------------------|
4. | 2 American 0 Female 48000 |
5. | 2 European 0 Female 48000 |
6. | 2 Japanese 1 Female 48000 |
|------------------------------------------|
7. | 3 American 1 Male 53000 |
8. | 3 European 0 Male 53000 |
9. | 3 Japanese 0 Male 53000 |
+------------------------------------------+
For more information, type help cross and help cond() from Stata's command prompt.
Please see dataex in Stata for how to produce data examples useful in web forums. (If necessary, install first using ssc install dataex.)
This could be an exercise in using fillin followed by filling in the missings.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
fillin id car
foreach v in sex income {
bysort id (_fillin) : replace `v' = `v'[1]
}
list , sepby(id)
+-------------------------------------------+
| id car sex income _fillin |
|-------------------------------------------|
1. | 1 European Male 45000 0 |
2. | 1 American Male 45000 1 |
3. | 1 Japanese Male 45000 1 |
|-------------------------------------------|
4. | 2 Japanese Female 48000 0 |
5. | 2 European Female 48000 1 |
6. | 2 American Female 48000 1 |
|-------------------------------------------|
7. | 3 American Male 53000 0 |
8. | 3 European Male 53000 1 |
9. | 3 Japanese Male 53000 1 |
+-------------------------------------------+
A provisional solution using Pandas in Python is the following:
1) Open the base with:
df = pd.read_stata("mybase.dta")
2) Use the code of the accepted answer of this question.
3) Save the base:
df.to_stata("newbase.dta")
If one wants to use dummy variables, reshape also is an option.
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
tabulate car, gen(choice)
reshape long choice, i(id)
label define car 2 "European" 3 "Japanese" 1 "American"
drop car
rename _j car
label values car car
list, sepby(id)
+------------------------------------------+
| id car sex income choice |
|------------------------------------------|
1. | 1 American Male 45000 0 |
2. | 1 European Male 45000 1 |
3. | 1 Japanese Male 45000 0 |
|------------------------------------------|
4. | 2 American Female 48000 0 |
5. | 2 European Female 48000 0 |
6. | 2 Japanese Female 48000 1 |
|------------------------------------------|
7. | 3 American Male 53000 1 |
8. | 3 European Male 53000 0 |
9. | 3 Japanese Male 53000 0 |
+------------------------------------------+