Dropping columns/variables based on count of missing in Stata - stata

I have a large dataset that looks like the one below. I would like to drop the variables (not the observations/rows) that have less 3 observations in the rows. In this case only variable X1 needs to be dropped.
I apologise if I am asking something obvious, however, at this point I do not have a clue on how to proceed with this.
+-----+-----+-----+-----+-----+
| ID | X1 | X2 | X3 | X4 |
+-----+-----+-----+-----+-----+
| 1 | . | 1 | 1 | 2 |
| 2 | . | 2 | 2 | 3 |
| 3 | . | 3 | 1 | . |
| 4 | 1 | . | 3 | 1 |
| 5 | . | 2 | 4 | 3 |
| 6 | 2 | 3 | . | . |
|total| 2 | 5 | 5 | 4 |
+-----+-----+-----+-----+-----+

My interpretation is you want to drop variables that have at least 3 missing values.
You can use nmissing, from SSC (ssc install nmissing):
clear
set more off
input ///
x y z
. . 5
. 6 8
4 . 9
. . 1
5 . .
end
list
nmissing, min(3)
drop `r(varlist)'
If my interpretation is incorrect, check the help for nmissing and npresent. The syntax is flexible enough.
Edit
A re-interpretation. You want to drop variables that don't have at least 3 non-missing values:
clear
set more off
input ///
ID X1 X2 X3 X4
1 . 1 1 2
2 . 2 2 3
3 . 3 1 .
4 1 . 3 1
5 . 2 4 3
6 2 3 . .
end
list, sep(0)
npresent, min(3)
keep `r(varlist)'
describe

Related

Stacking variables for each unique ID

I am using Stata 13 to stack several variables into one variable using
stack stand1-stand10, into(all)
However, I need to do it for each unique id which is pasted parallel to all, something like:
bysort familyid: stack stand1-stand10,into(all) keep familyid
We can use a simpler analogue of your data example.
clear
set obs 3
gen familyid = _n
forval j = 1/3 {
gen stand`j' = _n * `j'
}
list
+-------------------------------------+
| familyid stand1 stand2 stand3 |
|-------------------------------------|
1. | 1 1 2 3 |
2. | 2 2 4 6 |
3. | 3 3 6 9 |
+-------------------------------------+
save original
To stack with an identifier, just repeat the identifier variable name. For more than a few variables, it's easiest to prepare a call using a loop.
forval j = 1/3 {
local call `call' familyid stand`j'
}
di "`call'"
familyid stand1 familyid stand2 familyid stand3
stack `call', into(familyid stand)
sort familyid _stack
list, sepby(familyid)
+---------------------------+
| _stack familyid stand |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 2 |
3. | 3 1 3 |
|---------------------------|
4. | 1 2 2 |
5. | 2 2 4 |
6. | 3 2 6 |
|---------------------------|
7. | 1 3 3 |
8. | 2 3 6 |
9. | 3 3 9 |
+---------------------------+
That said, it's easier to use reshape long.
use original, clear
reshape long stand, i(familyid) j(which)
list, sepby(familyid)
+--------------------------+
| familyid which stand |
|--------------------------|
1. | 1 1 1 |
2. | 1 2 2 |
3. | 1 3 3 |
|--------------------------|
4. | 2 1 2 |
5. | 2 2 4 |
6. | 2 3 6 |
|--------------------------|
7. | 3 1 3 |
8. | 3 2 6 |
9. | 3 3 9 |
+--------------------------+

remove duplicate (non-unique) paired values

I'm working with an edge list in Stata, of the type:
var1 var2
a 1
a 2
a 3
b 1
b 2
1 a
2 b
I want to remove non-unique pairs such as 1a and 2b (which are same as a1 and b2 for me). How can I go about this?
. clear
. input str1 (var1 var2)
var1 var2
1. a 1
2. a 2
3. a 3
4. b 1
5. b 2
6. 1 a
7. 2 b
8. end
. gen first = cond(var1 <= var2, var1, var2)
. gen second = cond(var1 <= var2, var2, var1)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
|------------------------------|
6. | 1 a 1 a |
7. | 2 b 2 b |
+------------------------------+
. duplicates list first second
Duplicates in terms of first second
+--------------------------------+
| group: obs: first second |
|--------------------------------|
| 1 1 1 a |
| 1 6 1 a |
| 2 5 2 b |
| 2 7 2 b |
+--------------------------------+
. duplicates drop first second, force
Duplicates in terms of first second
(2 observations deleted)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
+------------------------------+
The easy part of the answer is to use duplicates drop. But how to get the data so that 1 a and a 1 are seen to be duplicates? This is all documented here. We can sort the values in each observation so that (in this case) both sort to 1 a. The linked paper says much more, but that's the main idea, and cond() helps.

get new data from old one SAS

I'm new in SAS
and I have this example :
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
run;
quit;
data
1 2 1
2 3 1
3 5 1
4 4 1
5 8 2
6 6 2
7 4 2
8 2 2
9 2 2
How can I creat new data with only Z=1 and only Z=2 ?
Thank you.
You could use the loc function to subset your data matrix. The following is the description of the function, snipped from Indexing matrices in Introduction to SAS/IML.
the LOC function is often very useful for subsetting vectors and matrices. This function is used for locating elements which meet a given condition. The positions of the elements are returned in row-major order. For vectors, this is simply the position of the element. For matrices, some manipulation is often required in order to use the result of the LOC function as an index. The syntax of the function is:
matrix2=LOC(matrix1=value);
Applied to your example:
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
z1rows=loc(data[,3]= 1);
z1=data[z1rows,];
print z1;
z2rows=loc(data[,3]= 2);
z2=data[z2rows,];
print z2;
run;
quit;
The result for print z1;
+------------+
| z1 |
+---+----+---+
| 1 | 2 | 1 |
| 2 | 3 | 1 |
| 3 | 5 | 1 |
| 4 | 4 | 1 |
+---+----+---+
The result for print z2;
+------------+
| z2 |
+---+----+---+
| 5 | 8 | 2 |
| 6 | 6 | 2 |
| 7 | 4 | 2 |
| 8 | 2 | 2 |
| 9 | 2 | 2 |
+---+----+---+

How to increase column values?

How to increase column values from:
1 | 1 | 7.317073
2 | 1 | 14.634146
3 | 1 | 24.390244
4 | 2 | 7.317073
5 | 2 | 14.634146
6 | 2 | 24.390244
To:
1 | 1 | 7.317073
2 | 1 | 14.634146
3 | 1 | 24.390244
4 | 2 | 7.317073
5 | 2 | 14.634146
6 | 2 | 24.390244
7 | 3 | 7.317073
8 | 3 | 14.634146
9 | 3 | 24.390244
10 | 4 | 7.317073
11 | 4 | 14.634146
12 | 4 | 24.390244
I'm using Open Office.
Assuming that the top left corner is A1, set the fourth row such:
A4: =A3+1
B4: =roundup(A4/3)
C4 =C1
And pull them up to row 12
For ColumnA simply selecting the first three rows, grabbing the fill handle (black square at the bottom right of the range) and dragging down to suit should be sufficient.
An alternative here to ROUNDUP is, in B1 and copied down:
=INT((ROW()-1)/3)+1
For ColumnC as for ColumnA but with Crl depressed.

Stata: Cumulative number of new observations

I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.