get new data from old one SAS - sas

I'm new in SAS
and I have this example :
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
run;
quit;
data
1 2 1
2 3 1
3 5 1
4 4 1
5 8 2
6 6 2
7 4 2
8 2 2
9 2 2
How can I creat new data with only Z=1 and only Z=2 ?
Thank you.

You could use the loc function to subset your data matrix. The following is the description of the function, snipped from Indexing matrices in Introduction to SAS/IML.
the LOC function is often very useful for subsetting vectors and matrices. This function is used for locating elements which meet a given condition. The positions of the elements are returned in row-major order. For vectors, this is simply the position of the element. For matrices, some manipulation is often required in order to use the result of the LOC function as an index. The syntax of the function is:
matrix2=LOC(matrix1=value);
Applied to your example:
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
z1rows=loc(data[,3]= 1);
z1=data[z1rows,];
print z1;
z2rows=loc(data[,3]= 2);
z2=data[z2rows,];
print z2;
run;
quit;
The result for print z1;
+------------+
| z1 |
+---+----+---+
| 1 | 2 | 1 |
| 2 | 3 | 1 |
| 3 | 5 | 1 |
| 4 | 4 | 1 |
+---+----+---+
The result for print z2;
+------------+
| z2 |
+---+----+---+
| 5 | 8 | 2 |
| 6 | 6 | 2 |
| 7 | 4 | 2 |
| 8 | 2 | 2 |
| 9 | 2 | 2 |
+---+----+---+

Related

How to rank observations in panel data?

I have a panel dataset in Stata with several countries and each country containing groups. I would like to rank the groups by country, according to the variable var1.
The structure of my dataset is as follows (the rank column is what I would like to achieve). Note that var1 is indeed constant within groups (it is just the within group average of another variable).
--country--|--groupId--|---time----|---var1----|---rank---
1 | 1 | 1 | 50 | 3
1 | 1 | 2 | 50 | 3
1 | 1 | 3 | 50 | 3
1 | 2 | 1 | 90 | 1
1 | 2 | 2 | 90 | 1
1 | 2 | 3 | 90 | 1
1 | 3 | 1 | 60 | 2
1 | 3 | 2 | 60 | 2
1 | 3 | 3 | 60 | 2
2 | 4 | 1 | 15 | 2
2 | 4 | 2 | 15 | 2
2 | 4 | 3 | 15 | 2
2 | 5 | 1 | 10 | 3
2 | 5 | 2 | 10 | 3
2 | 5 | 3 | 10 | 3
2 | 6 | 1 | 80 | 1
2 | 6 | 2 | 80 | 1
2 | 6 | 3 | 80 | 1
Among the options I have tried is:
sort country groupId
by country (groupId): egen rank = rank(var1)
However, I cannot achieve the desired result.
Thanks for the data example. There are two problems with your code. One is that as you want to rank from highest to lowest, you need to negate the argument to rank(). The second is that given the repetitions, you need to rank on one time only and then copy those ranks to other times.
This works with your data example, here edited to be input code. (See also the Stata tag wiki for that principle.)
clear
input country groupId time var1 rank
1 1 1 50 3
1 1 2 50 3
1 1 3 50 3
1 2 1 90 1
1 2 2 90 1
1 2 3 90 1
1 3 1 60 2
1 3 2 60 2
1 3 3 60 2
2 4 1 15 2
2 4 2 15 2
2 4 3 15 2
2 5 1 10 3
2 5 2 10 3
2 5 3 10 3
2 6 1 80 1
2 6 2 80 1
2 6 3 80 1
end
bysort country : egen wanted = rank(-var) if time == 1
bysort country groupId (time) : replace wanted = wanted[1]
assert rank == wanted

How can I compare values of adjacent observations?

I have two datasets that I have appended together in Stata.
There is one variable, say Age in both data sets. I sorted the data so that the ages are in ascending order. I want to delete the observations in each dataset where the corresponding ages don't match.
Dataset 1:
Obs Age
1 7
2 8
3 10
4 5
Dataset 2:
Obs Age
1 10
2 5
3 9
4 7
Combined and sorted in ascending order:
Obs Age
1 5
2 5
3 7
4 7
5 8
6 9
7 10
8 10
So because the ages when sorted don't match up for observations 5 and 6, I want to delete them. Essentially I want a way to loop through pairs of adjacent numbers and compare their values so that I'm only left with pairs with the same ages.
Looping over observations is inefficient and in the vast majority of cases not necessary.
The following works for me:
clear
input age
5
5
7
7
8
9
10
10
end
generate tag = age != age[_n+1] & age != age[_n-1]
list
+-----------+
| age tag |
|-----------|
1. | 5 0 |
2. | 5 0 |
3. | 7 0 |
4. | 7 0 |
5. | 8 1 |
|-----------|
6. | 9 1 |
7. | 10 0 |
8. | 10 0 |
+-----------+
After getting rid of the relevant observations you get the desired result:
keep if tag == 0
list
+-----------+
| age tag |
|-----------|
1. | 5 0 |
2. | 5 0 |
3. | 7 0 |
4. | 7 0 |
5. | 10 0 |
|-----------|
6. | 10 0 |
+-----------+

Stacking variables for each unique ID

I am using Stata 13 to stack several variables into one variable using
stack stand1-stand10, into(all)
However, I need to do it for each unique id which is pasted parallel to all, something like:
bysort familyid: stack stand1-stand10,into(all) keep familyid
We can use a simpler analogue of your data example.
clear
set obs 3
gen familyid = _n
forval j = 1/3 {
gen stand`j' = _n * `j'
}
list
+-------------------------------------+
| familyid stand1 stand2 stand3 |
|-------------------------------------|
1. | 1 1 2 3 |
2. | 2 2 4 6 |
3. | 3 3 6 9 |
+-------------------------------------+
save original
To stack with an identifier, just repeat the identifier variable name. For more than a few variables, it's easiest to prepare a call using a loop.
forval j = 1/3 {
local call `call' familyid stand`j'
}
di "`call'"
familyid stand1 familyid stand2 familyid stand3
stack `call', into(familyid stand)
sort familyid _stack
list, sepby(familyid)
+---------------------------+
| _stack familyid stand |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 2 |
3. | 3 1 3 |
|---------------------------|
4. | 1 2 2 |
5. | 2 2 4 |
6. | 3 2 6 |
|---------------------------|
7. | 1 3 3 |
8. | 2 3 6 |
9. | 3 3 9 |
+---------------------------+
That said, it's easier to use reshape long.
use original, clear
reshape long stand, i(familyid) j(which)
list, sepby(familyid)
+--------------------------+
| familyid which stand |
|--------------------------|
1. | 1 1 1 |
2. | 1 2 2 |
3. | 1 3 3 |
|--------------------------|
4. | 2 1 2 |
5. | 2 2 4 |
6. | 2 3 6 |
|--------------------------|
7. | 3 1 3 |
8. | 3 2 6 |
9. | 3 3 9 |
+--------------------------+

remove duplicate (non-unique) paired values

I'm working with an edge list in Stata, of the type:
var1 var2
a 1
a 2
a 3
b 1
b 2
1 a
2 b
I want to remove non-unique pairs such as 1a and 2b (which are same as a1 and b2 for me). How can I go about this?
. clear
. input str1 (var1 var2)
var1 var2
1. a 1
2. a 2
3. a 3
4. b 1
5. b 2
6. 1 a
7. 2 b
8. end
. gen first = cond(var1 <= var2, var1, var2)
. gen second = cond(var1 <= var2, var2, var1)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
|------------------------------|
6. | 1 a 1 a |
7. | 2 b 2 b |
+------------------------------+
. duplicates list first second
Duplicates in terms of first second
+--------------------------------+
| group: obs: first second |
|--------------------------------|
| 1 1 1 a |
| 1 6 1 a |
| 2 5 2 b |
| 2 7 2 b |
+--------------------------------+
. duplicates drop first second, force
Duplicates in terms of first second
(2 observations deleted)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
+------------------------------+
The easy part of the answer is to use duplicates drop. But how to get the data so that 1 a and a 1 are seen to be duplicates? This is all documented here. We can sort the values in each observation so that (in this case) both sort to 1 a. The linked paper says much more, but that's the main idea, and cond() helps.

Dropping columns/variables based on count of missing in Stata

I have a large dataset that looks like the one below. I would like to drop the variables (not the observations/rows) that have less 3 observations in the rows. In this case only variable X1 needs to be dropped.
I apologise if I am asking something obvious, however, at this point I do not have a clue on how to proceed with this.
+-----+-----+-----+-----+-----+
| ID | X1 | X2 | X3 | X4 |
+-----+-----+-----+-----+-----+
| 1 | . | 1 | 1 | 2 |
| 2 | . | 2 | 2 | 3 |
| 3 | . | 3 | 1 | . |
| 4 | 1 | . | 3 | 1 |
| 5 | . | 2 | 4 | 3 |
| 6 | 2 | 3 | . | . |
|total| 2 | 5 | 5 | 4 |
+-----+-----+-----+-----+-----+
My interpretation is you want to drop variables that have at least 3 missing values.
You can use nmissing, from SSC (ssc install nmissing):
clear
set more off
input ///
x y z
. . 5
. 6 8
4 . 9
. . 1
5 . .
end
list
nmissing, min(3)
drop `r(varlist)'
If my interpretation is incorrect, check the help for nmissing and npresent. The syntax is flexible enough.
Edit
A re-interpretation. You want to drop variables that don't have at least 3 non-missing values:
clear
set more off
input ///
ID X1 X2 X3 X4
1 . 1 1 2
2 . 2 2 3
3 . 3 1 .
4 1 . 3 1
5 . 2 4 3
6 2 3 . .
end
list, sep(0)
npresent, min(3)
keep `r(varlist)'
describe