I'm working with an edge list in Stata, of the type:
var1 var2
a 1
a 2
a 3
b 1
b 2
1 a
2 b
I want to remove non-unique pairs such as 1a and 2b (which are same as a1 and b2 for me). How can I go about this?
. clear
. input str1 (var1 var2)
var1 var2
1. a 1
2. a 2
3. a 3
4. b 1
5. b 2
6. 1 a
7. 2 b
8. end
. gen first = cond(var1 <= var2, var1, var2)
. gen second = cond(var1 <= var2, var2, var1)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
|------------------------------|
6. | 1 a 1 a |
7. | 2 b 2 b |
+------------------------------+
. duplicates list first second
Duplicates in terms of first second
+--------------------------------+
| group: obs: first second |
|--------------------------------|
| 1 1 1 a |
| 1 6 1 a |
| 2 5 2 b |
| 2 7 2 b |
+--------------------------------+
. duplicates drop first second, force
Duplicates in terms of first second
(2 observations deleted)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
+------------------------------+
The easy part of the answer is to use duplicates drop. But how to get the data so that 1 a and a 1 are seen to be duplicates? This is all documented here. We can sort the values in each observation so that (in this case) both sort to 1 a. The linked paper says much more, but that's the main idea, and cond() helps.
Related
I have a table in SAS dataset that looks like this:
proc sql;
create table my_table
(id char(1),
my_date num format=date9.,
my_col num);
insert into my_table
values('A','01JAN2010'd,.)
values('A','02JAN2010'd,0)
values('A','03DEC2009'd,1)
values('A','04NOV2009'd,1)
values('B','01JAN2010'd,.)
values('B','02NOV2009'd,2)
values('C','01JAN2010'd,.)
values('C','02OCT2009'd,3)
values('D','01JAN2010'd,.)
values('D','02NOV2009'd,2)
values('D','03OCT2009'd,1)
values('D','04AUG2009'd,2)
values('D','05MAY2009'd,3)
values('D','06APR2009'd,1);
quit;
I am trying to create a new column desired that, for each group of id column, flags the row with a value of 1 if the value in my_col is missing or less than 3.
The part I'm having trouble with is that when there is a my_col value that is greater than 2, I need the desired value for that row to be missing and also stop flagging any remaining rows in the id group with a value of 1.
The resulting dataset should look like this:
+----+-----------+--------+---------+
| id | my_date | my_col | desired |
+----+-----------+--------+---------+
| A | 01JAN2010 | . | 1 |
| A | 02JAN2010 | 0 | 1 |
| A | 03DEC2009 | 1 | 1 |
| A | 04NOV2009 | 1 | 1 |
| B | 01JAN2009 | . | 1 |
| B | 02NOV2009 | 2 | 1 |
| C | 01JAN2010 | . | 1 |
| C | 02OCT2009 | 3 | . |
| D | 01JAN2010 | . | 1 |
| D | 02NOV2009 | 2 | 1 |
| D | 03OCT2009 | 1 | 1 |
| D | 04AUG2009 | 2 | 1 |
| D | 05MAY2009 | 3 | . |
| D | 06APR2009 | 1 | . |
+----+-----------+--------+---------+
Looks like a simple application of a retained variable. Set the flag to 1 when you start a new group and then set it to missing when the value of MY_COL is larger than 2.
data want;
set my_table ;
by id;
if first.id then desired=1;
if my_col>2 then desired=.;
retain desired;
run;
Also it is not clear why you used such complicated code to create your example data. Why not a simple data step?
data my_table;
input id :$1. my_date :date. my_col;
format my_date date9.;
cards;
A 01JAN2010 .
A 02JAN2010 0
A 03DEC2009 1
A 04NOV2009 1
B 01JAN2010 .
B 02NOV2009 2
C 01JAN2010 .
C 02OCT2009 3
D 01JAN2010 .
D 02NOV2009 2
D 03OCT2009 1
D 04AUG2009 2
D 05MAY2009 3
D 06APR2009 1
;
I can't think of a simpler way to do it, but this works. You will need to have your data sorted by id.
data my_table2;
set my_table;
by id;
format gt2flag $1.;
retain gt2flag;
if first.id then gt2flag='';
if my_col gt 2 then gt2flag='Y';
if gt2flag = 'Y' then desired=.;
else desired=1;
drop gt2flag;
run;
id my_date my_col desired
A 01JAN2010 . 1
A 02JAN2010 0 1
A 03DEC2009 1 1
A 04NOV2009 1 1
B 01JAN2010 . 1
B 02NOV2009 2 1
C 01JAN2010 . 1
C 02OCT2009 3 .
D 01JAN2010 . 1
D 02NOV2009 2 1
D 03OCT2009 1 1
D 04AUG2009 2 1
D 05MAY2009 3 .
D 06APR2009 1 .
In Power BI, I have a table that looks like this:
Table 1
ID | At1
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
And another table that looks like this:
Table2
Value
A
B
C
D
I need to create a new table that add "n" times each row of table 1, where "n" would be each row of table 2. Something like this:
Table 3
ID | At1 | Value
1 | 1 | A
1 | 1 | B
1 | 1 | C
1 | 1 | D
2 | 2 | A
2 | 2 | B
2 | 2 | C
2 | 2 | D
3 | 3 | A
3 | 3 | B
3 | 3 | C
3 | 3 | D
4 | 4 | A
4 | 4 | B
4 | 4 | C
4 | 4 | D
I am using Stata 13 to stack several variables into one variable using
stack stand1-stand10, into(all)
However, I need to do it for each unique id which is pasted parallel to all, something like:
bysort familyid: stack stand1-stand10,into(all) keep familyid
We can use a simpler analogue of your data example.
clear
set obs 3
gen familyid = _n
forval j = 1/3 {
gen stand`j' = _n * `j'
}
list
+-------------------------------------+
| familyid stand1 stand2 stand3 |
|-------------------------------------|
1. | 1 1 2 3 |
2. | 2 2 4 6 |
3. | 3 3 6 9 |
+-------------------------------------+
save original
To stack with an identifier, just repeat the identifier variable name. For more than a few variables, it's easiest to prepare a call using a loop.
forval j = 1/3 {
local call `call' familyid stand`j'
}
di "`call'"
familyid stand1 familyid stand2 familyid stand3
stack `call', into(familyid stand)
sort familyid _stack
list, sepby(familyid)
+---------------------------+
| _stack familyid stand |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 2 |
3. | 3 1 3 |
|---------------------------|
4. | 1 2 2 |
5. | 2 2 4 |
6. | 3 2 6 |
|---------------------------|
7. | 1 3 3 |
8. | 2 3 6 |
9. | 3 3 9 |
+---------------------------+
That said, it's easier to use reshape long.
use original, clear
reshape long stand, i(familyid) j(which)
list, sepby(familyid)
+--------------------------+
| familyid which stand |
|--------------------------|
1. | 1 1 1 |
2. | 1 2 2 |
3. | 1 3 3 |
|--------------------------|
4. | 2 1 2 |
5. | 2 2 4 |
6. | 2 3 6 |
|--------------------------|
7. | 3 1 3 |
8. | 3 2 6 |
9. | 3 3 9 |
+--------------------------+
I have a large dataset that looks like the one below. I would like to drop the variables (not the observations/rows) that have less 3 observations in the rows. In this case only variable X1 needs to be dropped.
I apologise if I am asking something obvious, however, at this point I do not have a clue on how to proceed with this.
+-----+-----+-----+-----+-----+
| ID | X1 | X2 | X3 | X4 |
+-----+-----+-----+-----+-----+
| 1 | . | 1 | 1 | 2 |
| 2 | . | 2 | 2 | 3 |
| 3 | . | 3 | 1 | . |
| 4 | 1 | . | 3 | 1 |
| 5 | . | 2 | 4 | 3 |
| 6 | 2 | 3 | . | . |
|total| 2 | 5 | 5 | 4 |
+-----+-----+-----+-----+-----+
My interpretation is you want to drop variables that have at least 3 missing values.
You can use nmissing, from SSC (ssc install nmissing):
clear
set more off
input ///
x y z
. . 5
. 6 8
4 . 9
. . 1
5 . .
end
list
nmissing, min(3)
drop `r(varlist)'
If my interpretation is incorrect, check the help for nmissing and npresent. The syntax is flexible enough.
Edit
A re-interpretation. You want to drop variables that don't have at least 3 non-missing values:
clear
set more off
input ///
ID X1 X2 X3 X4
1 . 1 1 2
2 . 2 2 3
3 . 3 1 .
4 1 . 3 1
5 . 2 4 3
6 2 3 . .
end
list, sep(0)
npresent, min(3)
keep `r(varlist)'
describe
I'm new in SAS
and I have this example :
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
run;
quit;
data
1 2 1
2 3 1
3 5 1
4 4 1
5 8 2
6 6 2
7 4 2
8 2 2
9 2 2
How can I creat new data with only Z=1 and only Z=2 ?
Thank you.
You could use the loc function to subset your data matrix. The following is the description of the function, snipped from Indexing matrices in Introduction to SAS/IML.
the LOC function is often very useful for subsetting vectors and matrices. This function is used for locating elements which meet a given condition. The positions of the elements are returned in row-major order. For vectors, this is simply the position of the element. For matrices, some manipulation is often required in order to use the result of the LOC function as an index. The syntax of the function is:
matrix2=LOC(matrix1=value);
Applied to your example:
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
z1rows=loc(data[,3]= 1);
z1=data[z1rows,];
print z1;
z2rows=loc(data[,3]= 2);
z2=data[z2rows,];
print z2;
run;
quit;
The result for print z1;
+------------+
| z1 |
+---+----+---+
| 1 | 2 | 1 |
| 2 | 3 | 1 |
| 3 | 5 | 1 |
| 4 | 4 | 1 |
+---+----+---+
The result for print z2;
+------------+
| z2 |
+---+----+---+
| 5 | 8 | 2 |
| 6 | 6 | 2 |
| 7 | 4 | 2 |
| 8 | 2 | 2 |
| 9 | 2 | 2 |
+---+----+---+