Duplicate Observations in Stata (data manipulation) - stata

I have a string variable
var1
x
y
z
that I need to "duplicate" and append to give
var1 var2
x x
x y
x z
--------
y x
y y
y z
--------
z x
z y
z z
where I added the horizontal lines to facilitate reading. Is such an expansion possible in Stata without loops? (I am not sure if "duplicate" is the right term.)

Two commands:
gen var2 = var1
fillin var1 var2
See help fillin and http://www.stata-journal.com/sjpdf.html?articlenum=dm0011

Related

Deleting first instance of a column after group by in sas proc sql

I have the following SAS dataset.
correlation
policynum
risknum
A
X
Y
A
X
Y
A
X
Y
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
C
Z
M
C
Z
M
C
Z
M
D
Z
M
D
Z
M
D
Z
M
In SAS, I want to filter the above dataset so I get my final output as:
correlation
policynum
risknum
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
D
Z
M
D
Z
M
D
Z
M
i.e. for each group of policynum and risknum, if multiple values exist for correlation, I want to keep the second value and get rid of the first value.
If only a single value of correlation exists for a group of policynum and risknum, I want to retain that group in my final output too.
What would be the best way to do this? It might be something simple as I am relatively new to SAS.
Thanks in advance!
If the order of the correlation values, in sort order, is the same ordering as they appear row-wise in the data set you can use SQL. Otherwise, SQL, being based on set theory, which does not have implicit row numbers, can not be used. A DATA step with DOW loop can be used.
Example:
FYI, one common situation in which SAS coders use the phrase 'DOW loop' is when SET & BY statements occur inside a DO loop.
data have;
input correlation $ policynum $ risknum $;
datalines;
A X Y
A X Y
A X Y
B X Y
B X Y
B X Y
B X L
B X L
B X L
C Z M
C Z M
C Z M
D Z M
D Z M
D Z M
;
/* keep last group of a nested group */
* SQL can be used only if correlation wanted is ALWAYS highest valued correlation;
proc sql;
create table want as
select * from have
group by policynum, risknum
having correlation = max(correlation)
;
* DATA Step DOW loops can be used when correlation wanted is last occurring correlation within by group;
data want;
do _n_ = 1 by 1 until (last.policynum);
set have;
by policynum risknum notsorted; /* presume at least contiguous */
end;
_want_correlation = correlation;
do _n_ = 1 to _n_;
set have;
if _want_correlation = correlation then OUTPUT;
end;
run;

Convert Stata "egen, group" to SAS

I am trying to find the equivalent of the Stata code "egen group" in SAS.
The goal is:
I have three variables x, y, and z. I want to create a new variable which will assign a different ordinal number for each combination of values of x, y, and z. How can I do this in SAS?
If you order your data by x, y, and z, SAS knows exactly where the groups x, y, and z start/end. You can use this to create unique identifiers.
Let's make some sample data. This data purposefully has duplicate values to illustrate how first. works.
data have;
do x = 'a', 'b', 'c';
do y = 'd', 'e', 'f';
do z = 'g', 'h', 'i';
output;
output;
end;
end;
end;
run;
Single-Threaded Unique IDs
This is the most likely case for you. This applies if you're running code in Base SAS.
First, sort the data by x y z.
proc sort data=have;
by x y z;
run;
Next, create your identifiers. We'll tell SAS that the data is ordered by x y z. Since z is nested within y and x, if we reach the first value of z, we've reached a unique combination of x y z.
data want;
set have;
by x y z;
if(first.z) then id+1;
run;
Output:
x y z id
a d g 1
a d g 1
a d h 2
a d h 2
a d i 3
a d i 3
...
id+1 is a special SAS shortcut called a sum statement and is equivalent to the following code:
retain id 0;
if(first.z) then id = id+1;
Multi-threaded Unique IDs
This applies if you're running code in SAS Viya in CAS. You need to add _THREADID_ to the ID to make it unique. For example:
cas;
libname casuser cas caslib='casuser';
data casuser.have;
set have;
run;
data casuser.want;
set casuser.have;
by x y z;
if(first.z) then _id+1;
id = catx('_', _THREADID_, _id);
drop _id;
run;
Output:
x y z id
a d g 15_1
a d g 15_1
a d h 15_2
a d h 15_2
a d i 15_3
a d i 15_3
...

how to impute two variables simultaneously in Stata?

I am trying to impute two variables simultaneously in Stata: say y and x. And then I want to perform a linear regression for them.
The code I used are:
mi set mlong
mi register imputed y x
mi impute regress y a b c, add(10)
mi impute regress x a b c, add(10)
mi estimate: regress y x
I run into an error: "estimation sample varies between m=1 and m=11". Can someone help me out? Thanks!
I prefer doing it using chained equations. The code below should work (note that Part 1 can be skipped as I only used it to generate a suitable mock dataset):
* Part 1
clear all
set seed 0945
set obs 50
gen y0 = _n
gen y = runiform()
sort y
gen x0 = _n
gen x = runiform()
sort x
replace y = . in 1
replace y = . in 5
replace y = . in 10
replace y = . in 15
replace y = . in 20
replace y = . in 25
replace y = . in 30
replace y = . in 35
replace y = . in 40
replace y = . in 45
replace y = . in 50
sort y
replace x = . in 1
replace x = . in 5
replace x = . in 10
replace x = . in 15
replace x = . in 20
replace x = . in 25
replace x = . in 30
replace x = . in 35
replace x = . in 40
replace x = . in 45
replace x = . in 50
gen a = _n
sort x
gen b = _n
gen c = _n
* Part 2
mi set mlong
mi register imputed y x
mi impute chained (regress) y x = a b c, add(10)
mi estimate, dots: regress y x

Merge multiple rows with same value into one row in pandas

I have seen that there are similar questions, but the answers did not quite fit my exact needs. I have a dataframe that contains rows with different values. Some of the rows however have exactly the same value.
Column1 Column2 Column3
0 a x x
1 a x x
2 a x x
3 d y y
4 d y y
What I would like to have is:
Column1 Column2 Column3
0 a x x
1 d y y
So basically I want to merge all rows with the same values in all columns into one row. What is the most decent way to do that in python?
Thank you in advance!
Call drop_duplicates:
In [214]:
df.drop_duplicates()
Out[214]:
Column1 Column2 Column3
0 a x x
3 d y y

How to get only last 4 WORKING days data in SAS?

I'm trying to pull only last 4 working days data in SAS...I tried following code but I'm not getting what I'm intended to...
data input;
Input id $ id1 $ id2 $ num date date9.;
Format Date Date9.;
datalines;
x y z 3 19JUL2015
x y z 2 18JUL2015
x y z 3 17JUL2015
x y z 2 16JUL2015
x y z 3 15JUL2015
x y z 2 14JUL2015
x y z 3 13JUL2015
a b c 1 12JUL2015
a b c 1 11JUL2015
a b c 1 10JUL2015
a b c 1 09JUL2015
a b c 1 08JUL2015
a b c 2 07JUL2015
x y z 1 06JUL2015
;
Run;
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
*if intck('weekday',Date,today()) >4;
if 1<Weekday(Date)<7 and Date>=today()-4;
Run;
I think you need to reverse the > in your code, and add a qualification that you only want weekdays:
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
if intck('weekday',Date,'20JUL2015'd) le 4 and 1<weekday(Date)<7;
*if 1<Weekday(Date)<7 and Date>='20JUL2015'd-5;
Run;