I have the following data
EMPID XVAR SRC
ABC PER1 1
ABC 2
XYZ PER1 1
XYZ 2
LMN PER1 1
LMN 2
LMN PER2 1
LMN 2
LMN 2
LMN PER3 1
LMN 2
I need to create a new variable _XVAR for records where SRC=2 based on the value for XVAR on the previous record (where SRC=1)
The output should be like:
EMPID XVAR SRC _XVAR
ABC PER1 1
ABC 2 PER1
XYZ PER1 1
XYZ 2 PER1
LMN PER1 1
LMN 2 PER1
LMN PER2 1
LMN 2 PER2
LMN 2 PER2
LMN PER3 1
LMN 2 PER3
I am trying the following, but it isnt working;
data t003;
set t003;
by EMPID;
retain XVAR;
if SRC eq 2 then _XVAR=XVAR;
run;
It can also be done by saving the XVAR in a new variable (last_XVAR), retaining it and dropping it (you dont want it in the output). Then use that one to assign _XVAR. Note that you need to set last_XVAR after the IF, or the current XVAR is used in the assignment of _XVAR.
Your code, edited:
data t003;
set t003;
by EMPID;
length _XVAR last_XVAR $ 10;
if SRC eq 2 then _XVAR = last_XVAR;
last_XVAR = XVAR;
retain last_XVAR;
drop last_XVAR;
run;
You can use LAG to retrieve prior row values and conditionally use that value in an assignment.
Sample data
data have; input
EMPID $ XVAR $ SRC; datalines;
ABC PER1 1
ABC . 2
XYZ PER1 1
XYZ . 2
LMN PER1 1
LMN . 2
LMN PER2 1
LMN . 2
LMN . 2
LMN PER3 1
LMN . 2
run;
Example code
data want;
set have;
lag_xvar = lag(xvar);
if src eq 2 then do;
if lag_xvar ne '' then _xvar = lag_xvar;
end;
else
_xvar = ' ';
retain _xvar;
drop lag_xvar;
run;
Related
I am looking to figure out how many customers get their product from a certain store. The problem each prod_id can have up to 12 weeks of data for each customer. I have tried a multitude of codes, some add up all of the obersvations for each customer while others like the one below remove all but the last observation.
proc sort data= have; BY Prod_ID cust; run;
Data want;
Set have;
by Prod_Id cust;
if (last.Prod_Id and last.cust);
count= +1;
run;
data have
prod_id cust week store
1 A 7/29 ABC
1 A 8/5 ABC
1 A 8/12 ABC
1 A 8/19 ABC
1 B 7/29 ABC
1 B 8/5 ABC
1 B 8/12 ABC
1 B 8/19 ABC
1 B 8/26 ABC
1 C 7/29 XYZ
1 C 8/5 XYZ
1 F 7/29 XYZ
1 F 8/5 XYZ
2 A 7/29 ABC
2 A 8/5 ABC
2 A 8/12 ABC
2 A 8/19 ABC
2 C 7/29 EFG
2 C 8/5 EFG
2 C 8/12 EFG
2 C 8/19 EFG
2 C 8/26 EFG
what i want it to look like
prod_id store count
1 ABC 2
1 XYZ 2
2 ABC 1
2 EFG 2
Firstly, read about if-statement.
I've just edited your code to make it work:
proc sort data=have;
by prod_id store cust;
run;
data want(drop=cust week);
set have;
retain count;
by prod_id store cust;
if (last.cust) then count=count+1;
else if (first.prod_id or first.store) then count = 0;
if (last.prod_id or last.store) then output;
run;
If you will have questions, ask.
The only place where the result of the COUNT() aggregate function in SQL might be confusing is that it will not count missing values of the variable.
select prod_id
, store
, count(distinct cust) as count
, count(distinct cust)+max(missing(cust)) as count_plus_missing
from have
group by prod_id ,store
;
In a compare with id, how can I output only the difference and the new records
but not the old records no more present?
Example, suppose I have two tables:
mybase:
key other
1 Ann
3 Ann
4 Charlie
5 Emily
and mycompare:
key other
2 Bill
3 Charlie
4 Charlie
running:
proc compare data=mybase
compare=mycompare
outnoequal
outdif
out=myoutput
listvar
outcomp
outbase
method = absolute
criterion = 0.0001
;
id key;
run;
I get a table "myoutput" like this:
type obs key other
base 1 1 Ann
compare 1 2 Bill
base 2 3 Ann
compare 2 3 Charlie
dif 2 3 XXXXXXX
base 4 5 Emily
I would like to have this:
type obs key other
compare 1 2 Bill
base 2 3 Ann
compare 2 3 Charlie
dif 2 3 XXXXXXX
This works for your example. I think you want to output records that are not matched in base and any records that match and have differences.
data mybase;
input key other $;
cards;
1 Ann
3 Ann
4 Charlie
5 Emily
;;;;
data mycompare;
input key other $;
cards;
2 Bill
3 Charlie
4 Charlie
;;;;
proc compare data=mybase
compare=mycompare
outnoequal
outdif
out=myoutput
listvar
outcomp
outbase
method = absolute
criterion = 0.0001
;
id key;
run;
proc print;
run;
data test;
set myoutput;
by key;
if (first.key and last.key) and _type_ eq 'BASE' then delete;
run;
proc print;
run;
Obs _TYPE_ _OBS_ key other
1 COMPARE 1 2 Bill
2 BASE 2 3 Ann
3 COMPARE 2 3 Charlie
4 DIF 1 3 XXXXXXX.
I have following two dataframes:
df1:
name
abc
lmn
pqr
df2:
m_name n_name loc
abc tyu IND
bcd abc RSA
efg poi SL
lmn ert AUS
nne bnm ENG
pqr lmn NZ
xyz asd BAN
I want to generate a new dataframe on following condition:
if df2.m_name==df1.name or df2.n_name==df1.name
eliminate duplicate rows
Following is desired output:
m_name n_name loc
abc tyu IND
bcd abc RSA
lmn ert AUS
pqr lmn NZ
Can I get any suggestions on how to achieve this??
Use:
print (df2)
m_name n_name loc
0 abc tyu IND
1 abc tyu IND
2 bcd abc RSA
3 efg poi SL
4 lmn ert AUS
5 nne bnm ENG
6 pqr lmn NZ
7 xyz asd BAN
df3 = df2.filter(like='name')
#another solution is filter columns by columns names in list
#df3 = df2[['m_name','n_name']]
df = df2[df3.isin(df1['name'].tolist()).any(axis=1)]
df = df.drop_duplicates(df3.columns)
print (df)
m_name n_name loc
0 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
Details:
Seelct all columns with name by filter:
print (df2.filter(like='name'))
m_name n_name
0 abc tyu
1 abc tyu
2 bcd abc
3 efg poi
4 lmn ert
5 nne bnm
6 pqr lmn
7 xyz asd
Compare by DataFrame.isin:
print (df2.filter(like='name').isin(df1['name'].tolist()))
m_name n_name
0 True False
1 True False
2 False True
3 False False
4 True False
5 False False
6 True True
7 False False
Get at least one True per row by any:
print (df2.filter(like='name').isin(df1['name'].tolist()).any(axis=1))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 False
dtype: bool
Filter by boolean indexing:
df = df2[df2.filter(like='name').isin(df1['name'].tolist()).any(axis=1)]
print (df)
m_name n_name loc
0 abc tyu IND
1 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
And last remove duplicates drop_duplicates (If need remove dupes by all name columns add subset parameter)
df = df.drop_duplicates(subset=df3.columns)
print (df)
m_name n_name loc
0 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
Use
In [56]: df2[df2.m_name.isin(df1.name) | df2.n_name.isin(df1.name)]
Out[56]:
m_name n_name loc
0 abc tyu IND
1 bcd abc RSA
3 lmn ert AUS
5 pqr lmn NZ
Or using query
In [58]: df2.query('m_name in #df1.name or n_name in #df1.name')
Out[58]:
m_name n_name loc
0 abc tyu IND
1 bcd abc RSA
3 lmn ert AUS
5 pqr lmn NZ
From the sample data below, I'm trying to identify accounts (by ID and SEQ) where there is an occurence of STATUS_DATE for at least 3 consecutive months. I've been messing with this for a while and I'm not at all sure how to tackle it.
Sample Data:
ID SEQ STATUS_DATE
11111 1 01/01/2014
11111 1 02/10/2014
11111 1 03/15/2014
11111 1 05/01/2014
11111 2 01/30/2014
22222 1 06/20/2014
22222 1 07/15/2014
22222 1 07/16/2014
22222 1 08/01/2014
22222 2 02/01/2014
22222 2 09/10/2014
What I need to return:
ID SEQ STATUS_DATE
11111 1 01/01/2014
11111 1 02/10/2014
11111 1 03/15/2014
22222 1 06/20/2014
22222 1 07/15/2014
22222 1 07/16/2014
22222 1 08/01/2014
Any help would be appreciated.
Here is one method:
data have;
input ID SEQ STATUS_DATE $12.;
datalines;
11111 1 01/01/2014
11111 1 02/10/2014
11111 1 03/15/2014
11111 1 05/01/2014
11111 2 01/30/2014
22222 1 06/20/2014
22222 1 07/15/2014
22222 1 07/16/2014
22222 1 08/01/2014
22222 2 02/01/2014
22222 2 09/10/2014
;
run;
data grouped (keep = id seq status_date group) groups (keep = group2);
set have;
sasdate = input(status_date, mmddyy12.);
month = month(sasdate);
year = year(sasdate);
pdate = intnx('month', sasdate, -1);
if lag(year) = year(sasdate) and lag(month) = month(sasdate) then group+0;
else if lag(year) = year(pdate) and lag(month) = month(pdate) then count+1;
else do;
group+1;
count = 0;
end;
if count = 0 and lag(count) > 1 then do;
group2 = group-1;
output groups;
end;
output grouped;
run;
data want (keep = id seq status_date);
merge grouped groups (in=a rename=(group2=group));
by group;
if a;
run;
Basically I give observations the same group number if they are in consecutive months, then also create a data set with group numbers of groups with more than 2 observations. Then I merge those two data sets and only keep observations which are in the second data set, that is, those with more than 2 observations.
How about following. However you may want to sort on Month if thats what you want.
data want;
do _n_ = 1 by 1 until(last.id);
set survey;
by id;
if _n_ <=3 then output;
end;
run;
I have panel data set that looks like this
ID Usage month
1234 2 -2
1234 4 -1
1234 3 1
1234 2 2
2345 5 -2
2345 6 -1
2345 3 1
2345 6 2
Obviously there are more ID variables and usage data, but this is the general form. I want to average the usage data when the month column is negative, and when it is positive for each ID. In other words for each unique ID, average the usage for negative months and for positive months. My goal is to get something like this.
ID avg_usage_neg avg_usage_pos
1234 3 2.5
2345 5.5 4.5
Here's a few options for you.
First create the test data:
data sample;
input ID
Usage
month;
datalines;
1234 2 -2
1234 4 -1
1234 3 1
1234 2 2
2345 5 -2
2345 6 -1
2345 3 1
2345 6 2
;
run;
Here's an SQL solution:
proc sql noprint;
create table result as
select id,
avg(ifn(month < 0, usage, .)) as avg_usage_neg,
avg(ifn(month > 0, usage, .)) as avg_usage_pos
from sample
group by 1
;
quit;
Here's a datastep / proc means solution:
data sample2;
set sample;
usage_neg = ifn(month < 0, usage, .);
usage_pos = ifn(month > 0, usage, .);
run;
proc means data=sample2 noprint missing nway;
class id;
var usage_neg usage_pos;
output out=result2 mean=;
run;