Before we get to my question please note that I purposefully did not include example data in this post, as my problem occurs on my full dataset and subsets of it. I have two dataset with client data in the following format.
Have_1
+------------+------------+------+
| dt | dt_next | id |
+------------+------------+------+
| 30.09.2010 | 31.10.2010 | 0001 |
+------------+------------+------+
| 31.10.2010 | 30.11.2010 | 0001 |
+------------+------------+------+
| 30.11.2010 | 31.12.2010 | 0001 |
+------------+------------+------+
| 31.12.2010 | 31.01.2011 | 0001 |
+------------+------------+------+
Have_2
+------+-------+------------+------------+
| id | event | start_date | end_date |
+------+-------+------------+------------+
| 0001 | 1 | 31.10.2010 | 30.11.2010 |
+------+-------+------------+------------+
| 0001 | 2 | 31.10.2010 | 31.12.2010 |
+------+-------+------------+------------+
I am trying to use the IFN function to put 1-0 flags in my dataset by using the following logic:
Proc SQL;
Create table want as
Select a.*
,ifn(a.id in (select id from have_2 where a.dt <= end_date and start_date <= a.dt_next), 1, 0) as flg_1
,ifn(a.id in (select id from have_2 where a.dt <= end_date and start_date <= a.dt), 1, 0) as flg_2
From have_1 as a;
Quit;
The code works fine if I take only one client, however, if I take the full dataset (or even a small subset of it such as only 10 clients) then the code gets stuck in the sense that the process begins without error but simply never finishes. I tried setting indexes to both my input datasets, without success.
Are there any peculiarities to the IFN function, which can make it behave that way?
So why not just join and take the max of all events if any event's dates fall into those periods? That should eliminate the need to do two subqueries for every observation in HAVE1.
proc sql;
create table want2 as
select a.id
, a.dt
, a.dt_next
, max(a.dt <= b.end_date and b.start_date <= a.dt_next) as flg1
, max(a.dt <= b.end_date and b.start_date <= a.dt) as flg2
from have1 a
left join have2 b
on a.id = b.id
group by 1,2,3
;
quit;
Note the issue is with the subqueries, not the IFN() function call. Also there is no need for IFN() function here. SAS evaluates boolean expressions to 1 or 0. So the expression a=b returns the same result as IFN(a=b,1,0) returns.
Related
Can someone help me to convert the sql string to Dax?
row_number() p over (partition by date, customer, type order by day)
The row number is my desired output.
Assuming that your data looks like this table:
Sample
+------------+----------+---------+--------+
| Date | Customer | Product | Gender |
+------------+----------+---------+--------+
| 01/01/2018 | 1234 | P2 | F |
| 01/01/2018 | 1234 | P2 | M |
| 03/01/2018 | 1235 | P1 | F |
| 03/01/2018 | 1235 | P2 | F |
+------------+----------+---------+--------+
I have created a calculated column called Rank, using the RANKX and FILTER function.
The first part of the calculation is to create variables outside the scope of the FILTER function. The second part uses RANKX that takes an expression value - in this case Gender - to order the values.
Rank =
VAR _currentdate = 'Sample'[Date]
VAR _customer = 'Sample'[Customer]
var _product = 'Sample'[Product]
return
RANKX(FILTER('Sample',
[Date]=_currentdate &&
[Customer] = _customer &&
[Product] = _product),[Gender],,ASC)
The output is
I contrasted the output to the SQL equivalent.
select
*,
row_number() over(partition by Date,Customer,Product order by Gender)
from (
select '2018-01-01' as Date,1234 as CUSTOMER,'P2' AS PRODUCT, 'M' Gender union
select '2018-01-01' as Date,1234,'P2','F' UNION
select '2018-01-03' as Date,1235,'P1','F' UNION
select '2018-01-03' as Date,1235,'P2','F'
)t1
I have a database with 3 columns. ID, Date and amount. It is ordered by ID and Date. All I want to do is to add a row after the latest occurrence of every ID with the same ID, Date = Date + 1 Month and Amount = 0.
As an Illustration I want to go from this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
B | 01FEB| 0 |
B | 01MAR| 1 |
to this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
A | 01MAR| 0 | <- ADD THIS ROW
B | 01FEB| 0 |
B | 01MAR| 1 |
B | 01APR| 0 |<- ADD THIS ROW
I know I should use intxn but beyond that I don't really know what to do. I appreciate any input.
Assuming that the DATE variable has actual date values in it you just need to output twice on the last observation in each group.
data want;
set have;
by id;
output;
if last.id then do;
date=intnx('month',date,1,'b');
amount=0;
output;
end;
run;
I have a set of multiple choice responses from a survey with 45 questions, and I've placed the correct responses as my first observation in the dataset.
In my DATA step I would like to set values to 0 or 1depending on whether the variable in each observation matches the same variable in the first observation, I want to replace the response letter (A-D) with the 0 or 1 in the dataset, how do I go about doing that comparison?
I'm not doing any grouping, so I believe I can access the first row using First.x, but I'm not sure how to compare that across each variable(answer1-answer45).
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | A | C |
| 3 | C | D |
| 4 | A | B |
| 5 | D | C |
| 6 | B | B |
Should become:
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
Current code for reading in the data:
DATA TEST(drop=name fill answer0);
INFILE SCORES DSD firstobs=2;
length id $4;
length answer1-answer150 $1;
INPUT name $ fill id $ (answer0-answer150) ($);
RUN;
Thanks in advance!
Here's how I might do it. Create a data set to PROC COMPARE the KEY to the observed. Then you have X for not matching key and missing for matched. You can then use PROC TRANSREG to score the 'X.' to 01. PROC TRANSREG also creates macro variables which contain the names of the new variables and the number.
From log NOTE: _TRGINDN=2 _TRGIND=answer1D answer2D
data questions;
input id:$3. (answer1-answer2)(:$1.);
cards;
KEY A B
2 A C
3 C D
4 A B
5 D C
6 B B
;;;;
run;
data key;
if _n_ eq 1 then set questions(obs=1);
set questions(keep=id firstobs=2);
run;
proc compare base=key compare=questions(firstobs=2) out=comp outdiff noprint;
id id;
run;
options validvarname=v7;
proc transreg design data=comp(drop=_type_ type=data);
id id;
model class(answer:) / noint;
output out=scored(drop=intercept _:);
run;
%put NOTE: &=_TRGINDN &=_TRGIND;
I don't have my SAS license here at home, so I can't actually test this code. I'll give it me best shot, though ...
First, I'd keep my correct answers in a separate table, and then merge it with the answers from the respondents. That also makes the solution scalable, should you have more multiple choice solutions and answers in the same table, since you'd be joining on the assignment ID as well.
Now, import all your correct answers to a table answers_correct with column names answer_correct1-answer_correct45.
Then, merge the two tables and determine the outcome for each question.
DATA outcome;
MERGE answers answers_correct;
* We will not be using any BY.;
* If you later add more questionnaires, merge BY the questionnaire ID;
ARRAY answer(*) answer1-answer45;
ARRAY answer_correct(*) answer_correct1-answer_correct45;
LENGTH result1-result45 $1;
ARRAY result(*) result1-result45;
DROP i;
FOR i = 1 TO DIM(answer);
IF answer(i) = answer_correct(i) THEN result(i) = '1';
ELSE result(i) = '0';
END;
RUN;
Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100
I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;
Here is a response from SAS support.
What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;
I have a table that looks like this
(MAIN TABLE)
====================================================================================
|| TRANSACTION_ID || TRADE_PARTY_PREFIX || TRADE_PARTY_VALUE || TRADE_PARTY_TYPE
====================================================================================
| 1 | EXAMPLE | 123456789 | Dealer
--------------------------------------------------------------------------------
| 2 | EXAMPLE | 123456789 |
--------------------------------------------------------------------------------
| 3 | EXAMPLE_2 | 123456789 | Non-Dealer
--------------------------------------------------------------------------------
| 4 | EXAMPLE | 987654321 |
I have another table that looks like this
(LOOKUP TABLE)
=================================================================
|| TRADE_PARTY_PREFIX || TRADE_PARTY_VALUE || TRADE_PARTY_TYPE ||
=================================================================
| EXAMPLE | 123456789 | Dealer
-----------------------------------------------------------------
| EXAMPLE | 987654321 | Dealer
I want to have a SAS script that replaces all the missing values in the TRADE_PARTY_TYPE by looking up the TRADE_PARTY_PREFIX and TRADE_PARTY_VALUE in the lookup table (but it should not touch the row if that column is already filled in.
Something along the lines of (in pseudo-code):
for row in main_table:
if row["trade_party_type"] is not null:
print(row)
else
row["trade_party_type"] == lookup_table["trade_party_prefix"]["trade_party_value"]
Not sure how to do this in SAS.
I would use the coalesce function in a PROC SQL step like so:
PROC SQL;
CREATE TABLE WANT AS
SELECT a.transaction_id, a.trade_party_prefix, a.trade_party_value, coalesce(a.trade_party_type,b.trade_party_type) AS trade_party_type
FROM dset1 a LEFT JOIN dset2 b
ON a.trade_party_prefix = b.trade_party_prefix AND a.trade_party_value = b.trade_party_value;
QUIT;
The coalesce function evaluates the arguments in order and returns the current value of the first expression that initially does not evaluate to NULL.