SAS Spllit a column into multiple - sas

I was wondering if there is a way to separate a column into 2 or more columns. The values are separated by semicolon.
Here is how my data is currently
+------------+
| Col1 |
+------------+
| 541.6;I345 |
+------------+
I would like something as below
+--------+------+
| Col1 | Col2 |
+--------+------+
| 541.6 | I345 |
+--------+------+

Try scan:
data want;
set have;
col2 = scan(col1,2,";");
col1 = scan(col1,1,";");
run;
Let me know in case of any queries.

Related

Deleting rows based on multiple columns conditions

Given the following table have, I would like to delete the records that satisfy the conditions based on the to_delete table.
data have;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Basket|30SEP21:00:00:00
111|Basket|31DEC20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|31DEC20:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
data to_delete;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
In the past, I used to operate with the catx() function to concatenate the conditions in a where statement, but I wonder if there is a better way of doing this
proc sql;
delete from have
where catx('|',id,item,datetime) in
(select catx('|',id,item,datetime) from to_delete);
run;
+-----+--------+------------------+
| id | item | datetime |
+-----+--------+------------------+
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
+-----+--------+------------------+
Please note that it should allow the have table to have more columns than the table to_delete.
You can use except from to compute difference set of two sets:
proc sql;
create table want as
select * from have except select * from to_delete
;
quit;

Construct continuous intervals from non-contiguous validity ranges

Given
data have;
infile datalines missover delimiter="|" dsd;
input id :$20. (start_date end_date) (:date9.) (attribute_1 attribute_2 attribute_3 attribute_4) ($);
format start_date end_date date9.;
datalines;
ID1|01MAR2014|31DEC9999|BIG|YES||
ID2|01SEP2015|30NOV2020|||TWO|
ID2|01SEP2015|31DEC9999|SMALL|||
ID2|01AUG2021|31DEC9999|||TWO|
ID3|01DEC2014|31MAY2016||YES||
ID3|01DEC2014|29JUN2017||||OK
ID3|01DEC2014|31DEC9999|MEDIUM|||
ID3|31MAR2015|29SEP2017|||ONE|
ID3|30JUN2017|31DEC9999||YES||TBD
ID3|30SEP2017|31DEC9999|||ONE, TWO|
;
I would like to get continuous validity ranges for each id with the correct attributes at each of them.
The desired output would be like this:
+-----+------------+-----------+-------------+-------------+-------------+-------------+
| id | start_date | end_date | attribute_1 | attribute_2 | attribute_3 | attribute_4 |
+-----+------------+-----------+-------------+-------------+-------------+-------------+
| ID1 | 01MAR2014 | 31DEC9999 | BIG | YES | | |
| ID2 | 01SEP2015 | 30NOV2020 | SMALL | | TWO | |
| ID2 | 01DEC2020 | 31JUL2021 | SMALL | | | |
| ID2 | 01AUG2021 | 31DEC9999 | SMALL | | TWO | |
| ID3 | 01DEC2014 | 30MAR2015 | MEDIUM | YES | | OK |
| ID3 | 31MAR2015 | 31MAY2016 | MEDIUM | YES | ONE | OK |
| ID3 | 01JUN2016 | 29JUN2016 | MEDIUM | | ONE | OK |
| ID3 | 30JUN2016 | 29SEP2017 | MEDIUM | YES | ONE | TBD |
| ID3 | 30SEP2017 | 31DEC9999 | MEDIUM | YES | ONE, TWO | TBD |
+-----+------------+-----------+-------------+-------------+-------------+-------------+
I found a way of doing it using joins, but would like to know if there exist a better way of doing the following:
data all_intervals;
set have(keep= id start_date end_date);
_start = start_date; output;
_end = start_date-1; output;
_end = end_date; output;
if end_date < '31DEC9999'd then do;
_start = end_date+1; output;
end;
run;
proc sql;
create table all_intervals as
select distinct t1.id, t1._start, t2._end
from all_intervals t1, all_intervals t2
where t1.id = t2.id and t2._end > t1._start
;
quit;
data all_intervals;
set all_intervals;
by id _start;
if first.id or first._start;
run;
proc sql noprint;
select 'max(t1.'||NAME||') as '||NAME into :attributes separated by ','
from sashelp.vcolumn
where libname = "WORK" and memname = "HAVE" and upcase(name) not in ('ID', "_START", "_END")
;
quit;
proc sql;
create table merge as
select t2.id, t2._start as start_date, t2._end as end_date, &attributes.
from have t1 right join all_intervals t2
on t1.id = t2.id
and ((t2._start <= t1.start_date <= t2._end)
or (t2._start <= t1.end_date <= t2._end)
or (t2._start >= t1.start_date and t2._end <= t1.end_date))
group by t2.id, t2._start
order by t2.id, t2._start
;
quit;
proc sort data=merge out=want nodupkey; by id start_date end_date; run;
The above produce the expected output.

Usage of TSQL string_split in DAX

I have 2 tables in SSAS / Power BI:
Table1:
| ValueName| ValueKey |
|:---- |:------: |
| abc | 1,2,3 |
Table2:
| ID | ValueKey | Value |
|:---- |:------: |:------: |
| ID1 | 1 | 87,8 |
| ID2 | 85 | 14 |
| ID3 | 90 | 95,8 |
| ID4 | 3 | 13,4 |
I need to retrieve (in temp table, later make calculations over this temp table) ID, Value and only those rows, which have ValueKey 1 or 2 or 3.
I need to do it with DAX. In SQL we have for such situation STING_SPLIT function. Is there some way how can I achive this with DAX? My ValueKey column (table1) is comma separated text and ValueKey (table2) column is INT.
Thanks in advance
Like #Jeroen Mostert suggests, you can do this by abusing the PATHCONTAINS function like this:
FilteredTable2 =
VAR CurrKey = SELECTEDVALUE ( Table1[ValueKey] )
VAR PathFromKey = SUBSTITUTE ( CurrKey, ",", "|" ) /* Paths use | as separator. */
RETURN
FILTER ( Table2, PATHCONTAINS ( PathFromKey, Table2[ValueKey] ) )
However, this is not best practice for relating tables. In general, you don't want multiple keys in a single fields.

adding rows given a certain condition

I have a database with 3 columns. ID, Date and amount. It is ordered by ID and Date. All I want to do is to add a row after the latest occurrence of every ID with the same ID, Date = Date + 1 Month and Amount = 0.
As an Illustration I want to go from this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
B | 01FEB| 0 |
B | 01MAR| 1 |
to this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
A | 01MAR| 0 | <- ADD THIS ROW
B | 01FEB| 0 |
B | 01MAR| 1 |
B | 01APR| 0 |<- ADD THIS ROW
I know I should use intxn but beyond that I don't really know what to do. I appreciate any input.
Assuming that the DATE variable has actual date values in it you just need to output twice on the last observation in each group.
data want;
set have;
by id;
output;
if last.id then do;
date=intnx('month',date,1,'b');
amount=0;
output;
end;
run;

Compare Value of Current Observation with First Observation

I have a set of multiple choice responses from a survey with 45 questions, and I've placed the correct responses as my first observation in the dataset.
In my DATA step I would like to set values to 0 or 1depending on whether the variable in each observation matches the same variable in the first observation, I want to replace the response letter (A-D) with the 0 or 1 in the dataset, how do I go about doing that comparison?
I'm not doing any grouping, so I believe I can access the first row using First.x, but I'm not sure how to compare that across each variable(answer1-answer45).
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | A | C |
| 3 | C | D |
| 4 | A | B |
| 5 | D | C |
| 6 | B | B |
Should become:
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
Current code for reading in the data:
DATA TEST(drop=name fill answer0);
INFILE SCORES DSD firstobs=2;
length id $4;
length answer1-answer150 $1;
INPUT name $ fill id $ (answer0-answer150) ($);
RUN;
Thanks in advance!
Here's how I might do it. Create a data set to PROC COMPARE the KEY to the observed. Then you have X for not matching key and missing for matched. You can then use PROC TRANSREG to score the 'X.' to 01. PROC TRANSREG also creates macro variables which contain the names of the new variables and the number.
From log NOTE: _TRGINDN=2 _TRGIND=answer1D answer2D
data questions;
input id:$3. (answer1-answer2)(:$1.);
cards;
KEY A B
2 A C
3 C D
4 A B
5 D C
6 B B
;;;;
run;
data key;
if _n_ eq 1 then set questions(obs=1);
set questions(keep=id firstobs=2);
run;
proc compare base=key compare=questions(firstobs=2) out=comp outdiff noprint;
id id;
run;
options validvarname=v7;
proc transreg design data=comp(drop=_type_ type=data);
id id;
model class(answer:) / noint;
output out=scored(drop=intercept _:);
run;
%put NOTE: &=_TRGINDN &=_TRGIND;
I don't have my SAS license here at home, so I can't actually test this code. I'll give it me best shot, though ...
First, I'd keep my correct answers in a separate table, and then merge it with the answers from the respondents. That also makes the solution scalable, should you have more multiple choice solutions and answers in the same table, since you'd be joining on the assignment ID as well.
Now, import all your correct answers to a table answers_correct with column names answer_correct1-answer_correct45.
Then, merge the two tables and determine the outcome for each question.
DATA outcome;
MERGE answers answers_correct;
* We will not be using any BY.;
* If you later add more questionnaires, merge BY the questionnaire ID;
ARRAY answer(*) answer1-answer45;
ARRAY answer_correct(*) answer_correct1-answer_correct45;
LENGTH result1-result45 $1;
ARRAY result(*) result1-result45;
DROP i;
FOR i = 1 TO DIM(answer);
IF answer(i) = answer_correct(i) THEN result(i) = '1';
ELSE result(i) = '0';
END;
RUN;