Invert two strings inside brackets - regex

I have a coordinates column that refers to an ID with 3 dimensions [z][x][y]. What I am trying to do is to invert the [x] dimension with the [y] so that it becomes a table with (still) 3 dimensions but ordered differently: [z][y][x]
Here is a snippet of the column:
+------------------------+
| COORDINATES |
+------------------------+
| ID_01.02.a[][01][20] |
| ID_02.00[010][02][017] |
| ID_03.01[][][010] |
| ID_04.01.w[010][][] |
+------------------------+
Here is the code to reproduce this snippet:
data have;
input coordinates :$30.;
datalines;
ID_01.02.a[][01][20]
ID_02.00[010][02][017]
ID_03.01[][][010]
ID_04.01.w[010][][]
;
The final table is expected to be:
+------------------------+
| COORDINATES |
+------------------------+
| ID_01.02.a[][20][01] |
| ID_02.00[010][017][02] |
| ID_03.01[][010][] |
| ID_04.01.w[010][][] |
+------------------------+
When I tried using the scan function, it did not work because I could not find a robust index e.g. x_after = trim(scan(coordinates,2,'[]')). Indeed, if the z axis is not empty it will output the z axis and if the z axis is empty, it will output the x axis.

You can do it with SCAN(). Use [] as the delimiters and the m modifier. But now your strings have 6 values and not 4 as there are empty values between the places with ][ .
Here is a way that shows what is happening.
data have;
input COORDINATES $50.;
cards;
ID_01.02.a[][01][20]
ID_02.00[010][02][017]
ID_03.01[][][010]
ID_04.01.w[010][][]
;
data want;
set have;
length want $50;
array temp[6] $50;
do i=1 to dim(temp);
temp[i]=scan(coordinates,i,'[]','m');
end;
want = cats(temp[1],'[',temp[2],'][',temp[6],'][',temp[4],']');
drop i;
run;
Results:
Obs COORDINATES want temp1 temp2 temp3 temp4 temp5 temp6
1 ID_01.02.a[][01][20] ID_01.02.a[][20][01] ID_01.02.a 01 20
2 ID_02.00[010][02][017] ID_02.00[010][017][02] ID_02.00 010 02 017
3 ID_03.01[][][010] ID_03.01[][010][] ID_03.01 010
4 ID_04.01.w[010][][] ID_04.01.w[010][][] ID_04.01.w 010

You can access regular expression power via PRXCHANGE to modify values containing both simple and complex patterns.
Example:
data have;
input COORDINATES $char50.;
datalines;
ID_01.02.a[][01][20]
ID_02.00[010][02][017]
ID_03.01[][][010]
ID_04.01.w[010][][]
;
data want;
set have;
put coordinates;
coordinates = prxchange
( 's/(.*?\[.*?\])(\[.*?\])(\[.*\])/$1$3$2/'
, 1
, coordinates
);
put coordinates /;
run;
Logs
ID_01.02.a[][01][20]
ID_01.02.a[][20][01]
ID_02.00[010][02][017]
ID_02.00[010][017][02]
ID_03.01[][][010]
ID_03.01[][010][]
ID_04.01.w[010][][]
ID_04.01.w[010][][]

Related

How to create 2 new columns with appropriate prefix based on values in columns with same prefix in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 1 | 5 | 110 | 350
444 | 20 | 5 | 670 | 0
Requirements:
I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
It is not possible to have only 0 in columns with prefix _COUNT or only 0 in columns with prefix _SUM
There is not null in table
Desire output:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111 | 10 | 10 | 320 | 120 | COUNT_COL_A | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_B
333 | 1 | 5 | 110 | 350 | COUNT_COL_B | SUM_COL_B
444 | 20 | 5 | 670 | 0 | COUNT_COL_A | SUM_COL_A
How can i do that in SAS Enterprise Guide or in PROC SQL ?
Use an array with loops methodology:
Declare an array of the count variables
Set the maximum value to 0
Loop through the array
Check if each value is more than current
maximum
If yes, assign value to current maximum value and store name
If no, keep looping
Non looping, function methodology:
Use MAX to find the maximum value of the array
Use WHICHN() to find the location of the array
Use VNAME to get the variable name based on the location
*for count - you can extend for max;
data want;
set have;
array _count(*) count_col_:;
*looping methodology;
top_count_value=0;
do i=1 to _count;
if _count(i) > top_count_value then do;
top_count = vname(_count(i));
top_count_value = _count(i);
end;
end;
/*or function methodology*/
top_count_max = max(of _count(*));
index_top_count = whichn(top_count_max, of _count(*));
top_count_name_2 = vname(_count(index_top_count);
run;
Just do the same thing as your other question. But because you want to transpose two sets of variable it is probably going to be easier to a data step and arrays to do the first transform.
data tall;
set have;
array counts count_col_a count_col_b;
array sums sum_col_a sum_col_b;
do index=1 to dim(sums);
length type $5 name $32 ;
type='COUNT';
name=vname(counts[index]);
value1=counts[index];
value2=sums[index];
output;
type='SUM';
name=vname(sums[index]);
value1=sums[index];
value2=counts[index];
output;
end;
run;
Now sort and take the last per ID/TYPE combination to find the largest.
proc sort;
by id type value1 value2 name;
run;
data top;
set tall;
by id type value1 value2;
if last.type;
run;
And then transpose and re-merge.
proc transpose data=top out=want(drop=_name_) prefix=TOP_;
by id;
id type;
var name;
run;
data want;
merge have want;
by id;
run;
Result:
COUNT_ COUNT_ SUM_ SUM_
Obs ID COL_A COL_B COL_A COL_B TOP_COUNT TOP_SUM
1 111 10 10 320 120 COUNT_COL_A SUM_COL_A
2 222 15 80 500 500 COUNT_COL_B SUM_COL_B
3 333 1 5 110 350 COUNT_COL_B SUM_COL_B
4 444 20 5 670 0 COUNT_COL_A SUM_COL_A

How can I add observations to the existing dataset based on dates?

I have a dataset like this:
data have;
input date :date9. index;
format date date9.;
datalines;
31MAR2019 10
30APR2019 12
31MAY2019 15
30JUN2019 14
;
run;
I would like to add observations with dates from the maximum date (hence from 30JUN2019) until 31DEC2019 (by months) with the value of index being the last available value: 14. How can I achieve this in SAS? I want the code to be flexible, thus for every such dataset, take the maximum of date and add monthly observations from that maximum until DEC2019 with the value of index being equal to the last available value (here in the example the value in JUN2019).
An explicit DO loop over the SET provides the foundation for a concise solution with no extraneous worker variables. Automatic variable last is automatically dropped.
data have;
input date :date9. index;
format date date9.;
datalines;
31MAR2019 10
30APR2019 12
31MAY2019 15
30JUN2019 14
;
data want;
do until (last);
set have end=last;
output;
end;
do last = month(date) to 11; %* repurpose automatic variable last as a loop index;
date = intnx ('month',date,1,'e');
output;
end;
run;
Always helpful to refresh understanding. From SET Options documentation
END=variable
creates and names a temporary variable that contains an end-of-file indicator. The variable, which is initialized to zero, is set to 1 when SET reads the last observation of the last data set listed. This variable is not added to any new data set.
You can do it using end in set statement and retain statement.
data want(drop=i tIndex tDate);
set have end=eof;
retain tIndex tDate;
if eof then do;
tIndex=Index;
tDate=Date;
end;
output;
if eof then do;
do i=1 to 12-month(tDate);
index=tIndex;
date = intnx('month',tDate,i,'e');
output;
end;
end;
run;
INPUT:
+-----------+-------+
| date | index |
+-----------+-------+
| 31MAR2019 | 10 |
| 30APR2019 | 12 |
| 31MAY2019 | 15 |
| 30JUN2019 | 14 |
+-----------+-------+
OUTPUT:
+-----------+-------+
| date | index |
+-----------+-------+
| 31MAR2019 | 10 |
| 30APR2019 | 12 |
| 31MAY2019 | 15 |
| 30JUN2019 | 14 |
| 31JUL2019 | 14 |
| 31AUG2019 | 14 |
| 30SEP2019 | 14 |
| 31OCT2019 | 14 |
| 30NOV2019 | 14 |
| 31DEC2019 | 14 |
+-----------+-------+

How to split a combination of numeric and characters into multiple columns

I want to split some variable "15to16" into two columns where for that row I want the values 15 and 16 in each of the column entries. Hence, I want to get from this
+-------------+
| change |
+-------------+
| 15to16 |
| 9to8 |
| 6to5 |
| 10to16 |
+-------------+
this
+-------------+-----------+-----------+
| change | from | to |
+-------------+-----------+-----------+
| 15to16 | 15 | 16 |
| 9to8 | 9 | 8 |
| 6to5 | 6 | 5 |
| 10to16 | 10 | 16 |
+-------------+-----------+-----------+
Could someone help me out? Thanks in advance!
data have;
input change $;
cards;
15to16
9to8
6to5
10to16
;
run;
data want;
set have;
from = input(scan(change,1,'to'), 8.);
to = input(scan(change,2,'to'), 8.);
run;
N.B. in this case the scan function is using both t and o as separate delimiters, rather than looking for the word to. This approach still works because scan by default treats multiple consecutive delimiters as a single delimiter.
Regular expressions with the metacharacter () define groups whose contents can be retrieved from capture buffers with PRXPOSN. The capture buffers retrieved in this case would be one or more consecutive decimals (\d+) and converted to a numeric value with INPUT
data have;
input change $20.; datalines;
15to16
9to8
6to5
10to16
run;
data want;
set have;
rx = prxparse('/^\s*(\d+)\s*to\s*(\d+)\s*$/');
if prxmatch (rx, change) then do;
from = input(prxposn(rx,1,change), 12.);
to = input(prxposn(rx,2,change), 12.);
end;
drop rx;
run;
You can get the answer you want by declaring delimiter when you create the dataset. However you did not provide enough information regarding your other variables and how you import them
Data want;
INFILE datalines DELIMITER='to';
INPUT from to;
datalines;
15to16
9to8
6to5
10to16
;
Run;

Updating value in one row to another row based on same column value

My dataset XXX comprises records where 2 rows form a pair based on same value of FRUIT column . The difference is that one row contains empty COUNTRY value field while second row contains actual COUNTRY value. Similarly that first row contains empty COLOUR field while second row contains actual COLOUR value. now I would like to populate the COLOUR value of row (source) where COUNTRY value is populated, to the first row's empty COLOUR field (destination) where COUNTRY field is empty.
XXX DATASET [current]
FRUIT | COUNTRY | COLOUR
Banana | . | .
Banana | Spain | Yellow
Apple | . | .
Apple | USA | Red
Pear | China | Green
Pear | . | .
YYY [DESIRED]
FRUIT | COUNTRY | COLOUR
Banana | . | Yellow
Banana | Spain | Yellow
Apple | . | Red
Apple | USA | Red
Pear | China | Green
Pear | . | Green
Of course this example is dumb, but it is valid business case.
Apologizes I could not attach code here as I am in a bus now frantically typing. I tried using first. and last. , But somehow the variable cannot be passed across rows.
Can you advise in this?
Here's one way of doing this, using retain to carry over values from previous rows. The trick is to retain a temporary column rather than the one you want to fill in:
data have;
input FRUIT $ COUNTRY $ COLOUR $;
infile cards dlm='|';
cards;
Banana | . | .
Banana | Spain | Yellow
Apple | . | .
Apple | USA | Red
Pear | China | Green
Pear | . | .
;
run;
/*Sort missing values of COLOUR to the bottom within each FRUIT*/
proc sort data = have out = temp;
by FRUIT descending COLOUR;
run;
data want;
set temp;
by FRUIT;
retain t_COLOUR 'placeholder';
if first.FRUIT then do;
t_COLOUR = .;
if not(missing(COUNTRY)) then t_COLOUR = COLOUR;
end;
else COLOUR = coalescec(COLOUR, t_COLOUR);
drop t_COLOUR;
run;
Try this out:
proc sort data=have;
by fruit country;
run;
data want( rename=(country1=country colour1=colour));
set have end=eof;
by fruit notsorted;
if first.fruit then do;
point = _N_ + 1;
set have (keep= country colour rename= (country = country1 colour = colour1)) point=point;
end;
else do;
country1=country;
colour1 = colour;
end;
drop country colour;
run;
So you want to apply the non-missing values of COLOUR to every record with the same value of FRUIT? Sounds like a simple MERGE problem.
data YYY;
merge XXX(drop=colour) XXX(keep=fruit colour where=(not missing(colour)));
by fruit;
run;

Compare Value of Current Observation with First Observation

I have a set of multiple choice responses from a survey with 45 questions, and I've placed the correct responses as my first observation in the dataset.
In my DATA step I would like to set values to 0 or 1depending on whether the variable in each observation matches the same variable in the first observation, I want to replace the response letter (A-D) with the 0 or 1 in the dataset, how do I go about doing that comparison?
I'm not doing any grouping, so I believe I can access the first row using First.x, but I'm not sure how to compare that across each variable(answer1-answer45).
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | A | C |
| 3 | C | D |
| 4 | A | B |
| 5 | D | C |
| 6 | B | B |
Should become:
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
Current code for reading in the data:
DATA TEST(drop=name fill answer0);
INFILE SCORES DSD firstobs=2;
length id $4;
length answer1-answer150 $1;
INPUT name $ fill id $ (answer0-answer150) ($);
RUN;
Thanks in advance!
Here's how I might do it. Create a data set to PROC COMPARE the KEY to the observed. Then you have X for not matching key and missing for matched. You can then use PROC TRANSREG to score the 'X.' to 01. PROC TRANSREG also creates macro variables which contain the names of the new variables and the number.
From log NOTE: _TRGINDN=2 _TRGIND=answer1D answer2D
data questions;
input id:$3. (answer1-answer2)(:$1.);
cards;
KEY A B
2 A C
3 C D
4 A B
5 D C
6 B B
;;;;
run;
data key;
if _n_ eq 1 then set questions(obs=1);
set questions(keep=id firstobs=2);
run;
proc compare base=key compare=questions(firstobs=2) out=comp outdiff noprint;
id id;
run;
options validvarname=v7;
proc transreg design data=comp(drop=_type_ type=data);
id id;
model class(answer:) / noint;
output out=scored(drop=intercept _:);
run;
%put NOTE: &=_TRGINDN &=_TRGIND;
I don't have my SAS license here at home, so I can't actually test this code. I'll give it me best shot, though ...
First, I'd keep my correct answers in a separate table, and then merge it with the answers from the respondents. That also makes the solution scalable, should you have more multiple choice solutions and answers in the same table, since you'd be joining on the assignment ID as well.
Now, import all your correct answers to a table answers_correct with column names answer_correct1-answer_correct45.
Then, merge the two tables and determine the outcome for each question.
DATA outcome;
MERGE answers answers_correct;
* We will not be using any BY.;
* If you later add more questionnaires, merge BY the questionnaire ID;
ARRAY answer(*) answer1-answer45;
ARRAY answer_correct(*) answer_correct1-answer_correct45;
LENGTH result1-result45 $1;
ARRAY result(*) result1-result45;
DROP i;
FOR i = 1 TO DIM(answer);
IF answer(i) = answer_correct(i) THEN result(i) = '1';
ELSE result(i) = '0';
END;
RUN;