I have data that looks like this:
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
I need to add:
A new column with column heading for each sequence, add prefix 'RT', drop the letters following the numeric part of the sequence
Fill the new column with the letters that follow the numeric part
of the sequence
I need to create this:
ID Sequence RTE6 RTK11 RTQ174 RTD177 RTV245
-----------------------------------------------------------------------
101 E6S,K11T,Q174K,D177E S T K E
102 K11T,V245EKQ T EKQ
I assume you want a SAS data set and not a report. ANYDIGIT makes it pretty easy to find the last non-digit sub-string.
data seq;
infile cards firstobs=3;
input id:$3. sequence :$50.;
cards;
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
;;;;
run;
proc print;
run;
data seq2V / View=seq2V;
set seq;
length w name sub $32 subl 8;
do i = 1 by 1;
w = scan(sequence,i,',');
if missing(w) then leave;
subl = anydigit(w,-99);
name = substrn(w,1,subl);
sub = substrn(w,subl+1);
output;
end;
run;
proc transpose data=seq2V out=seq3(drop=_name_) prefix=RT;
by id sequence;
var sub;
id name;
run;
proc print;
run;
I had a similar problem a while ago. The code is adapted to your problem.
If found this solution to work faster than anything I tried with proc transpose.
Still overall performance on huge datasets (espc. using many different sequences) is not great at all, as we loop 2*2 over all strings and also the final variables.
Can anyone offer a faster solution?
(Caution: MacroVar is limited to 65534 Characters.)
data var_name ;
set in_data;
length var string $30.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
var = substr(string,1,anydigit(string,-99));
output;
keep var;
end;
run;
proc sql noprint;
select distinct compress("RT"||var) into :var_list separated by ' '
from var_name;
quit;
%put &var_list.;
data out_data;
set in_data;
length string &var_list. $30. n 8. ;
array a_var [*] &var_list.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
do j = 1 to dim(a_var);
n = anydigit(string,-99) ;
if substr(vname(a_var[j]),3) eq substr(string,1,n) then a_var[j] = substr(string,n+1);
end;
end;
drop string i j n;
run;
Related
I want to create a variable that resolves to the character before a specified character (*) in a string. However I am asking myself now if this specified character appears several times in a string (like it is in the example below), how to retrieve one variable that concatenates all the characters that appear before separated by a comma?
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
run;
Code:
data want;
set have;
cnt = count(string, "*");
_startpos = 0;
do i=0 to cnt until(_startpos=0);
before = catx(",",substr(string, find(string, "*", _startpos+1)-1,1));
end;
drop i _startpos;
run;
That code output before=C for the first and second observation. However I want it to be before=C,E for the first one and before=C,W,d for the second observation.
You can use Perl regular expression replacement pattern to transform the original string.
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
data want;
set have;
csl = prxchange('s/([^*]*?)([^*])\*/$2,/',-1,string); /* comma separated letters */
csl = prxchange('s/, *$//',1,csl); /* remove trailing comma */
run;
Make sure to increment _STARTPOS so your loop will finish. You can use CATX() to add the commas. Simplify selecting the character by using CHAR() instead of SUBSTR(). Also make sure to TELL the data step how to define the new variable instead of forcing it to guess. I also include test to handle the situation where * is in the first position.
data have;
input string $20.;
datalines;
ABC*EDE*
EFCC*W*d*
*XXXX*
asdf
;
data want;
set have;
length before $20 ;
_startpos = 0;
do cnt=0 to length(string) until(_startpos=0);
_startpos = find(string,'*',_startpos+1);
if _startpos>1 then before = catx(',',before,char(string,_startpos-1));
end;
cnt=cnt-(string=:'*');
drop i _startpos;
run;
Results:
Obs string before cnt
1 ABC*EDE* C,E 2
2 EFCC*W*d* C,W,d 3
3 *XXXX* X 1
4 asdf 0
call scan is also a good choice to get position of each *.
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
****
asdf
;
data want;
length before $20.;
set have;
do i = 1 to count(string,'*');
call scan(string,i,pos,len,'*');
before = catx(',',before,substrn(string,pos+len-1,1));
end;
put _n_ = +7 before=;
run;
Result:
_N_=1 before=C,E
_N_=2 before=C,W,d
_N_=3 before=
_N_=4 before=
I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)
I have a data set with 400 observations of 4 digit codes which I would like to pad with a space on both sides
ex. Dataset
obs code
1 1111
2 1112
3 3333
.
.
.
400 5999
How can I go through another large data set and replace every occurrence of any of the padded 400 codes with a " ".
ex. Large Dataset
obs text
1 abcdef 1111 abcdef
2 abcdef 1111 abcdef 1112 8888
3 abcdef 1111 abcdef 11128888
...
Data set that I want
ex. New Data set
obs text
1 abcdef abcdef
2 abcdef abcdef 8888
3 abcdef abcdef 11128888
...
Note: I'm only looking to replace 4 digit codes that are padded on both sides by a space. So in obs 3, 1112 won't be replaced.
I've tried doing the following proc sql statement, but it only finds and replaces the first match, instead of all the matches.
proc sql;
select
*,
tranwrd(large_dataset.text, trim(small_dataset.code), ' ') as new_text
from large_dataset
left join small_dataset
on findw(large_dataset.text, trim(small_dataset.code))
;
quit;
You could just use a DO loop to scan through the small dataset of codes for each record in the large dataset. If you want to use TRANWRD() function then you will need to add extra space characters.
data want ;
set have ;
length code $4 ;
do i=1 to nobs while (text ne ' ');
set codes(keep=code) nobs=nobs point=i ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2);
end;
drop code;
run;
The DO loop will read the records from your CODES list. Using the POINT= option on the SET statement lets you read the file multiple times. The WHILE clause will stop if the TEXT string is empty since there is no need to keep looking for codes to replace at that point.
If your list of codes is small enough and you can get the right regular expression then you might try using PRXCHANGE() function instead. You can use an SQL step to generate the codes as a list that you can use in the regular expression.
proc sql noprint ;
select code into :codelist separated by '|'
from codes
;
quit;
data want ;
set have ;
text=prxchange("s/\b(&codelist)\b/ /",-1,text);
run;
There might be more efficient ways of doing this, but this seems to work fairly well:
/*Create test datasets*/
data codes;
input code;
cards;
1111
1112
3333
5999
;
run;
data big_dataset;
infile cards truncover;
input text $100.;
cards;
abcdef 1111 abcdef
abcdef 1111 abcdef 1112 8888
abcdef 1111 abcdef 11128888
;
run;
/*Get the number of codes to use for array definition*/
data _null_;
set codes(obs = 1) nobs = nobs;
call symput('ncodes',nobs);
run;
%put ncodes = &ncodes;
data want;
set big_dataset;
/*Define and populate array with padded codes*/
array codes{&ncodes} $6 _temporary_;
if _n_ = 1 then do i = 1 to &ncodes;
set codes;
codes[i] = cat(' ',put(code,4.),' ');
end;
do i = 1 to &ncodes;
text = tranwrd(text,codes[i],' ');
end;
drop i code;
run;
I expect a solution using prxchange is also possible, but I'm not sure how much work it is to construct a regex that matches all of your codes compared to just substituting them one by one.
Taking Tom's solution and putting the code-lookup into a hash-table. Thereby the dataset will only be read once and the actual lookup is quite fast. If the Large Dataset is really large this will make a huge difference.
data want ;
if _n_ = 1 then do;
length code $4 ;
declare hash h(dataset:"codes (keep=code)") ;
h.defineKey("code") ;
h.defineDone() ;
call missing (code);
declare hiter hiter('h') ;
end;
set big_dataset ;
rc = hiter.first() ;
do while (rc = 0 and text ne ' ') ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2) ;
rc = hiter.next() ;
end ;
drop code rc ;
run;
Use array and regular express:
proc transpose data=codes out=temp;
var code;
run;
data want;
if _n_=1 then set temp;
array var col:;
set big_dataset;
do over var;
text = prxchange(cats('s/\b',var,'\b//'),-1,text);
end;
drop col:;
run;
I have a dataset looks like the following:
Name Number
a 1
b 2
c 9
d 6
e 5.5
Total ???
I want to calculate the sum of variable Number and record the sum in the last row (corresponding with Name = 'total'). I know I can do this using proc means then merge the output backto this file. But this seems not very efficient. Can anyone tell me whether there is any better way please.
you can do the following in a dataset:
data test2;
drop sum;
set test end = last;
retain sum;
if _n_ = 1 then sum = 0;
sum = sum + number;
output;
if last then do;
NAME = 'TOTAL';
number = sum;
output;
end;
run;
it takes just one pass through the dataset
It is easy to get by report procedure.
data have;
input Name $ Number ;
cards;
a 1
b 2
c 9
d 6
e 5.5
;
proc report data=have out=want(drop=_:);
rbreak after/ summarize ;
compute after;
name='Total';
endcomp;
run;
The following code uses the DOW-Loop (DO-Whitlock) to achieve the result by reading through the observations once, outputting each one, then lastly outputting the total:
data want(drop=tot);
do until(lastrec);
set have end=lastrec;
tot+number;
output;
end;
name='Total';
number=tot;
output;
run;
For all of the data step solutions offered, it is important to keep in mind the 'Length' factor. Make sure it will accommodate both 'Total' and original values.
proc sql;
select max(5,length) into :len trimmed from dictionary.columns WHERE LIBNAME='WORK' AND MEMNAME='TEST' AND UPCASE(NAME)='NAME';
QUIT;
data test2;
length name $ &len;
set test end=last;
...
run;
I have a question about transposing data without using PROC Transpose.
0 a b c
1 dog cat camel
2 9 7 2534
Without using PROC TRANSPOSE, how can I get a resulting dataset of:
Animals Weight
1 dog 9
2 cat 7
3 camel 2534
This is a bit of a curious request. This example code is hard coded for your 3 variables. You will have to generalize this if needed.
data temp;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;
run;
data animal_weight;
set temp end=last;
format animal animals1-animals3 $8.;
format weight weights1-weights3 best. ;
retain animals: weights:;
array animals[3];
array weights[3];
if _n_ = 1 then do;
animals[1] = a;
animals[2] = b;
animals[3] = c;
end;
else if _n_ = 2 then do;
weights[1] = input(a,best.);
weights[2] = input(b,best.);
weights[3] = input(c,best.);
end;
if last then do;
do i=1 to 3;
animal = animals[i];
weight = weights[i];
output;
end;
end;
drop i animals: weights: a b c;
run;
Read the values into 2 arrays, converting the weights from strings into numbers. Use the _N_ variable to figure out which array to populate. At the end of the data set, output the values in the arrays.
I wouldn't give this as an answer to a homework problem that I actually wanted to get a good grade on (because it's far too advanced, so it's obvious you asked for help); but the hash solution is almost certainly the most flexible and what I'd hope someone doing this in the real world would do (assuming there is a 'don't use proc transpose' real world reason, such as available resources). The problem is somewhat undefined, so this is only moderately fault-tolerant.
data have;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;;;;
run;
data _null_;
set have end=eof;
array charvars _character_;
if _n_ = 1 then do;
length animal $15 weight 8;
declare hash h();
h.defineKey('row');
h.defineData('animal','weight');
h.defineDone();
end;
animal=' ';
weight=.;
do row = 1 to dim(charvars);
rc_f = h.find();
if rc_f ne 0 then do;
animal=charvars[row];
rc_a = h.add();
animal=' ';
end;
else if rc_f eq 0 then do;
weight=input(charvars[row],best12.);
rc_r = h.replace();
end;
end;
if eof then rc_o = h.output(dataset:'want');
run;
Do you always have just two rows or is that the no of columns and the rows are dynamic?
If you have a dynamic no of rows and columns, then the ideal way will be to use open function, get the no of columns to a macro variable. This will be the no of rows in your new dataset. Then take the no of rows in your original dataset which will be the no of columns in your new dataset. This must happen before the actual Transpose method. Post this you can read it in to an array and using the macro variables as the dimensions output the values in to the new dataset.
Having said all this, why would you want to re-invent the wheel when you already have the SAS provided ready made transpose function?