Accumulate text variables in SAS across observations - sas

This seems straightforward, but it's not working as expected:
data names;
input name $12.;
cards;
John
Jacob
Jingleheimer
Schmidt
;
run;
data names;
length namelist $100.;
set names end=eof;
retain namelist;
if _n_=1 then namelist=name;
else namelist = namelist || "|" || name;
if eof then output;
run;
I would expect the result to have one observation containing
John|Jacob|Jingleheimer|Schmidt
but namelist is just John. What am I doing wrong?

You need to trim the whitespace before concatenating to your list.
data names;
length namelist $100.;
set names end=eof;
retain namelist;
if _n_=1 then namelist=trim(name);
else namelist = trim(namelist) || "|" || trim(name);
if eof then output;
run;
You could also use the cats() function (which does the trimming and concatenation for you):
data names;
length namelist $100.;
set names end=eof;
retain namelist;
if _n_=1 then namelist=name;
else namelist = cats(namelist,"|",name);
if eof then output;
run;

If you added STRIP to your assignment
strip(namelist) || "|" || name
it would work also
(but CATS is a really good solution)

Using the catx function allows you to specify the delimiter...
data names;
length namelist $100.;
set names end=eof;
retain namelist;
namelist = catx("|",namelist,name);
if eof then output;
run;

Related

Concatenating a variable dynamically in SAS

I want to create a variable that resolves to the character before a specified character (*) in a string. However I am asking myself now if this specified character appears several times in a string (like it is in the example below), how to retrieve one variable that concatenates all the characters that appear before separated by a comma?
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
run;
Code:
data want;
set have;
cnt = count(string, "*");
_startpos = 0;
do i=0 to cnt until(_startpos=0);
before = catx(",",substr(string, find(string, "*", _startpos+1)-1,1));
end;
drop i _startpos;
run;
That code output before=C for the first and second observation. However I want it to be before=C,E for the first one and before=C,W,d for the second observation.
You can use Perl regular expression replacement pattern to transform the original string.
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
data want;
set have;
csl = prxchange('s/([^*]*?)([^*])\*/$2,/',-1,string); /* comma separated letters */
csl = prxchange('s/, *$//',1,csl); /* remove trailing comma */
run;
Make sure to increment _STARTPOS so your loop will finish. You can use CATX() to add the commas. Simplify selecting the character by using CHAR() instead of SUBSTR(). Also make sure to TELL the data step how to define the new variable instead of forcing it to guess. I also include test to handle the situation where * is in the first position.
data have;
input string $20.;
datalines;
ABC*EDE*
EFCC*W*d*
*XXXX*
asdf
;
data want;
set have;
length before $20 ;
_startpos = 0;
do cnt=0 to length(string) until(_startpos=0);
_startpos = find(string,'*',_startpos+1);
if _startpos>1 then before = catx(',',before,char(string,_startpos-1));
end;
cnt=cnt-(string=:'*');
drop i _startpos;
run;
Results:
Obs string before cnt
1 ABC*EDE* C,E 2
2 EFCC*W*d* C,W,d 3
3 *XXXX* X 1
4 asdf 0
call scan is also a good choice to get position of each *.
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
****
asdf
;
data want;
length before $20.;
set have;
do i = 1 to count(string,'*');
call scan(string,i,pos,len,'*');
before = catx(',',before,substrn(string,pos+len-1,1));
end;
put _n_ = +7 before=;
run;
Result:
_N_=1 before=C,E
_N_=2 before=C,W,d
_N_=3 before=
_N_=4 before=

Concatenating all variables in an observation in SAS

Is there a general purpose way of concatenating each variable in an observation into one larger variable whilst preserving the format of numeric/currency fields in terms of how it looks when you do a proc print on the dataset. (see sashelp.shoes for example)
Here is some code you can run, as you can see when looking at the log, using the catx function to produce a comma separated output removes both the $ currency sign as well as the period from the numeric variables
proc print data=sashelp.shoes (obs=10);
run;
proc sql;
select name into :varstr2 separated by ','
from dictionary.columns
where libname = "SASHELP" and
memname = "SHOES";
quit;
data stuff();
format all $5000.;
set sashelp.shoes ;
all = catx(',',&varstr2.) ;
put all;
run;
Any solution needs to be general purpose as it will run on disparate datasets with differently formatted variables.
You can manually loop over PDV variables of the data set, concatenating each formatted value retrieved with vvaluex. A hash can be used to track which variables of the data set to process. If you are comma separating values you will probably want to double quote formatted values that contain a comma.
data want;
set sashelp.cars indsname=_data;
if _n_ = 1 then do;
declare hash vars();
length _varnum 8 _varname $32;
vars.defineKey('_n_');
vars.defineData('_varname');
vars.defineDone();
_dsid = open(_data);
do _n_ = 1 to attrn(_dsid,'NVAR');
rc = vars.add(key:_n_,data:varname(_dsid,_n_));
end;
_dsid = close(_dsid);
call missing (of _:);
end;
format weight comma7.;
length allcat $32000 _vvx $32000;
do _n_ = 1 to vars.NUM_ITEMS;
vars.find();
_vvx = strip(vvaluex(_varname));
if index(_vvx,",") then _vvx = quote(strip(_vvx));
if _n_ = 1
then allcat = _vvx;
else allcat = cats(allcat,',',_vvx);
end;
drop _:;
run;
You can use import and export to csv file:
filename tem temp;
proc export data=sashelp.SHOES file=tem dbms=csv replace;
run;
data l;
length all $ 200;
infile tem truncover firstobs=2;
input all 1-200;
run;
P.S.
If you need concatenate only char, uou can create array of all CHARACTER columns in dataset, and just iterate thru:
data l;
length all $ 5000;
set sashelp.SHOES;
array ch [*] _CHARACTER_;
do i = 1 to dim(ch);
all=catx(',',all,ch[i]);
end;
run;
The PUT statement is the easiest way to do that. You don't need to know the variables names as you can use the _all_ variable list.
put (_all_) (+0);
It will honor the formats attached the variables and if you have used DSD option on the FILE statement then the result is a delimited list.
What is the ultimate goal of this exercise? If you want to create a file you can just write the file directly.
data _null_;
set sashelp.shoes(obs=3);
file 'myfile.csv' dsd ;
put (_all_) (+0);
run;
If you really do want to get that string into a dataset variable there is no need to invent some new function. Just take advantage of the PUT statements abilities by creating a file and then reading the lines from the file.
filename junk temp;
data _null_;
set sashelp.shoes(obs=3);
file junk dsd ;
put (_all_) (+0);
run;
data stuff ;
set sashelp.shoes(obs=3);
infile junk truncover ;
input all $5000.;
run;
You can even do it without creating the full text file. Instead just write one line at a time and save the line into a variable using the _FILE_ automatic variable.
filename junk temp;
data stuff;
set sashelp.shoes(obs=3);
file junk dsd lrecl=5000 ;
length all $5000;
put #1 (_all_) (+0) +(-2) ' ' #;
all = _file_;
output;
all=' ';
put #1 all $5000. #;
run;
Solution with vvalue and concat function (||):
It is similar with 'solution without catx' (the last one), but it is simplified by vvalue function instead put.
/*edit sashelp.shoes with missing values in Product as test-cases*/
proc sql noprint;
create table wocatx as
select * from SASHELP.SHOES;
update wocatx
set Product = '';
quit;
/*Macro variable for concat function (||)*/
proc sql;
select ('strip(vvalue('|| strip(name) ||'))') into :varstr4 separated by "|| ',' ||"
from dictionary.columns
where libname = "WORK" and
memname = "WOCATX";
quit;
/*Data step to concat all variables*/
data stuff2;
format all $5000.;
set work.wocatx ;
all = &varstr4. ;
put all;
run;
Solution with catx:
proc print data=SASHELP.SHOES;
run;
proc sql;
select ifc(strip(format) is missing,strip(name),ifc(type='num','put('|| strip(name) ||','|| strip(format) ||')','input('|| strip(name) ||','|| strip(format) ||')')) into :varstr2 separated by ','
from dictionary.columns
where libname = "SASHELP" and
memname = "SHOES";
quit;
data stuff();
format all $5000.;
set sashelp.shoes ;
all = catx(',',&varstr2.) ;
put all;
run;
If there isn't in dictionary.columns format, then in macro variable varstr2 will just name, if there is format, then when it would call in catx it will convert in format, that you need, for example,if variable is num type then put(Sales,DOLLAR12.), or if it char type then input function . You could add any conditions in select into if you need.
If there is no need of using of input function just change select:
ifc(strip(format) is missing,strip(name),'put('|| strip(name) ||','|| strip(format) ||')')
Solution without catx:
/*edit sashelp.shoes with missing values in Product as test-cases*/
proc sql noprint;
create table wocatx as
select * from SASHELP.SHOES;
update wocatx
set Product = '';
quit;
/*Macro variable for catx*/
proc sql;
select ifc(strip(format) is missing,strip(name),ifc(type='num','put('|| strip(name) ||','|| strip(format) ||')','input('|| strip(name) ||','|| strip(format) ||')')) into :varstr2 separated by ','
from dictionary.columns
where libname = "WORK" and
memname = "WOCATX";
quit;
/*data step with catx*/
data stuff;
format all $5000.;
set work.wocatx ;
all = catx(',',&varstr2.) ;
put all;
run;
/*Macro variable for concat function (||)*/
proc sql;
select ifc(strip(format) is missing,
'strip(' || strip(name) || ')',
'strip(put('|| strip(name) ||','|| strip(format) ||'))') into :varstr3 separated by "|| ',' ||"
from dictionary.columns
where libname = "WORK" and
memname = "WOCATX";
quit;
/*Data step without catx*/
data stuff1;
format all $5000.;
set work.wocatx ;
all = &varstr3. ;
put all;
run;
Result with catx and missing values:
Result without catx and with missing values:

specifying data informat using do loops in sas

I have a large data file with data in the following format: country, datatype, year1month1 to year2018month7.
Reading the data using proc import did not work for all data fields. I ended up modifying the SAS datastep code to ensure data format was correct.
However, I am having trouble simplifying the code, namely I would like a do loop to go through all the years and month. This way, I could use current date to figure out the range of dates for the file and the code to create Year/Month variable does not have to repeat 100 times in the file.
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
informat _(i)M(j) best32.;
end;
end;
informat Base_Year $1. ;
format Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
format _(i)M(j) best12.;
end;
end;
format Base_Year $1. ;
input
Country_Name $
do i = 1940 to 2018;
do j = 1 to 12;
_(i)M(j) $;
end;
end;
Base_Year $;
run;
There are a few approaches here that could work. The most directly translatable to your approach is to use the macro language.
You need to translate those two loops to something like this:
%do i = 1940 %to 2018;
%do j = 1 %to 12;
informat _&i.M&j. best32.;
%end;
%end;
Notice the % there. This also has to be in a macro; you can't do this in normal datastep code.
I would rewrite it to use a macro like so:
%macro make_ym(startyear=, endyear=, separator=);
%local i j;
%do i = &startyear. %to &endyear.;
%do j = 1 %to 12;
_&i.&separator.&j.
%end;
%end;
%mend make_ym;
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
informat %make_ym(startyear=1940,endyear=2018,separator=M) best32.;
informat Base_Year $1. ;
format %make_ym(startyear=1940,endyear=2018,separator=M) best12.;
format Base_Year $1. ;
input
Country_Name $
%make_ym(startyear=1940,endyear=2018,separator=M)
Base_Year $;
run;
I took out the $ after the yMm bits in the input since you declared them as numeric.
Don't model your data step after the code generated by PROC IMPORT. It does a lot of useless things, like attaching formats and informats to variables that don't need them.
For your problem you just need a simple program like this:
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. Y1940M01 .... Y2018M08 Base_Year :$1. ;
run;
Now the only tricky part is building that list of numerical variables. If the list is small enough you could just put it into a macro variable. Fortunately that is not a problem in this case since using 8 character names (YyyyyMmm) there is room for over 300 years worth in a data step character variable. A variable of length 10,800 bytes should have room for 100 years of month names.
So just run this data step first.
data _null_;
length names $10800 ;
basedate = mdy(1,1,1940);
lastdate = today();
do i=0 to intck('month',basedate,lastdate);
date=intnx('month',basedate,i);
names=catx(' ',names,cats('Y',year(date),'M',put(month(date),Z2.)));
end;
call symputx('names',names);
run;
Now you can use the macro variable in your INPUT statement.
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. &names Base_Year :$1. ;
run;

End Stamement not working in SAS

can you only use END Stamement in SAS in set statement? For example...why isn't this working?
filename FS '/folders/myfolders/list4.txt';
data steward;
infile FS dlm = ',' END = EOF;
input Name $ Age Gender $;
if EOF = 1;
run;
Most SAS data steps actually stop when the INPUT or SET statement reads past the end of the file.
I suspect that your input file is either empty or does not have enough data to satisfy your INPUT statement.
You don't need to check EOF or IF as the data step will terminate automatically once it reaches the last record.
Solution:
DATA WORK.input1;
LENGTH
name $ 5
age 8
gender $ 1 ;
FORMAT
name $CHAR5.
age BEST2.
gender $CHAR1. ;
INFORMAT
name $CHAR5.
age BEST2.
gender $CHAR1. ;
INFILE 'E:\saswork\Input.txt'
LRECL=256
FIRSTOBS=2 /*I am skipping first row, as it containts column names*/
ENCODING="WLATIN1"
DLM='2c'x /* this is "," delimiter; I am using windows*/
MISSOVER
DSD ;
INPUT
name : $CHAR5.
age : ?? BEST2.
gender : $CHAR1. ;
put _all_;
RUN;
/*Contents of the Input.txt*/
/*name, age, gender*/
/*jack,32,M*/
/*John,45,M*/
/*Sally,38,F*/
Output:
name=jack age=32 gender=M _ERROR_=0 _N_=1
name=John age=45 gender=M _ERROR_=0 _N_=2
name=Sally age=38 gender=F _ERROR_=0 _N_=3

Reading text file in SAS with delimiter in wrong places

I am reading a .txt file into SAS, that uses "|" as the delimiter. The issue is there is one column that is using "|" as a word separator as well instead of acting like delimiter, this needs to be in one column.
For example the txt file looks like:
apple|fruit|Healthy|choices|of|food|12|2012|chart
needs to look like this in the SAS dataset:
apple | fruit | Healthy choices of Food | 12 | 2012 | chart
How do I eliminate "|" between "Healthy choices of Food"?
I think this will do what you want:
data tmp1;
length tmp $100;
input tmp $;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
apple|fruit|Healthy|choices|of|food|and|lots|of|other|stuff|12|2012|chart
;
run;
data tmp2;
set tmp1;
num_delims=length(tmp)-length(compress(tmp,"|"));
expected_delims=5;
extra_delims=num_delims-expected_delims;
length new_var $100;
i=1;
do while(scan(tmp,i,"|") ne "");
if i<=2 or (extra_delims+2)<i<=num_delims then new_var=trim(new_var)||scan(tmp,i,"|")||"|";
else new_var=trim(new_var)||scan(tmp,i,"|")||"#";
i+1;
end;
new_var=left(tranwrd(new_var,"#"," "));
run;
This isn't particularly elegant, but it will work:
data tmp;
input tmp $50.;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
;
run;
data tmp;
set tmp;
var1 = scan(tmp,1,'|');
var2 = scan(tmp,2,'|');
var4 = scan(tmp,-3,'|');
var5 = scan(tmp,-2,'|');
var6 = scan(tmp,-1,'|');
var3 = tranwrd(tmp,trim(var1)||"|"||trim(var2),"");
var3 = tranwrd(var3,trim(var4)||"|"||trim(var5)||"|"||trim(var6),"");
var3 = tranwrd(var3,"|"," ");
run;
Expanding a little on Itzy's answer, here is another possible solution:
data want;
/* Define variables */
attrib item length=$10 label='Item';
attrib class length=$10 label='Family';
attrib desc length=$80 label='Item Description';
attrib count length=8 label='Some number';
attrib year length=$4 label='Year';
attrib somevar length=$10 label='Some variable';
length countc $8; /* A temp variable */
infile 'c:\temp\delimited_temp.txt' lrecl=1000 truncover;
input;
item = scan(_infile_,1,'|','mo');
class = scan(_infile_,2,'|','mo');
countc = scan(_infile_,-3,'|','mo'); /* Temp var for numeric field */
count = inputn(countc,'8.'); /* Re-read the numeric field */
year = scan(_infile_,-2,'|','mo');
somevar = scan(_infile_,-1,'|','mo');
desc = tranwrd(
substr(_infile_
,length(item)+length(class)+3
,length(_infile_)
- ( length(item)+length(class)+length(countc)
+length(year)+length(somevar)+5))
,'|',' ');
drop countc;
run;
The key in this case it to read your file directly and handle the delimiters yourself. This can be tricky and requires that your data file is exactly as described. A much better solution would be to go back to whoever gave this this data and ask them to deliver it to you in a more appropriate form. Good luck!
Another possible workaround.
data tmp;
infile '/path/to/textfile';
input tmp :$100.;
array varlst (*) $30 v1-v6;
a=countw(tmp,'|');
do i=1 to dim(varlst);
if i<=2 then
varlst(i) = scan(tmp,i,'|');
else if i>=4 then
varlst(i) = scan(tmp,a-(dim(varlst)-i),'|');
else do j=3 to a-(dim(varlst)-i)-1;
varlst(i)=catx(' ', varlst(i),scan(tmp,j,'|'));
end;
end;
drop tmp a i j;
run;