SAS split long character variable into 2 variables - sas

I have a messy dataset whic contains lastnames, firstnames, addresses (in that order) in one variable, while I would need this to be 2 different ones (names and address).
I tried
data commainvest (keep=appln_id person_id person_name lastname firstnames newname address);
set commainvest;
lastname=scan(person_name,1,',') ;
firstnames=scan(person_name,2,',') ;
newname=catx(', ',lastname,firstnames) ;
address=substr(person_name,1,length(person_name)-length(newname)) ;
run;
and others such as
address= substr(person_name,-1,length(person_name)-length(newname)) ;
or
address= scan(person_name,3,length(person_name)) ;
but it always cuts the address part incorrectly or leaves all the info in the last column.
There is also actually no need to cut last and firstnames, but I could find a way to leave them together from the start. My data is separated by commas between them.
I appreciate your help
Thanks
Anna

I believe you are getting truncated data due to the default length of the new address variable.
data commainvest (keep=appln_id person_id person_name lastname firstnames newname address);
set commainvest;
lastname=scan(person_name,1,',') ;
firstnames=scan(person_name,2,',') ;
newname=catx(', ',lastname,firstnames) ;
length address $1000;
address=substr(person_name,length(newname),length(person_name)-length(newname)) ;
run;
Try the above (with length statement). Your code otherwise looks fine!

Related

SAS proc transpose duplicate values issue

I need your help, please!
I'm doing a proc transpose on SAS, from a table that as only unique lines. However it is returning the following error
ERROR: The ID value "'OUTROS_CANAIS_Fatura Eletrónica'n" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
ID_CLIENTE=xxxxxxxxxx
When I check the original table the ID_CLIENTE xxxxxxxxxxx has two lines:
ID_CLIENTE MOTIVO Nr_Solicitacoes
xxxxxxxxxx OUTROS_CANAIS_Fatura Eletrónica - adesão 1
xxxxxxxxxx OUTROS_CANAIS_Fatura Eletrónica - cancelamento 1
I believe it is the '-' that is causing the issue (that comes with the original data), since they are clearly two different values.
Any ideas how to solve this?
EDIT: I've managed to replace the '-' value, however it still returns the same error...
Thank you!!
Proc TRANSPOSE ID statement turns data values into columns names when pivoting data. Column names are limited to 32 characters (and column labels are limited to 200 characters). Your ID values when truncated to 32 characters are the same value and you get the 'occurs twice' LOG message.
You can add a new variable to distinguish the id values and use the IDLABEL statement to store the original id values in the variable labels.
Example:
idnum is added to the data and is used to distinguish the id values. If you have many id values a hash can be used to dynamically assign a unique idnum for each id value
options validvarname = v7;
data have;
id = 'xxxxxxxxxx OUTROS_CANAIS_Fatura Eletrónica - adesão';
idnum = 1;
count = 1;
output;
id = 'xxxxxxxxxx OUTROS_CANAIS_Fatura Eletrónica - cancelamento 1';
idnum = 2;
output;
run;
proc transpose data=have out=want;
id idnum;
idlabel id;
var count;
run;
proc contents data=work.want;
run;
Figured it out!
SAS only allows 32 bites columns... It was a coincidence that ended in '-'.

SAS: How to detect the date format in SAS from UNIX folder?

I have access to a linux directory where it contains multiple folders with various names.
Eg.
01312019
19990131
europe_1
johncena
Based on the 4 samples above, only the first and second line are valid date format (MMDDYYYY & YYYYMMDD).
What I want to achieve is to identify and flag the folder that is having the date format that I want, for example, MMDDYYYY. Once identified, I will write a set of rules to further process it.
My script below is already able to scan for the directory.
data allfilenames;
length fref $8 fname $200;
did = filename(fref,"&ROOT./&Directory.");
did = dopen(fref);
do i = 1 to dnum(did);
fname = dread(did,i);
output;
end;
did = dclose(did);
did = filename(fref);
keep fname;
run;
data folderonly;
set allfilenames;
if count(fname,'.') >0 then delete;
run;
However, now I am stuck on how to check the folder name for its date format. Again, it is possible that folder names might not contain a valid date format at all, or contain date format differently (YYYYMMDD or MMDDYYYY).
Is there any guide that I can follow?
data folderonly ;
input #1 fname $20.;
datalines ;
01122021
12312021
20210901
blahblah
;
run ;
data check ;
set folderonly ;
if input(fname,??mmddyy8.) then valid_mmddyy = 1 ;
run ;
The ?? before the informat suppresses any errors due to data which doesn't match the informat.

How to split a column into multiple rows in SAS

I have a SAS table that I imported from Oracle with two fields. SYSTEMID and T_BLOB.
Inside the T_BLOB field there is data:
2203 Mountain Meadow===========OSCAR ST===========Zephyrhill Road
(why they are delimiting with equal signs I do not know nor do I know who to ask).
I'm new to SAS and I'm being asked to split T_BLOB field into multiple rows in a table called rick.split_blob. I tried Google but I can't find the exact example. I'm trying to get the output to look like:
SYSTEM_ID T_BLOB
GID_1 2203 Mountain Ave
GID_1 OSCAR ST
GID_1 Zephyrhill Road
Can anyone help me with how to code this?
If none of the values ever contain = then you can just use the scan() function.
data want;
set have ;
length T_BLOB_VALUE $200 ;
do i=1 by 1 until(t_blob_value=' ');
t_blob_value=scan(t_blob,i,'=') ;
if i=1 or t_blob_value ne ' ' then output;
end;
run;
You could try this:
data rick.split_blob (keep=SYSTEM_ID T_BLOB_SUB rename=(T_BLOB_SUB=T_BLOB));
set orig_dataset;
T_BLOB_TRANS = tranwrd(T_BLOB,"===========","|");
do i = 1 to countw(T_BLOB_TRANS,"|");
T_BLOB_SUB = scan(T_BLOB,i,"|");
output;
end;
run;
What I'm trying to do is first translate the odd string of equals signs to a simple pipe to avoid counting them as consecutive delimiters. Then we determine how many "words" (really - delimited strings) there are in T_BLOB_TRANS so we know how many times to run the DO loop. Finally we read everything between each delimiter and output it to a new T_BLOB variable for each new word.
It looks like you'll want to use a combination of the "scan" function and the "output" statement (with countw to get you the number of words if it is variable). Scan returns the nth word where you can specify the delimiter. Output outputs a record. So, for example, you can say
do i=1 to countw(line);
newvar = scan(line,i);
output;
end;

Create dynamic SAS variable name from string

I have something similar to the code below, I want to create every 2 character combination within my strings and then count the occurrence of each and store in a table. I will be changing the substr statement to a do loop to iterate through the whole string. But for now I just want to get the first character pair to work;
data temp;
input cat $50.;
call symput ('regex', substr(cat,1,2));
&regex = count(cat,substr(cat,1,2));
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
Expected results;
cat bv dv cd ud kd
#### 6
#### 4
#### 8
#### 1
#### 3
#### 9
#### 1
I'd prefer not to use a proc transpose as I can't loop through the string to create all the character pairs. I'll have to manually create them and I have upto 500 characters per string, plus I would like to search for 3 and 4 string patterns.
You can't do what you're asking to directly. You will either have to use the macro language, or use PROC TRANSPOSE. SAS doesn't let you reference data in the way you're trying to, because it has to have already constructed the variable names and such before it reads anything in.
I'll post a different solution that uses the macro language, but I suspect TRANSPOSE is the ultimate solution here; there's no practical reason this shouldn't work with your actual problem, and if you're having trouble with that it should be possible to help - post the do loop and what you're wanting, and we can of course help. Likely you just need to put the OUTPUT in the do loop.
data temp;
input cat $50.;
cat_val = substr(cat,1,2);
_var_ = count(cat,substr(cat,1,2));
output;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
proc transpose data=temp out=temp_T(drop=_name_);
by cat notsorted; *or by some ID variable more likely;
id cat_val;
var _var_;
run;
Here's a solution that uses CALL EXECUTE rather than the macro language, as I decided that was actually a better solution. I wouldn't use this in production, but it hopefully shows the concept (in particular, I would not run a PROC DATASETS for each variable separately - I would concat all the renames into one string then run that at the end. I thought this better for showing how the process might work.)
This takes advantage of timing - namely, CALL EXECUTE happens after the data step terminates, so by that point you do know what variable maps to what data point. It does have to pass the data twice in order to drop the spurious variables, though if you either know the actual number of variables you want to have, or if you're okay with the excess variables hanging around, it would be okay to skip that, and PROC DATASETS doesn't actually open the whole dataset, so it would be quite fast (even the above with five calls is quite fast).
data temp;
input cat $50.;
array _catvars[50]; *arbitrary 50 chosen here - pick one big enough for your data;
array _catvarnames[50] $ _temporary_;
cat_val = substr(cat,1,2);
_iternum = whichc(cat_val, of _catvarnames[*]);
if _iternum=0 then do;
_iternum = whichc(' ',of _catvarnames[*]);
_catvarnames[_iternum]=cat_val;
call execute('proc datasets lib=work; modify temp; rename '||vname(_catvars[_iternum])||' = '||cat_val||'; quit;');
end;
_catvars[_iternum]= count(cat,substr(cat,1,2));
if _n_=7 then do; *this needs to actually be a test for end-of-file (so add `end=eof` to the set statement or infile), but you cannot do that in DATALINES so I hardcode the example.;
call execute('data temp; set temp; drop _catvars'||put(whichc(' ',of _catvarnames[*]),2. -l)||'-_catvars50;run;');
end;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;

How do I stop SAS from adding an extra empty byte to every string variable when I use PROC EXPORT?

When I export a dataset to Stata format using PROC EXPORT, SAS 9.4 automatically expands adds an extra (empty) byte to every observation of every string variable. For example, in this data set:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;
proc export data = test1
file = "test1.dta"
dbms = stata replace;
quit;
the variables cust_id, category, and status should be str1, str3, and str1 in the final Stata file, and thus take up 1 byte, 3 bytes, and 1 byte, respectively, for every observation. However, SAS automatically adds an extra empty byte to each observation, which expands their data types to str2, str4, and str2 data type in the outputted Stata file.
This is extremely problematic because that's an extra byte added to every observation of every string variable. For large datasets (I have some with ~530 million observations and numerous string variables), this can add several gigabytes to the exported file.
Once the file is loaded into Stata, the compress command in Stata can automatically remove these empty bytes and shrink the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file that I don't always have enough memory to load the dataset into Stata in the first place.
Is there a way to stop SAS from padding the string variables in the first place? When I export a file with a one character string variable (for example), I want that variable stored as a one character string variable in the output file.
This is how you can do it using existing functions.
filename FT41F001 temp;
data _null_;
file FT41F001;
set test1;
put 256*' ' #;
__s=1;
do while(1);
length __name $32.;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
substr(_FILE_,__s) = vvaluex(__name);
putlog _all_;
__s = sum(__s,vformatwx(__name));
end;
_file_ = trim(_file_);
put;
format month f6.;
run;
To avoid the use of _FILE_;
data _null_;
file FT41F001;
set test1;
__s=1;
do while(1);
length __name $32. __value $128 __w 8;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
__value = vvaluex(__name);
__w = vformatwx(__name);
put __value $varying128. __w #;
end;
put;
format month f6.;
run;
If you are willing to accept a flat file answer, I've come up with a fairly simple way of generating one that I think has the properties you require:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 SD X
B 199912 D C
;
run;
data _null_;
file "/folders/myfolders/test.txt";
set test1;
put #;
_FILE_ = cat(of _all_);
put;
run;
/* Print contents of the file to the log (for debugging only)*/
data _null_;
infile "/folders/myfolders/test.txt";
input;
put _infile_;
run;
This should work as-is, provided that the total assigned length of all variables in your dataset is less than 32767 (the limit of the cat function in the data step environment- the lower 200 character limit doesn't apply, as that's only when you use cat to create a variable that hasn't been assigned a length). Beyond that you may start to run into truncation issues. A workaround when that happens is to only cat together a limited number of variables at a time - a manual process, but much less laborious than writing out put statements based on the lengths of all the variables, and depending on your data it may never actually come up.
Alternatively, you could go down a more complex macro route, getting variable lengths from either the vlength function or dictionary.columns and using those plus the variable names to construct the required put statement(s).