SAS format char - sas

First i have created this table
data rmlib.tableXML;
input XMLCol1 $ 1-10 XMLCol2 $ 11-20 XMLCol3 $ 21-30 XMLCol4 $ 31-40 XMLCol5 $ 41-50 XMLCol6 $ 51-60;
datalines;
| AAAAA A||AABAAAAA|| BAAAAA|| AAAAAA||AAAAAAA ||AAAA |
;
run;
I want to clean, concatenate and export. I have written the following code
data rmlib.tableXML_LARGO;
file CleanXML lrecl=90000;
set rmlib.tableXML;
array XMLCol{6} ;
array bits{6};
array sqlvars{6};
do i = 1 to 6;
*bits{i}=%largo(XMLCol{i})-2;
%let bits =input(%largo(XMLCol{i})-2,comma16.5);
sqlvars{i} = substr(XMLCol{i},2,&bits.);
put sqlvars{i} &char10.. #;
end;
run;
the macro largo count how many characters i have
%macro largo(num);
length(put(&num.,32500.))
%mend;
What i need is instead of have char10, i would like that this number(10) would be the length, of each string, so to have something like
put sqlvars{i} &char&bits.. #;
I don't know if it possible but i can't do it.
I would like to see something like
AAAAA AAABAAAAA BAAAAA AAAAAAAAAAAAA AAAA
It is important to me to keep the spaces(this is only an example of an extract of a xml extract). In addition I will change (for example) "B" for "XPM", so the size will change after cleaning the text, that it what i need to be flexible in the char
Thank you for your time
Julen

I'm still not quite sure what you want to achieve, but if you want to combine the text from multiple varriables into one variable, then you could do something along the lines:
proc sql;
select name into :names separated by '||'
from dictionary.columns
where 1=1
and upcase(libname)='YOURLIBNAME'
and upcase(memname)='YOURTABLENAME';
quit;
data work.testing;
length resultvar $ 32000;
set YOURLIBNAME.YOURTABLENAME;
resultvar = &names;
resultvar2 = compress(resultvar,'|');
run;
Wasn't able to test this, but this should work if you replace YOURLIBNAME and YOURTABLENAME with your respective tables. I'm not 100% sure if the compress will preserve the spaces in the text.. But I think it should.

The format $VARYING. <length-variable> is a good candidate for solving this output problem.
On the presumption of having a number of variables whose values are vertical-bar bounded and wanting to output to a file the concatenation of the values without the bounding bars.
data have;
file "c:\temp\want.txt" lrecl=9000;
length xmlcol1-xmlcol6 $100;
array values xmlcol1-xmlcol6 ;
xmlcol1 = '| A |';
xmlcol2 = '|A BB|';
xmlcol3 = '|A BB|';
xmlcol4 = '|A BBXC|';
xmlcol5 = '|DD |';
xmlcol6 = '| ZZZ |';
do index = 1 to dim(values);
value = substr(values[index], 2); * ignore presumed opening vertical bar;
value_length = length(value)-1; * length with still presumed closing vertical bar excluded;
put value $varying. value_length #; * send to file the value excluding the presumed closing vertical bar;
end;
run;

You have some coding errors in that is making it difficult to understand what you want to do.
Your %largo() macro doesn't make any sense. There is no format 32500.. The only reason it would run in your code is because you are trying to apply the format to a character variable instead of a number. So SAS will automatically convert to use the $32500. instead.
The %LET statement that you have hidden in the middle of your data step will execute BEFORE the data step runs. So it would be less confusing to move it before the data step.
So replacing the call to %largo() your macro variable BITS will contain this text.
%let bits =input(length(put(XMLCol{i},32500.))-2,comma16.5);
Which you then use inside a line of code. So that line will end up being this SAS code.
sqlvars{i} = substr(XMLCol{i},2,input(length(put(XMLCol{i},$32500.))-2,comma16.5));
Which seems to me to be a really roundabout way to do this:
sqlvars{i} = substr(XMLCol{i},2,length(XMLCol{i})-2);
Since SAS stores character variables as fixed length, it will pad the value stored. So what you need to do is to remember the length so that you can use it later when you write out the value. So perhaps you should just create another array of numeric variables where you can store the lengths.
sqllen{i} = length(XMLCol{i})-2;
sqlvars{i} = substr(XMLCol{i},2,sqllen{i});

Related

Remove single quotes in list of values in macro variable

I have a project with multiple programs. Each program has a proc SQL statement which will use the same list of values for a condition in the WHERE clause; however, the column type of one database table needed is a character type while the column type of the other is numeric.
So I have a list of "Client ID" values I'd like to put into a macro variable as these IDs can change, and I would like to change them once in the variable instead of in multiple programs.
For example, I have this macro variable set up like so and it works in the proc SQL which queries the character column:
%let CLNT_ID_STR = ('179966', '200829', '201104', '211828', '264138');
Proc SQL part:
...IN &CLNT_ID_STR.
I would like to create another macro variable, say CLNT_ID_NUM, which takes the first variable (CLNT_ID_STR) but removes the quotes.
Desired output: (179966, 200829, 201104, 211828, 264138)
Proc SQL part: ...IN &CLNT_ID_NUM.
I've tried using the sysfunc, dequote and translate functions but have not figured it out.
TRANSLATE doesn't seem to want to allow a null string as the replacement.
Below uses TRANSTRN, which has no problem translating single quote into null:
1 %let CLNT_ID_STR = ('179966', '200829', '201104', '211828', '264138');
2 %let want=%sysfunc(transtrn(&clnt_id_str,%str(%'),%str())) ;
3 %put &want ;
(179966, 200829, 201104, 211828, 264138)
It uses the macro quoting function %str() to mask the meaning of a single quote.
Three other ways to remove single quotes are COMPRESS, TRANSLATE and PRXCHANGE
%let CLNT_ID_STR = ('179966', '200829', '201104', '211828', '264138');
%let id_list_1 = %sysfunc(compress (&CLNT_ID_STR, %str(%')));
%let id_list_2 = %sysfunc(translate(&CLNT_ID_STR, %str( ), %str(%')));
%let id_list_3 = %sysfunc(prxchange(%str(s/%'//), -1, &CLNT_ID_STR));
%put &=id_list_1;
%put &=id_list_2;
%put &=id_list_3;
----- LOG -----
ID_LIST_1=(179966, 200829, 201104, 211828, 264138)
ID_LIST_2=( 179966 , 200829 , 201104 , 211828 , 264138 )
ID_LIST_3=(179966, 200829, 201104, 211828, 264138)
It really doesn't matter that TRANSLATE replaces the ' with a single blank () because the context for interpretation is numeric.

In SAS compute an md5 hash for whole file given you have an md5 hash for each record

This is a follow-up to my recent question on calculating md5 hash in SAS and python. So, I'm using SAS v9.2 and there is an md5 hash function which takes in a string and returns a hash. What I'd really like though is a way to compute the hash for the file as a whole. Given that I have a hash for each record , is there any way to do this and have the file hash match up with the value obtained by using , say, python code. Taking the sashelp.shoes dataset as an example I exported this to a CSV file and manually removed double quotes and dollars and commas of the currency fields. I then computed the hash for the file as a whole using this python code:
filename = "f:/test/shoes.csv"
md5_hash = hashlib.md5()
with open(filename,"rb") as f:
# Read and update hash string value in blocks of 4K
for byte_block in iter(lambda: f.read(1024*1024),b''):
md5_hash.update(byte_block.replace(b'\r', b'').replace(b'\n', b''))
print(md5_hash.hexdigest())
And got this hash back as output:
f7f205b5b844bf57f5f51685969e0df0
If anyone can replicate this final hash value in SAS for that dataset that would be great.
PS I'm on SAS V9.2
You have two options:
Implement the MD5 algorithm in SAS. I'm aware of existing implementations for SHA and CRC but I'm not sure about MD5.
Call an external utility from SAS to calculate the md5 hash for the file. There is an example here.
My earlier note on limitations applies only when working with DS1. There is no way around the length restriction in DS1. You could try this and you will get an error:
data test;
length x $30000;
x = repeat('-', 30000);
run;
data _null_;
set test;
format m $hex32.;
m = md5(catx(',', x, x));
put m=;
run;`
But Robert Pendridge is correct to point out that DS2 can solve this issue.
%let reclen = 201; /* Length of each record */
%let records = 2000; /* Number of records */
%let totlen = %eval(&reclen * &records);
proc ds2;
data _null_;
retain m;
dcl char(&totlen) m;
method run();
dcl char(200) c;
set shoes;
c = catx(',',&varstr2);
m = strip(m)|| strip(c);
end;
method term();
dcl char(32) hh;
hh = put(md5(m), $hex32.);
put hh=;
end;
enddata;
run;
quit;
This is essentially doing what the Python code is doing. The update merely concatenates the strings and applies the hash. You may have to tighten this up a little bit to remove any extraneous spaces etc., but should work.
Unfortunately you cannot in DS1. The reason is that the maximum variable size that SAS allows is only 32,767 bytes long. You could group the variables in multiple variables, but still when you try to concatenate them (even directly when invoking the md5 function), it will end up truncating it. Your best bet is writing the output to an external text file (as shown below based on your previous example) and generating md5sum on it. This is actually just one little extra step.. You could just use the X command to do that from within SAS itself (provided you are configured to do so).
filename ff "contents.txt" TERMSTR=CR;
data _null_;
set shoes end = lastrec;
newvar2 = catx(',',&varstr2);
file ff;
put newvar2;
run;

Convert Number with Format into a String

How do you convert a number or currency variable into a character string that keeps the format as part of the string?
For instance, the below code has a character variable, MSRP_to_text, and currency variable, MSRP. When I set MSRP_to_text equal to MSRP, it takes the unformatted number and converts it to a string, so the dollar sign and the comma are gone.
DATA want;
SET SASHELP.CARS(KEEP=MSRP);
ATTRIB MSRP_to_text FORMAT=$8.;
MSRP_to_text = MSRP;
RUN;
In other words, the code is currently converting $36,945 -> "36945", but what I really want is $36,945 -> "$36,945".
Is there a way to keep the dollar sign and comma in the string?
VVALUE function will retrieve the formatted value of a variable.
MSRP_as_text = VVALUE(MSRP);
VVALUEX goes one step further for the case of the variable name being dynamic; such as being stored in a different variable, or is computed from some name patterning algorithm.
name = 'MSRP';
formatted_value = VVALUEX(name);
Instead of ATTRIB statement, Use the PUT function to convert number to Character. and it will keep the text value with format. Since the original Format of MSRP is DOLLAR8. , so using same format in put statement will suffice the purpose
DATA want;
SET SASHELP.CARS(KEEP=MSRP);
MSRP_to_text = put(MSRP, DOLLAR8.);
RUN;
proc contents data=want; run;

How to use proc format with the number of lines?

I have a table like this :
|Num | Label
-----------------------
1|1 | a thing
2|2 | another thing
3|3 | something else
4|4 | whatever
I want to replace my values of my label column by something more generic for example the first two lines : label One, the two next ones label Two ...
|Num | Label
-----------------------
1|1 | label One
2|2 | label One
3|3 | label Two
4|4 | label Two
How can I do that using proc format procedure ? I was wondering if I can use either the number of lines or another column like Num.
I need to do something like this :
proc format;
value label_f
low-2 = "label One"
3-high = "label Two"
;
run;
But I want to specify the number of the line or the value of the Num column.
You could do what you are describing using the words format. You could swap out num for _N_ in the ceil function below in order to use the observation number instead of the value of num (if they are not always equal):
data have;
length num 8 label $20;
infile datalines dlm='|';
input num label $;
datalines;
1|a thing
2|another thing
3|something else
4|whatever
5|whatever else
6|so many things
;
run;
data want;
set have;
label=catx(' ','label',propcase(put(ceil(num/2),words.)));
run;
Although this answer is probably a bit too specific to your example and it may not apply in your actual context.
Gatsby:
It sounds like you want to format NUM instead of LABEL.
Where you want the use the 'generic' representation defined by your format simply place a FORMAT statement in the Proc being used:
PROC PRINT data=have;
format num label_f.;
RUN;
If you want both num and generic, you will need to add a new column to the data for use during processing. This can be done with a view:
data have_view / view=have_view;
set have;
num_replicate1 = num;
attrib num_replicate1 format=label_f. label='Generic';
num_replacement = put (num,label_f.);
attrib num_replacement label='Generic'; %* no format because the value is the formatted value of the original num;
run;
PROC PRINT data=have_view;
var num num_replicate1 num_replacement;
RUN;
If you want a the 'generic' representation of the NUM column to be used in by-processing as a grouping variable, you have several scenarios:
know apriori the generic representation is by-group clustered
use a view and process with BY or BY ... NOTSORTED if clusters are not in sort order
force ordering for use with by-group processing
use an ordered SQL view containing the replicate and process with BY
add a replicate variable to the data set, sort by the formatted value and process with BY
A direct backmap from label to num to generic is possible only if the label is known to be unique, or you know apriori the transformation backmap-num + num-map is unique.
Proc FORMAT also has a special value construct [format] that can be used to map different ranges of values according to different formatting rules. The other range can also map to a different format that itself has an other range that maps to yet another different format. The SAS format engine will log an error if you happen to define a recursive loop using this advanced kind of format mapping.
propaedeutics
One of my favorite Dorfman words.
Format does not replace underlying values. Format is a map from the underlying data value to a rendered representation. The map can be 1:1, many:1. The MultiLabel Format (MLF) feature of the format system can even perform 1:many and many:many mappings in procedures many MLF enabled procedures (which is most of them)
To replace an underlying value with it's formatted version you need to use the PUT, PUTC or PUTN functions. The PUT functions always outputs a character value.
character ⇒ PUT ⇒ character [ FILE / PUT ]
numeric ⇒ PUT ⇒ character [ FILE / PUT ]
There is no guarantee a mapped value will mapped to the same value, it depends on the format.
INFORMATs are similar to FORMATs, however the target value depend on the in format type
character ⇒ INPUT ⇒ character [ INFILE / INPUT ]
numeric ⇒ INPUT ⇒ character
character ⇒ INPUT ⇒ numeric [ INFILE / INPUT ]
numeric ⇒ INPUT ⇒ numeric
Custom formats are created with Proc FORMAT. The construction of a format is specified by either the VALUE statement, or the CNTLIN= option. CNTLIN lets you create formats directly from data and avoids really large VALUE statements that are hand-entered or code-generated (via say macro)
Data-centric 'formatting' performs the mapping through a left-join. This is prevalent in SQL data bases. Left-joins in SAS can be done through SQL, DATA Step MERGE BY and FORMAT application. 1:1 left-joins can also be done via Hash object SET POINT=

Extract "dynamic" part from SAS data-set

I am unsure if this is possible (or stupid question), as I just started looking at SAS last week. I've managed to import my .CSV file to a SAS data set using the:
proc import
Specifying the guessingrows= to limit my out=.
My problem is now that my CSV files to import are not of same structure, which I noticed after writing some code using the obsnum= to specify start and x-lines to read.
So my question is wether or not SAS is capable of either look for a specific string/empty variable, and use as end observation?
My Data looks like (but number of Var_x varies for each file):
First I tried looking at the slice= but is only useful if I know the exact Places of interest, as the empty Space between the Groups can vary.
Is it possible to use the set function to specify to start at line 1 and read till encounting a blank field? Or can you redirect me to some function (that I couldn't find myself)?
I would like to look at each "block" separately and process.
Thank you in advance
I think you can do this in a relatively straightforward way if you are comfortable doing some processing after all the data has been inputted.
So do proc import on the whole dataset with no restriction.
Then use a data step and a counter to process through the data and output as necessary. Something like:
data output1 output2 output3;
set imported_data;
if _n_ = 1 then counter = 1;
var1lag = lag(var1);
if var1 = '' and var1lag ne '' then counter=counter+1;
if counter = 1 then output output1;
else if counter = 2 then output output2;
else output output3;
run;
data output1;
set output1;
if var1 = '' and var2 = . and var3 = . then delete;
run;
data output2;
set output2;
if var1 = '' and var2 = . and var3 = . then delete;
run;
data output3;
set output3;
if var1 = '' and var2 = . and var3 = . then delete;
run;
The above code outputs to three datasets based on the value of counter. The lag function lets us look up a row to ensure its the first time we see no data and updates the counter as we see no data.
Then we go back and remove any fully blank data for our datasets.
You could easily use some arrays to make this work more scaleably if you have many outputs instead of the if/else statements to output the data.