I am unsure if this is possible (or stupid question), as I just started looking at SAS last week. I've managed to import my .CSV file to a SAS data set using the:
proc import
Specifying the guessingrows= to limit my out=.
My problem is now that my CSV files to import are not of same structure, which I noticed after writing some code using the obsnum= to specify start and x-lines to read.
So my question is wether or not SAS is capable of either look for a specific string/empty variable, and use as end observation?
My Data looks like (but number of Var_x varies for each file):
First I tried looking at the slice= but is only useful if I know the exact Places of interest, as the empty Space between the Groups can vary.
Is it possible to use the set function to specify to start at line 1 and read till encounting a blank field? Or can you redirect me to some function (that I couldn't find myself)?
I would like to look at each "block" separately and process.
Thank you in advance
I think you can do this in a relatively straightforward way if you are comfortable doing some processing after all the data has been inputted.
So do proc import on the whole dataset with no restriction.
Then use a data step and a counter to process through the data and output as necessary. Something like:
data output1 output2 output3;
set imported_data;
if _n_ = 1 then counter = 1;
var1lag = lag(var1);
if var1 = '' and var1lag ne '' then counter=counter+1;
if counter = 1 then output output1;
else if counter = 2 then output output2;
else output output3;
run;
data output1;
set output1;
if var1 = '' and var2 = . and var3 = . then delete;
run;
data output2;
set output2;
if var1 = '' and var2 = . and var3 = . then delete;
run;
data output3;
set output3;
if var1 = '' and var2 = . and var3 = . then delete;
run;
The above code outputs to three datasets based on the value of counter. The lag function lets us look up a row to ensure its the first time we see no data and updates the counter as we see no data.
Then we go back and remove any fully blank data for our datasets.
You could easily use some arrays to make this work more scaleably if you have many outputs instead of the if/else statements to output the data.
Related
This is a follow-up to my recent question on calculating md5 hash in SAS and python. So, I'm using SAS v9.2 and there is an md5 hash function which takes in a string and returns a hash. What I'd really like though is a way to compute the hash for the file as a whole. Given that I have a hash for each record , is there any way to do this and have the file hash match up with the value obtained by using , say, python code. Taking the sashelp.shoes dataset as an example I exported this to a CSV file and manually removed double quotes and dollars and commas of the currency fields. I then computed the hash for the file as a whole using this python code:
filename = "f:/test/shoes.csv"
md5_hash = hashlib.md5()
with open(filename,"rb") as f:
# Read and update hash string value in blocks of 4K
for byte_block in iter(lambda: f.read(1024*1024),b''):
md5_hash.update(byte_block.replace(b'\r', b'').replace(b'\n', b''))
print(md5_hash.hexdigest())
And got this hash back as output:
f7f205b5b844bf57f5f51685969e0df0
If anyone can replicate this final hash value in SAS for that dataset that would be great.
PS I'm on SAS V9.2
You have two options:
Implement the MD5 algorithm in SAS. I'm aware of existing implementations for SHA and CRC but I'm not sure about MD5.
Call an external utility from SAS to calculate the md5 hash for the file. There is an example here.
My earlier note on limitations applies only when working with DS1. There is no way around the length restriction in DS1. You could try this and you will get an error:
data test;
length x $30000;
x = repeat('-', 30000);
run;
data _null_;
set test;
format m $hex32.;
m = md5(catx(',', x, x));
put m=;
run;`
But Robert Pendridge is correct to point out that DS2 can solve this issue.
%let reclen = 201; /* Length of each record */
%let records = 2000; /* Number of records */
%let totlen = %eval(&reclen * &records);
proc ds2;
data _null_;
retain m;
dcl char(&totlen) m;
method run();
dcl char(200) c;
set shoes;
c = catx(',',&varstr2);
m = strip(m)|| strip(c);
end;
method term();
dcl char(32) hh;
hh = put(md5(m), $hex32.);
put hh=;
end;
enddata;
run;
quit;
This is essentially doing what the Python code is doing. The update merely concatenates the strings and applies the hash. You may have to tighten this up a little bit to remove any extraneous spaces etc., but should work.
Unfortunately you cannot in DS1. The reason is that the maximum variable size that SAS allows is only 32,767 bytes long. You could group the variables in multiple variables, but still when you try to concatenate them (even directly when invoking the md5 function), it will end up truncating it. Your best bet is writing the output to an external text file (as shown below based on your previous example) and generating md5sum on it. This is actually just one little extra step.. You could just use the X command to do that from within SAS itself (provided you are configured to do so).
filename ff "contents.txt" TERMSTR=CR;
data _null_;
set shoes end = lastrec;
newvar2 = catx(',',&varstr2);
file ff;
put newvar2;
run;
I am facing this issue with sas data step. My requirement is to get a list of variables such as
total_jun2018 = sum(jun2018, dep_jun2018);
total_jul2018 = sum(jul2018, dep_jul2018);
Data final4;
set final3;
by hh_no;
do i=0 to &tot_bal_mnth.;
bal_mnth = put(intnx('month',"&min_Completed_dt."d, i-1), monyy7.);
call symputx('bal_mnth', bal_mnth);
&bal_mnth._total=sum(&bal_mnth., Dep_&bal_mnth.);
output;
end;
But I am facing error that macro variable bal_mnth not resolved. Also once it did ran successfully but I want that output must be printed sequentially but it only prints output for last loop when i=6 then it prints only Total_DEC2018=sum(DEC2018, DEP_DEC2018);
Any help will be appreciated!
Thanks,
Ajay
This is a common issue when learning SAS Macro. The problem is that the macro processor needs to resolve &bal_mnth to a value when the data step is first submitted for execution, but the CALL SYMPUT doesn't execute until the data step is actually executed, so at the time you submit the code, there is no value available for &bal_mnth.
In this case you don't need bal_mnth to be created as a variable in the data set, so you could replace the line that starts bal_mnth = put(intck(...)) with a %let bal_mnth = ... statement. The %let executes while the data step is being submitted, so that way its value will be available when you need it.
My proposed %let statement will need to wrap the functions in at least one SYSFUNC call, which is left as an exercise for the reader :-)
It looks like you want to generate a series of assignment statements like:
total_jun2018 = sum(jun2018, dep_jun2018);
total_jul2018 = sum(jul2018, dep_jul2018);
...
total_jan2019 = sum(jan2019, dep_jan2019);
What is known as wallpaper code.
If your variables names were easier, such as dep1 to dep18 then it would be easy to use arrays to process the data. With your current naming convention the problem with generating the array statements is not much different than the problem of generating a series of assignment statements.
You can create a macro so that you could use a %DO loop to generate your wallpaper code.
%local i bal_mnth;
%do i=0 %to &tot_bal_mnth.;
%let bal_mnth = %sysfunc(intnx(month,"&min_Completed_dt."d, &i-1), monyy7.);
total_&bal_mnth = sum(&bal_mnth , Dep_&bal_mnth );
%end;
Or you could just generate the code to a file with a data step.
%let tot_bal_mnth = 7;
%let min_Completed_dt=01JUN2018;
filename code temp;
data _null_;
file code;
length bal_mnth $7 ;
do i=0 to &tot_bal_mnth.;
bal_mnth = put(intnx('month',"&min_Completed_dt."d, i-1), monyy7.);
put 'total_' bal_mnth $7. ' = sum(' bal_mnth $7. ', Dep_' bal_mnth $7. ');';
end;
run;
So the generated file of code looks like this:
total_MAY2018 = sum(MAY2018, Dep_MAY2018);
total_JUN2018 = sum(JUN2018, Dep_JUN2018);
total_JUL2018 = sum(JUL2018, Dep_JUL2018);
total_AUG2018 = sum(AUG2018, Dep_AUG2018);
total_SEP2018 = sum(SEP2018, Dep_SEP2018);
total_OCT2018 = sum(OCT2018, Dep_OCT2018);
total_NOV2018 = sum(NOV2018, Dep_NOV2018);
total_DEC2018 = sum(DEC2018, Dep_DEC2018);
You can then use %include to run it in your data step.
data final4;
set final3;
by hh_no;
%include code / source2 ;
run;
I would like to offer another point of view: the difficulty you are having here results from the use of a wide data shape, with lots of columns.
Rather than working with your data in this shape, you could first transpose from wide to long, so that instead of having lots of total_xxx columns you just have 3: total, total_dep and date, with one row per month. Once it's in this format, it will be much easier to work with, potentially allowing you to avoid resorting to macros and wallpaper code.
Suggested reading:
Transpose wide to long with dynamic variables
First i have created this table
data rmlib.tableXML;
input XMLCol1 $ 1-10 XMLCol2 $ 11-20 XMLCol3 $ 21-30 XMLCol4 $ 31-40 XMLCol5 $ 41-50 XMLCol6 $ 51-60;
datalines;
| AAAAA A||AABAAAAA|| BAAAAA|| AAAAAA||AAAAAAA ||AAAA |
;
run;
I want to clean, concatenate and export. I have written the following code
data rmlib.tableXML_LARGO;
file CleanXML lrecl=90000;
set rmlib.tableXML;
array XMLCol{6} ;
array bits{6};
array sqlvars{6};
do i = 1 to 6;
*bits{i}=%largo(XMLCol{i})-2;
%let bits =input(%largo(XMLCol{i})-2,comma16.5);
sqlvars{i} = substr(XMLCol{i},2,&bits.);
put sqlvars{i} &char10.. #;
end;
run;
the macro largo count how many characters i have
%macro largo(num);
length(put(&num.,32500.))
%mend;
What i need is instead of have char10, i would like that this number(10) would be the length, of each string, so to have something like
put sqlvars{i} &char&bits.. #;
I don't know if it possible but i can't do it.
I would like to see something like
AAAAA AAABAAAAA BAAAAA AAAAAAAAAAAAA AAAA
It is important to me to keep the spaces(this is only an example of an extract of a xml extract). In addition I will change (for example) "B" for "XPM", so the size will change after cleaning the text, that it what i need to be flexible in the char
Thank you for your time
Julen
I'm still not quite sure what you want to achieve, but if you want to combine the text from multiple varriables into one variable, then you could do something along the lines:
proc sql;
select name into :names separated by '||'
from dictionary.columns
where 1=1
and upcase(libname)='YOURLIBNAME'
and upcase(memname)='YOURTABLENAME';
quit;
data work.testing;
length resultvar $ 32000;
set YOURLIBNAME.YOURTABLENAME;
resultvar = &names;
resultvar2 = compress(resultvar,'|');
run;
Wasn't able to test this, but this should work if you replace YOURLIBNAME and YOURTABLENAME with your respective tables. I'm not 100% sure if the compress will preserve the spaces in the text.. But I think it should.
The format $VARYING. <length-variable> is a good candidate for solving this output problem.
On the presumption of having a number of variables whose values are vertical-bar bounded and wanting to output to a file the concatenation of the values without the bounding bars.
data have;
file "c:\temp\want.txt" lrecl=9000;
length xmlcol1-xmlcol6 $100;
array values xmlcol1-xmlcol6 ;
xmlcol1 = '| A |';
xmlcol2 = '|A BB|';
xmlcol3 = '|A BB|';
xmlcol4 = '|A BBXC|';
xmlcol5 = '|DD |';
xmlcol6 = '| ZZZ |';
do index = 1 to dim(values);
value = substr(values[index], 2); * ignore presumed opening vertical bar;
value_length = length(value)-1; * length with still presumed closing vertical bar excluded;
put value $varying. value_length #; * send to file the value excluding the presumed closing vertical bar;
end;
run;
You have some coding errors in that is making it difficult to understand what you want to do.
Your %largo() macro doesn't make any sense. There is no format 32500.. The only reason it would run in your code is because you are trying to apply the format to a character variable instead of a number. So SAS will automatically convert to use the $32500. instead.
The %LET statement that you have hidden in the middle of your data step will execute BEFORE the data step runs. So it would be less confusing to move it before the data step.
So replacing the call to %largo() your macro variable BITS will contain this text.
%let bits =input(length(put(XMLCol{i},32500.))-2,comma16.5);
Which you then use inside a line of code. So that line will end up being this SAS code.
sqlvars{i} = substr(XMLCol{i},2,input(length(put(XMLCol{i},$32500.))-2,comma16.5));
Which seems to me to be a really roundabout way to do this:
sqlvars{i} = substr(XMLCol{i},2,length(XMLCol{i})-2);
Since SAS stores character variables as fixed length, it will pad the value stored. So what you need to do is to remember the length so that you can use it later when you write out the value. So perhaps you should just create another array of numeric variables where you can store the lengths.
sqllen{i} = length(XMLCol{i})-2;
sqlvars{i} = substr(XMLCol{i},2,sqllen{i});
I'm running a process that lists jobs I want to check the modification date on. I list the jobs in a dataset and then pass these to macro variables with a number.
e.g.
Data List_Prep;
Format Folder
Code $100.;
Folder = 'C:\FilePath\Job ABC'; Code = '01 Job Name.sas'; Output;
Folder = 'C:\FilePath\Job X&Y'; Code = '01 Another Job.sas'; Output;
Run;
%Macro List_Check();
Data List;
Set List_Prep;
Job + 1;
Call Symput (Cats("Folder", Job), Strip(Folder));
Call Symput (Cats("Code", Job), Strip(Code));
Run;
%Put Folder1 = &Folder1;
%Put Folder2 = &Folder2;
%MEnd;
%List_Check;
It prints the %Put statement just fine for foler 1, but folder 2 doesn't work right.
Folder1 = C:\FilePath\Job ABC
WARNING: Apparent symbolic reference Y not resolved.
Folder2 = C:\FilePath\Job X&Y
When I then go in to a loop to check the datasets, again, it work, so looks for Folder1, Code1 etc, but I still get the warnings.
How can I stop these warnings? I've tried %Str("&") instead, but still get the issue.
The %superq() macro function is a great way to mask macro triggers that are already in a macro variable. You could either remember to quote the values when using them,
%put Folder1 = %superq(Folder1) ;
or you could adjust your process to quote them right after creating them.
data List_Prep;
length Folder Code $100;
Folder = 'C:\FilePath\Job ABC'; Code = '01 Job Name.sas'; Output;
Folder = 'C:\FilePath\Job X&Y'; Code = '01 Another Job.sas'; Output;
run;
data List;
set List_Prep;
Job + 1;
length dummy $200 ;
call symputx(cats("Folder", Job), Folder);
dummy = resolve(catx(' ','%let',cats("Folder", Job),'=%superq(',cats("Folder", Job),');'));
call symputx(cats("Code", Job), Code);
dummy = resolve(catx(' ','%let',cats("Code", Job),'=%superq(',cats("Code", Job),');'));
drop dummy;
run;
P.S. Don't use FORMAT to define variables. Use statements like LENGTH or ATTRIB that are designed for defining variables. FORMAT is for attaching formats to variable, not for defining them. The only reason that using FORMAT worked is that it had the side effect of SAS defining the variable's type and length to match the format that you attached to it because it was the first place you referenced the variable in the data step.
You can prevent SAS from trying to resolve the ampersand in the value by using the %superq function
%put Folder2 = %superq(Folder2);
In a basic data step I'm creating a new variable and I need to filter the dataset based on this new variable.
data want;
set have;
newVariable = 'aaa';
*lots of computations that change newVariable ;
*if xxx then newVariable = 'bbb';
*if yyy AND not zzz then newVariable = 'ccc';
*etc.;
where newVariable ne 'aaa';
run;
ERROR: Variable newVariable is not on file WORK.have.
I usually do this in 2 steps, but I'm wondering if there is a better way.
( Of course you could always write a complex where statement based on variables present in WORK.have. But in this case the computation of newVariable it's too complex and it is more efficient to do the filter in a 2nd data step )
I couldn't find any info on this, I apologize for the dumb question if the answer is in the documentation and I didn't find it. I'll remove the question if needed.
Thanks!
Use a subsetting if statement:
if newVariable ne 'aaa';
In general, if <condition>; is equivalent to if not(<condition>) then delete;. The delete statement tells SAS to abandon this iteration of the data step and go back to the start for the next iteration. Unless you have used an explicit output statement before your subsetting if statement, this will prevent a row from being output.