I am extracting data from a database that has all values posted in strings, in the format +000000xx.xxx or -00000xx.xxx . I need to convert these to numeric to operate on.
data want;
set have;
numeric_var = string_var*1;
run;
works fine, but, to save compute time and resources on the final running, which will be over a much larger dataset, and in the interest of doing things properly I'd rather do that with a format or informat statement.
data want;
set have;
numeric_var = input(string_var, best8.);
run;
seems to output wrong values and to round everything to 0.
Any ideas?
Using best8. is telling SAS to only consider the first 8 characters of the string, so that's never going to work. You should use just best. or possibly best32. if you feel you have to pre-specify the length.
However, make sure you run some benchmarks before changing your current simple solution. SAS is already doing a character-to-numeric conversion as part of the numeric_var = string_var*1; statement, and is apparently doing it correctly; changing the code to use an informat will not automatically be any faster.
It would be cool if you benchmarked both methods and reported the results back here.
EDIT:
I did some benchmarking on this, out of curiosity. The code and log are below but TL;DR - the informat seems to be very slightly but consistently faster - 7.58 seconds vs 7.83 seconds in the run below on a 50 million observation data set. So the informat method is the way to go, but the 3% performance gain wouldn't be worth refactoring a large program, particularly if you don't have good test coverage to be sure of avoiding regressions.
483 * Set small for testing, big for benchmarking;
484 %let obs = 50000000;
485
486 * Generate test data;
487 data testdata;
488 do i = 1 to &obs;
489 numeric = round(ranuni(0)*100, 0.001);
490 char = '+' || put(numeric, z12.3-L);
491 output;
492 end;
493 run;
NOTE: The data set WORK.TESTDATA has 50000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 12.55 seconds
user cpu time 11.41 seconds
system cpu time 0.84 seconds
memory 4375.18k
OS Memory 20784.00k
Timestamp 12/10/2019 10:36:11 AM
Step Count 51 Switch Count 0
494
495 %macro charToNum(in=, method=, obs=);
496
497 * Convert back to numeric;
498 data converted;
499 set ∈
500 %if "&method" = "MULT-BY-ONE" %then %do;
501 converted = char * 1;
502 %end; %else %if "&method" = "INFORMAT" %then %do;
503 converted = input(char, 32.);
504 %end;
505 if converted ne numeric then do;
506 put "ERROR: Conversion failed: " numeric= char= converted=;
507 end;
508 run;
509
510 %mend;
511
512 %charToNum(in = testdata, method = MULT-BY-ONE, obs = &obs);
NOTE: Character values have been converted to numeric values at the places given by:
(Line):(Column).
3:20
NOTE: There were 50000000 observations read from the data set WORK.TESTDATA.
NOTE: The data set WORK.CONVERTED has 50000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 7.83 seconds
user cpu time 5.92 seconds
system cpu time 1.88 seconds
memory 14642.84k
OS Memory 31036.00k
Timestamp 12/10/2019 10:36:18 AM
Step Count 52 Switch Count 0
513 %charToNum(in = testdata, method = INFORMAT, obs = &obs);
NOTE: There were 50000000 observations read from the data set WORK.TESTDATA.
NOTE: The data set WORK.CONVERTED has 50000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 7.58 seconds
user cpu time 5.36 seconds
system cpu time 2.15 seconds
memory 14646.18k
OS Memory 31036.00k
Timestamp 12/10/2019 10:36:26 AM
Step Count 53 Switch Count 0
If you want to keep only the numbers, use the code below.
Using compress this way the numbers in the string will keeped.
The first parameter is the name of variable. The second is optional, this case the caracters to be keeped. Third is "k" that means keep.
data want;
set have;
numeric_var = input(compress(string_var,"0123456789","k"), best8.);
run;
Related
I have a numeric parameter given to my macro and would like to convert it to date, set to end of month and apply a format.
Following code works for many dates, but not for march; throws 'Literal contains unmatched quote'.
proc format;
picture mydatep
low-high = "'%0d-%0b-%0Y'" (datatype = date);
%macro test(cycle=);
%let enddate = %SYSFUNC(intnx(month, %SYSFUNC(inputn(&cycle., yymmn6.)), 0, e), mydatep.);
%put &enddate.;
%mend;
%test(cycle=201602); /* works --> 29-Feb-2016*/
%test(cycle=201603); /* works not */
%test(cycle=201604); /* works again --> 30-Apr-2016*/
%test(cycle=201402); /* works --> 28-Feb-2014*/
%test(cycle=201403); /* works not */
%test(cycle=201404); /* works again --> 30-Apr-2014*/
I have been using the code for some years now, and never had trouble with it. I am using SAS Analytics Pro 9.4
Solution: I was starting the SAS session via SAS (Unicode). Switching to SAS (Deutsch) [engl: SAS (German)], solved the issue.
I don't know why, though.
#Kenji: "Switching to SAS (Deutsch) [engl: SAS (German)], solved the issue. I don't know why, though."
The explanation is quite simple, some date formats in German and English differ in just a few cases:
German English equal?
----------------------------------------
01Jan2022 01Jan2022 yes
01Feb2022 01Feb2022 yes
01Mär2022 01Mar2022 NO
01Apr2022 01Apr2022 yes
...
01Okt2022 01Oct2022 NO
01Nov2022 01Nov2022 yes
01Dez2022 01Dec2022 NO
So in a German environment, it is a common observation that your code might work in most cases but not for March, October and December.
If you used an endash or emdash instead of hyphens in your PICTURE text that would change the generated character string from 12 bytes to 13 or 14 bytes. Those non ASCII characters require more than one byte of storage.
So if your code used a width of 12 with that format then the value would be truncated removing the closing quote and possibly the last digit of the year also.
Tell PROC FORMAT that the default width for the format should be 14 and not 13.
proc format;
picture mydatep (default=14)
low-high = "'%0d-%0b-%0Y'" (datatype = date);
run;
Example:
23 proc options option=ENCODING option=LOCALE option=DATESTYLE option=DFLANG ;
24 run;
SAS (r) Proprietary Software Release 9.4 TS1M5
ENCODING=UTF-8 Specifies the default character-set encoding for the SAS session.
LOCALE=EN_US Specifies a set of attributes in a SAS session that reflect the language, local conventions, and culture for a
geographical region.
DATESTYLE=MDY Specifies the sequence of month, day, and year when ANYDTDTE, ANYDTDTM, or ANYDTTME informat data is ambiguous.
DFLANG=GERMAN Specifies the language for international date informats and formats.
NOTE: PROCEDURE OPTIONS used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
25
26 options dflang=german locale=de_DE ;
27
28 data test;
29 do month=1 to 12;
30 length string $20 ;
31 string=put(mdy(month,1,2000),mydatep.);
32 put month= string=;
33 end;
34 run;
month=1 string='01-Jan-2000'
month=2 string='01-Feb-2000'
month=3 string='01-Mär-2000
month=4 string='01-Apr-2000'
month=5 string='01-Mai-2000'
month=6 string='01-Jun-2000'
month=7 string='01-Jul-2000'
month=8 string='01-Aug-2000'
month=9 string='01-Sep-2000'
month=10 string='01-Okt-2000'
month=11 string='01-Nov-2000'
month=12 string='01-Dez-2000'
NOTE: The data set WORK.TEST has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
35 proc format;
36 picture mydatep (default=14)
37 low-high = "'%0d-%0b-%0Y'" (datatype = date);
NOTE: Format MYDATEP is already on the library WORK.FORMATS.
NOTE: Format MYDATEP has been output.
38 run;
NOTE: PROZEDUR FORMAT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
39
40 data test;
41 do month=1 to 12;
42 length string $20 ;
43 string=put(mdy(month,1,2000),mydatep.);
44 put month= string=;
45 end;
46 run;
month=1 string='01-Jan-2000'
month=2 string='01-Feb-2000'
month=3 string='01-Mär-2000'
month=4 string='01-Apr-2000'
month=5 string='01-Mai-2000'
month=6 string='01-Jun-2000'
month=7 string='01-Jul-2000'
month=8 string='01-Aug-2000'
month=9 string='01-Sep-2000'
month=10 string='01-Okt-2000'
month=11 string='01-Nov-2000'
month=12 string='01-Dez-2000'
NOTE: The data set WORK.TEST has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
I want to store an instance of a data step variable in a macro-variable using call symput, then use that macro-variable in the same data step to populate a new field, assigning it a new value every 36 records.
I tried the following code:
data a;
set a;
if MOB = 1 then do;
MOB1_accounts = accounts;
call symput('MOB1_acct', MOB1_accounts);
end;
else if MOB > 1 then MOB1_accounts = &MOB1_acct.;
run;
I have a series of repeating MOB's (1-36). I want to create a field called MOB1_Accts, set it equal to the # of accounts for that cohort where MOB = 1, and keep that value when MOB = 2, 3, 4 etc. I basically want to "drag down" the MOB 1 value every 36 records.
For some reason this macro-variable is returning "1" instead of the correct # accounts. I think it might be a char/numeric issue but unsure. I've tried every possible permutation of single quotes, double quotes, symget, etc... no luck.
Thanks for the help!
You are misusing the macro system.
The ampersand (&) introducer in source code tells SAS to resolve the following symbol and place it into the code submission stream. Thus, the resolved &MOB1_acct. can not be changed in the running DATA Step. In other words, a running step can not change it's source code -- The resolved macro variable will be the same for all implicit iterations of the step because its value became part of the source code of the step.
You can use SYMPUT() and SYMGET() functions to move strings out of and into a DATA Step. But that is still the wrong approach for your problem.
The most straight forward technique could be
use of a retained variable
mod (_n_, 36) computation to determine every 36th row. (_n_ is a proxy for row number in a simple step with a single SET.)
Example:
data a;
set a;
retain mob1_accounts;
* every 36 rows change the value, otherwise the value is retained;
if mod(_n_,36) = 1 then mob1_accounts = accounts;
run;
You didn't show any data, so the actual program statements you need might be slightly different.
Contrasting SYMPUT/SYMGET with RETAIN
As stated, SYMPUT/SYMGET is a possible way to retain values by off storing them in the macro symbol table. There is a penalty though. The SYM* requires a function call and whatever machinations/blackbox goings on are happening to store/retrieve a symbol value, and possibly additional conversions between character and numeric.
Example:
1,000,000 rows read. DATA _null_ steps to avoid writing overhead as part of contrast.
data have;
do rownum = 1 to 1e6;
mob + 1;
accounts = sum(accounts, rand('integer', 1,50) - 10);
if mob > 36 then mob = 1;
output;
end;
run;
data _null_;
set have;
if mob = 1 then call symput ('mob1_accounts', cats(accounts));
mob1_accounts = symgetn('mob1_accounts');
run;
data _null_;
set have;
retain mob1_accounts;
if mob = 1 then mob1_accounts = accounts;
run;
On my system logs
142 data _null_;
143 set have;
144
145 if mob = 1 then call symput ('mob1_accounts', cats(accounts));
146
147 mob1_accounts = symgetn('mob1_accounts');
148 run;
NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 0.34 seconds
cpu time 0.34 seconds
149
150 data _null_;
151 set have;
152 retain mob1_accounts;
153
154 if mob = 1 then mob1_accounts = accounts;
155 run;
NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
Or
way real cpu
------------- ------ ----
SYMPUT/SYMGET 0.34 0.34
RETAIN 0.04 0.03
I have come across a scenario where NULL may be preventing datastep to execute.
Can someone please have a look and confirm why is this happening.
Ran this in SAS EG :
/*create a TEMP1 table*/
data TEMP1;
input Name $ age score;
cards;
A 10 100
B . 20
C 20 .
D . .
;
run;
/* step to overwrite WORK.TEMP1 dots with 0 */
DATa _NULL_;
SET TEMP1;
file print;
array a1 _numeric_;
do over a1;
if a1=. then a1=0;
end;
run;
Expectation is that all numeric fields with dot to be overwritten with 0.
It does only when DATA NULL is replaced with DATA TEMP1
A bit of a conundrum.
Here's some comments that may help. Basically, as others have indicated, _NULL_ does not create an output data set so your assumption there is incorrect.
You're also using FILE incorrectly I suspect but don't know what you're trying to do with that statement.
You also are using a DO OVER loop which is deprecated as of SAS V7 so you shouldn't use it in production code.
DATa _NULL_;*_Null_ means no output data set is created;
SET TEMP1; *input data set means temp1;
file print; *writes to a file named print, no filename statement so no idea what this means to you;
array a1 _numeric_; *creates an array of all numeric values;
do over a1; *Do over is deprecated as of 20 years ago, it works but I don't recommend using it in production code;
if a1=. then a1=0; *replaces missing with 0;
end;*ends loop;
*no put statements so nothing is written the file print;
run;
You could fix it by doing this, but I don't recommend using the same data set name. It makes it hard to debug your code later on.
/* step to overwrite WORK.TEMP1 dots with 0 */
DATa TEMP1;
SET TEMP1;
array a1 _numeric_;
do over a1;
if a1=. then a1=0;
end;
run;
Here are two different ways to replace values in an existing table
overwriting the entire table with a new copy of itself
data name; set name; …
modifying values in-place within existing table
data name; modify name; …
Example
1 data class;
2 set sashelp.class;
3 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.00 seconds
4
5 data class; /* output data set named is same as input data set */
6 set class;
7 age = age * 2;
8 run;
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.03 seconds
9
10 data class; /* output data set name */
11 modify class; /* is same as modify name, values updated in place */
12 age = age / 2;
13 run; /* observations are rewritten (see log) */
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: The data set WORK.CLASS has been updated. There were 19 observations rewritten, 0
observations added and 0 observations deleted.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.00 seconds
A third way would be to use SQL UPDATE statement with sets based on coalesce, however, that is not amenable to array processing.
Proc SQL;
update mydata set
a1 = coalesce (a1,0)
, s2 = coalesce (a2,0)
…
;
When you use data _NULL_ instead of data temp1 you only read from temp1 but you changes will be written nowhere. That is not conundrum that is basic SAS functionality. Only use _NULL_ when you don't need data to be written.
I am trying to make character informat from the range values given in a dataset.
Dataset : Grade
Start End Label Fmtname Type
0 20 A $grad I
21 40 B $grad I
41 60 C $grad I
61 80 D $grad I
81 100 E $grad I
And here is the code i wrote to create the informat
proc format cntlin = grade;
run;
And now the code to create a temp dataset using the new informat
data temp;
input grade : $grad. ## ;
datalines;
21 30 0 45 10
;
The output i wanted was a dataset Temp with values :
Grade
A
B
A
..
Whereas the dataset Temp has values :
Grade
21
30
0
...
SAS Log Entry :
1146 proc format cntlin = grade;
NOTE: Informat $GRAD has been output.
1147 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
NOTE: There were 5 observations read from the data set WORK.GRADE.
1148
1149
1150 data temp;
1151 input grade : $grad. ## ;
1152
1153 datalines;
NOTE: SAS went to a new line when INPUT statement reached past the end of a
line.
NOTE: The data set WORK.TEMP has 5 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
I am not able to understand why informat is not working. Can anyone please
explain where i am making my mistake.
INFORMATS convert characters to (characters or numbers). So you can't use START/END the way you are doing so, since that only works with numbers.
See the following:
proc format;
invalue $grade
'0'-'20'="A"
'21'-'40'="B"
'41'-'60'="C"
'61'-'80'="D"
'81'-'100'="E";
quit;
proc format;
invalue $grade
'21'='A';
quit;
The latter works, the former gives you an error. So, you could write a dataset with all 101 values (each on a line with START), or just write a format and do it in a second step (read in as a number and then PUT to the format).
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75