Can someone please explain the a weird behavior I observe with the ifc function :
I ran the following 3 tests:
data test1;
var = " "; /* 4 spaces */
length outvar $5;
if not missing(var) then outvar = substr(var, 1,5);
else call missing(outvar);
put outvar=;
run;
data test2;
var = " "; /* 4 spaces */
length outvar $5;
outvar = ifc(not missing(var), substr(var, 1, 5), "");
put outvar=;
run;
data test3;
var = " "; /* 5 spaces */
length outvar $5;
outvar = ifc(not missing(var), substr(var, 1, 5), "");
put outvar=;
run;
test1 and test3 run fine. However I get the following warning/note for test2:
Invalid third argument to function SUBSTR
While I understand the meaning of this, it is not clear why it is triggered to begin with given it should not go into evaluating that expression in the ifc function. Appears the ifc function is evaluating both expressions regardless of the outcome of logical test.
IFC/N evaluates all expressions in all arguments. The SUBSTRN function should fix the message and give desired result.
SAS does not use lazy-evaluation.
Before ifc is applied, SAS evaluates all its arguments,
so if you submit
data test2;
var = " "; /* 4 spaces */
length outvar $7;
outvar = ifc(not missing(var), substr(var, 1, 5), "");
put outvar=;
run;
SAS will evaluate
not missing(var), which results in false
substr(var, 1, 5), wich gives an error
"", wich results in a null string
So the error occurs before the ifc is executed.
But if you submit
data test1;
var = " "; /* 4 spaces */
length outvar $5;
if not missing(var) then outvar = substr(var, 1,5);
else call missing(outvar);
put outvar=;
run;
SAS will evaluate not missing(var), which results in false.
Next it will
not evaulate substr(var, 1, 5)
but only evaluate "", wich results in a null string
Related
In SAS I'd like to add id values to the variables with a specific conditions. I have the following code:
DATA market_new;
SET sashelp.cars;
if Make = 'Audi' then id = 0;
else id = _N_;
RUN;
proc print data=market_new;
run;
Output:
The problem is that the id continues with 27, 28 etc. after the make isn't equal to Audi. My goal is to have 8, 9 instead.
Use a SUM (+) statement to track the Audis.
if make='Audi' then do;
audi_seq + 1; drop audi_seq;
audi_id = audi_seq;
end;
else
audi_id = 0;
I got this chars
DDSPRJ11
DDSPRJ12
DDSPRJ12
DDRJCT
in the case of the first 3 i want the last 4 chars e the case of the last i want the last 3 chars, how can i get them using substr and get them in the correct order eg: RJ11.
You can do this with regular expression matching using prxchange:
data have;
infile datalines;
input mystr $ ##;
datalines;
DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT
;
run;
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)(.*)/$2/', 1, mystr);
run;
#user667489 is perfect answer if it you have can read all of values separately. if it is in same variable as shown below you can use the same code given by #user667489. and add can add can function. prxnext, can also be used to achieve the same. both examples are shown below
data have;
val= "DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT";
run;
/* using prxchange with scan*/
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)//', -1, val);
do i = 1 to countw(suffix,' ');
newstr= scan(suffix, i);
output;
end;
drop suffix val;
run;
/* using prxposn*/
data want;
length val1 re $200.;
set have;
start = 1;
stop = length(val);
re = prxparse('/(DDSP|DDR)/');
set have;
call prxnext(re, start, stop, trim(val), position, length);
do while (position > 0);
val1 = substr(val, position+length, length);
call prxnext(re, start, stop, trim(val), position, length);
output;
end;
drop re start stop position length val;
run;
Here is how you can do it in a simple python.
I assumed that, you want last 4 char of every word except last.
string_1 = 'DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT'
list_string = string_1.split()
new_list = []
for i in range(len(list_string)):
if i == len(list_string) - 1:
new_list.append(list_string[i][-3:])
else:
new_list.append(list_string[i][-4:])
print(new_list)
output:
['RJ11', 'RJ12', 'RJ12', 'JCT']
libname Prob 'Y:\alsdkjf\alksjdfl';
the geoid here is char I wanna convert to num to be able to merge by id;
data Problem2_1;
set Prob.geocode;
id = substr(GEOID, 8, 2);
id = input(id, best5.);
output;
run;
geoid here is numeric;
data Problem2_2; c
set Prob.households;
id = GEOID;
output;
run;
data Problem2_3;
merge Problem2_1
Problem2_2 ;
by ID;
run;
proc print data = Problem2_3;
*ERROR: Variable geoid has been defined as both character and numeric.
*ERROR: Variable id has been defined as both character and numeric.
It looks like you could replace these two lines:
id = substr(GEOID, 8, 2);
id = input(id, best5.);
With:
id = input(substr(GEOID, 8, 2), best.);
This would mean that both merge datasets contain numeric id variables.
SAS requires the linking id to be same data type. This means that you have to convert the int to string or vice versa. Personally, I prefer to convert to numeric when ever it is possible.
A is a worked out example:
/*Create some dummy data for testing purposes:*/
data int_id;
length id 3 dummy $3;
input id dummy;
cards;
1 a
2 b
3 c
4 d
;
run;
data str_id;
length id $1 dummy2 $3;
input id dummy2;
cards;
1 aa
2 bb
3 cc
4 dd
;
run;
/*Convert string to numeric. Int in this case.*/
data str_id_to_int;
set str_id;
id2 =id+0; /* or you could use something like input(id, 8.)*/
/*The variable must be new. id=id+0 does _not_ work.*/
drop id; /*move id2->id */
rename id2=id;
run;
/*Same, but other way around. Imho, trickier.*/
data int_id_to_str;
set int_id;
id2=put(id, 1.); /*note that '1.' refers to lenght of 1 */
/*There are other ways to convert int to string as well.*/
drop id;
rename id2=id;
run;
/*Testing. Results should be equivalent */
data merged_by_str;
merge str_id(in=a) int_id_to_str(in=b);
by id;
if a and b;
run;
data merged_by_int;
merge int_id(in=a) str_id_to_int(in=b);
by id;
if a and b;
run;
For Problem2_1, if your substring contains only numbers you can coerce it to numeric by adding zero. Something like this should make ID numeric and then you could merge with Problem2_2.
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = temp + 0;
drop temp;
run;
EDIT:
Your original code originally defines ID as the output of substr, which is character. This should work as well:
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = input(temp, 8.0);
drop temp;
run;
I have a data similar to this
(This won't work because of an Array subscript out of range error):
data test;
array id {5} (1, 8, 4, 12, 23);
array a_ {5};
do i = 1 to 5;
a_[id[i]] = id[i];
end;
run;
what I want to do is,
create variables begins with 'a_' and the values of array id.
Meaning : a_1, a_8, a_4, a_12, a_23
This will only work if I declare array a_ with 23 members:
data test;
array id {5} (1, 8, 4, 12, 23);
array a_ {23};
do i = 1 to 5;
a_[id[i]] = id[i];
end;
run;
But then I get lots of missing variables I don't need.
I only want the above 5.
How can I achieve that?
PROC TRANSPOSE is usually the easiest way to do this.
First, make a vertical dataset like so:
data vert;
array id[5] (1,8,4,12,23);
do _i = 1 to dim(id);
varname = cats('A_',id[_i]);
vvalue = 1; *it is not apparent to me what the value should be in A_12 or whatnot;
output;
end;
run;
Then PROC TRANSPOSE makes your desired dataset.
proc transpose data=vert out=want;
id varname;
var vvalue;
run;
I have the following function defined via PROC FCMP. The point of the code should be pretty obvious and relatively straightforward. I'm returning the value of an attribute from a line of XHTML. Here's the code:
proc fcmp outlib=library.funcs.crawl;
function getAttr(htmline $, Attribute $) $;
/*-- Find the position of the match --*/
Pos = index( htmline , strip( Attribute )||"=" );
/*-- Now do something about it --*/
if pos > 0 then do;
Value = scan( substr( htmline, Pos + length( Attribute ) + 2), 1, '"');
end;
else Value = "";
return( Value);
endsub;
run;
No matter what I do with length or attrib statement to try to explicitly declare the data type returned, it ALWAYS returns only a max of 33 bytes of the requested string, regardless of how long the actual return value is. This happens no matter which attribute I am searching for. The same code (hard-coded) into a data step returns the correct results so this is related to PROC FCMP.
Here is the datastep I'm using to test it (where PageSource.html is any html file that has xhtml compliant attributes -- fully quoted):
data TEST;
length href $200;
infile "F:\PageSource.html";
input;
htmline = _INFILE_;
href = getAttr( htmline, "href");
x = length(href);
run;
UPDATE: This seems to work properly after upgrading to SAS9.2 - Release 2
I think the problem (though I don't know why) is in the scan function - it seems to be truncating input from substr(). If you pull the substr function out of scan(), assign the result of the substr function to a new variable that you then pass to scan, it seems to work.
Here is what I ran:
proc fcmp outlib=work.funcs.crawl;
function getAttr(htmline $, Attribute $) $;
length y $200;
/*-- Find the position of the match --*/
Pos = index( htmline , strip( Attribute )||"=" );
/*-- Now do something about it --*/
if pos > 0 then do;
y=substr( htmline, Pos + length( Attribute ) + 2);
Value = scan( y, 1, '"');
end;
else Value = "";
return( Value);
endsub;
run;
options cmplib=work.funcs;
data TEST;
length href $200;
infile "PageSource.html";
input;
htmline = _INFILE_;
href = getAttr( htmline, "href");
x = length(href);
run;
In this case, an input pointer control should be enough. hope this helps.
/* create a test input file */
data _null_;
file "f:\pageSource.html";
input;
put _infile_;
cards4;
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="w3.org/StyleSheets/TR/W3C-REC.css"; type="text/css"?>
;;;;
run;
/* extract the href attribute value, if any. */
/* assuming that the value and the attribute name occurs in one line. */
/* and max length is 200 chars. */
data one;
infile "f:\pageSource.html" missover;
input #("href=") href :$200.;
href = scan(href, 1, '"'); /* unquote */
run;
/* check */
proc print data=one;
run;
/* on lst
Obs href
1
2 w3.org/StyleSheets/TR/W3C-REC.css
*/
It seems like uninitialized variables in PROC FCMP get a default length of 33 bytes. Consider the following demonstration code:
OPTIONS INSERT = (CMPLIB = WORK.FCMP);
PROC FCMP
OUTLIB = WORK.FCMP.FOO
;
FUNCTION FOO(
BAR $
);
* Assign the value of BAR to the uninitialised variable BAZ;
BAZ = BAR;
* Diagnostics;
PUT 'BAR IS ' BAR;
PUT 'BAZ IS ' BAZ;
* Return error code;
IF
LENGTH(BAZ) NE LENGTH(BAR)
THEN
RETURN(0)
; ELSE
RETURN(1)
;
ENDSUB;
RUN;
DATA _NULL_;
X = 'shortstring';
Y = 'exactly 33 characters long string';
Z = 'this string is somewhat longer than 33 characters';
ARRAY STRINGS{*} _CHARACTER_;
ARRAY RC{3} 8 _TEMPORARY_;
DO I = 1 TO DIM(STRINGS);
RC[I] = FOO(STRINGS[I]);
END;
RUN;
Which, with my site's installation (Base SAS 9.4 M2) prints the following lines to the log:
BAR IS shortstring
BAZ IS shortstring
BAR IS exactly 33 characters long string
BAZ IS exactly 33 characters long string
BAR IS this string is somewhat longer than 33 characters
BAZ IS this string is somewhat longer th
This is likely related to the fact that PROC FCMP, like DATA steps, cannot allocate variable lengths dynamically at runtime. However, it's a little confusing, because it does dynamically allocate variable lengths for parameters. I'm assuming that there is a separate "initialization" phase for PROC FCMP subroutines, during which the length of values passed as arguments are determined and parameter variables which must hold those values are initialized to the required length. However, the length of variables defined only within the body of the subroutine can only be discovered at runtime, when memory has already been allocated. So prior to runtime (whether at compile-time or my hypothetical "initialization" phase), memory is allocated to these variables with an explicit LENGTH statement if present, and otherwise falls back to a default of 33 bytes.
Now what's really interesting is that PROC FCMP is as smart as can be about this -- within the strict separation of initialization/runtime stages. If, in the body of the subroutine, a variable A has an explicitly defined LENGTH, and then another uninitialized variable B is assigned a function of A, then B is set to the same length as A. Consider this modification of the above function, in which the value of BAR is not assigned directly to BAZ, but rather via the third variable QUX, which has an explicitly defined LENGTH of 50 bytes:
OPTIONS INSERT = (CMPLIB = WORK.FCMP);
PROC FCMP
OUTLIB = WORK.FCMP.FOO
;
FUNCTION FOO(
BAR $
);
LENGTH QUX $ 50;
QUX = BAR;
* Assign the value of BAR to the uninitialised variable BAZ;
BAZ = QUX;
* Diagnostics;
PUT 'BAR IS ' BAR;
PUT 'BAZ IS ' BAZ;
* Return error code;
IF
LENGTH(BAZ) NE LENGTH(BAR)
THEN
RETURN(0)
; ELSE
RETURN(1)
;
ENDSUB;
RUN;
DATA _NULL_;
X = 'shortstring';
Y = 'exactly 33 characters long string';
Z = 'this string is somewhat longer than 33 characters';
ARRAY STRINGS{*} _CHARACTER_;
ARRAY RC{3} 8 _TEMPORARY_;
DO I = 1 TO DIM(STRINGS);
RC[I] = FOO(STRINGS[I]);
END;
RUN;
The log shows:
BAR IS shortstring
BAZ IS shortstring
BAR IS exactly 33 characters long string
BAZ IS exactly 33 characters long string
BAR IS this string is somewhat longer than 33 characters
BAZ IS this string is somewhat longer than 33 characters
It's likely that this "helpful" behavior is the cause of confusion and differences in the previous answers. I wonder if this behavior is documented?
I'll leave it as an exercise to the reader to investigate exactly how smart SAS tries to get about this. For example, if an uninitialized variable gets assigned the concatenated values of two other variables with explicitly assigned lengths, is its length set to the sum of those of the other two?
I ended up backing out of using FCMP defined data step functions. I don't think they're ready for primetime. Not only could I not solve the 33 byte return issue, but it started regularly crashing SAS.
So back to the good old (decades old) technology of macros. This works:
/*********************************/
/*= Macro to extract Attribute =*/
/*= from XHTML string =*/
/*********************************/
%macro getAttr( htmline, Attribute, NewVar );
if index( &htmline , strip( &Attribute )||"=" ) > 0 then do;
&NewVar = scan( substr( &htmline, index( &htmline , strip( &Attribute )||"=" ) + length( &Attribute ) + 2), 1, '"' );
end;
%mend;