how to avoid spaces in put satement in SAS - sas

I'm trying to write a json file from a data step.
but my put statements always add unwanted spaces after variables.
put ' {"year":' year ',';
will create {"year":2013 ,
and
put ' {"name":"' %trim(name) '", ' ;
will create {"name":"Rubella virus ",
How can I remove the space after "Rubella virus" without overcomplicating things?
My best solution so far is to create a variable that uses cats and then put the newvariable a bit like this:
newvar=cats('{"name":"',name,'",');
put newvar;
Thanks!

You need to move pointer back by one step. You do this by asking to go forward by minus one step. Use this:
put ' {"name":"' name(+)-1 '", ' ;
Weird, I know, but it works.
Here is example with sashelp.class:
Code:
data _null_;
set sashelp.class end = eof;
if _N_ eq 1 then
put '[';
put '{ "Name":"' Name+(-1)
'","Sex":"' Sex+(-1)
'","Age":"' Age+(-1)
'","Height":"' Height+(-1)
'","Weight":"' Weight+(-1)
'"}';
if eof then
put ']';
else put ',';
run;
Result:
[
{ "Name":"Alfred","Sex":"M","Age":"14","Height":"69","Weight":"112.5"}
,
{ "Name":"Alice","Sex":"F","Age":"13","Height":"56.5","Weight":"84"}
,
{ "Name":"Barbara","Sex":"F","Age":"13","Height":"65.3","Weight":"98"}
,
{ "Name":"Carol","Sex":"F","Age":"14","Height":"62.8","Weight":"102.5"}
,
{ "Name":"Henry","Sex":"M","Age":"14","Height":"63.5","Weight":"102.5"}
,
{ "Name":"James","Sex":"M","Age":"12","Height":"57.3","Weight":"83"}
,
{ "Name":"Jane","Sex":"F","Age":"12","Height":"59.8","Weight":"84.5"}
,
{ "Name":"Janet","Sex":"F","Age":"15","Height":"62.5","Weight":"112.5"}
,
{ "Name":"Jeffrey","Sex":"M","Age":"13","Height":"62.5","Weight":"84"}
,
{ "Name":"John","Sex":"M","Age":"12","Height":"59","Weight":"99.5"}
,
{ "Name":"Joyce","Sex":"F","Age":"11","Height":"51.3","Weight":"50.5"}
,
{ "Name":"Judy","Sex":"F","Age":"14","Height":"64.3","Weight":"90"}
,
{ "Name":"Louise","Sex":"F","Age":"12","Height":"56.3","Weight":"77"}
,
{ "Name":"Mary","Sex":"F","Age":"15","Height":"66.5","Weight":"112"}
,
{ "Name":"Philip","Sex":"M","Age":"16","Height":"72","Weight":"150"}
,
{ "Name":"Robert","Sex":"M","Age":"12","Height":"64.8","Weight":"128"}
,
{ "Name":"Ronald","Sex":"M","Age":"15","Height":"67","Weight":"133"}
,
{ "Name":"Thomas","Sex":"M","Age":"11","Height":"57.5","Weight":"85"}
,
{ "Name":"William","Sex":"M","Age":"15","Height":"66.5","Weight":"112"}
]
Regards,
Vasilij

For the character fields you can use the $QUOTE. format to add the quotes. Use the : to remove the trailing blanks in the value of the variable.
put '{ "Name":' Name :$quote.
',"Sex":' Sex :$quote.
',"Age":"' Age +(-1) '"'
',"Height":"' Height +(-1) '"'
',"Weight":"' Weight +(-1) '"'
'}'
;

If you are looking to have 'cleaner' code, you could build yourself a helper function or two using proc fcmp. This function will take a string description, the name of the field you want, and then whether or not to quote the returned string. Note that if your values can contain quotes, you may want to use the quote() function instead of t
Example Function:
proc fcmp outlib=work.funcs.funcs;
function json(iName $, iField $, iQuote) $;
length result $200;
quote_char = ifc(iQuote,'"','');
result = cats('"', iName, '":',quote_char, iField, quote_char );
return (result );
endsub;
run;
Example Usage:
data _null_;
set sashelp.class;
x = catx(',',
json("name",name,1),
json("age",age,0));
put x;
run;
Example Output:
"name":"Alfred","age":14
"name":"Alice","age":13
"name":"Barbara","age":13
"name":"Carol","age":14
"name":"Henry","age":14
"name":"James","age":12
"name":"Jane","age":12

Related

SAS Retain not working for 1 string variable

The below code doesn't seem to be working for the variable all_s when there is more than 1 record with the same urn. Var1,2,3 work fine but that one doesn't and I cant figure out why. I am trying to have all_s equal to single_var1,2,3 concatenated with no spaces if it's first.urn but I want it to be
all_s = all_s + ',' + single_var1 + single_var2 + single_var3
when it's not the first instance of that urn.
data dataset_2;
set dataset_1;
by URN;
retain count var1 var2 var3 all_s;
format var1 $40. var2 $40. var3 $40. all_s $50.;
if first.urn then do;
count=0;
var1 = ' ';
var2 = ' ';
var3 = ' ';
all_s = ' ';
end;
var1 = catx(',',var1,single_var1);
var2 = catx(',',var2,single_var2);
var3 = catx(',',var3,single_var3);
all_s = cat(all_s,',',single_var1,single_var2,single_var3);
count = count+1;
if first.urn then do;
all_s = cat(single_var1,single_var2,single_var3);
end;
run;
all_s is not large enough to contain the concatenation if the total length of the var1-var3 values within the group exceeds $50. Such a scenario seems likely with var1-var3 being $40.
I recommend using the length function to specify variable lengths. format will create a variable of a certain length as a side effect.
catx removes blank arguments from the concatenation, so if you want spaces in the concatenation when you have blank single_varN you won't be able to use catx
A requirement that specifies a concatenation such that non-blank values are stripped and blank values are a single blank will likely have to fall back to the old school trim(left(… approach
Sample code
data have;
length group 8 v1-v3 $5;
input group (v1-v3) (&);
datalines;
1 111 222 333
1 . 444 555
1 . . 666
1 . . .
1 777 888 999
2 . . .
2 . b c
2 x . z
run;
data want(keep=group vlist: all_list);
length group 8 vlist1-vlist3 $40 all_list $50;
length comma1-comma3 comma $2;
do until (last.group);
set have;
by group;
vlist1 = trim(vlist1)||trim(comma1)||trim(left(v1));
vlist2 = trim(vlist2)||trim(comma2)||trim(left(v2));
vlist3 = trim(vlist3)||trim(comma3)||trim(left(v3));
comma1 = ifc(missing(v1), ' ,', ',');
comma2 = ifc(missing(v2), ' ,', ',');
comma3 = ifc(missing(v3), ' ,', ',');
all_list =
trim(all_list)
|| trim(comma)
|| trim(left(v1))
|| ','
|| trim(left(v2))
|| ','
|| trim(left(v3))
;
comma = ifc(missing(v3),' ,',',');
end;
run;
Reference
SAS has operators and multiple functions for string concatenation
|| concatenate
cat concatenate
catt concatenate, trimming (remove trailing spaces) of each argument
cats concatenate, stripping (remove leading and trailing spaces) of each argument
catx concatenate, stripping each argument and delimiting
catq concatenate with delimiter and quote arguments containing the delimiter
From SAS 9.2 documentation
Comparisons
The results of the CAT, CATS, CATT, and CATX functions are usually equivalent to results that are produced by certain combinations of the concatenation operator (||) and the TRIM and LEFT functions. However, the default length for the CAT, CATS, CATT, and CATX functions is different from the length that is obtained when you use the concatenation operator. For more information, see Length of Returned Variable.
Note: In the case of variables that have missing values, the concatenation produces different results. See Concatenating Strings That Have Missing Values.
Some example data would be helpful, but I'm going to give it a shot and ask you to try
all_s = cat(strip(All_s),',',single_var1,single_var2,single_var3);

Get string between two specific char positions

i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");

SAS : How do I find nth instance of a character/group of characters within a string?

I'm trying to find a function that will index the nth instance of a character(s).
For example, if I have the string ABABABBABSSSDDEE and I want to find the 3rd instance of A, how do I do that? What if I want to find the 4th instance of AB
ABABABBABSSSDDEE
data HAVE;
input STRING $;
datalines;
ABABABBASSSDDEE
;
RUN;
Here is a much simplified implementation of finding N-th instance of a group of characters in a SAS character string using SAS find() function:
data a;
s='AB bhdf +BA s Ab fs ABC Nfm AB ';
x='AB';
n=3;
/* from left to right */
p = 0;
do i=1 to n until(p=0);
p = find(s, x, p+1);
end;
put p=;
/* from right to left */
p = length(s) + 1;
do i=1 to n until(p=0);
p = find(s, x, -p+1);
end;
put p=;
run;
As you can see it allows for both, left-to-right and right-to-left searches.
You can combine these two into a SAS user-defined function (negative n will indicate search from right to left as it is in find function):
proc fcmp outlib=sasuser.functions.findnth;
function findnth(str $, sub $, n);
p = ifn(n>=0,0,length(str)+1);
do i=1 to abs(n) until(p=0);
p = find(str,sub,sign(n)*p+1);
end;
return (p);
endsub;
run;
Note that the above solutions with FIND() and FINDNTH() functions assume that the searched substring can overlap with its prior instance. For example, if we search for a substring ‘AAA’ within a string ‘ABAAAA’, then the first instance of the ‘AAA’ will be found in position 3, and the second instance – in position 4. That is, the first and second instances are overlapping. For that reason, when we find an instance we increment position p by 1 (p+1) to start the next iteration (instance) of the search.
However, if such overlapping is not a valid case in your searches, and you want to continue search after the end of the previous substring instance, then we should increment p not by 1, but by length of the substring x. That will speed up our search (the more the longer our substring x is) as we will be skipping more characters as we go through the string s. In this case, in our search code we should replace p+1 to p+w, where w=length(x).
A detail discussion of this problem is described in my recent SAS blog post Finding n-th instance of a substring within a string. I also found that using find() function works considerably faster than using regular expression functions in SAS.
I realize I'm late to the party here, but in the interest of adding to the collection of answers, here's what I've come up with.
DATA test;
input = "ABABABBABSSSDDEE";
A_3 = find(prxchange("s/A/#/", 2, input), "A");
AB_4 = find(prxchange("s/AB/##/", 3, input), "AB");
RUN;
Breaking it down, prxchange() just does a pattern matching replacement, but the great thing about it is that you can tell it how many times to replace that pattern. So, prxchange("s/A/#/", 2, input) replaces the first two A's in input with #. Once you've replaced the first two A's, you can wrap it in a find() function to find the "first A", which is actually the third A of the original string.
One thing to note about this approach is that, ideally, the replacement string should be the same length as the string you're replacing. For instance, notice the difference between
prxchange("s/AB/##/", 3, input) /* gives 8 (correct) */
and
prxchange("s/AB/#/", 3, input) /* gives 5 (incorrect) */
That's because we've replaced a string of length 2 with a string of length 1 three times. In other words:
(length("#") - length("AB")) * 3 = -3
so 8 + (-3) = 5.
Hopefully that helps someone out there!
data _null_;
findThis = 'A'; *** substring to find;
findIn = 'ADABAACABAAE'; **** the string to search;
instanceOf=1; *** and the instance of the substring we want to find;
pos = 0;
len = 0;
startHere = 1;
endAt = length(findIn);
n = 0; *** count occurrences of the pattern;
pattern = '/' || findThis || '/';
rx = prxparse(pattern);
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
if pos le 0 then do;
put 'Could not find ' findThis ' in ' findIn;
end;
else do while (pos gt 0);
n+1;
if n eq instanceOf then leave;
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
end;
if n eq instanceOf then do;
put 'found ' instanceOf 'th instance of ' findThis ' at position ' pos ' in ' findIn;
end;
else do;
put 'No ' instanceOf 'th instance of ' findThis ' found';
end;
run;
Here is a solution using the find() function and a do loop within a datastep. I then take that code, and place it into a proc fcmp procedure to create my own function called find_n(). This should greatly simplify whatever task is using this and allows for code re-use.
Define the data:
data have;
length string $50;
input string $;
datalines;
ABABABBABSSSDDEE
;
run;
Do-loop solution:
data want;
set have;
search_term = 'AB';
nth_time = 4;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
if nth_time eq counter then do;
put "The nth occurrence was found at position " last_find;
end;
else do;
put "Could not find the nth occurrence";
end;
run;
Define the proc fcmp function:
Note: If the nth-occurrence cannot be found return 0.
options cmplib=work.temp.temp;
proc fcmp outlib=work.temp.temp;
function find_n(string $, search_term $, nth_time) ;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
result = ifn(nth_time eq counter, last_find, 0);
return (result);
endsub;
run;
Example proc fcmp usage:
Note that this calls the function twice. The first example is showing the original request solution. The second example shows what happens when a match cannot be found.
data want;
set have;
nth_position = find_n(string, "AB", 4);
put nth_position =;
nth_position = find_n(string, "AB", 5);
put nth_position =;
run;

SAS: Reading fields from datalines in a data step

Could someone please provide an explanation or a link which has explanation on the functionality of ':' in the below code:
data voter;
infile datalines dsd dlm='~'
input age party : $1. (ques1 - Ques4) ($1. + 1);
format age 2. party $1. ques1 - ques4 $likert.;
label Ques1 = ' performance '
Ques2 = ' taxes '
Ques3 = ' amenities '
Ques4 = ' endurance ';
datalines;
23~D~2~1~3~4
34~R~2~1~4~4
43~D~2~2~1~1
;
This is the test code used for learning SAS. When I remove the ':' from the INPUT statement I am not able to read the data properly. Also, kindly let me know what is the +1 in the ($1. + 1); context. This snippet is taken from learning SAS through examples. Thanks in advance.
: is called colon operator which means - to stop reading when encounter a delimiter,
Because it is list input method, so point will move a unit forward
(Ques1-Ques4) ($1. +1);
is the same as Ques1 $1. +1 Ques2 $1. +1 Ques3 $1. +1 Ques4 $1. +1 i.e. Increament +1 position for Ques2 from Ques1 and so on.

parsing a string in SAS 9.2 that contains the '|' character

I have a variable that contains a number of firms separated by the | symbol. I would like to be able to count how many firms there. i.e., the number of | + 1, and ideally identify the location of the | symbol in the string. Note there will not be more than five firms in a single variable. I was trying to use the following approach but run into the fact that SAS treats the | symbol as a special operator.
pattern1 = prxparse('/|/'); /* I can't seem to get SAS to treat this as a text to compare */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
call prxnext(pattern1, start, stop, reassignment2, position, length);
ARRAY Y[5];
do J=1 to 5 while (position > 0);
Y[J]=position;
call prxnext(pattern1, start, stop, reassignment2, position, length);
end;
nfirms=j+1;
run;
I would do it somewhat differently. What you really want is not the number of | characters, but the actual firms, right? So search for those. Your code had a number of minor issues; primarily, you must first prxmatch before using call prxnext, your j+1 is wrong because the loop iterator actually increments one beyond the last qualifying loop value (I use j-1 because I will find one more element than you), and | is a regular expression metacharacter and must be escaped if you actually want to use it, unless it is inside [] like I am using it.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
pattern1 = prxparse('/[^|]+/io'); /* Look for non-| characters */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
rc=prxmatch(pattern1,reassignment2);
if rc>0 then do;
ARRAY Y[5];
do J=1 by 1 until (position = 0);
call prxnext(pattern1, start, stop, reassignment2, position, length);
Y[J]=position;
end;
nfirms=j-1;
end;
else nfirms=0;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
For completeness' sake, you could also do this easily without regular expressions, using call scan.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
array y[5];
do nfirms=1 by 1 until (position le 0);
call scan(reassignment2,nfirms,position,length,'|');
y[nfirms]=position;
end;
nfirms=nfirms-1; *loop ends one iteration too late;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
I agree with #Joe that this could be done more simply without regular expressions, though I would simplify his code a little further to exclude the use of an array.
data test;
infile datalines truncover length = reclen;
input firmlist $varying256. reclen;
i = 0;
do until(scan(firmlist,i,"|") = "");
i + 1;
end;
nfirms = i - 1;
drop i;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;
run;
You said you'd also like to capture the position of the "|" character in the string, but if there are multiple firms per record there will be multiple "|" characters in the string. If you want the position of each one, an array might be a better route, though if you only want one, the index function will get you what you want. You'd use delimpos = index(firmlist,"|");.
I hope that helps!