Adding a new Column to existing SAS dataset - sas

I have a SAS dataset that I have created by reading in a .txt file. It has about 20-25 rows and I'd like to add a new column that assigns an alphabet in serial order to each row.
Row 1 A
Row 2 B
Row 3 C
.......
It sounds like a really basic question and one that should have an easy solution, but unfortunately, I'm unable to find this anywhere. I get solutions for adding new calculated columns and so on, but in my case, I just want to add a new column to my existing datatable - there is no other relation between the variables.

This is kind of ugly and if you have more than 26 rows it will start to use random ascii characters. But it does solve the problem as defined by the question.
Test data:
data have;
do row = 1 to 26;
output;
end;
run;
Explanation:
On my computer, the letter 'A' is at position 65 in the ASCII table (YMMV). We can determine this by using this code:
data _null_;
pos = rank('A');
put pos=;
run;
The ASCII table will position the alphabet sequentially, so that B will be at position 66 (if A is at 65 and so on).
The byte() function returns a character from the ASCII table at a certain position. We can take advantage of this by using the position of ASCII character A as an offset, subtracting 1, then adding the row number (_n_) to it.
Final Solution:
data want;
set have;
alphabet = byte(rank('A')-1 + _n_);
run;

Not better than Tom's but a brute force alternative essentially. Create the string of Alpha and then use CHAR() to identify character of interest.
data want;
set sashelp.class;
retain string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
letter = char(string, _n_);
run;

Related

Create a new, named column based on a formatted numeric variable

I have a dataset (DIN) that consists of formatted numeric variables (e.g., column 1 'BLD' has values 1-3, but they are formatted as 'Yes', 'No', 'Unknown'). All columns have slightly different formatting.
In each row, only one column has a value, the rest are missing. I am trying to use the following to get the maximum of each row (which will always be the non-missing value)
data DIN;
set DIN;
MAX = max(of BLD--VASC);
run;
Unfortunately as these columns are numeric the MAX column is showing as numbers, not the formatted value. I have tried using vvalue to get the formatted value, like below but I don't know how to do this for all columns at once.
data _null_;
set DIN;
BLD_C = vvalue(BLD);
run;
I felt like a do loop might help, and I tried looping over an array of variable names, but it just doesn't work. Nothing seems to happen
data DIN_C;
set DIN;
array nums(*) _numeric_;
do i = 1 to dim(nums);
nums_C = vvalue(nums(i));
end;
run;
Can anyone help me? Or is there another approach I could take for this problem?
You can use MAX() to find the actual non-missing numeric value. Then use WHICH() to find the index number of the variable with that value. Now you can use VVALUE() to find the formatted value of that variable.
data DIN_FIXED;
set DIN;
array _num BLD--VASC
length max 8 max_formatted $50 ;
MAX = max(of _num[*]);
if not missing(max) then max_formatted=vvalue(_num[which(max,of _num[*])]);
run;

How to split a column into multiple rows in SAS

I have a SAS table that I imported from Oracle with two fields. SYSTEMID and T_BLOB.
Inside the T_BLOB field there is data:
2203 Mountain Meadow===========OSCAR ST===========Zephyrhill Road
(why they are delimiting with equal signs I do not know nor do I know who to ask).
I'm new to SAS and I'm being asked to split T_BLOB field into multiple rows in a table called rick.split_blob. I tried Google but I can't find the exact example. I'm trying to get the output to look like:
SYSTEM_ID T_BLOB
GID_1 2203 Mountain Ave
GID_1 OSCAR ST
GID_1 Zephyrhill Road
Can anyone help me with how to code this?
If none of the values ever contain = then you can just use the scan() function.
data want;
set have ;
length T_BLOB_VALUE $200 ;
do i=1 by 1 until(t_blob_value=' ');
t_blob_value=scan(t_blob,i,'=') ;
if i=1 or t_blob_value ne ' ' then output;
end;
run;
You could try this:
data rick.split_blob (keep=SYSTEM_ID T_BLOB_SUB rename=(T_BLOB_SUB=T_BLOB));
set orig_dataset;
T_BLOB_TRANS = tranwrd(T_BLOB,"===========","|");
do i = 1 to countw(T_BLOB_TRANS,"|");
T_BLOB_SUB = scan(T_BLOB,i,"|");
output;
end;
run;
What I'm trying to do is first translate the odd string of equals signs to a simple pipe to avoid counting them as consecutive delimiters. Then we determine how many "words" (really - delimited strings) there are in T_BLOB_TRANS so we know how many times to run the DO loop. Finally we read everything between each delimiter and output it to a new T_BLOB variable for each new word.
It looks like you'll want to use a combination of the "scan" function and the "output" statement (with countw to get you the number of words if it is variable). Scan returns the nth word where you can specify the delimiter. Output outputs a record. So, for example, you can say
do i=1 to countw(line);
newvar = scan(line,i);
output;
end;

SAS: Transform variable into time series in text file import - length greater than 32.767

I get a calendar file from a vendor containing all holidays for a specific calendar.
The file contain 7 columns separated by a pipe (|). However column 7 that contain the actual holiday comes in a string format separated by semi-colon (;).
My problem is that column 7 has a length greater than 32.767 - then the solution I have done so far using some array and transpose tricks doesn't work anymore.
Basically the text file looks like:
INTERNAL_NAME|ERROR_CODE|NUMBER_OF_FIELDS|CALENDAR_CODE|CALENDAR_TYPE|CALENDAR_NAME|DATES
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109......etc.
However column 7 is delivered in a nice format so that the size of the array/matrix is given and the delimiter is given at the start of the string.
*1st charachter = delimiter -> ;
*Number of dimensions in matrix -> 2
*Number of rows in matrix -> 15.728
*Number of columns -> 1
*Data elements + Data -> 5 = Date and Data=01JAN1944 etc.
My desired result would be a dataset looking like
INTERNAL_NAME DATES
US 01JAN1944
US 02JAN1944
US 03JAN1944
US 08JAN1944
etc. until 15.728 observations is read.....
You can do this fairly easily.
The manual solution, i.e., assuming the fields are just as you say they are, is to use the secondary delimiter (;) and then you can parse that initial string on your own later since it's known to be shorter. Then iterate the inputs of that string, using # to hold the line.
data want;
infile datalines4 dlm=';' truncover;
length initial_string $500;
input initial_String $ #;
input dim row col #;
do _n_ = 1 by 1 until (missing(holiday_date));
input col_type holiday_Date #;
if not missing(holiday_date) then output;
end;
datalines4;
US|0|4|US|Country|United States|;2;15728;1;5;19440101;5;19440102;5;19440103;5;19440108;5;19440109
;;;;
run;
If you want to use that information that tells you about the delimiter/etc. to drive the readin, you could do that, but it would take two passes on the data file (unless it has a limited set of possibilities and you could just use if/else branching with those limited set of input statements). One pass would read just that part, then call a macro to read in the rest in a separate data step. But if this is always the format of the file, and you don't really care about those fields - you just have to work with them being there - the above is probably better as it's faster and less complicated.

Extracting coordinates from SVG path syntax

*the title may be misleading
I have (column) cells values as follows:
d="M200,170L149,385"
d="M200,170L150,387"
d="M200,170L275,384"
d="M200,170L49,317"
d="M200,170L92,347"
The values 200 & 170 in each cell represent the x and y origins respectively, while the second set of values (i.e. 149 and 385) represent the x and y values.
I want to separate the x-orgin, y-orgin, x and y values into four columns. (I'm relatively new to sas... I think these are cartesian coordinates)
How would I go about doing this?
Use the scan function. It is used to select the nth word of a string. First argument is the string you want parsed, second is the word (1st, 2nd, etc), and third lists your delimiters (characters that separate the words). That should be all you need.
data want;
set have;
origx = scan(d,1,'M,L');
origy = scan(d,2,'M,L');
x = scan(d,3,'M,L');
y = scan(d,4,'M,L');
run;
Do you have a SAS dataset with a variable named d in it, or do you have a text file? My first read was that you have a SAS dataset already, in which case you need to parse the variable. You could use SCAN() function, or plenty of other methods, e.g.:
data have;
input d $16.;
cards;
M200,170L149,385
M200,170L150,387
M200,170L275,384
M200,170L49,317
M200,170L92,347
;
run;
data want;
set have;
x_origin=scan(d,1,"M,L");
y_origin=scan(d,2,"M,L");
x=scan(d,3,"M,L");
y=scan(d,4,"M,L");
run;
proc print data=want;
run;

SAS date or numeric data?

%let months_back = %sysget(months_back);
data;
m = intnx('month', "&sysdate9"d, -&months_back - 2, 'begin');
m = intnx('day', put(m, date9.), 26, 'same');
m2back = put(m, yymmddd10.);
put m2back;
run;
NOTE: Character values have been converted to numeric values at the
places given by: (Line):(Column).
5:19 NOTE: Invalid numeric data, '01OCT2012' , at line 5 column 19.
I really don't know why this go wrong. The date string is numeric data?
PUT(m, date9.) is the culprit here. The 2nd argument of INTNX needs to be numeric (i.e. a date), the PUT function always returns a character value, in this instance '01OCT2012'. Just take out the PUT function completely and the code should work.
m = intnx('day', m, 26, 'same');
SAS stores dates as numbers - and in fact does not have a truly separate type for them. A SAS date is the number of days since 1/1/1960, so a bit over 19000 for today. The date format is entirely irrelevant to any date calculations - it is solely for human readibility.
The bit where you say:
"&sysdate9"d
actually converts the string "01JAN2012" to a numeric value (18304).
There's actually a quicker way to accomplish what you're trying to do. Because days correspond to whole numbers in SAS, to increment by one day you can simply add one to the value.
For example:
%let months_back=5;
data _null_;
m = intnx('month', today(), -&months_back - 2, 'begin');
m2 = intnx('day', m, 26, 'same');
m3 = intnx('month',"&sysdate9"d, -&months_back - 2)+26;
m2back = put(m2, yymmdd10.);
put m= date9. m2= yymmdd10. m3= yymmdd10.;
run;
M3 does your entire calculation in one step, by using the MONTH interval, then adding 26. INTNX('day'...) is basically pointless, unless there's some other value to using the function (using a shift index for example).
You also can see the use of a format in the PUT(log) statement here - you don't have to PUT it to a character value and then put that to the log to get the formatted value, just put (var) (format.); - and string together as many as you want that way.
Also, "&sysdate9."d is not the best way to get the current date. &sysdate. is only defined on startup of SAS, so if your session ran for 3 days you would not be on the current day (though perhaps that's desired?). Instead, the TODAY() function gets the current date, up to date no matter how long your SAS session has been running.
Finally - I recommend data _null_; if you don't want a dataset (and naming the result dataset if you do want it). data _null_ does not create a dataset. data; simply creates increasing numbers of datasets (data1, data2, ...) which quickly fill up your workspace and make it hard to tell what you're doing.