I have table1 that contains one column (city), I have a second table (table2) that has two columns (city, distance),
I am trying to create a third table, table 3, this table contains two columns (city, distance), the city in table 3 will come from the city column in table1 and the distance will be the corresponding distance in table2.
I tried doing this using Proc IML based on Joe's suggestion and this is what I have.
proc iml;
use Table1;
read all var _CHAR_ into Var2 ;
use Table2;
read all var _NUM_ into Var4;
read all var _CHAR_ into Var5;
do i=1 to nrow(Var2);
do j=1 to nrow(Var5);
if Var2[i,1] = Var5[j,1] then
x[i] = Var4[i];
end;
create Table3 from x;
append from x;
close Table3 ;
quit;
I am getting an error, matrix x has not been set to a value. Can somebody please help me here. Thanks in advance.
The technique you want to use is called the "unique-loc technique". It enables you to loop over unique values of a categorical variable (in this case, unique cities) and do something for each value (in this case, copy the distance into another array).
So that others can reprodce the idea, I've imbedded the data directly into the program:
T1_City = {"Gould","Boise City","Felt","Gould","Gould"};
T2_City = {"Gould","Boise City","Felt"};
T2_Dist = {10, 15, 12};
T1_Dist = j(nrow(T1_City),1,.); /* allocate vector for results */
do i = 1 to nrow(T2_City);
idx = loc(T1_City = T2_City[i]);
if ncol(idx)>0 then
T1_Dist[idx] = T2_Dist[i];
end;
print T1_City T1_Dist;
The IF-THEN statement is to prevent in case there are cities in Table2 that are not in Table1. You can read about why it is important to use that IF-THEN statement. The IF-THEN statement is not needed if Table2 contains all unique elements of Table1 cities.
This technique is discussed and used extensively in my book Statistical Programming with SAS/IML Software.
You need a nested loop, or to use a function that finds a value in another matrix.
IE:
do i = 1 to nrow(table1);
do j = 1 to nrow(table2);
...
end;
end;
Related
I am trying to sum one variable as long as another remains constant. I want to cumulative sum dur as long as a is constant. when a changes the sum restarts. when a new id, the sum restarts.
enter image description here
and I would like to do this:
enter image description here
Thanks
You can use a BY statement to specify the variables whose different value combinations organize data rows into groups. You are resetting an accumulated value at the start of each group and adding to the accumulator at each row in the group. Use retain to maintain a new variables value between the DATA step implicit loop iterations. The SUM statement is a unique SAS feature for accumulating and retaining.
Example:
data want;
set have;
by id a;
if first.a then mysum = 0;
mysum + dur;
run;
The SUM statement is different than the SUM function.
<variable> + <expression>; * SUM statement, unique to SAS (not found in other languages);
can be thought of as
retain <variable>;
<variable> = sum (<variable>, <expression>);
As far as I am concerned, you need to self-join your table with a ranked column.
It should be ranked by id and a columns.
FROM WORK.QUERY_FOR_STCKOVRFLW t1; is the table you provided in the screenshot
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_STCKOVRFLW_0001 AS
SELECT t1.id,
t1.a,
t1.dur,
/* mono */
(monotonic()) AS mono
FROM WORK.QUERY_FOR_STCKOVRFLW t1;
QUIT;
PROC SORT
DATA=WORK.QUERY_FOR_STCKOVRFLW_0001
OUT=WORK.SORTTempTableSorted
;
BY id a;
RUN;
PROC RANK DATA = WORK.SORTTempTableSorted
TIES=MEAN
OUT=WORK.RANKRanked(LABEL="Rank Analysis for WORK.QUERY_FOR_STCKOVRFLW_0001");
BY id a;
VAR mono;
RANKS rank_mono ;
RUN; QUIT;
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_RANKRANKED AS
SELECT t1.id,
t1.a,
t1.dur,
/* SUM_of_dur */
(SUM(t2.dur)) FORMAT=BEST12. AS SUM_of_dur
FROM WORK.RANKRANKED t1
LEFT JOIN WORK.RANKRANKED t2 ON (t1.id = t2.id) AND (t1.a = t2.a AND (t1.rank_mono >= t2.rank_mono ))
GROUP BY t1.id,
t1.a,
t1.dur;
QUIT;
I stumbled upon the following code snippet in which the variable top3 has to be filled from a table have rather than from an array of numbers.
%let top3 = 14 15 42; /* This should be made obsolete.. */
%let no = 3;
proc sql;
create table want as
select *
from (select x, y from foo) a
%do i = 1 %to &no.;
%let current = %scan(&top3.,&i.); /* What do I need to put here? */
left join (select x, y from bar where z=¤t.) row_¤t.
on a.x = row_¤t..x
%end;
;
quit;
The table have contains the xs from the string and looks as follows:
i x
1 14
2 15
3 42
I am now wondering how I should modify the %let current = ... line such that current is populated from the table have. I know how to populate a macro variable using proc sql with select .. into, but I am afraid that the way I am going right now is fully against SAS philosophy.
It looks like you're more or less transposing something. If that's the case, this is doable in macro/sql pretty easily.
First, here's the simple version - no macro.
proc sql;
create table class_t as
select * from (
select name from sashelp.class ) class
left join (
select name, age as age_Alfred
from sashelp.class
where name='Alfred') Alfred
on class.name = Alfred.name
;
quit;
We grab the value of age from the Alfred row and put it on the main join. This isn't exactly what you're doing, but it seems similar. (I'm just using one table, but you can of course use two here.)
Now, how do we extend this to be table-driven and not handwritten? Macros!
First, here's the macro - just taking the Alfred bit and making it generic.
%macro joiner(name=);
left join (
select name, age as age_&name.
from sashelp.class
where name="&name.") &name.
on class.name = &name..name
%mend joiner;
Second, we look at this and see two things we need to put into macro lists: the SELECT variable list (we'll get one new variable for each call), and the JOIN list.
proc sql;
select cats('%joiner(name=',name,')')
into :joinlist separated by ' '
from sashelp.class;
select cats(name,'.age_',name)
into :selectlist separated by ','
from sashelp.class;
quit;
And then, we just call it!
proc sql;
create table class_t as
select class.name,&selectlist. from (
select name from sashelp.class) class
&joinlist.
;
quit;
Now, your dataset you call the macro lists from is perhaps the dataset with the 3 rows in it you have above ("have"). The dataset you actually get the appending data from is some other dataset ("bar"), right? And then the ones you join to is perhaps a third dataset ("foo"). Here I just use the one, for simplicity, but the concept is the same, just different sources.
When the lookup data is in a table you can perform a three way join without any need for SAS Macro. You don't provide any data so the example will mock some.
Example:
Suppose a master record has several associated detail records, and the detail records contain a z value used for selection into a result set per a wanted z lookup table.
data masters;
call streaminit(2020);
do id = 1 to 100;
do x = 1 to 100;
m_rownum + 1;
code = rand('integer', 10,45);
output;
end;
end;
run;
data details;
call streaminit(2020);
do date = 1 to 20;
do x = 1 to 100;
do rep = 1 to 5;
d_rownum + 1;
amount = rand('integer', 100,200);
z = rand('integer', 10,45);
output;
end;
end;
end;
run;
data zs;
input z ##; datalines;
14 15 42
;
proc sql;
create table want as
select
m_rownum
, d_rownum
, masters.id
, masters.x
, masters.code
, details.z
, details.date
, details.amount
from
masters
left join
details
on
details.x = masters.x
inner join
zs
on
zs.z = details.z
order by
masters.id, masters.x, details.z, details.date
;
quit;
I have the following data strucure, which I need to transform:
I have been working on transposing the data properly to obtain a table structure.
Target structure
Attribute 1, Attribute 2, ...., Attribute n
Outcome1-A, Outcome2-A, ...., Outcome n-A .....
....
Outcome A-Z, Outcome2-Z, ...., Outcome n-Z .....
I started with the following statement, modifications would be great. The static names of the attributes are imported as duplicate records.
PROC TRANSPOSE DATA=INPUT_TAB OUT=vertikal ;
VAR v name ;
id n;
RUN;
The id statement serves tels SAS which variable to use for column names, and if I understand your question well, that is Name, not n. Hence Name should not be in the VAR list
You should also tell SAS when to start a new observation (aka row), using a by statement. Probably one of the columns left of Name is a good candidate. Let's name it group_id. Hence the code should look like this:
PROC TRANSPOSE DATA=INPUT_TAB OUT=vertikal ;
by group_id;
VAR v;
id Name;
RUN;
If there is no variable like group_id available, you can create one this way:
DATA INPUT_VIEW / view=INPUT_VIEW;
set INPUT_TAB;
retain group_id 1;
if Name eq "SZENARIO_ID" then group_id = group_id + 1;
PROC TRANSPOSE DATA=INPUT_VIEW OUT=vertikal ;
by group_id;
VAR v;
id Name;
RUN;
I choose a view to avoid needlessly writing data to disk, but that only matters for large datasets.
I expanded the table using COL_TYP, however the transpose command does not work, see error message
PROC SQL;
CREATE TABLE TEST_CASE_WHEN AS
SELECT A.*,
CASE WHEN A.NAME IN ('SZENARIO_ID') THEN 1
ELSE 0 END AS COL_TYP
FROM WORK.TEST A
;
QUIT;
proc sort
data = work.INPUT_VIEW out= input_view2;
BY N;
RUN;
PROC TRANSPOSE DATA=INPUT_VIEW2 OUT=vertikal ;
by COL_TYP;
VAR v;
id Name;
RUN;
ERROR: Data set WORK.INPUT_VIEW2 is not sorted in ascending sequence. The current BY group has
COL_TYP = 2 and the next BY group has COL_TYP = 0.
Thanks and Kind Regards
Kingsley
I have a variable UserName that contains IDs of variable length. A shortened example:
How can I sort all rows by variable X where longer strings are listed first.
Context: This is for calculating HEI 2015 scores using the ASA24 macro. It writes:
/*Note: Some users have found that the SAS program will drop observations from the analysis if the ID field is not the same length for all observations. To prevent this error, the observations with the longest ID length should be listed first when the data is imported into SAS. */
Proc SQL with an ORDER BY clause specifying an ordering value computed in a CASE expression.
The computation when length(X) > 8 then -length(X) else 0 ensures longest values are first when sorted and all value lengths <= some-capping-length (8) are treated equally
ORDER BY length(X) desc, X would also select longest X values first and then by X itself, but length would predominate ordering even when value lengths < 8.
data have;
length X $50;
input X; datalines;
GFHsp036
GFHsp038
GFHsp039
GFHsp040
GFHsp0400
GFHsp0401
GFHsp0402
GFHsp04021
;
proc sql;
create table want as
select * from have
order by
case when length(x) > 8 then -length(X) else 0 end,
X
;
quit;
proc print;
var X / style=[fontfamily='Courier'];
run;
Here is probably the simplest way to do this
data have;
input string $;
datalines;
abcde
ab
a
abcd
abc
;
proc sql;
create table want as
select * from have
order by length(string) desc;
quit;
Re-ordering IDs did not help in my case as PROC IMPORT needed GUESSINGROWS = MAX.
Please see SAS Macro Truncating IDs
For how to fix the truncating IDs that this question attempted to fix.
Im a new SAS User and I have a small problem
I have one large empty table A with lets say 100 columns that I have created with a simple proc sql; create table
I have another table B with lets say 40 columns and table C with 55 columns.
I want to add these two tables into table A, basically I want a table with 100 columns containing the data from table B & C and I'm doing this with a Union command.
Since I dont have values for all 100 variables I have to set default values.
Lets say I have a column named nourishment in table A, food in table B and has no equivalent in table C. I have rules like "If the data comes from table B then value =xxx if its from table C then Value="DefaultValue"
I'd do this easily with R or python but Im struggling with sas.
I'm using SAS sql commands (a Union command)
How do you set default values ? (for all data types : character, numeric or dateI'm using SAS sql commands )
Dates in SAS are actually just numeric values. Often they have a date format applied to make them readable.
So you could just assign a missing value by default like so:
. as ColumnName
or any default date like so
'17NOV2017'd as ColumnName
. as MyColumnName
SAS can deal with missing values.
Using a specially coded value, such as 'NA', to represent a missing value condition can work but may lead to headaches and extra coding. Recommended read in SAS help: "Working with Missing Values"
The default SAS missing value for numerics (which also includes dates) is period.
. as MyColumnName
SAS also has 27 special missing values for numerics that are expressed as . < character >
.A as MyColumnName
...
.Z as MyColumnName
._ as MyColumnName
The missing value for character variables is a single space
' '
'' empty quote string also works
' ' as does a longer empty string
Rule of thumb: be consistent when coding your missing values.
You can use OPTIONS MISSING to specify what character is shown when a missing value is printed.
OPTIONS MISSING = '*'; * My special representation of missing for this report;
Proc PRINT data=myData;
run;
OPTIONS MISSING = '.'; * Restore to the default;
SAS custom formats can also be used to customize what is printed for missing values.
Proc FORMAT;
value MissingN
. = 'N/A'
.N = 'Special N/A different than regular N/A' /* for .N */
;
value $MissingC
' ' = 'N/A'
;
value SillyChristmasStocking
.C = 'Bad'
.O = 'children'
.A = 'get'
.L = 'No toys'
;
The token after the value keyword can be any new valid SAS name that you want to used for your format name.
Proc PRINT data=myData;
format myColumnName MissingN.;
format name $MissingC.;
format behaviour SillyChristmasStocking.;
run;
As for your character missing value conditions, I would continue to use " " or ' '
You mention UNION which is a SQL feature. In SQL, JOIN also occur, perhaps more often then UNION. When JOINing and values from two source columns collide, you will want to use either COALESCE() function or CASE
statements to select the non-missing value.
I would not recommend using UNION in PROC SQL at any point in your SAS usage. UNION is almost always inferior to a simple data step, or a data step view.
That's because the data step seamlessly handles issues like differing variables on different tables. SAS is quite comfortable with vertically combining datasets; SQL is always a bit trickier when they're not identical.
data c;
set a b;
run;
That runs whether or not a and b are identical, so long as a and b don't have conflicting variable names (that aren't intended to be in the same column); and if they do, just use the rename dataset option to resolve it.
If you do as the above, and don't use union, you'll get a missing value automatically for those dates.
NFN:
DATA Step
A DATA Step approach for stacking data is the simplest. Use SET to stack the data and array processing to apply your defaults. For example:
data stacked_data;
set
TARGET_TEMPLATE (obs = 0)
ONE
TWO
;
array allchar _character_;
array allnum _numeric_;
array dates d1-d5;
do over allchar; if missing(allchar) then allchar = '*UNKNOWN*'; end;
do over allnum; if missing(allnum) then allnum = -995; end;
do over dates; if missing(dates) then dates='01NOV1971'd; end;
run;
A subtle issue is that any missing values in ONE or TWO will be replaced with the default value.
Proc SQL
In Proc SQL you will want to create a single row table containing the default values for A. That table can be joined to the union of B and C. The join select will involve coalesce() in order to choose the predefined default value when a column is not from B or C.
For example, suppose you have an empty (zero rows), richly columned, target table (your A) acting as a template:
data TARGET_TEMPLATE;
length _n_ 8;
length a1-a5 $25 d1-d5 4 x1-x20 y1-y20 z1-z20 p1-p20 q1-q20 r1-r20 8;
call missing (of _all_);
format d1-d5 yymmdd10.;
stop;
run;
Because Proc SQL does not provide syntax for a default constraint you need to create a table of your own defaults. This is probably easiest with DATA Step:
data TARGET_DEFAULTS;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv to match TARGET;
array allchar _character_ (1000 * '*UNKNOWN*');
array allnum _numeric_ (1000 * -995);
array d d1-d5 (5 * '01NOV1971'd); * override the allnum array initialization;
output;
stop;
run;
Here is some generated demo data, ONE and TWO, that correspond to your B and C:
data ONE;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 100;
array a a1 a3 a5;
array num x: y: z:;
array d d1 d2;
do over a; a = catx (' ', 'ONE', _n_, _i_); end;
do over num; num = 1000 + _n_ + _i_; end;
retain foodate '01jan1975'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a3 a5 x: y: z: d1 d2; * keep the disparate columns that were populated;
run;
data TWO;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 200;
array a a1 a2 a3;
array num x5 y5 z5 p: q: r:;
array d d1 d2;
do over a; a = catx (' ', 'TWO', _n_, _i_); end;
do over num; num = 20000 + _n_*10 + _i_; end;
retain foodate '01jan1985'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a2 a3 x5 y5 z5 p: q: r:; * keep the disparate columns that were populated;
run;
A stacking of A, B and C is simple SQL but does not introduce target specific default values:
proc sql noprint;
* generic UNION stack with SAS missing values (space and dot) for cells
* where ONE and TWO did not contribute any data;
create table stacked_data as
select * from have_data_TEMPLATE %*** empty template first ensures template column order and formats are honored in output data;
outer union corresponding %*** align by column name, do not remove duplicates;
select * from ONE
outer union corresponding
select * from TWO
;
When the stacking is put in a sub-query, it can be joined with the defaults. The choosing of the target default value for each column involves examining DICTIONARY.COLUMNS and generating the SQL source for selecting the coalescence of stack and default.
proc sql noprint;
* codegen select items ;
select cat('coalesce(STACK.',trim(name),',DEFAULT.',trim(name),') as ',trim(name))
into :coalesces separated by ','
from DICTIONARY.COLUMNS
where libname = 'WORK' and memname = 'HAVE_DATA_TEMPLATE' %* dictionary lib and mem name values are always uppercase;
order by npos
;
create table stacked_data_with_defaults as
select * from TARGET_TEMPLATE %*** output honors template;
outer union corresponding
select
source
, &coalesces %*** apply codegen;
from
(
select * from WORK.have_data_TEMPLATE %*** ensure fully columned sub-select that will align with coalesces;
outer union corresponding
select 'one' as source, * from ONE
outer union corresponding
select 'two' as source, * from TWO
) as STACK
join
TARGET_DEFAULTS as DEFAULT
on 1=1
;
quit;
Why would you create an empty dataset? What is it going to be used for? Perhaps you want to use it as a default structure definition? If so and you want to stack B and C and get them in the structure defined by A you could code this way.
data want ;
set a(obs=0) b c ;
run;
Not sure what the purpose would be to have default values. Couldn't you use formats if you want missing values to display in special ways?
Or you could create code to default values and perhaps just %include it or wrap the logic into a macro. So it you had a code file name 'defaults.sas' with lines like this.
startdate=coalesce(startdate,'01JAN2013'd);
gender=coalescec(gender,'UNKNOWN');
Then your little program to make a new dataset that looks like A and uses the data from B and C would look like this.
data want ;
set a(obs=0) b c ;
%include 'defaults.sas';
run;
If you really did want to aggregate the records into some large dataset then perhaps you want to use PROC APPEND to add the records once they are created in the right structure.
proc append data=want base=a ;
run;