Column Sorting with PROC Transpose in SAS

Column Sorting with PROC Transpose in SAS - sas

I am using below mentioned code to get the columns sorted dynamically after proc transpose. I have gone a lot of solutions for this solution. But now I am getting an error if I run
data work.AB ;
input name $ class $ dt $ gpa $;
datalines;
JOHN 1 201607 C-
JOHN 1 201608 C+
JOHN 1 201702 B-
JOHN 2 201608 A
NICK 1 201608 A
NICK 1 201707 A
MIKE 2 201608 B
MIKE 2 201607 B
MIKE 2 201707 B+
MIKE 2 201702 B
BOB 3 201702 D
BOB 3 201607 C
BOB 3 201707 C
;
proc sort data=work.AB;
by NAME ClASS dt;
run;
PROC TRANSPOSE DATA = AB OUT = ABC(drop=_name_) ;
BY nAME cLASS;
VAR GPA;
ID dt;
RUN ;
proc sql ;
create table test as
select name into : list separated by ' '
from dictionary.columns
where libname='WORK' and memname='ABC'
order by input(substr(name,anydigit(name)),best32.)
;
quit;
%put &list;
data want;
retain &list;
set ABC;
run;
Error that I get is
22 GOPTIONS ACCESSIBLE;
WARNING: Apparent symbolic reference LIST not resolved.
23 %put &list;
&list
24 data want;
25 retain &list;
_
22
200
WARNING: Apparent symbolic reference LIST not resolved.
ERROR 22-322: Syntax error, expecting one of the following: a name, ;, _ALL_, _CHARACTER_, _CHAR_, _NUMERIC_.
ERROR 200-322: The symbol is not recognized and will be ignored.
26 set ABC;
27 run;
Kindly suggest.

You cannot put the values from the same SELECT statement into both a dataset and macro variables. Remove the create table test as from your SQL code.
You also might want to suppress some of the warnings by changing the query to:
proc sql noprint ;
select case when (not anydigit(name)) then -1
else input(substr(name,anydigit(name)),?32.)
end as order
, name
into :list
, :list separated by ' '
from dictionary.columns
where libname='WORK' and memname='ABC'
order by 1
;
quit;
%put &list;

Related

Look up and replace values from a separate table in SAS

Dataset HAVE includes two variables with misspelled names in them: names and friends.
Name Age Friend
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
I have a small dataset CORRECTIONS that includes wrong_names and resolved_names.
current_names resolved_names
Jon John
Ann Anne
Jimb Jim
I need any name in names or friends in HAVE that matches a name in the wrong_names column of CORRECTIONS to get recoded to the corresponding string in resolved_name. The resulting dataset WANT should look like this:
Name Age Friend
John 11 Anne
John 11 Tom
Jim 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Anne
In R, I could simply invoke each dataframe and vector using if_else(), but the DATA step in SAS doesn't play nicely with multiple datasets. How can I make these replacements using CORRECTIONS as a look-up table?

There are many ways to do a lookup in SAS.
First of all, however, I would suggest to de-duplicate your look-up table (for example, using PROC SORT and Data Step/Set/By) - deciding which duplicate to keep (if any exist).
As for the lookup task itself, for simplicity and learning I would suggest the following:
The "OLD SCHOOL" way - good for auditing inputs and outputs (it is easier to validate the results of a join when input tables are in the required order):
*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
run;
*** lookup table;
data corrections;
length current_names $10. resolved_names $10.;
input current_names resolved_names;
datalines;
Jon John
Ann Anne
Jimb Jim
run;
*** de-duplicate lookup table;
proc sort data=corrections nodupkey; by current_names; run;
proc sort data=have; by name; run;
data have_corrected;
merge have(in=a)
corrections(in=b rename=(current_names=name))
;
by name;
if a;
if b then do;
name=resolved_names;
end;
run;
The SQL way - which avoids sorting the have table:
proc sql;
create table have_corrected_sql as
select
coalesce(b.resolved_names, a.name) as name,
a.age,
a.friend
from work.have as a left join work.corrections as b
on a.name eq b.current_names
order by name;
quit;
NB the Coalesce() is used to replace missing resolved_names values (ie when there is no correction) with names from the have table
EDIT: To reflect Quentin's (CORRECT) comment that I'd missed the update to both name and friend fields.
Based on correcting the 2 fields, again many approaches but the essence is one of updating a value only IF it exists in the lookup (corrections) table. The hash object is pretty good at this, once you've understood it's declaration.
NB: any key fields in the Hash object need to be specified on a Length statement BEFOREHAND.
EDIT: as per ChrisJ's alternative to the Length statement declaration, and my reply (see below) - it would be better to state that key variables need to be defined BEFORE you declare the hash table.
data have_corrected;
keep name age friend;
length current_names $10.;
*** load valid names into hash lookup table;
if _n_=1 then do;
declare hash h(dataset: 'work.corrections');
rc = h.defineKey('current_names');
rc = h.defineData('resolved_names');
rc = h.defineDone();
end;
do until(eof);
set have(in=a) end=eof;
*** validate both name fields;
if h.find(key:name) eq 0 then
name = resolved_names;
if h.find(key:friend) eq 0 then
friend = resolved_names;
output;
end;
run;
EDIT: to answer the comments re ChrisJ's SQL/Update alternative
Basically, you need to restrict each UPDATE statement to ONLY those rows that have name values or friend values in the corrections table - this is done by adding another where clause AFTER you've specified the set var = (clause). See below.
NB. AFAIK, an SQL solution to your requirement will require MORE than 1 pass of both the base table & the lookup table.
The lookup/hash table, however, requires a single pass of the base table, a load of the lookup table and then the lookup actions themselves. You can see the performance difference in the log...
proc sql;
*** create copy of have table;
create table work.have_sql as select * from work.have;
*** correct name field;
update work.have_sql as u
set name = (select resolved_names
from work.corrections as n
where u.name=n.current_names)
where u.name in (select current_names from work.corrections)
;
*** correct friend field;
update work.have_sql as u
set friend = (select resolved_names
from work.corrections as n
where u.friend=n.current_names)
where u.friend in (select current_names from work.corrections)
;
quit;

Given data
*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
run;
*** lookup table;
data corrections;
length from_name $10. to_name $10.;
input from_name to_name;
datalines;
Jon John
Ann Anne
Jimb Jim
run;
One SQL alternative is to perform a existent mapping select look up on each field to be mapped. This would be counter to joining the corrections table one time for each field to be mapped.
proc sql;
create table want1 as
select
case when exists (select * from corrections where from_name=name)
then (select to_name from corrections where from_name=name)
else name
end as name
, age
, case when exists (select * from corrections where from_name=friend)
then (select to_name from corrections where from_name=friend)
else friend
end as friend
from
have
;
Another, SAS only way, to perform inline left joins is to use a custom format.
data cntlin;
set corrections;
retain fmtname '$cohen'; /* the fixer */
rename from_name=start to_name=label;
run;
proc format cntlin=cntlin;
run;
data want2;
set have;
name = put(name,$cohen.);
friend = put(friend,$cohen.);
run;

You can use an UPDATE in proc sql :
proc sql ;
update have a
set name = (select resolved_names b from corrections where a.name = b.current_names)
where name in(select current_names from corrections)
;
update have a
set friend = (select resolved_names b from corrections where a.friend = b.current_names)
where friend in(select current_names from corrections)
;
quit ;
Or, you could use a format :
/* Create format */
data current_fmt ;
retain fmtname 'NAMEFIX' type 'C' ;
set resolved_names ;
start = current_names ;
label = resolved_names ;
run ;
proc format cntlin=current_fmt ; run ;
/* Apply format */
data want ;
set have ;
name = put(name ,$NAMEFIX.) ;
friend = put(friend,$NAMEFIX.) ;
run ;

Try this:
proc sql;
create table want as
select p.name,p.age,
case
when q.current_names is null then p.friend
else q.resolved_names
end
as friend1
from
(
select
case
when b.current_names is null then a.name
else b.resolved_names
end
as name,
a.age,a.friend
from
have a
left join
corrections b
on upcase(a.name) = upcase(b.current_names)
) p
left join
corrections q
on upcase(p.friend) = upcase(q.current_names);
quit;
Output:
name age friend
John 11 Anne
Jed 10 Anne
Joe 11 Anne
Jim 12 Egg
Joe 11 Egg
Joe 11 Tom
John 11 Tom
Let me know in case of any clarifications.

how to split four columns into two tables in sas report

I have one table having 4 columns and i want to separate them into 2 table 2 columns in one table and 2 columns in another table.but both table should be below to each other.I want this in proc report format.code should be in report.
id name age gender
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
i want id and name in one table and age and gender in other table.but one below the other in ods excel.

I've split the main table into two tables using a data setp then appended them to each other, I added an extra columns called "source" in order to be differniate between the tables. if you use a Proc report you can group by "source"
Code:
*Create input data*/
data have;
input id name $ age gender $ ;
datalines;
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
;;;;
run;
/*Split / create first table*/
data table1;
set have;
source="table1: id & name";
keep source id name ;
run;
/*Split / create second table*/
data table2;
set have;
source="table2: age & gender";
keep source age gender;
run;
/*create Empty table*/
data want;
length Source $30. column1 8. column2 $10.;
run;
proc sql; delete * from want; quit;
/* Append both tables to each other*/
proc append base= want data=table1(rename=(id=column1 name=column2)) force ; run;
proc append base= want data=table2(rename=(age=column1 gender=column2)) force ; run;
/*Create Report*/
proc report data= want;
col source column1 column2 ;
define source / group;
run;
Output Table:
Report:

For data
data have;input
id name $ age gender $; datalines;
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
run;
Being output as Excel, the splitting into two parts can be done via two Proc REPORT steps; each step responsible for a single set of columns. Options are used in the ODS EXCEL to control how sheet processing is handled.
The first step manages the common header through DEFINE, the subsequent steps are NOHEADER and don't need DEFINE statements. Each step must define and compute the value of the new source column. There will be a one Excel row gap between each table.
ods _all_ close;
ods excel file='want.xlsx' options(sheet_interval='NONE');
proc report data=have;
column source id name;
define id / 'Column 1';
define name / 'Column 2';
define source / format=$20.;
compute source / character length=20; source='ID and NAME'; endcomp;
run;
proc report data=have noheader;
column source age gender;
define source / format=$20.;
compute source / character length=20; source='AGE and GENDER'; endcomp;
run;
ods excel close;
There is no reasonable single Proc REPORT step that would produce similar output from dataset have.

How to use proc summary and keep all variables (without naming them)

I want to sum over a specific variable in my dataset, without loosing all the other columns. I have tried the following code:
proc summary data=work.test nway missing;
class var_1 var_2 ; *groups;
var salary;
id _character_ _numeric_; * keeps all variables;
output out=test2(drop=_:) sum= ;
run;
But it does not seem to sum properly, and for the "salary" column I'm just left with the value of the last value in each group (var_1 and var_2). If I remove
id _character_ _numeric_;
it works fine, but I loose all other columns.
Example:
data:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
desired output:
John Sales 66 M
Mary Acctng 21 F

I think this does what you want. You still get warnings about name conflicts and variables being dropped but at least the ones you want are kept. The ID statement is depreciated in favor in the new and better IDGROUP output statement option.
You could add the AUTONAME option to the output statement if you wanted PROC SUMMARY to automatically rename the conflicting variables.
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;;;;
run;
proc print;
run;
proc summary nway missing;
class name dept;
var salary;
output out=test2(drop=_:) sum= idgroup(out(_all_)=);
run;
proc print;
run;

Try this:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql;
create table salary2 as
select *,
monotonic() as n,
sum(salary) as sum_salary
from salary
group by name
having max(n)=n;
quit;

I wasn't aware that SAS did this, but the problem appears to lie in the fact that the id statement takes preference over the var statement. By including all variables in the id statement, all the output is showing is the maximum value for each variable, including Salary.
One option is to pull a list of the variables not included in the class or var statements from dictionary.columns, then use that list in the id statement. Just be aware that proc summary runs in memory and I have come across out of memory problems in the past when many variables have been included in the id statement
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql noprint;
select name into :cols separated by ' '
from dictionary.columns
where libname='WORK'
and
memname='SALARY'
and
name not in ('name','Salary');
quit;
%put &cols.;
proc summary data=salary nway missing;
class name;
var salary;
id &cols.;
output out=want (drop=_:) sum=;
run;

Sas Macro to semi-efficiently manipulate data

Objective: Go from Have table + Help table to Want table. The current implementation (below) is slow. I believe this is a good example of how not to use SAS Macros, but I'm curious as to whether...
1. the macro approach could be salvaged / made fast enough to be viable
(e.g. proc append is supposed to speed up the action of stacking datasets, but I was unable to see any performance gains.)
2. what all the alternatives would look like.
I have written a non-macro solution that I will post below for comparison sake.
Data:
data have ;
input name $ term $;
cards;
Joe 2000
Joe 2002
Joe 2008
Sally 2001
Sally 2003
; run;
proc print ; run;
data help ;
input terms $ ;
cards;
2000
2001
2002
2003
2004
2005
2006
2007
2008
; run;
proc print ; run;
data want ;
input name $ term $ status $;
cards;
Joe 2000 here
Joe 2001 gone
Joe 2002 here
Joe 2003 gone
Joe 2004 gone
Joe 2005 gone
Joe 2006 gone
Joe 2007 gone
Joe 2008 here
Sally 2001 here
Sally 2002 gone
Sally 2003 here
; run;
proc print data=have ; run;
I can write a little macro to get me there for each individual:
%MACRO RET(NAME);
proc sql ;
create table studtermlist as
select distinct term
from have
where NAME = "&NAME"
;
SELECT Max(TERM) INTO :MAXTERM
FROM HAVE
WHERE NAME = "&NAME"
;
SELECT MIN(TERM) INTO :MINTERM
FROM HAVE
WHERE NAME = "&NAME"
;
CREATE TABLE TERMLIST AS
SELECT TERMS
FROM HELP
WHERE TERMS BETWEEN "&MINTERM." and "&MAXTERM."
ORDER BY TERMS
;
CREATE TABLE HEREGONE_&Name AS
SELECT
A.terms ,
"&Name" as Name,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from termlist a left join studtermlist b
on a.terms eq b.term
;
quit;
%MEND RET ;
%RET(Joe);
%RET(Sally);
proc print data=HEREGONE_Joe; run;
proc print data=HEREGONE_Sally; run;
But it's incomplete. If I loop through for (presumably quite a few names)...
*******need procedure for all names - grab info on have ;
proc sql noprint;
select distinct name into :namelist separated by ' '
from have
; quit;
%let n=&sqlobs ;
%MACRO RETYA ;
OPTIONS NONOTEs ;
%do i = 1 %to &n ;
%let currentvalue = %scan(&namelist,&i);
%put &currentvalue ;
%put &i ;
%RET(&currentvalue);
%IF &i = 1 %then %do ;
data base; set HEREGONE_&currentvalue; run;
%end;
%IF &i gt 1 %then %do ;
proc sql ; create table base as
select * from base
union
select * from HEREGONE_&currentvalue
;
drop table HEREGONE_&currentvalue;
quit;
%end;
%end ;
OPTIONS NOTES;
%MEND;
%RETYA ;
proc sort data=base ; by name terms; run;
proc print data=base; run;
So now I have want, but with 6,000 names, it takes over 20 minutes.

Let's try the alternative solution. For each name find the min/max term via a proc SQL data step. Then use a data step to create the time period table and merge that with your original table.
*Sample data;
data have ;
input name $ term ;
cards;
Joe 2000
Joe 2002
Joe 2008
Sally 2001
Sally 2003
; run;
*find min/max of each name;
proc sql;
create table terms as
select name, min(term) as term_min, max(term) as term_max
from have
group by name
order by name;
quit;
*Create table with the time periods for each name;
data empty;
set terms;
do term=term_min to term_max;
output;
end;
drop term_min term_max;
run;
*Create final table by merging the original table with table previously generated;
proc sql;
create table want as
select a.name, a.term, case when missing(b.term) then 'Gone'
else 'Here' end as status
from empty a
left join have b
on a.name=b.name
and a.term=b.term
order by a.name, a.term;
quit;
EDIT: Now looking at your macro solution, part of the problem is that you're scanning your table too many times.
The first table, studenttermlist is not required, the last join can
be filtered instead.
The two macro variables, min/max term can be
calculated in a single pass
Avoid the smaller interim term list and use a where clause to filter your results
Use Call Execute to call your macro rather than another macro loop
Rather than loop through to append the
data, take advantage of a naming convention and use a single data
step to append all outputs.
%MACRO RET(NAME);
proc sql noprint;
SELECT MIN(TERM), Max(TERM) INTO :MINTERM, :MAXTERM
FROM HAVE
WHERE NAME = "&NAME"
;
CREATE TABLE _HG_&Name AS
SELECT
A.terms ,
"&Name" as Name,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from help a
left join have b
on a.terms eq b.term
and b.name="&name"
where a.terms between "&minterm" and "&maxterm";
;
quit;
%MEND RET ;
*call macro;
proc sort data=have;
by name term;
run;
data _null_;
set have;
by name;
if first.name then do;
str=catt('%ret(', name, ');');
call execute(str);
end;
run;
*append results;
data all;
set _hg:;
run;

You can actually do this in a single nested SQL query. It would be messy and hard to read.
I'm going to break it out into the three components.
First, get the distinct names;
proc sql noprint;
create table names as
select distinct name from have;
quit;
Second, Cartesian product names and terms to get all the combos.
proc sql noprint;
create table temp as
select a.name, b.terms as term
from names as a,
help as b;
quit;
Third, left join to find the matches
proc sql noprint;
create table want as
select a.name,
a.term,
case
when missing(b.term) then "gone"
else "here"
end as Status
from temp as a
left join
have as b
on a.name=b.name
and a.term=b.term;
quit;
Last, delete the temp table to save space;
proc datasets lib=work nolist;
delete temp;
run;
quit;
As Reeza shows, there are other ways to do this. As I said above, you can merge all this into a single SQL join and get the results you want. Depending on computer memory and data size, it should be OK (and might be faster as everything is in memory).

proc sql;
create table want as
select c.name, c.terms, a.term,
( case when missing(a.term) then "Gone"
else "Here" end ) as status
from (select distinct a.name, b.terms
from have a, help b) c
left join have a
on c.terms = a.term and c.name = a.name
order by c.name, c.terms, a.term
;

I'm going to throw in my similar answer so I can compare them all later.
proc sql ;
create table studtermlist as
select distinct term,name
from have
;
create table MAXMINTERM as
SELECT Max(TERM) as MAXTERM, Min(TERM) as MINTERM, name
FROM HAVE
GROUP BY name
;
CREATE TABLE TERMLIST AS
SELECT TERMS,name
FROM HELP a,MAXMINTERM b
WHERE TERMS BETWEEN MINTERM and MAXTERM
ORDER BY name,TERMS
;
CREATE TABLE HEREGONE AS
SELECT
a.terms ,
a.Name ,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from termlist a left join studtermlist b
on a.terms eq b.term
and a.name eq b.name
order by name, terms
;
quit;

How to sort by formatted values

proc sort data=sas.mincome;
by F3 F4;
run;
Proc sort doesn't sort the dataset by formatted values, only internal values. I need to sort by two variables prior to a merge. Is there anyway to do this with proc sort?

I don't think you can sort by formatted values in proc sort, but you can definitely use a simple proc SQL procedure to sort a dataset by formatted values. proc SQL is similar to the data step and proc sort, but is more powerful.
The general syntax of proc sql for sorting by formatted values will be:
proc sql;
create table NewDataSet as
select variable(s)
from OriginalDataSet
order by put(variable1, format1.), put(variable2, format2.);
quit;
For example, we have a sample data set containing the names, sex and ages of some people and we want to sort them:
proc format;
value gender 1='Male'
2='Female';
value age 10-15='Young'
16-24='Old';
run;
data work.original;
input name $ sex age;
datalines;
John 1 12
Zack 1 15
Mary 2 18
Peter 1 11
Angela 2 24
Jack 1 16
Lucy 2 17
Sharon 2 12
Isaac 1 22
;
run;
proc sql;
create table work.new as
select name, sex format=gender., age format=age.
from work.original
order by put(sex, gender.), put(age, age.);
quit;
Output of work.new will be:
Obs name sex age
1 Mary Female Old
2 Angela Female Old
3 Lucy Female Old
4 Sharon Female Young
5 Jack Male Old
6 Isaac Male Old
7 John Male Young
8 Zack Male Young
9 Peter Male Young
If we had used proc sort by sex, then Males would have been ranked first because we had used 1 to represent Males and 2 to represent Females which is not what we want. So, we can clearly see that proc sql did in fact sort them according to the formatted values (Females first, Males second).
Hope this helps.

Because of the nature of formats, SAS only uses the underlying values for the sort. To my knowledge, you cannot change that (unless you want to build your own translation table via PROC TRANTAB).
What you can do is create a new column that contains the formatted value. Then you can sort on that column.
proc format library=work;
value $test 'z' = 'a'
'y' = 'b'
'x' = 'c';
run;
data test;
format val $test.;
informat val $1.;
input val $;
val_fmt = put(val,$test.);
datalines;
x
y
z
;
run;
proc print data=test(drop=val_fmt);
run;
proc sort data=test;
by val_fmt;
run;
proc print data=test(drop=val_fmt);
run;
Produces
Obs val
1 c
2 b
3 a
Obs val
1 a
2 b
3 c

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Column Sorting with PROC Transpose in SAS - sas

Related

Look up and replace values from a separate table in SAS

how to split four columns into two tables in sas report

How to use proc summary and keep all variables (without naming them)

Sas Macro to semi-efficiently manipulate data

How to sort by formatted values

Categories

Resources