Hi I have two tables with different column orders, and the column name are not capitalized as the same. How can I compare if the contents of these two tables are the same?
For example, I have two tables of students' grades
table A:
Math English History
-------+--------+---------
Tim 98 95 90
Helen 100 92 85
table B:
history MATH english
--------+--------+---------
Tim 90 98 95
Helen 85 100 92
You may use either of the two approaches to compare, regardless of the order or column name
/*1. Proc compare*/
proc sort data=A; by name; run;
proc sort data=B; by name; run;
proc compare base=A compare=B;
id name;
run;
/*2. Proc SQL*/
proc sql;
select Math, English, History from A
<union/ intersect/ Except>
select MATH, english, history from B;
quit;
use except corr(corresponding) it will check by name. if everything is matching you will get zero records.
data have1;
input Math English History;
datalines;
1 2 3
;
run;
data have2;
input English math History;
datalines;
2 1 3
;
run;
proc sql ;
select * from have1
except corr
select * from have2;
edit1
if you want to check which particular column it differs you may have to transpose and compare as shown below example.
data have1;
input name $ Math English pyschology History;
datalines;
Tim 98 95 76 90
Helen 100 92 55 85
;
run;
data have2;
input name $ English Math pyschology History;
datalines;
Tim 95 98 76 90
Helen 92 100 99 85
;
run;
proc sort data = have1 out =hav1;
by name;
run;
proc sort data = have2 out =hav2;
by name;
run;
proc transpose data =hav1 out=newhave1 (rename = (_name_= subject
col1=marks));
by name;
run;
proc transpose data =hav2 out=newhave2 (rename = (_name_= subject
col1=marks));
by name;
run;
proc sql;
create table want(drop=mark_dif) as
select
a.name as name
,a.subject as subject
,a.marks as have1_marks
,b.marks as have2_marks
,a.marks -b.marks as mark_dif
from newhave1 a inner join newhave2 b
on upcase(a.name) = upcase(b.name)
and upcase(a.subject) =upcase(b.subject)
where calculated mark_dif ne 0;
Related
I have one table having 4 columns and i want to separate them into 2 table 2 columns in one table and 2 columns in another table.but both table should be below to each other.I want this in proc report format.code should be in report.
id name age gender
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
i want id and name in one table and age and gender in other table.but one below the other in ods excel.
I've split the main table into two tables using a data setp then appended them to each other, I added an extra columns called "source" in order to be differniate between the tables. if you use a Proc report you can group by "source"
Code:
*Create input data*/
data have;
input id name $ age gender $ ;
datalines;
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
;;;;
run;
/*Split / create first table*/
data table1;
set have;
source="table1: id & name";
keep source id name ;
run;
/*Split / create second table*/
data table2;
set have;
source="table2: age & gender";
keep source age gender;
run;
/*create Empty table*/
data want;
length Source $30. column1 8. column2 $10.;
run;
proc sql; delete * from want; quit;
/* Append both tables to each other*/
proc append base= want data=table1(rename=(id=column1 name=column2)) force ; run;
proc append base= want data=table2(rename=(age=column1 gender=column2)) force ; run;
/*Create Report*/
proc report data= want;
col source column1 column2 ;
define source / group;
run;
Output Table:
Report:
For data
data have;input
id name $ age gender $; datalines;
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
run;
Being output as Excel, the splitting into two parts can be done via two Proc REPORT steps; each step responsible for a single set of columns. Options are used in the ODS EXCEL to control how sheet processing is handled.
The first step manages the common header through DEFINE, the subsequent steps are NOHEADER and don't need DEFINE statements. Each step must define and compute the value of the new source column. There will be a one Excel row gap between each table.
ods _all_ close;
ods excel file='want.xlsx' options(sheet_interval='NONE');
proc report data=have;
column source id name;
define id / 'Column 1';
define name / 'Column 2';
define source / format=$20.;
compute source / character length=20; source='ID and NAME'; endcomp;
run;
proc report data=have noheader;
column source age gender;
define source / format=$20.;
compute source / character length=20; source='AGE and GENDER'; endcomp;
run;
ods excel close;
There is no reasonable single Proc REPORT step that would produce similar output from dataset have.
This is some example data, real data is more complex, other fields and about 40000 observations and up to 180 values per id (i know that i will get 360 rows in transposed table, but thats ok):
Data have;
input lastname firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
And i want it to transpose in this way, so I have lastname, and then alternating firstname and value for ervery line matching the lastname.
data want:
Lastname Firstname1 Value1 Firstname2 value2 Firstname3 Value3 firstname4 value4
miller george 47 george 45 henry 44 peter 45
smith peter 42 frank 46
I tried a bit with proc transpose, but i was not able to build a table exactly the way i want it, described above. I need the want table exactly that way (real data is more complex and with other fields), so please no answers which propose to create a want table with other layout.
proc summary has a very useful function to do this, idgroup. You need to specify how many values you have per lastname, so I've included a step to calculate the maximum number.
Data have;
input lastname $ firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
/* get frequency count of lastnames */
proc freq data=have noprint order=freq;
table lastname / out=name_freq;
run;
/* store maximum into a macro variable (first record will be the highest) */
data _null_;
set name_freq (obs=1);
call symput('max_num',count);
run;
%put &max_num.;
/* transpose data using proc summary */
proc summary data=have nway;
class lastname;
output out=want (drop=_:)
idgroup(out[&max_num.] (firstname value)=) / autoname;
run;
I want to sum over a specific variable in my dataset, without loosing all the other columns. I have tried the following code:
proc summary data=work.test nway missing;
class var_1 var_2 ; *groups;
var salary;
id _character_ _numeric_; * keeps all variables;
output out=test2(drop=_:) sum= ;
run;
But it does not seem to sum properly, and for the "salary" column I'm just left with the value of the last value in each group (var_1 and var_2). If I remove
id _character_ _numeric_;
it works fine, but I loose all other columns.
Example:
data:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
desired output:
John Sales 66 M
Mary Acctng 21 F
I think this does what you want. You still get warnings about name conflicts and variables being dropped but at least the ones you want are kept. The ID statement is depreciated in favor in the new and better IDGROUP output statement option.
You could add the AUTONAME option to the output statement if you wanted PROC SUMMARY to automatically rename the conflicting variables.
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;;;;
run;
proc print;
run;
proc summary nway missing;
class name dept;
var salary;
output out=test2(drop=_:) sum= idgroup(out(_all_)=);
run;
proc print;
run;
Try this:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql;
create table salary2 as
select *,
monotonic() as n,
sum(salary) as sum_salary
from salary
group by name
having max(n)=n;
quit;
I wasn't aware that SAS did this, but the problem appears to lie in the fact that the id statement takes preference over the var statement. By including all variables in the id statement, all the output is showing is the maximum value for each variable, including Salary.
One option is to pull a list of the variables not included in the class or var statements from dictionary.columns, then use that list in the id statement. Just be aware that proc summary runs in memory and I have come across out of memory problems in the past when many variables have been included in the id statement
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql noprint;
select name into :cols separated by ' '
from dictionary.columns
where libname='WORK'
and
memname='SALARY'
and
name not in ('name','Salary');
quit;
%put &cols.;
proc summary data=salary nway missing;
class name;
var salary;
id &cols.;
output out=want (drop=_:) sum=;
run;
proc sort data=sas.mincome;
by F3 F4;
run;
Proc sort doesn't sort the dataset by formatted values, only internal values. I need to sort by two variables prior to a merge. Is there anyway to do this with proc sort?
I don't think you can sort by formatted values in proc sort, but you can definitely use a simple proc SQL procedure to sort a dataset by formatted values. proc SQL is similar to the data step and proc sort, but is more powerful.
The general syntax of proc sql for sorting by formatted values will be:
proc sql;
create table NewDataSet as
select variable(s)
from OriginalDataSet
order by put(variable1, format1.), put(variable2, format2.);
quit;
For example, we have a sample data set containing the names, sex and ages of some people and we want to sort them:
proc format;
value gender 1='Male'
2='Female';
value age 10-15='Young'
16-24='Old';
run;
data work.original;
input name $ sex age;
datalines;
John 1 12
Zack 1 15
Mary 2 18
Peter 1 11
Angela 2 24
Jack 1 16
Lucy 2 17
Sharon 2 12
Isaac 1 22
;
run;
proc sql;
create table work.new as
select name, sex format=gender., age format=age.
from work.original
order by put(sex, gender.), put(age, age.);
quit;
Output of work.new will be:
Obs name sex age
1 Mary Female Old
2 Angela Female Old
3 Lucy Female Old
4 Sharon Female Young
5 Jack Male Old
6 Isaac Male Old
7 John Male Young
8 Zack Male Young
9 Peter Male Young
If we had used proc sort by sex, then Males would have been ranked first because we had used 1 to represent Males and 2 to represent Females which is not what we want. So, we can clearly see that proc sql did in fact sort them according to the formatted values (Females first, Males second).
Hope this helps.
Because of the nature of formats, SAS only uses the underlying values for the sort. To my knowledge, you cannot change that (unless you want to build your own translation table via PROC TRANTAB).
What you can do is create a new column that contains the formatted value. Then you can sort on that column.
proc format library=work;
value $test 'z' = 'a'
'y' = 'b'
'x' = 'c';
run;
data test;
format val $test.;
informat val $1.;
input val $;
val_fmt = put(val,$test.);
datalines;
x
y
z
;
run;
proc print data=test(drop=val_fmt);
run;
proc sort data=test;
by val_fmt;
run;
proc print data=test(drop=val_fmt);
run;
Produces
Obs val
1 c
2 b
3 a
Obs val
1 a
2 b
3 c
I have some data (created using the code) below that ranks observations according to two variables. In this case, it ranks the players first bet and second bet and creates two 'rank' variables. What I want to do instead is rank the observations according a function of the two variables instead (like the average of the two variables) and I'd like to do this in the PROC RANK command itself rather than using a preliminary data step as the ranking will get fairly involved after I replicate this on all the variables I need. Can I put operators into the PROC RANK statement? Rather than doing this:
Proc rank data=want ties=mean out=ranked groups=2;
var bet1stake bet2stake;
ranks bet1stakeRank bet2stakeRank;
run;
I would like to do this:
Proc rank data=want ties=mean out=ranked groups=2;
var avg(bet1stake, bet2stake);
ranks firstTwoBetsRank;
run;
Is this possible?
This is how the full example data can be created.
data have;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
proc sort data=have;
by username betdate;
run;
data have;
set have;
by username betdate;
retain eventTime;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
run;
proc sql;
create table want as
select
distinct username,
(select distinct stake from have where username = main.username and eventTime = 1) as bet1Stake,
(select distinct stake from have where username = main.username and eventTime = 2) as bet2Stake
from have main;
quit;
Proc rank data=want ties=mean out=want groups=2;
var bet1stake bet2stake;
ranks bet1stakeRank bet2stakeRank;
run;
Thanks for any help on this.
I'm afraid you cannot apply operators on the variables you'd like to rank your observations.
The choice you have is either to use a DATA step to do both the application of operators and the calculation of the ranking
Or
use a Data step view or SQL view to apply the operator as an intermediate step just in case if you are concerned about disk space.
In case you are pulling the data from a SQL database (assuming it supports windowing functions) you should be to do exactly what you are trying to do just with some SQL code that is passed-through to the database.