This is some example data, real data is more complex, other fields and about 40000 observations and up to 180 values per id (i know that i will get 360 rows in transposed table, but thats ok):
Data have;
input lastname firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
And i want it to transpose in this way, so I have lastname, and then alternating firstname and value for ervery line matching the lastname.
data want:
Lastname Firstname1 Value1 Firstname2 value2 Firstname3 Value3 firstname4 value4
miller george 47 george 45 henry 44 peter 45
smith peter 42 frank 46
I tried a bit with proc transpose, but i was not able to build a table exactly the way i want it, described above. I need the want table exactly that way (real data is more complex and with other fields), so please no answers which propose to create a want table with other layout.
proc summary has a very useful function to do this, idgroup. You need to specify how many values you have per lastname, so I've included a step to calculate the maximum number.
Data have;
input lastname $ firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
/* get frequency count of lastnames */
proc freq data=have noprint order=freq;
table lastname / out=name_freq;
run;
/* store maximum into a macro variable (first record will be the highest) */
data _null_;
set name_freq (obs=1);
call symput('max_num',count);
run;
%put &max_num.;
/* transpose data using proc summary */
proc summary data=have nway;
class lastname;
output out=want (drop=_:)
idgroup(out[&max_num.] (firstname value)=) / autoname;
run;
Related
Hi I have two tables with different column orders, and the column name are not capitalized as the same. How can I compare if the contents of these two tables are the same?
For example, I have two tables of students' grades
table A:
Math English History
-------+--------+---------
Tim 98 95 90
Helen 100 92 85
table B:
history MATH english
--------+--------+---------
Tim 90 98 95
Helen 85 100 92
You may use either of the two approaches to compare, regardless of the order or column name
/*1. Proc compare*/
proc sort data=A; by name; run;
proc sort data=B; by name; run;
proc compare base=A compare=B;
id name;
run;
/*2. Proc SQL*/
proc sql;
select Math, English, History from A
<union/ intersect/ Except>
select MATH, english, history from B;
quit;
use except corr(corresponding) it will check by name. if everything is matching you will get zero records.
data have1;
input Math English History;
datalines;
1 2 3
;
run;
data have2;
input English math History;
datalines;
2 1 3
;
run;
proc sql ;
select * from have1
except corr
select * from have2;
edit1
if you want to check which particular column it differs you may have to transpose and compare as shown below example.
data have1;
input name $ Math English pyschology History;
datalines;
Tim 98 95 76 90
Helen 100 92 55 85
;
run;
data have2;
input name $ English Math pyschology History;
datalines;
Tim 95 98 76 90
Helen 92 100 99 85
;
run;
proc sort data = have1 out =hav1;
by name;
run;
proc sort data = have2 out =hav2;
by name;
run;
proc transpose data =hav1 out=newhave1 (rename = (_name_= subject
col1=marks));
by name;
run;
proc transpose data =hav2 out=newhave2 (rename = (_name_= subject
col1=marks));
by name;
run;
proc sql;
create table want(drop=mark_dif) as
select
a.name as name
,a.subject as subject
,a.marks as have1_marks
,b.marks as have2_marks
,a.marks -b.marks as mark_dif
from newhave1 a inner join newhave2 b
on upcase(a.name) = upcase(b.name)
and upcase(a.subject) =upcase(b.subject)
where calculated mark_dif ne 0;
Dataset HAVE includes two variables with misspelled names in them: names and friends.
Name Age Friend
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
I have a small dataset CORRECTIONS that includes wrong_names and resolved_names.
current_names resolved_names
Jon John
Ann Anne
Jimb Jim
I need any name in names or friends in HAVE that matches a name in the wrong_names column of CORRECTIONS to get recoded to the corresponding string in resolved_name. The resulting dataset WANT should look like this:
Name Age Friend
John 11 Anne
John 11 Tom
Jim 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Anne
In R, I could simply invoke each dataframe and vector using if_else(), but the DATA step in SAS doesn't play nicely with multiple datasets. How can I make these replacements using CORRECTIONS as a look-up table?
There are many ways to do a lookup in SAS.
First of all, however, I would suggest to de-duplicate your look-up table (for example, using PROC SORT and Data Step/Set/By) - deciding which duplicate to keep (if any exist).
As for the lookup task itself, for simplicity and learning I would suggest the following:
The "OLD SCHOOL" way - good for auditing inputs and outputs (it is easier to validate the results of a join when input tables are in the required order):
*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
run;
*** lookup table;
data corrections;
length current_names $10. resolved_names $10.;
input current_names resolved_names;
datalines;
Jon John
Ann Anne
Jimb Jim
run;
*** de-duplicate lookup table;
proc sort data=corrections nodupkey; by current_names; run;
proc sort data=have; by name; run;
data have_corrected;
merge have(in=a)
corrections(in=b rename=(current_names=name))
;
by name;
if a;
if b then do;
name=resolved_names;
end;
run;
The SQL way - which avoids sorting the have table:
proc sql;
create table have_corrected_sql as
select
coalesce(b.resolved_names, a.name) as name,
a.age,
a.friend
from work.have as a left join work.corrections as b
on a.name eq b.current_names
order by name;
quit;
NB the Coalesce() is used to replace missing resolved_names values (ie when there is no correction) with names from the have table
EDIT: To reflect Quentin's (CORRECT) comment that I'd missed the update to both name and friend fields.
Based on correcting the 2 fields, again many approaches but the essence is one of updating a value only IF it exists in the lookup (corrections) table. The hash object is pretty good at this, once you've understood it's declaration.
NB: any key fields in the Hash object need to be specified on a Length statement BEFOREHAND.
EDIT: as per ChrisJ's alternative to the Length statement declaration, and my reply (see below) - it would be better to state that key variables need to be defined BEFORE you declare the hash table.
data have_corrected;
keep name age friend;
length current_names $10.;
*** load valid names into hash lookup table;
if _n_=1 then do;
declare hash h(dataset: 'work.corrections');
rc = h.defineKey('current_names');
rc = h.defineData('resolved_names');
rc = h.defineDone();
end;
do until(eof);
set have(in=a) end=eof;
*** validate both name fields;
if h.find(key:name) eq 0 then
name = resolved_names;
if h.find(key:friend) eq 0 then
friend = resolved_names;
output;
end;
run;
EDIT: to answer the comments re ChrisJ's SQL/Update alternative
Basically, you need to restrict each UPDATE statement to ONLY those rows that have name values or friend values in the corrections table - this is done by adding another where clause AFTER you've specified the set var = (clause). See below.
NB. AFAIK, an SQL solution to your requirement will require MORE than 1 pass of both the base table & the lookup table.
The lookup/hash table, however, requires a single pass of the base table, a load of the lookup table and then the lookup actions themselves. You can see the performance difference in the log...
proc sql;
*** create copy of have table;
create table work.have_sql as select * from work.have;
*** correct name field;
update work.have_sql as u
set name = (select resolved_names
from work.corrections as n
where u.name=n.current_names)
where u.name in (select current_names from work.corrections)
;
*** correct friend field;
update work.have_sql as u
set friend = (select resolved_names
from work.corrections as n
where u.friend=n.current_names)
where u.friend in (select current_names from work.corrections)
;
quit;
Given data
*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
run;
*** lookup table;
data corrections;
length from_name $10. to_name $10.;
input from_name to_name;
datalines;
Jon John
Ann Anne
Jimb Jim
run;
One SQL alternative is to perform a existent mapping select look up on each field to be mapped. This would be counter to joining the corrections table one time for each field to be mapped.
proc sql;
create table want1 as
select
case when exists (select * from corrections where from_name=name)
then (select to_name from corrections where from_name=name)
else name
end as name
, age
, case when exists (select * from corrections where from_name=friend)
then (select to_name from corrections where from_name=friend)
else friend
end as friend
from
have
;
Another, SAS only way, to perform inline left joins is to use a custom format.
data cntlin;
set corrections;
retain fmtname '$cohen'; /* the fixer */
rename from_name=start to_name=label;
run;
proc format cntlin=cntlin;
run;
data want2;
set have;
name = put(name,$cohen.);
friend = put(friend,$cohen.);
run;
You can use an UPDATE in proc sql :
proc sql ;
update have a
set name = (select resolved_names b from corrections where a.name = b.current_names)
where name in(select current_names from corrections)
;
update have a
set friend = (select resolved_names b from corrections where a.friend = b.current_names)
where friend in(select current_names from corrections)
;
quit ;
Or, you could use a format :
/* Create format */
data current_fmt ;
retain fmtname 'NAMEFIX' type 'C' ;
set resolved_names ;
start = current_names ;
label = resolved_names ;
run ;
proc format cntlin=current_fmt ; run ;
/* Apply format */
data want ;
set have ;
name = put(name ,$NAMEFIX.) ;
friend = put(friend,$NAMEFIX.) ;
run ;
Try this:
proc sql;
create table want as
select p.name,p.age,
case
when q.current_names is null then p.friend
else q.resolved_names
end
as friend1
from
(
select
case
when b.current_names is null then a.name
else b.resolved_names
end
as name,
a.age,a.friend
from
have a
left join
corrections b
on upcase(a.name) = upcase(b.current_names)
) p
left join
corrections q
on upcase(p.friend) = upcase(q.current_names);
quit;
Output:
name age friend
John 11 Anne
Jed 10 Anne
Joe 11 Anne
Jim 12 Egg
Joe 11 Egg
Joe 11 Tom
John 11 Tom
Let me know in case of any clarifications.
I want to sum over a specific variable in my dataset, without loosing all the other columns. I have tried the following code:
proc summary data=work.test nway missing;
class var_1 var_2 ; *groups;
var salary;
id _character_ _numeric_; * keeps all variables;
output out=test2(drop=_:) sum= ;
run;
But it does not seem to sum properly, and for the "salary" column I'm just left with the value of the last value in each group (var_1 and var_2). If I remove
id _character_ _numeric_;
it works fine, but I loose all other columns.
Example:
data:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
desired output:
John Sales 66 M
Mary Acctng 21 F
I think this does what you want. You still get warnings about name conflicts and variables being dropped but at least the ones you want are kept. The ID statement is depreciated in favor in the new and better IDGROUP output statement option.
You could add the AUTONAME option to the output statement if you wanted PROC SUMMARY to automatically rename the conflicting variables.
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;;;;
run;
proc print;
run;
proc summary nway missing;
class name dept;
var salary;
output out=test2(drop=_:) sum= idgroup(out(_all_)=);
run;
proc print;
run;
Try this:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql;
create table salary2 as
select *,
monotonic() as n,
sum(salary) as sum_salary
from salary
group by name
having max(n)=n;
quit;
I wasn't aware that SAS did this, but the problem appears to lie in the fact that the id statement takes preference over the var statement. By including all variables in the id statement, all the output is showing is the maximum value for each variable, including Salary.
One option is to pull a list of the variables not included in the class or var statements from dictionary.columns, then use that list in the id statement. Just be aware that proc summary runs in memory and I have come across out of memory problems in the past when many variables have been included in the id statement
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql noprint;
select name into :cols separated by ' '
from dictionary.columns
where libname='WORK'
and
memname='SALARY'
and
name not in ('name','Salary');
quit;
%put &cols.;
proc summary data=salary nway missing;
class name;
var salary;
id &cols.;
output out=want (drop=_:) sum=;
run;
In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;
proc sort data=sas.mincome;
by F3 F4;
run;
Proc sort doesn't sort the dataset by formatted values, only internal values. I need to sort by two variables prior to a merge. Is there anyway to do this with proc sort?
I don't think you can sort by formatted values in proc sort, but you can definitely use a simple proc SQL procedure to sort a dataset by formatted values. proc SQL is similar to the data step and proc sort, but is more powerful.
The general syntax of proc sql for sorting by formatted values will be:
proc sql;
create table NewDataSet as
select variable(s)
from OriginalDataSet
order by put(variable1, format1.), put(variable2, format2.);
quit;
For example, we have a sample data set containing the names, sex and ages of some people and we want to sort them:
proc format;
value gender 1='Male'
2='Female';
value age 10-15='Young'
16-24='Old';
run;
data work.original;
input name $ sex age;
datalines;
John 1 12
Zack 1 15
Mary 2 18
Peter 1 11
Angela 2 24
Jack 1 16
Lucy 2 17
Sharon 2 12
Isaac 1 22
;
run;
proc sql;
create table work.new as
select name, sex format=gender., age format=age.
from work.original
order by put(sex, gender.), put(age, age.);
quit;
Output of work.new will be:
Obs name sex age
1 Mary Female Old
2 Angela Female Old
3 Lucy Female Old
4 Sharon Female Young
5 Jack Male Old
6 Isaac Male Old
7 John Male Young
8 Zack Male Young
9 Peter Male Young
If we had used proc sort by sex, then Males would have been ranked first because we had used 1 to represent Males and 2 to represent Females which is not what we want. So, we can clearly see that proc sql did in fact sort them according to the formatted values (Females first, Males second).
Hope this helps.
Because of the nature of formats, SAS only uses the underlying values for the sort. To my knowledge, you cannot change that (unless you want to build your own translation table via PROC TRANTAB).
What you can do is create a new column that contains the formatted value. Then you can sort on that column.
proc format library=work;
value $test 'z' = 'a'
'y' = 'b'
'x' = 'c';
run;
data test;
format val $test.;
informat val $1.;
input val $;
val_fmt = put(val,$test.);
datalines;
x
y
z
;
run;
proc print data=test(drop=val_fmt);
run;
proc sort data=test;
by val_fmt;
run;
proc print data=test(drop=val_fmt);
run;
Produces
Obs val
1 c
2 b
3 a
Obs val
1 a
2 b
3 c