I have to merge two very large files and I want to avoid this doing in Data step as that would mean sorting the data. I need all observations for all IDs from left file excluding IDs that are not in the second.
data leftdata;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;run;
data rightdata;
input id $ ;
datalines;
AA
BB
;
run;
*Using datastep;
PROC SORT DATA=leftdata; BY id;
PROC SORT DATA=rightdata; BY id; RUN;
DATA datastep;
MERGE leftdata(IN=a) rightdata(IN=b);
BY id; IF a and b=0;
RUN;
How can the same be achieved using PROC SQL?
Final output must include the following observations:
CC 50
CC 80
DD 60
Here are two ways:
WHERE NOT IN (SELECT ...) filtering, or
LEFT JOIN where missing the right id.
Example:
data have;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;
data excluded_ids;
input id $ ;
datalines;
AA
BB
;
proc sql;
create table want as
select * from have
where id not in (select id from excluded_ids)
;
create table want as
select have.* from have
left join excluded_ids as remove
on have.id = remove.id
where remove.id is null
;
For the second way you will need a SELECT DISTINCT if the exclusion list has a repeated id.
Data step
Use a hash object to store the exclusion list and check method to test for removal
Example:
data want;
set have;
if _n_ = 1 then do;
declare hash exclude(dataset:'excluded_ids');
exclude.defineKey('id');
exclude.defineDone();
end;
if exclude.check() = 0 then delete;
run;
I am trying to compute the frequency of observation in a group.
My dataset looks like:
Date Account C_group Age ...
1 152627 A 28
2 152627 B 28
1 163718 B 32
3 163628 D 12
4 163717 C 41
.
.
I would like to determine the percentage of accounts in the different groups.
Do you know how I could that?
Thanks
The following should get you close to what you are looking for:
data dset ;
input
freqgroup $
subgroup ;
datalines ;
A 12
B 12
C 12
C 21
C 23
A 12
A 21
B 12
B 21
B 21
;
run;
proc sort data=dset;
by freqgroup;
run;
proc freq data=dset ;
table freqgroup ;
run ;
proc freq data=dset ;
by freqgroup ;
table subgroup ;
run ;
I've got a wide dataset with each month listed as a column. I'd like to transpose the data to a long format, but the problem is that my column names will be changing in the future. What's the best way to transpose with a dynamic variable being passed to the transpose statement?
For example:
data have;
input subject $ "Jan-10"n $ "Feb-10"n $ "Mar-10"n $;
datalines;
1 12 18 22
2 13 19 23
;
run;
data want;
input subject month $ value;
datalines;
1 Jan-10 12
1 Feb-10 18
1 Mar-10 22
2 Jan-10 13
2 Feb-10 19
2 Mar-10 23
;
run;
Simply run the transpose procedure and provide only the by statement.
I've updated your sample data to convert the months to numeric values (rather than character which can't be transposed). I've also changed them to use valid base-sas names by removing the hyphen.
data have;
input subject $ "Jan10"n "Feb10"n "Mar10"n ;
datalines;
1 12 18 22
2 13 19 23
;
run;
Here's the transpose syntax you need, it will transpose all numeric variables by default:
proc transpose data=have out=want;
by subject;
run;
You could also do something more explicit, but still dynamic such as:
proc transpose data=have out=want;
by subject;
var jan: feb: mar: ; * ETC;
run;
This would transpose all vars that begin with jan/feb/mar etc... Useful in case your table contains other numeric variables that you don't want to include in the transpose.
I am working with a very large dataset containing the same columns several times, but with different column names (both character and numeric).
Does anyone know how to find and delete these identical columns?
Example
A B C D E F G
12 ab 12 ab 8 h 12
14 cd 14 cd 65 j 14
6 fs 6 fs 3 g 6
. . . . 4 q .
3 d 3 d 5 d 3
A-G are variable names, and I want to be able to see that A, C and G are identical and then remove all except one.
Also B and D are identical. I want to keep only one.
Is this even possible?
Here is example using technique proposed by Shenglin Chen in the comments.
data have ;
input A B $ C D $ E F $ G ;
cards;
12 ab 12 ab 8 h 12
14 cd 14 cd 65 j 14
6 fs 6 fs 3 g 6
. . . . 4 q .
3 d 3 d 5 d 3
;;;;
Find the unique numeric columns.
proc transpose data=have out=tall_numbers ;
var _numeric_;
run;
proc sort data=tall_numbers nodupkey out=keep_numbers(keep=_name_);
by col: ;
run;
Find the unique character columns.
proc transpose data=have out=tall_characters ;
var _character_;
run;
proc sort data=tall_characters nodupkey out=keep_characters(keep=_name_);
by col: ;
run;
Get the combined list of columns.
proc sql noprint ;
select _name_
into :keep_list separated by ' '
from (select _name_ from keep_characters
union select _name_ from keep_numbers)
order by 1
;
quit;
Make new table with only the unique columns.
data want ;
set have ;
keep &keep_list ;
run;
I have a SAS Table like:
DATA test;
INPUT id sex $ age inc r1 r2 Zaehler work $;
DATALINES;
1 F 35 17 7 2 1 w
17 M 40 14 5 5 1 w
33 F 35 6 7 2 1 w
49 M 24 14 7 5 1 w
65 F 52 9 4 7 1 w
81 M 44 11 7 7 1 w
2 F 35 17 6 5 1 n
18 M 40 14 7 5 1 n
34 F 47 6 6 5 1 n
50 M 35 17 5 7 1 w
;
PROC PRINT; RUN;
proc sort data=have;
by county;
run;
I want compare rows if sex and age is equal and build sum over Zaehler. For example:
1 F 35 17 7 2 1 w
and
33 F 35 6 7 2 1 w
sex=f and age=35 are equale so i want to merge them like:
id sex age inc r1 r2 Zaehler work
1 F 35 17 7 2 2 w
I thought i can do it with proc sql but i can't use sum in proc sql. Can someone help me out?
PROC SUMMARY is the normal way to compute statistics.
proc summary data=test nway ;
class sex age ;
var Zaehler;
output out=want sum= ;
run;
Why would you want to include variables other than SEX, AGE and Zaehler in the output?
Your requirement is not difficult to understand or to satisfy, however, I am not sure what is your underline reason for doing this. Explain more on your purpose may help to facilitate better answers that work from the root of your project. Although I have a feeling the PROC MEAN may give you better matrix, here is a one step PROC SQL solution to get you the summary as well as retaining "the value of first row":
proc sql;
create table want as
select id, sex , age, inc, r1, r2, sum(Zaehler) as Zaehler, work
from test
group by sex, age
having id = min(id) /*This is tell SAS only to keep the row with the smallest id within the same sex,age group*/
;
quit;
You can use proc sql to sum over sex and age
proc sql;
create table sum as
select
sex
,age
,sum(Zaehler) as Zaehler_sum
from test
group by
sex
,age;
quit;
You can than join it back to the main table if you want to include all the variables
proc sql;
create table test_With_Sum as
select
t.*
,s.Zaehler_sum
from test t
inner join sum s on t.sex = s.sex
and t.age = s.age
order by
t.sex
,t.age
;
quit;
You can write it all as one proc sql query if you wish and the order by is not needed, only added for a better visibility of summarised results
Not a good solution. But it should give you some ideas.
DATA test;
INPUT id sex $ age inc r1 r2 Zaehler work $;
DATALINES;
1 F 35 17 7 2 1 w
17 M 40 14 5 5 1 w
33 F 35 6 7 2 1 w
49 M 24 14 7 5 1 w
65 F 52 9 4 7 1 w
81 M 44 11 7 7 1 w
2 F 35 17 6 5 1 n
18 M 40 14 7 5 1 n
34 F 47 6 6 5 1 n
50 M 35 17 5 7 1 w
;
run;
data t2;
set test;
nobs = _n_;
run;
proc sort data=t2;by descending sex descending age descending nobs;run;
data t3;
set t2;
by descending sex descending age;
if first.age then count = 0;
count + 1;
zaehler = count;
if last.age then output;
run;
proc sort data=t3 out=want(drop=nobs count);by nobs sex age;run;
thanks for your help. Here is my final code.
proc sql;
create table sum as
select distinct
sex
,age
,sum(Zaehler) as Zaehler
from test
WHERE work = 'w'
group by
sex
,age
;
PROC PRINT;quit;
I just modify the code a little bit. I filtered the w and i merg the Columns with the same value.
It was just an example the real Data is much bigger and has more Columns and rows.