I have a dataset looks like the following. This dataset contains four variable Country name Country, company ID Company, Year and Date.
Country Company Year Date
------- ------- ---- ----
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
I know how to count number of distinct company in each country. I did it using the following code.
proc sql;
create table lib.count as
select country, count(distinct company) as count
from lib.data
group by country;
quit;
My problem is how to count the number of distinct company-Years in each country. Essentially i want to know how many different company or same company in different year. If there are two observation for the same company in the same year, I want to count it as 1 different value. If same company have two observation in differeny year I want to count it as two different value. I want the output looks like the following (one number per country):
Country No. firm_year
A 2
B 1
C 1
Can anyone can teach me how to do it please.
A quick method is to concatenate all the variables you want to compare, creating a new variable. Something like:
data data_mod;
set data;
length company_year $ 20;
company_year= cats(company,year);
run;
Then you can run your proc sql with count(distinct company_year).
You need nested queries, as #DaBigNikoladze hinted at...
An "internal" query which will generate a list of distinct combinations of Country + Company + Year;
An "external" query which will count how many rows per country are present in the internal query.
Generate dataset
data have;
informat Country $1.
Company 1.
Year 4.
Date YYMMDD10.;
format Date YYMMDDs10.;
input country company year date;
datalines;
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
;
Execute query
PROC SQL;
CREATE TABLE want AS
SELECT country, Count(company) AS Firm_year
FROM (SELECT DISTINCT country, company, year FROM have)
GROUP BY country;
QUIT;
Results
Country Firm_year
A 2
B 1
C 1
proc sort data=lib.data out=temp nodupkey;
by country company year;
run;
data firm_year(keep=country cnt_fyr);
set out;
by country company year
retain cnt_fyr;
if first.country then cnt_fyr=1;
else cnt_fyr+1;
if last.country;
run;
The answer for your first question is:
data lib.count(keep=country companyCount);
set lib.data;
by country;
retain companyList '';
retain companyCount 0;
if first.country then do;
companyList = company;
companyCount = 1;
end;
else do;
if ^index(companyList, company) then do;
companyList = cats(companyList,',',company);
companyCount + 1;
end;
end;
if last.country then output;
run;
The resutl is:
Country companyCount
------- ------------
A 2
B 1
C 1
Similary you will take the number of distinct company-Years in each country.
Guess i'm a bit confused as to what you are expecting the result to look like. Here is an sql method that gets the same result as posted by the other answer so far.
data temp;
attrib Country length = $10;
attrib Company length = $10;
attrib Year length = $10;
attrib Date length = $10;
input Country $ Company $ Year $ Date $;
infile datalines delimiter = '#';
datalines;
A#1#x#x1#
A#1#x#x2#
B#2#x#x1#
C#3#x#x3#
;
run;
proc sql;
create table temp2 as
select country, count(distinct Date) as count
from temp
group by country, company;
quit;
Related
Date set having id and date .I want a date set with two duplicate id but condition is that one should be before 8th June and other should be after 8th June.
To take the first date and the first date after 2021-06-08 you can sort by ID and DATE and use LAG() to detect when you cross the date boundary.
data have ;
input id date :date. ;
format date date9.;
cards;
1 01jun2021
1 07jun2021
1 08jun2021
1 09jun2021
;
data want;
set have ;
by id date;
if first.id or ( (date<='08JUN2021'd) ne lag(date<='08JUN2021'd));
run;
results
Obs id date
1 1 01JUN2021
2 1 09JUN2021
Hi I am trying to calculate how much the customer paid on the month by subtracting their balance from the next month.
Data looks like this: I want to calculate PaidAmount for A111 in Jun-20 by Balance in Jul-20 - Balance in June-20. Can anyone help, please? Thank you
For this situation there is no need to look ahead as you can create the output you want just by looking back.
data have;
input id date balance ;
informat date yymmdd10.;
format date yymmdd10.;
cards;
1 2020-06-01 10000
1 2020-07-01 8000
1 2020-08-01 5000
2 2020-06-01 10000
2 2020-07-01 8000
3 2020-08-01 5000
;
data want;
set have ;
by id date;
lag_date=lag(date);
format lag_date yymmdd10.;
lag_balance=lag(balance);
payment = lag_balance - balance ;
if not first.id then output;
if last.id then do;
payment=.;
lag_balance=balance;
lag_date=date;
output;
end;
drop date balance;
rename lag_date = date lag_balance=balance;
run;
proc print;
run;
Result:
Obs id date balance payment
1 1 2020-06-01 10000 2000
2 1 2020-07-01 8000 3000
3 1 2020-08-01 5000 .
4 2 2020-06-01 10000 2000
5 2 2020-07-01 8000 .
6 3 2020-08-01 5000 .
This is looking for a LEAD calculation which is typically done via PROC EXPAND but that's under the SAS/ETS license which not many users have. Another option is to merge the data with itself, offsetting the records by one so that the next months record is on the same line.
data want;
merge have have(firstobs=2 rename=balance = next_balance);
by clientID;
PaidAmount = Balance - next_balance;
run;
If you can be missing months in your series this is not a good approach. If that is possible you want to do an explicit merge using SQL instead. This assumes you have month as a SAS date as well.
proc sql;
create table want as
select t1.*, t1.balance - t2.balance as paidAmount
from have as t1
left join have as t2
on t1.clientID = t2.ClientID
/*joins current month with next month*/
and intnx('month', t1.month, 0, 'b') = intnx('month', t2.month, 1, 'b');
quit;
Code is untested as no test data was provided (I won't type out your data to test code).
I have a sas datebase with something like this:
id birthday Date1 Date2
1 12/4/01 12/4/13 12/3/14
2 12/3/01 12/6/13 12/2/14
3 12/9/01 12/4/03 12/9/14
4 12/8/13 12/3/14 12/10/16
And I want the data in this form:
id Date Datetype
1 12/4/01 birthday
1 12/4/13 1
1 12/3/14 2
2 12/3/01 birthday
2 12/6/13 1
2 12/2/14 2
3 12/9/01 birthday
3 12/4/03 1
3 12/9/14 2
4 12/8/13 birthday
4 12/3/14 1
4 12/10/16 2
Thanks by ur help, i'm on my second week using sas <3
Edit: thanks by remain me that i was not finding a sorting method.
Good day. The following should be what you are after. I did not come up with an easy way to rename the columns as they are not in beginning data.
/*Data generation for ease of testing*/
data begin;
input id birthday $ Date1 $ Date2 $;
cards;
1 12/4/01 12/4/13 12/3/14
2 12/3/01 12/6/13 12/2/14
3 12/9/01 12/4/03 12/9/14
4 12/8/13 12/3/14 12/10/16
; run;
/*The trick here is to use date: The colon means everything beginning with date, comparae with sql 'date%'*/
proc transpose data= begin out=trans;
by id;
var birthday date: ;
run;
/*Cleanup. Renaming the columns as you wanted.*/
data trans;
set trans;
rename _NAME_= Datetype COL1= Date;
run;
See more from Kent University site
Two steps
Pivot the data using Proc TRANSPOSE.
Change the names of the output columns and their labels with PROC DATASETS
Sample code
proc transpose
data=have
out=want
( keep=id _label_ col1)
;
by id;
var birthday date1 date2;
label birthday='birthday' date1='1' date2='2' ; * Trick to force values seen in pivot;
run;
proc datasets noprint lib=work;
modify want;
rename
_label_ = Datetype
col1 = Date
;
label
Datetype = 'Datetype'
;
run;
The column order in the TRANSPOSE output table is:
id variables
copy variables
_name_ and _label_
data based column names
The sample 'want' shows the data named columns before the _label_ / _name_ columns. The only way to change the underlying column order is to rewrite the data set. You can change how that order is perceived when viewed is by using an additional data view, or an output Proc that allows you to specify the specific order desired.
I have a SAS question. I have a large dataset containing unique ID's and a bunch of variables for each year in a time series. Some ID's are present throughout the entire timeseries, some new ID's are added and some old ID's are removed.
ID Year Var3 Var4
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
As can be seen from the table above, ID 1 is present in all three years (2015-2017), ID 2 is present from 2016 and onwards, ID 3 is removed in 2017 and ID 4 is present in 2015, removed in 2016 and then present again in 2017.
I would like to know which ID's are new and which are removed in any given year, whilst keeping all the data. Eg. a new table with indicators for which ID's are new and which are removed. Furthermore, it would be nice to get a frequency of how many ID' are added/removed in a given year and the sum og their "Var3" and "Var4". Do you have any suggestions how to do that?
************* UPDATE ******************
Okay, so I tried the following program:
**** Addition to suggested code ****;
options validvarname=any;
proc sql noprint;
create table years as
select distinct year
from have;
create table ids as
select distinct id
from have;
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now this will provide me with a table that only contains the ID's that are new in 2017:
data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;
I can now merge this table to add var3 and var4:
data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
Now I can find the frequence of new ID's in 2017 and the sum of var3 and var4:
proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
This gives me the output I need. However, I would like the equivalent output for the ID's that are "gone" between 2016 and 2017 which can be made from:
data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;
data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
The end result should be a combination of the two tables "sum_var_freq_new" and "sum_var_freq_gone" into one table. Furthermore, I need this table for every new year, so my current approach is very inefficient. Do you guys have any suggestions how to achieve this efficiently?
Aside from a different sample, you didn't provide much extra info from your previous question in order to understand what was lacking in the previous answer.
To build on the latter though, you could use a macro do loop to dynamically account for the distinct year values present in your dataset.
data have;
infile datalines;
input ID year var3 var4;
datalines;
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
;
run;
proc sql noprint;
select distinct year
into :year1-
from have
;
quit;
%macro doWant;
proc sql;
create table want as
select distinct ID
%let i=1;
%do %while(%symexist(year&i.));
,exists(select * from have b where year=&&year&i.. and a.id=b.id) as "&&year&i.."n
%let i=%eval(&i.+1);
%end;
from have a
;
quit;
%mend;
%doWant;
This will produce the following result:
ID 2015 2016 2017
-----------------
1 1 1 1
2 0 1 1
3 1 1 0
4 1 0 1
Here is a more efficient way of doing this and also giving you the summary values.
First a little SQL magic. Create the cross product of years and IDs, then join that to the table you have to create an indicator;
proc sql noprint;
/*All Years*/
create table years as
select distinct year
from have;
/*All IDS*/
create table ids as
select distinct id
from have;
/*All combinations of ID/year*/
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
/*Original data with rows added for missing years. Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now transpose that.
proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;
Create the sums. You could also add other summary stats if you wanted here:
proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;
Merge the indicators and the summary values:
data want;
merge indicators summary(keep=id var3 var4);
by id;
run;
Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;