I have a dataset of customers of a grocery store with information including name, customer id, gender, date of birth, customer type, etc.
DOB is in the format of DDMMMYYYY such as 25JAN1990.
Customer types include online shopper, gold member, silver member, etc.
I need to print a table that identifies the youngest and oldest customers for each customer type using PROC SQL. I'm not sure where to even start with this. I have a rough idea of grouping by customer type and using max and min function on dates but I am not sure if that will work or if I can implement it.
PROC SQL;
quit;
I think the important bit of this task is the SQL command having, which allows you to do comparisons withing the select statement.
Since there is no clear beginning data, I've taken liberty to produce some.
data begin;
input ID Test $ date mmddyy10.;
cards;
001 A 09/01/2011
001 A 10/02/2011
001 A 09/12/2012
001 A 10/10/2013
001 B 10/01/2011
001 B 01/01/2012
002 A 10/12/2014
002 A 10/13/2014
002 A 02/02/2015
002 A 11/15/2015
;
run;
proc sql;
select * from begin
group by ID, test /*These are the grouping variables*/
having date=max(date) or date=min(date); /*Conditions that must be fulfilled*/
quit;
More about having and group by: http://www.dofactory.com/sql/having
Related
I have a dataset as follows:
ID status
101 Checked
101 Checked
101 NotChecked
101 Checked
101 NotChecked
I want to count the number of obs base don the status variable like
ID status Count
101 Checked 2
101 Checked 2
101 NotChecked 1
101 Checked 1
101 NotChecked 1
I dont want to use proc sql because when I say group by then it sorts the dataset and gives the result where as here the Status variable is not sorted.
Aggregating by groups will always require sorting unless you want to use some complex data step logic.
If you have a particular sort order that you want to keep, the easiest way is to create a key column that holds your desired order. You can then resort it back to the way you'd like it after grouping.
data have2;
set have;
varorder = _N_;
run;
proc sql;
create table want as
select id, status, count(*) as count
from have2
group by id, status
order by varorder
;
quit;
This works for me, a bit of a longer solution but basically add row and group identifiers to control the count. The NOTSORTED option on the BY statement helps to identify your groups uniquely.
data have;
input ID status $12.;
cards;
101 Checked
101 Checked
101 NotChecked
101 Checked
101 NotChecked
;;;;
run;
data grouped;
set have;
by id status notsorted;
retain MyGroups count;
if first.id then count=1;
else count+1;
if first.status then MyGroups+1;
run;
proc sql;
create table want as
select *, count(*) as numberFound
from grouped
group by MyGroups
order by ID, count;
quit;
I have a dataset looks like the following. This dataset contains four variable Country name Country, company ID Company, Year and Date.
Country Company Year Date
------- ------- ---- ----
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
I know how to count number of distinct company in each country. I did it using the following code.
proc sql;
create table lib.count as
select country, count(distinct company) as count
from lib.data
group by country;
quit;
My problem is how to count the number of distinct company-Years in each country. Essentially i want to know how many different company or same company in different year. If there are two observation for the same company in the same year, I want to count it as 1 different value. If same company have two observation in differeny year I want to count it as two different value. I want the output looks like the following (one number per country):
Country No. firm_year
A 2
B 1
C 1
Can anyone can teach me how to do it please.
A quick method is to concatenate all the variables you want to compare, creating a new variable. Something like:
data data_mod;
set data;
length company_year $ 20;
company_year= cats(company,year);
run;
Then you can run your proc sql with count(distinct company_year).
You need nested queries, as #DaBigNikoladze hinted at...
An "internal" query which will generate a list of distinct combinations of Country + Company + Year;
An "external" query which will count how many rows per country are present in the internal query.
Generate dataset
data have;
informat Country $1.
Company 1.
Year 4.
Date YYMMDD10.;
format Date YYMMDDs10.;
input country company year date;
datalines;
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
;
Execute query
PROC SQL;
CREATE TABLE want AS
SELECT country, Count(company) AS Firm_year
FROM (SELECT DISTINCT country, company, year FROM have)
GROUP BY country;
QUIT;
Results
Country Firm_year
A 2
B 1
C 1
proc sort data=lib.data out=temp nodupkey;
by country company year;
run;
data firm_year(keep=country cnt_fyr);
set out;
by country company year
retain cnt_fyr;
if first.country then cnt_fyr=1;
else cnt_fyr+1;
if last.country;
run;
The answer for your first question is:
data lib.count(keep=country companyCount);
set lib.data;
by country;
retain companyList '';
retain companyCount 0;
if first.country then do;
companyList = company;
companyCount = 1;
end;
else do;
if ^index(companyList, company) then do;
companyList = cats(companyList,',',company);
companyCount + 1;
end;
end;
if last.country then output;
run;
The resutl is:
Country companyCount
------- ------------
A 2
B 1
C 1
Similary you will take the number of distinct company-Years in each country.
Guess i'm a bit confused as to what you are expecting the result to look like. Here is an sql method that gets the same result as posted by the other answer so far.
data temp;
attrib Country length = $10;
attrib Company length = $10;
attrib Year length = $10;
attrib Date length = $10;
input Country $ Company $ Year $ Date $;
infile datalines delimiter = '#';
datalines;
A#1#x#x1#
A#1#x#x2#
B#2#x#x1#
C#3#x#x3#
;
run;
proc sql;
create table temp2 as
select country, count(distinct Date) as count
from temp
group by country, company;
quit;
Just a general question lets say I have two datasets called dataset1 and dataset2 and If I want to compare the rows of dataset1 with the complete dataset2 so essentially compare each row of dataset1 with dataset2. Below is just an example of the two datasets
Dataset1
EmployeeID Name Employeer
12345 John Microsoft
1234567 Alice SAS
1234565 Jim IBM
Dataset1
EmployeeID2 Name DateAbsent
12345 John 25/06/2009
12345 John 26/06/2009
1234567 Alice 27/06/2010
1234567 Alice 30/06/2011
1234567 Alice 2/8/2012
12345 John 28/06/2009
12345 John 25/07/2009
12345 John 25/08/2009
1234565 Jim 26/08/2009
1234565 Jim 27/08/2010
1234565 Jim 28/08/2011
1234565 Jim 29/08/2012
I have written some programming logic its not sas code, this is just my logic
for item in dataset1:
for item2 in dataset2:
if item.EmployeeID=item2.EmployeeID2 and item.Name=item2.Name then output newSet
This is an inner join.
proc sql noprint;
create table output as
select a.EmployeeId,
a.Name,
a.Employeer,
b.DateAbsent
from dataset1 as a
inner join
dataset2 as b
on a.EmployeeID = b.EmployeeID2
and a.Name = b.name;
quit;
I recommend reading the SAS documentation on PROC SQL if you are unfamiliar with the syntax
To do this in a Data step, the data sets need to be sorted by the variables to join on (or indexed). Also the variable names need to be the same, so I will assume both variables are EmployeeID.
/*sort*/
proc sort data=dataset1;
by EmployeeID Name;
run;
proc sort data=dataset2;
by EmployeeID Name;
run;
data output;
merge dataset1 (in=ds1) dataset2 (inds2);
by EmployeeID Name;
if ds1 and ds2;
run;
The data step does the loop for you. It needs sorted sets because it only takes 1 pass over the data sets. The if clause checks to make sure you are getting a value from both data sets.
Is your goal to compare the two dataset and see where there are differences? Proc Compare will do this for you. You can compare specific columns or the entire dataset.
I'm fairly new with SAS and am looking for a little guidance.
I have two tables. One contains my data (something like the below, although much larger):
Data DataTable;
Input Var001 $ Var002;
Datalines;
000 050
063 052
015 017
997 035;
run;
My variables are integers (read in as text) from 000 to 999. There can be as few as two, or as many as 500 depending on what the user is doing.
The second table contains user specified groupings of the variables in the DataTable:
Data Var_Groupings;
input var $ range $ Group_Desc $;
Datalines;
001 025 0-25
001 075 26-75
001 999 76-999
002 030 0-30
002 050 31-50
002 060 51-60
002 999 61-999;
run;
(In actuality, this table in adjusted by the user in excel and then imported, but this will work for the purposes of troubleshooting).
The "var" variable in the var_groupings table corresponds to a var column in the DataTable. So for instance a "var" of 001 in the var_groupings table is saying that this grouping will be on var001 of the DataTable.
the "Range" variable specifics the upper bound of a grouping. So looking at ranges in the var_grouping table where var is equal to 001, the user wants the first group to span from 0 to 25, the second group to span from 26 to 75, and the last group to span from 76 to 999.
EDIT: The Group_Desc column can contain any string and is not necessarily of the form presented here.
the final table should look something like this:
Var001 Var002 Var001_Group Var002_group
000 050 0-25 31-50
063 052 26-75 51-60
015 017 0-25 0-30
997 035 76-999 31-50
I'm not sure how I would even approach something like this. Any guidance you can give would be greatly appreciated.
That's an interesting one, thanks! It can be solved using CALL EXECUTE, since we need to create variable names from values. And obviously PROC FORMAT is the easiest way to convert some values into ranges. So, combining these two things we can do something like this:
proc sort data=Var_Groupings; by var range; run;
/*create dataset which will be the source of our formats' descriptions*/
data formatset;
set Var_Groupings;
by var;
fmtname='myformat';
type='n';
label=Group_Desc;
start=input(lag(range),8.)+1;
end=input(range,8.);
if FIRST.var then start=0;
drop range Group_Desc;
run;
/*put the raw data into new one, which we'll change to get what we want (just to avoid
changing the raw one)*/
data want;
set Datatable;
run;
/*now we iterate through all distinct variable numbers. A soon as we find new number
we generate with CALL EXECUTE three steps: PROC FORMAT, DATA-step to apply this format
to a specific variable, and then PROC CATALOG to delete format*/
data _null_;
set formatset;
by var;
if FIRST.var then do;
call execute(cats("proc format library=work cntlin=formatset(where=(var='",var,"')); run;"));
call execute("data want;");
call execute("set want;");
call execute(cats('_Var',var,'=input(var',var,',8.);'));
call execute(cats('Var',var,'_Group=put(_Var',var,',myformat.);'));
call execute("drop _:;");
call execute("proc catalog catalog=work.formats; delete myformat.format; run;");
end;
run;
UPDATE. I've changed the first DATA-step (for creating formatset) so that now end and start for each range is taken from variable range, not Group_Desc. And PROC SORT moved to the beginning of the code.
I have the following sample data and 'proc means' command.
data have;
input measure country $;
datalines;
250 UK
800 Ireland
500 Finland
250 Slovakia
3888 Slovenia
34 Portugal
44 Netherlands
4666 Austria
run;
PROC PRINT data=have; RUN;
The following PROC MEANS command prints out a listing for each country above. How can I group some of those countries (i.e. UK & Ireland, Slovakia/SLovenia as Central Europe) in the PROC MEANS step, rather than adding another datastep to add a 'case when' etc?
proc means data=have sum maxdec=2 order=freq STACKODS;
var measure;
class country;
run;
Thanks for any help at all on this. I understand there are various things you can do in the PROC MEANS command itself (like limit the number of countries by doing this:
proc means data=have(WHERE=(country not in ('Finland', 'UK')
I'd like to do the grouping in the PROC MEANS command for brevity.
Thanks.
This is very easy with a format for any PROC that takes a CLASS statement.
Simply build a format, either with code or from data; then apply the format in the PROC MEANS statement.
proc format lib=work;
value $countrygroup
"UK"="British Isles"
"Ireland"="British Isles"
"Slovakia","Slovenia"="Central Europe"
;
quit;
proc means data=have;
class country;
var measure;
format country $countrygroup.;
run;
It's usually better to have numeric codes for country and then format those to be whichever set of names is needed at any one time, particularly as capitalization/etc. is pretty irritating, but this works well enough even here.
The CNTLIN= option in PROC FORMAT allows you to make a format from a dataset, with FMTNAME as the value statement, START as the value-to-label, LABEL as the label. (END=end of range if numeric.) There are other options also, the documentation goes into more detail.