Splitting a Column into two based on condtions in Proc Sql ,SAS - sas

I want to Split the airlines column into two groups and then
Add each group 's amount for all clients... : -
Group 1 = Air India & jet airways
| Group 2 = Others.
Loc Client_Name Airlines Amout
BBI A_1ABC2 Air India 41302
BBI A 1ABC2 Air India 41302
MAA Th 1ABC2 Spice Jet Airlines 288713
HYD Ma 1ABC2 Jet Airways 365667
BOM Vi 1ABC2 Air India 552506
Something like this: -
Rank Client_name Group1 Group2 Total
1 Ca 1ABC2 5266269 7040320 1230658
2 Ve 1ABC2 2815593 2675886 5491479
3 Ma 1ABC2 1286686 437843 1724529
4 Th 1ABC2 723268 701712 1424980
5 Ec 1ABC2 113517 627734 741251
6 A 1ABC2 152804 439381 592185
I grouped it first ..but i am confused regarding how to split: -
Data assign6.Airlines_grouping1;
Set assign6.Airlines_grouping;
if Scan(Airlines,1) IN ('Air','Jet') then Group = "Group1";
else
if Scan(Airlines,1) Not in('Air','Jet') then Group = "Group2";
Run;

You are categorizing a row based on the first word of the airline.
Proc TRANSPOSE with an ID statement is one common way to reshape data so that a categorical value becomes a column. A second way is to bypass the categorization and use a data step to produce the new shape of data directly.
Here is an example of the second way -- create new columns group1 and group2 and set value based on airline criteria.
data airlines_group_amounts;
set airlines;
if scan (airlines,1) in ('Air', 'Jet') then
group1 = amount;
else
group2 = amount;
run;
summarize over client
proc sql;
create table want as
select
client_name
, sum(group1) as group1
, sum(group2) as group2
, sum(amount) as total
from airlines_group_amounts
group by client_name
;
You can avoid the two steps and do all of the processing in a single query, or you can do the summarization with Proc MEANS
Here is a single query way.
proc sql;
create table want as
select
client_name
, sum(case when scan (airlines,1) in ('Air', 'Jet') then amount else 0 end) as group1
, sum(case when scan (airlines,1) in ('Air', 'Jet') then 0 else amount end) as group2
, sum(amount) as total
from airlines
group by client_name
;

Related

SAS repeating a set of statements for each value of macro

I have say two tables in teradata one of them-Reports is like this
Year Report_ID BAD_PART_NUMBERS
2015 P12568 6989820
2015 P12568 1769819
2015 P12568 1988700
2015 P12697 879010
2015 P12697 287932
2015 P12697 17902
and the other table-Orders
order_no Customer_id Purchase dt PART_NUM PART_DESC
265187 B1792 3/4/2016 02-6989820 gfsahj
1669 B1792 7/8/2017 01-32769237 susisd
1692191 B1794 5/7/2015 03-6989820 gfsahj
16891 B1794 3/24/2016 78-1769819 ysatua
62919 B1794 2/7/2017 15-3287629 at8a9s7d
One of my objective is to find the part number that was most frequently purchased after purchasing a bad part, for every Report_ID
For one report_ID I wrote the code like this:
%let REPORT_ID=('P12568');
Proc SQL;
connect to teradata as tera1 (server='XXX' user=&userid pwd=&pwd Database
="XXXXX" );
create table BAD_PART as
select * from connection to tera1
(
select REPORT_ID,BAD_PART_NUMBERS from REPORTS where REPORT_ID=&REPORT_ID
*other where conditions
group by 1,2
)
;
disconnect from tera1;
quit;
/*creating a PART_NUM macro*/
PROC SQL NOPRINT;
SELECT quote(cats('%',BAD_PART_NUMBERS),"'")
INTO :PART_NUM separated by ", "
FROM BAD_PART ;
QUIT;
%put macro variable PART_NUM:&PART_NUM;
/*FINDING SECONDARY PART INFORMATION*/
proc sql;
connect to teradata as tera1 (server='XXXX' user=&userid pwd=&pwd Database
=" XXXX" );
create table SEC_PART as
select * from connection to tera1
(
SELECT &REPORT_ID as REPORT_ID, PART_NUM, PART_DESC,COUNT (DISTINCT ORDER)
as frequency
from (
select Customer_id,Min(Purchase_dt) as FIRST_BAD_PART_PURCHASE
from ORDERS
where (PART_NUM like any(&PART_NUM)) A
left join (
select Customer_id, Purchase_dt, PART_NUM, PART_DESC,ORDER
from ORDERS group by 1,2,3,4,5 ) B
on A. Customer_id =B. Customer_id
AND FIRST_BAD_PART_PURCHASE< Purchase_dt
group by 1,2,3 order by frequency desc
having frequency>0
)
;
disconnect from tera1;
quit;
/*---various PROC SQL and Data steps*/
Ultimately, I have a dataset which has
Report_ID MONTHS VALUE
P12568 0 21
P12568 1 34
P12568 2 40.38
P12568 3 67.05
P12568 4 100.08
where months here is continous which is MONTHS of exposure. For every report_id the final table needs to be appended.Suppose I am interested in seeing for all report_id for a year eg;
select REPORT_ID from reports where year='2015'.
Right now my code is doing for one Report_ID but if I am interested to find for more than one at once.
Try performing the entire query in Teradata. Instead constructing the any list, join to the bad_part query and use concatenation to construct the like pattern.
Deep in the query try having a
JOIN ( select part_num bad_part_num from <bad part_num query> ) bad_list ON
PART_NUM like '%' || bad_list.bad_part_num
instead of the
where (PART_NUM like any(&PART_NUM)) A
wherein the any list is, a list of % prefaced bad part numbers, constructed via SAS Proc SQL (into :) macro.

Proc sql - Group by aggregate function from subquery in main query

I two data sets containing millions of rows. Table1 contains two different ID numbers, ID1 and ID2. It also contains a variable explaining which group (variable y1) a certain ID belongs to.
The second table (Table2) contains two variables from the first table and an additional one.
I want to join the two tables together but before the join, I want table1 to only contain information grouped by ID1 and also for it to give me information which group an ID belongs to.
I could do this in two Proc Sql stages where I first create a table on table1 where I group by ID1 and then create another step where I merge it onto table2. However this is rather inefficient as my tables contain so many rows and I would therefore like to do it in one run. Hence I have instead created a subquery that does what I want. My problem is that I get the error that I can't group by the variable "WhichGroup" from my subquery as it stems from an aggregate function. I'm wondering if there is some good workaround to what I want to achieve?
Many thanks in advance!
Example code:
data table1;
input ID1 $ ID2 $ x1 2. y1 $;
datalines;
1 p1 10 Group1
1 p2 20 Group2
2 p3 50 Group1
;
run;
data table2;
input ID1 $ x1 x2;
datalines;
1 10 500
1 20 600
2 50 700
;
run;
Proc sql;
Create table Test
as select
t1.WhichGroup
,sum(t1.Sum_x1) as Sum_x1
,sum(t2.x2) as Sum_x2
from (select
a.ID1
,case when max(case when a.y1 = 'Group1' then 1 else 0 end) = 0 then 'Group2'
when max(case when a.y1 = 'Group2' then 1 else 0 end) = 0 then 'Group1'
else 'Both' end as WhichGroup
,Sum(a.x1) as Sum_x1
from work.table1 as a
group by 1
) as t1
left join
work.table2 as t2
on t1.ID1 = t2.ID1
Group by 1;
Quit;
- Answering my own question -
I am not sure why this is happening but I have encountered a very interest phenomenon and potentially a bug in SAS.
It appears that the whole reason the query doesn't work is because SAS does not understand the group by statement if it is given in digits rather than explicitly stating the variable name you want to group by. Potentially SAS gets lost in the column order?
Has anyone else encountered such a phenomenon before in SAS?
Hence the query works if the following code is used:
Proc sql;
Create table Test
as select
t1.WhichGroup
,sum(t1.Sum_x1) as sum_x1
,sum(t2.x2) as Sum_x2
from (select
a.ID1
,case when max(case when a.y1 = 'Group1' then 1 else 0 end) = 0 then 'Group2'
when max(case when a.y1 = 'Group2' then 1 else 0 end) = 0 then 'Group1'
else 'Both' end as WhichGroup
,Sum(a.x1) as Sum_x1
from work.table1 as a
group by 1
) as t1
left join
work.table2 as t2
on t1.ID1 = t2.ID1
Group by WhichGroup;
Quit;

how to count distinct value over two dimension using SAS

I have a dataset looks like the following. This dataset contains four variable Country name Country, company ID Company, Year and Date.
Country Company Year Date
------- ------- ---- ----
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
I know how to count number of distinct company in each country. I did it using the following code.
proc sql;
create table lib.count as
select country, count(distinct company) as count
from lib.data
group by country;
quit;
My problem is how to count the number of distinct company-Years in each country. Essentially i want to know how many different company or same company in different year. If there are two observation for the same company in the same year, I want to count it as 1 different value. If same company have two observation in differeny year I want to count it as two different value. I want the output looks like the following (one number per country):
Country No. firm_year
A 2
B 1
C 1
Can anyone can teach me how to do it please.
A quick method is to concatenate all the variables you want to compare, creating a new variable. Something like:
data data_mod;
set data;
length company_year $ 20;
company_year= cats(company,year);
run;
Then you can run your proc sql with count(distinct company_year).
You need nested queries, as #DaBigNikoladze hinted at...
An "internal" query which will generate a list of distinct combinations of Country + Company + Year;
An "external" query which will count how many rows per country are present in the internal query.
Generate dataset
data have;
informat Country $1.
Company 1.
Year 4.
Date YYMMDD10.;
format Date YYMMDDs10.;
input country company year date;
datalines;
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
;
Execute query
PROC SQL;
CREATE TABLE want AS
SELECT country, Count(company) AS Firm_year
FROM (SELECT DISTINCT country, company, year FROM have)
GROUP BY country;
QUIT;
Results
Country Firm_year
A 2
B 1
C 1
proc sort data=lib.data out=temp nodupkey;
by country company year;
run;
data firm_year(keep=country cnt_fyr);
set out;
by country company year
retain cnt_fyr;
if first.country then cnt_fyr=1;
else cnt_fyr+1;
if last.country;
run;
The answer for your first question is:
data lib.count(keep=country companyCount);
set lib.data;
by country;
retain companyList '';
retain companyCount 0;
if first.country then do;
companyList = company;
companyCount = 1;
end;
else do;
if ^index(companyList, company) then do;
companyList = cats(companyList,',',company);
companyCount + 1;
end;
end;
if last.country then output;
run;
The resutl is:
Country companyCount
------- ------------
A 2
B 1
C 1
Similary you will take the number of distinct company-Years in each country.
Guess i'm a bit confused as to what you are expecting the result to look like. Here is an sql method that gets the same result as posted by the other answer so far.
data temp;
attrib Country length = $10;
attrib Company length = $10;
attrib Year length = $10;
attrib Date length = $10;
input Country $ Company $ Year $ Date $;
infile datalines delimiter = '#';
datalines;
A#1#x#x1#
A#1#x#x2#
B#2#x#x1#
C#3#x#x3#
;
run;
proc sql;
create table temp2 as
select country, count(distinct Date) as count
from temp
group by country, company;
quit;

How to pivot up data by summing values?

I have a data set as below. I need to print 2 data sets -one for EU and other for US such that I have unique IDs in the rows and the sales for each ID is the sum of the sales.( E.g. for ID 1 sales will be 1200+1500, for ID 4 sales will be 3000+9000). Can someone please suggest some proc or short way of getting this?
ID Country Sales
1 EU 1200
2 US 1000
1 EU 1500
3 EU 2000
4 US 3000
4 US 9000
This should be easy with a proc sql containing a group by statement:
proc sql;
create table work.sales_by_id as (
select ID, country, sum( sales ) as total_sales
from input_data
group by ID, country
)
quit;
Edit: added grouping by country as I think this is what you wanted

sas count number of different combinations

I have a data set with information on students' educations on a institution.
I want to get a number of how many different combinations of study programmes they have been on. I have information on both master and bachelor level and I want to count the number of different study programmes in each education level (master, bachelor).
For example person1 can have:
Bachelor:
- study1
- study2
- study3
- study3
Master:
- studyA
- studyA
Then I want a number of 3 study programmes in bachelor level (study3 should not Count twice), and a number of 1 in masters level.
Each study programme has its own row - so in the dataset person1 has 6 rows.
I want one row per person telling the number of study programmes per education level:
person number_bachelor number_master
person1 3 1
....etc...
I have tried with this:
proc sql;
create table new as
select distinct personid, name,
count(study) as number_of_bach
from old
group by personid, edu_level, study;
quit;
But it doesn't give me what I want.
This gives me two rows with person1 with the values of 1 and 2 in the variable "number_of_bach".
How can I edit this code to get the result I want?
Code:
data education;
input person $ level $ program $;
datalines;
person1 bachelor study1
person1 bachelor study2
person1 bachelor study3
person1 bachelor study3
person1 master study1
person2 bachelor study1
person2 master study2
person2 master study1
;
run;
proc sort data = education nodupkey;
by person level program;
run;
proc sql;
select person,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_bachelors,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_masters
from education
group by person;
quit;
Working: Here, SORT procedure will eliminate duplicate records, if any. Then SQL procedure only can be used to generate the person wise count of programs at bachelor level as well as count of programs at master level.
Output:
person num_bachelors num_masters
person1 3 1
person2 1 2
Is this what you want?
DATA old;
INPUT personid edu_level $ study $;
DATALINES;
1 bachelor study1
1 bachelor study2
1 bachelor study3
1 bachelor study3
1 master studyA
1 master studyA
1 master studyB
;
PROC SQL;
CREATE TABLE new AS
SELECT personid, edu_level, COUNT (DISTINCT study) AS num_bach
FROM OLD
GROUP BY personid, edu_level;
QUIT;
The column study is a so-called an aggregate column in your query (because COUNT is an aggregate function) and as such should not be included in the GROUP BY-clause (else your query will also groupy by 'study' and the count will always be 1.
If you want to have one each person on one line then add a PROC TRANSPOSE:
PROC transpose IN = new OUT = new2;
BY personid;
ID edu_level;
RUN;
(You could also create a more complex query using subqueries and joins instead of the transpose, as long as you don't have millions of rows the overhead for the TRANSPOSE doesn't matter)
For the sake of completeness here is a SQL-only solution to your question:
PROC SQL;
CREATE TABLE new AS
SELECT p.personid, b.num_bachelors, m.num_masters
/* Select unique personids */
FROM (SELECT DISTINCT personid
FROM old) AS p
/* Count number of bachelor-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_bachelors
FROM old WHERE edu_level = 'bachelor'
GROUP BY personid) AS b on p.personid = b.personid
/* Count number of master-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_masters
FROM old WHERE edu_level = 'master'
GROUP BY personid) AS m on p.personid = m.personid;
QUIT;