I have a data set as below. I need to print 2 data sets -one for EU and other for US such that I have unique IDs in the rows and the sales for each ID is the sum of the sales.( E.g. for ID 1 sales will be 1200+1500, for ID 4 sales will be 3000+9000). Can someone please suggest some proc or short way of getting this?
ID Country Sales
1 EU 1200
2 US 1000
1 EU 1500
3 EU 2000
4 US 3000
4 US 9000
This should be easy with a proc sql containing a group by statement:
proc sql;
create table work.sales_by_id as (
select ID, country, sum( sales ) as total_sales
from input_data
group by ID, country
)
quit;
Edit: added grouping by country as I think this is what you wanted
Related
I have two tables in Power BI as follows:
COUNTRIES
COD COUNTRY
1 BRAZIL
2 ARGENTINA
3 CHILE
4 BRASIL
5 COLOMBIA
6 ARGENTINA
7 URUGUAI
SALES
COD DATE
1 2021-01-02
2 2021-10-01
3 2019-09-04
1 2018-07-05
7 2019-04-10
There's a relationship between the two tables, on the COD column.
I need to count how many countries (column "COUNTRY" from the table "COUNTRIES") have a status CHURN. It's considered CHURN when their latest order from the table "SALES" is more than 180 days, using today as a reference.
I know that I need to group by the MAX date per country, do a DATEDIFF, and then do a COUNT. I've tried using ALL and SUMMARIZE, but I haven't found a way to solve this problem.
Are you able to add a calculated column to store the max sales date for each country in your COUNTRIES table? Either in Power BI or directly in your database. If so, here's one solution with 2 steps.
Create a MaxSalesDate column in your COUNTRIES table. DAX for a calculated column below:
MaxSalesDate =
VAR COD = COUNTRIES[COD]
RETURN MAXX(FILTER(SALES, SALES[COD] = COD), SALES[DATE])
Create a measure that counts the number of MaxSalesDate values that are older than 180 days old:
CountCHURN = COUNTX(COUNTRIES, IF(DATEDIFF(COUNTRIES[MaxSalesDate], TODAY(), Day) > 180, 1))
I have tables as below.
Table A(total of 3000 rows, end_date may have duplicates, ex, 123 and 223 may have the same end_date)
enroll_dt,end_date, acct_nbr
12/31/2016, 01/03/2017, 123
12/31/2016, 01/04/2017, 234
01/05/2017, 02/02/2017, 334
Table B(total of 30 unique values)
enroll_dt
12/31/2016
01/01/2017
01/02/2017
01/03/2017
01/04/2017
01/05/2017
...
Desired table:
Date number_of_records
12/31/2016 2
01/01/2017 2
01/02/2017 2
01/03/2017 2
01/04/2017 1
02/01/2017 1
What I want to do is for each value from Table B, I would sort all of rows from Table A, and return # of acct_nbr if
for total # of accounts get enrolled until dateA, how many accounts have
end_date>DateA.
Ex. for 01/01/2017 from Table B, number_of_records = 2 since we only have 2 accounts enrolled until 01/01/2017(acct_nbr=123 and 234)
and end_date'01/03/2017' and '01/04/2017' both greater than '01/01/2017'
Thanks a lot for your help
Assuming your dates are stored as actual dates:
select
b.datea,
count(distinct a.acct_nbr)
from
b
inner join a
on a.end_date >= b.datea
group by
1
Suppose I have the following database:
DATA have;
INPUT id date gain;
CARDS;
1 201405 100
2 201504 20
2 201504 30
2 201505 30
2 201505 50
3 201508 200
3 201509 200
3 201509 300
;
RUN;
I want to create a new table want where the average of the variable gain is grouped by id and by date. The final database should look like this:
DATA want;
INPUT id date average_gain;
CARDS;
1 201405 100
2 201504 25
2 201505 40
3 201508 200
3 201509 250
I tried to obtain the desired result using the code below but it didn't work:
PROC sql;
CREATE TABLE want as
SELECT *,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
It's the asterisk that's causing the issue. That will resolve to id, date, gain, which is not what you want. ANSI SQL would not allow this type of functionality so it's one way in which SAS differs from other SQL implementation.
There should be a note in the log about remerging with the original data, which is essentially what's happening. The summary values are remerged to every line.
To avoid this, list your group by fields in your query and it will work as expected.
PROC sql;
CREATE TABLE want as
SELECT id, date,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
I will say, in general, PROC MEANS is usually a better option because:
calculate for multiple variables & statistics without need to list them all out multiple times
can get results at multiple levels, for example totals at grand total, id and group level
not all statistics can be calculated within PROC MEANS
supports variable lists so you can shortcut reference long lists without any issues
I want to Split the airlines column into two groups and then
Add each group 's amount for all clients... : -
Group 1 = Air India & jet airways
| Group 2 = Others.
Loc Client_Name Airlines Amout
BBI A_1ABC2 Air India 41302
BBI A 1ABC2 Air India 41302
MAA Th 1ABC2 Spice Jet Airlines 288713
HYD Ma 1ABC2 Jet Airways 365667
BOM Vi 1ABC2 Air India 552506
Something like this: -
Rank Client_name Group1 Group2 Total
1 Ca 1ABC2 5266269 7040320 1230658
2 Ve 1ABC2 2815593 2675886 5491479
3 Ma 1ABC2 1286686 437843 1724529
4 Th 1ABC2 723268 701712 1424980
5 Ec 1ABC2 113517 627734 741251
6 A 1ABC2 152804 439381 592185
I grouped it first ..but i am confused regarding how to split: -
Data assign6.Airlines_grouping1;
Set assign6.Airlines_grouping;
if Scan(Airlines,1) IN ('Air','Jet') then Group = "Group1";
else
if Scan(Airlines,1) Not in('Air','Jet') then Group = "Group2";
Run;
You are categorizing a row based on the first word of the airline.
Proc TRANSPOSE with an ID statement is one common way to reshape data so that a categorical value becomes a column. A second way is to bypass the categorization and use a data step to produce the new shape of data directly.
Here is an example of the second way -- create new columns group1 and group2 and set value based on airline criteria.
data airlines_group_amounts;
set airlines;
if scan (airlines,1) in ('Air', 'Jet') then
group1 = amount;
else
group2 = amount;
run;
summarize over client
proc sql;
create table want as
select
client_name
, sum(group1) as group1
, sum(group2) as group2
, sum(amount) as total
from airlines_group_amounts
group by client_name
;
You can avoid the two steps and do all of the processing in a single query, or you can do the summarization with Proc MEANS
Here is a single query way.
proc sql;
create table want as
select
client_name
, sum(case when scan (airlines,1) in ('Air', 'Jet') then amount else 0 end) as group1
, sum(case when scan (airlines,1) in ('Air', 'Jet') then 0 else amount end) as group2
, sum(amount) as total
from airlines
group by client_name
;
I have a dataset looks like the following. This dataset contains four variable Country name Country, company ID Company, Year and Date.
Country Company Year Date
------- ------- ---- ----
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
I know how to count number of distinct company in each country. I did it using the following code.
proc sql;
create table lib.count as
select country, count(distinct company) as count
from lib.data
group by country;
quit;
My problem is how to count the number of distinct company-Years in each country. Essentially i want to know how many different company or same company in different year. If there are two observation for the same company in the same year, I want to count it as 1 different value. If same company have two observation in differeny year I want to count it as two different value. I want the output looks like the following (one number per country):
Country No. firm_year
A 2
B 1
C 1
Can anyone can teach me how to do it please.
A quick method is to concatenate all the variables you want to compare, creating a new variable. Something like:
data data_mod;
set data;
length company_year $ 20;
company_year= cats(company,year);
run;
Then you can run your proc sql with count(distinct company_year).
You need nested queries, as #DaBigNikoladze hinted at...
An "internal" query which will generate a list of distinct combinations of Country + Company + Year;
An "external" query which will count how many rows per country are present in the internal query.
Generate dataset
data have;
informat Country $1.
Company 1.
Year 4.
Date YYMMDD10.;
format Date YYMMDDs10.;
input country company year date;
datalines;
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
;
Execute query
PROC SQL;
CREATE TABLE want AS
SELECT country, Count(company) AS Firm_year
FROM (SELECT DISTINCT country, company, year FROM have)
GROUP BY country;
QUIT;
Results
Country Firm_year
A 2
B 1
C 1
proc sort data=lib.data out=temp nodupkey;
by country company year;
run;
data firm_year(keep=country cnt_fyr);
set out;
by country company year
retain cnt_fyr;
if first.country then cnt_fyr=1;
else cnt_fyr+1;
if last.country;
run;
The answer for your first question is:
data lib.count(keep=country companyCount);
set lib.data;
by country;
retain companyList '';
retain companyCount 0;
if first.country then do;
companyList = company;
companyCount = 1;
end;
else do;
if ^index(companyList, company) then do;
companyList = cats(companyList,',',company);
companyCount + 1;
end;
end;
if last.country then output;
run;
The resutl is:
Country companyCount
------- ------------
A 2
B 1
C 1
Similary you will take the number of distinct company-Years in each country.
Guess i'm a bit confused as to what you are expecting the result to look like. Here is an sql method that gets the same result as posted by the other answer so far.
data temp;
attrib Country length = $10;
attrib Company length = $10;
attrib Year length = $10;
attrib Date length = $10;
input Country $ Company $ Year $ Date $;
infile datalines delimiter = '#';
datalines;
A#1#x#x1#
A#1#x#x2#
B#2#x#x1#
C#3#x#x3#
;
run;
proc sql;
create table temp2 as
select country, count(distinct Date) as count
from temp
group by country, company;
quit;