PROC TABULATE WITH TOTAL - sas

I am doing reports with proc tabulate, but unable to add total in a report.
Example
+--------+------+----------+--------+---+---+---+
| Shop | Year | Month | Family | A | B | C |
+--------+------+----------+--------+---+---+---+
| raoas | 2006 | january | TA12 | 5 | 6 | 0 |
| taba | 2008 | january | TS01 | 0 | 1 | 1 |
| suptop | 2008 | april | TZ05 | 0 | 0 | 1 |
| taba | 2006 | December | TA12 | 5 | 6 | 0 |
| raoas | 2008 | january | TA15 | 0 | 2 | 0 |
| sup | 2008 | april | TQ05 | 0 | 1 | 1 |
+--------+------+----------+--------+---+---+---+
code
proc tabulate data=REPORTDATA_T6 format=12.;
CLASS YEAR;
var A C;
table (A C)*SUM='',YEAR=''
/box = 'YEAR';
TITLE 'FORECAST SUMMARY';
run;
output
YEAR 2006 2008 2009
A 800 766 813
C 854 832 812
I tried with... table(A C)*sum,year all... it will sum up for all the years but I want by year.
I tried with all the possible ways and tried... table(A C)*sum all,year. It will give number of observations ie N.. Thanx JON CLEMENTS But I dont want to add as TOTAL VARIABLE in the table, becoz this is a sample data but the number of variables are more then 10, some time I need to change variables, So, every time i dont want to add new variable as total.

I'm not sure if it's possible to do what you want in one step using only original data. Keyword ALL works only for summing up categories of CLASS-variables, but you want to sum up two different variables.
But it's easy enough with interim step, creating dataset where A, B, C variables will become categories of one variable:
data REPORTDATA_T6;
input Shop $ Year Month $ Family $ A B C;
datalines;
raoas 2006 january TA12 5 6 0
taba 2008 january TS01 0 1 1
suptop 2008 april TZ05 0 0 1
taba 2006 December TA12 5 6 0
raoas 2008 january TA15 0 2 0
sup 2008 april TQ05 0 1 1
;
run;
proc sort data=REPORTDATA_T6; by Shop Year Month Family; run;
proc transpose data=REPORTDATA_T6 out=REPORTDATA_T6_long;
var A B C;
by Shop Year Month Family;
run;
proc tabulate data=REPORTDATA_T6_long;
class _NAME_ YEAR;
var COL1;
table (_NAME_ all)*COL1=' '*SUM=' ', YEAR=' '
/box = 'YEAR';
TITLE 'FORECAST SUMMARY';
run;

Related

How to extracting all values that contain part of particular number and then deleting them?

How do you extract all values containing part of a particular number and then delete them?
I have data where the ID contains different lengths and wants to extract all the IDs with a particular number. For example, if the ID contains either "-00" or "02" or "-01" at the end, pull to be able to see the hit rate that includes those—then delete them from the ID. Is there a more effecient way in creating this code?
I tried to use the substring function to slice it to get the result, but there is some other ID along with the specified position.
Code:
Proc sql;
Create table work.data1 AS
SELECT Product, Amount_sold, Price_per_unit,
CASE WHEN Product Contains "Pen" and Lenghth(ID) >= 9 Then ID = SUBSTR(ID,1,9)
WHEN Product Contains "Book" and Lenghth(ID) >= 11 Then ID = SUBSTR(ID,1,11)
WHEN Product Contains "Folder" and Lenghth(ID) >= 12 Then ID = SUBSTR(ID,1,12)
...
END AS ID
FROM A
Quit;
Have:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229-01 | Book | 20 | 5 |
| ABC134475472 02 | Folder | 29 | 7 |
| AB-1235674467-00 | Pencil | 26 | 1 |
| 69598346-02 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Wanted the final result:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229 | Book | 20 | 5 |
| ABC134475472 | Folder | 29 | 7 |
| AB-1235674467 | Pencil | 26 | 1 |
| 69598346 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Just test if the string has any embedded spaces or hyphens and also that the last word when delimited by space or hyphen is 00 or 01 or 02 then chop off the last three characters.
data have;
infile cards dsd dlm='|' truncover ;
input id :$20. product :$20. amount_sold price_per_unit;
cards;
123456789 | Pen | 30 | 2 |
63495837229-01 | Book | 20 | 5 |
ABC134475472 02 | Folder | 29 | 7 |
AB-1235674467-00 | Pencil | 26 | 1 |
69598346-02 | Correction pen | 15 | 1.50 |
6970457688 | Highlighter | 15 | 2 |
584028467 | Color pencil | 15 | 10 |
;
data want;
set have ;
if indexc(trim(id),'- ') and scan(id,-1,'- ') in ('00' '01' '02') then
id = substrn(id,1,length(id)-3)
;
run;
Result
amount_ price_
Obs id product sold per_unit
1 123456789 Pen 30 2.0
2 63495837229 Book 20 5.0
3 ABC134475472 Folder 29 7.0
4 AB-1235674467 Pencil 26 1.0
5 69598346 Correction pen 15 1.5
6 6970457688 Highlighter 15 2.0
7 584028467 Color pencil 15 10.0
There may be other solutions but you have to use some string functions. I used here the functions substr, reverse (reverting the string) and indexc (position of one of the characters in the string):
data have;
input text $20.;
datalines;
12345678
AB-142353 00
AU-234343-02
132453 02
221344-09
;
run;
data want (drop=reverted pos);
set have;
if countw(text) gt 1
then do;
reverted=strip(reverse(text));
pos=indexc(reverted,'- ')+1;
new=strip(reverse(substr(reverted,pos)));
end;
else new=text;
run;

Grouping child items and displaying parent sum

I have the following table
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
I would like to group the table by group, insert the grouped sum into value, and then ungroup:
+-------+--------+
| item | value |
+-------+--------+
| 1 | 30 |
| a | 10 |
| b | 20 |
| 2 | 70 |
| b | 30 |
| c | 40 |
+-------+--------+
The purpose of the result is to interpret the first column as items a and b belonging to group 1 with sum 30 and items b and c belonging to group 2 with sum 70.
Such a data transformation can be indicative of a reporting requirement more than a useful data structure for downstream processing. Proc REPORT can create output in the form desired.
data have;
infile datalines;
input group $ item $ value ##; datalines;
1 a 10 1 b 20 2 b 30 2 c 40
;
proc report data=have;
column group item value;
define group / order order=data noprint;
break before group / summarize;
compute item;
if missing(item) then item=group;
endcomp;
run;
I assume that both group and item are character variables
data have;
infile datalines firstobs=4 dlm='|';
input group $ item $ value;
datalines;
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
;
data want (keep=group value);
do _N_=1 by 1 until (last.group);
set have;
by group;
v + value;
end;
value = v;output;v=0;
do _N_=1 to _N_;
set have;
group = item;
output;
end;
run;

Calculating if Start time occurs within 1 hour range for each person (Single column)

I'm trying to figure out how to calculate if start time for each subject occurs within 1 hour of each other. However I only have one column and two groups with two different dates for each. I have no comparative variable to a dhms time difference as they occur under the same column variable. I have thought of doing a lag on the first time and then an intchk to calculate the 24 hour time difference between each but I don't think i have sufficient arguments for the intchk function. Alternatively could maybe do a proc transpose and then do a timediff between each array variable but that seems messy. Anyone have less clunky and more efficient solutions as i might be overthinking this.
Sample Data:
+----------+-------+------+------------+------------+
| CLIENTID | GRPID | date | start_date | start_time |
+----------+-------+------+------------+------------+
| 2 | 1 | -2 | 10Nov2019 | 23:19:52 |
| 3 | 1 | -2 | 10Nov2019 | 23:22:51 |
| 4 | 1 | -2 | 10Nov2019 | 23:20:16 |
| 5 | 1 | -2 | 10Nov2019 | 23:21:30 |
| 6 | 1 | -2 | 10Nov2019 | 23:23:51 |
| 23 | 2 | -2 | 11Nov2019 | 23:11:38 |
| 24 | 2 | -2 | 11Nov2019 | 23:38:33 |
| 25 | 2 | -2 | 11Nov2019 | 23:15:01 |
| 26 | 2 | -2 | 11Nov2019 | 23:08:43 |
+----------+-------+------+------------+------------+
You can compile the start date and time into a temporary datetime variable (_start_dt) to ease the comparison. Then, taking the first datetime for each GRPID as the baseline, you could use a RETAIN statement to pass that baseline datetime (_base_dt) down the related data rows and find the time difference (time_diff) using the INTCK function with a dtsecond interval.
proc sort data=your_data;
by grpid clientid;
run;
data your_results (drop=_:);
retain CLIENTID GRPID DATE start_date start_time _base_dt;
format _base_dt _start_dt datetime16. time_diff time8.;
set your_data;
by grpid clientid;
_start_dt = dhms(start_date,hour(start_time),minute(start_time),second(start_time));
if first.grpid then _base_dt = _start_dt;
time_diff = intck('dtsecond', _base_dt, _start_dt);
run;
This gives the following results dataset:
+----------+-------+------+------------+------------+-----------+
| CLIENTID | GRPID | date | start_date | start_time | time_diff |
+----------+-------+------+------------+------------+-----------+
| 2 | 1 | -2 | 10Nov2019 | 23:19:52 | 00:00:00 |
| 3 | 1 | -2 | 10Nov2019 | 23:22:51 | 00:02:59 |
| 4 | 1 | -2 | 10Nov2019 | 23:20:16 | 00:00:24 |
| 5 | 1 | -2 | 10Nov2019 | 23:21:30 | 00:01:38 |
| 6 | 1 | -2 | 10Nov2019 | 23:23:51 | 00:03:59 |
| 23 | 2 | -2 | 11Nov2019 | 23:11:38 | 00:00:00 |
| 24 | 2 | -2 | 11Nov2019 | 23:38:33 | 00:26:55 |
| 25 | 2 | -2 | 11Nov2019 | 23:15:01 | 00:03:23 |
| 26 | 2 | -2 | 11Nov2019 | 23:08:43 | -0:02:55 |
+----------+-------+------+------------+------------+-----------+
I think I’ve interpreted your requirements correctly.. Let me know if not.
It sounds like you want to check if the RANGE of the start_time over each group is < 1 hour:
Coerce the start_date to a datetime value and add the start_time before computing the range.
data have;
input
CLIENTID GRPID date start_date: date9. start_time: hhmmss6.;
format start_date date9. start_time time8.;
datalines;
2 1 -2 10Nov2019 23:19:52
3 1 -2 10Nov2019 23:22:51
4 1 -2 10Nov2019 23:20:16
5 1 -2 10Nov2019 23:21:30
6 1 -2 10Nov2019 23:23:51
23 2 -2 11Nov2019 23:11:38
24 2 -2 11Nov2019 23:38:33
25 2 -2 11Nov2019 23:15:01
26 2 -2 11Nov2019 23:08:43
run;
proc sql;
create table want (label="start range status by group") as
select
grpid,
range(dhms(start_date,0,0,0)+start_time) as start_range format time8.,
calculated start_range < '24:00:00't as one_hr_start_flag
from have
group by grpid;
If you want to disregard the groups and focus only on the time of day, disregarding the date, the range computation would be:
* Presuming 'noon' is the center of the day;
proc sql;
create table want (label="time of day start range status overall") as
select
range(start_time) as range format time8.,
calculated range < '24:00:00't as one_hr_start_flag
from have;
Looking at only time is always troublesome for the cases of when the time value is slightly after midnight.

Transposing variables

Is there an easy way to transpose my variables in Stata?
From:
-.48685038 -.13912173 -.91550094 -.96246505
-1.4760038 1.2873173 -.22300169 .25329232
-.01091149 -.58777297 .49454963 2.2842488
-.01376025 -.03060045 -.26231077 .32238093
.51557881 -2.1968436 .36612388 -.40590465
To:
-.48685038 -1.4760038 -.01091149 -.01376025 .51557881
-.13912173 1.2873173 -.58777297 -.03060045 -2.1968436
-.91550094 -.22300169 .49454963 -.26231077 .36612388
-.96246505 .25329232 2.2842488 .32238093 -.40590465
My understanding is that I have to create a matrix first:
mkmat *, matrix(data)
matrix data = data'
svmat data
Try xpose:
. webuse xposexmpl, clear
. list
+--------------------------------+
| county year1 year2 year3 |
|--------------------------------|
1. | 1 57.2 11.3 19.5 |
2. | 2 12.5 8.2 28.9 |
3. | 3 18 14.2 33.2 |
+--------------------------------+
. xpose, clear varname
. list
+-------------------------------+
| v1 v2 v3 _varname |
|-------------------------------|
1. | 1 2 3 county |
2. | 57.2 12.5 18 year1 |
3. | 11.3 8.2 14.2 year2 |
4. | 19.5 28.9 33.2 year3 |
+-------------------------------+

Generating a variable only including the top 4 firms with largest sales

My question is very related to the question below:
Calculate industry concentration in Stata based on four biggest numbers
I want to generate a variable only including the top 4 firms with largest sales and exclude the rest.
In other words the new variable will only have values of the 4 firms with largest sales in a given industry for a given year and the rest will be .
Consider this:
webuse grunfeld, clear
bysort year (invest) : gen largest4 = cond(_n < _N - 3, ., invest)
sort year invest
list year largest4 if largest4 < . in 1/40, sepby(year)
+-----------------+
| year largest4 |
|-----------------|
7. | 1935 39.68 |
8. | 1935 40.29 |
9. | 1935 209.9 |
10. | 1935 317.6 |
|-----------------|
17. | 1936 50.73 |
18. | 1936 72.76 |
19. | 1936 355.3 |
20. | 1936 391.8 |
|-----------------|
27. | 1937 74.24 |
28. | 1937 77.2 |
29. | 1937 410.6 |
30. | 1937 469.9 |
|-----------------|
37. | 1938 51.6 |
38. | 1938 53.51 |
39. | 1938 257.7 |
40. | 1938 262.3 |
+-----------------+
If you had missing values, they would sort to the end of each block and mess up the results.
So you need a trick more:
generate OK = !missing(invest)
bysort OK year (invest) : gen Largest4 = cond(_n < _N - 3, ., invest) if OK
sort year invest
list year Largest4 if Largest4 < . in 1/40, sepby(year)
With this example, which you can run, there are no missing values and the results are the same.