Values as labels in Box plots - sas

I have the following sample of data
Y X1 X2 X3 X4 ...
123 121 214 241 241
431 143 141 241 124
214 124 214 142 241
531 432 134 412 124
243 124 134 134 123
I would be interested in plotting using box plots the data above. Specifically, I would like to have on the x-axis X1, X2, ... and on the y-axis the information of values within each column as a box plot.
However, since I would like to identify visually, the corresponding Y'value (e.g., for max X1 would be 531), I thought about using some labels.
For creating the box plot, I am using
ods graphics off;
proc boxplot data=test;
plot Y*X;
run;
where X is
X Y
X1 121
X1 143
X1 124
X1 432
... ...
X2 214
X2 141
X2 214
...
As shown above, however, I am loosing the values of Y (i.e. 123, 431, ...).
Is there any way to keep also this information in a (box) plot? Any other ideas would be also kept into consideration and appreciated.

Transpose your data and you will be able to use Proc SGPLOT statement HBOX.
Example:
data have;
input
Y X1 X2 X3 X4 ;
datalines;
123 121 214 241 241
431 143 141 241 124
214 124 214 142 241
531 432 134 412 124
243 124 134 134 123
;
proc transpose data=have out=tall (rename=col1=x);
by y notsorted;
var x1-x4;
run;
ods html file='hbox-plot.html';
proc sgplot data=tall;
hbox x / category=y;
yaxis type=linear;
run;
ods html close;
will produce

Related

SAS transpose data into long form

I have the following dataset and hoping to transpose it into long form:
data have ;
input Height $ Front Middle Rear ;
cards;
Low 125 185 126
Low 143 170 136
Low 150 170 129
Low 138 195 136
Low 149 162 147
Medium 141 176 128
Medium 137 161 133
Medium 145 167 148
Medium 150 165 145
Medium 130 184 141
High 129 157 149
High 141 152 137
High 148 186 138
High 130 164 126
High 137 176 138
;
run;
Here Height is low, medium and high. Location is front, middle and rear. Numerical values are prices by location and height of a book on a bookshelf.
I'm hoping to transpose the dataset into long form with columns:
Height, Location and Price
The following code only allows me to transpose Location into long form. How should I transpose Height at the same time?
data bookp;
set bookp;
dex = _n_;
run;
proc sort data=bookp;
by dex;
run;
proc transpose data=bookp
out=bookpLong (rename=(col1=price _name_= location )drop= _label_ dex);
var front middle rear;
by dex;
run;
I think you just need to include HEIGHT in the BY statement.
First let's convert your example data into a SAS dataset.
data have ;
input Height $ Front Middle Rear ;
cards;
Low 125 185 126
Low 143 170 136
Medium 141 176 128
Medium 137 161 133
High 129 157 149
High 141 152 137
;
Now let's add an identifier to uniquely identify each row. Note that if you are really reading the data with a data step you could do this in the same step that reads the data.
data with_id ;
row_num+1;
set have;
run;
Now we can transpose.
proc transpose data=with_id out=want (rename=(_name_=Location col1=Price));
by row_num height ;
var front middle rear ;
run;
Results:
Obs row_num Height Location Price
1 1 Low Front 125
2 1 Low Middle 185
3 1 Low Rear 126
4 2 Low Front 143
5 2 Low Middle 170
6 2 Low Rear 136
7 3 Medium Front 141
8 3 Medium Middle 176
9 3 Medium Rear 128
10 4 Medium Front 137
11 4 Medium Middle 161
12 4 Medium Rear 133
13 5 High Front 129
14 5 High Middle 157
15 5 High Rear 149
16 6 High Front 141
17 6 High Middle 152
18 6 High Rear 137

Replace missing values in SAS by Specific Condition

I have a large dataset named Planes with missing values in Arrival Delays(Arr_Delay).I want to
Replace those missing values by Average delay on the Specific route(Origin - Dest) by Specific
Carrier.
Hereby is the sample of the dataset : -
date carrier Flight tailnum origin dest Distance Air_time Arr_Delay
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
code I tried : -
Proc stdize data=cs1.Planes reponly method=mean out=cs1.Complete_data;
var Arrival_delay_minutes;
Run;
But as my problem states..i want to get the mean by Specific Route and Specific Carrier for the Missing Value. Please help me on this!
stdize Procedure does not have a way to include by or class variables. you can use the below code to complete your task:-
Proc means data=cs1.Planes noprint;
var Arr_Delay;
class carrier origin dest;
output out=mean1;
Run;
proc sort data=cs1.Planes;
by carrier origin dest;
run;
proc sort data=mean1;
by carrier origin dest;
run;
data cs1.Complete_data(drop=Arr_Delay1 _stat_);
merge cs1.Planes(in=a) mean1(where=(_stat_="MEAN")
keep=carrier origin dest Arr_Delay _stat_
rename=(Arr_Delay = Arr_Delay1) in=b);
by carrier origin dest;
if a;
if Arr_Delay =. then Arr_Delay=Arr_Delay1;
run;
You just need to sort the table cs1.Planes by origin, dest & carrier before running Proc stdize and add by origin dest carrier; to do the grouping you wanted. The only case the values will remain missing is when there are no other values for this carrier/route.
You can find the SAS documentation here and available options here.
Code:
data have;
informat date ddmmyy10.;
format date ddmmyy10.;
input
date carrier $ Flight tailnum $ origin $ dest $ Distance Air_time Arr_Delay;
datalines;
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 15
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
;
run;
proc sort data=work.have; by origin dest carrier; run;
Proc stdize data=work.have reponly method=mean out=work.Complete_data ;
var Arr_Delay;
by origin dest carrier ;
Run;

Reading and halving a sas data set

I have to read a data set of 50 numbers from a text file. It's all in a row with a space delimiter and in multiple uneven lines. for example:
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15
16 17 18 19 20 21
Etc.
The first 25 numbers belong to group 1, and the 2nd 25 belong to group 2. So I need to make a group variable (binary either 1 or 2), a count number (1 to 25), and a value variable which is holding the value of the number.
I am stuck on how to split the data in half when reading it. I tried to use truncover but it did not work.
Try something like this, replacing the datalines keyword with the path to your file:
data groups;
infile datalines;
format number 8. counter 2. group 1.; * Not mandatory, used here to order variables;
retain group (1);
input number ##;
counter + 1;
if counter = 26 then do;
group = 2;
counter = 1;
end;
datalines;
192 105 435 448 160 499 184 246 388 190 316
139 146 147 192 231 449 101 216 342 399 352 122 418
280 400 187 352 321 180 425 500 320 179 105
232 105 323 132 106 255 449
186 135 472 174 119 255
308 350
run;

How to extract next12 month data from master table for each id in Sample table based on the yearmonth and ID using sas

I am currently practicing SAS programming on using two SAS dataset(sample and master) . Below are the hypothetical or dummy data created for illustration purpose to solve my problem through SAS programming . I would like to extract the data for the id's in sample dataset from master dataset(test). I have given an example with few id's as sample dataset, for which i need to extract next 12 month information from master table(test) for each id's based on the yearmonth information( desired output given in the third output).
Below is the code to extract the previous 12 month data but i am not getting idea to extract next 12 month records as pulled for previous months, Can anyone help me in solving this problem using SAS programming with optimized way.
proc sort data=test;
by id yearmonth;
run;
data result;
set test;
array prev_month {13} PREV_MONTH_0-PREV_MONTH_12;
by id;
if first.id then do;
do i =1 to 13;
prev_month(i)=0;
end;
end;
do i = 13 to 2 by -1;
prev_month(i)=prev_month(i-1);
end;
prev_month(1)=no_of_cust;
drop i prev_month_0;
retain prev_month:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
One sample dataset (dataset name - sample).
ID YEARMONTH NO_OF_CUST
1 200909 50
1 201005 65
1 201008 78
1 201106 95
2 200901 65
2 200902 45
2 200903 69
2 201005 14
2 201006 26
2 201007 98
One master dataset - dataset name (test) (huge dataset over the year for each id from start of the account to till date.)
ID YEARMONTH NO_OF_CUST
1 200808 125
1 200809 125
1 200810 111
1 200811 174
1 200812 98
1 200901 45
1 200902 74
1 200903 73
1 200904 101
1 200905 164
1 200906 104
1 200907 22
1 200908 35
1 200909 50
1 200910 77
1 200911 86
1 200912 95
1 201001 95
1 201002 87
1 201003 79
1 201004 71
1 201005 65
1 201006 66
1 201007 66
1 201008 78
1 201009 88
1 201010 54
1 201011 45
1 201012 100
1 201101 136
1 201102 111
1 201103 17
1 201104 77
1 201105 111
1 201106 95
1 201107 79
1 201108 777
1 201109 758
1 201110 32
1 201111 15
1 201112 22
2 200711 150
2 200712 150
2 200801 44
2 200802 385
2 200803 65
2 200804 66
2 200805 200
2 200806 333
2 200807 285
2 200808 265
2 200809 222
2 200810 220
2 200811 205
2 200812 185
2 200901 65
2 200902 45
2 200903 69
2 200904 546
2 200905 21
2 200906 256
2 200907 214
2 200908 14
2 200909 44
2 200910 65
2 200911 88
2 200912 79
2 201001 65
2 201002 45
2 201003 69
2 201004 54
2 201005 14
2 201006 26
2 201007 98
Desired Output should like below,
ID YEARMONTH NO_OF_CUST AFTER_MONTH_1 AFTER_MONTH_2 AFTER_MONTH_3 AFTER_MONTH_4 AFTER_MONTH_5 AFTER_MONTH_6 AFTER_MONTH_7 AFTER_MONTH_8 AFTER_MONTH_9 AFTER_MONTH_10 AFTER_MONTH_11 AFTER_MONTH_12
1 200909 50 77 86 95 95 87 79 71 65 66 66 78 88
Step1: Join your sample table with the main(test) table and using intnx to get all the values for next 12 months.
Step2: Making a column names "after month"
Step3: Transpose to get your final output
proc sql;
create table abc as
select a.id,a.yearmonth,b.yearmonth as yearmonth1, b.no_of_cust
from
sample a
left join
test b
on a.id = b.id and a.yearmonth <= b.yearmonth <= intnx("month",a.yearmonth,12)
order by a.id,a.yearmonth,b.yearmonth;
quit;
data abc1(drop=col yearmonth1);
set abc;
by id yearmonth;
if first.yearmonth then col=-1;
col+1;
columns = compress("after_month_"||col);
run;
proc transpose data=abc1 out=abc2(rename=(after_month_0 = no_of_cust) drop=_name_);
by id yearmonth;
id columns;
var no_of_cust;
run;
My Output:
Or
If you want to make changes in your query then you could use the below code.
proc sort data=test;
by id descending yearmonth;
run;
data result;
set test;
array after_month {13} after_MONTH_0-after_MONTH_12;
by id;
if first.id then do;
do i = 1 to 13;
after_month(i) = 0;
end;
end;
do i = 13 to 2 by -1;
after_month(i) = after_month(i-1);
end;
after_month(1) = NO_OF_CUST;
drop i after_MONTH_0;
retain after_MONTH:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=result;
by id yearmonth;
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
Let me know in case of any queries.

replace string with multiple-string line using awk

I have two separate files and I was hoping to search and replace a string in file1 for an entire line of multiple strings in file2. I have been working on using awk but I am not sure how to replace a string for a line of strings. Below is an example of what I was looking to do.
The string to be replaced would match the first field of the line to replace it (multiple strings to insert in place of the single string). It's a "find and replace" task.
file1:
001 111 112 113 116 117
002 221 222
003 331
004
005 551 555
file2:
113 114 115
222 223 224 225 226 227
551 552 553 554
Desired output:
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
Try this:
awk 'NR==FNR{a[$1]=$0;next}{for(i=1;i<=NF;i++)$i=($i in a?a[$i]:$i)}1' file2 file1
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
We read file2 first and create an array indexed at column1 containing entire line as value.
For file1 we loop through each element if it is found in our array we substitute it with the value.
Here you go:
awk '
FILENAME == "file2" {
key = $1
map[key] = $0
next
}
{
for (i = 1; i <= NF; i++) {
if (map[$i])
$i = map[$i]
}
print
}
' file2 file1
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
This takes lines from file2 and populates an array called map with the whole line, keyed on the first element (I'm treating awk's associative array system more like a hash). Otherwise, loop through each element and substitute those that have map values, then print the output. Note that this must be run with file2 provided first so that the map array can be populated.