Create Sub tables with a condition in SAS - sas

I want to check whether the operater (OP_NAME) is change by SN or not. If Operater changes then output with all the corresponding SN ,OP_NAME,Value as Table1:OP_CHANGE else Table2:OP_UNQ.
Condition : If any 1 operater changes with SN then output entire rows in to a table
SN OP_Name Value
109029 SPAN 150
109029 SPAN 235
109032 SPAN 550
109033 SPAN 650
109033 SPAN 700
109036 FRAN 124
109036 SECURIT 224
109036 SECURIT 560
109036 SECURIT 752
109037 AOM 44
109037 SECA 58
109037 SECURIT 85
109037 SECURIT 98
109038 FRAN 45
109038 SECURIT 47
109038 SECURIT 58
109038 SECURIT 65
109039 GOVER 33
109039 GOVER 45
109039 GOVER 48
109041 SOREL 45
109041 SOREM 55
109041 INA 45
109043 SPAN 96
109044 SPAN 53
109045 SOREM 25
109045 SOREM 65
I want to see output tables like
Table1:OP_CHANGE
SN OP Name value
109036 FRAN 124
109036 SECURIT 224
109036 SECURIT 560
109036 SECURIT 752
109037 AOM 44
109037 SECA 58
109037 SECURIT 85
109037 SECURIT 98
109038 FRAN 45
109038 SECURIT 47
109038 SECURIT 58
109038 SECURIT 65
109041 SOREM 45
109041 SOREM 55
109041 INA 45
Table2:OP_UNQ
SN OP Name value
109029 SPAN 150
109029 SPAN 235
109032 SPAN 550
109033 SPAN 650
109033 SPAN 700
109039 GOVER 33
109039 GOVER 45
109039 GOVER 48
109043 SPAN 96
109044 SPAN 53
109045 SOREM 25
109045 SOREM 65

You could do something like the following:
data op_change(drop=change) op_unq(drop=change);
set have(in=h1) have;
by sn;
length change $1;
retain change;
if h1 then do;
if lag(op_name) ne op_name then change = 'Y';
if first.sn then change = 'N';
end;
else do;
if change = 'Y' then output op_change;
else output op_unq;
end;
run;

This should work, assuming your input table is called table:
proc sql;
/* Find SN values with different OP_NAMEs */
create table op_change as
select t.*
from table t
inner join
(select SN, count(distinct OP_NAME)
from table
group by SN
having count(distinct OP_NAME) > 1) x
on t.SN = x.SN;
/* Find SN values with unique OP_NAMEs */
create table op_unq as
select t.*
from table t
inner join
(select SN, count(distinct OP_NAME)
from table
group by SN
having count(distinct OP_NAME) = 1) x
on t.SN = x.SN;
quit;

Related

Replace missing values in SAS by Specific Condition

I have a large dataset named Planes with missing values in Arrival Delays(Arr_Delay).I want to
Replace those missing values by Average delay on the Specific route(Origin - Dest) by Specific
Carrier.
Hereby is the sample of the dataset : -
date carrier Flight tailnum origin dest Distance Air_time Arr_Delay
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
code I tried : -
Proc stdize data=cs1.Planes reponly method=mean out=cs1.Complete_data;
var Arrival_delay_minutes;
Run;
But as my problem states..i want to get the mean by Specific Route and Specific Carrier for the Missing Value. Please help me on this!
stdize Procedure does not have a way to include by or class variables. you can use the below code to complete your task:-
Proc means data=cs1.Planes noprint;
var Arr_Delay;
class carrier origin dest;
output out=mean1;
Run;
proc sort data=cs1.Planes;
by carrier origin dest;
run;
proc sort data=mean1;
by carrier origin dest;
run;
data cs1.Complete_data(drop=Arr_Delay1 _stat_);
merge cs1.Planes(in=a) mean1(where=(_stat_="MEAN")
keep=carrier origin dest Arr_Delay _stat_
rename=(Arr_Delay = Arr_Delay1) in=b);
by carrier origin dest;
if a;
if Arr_Delay =. then Arr_Delay=Arr_Delay1;
run;
You just need to sort the table cs1.Planes by origin, dest & carrier before running Proc stdize and add by origin dest carrier; to do the grouping you wanted. The only case the values will remain missing is when there are no other values for this carrier/route.
You can find the SAS documentation here and available options here.
Code:
data have;
informat date ddmmyy10.;
format date ddmmyy10.;
input
date carrier $ Flight tailnum $ origin $ dest $ Distance Air_time Arr_Delay;
datalines;
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 15
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
;
run;
proc sort data=work.have; by origin dest carrier; run;
Proc stdize data=work.have reponly method=mean out=work.Complete_data ;
var Arr_Delay;
by origin dest carrier ;
Run;

Get frequency from dataset with repeated measurements over time

this is my problem: I have a dataset that has 10 measurements over time, something like this:
ID Expenditure Age
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
.
.
.
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
.
.
.
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
.
.
.
Now I want to obtain the frequency of age, so I did this:
proc freq data=Expenditure;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age Frequency Count Percent of total frequency
79 10 0.1
80 140 1.4
89 50 0.5
The problem is that this counts all rows, but doesn't take into account the repeated measurements per id. So I wanted to create a new colum with the actual frequencies like this:
data Age;
set Age_freq;
freq = Frequency Count /10;
run;
but I think sas doesn't recognize this 'Frequency Count' variable, can anybody gives me some insight on this?
thanks
You have to remove the duplicate records so that each ID had one record containing the age.
Solution: create a new table with the disticnt values of the ID and Age. then run the proc freq
Code:
I created a new table called Expenditure_ids that doesn't have any duplicate values for the ID & Age.
data Expenditure;
input ID Expenditure Age ;
datalines;
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
28 100 80
28 102 80
28 178 80
28 290 80
28 200 80
;
run;
proc sql;
create table Expenditure_ids as
select distinct ID, Age from Expenditure ;
quit;
proc freq data=Expenditure_ids;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age=79 COUNT=1 PERCENT=25
Age=80 COUNT=2 PERCENT=50
Age=89 COUNT=1 PERCENT=25

Select lowest value per group

This question is related to Stata: select the minimum of each observation.
I have data as follows:
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
Some people have multiple readings on one day, eg see Sue on 31st March 1999. I want to select the lowest reading per day.
Here is my code which gets me some of the way. It is clunky and clumsy and I am looking for help to do what I want to do in a more straightforward way.
*make flag for repeat observations on same day
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
drop flag flag2
* group repeat observations together
egen group = group(id flag3 eventdate)
* find lowest `sys_bp_copy` value per group
bys group (eventdate flag3): egen low_sys=min(sys_bp_copy)
*remove the observations where the lowest value of `sys_bp`_copy doesn't exist
bys group: gen remove =1 if low_sys!=sys_bp_copy
drop if remove==1 & group !=.
****Problems with this and where I'd like help** **
The problem with the above approach is that for Sue, two of her repeat readings have the same val of sys_bp_copy. So my approach above leaves me with multiple readings for her.
In this instance I would like to refer to the dia_sys_copy and select the lowest value there to help me pick out one row per person when multiple readings are in place. Code for this is below - but there must be a simpler way to do this?
drop flag3 remove group
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
egen group = group(id flag3 eventdate)
bys group (eventdate flag3): egen low_dia=min(dia_bp_copy)
bys group: gen remove =1 if low_dia!=dia_bp_copy
drop if remove==1 & group !=.
The lowest systolic pressure for a patient on a particular day is easy to define: you just sort and look for the lowest value in each block of observations.
We can refine the definition by breaking ties on systolic by values of diastolic. That's another sort. In this example, that makes no difference.
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
bysort id eventdate (sys) : gen lowest = sys[1]
bysort id eventdate (sys dia) : gen lowest_2 = sys[1]
egen tag = tag(id eventdate)
count if lowest != lowest_2
list id event dia sys lowest* if tag, sepby(id)
+-----------------------------------------------------------+
| id eventd~e dia_bp~y sys_bp~y lowest lowest_2 |
|-----------------------------------------------------------|
1. | mary 14998 90 154 154 154 |
2. | mary 15165 91 179 179 179 |
3. | mary 15280 91 156 156 156 |
4. | mary 15386 81 154 154 154 |
5. | mary 15952 77 133 133 133 |
7. | mary 16390 91 159 159 159 |
|-----------------------------------------------------------|
8. | pat 15698 100 140 140 140 |
9. | pat 16183 80 120 120 120 |
10. | pat 19226 98 155 155 155 |
11. | pat 19375 80 130 130 130 |
|-----------------------------------------------------------|
12. | sue 14296 80 120 120 120 |
13. | sue 14334 88 127 127 127 |
16. | sue 14403 86 124 124 124 |
21. | sue 14431 80 120 120 120 |
24. | sue 15456 80 130 130 130 |
25. | sue 15501 80 120 120 120 |
26. | sue 15596 80 120 120 120 |
+-----------------------------------------------------------+
egen is very useful (disclosure of various interests there), but the main idea here is just that by: defines groups of observations and you can do that for two or more variables, and not just one -- and control the sort order too. As it were, about half of egen is built on such ideas, but it can be easiest and best to use them directly.
If I understand:
Create an identifier for same id and same date
egen temp_group = group(id eventdate)
Find the first occurrence based on lowest sys_bp_copy and then lowest dia_bp_copy
bys temp_group (sys_bp_copy dia_bp_copy): gen temp_first = _n
keep if temp_first == 1
drop temp*
or in 1 line as suggest in comment:
bys id eventdate (sys_bp_copy dia_bp_copy): keep if _n==1

Large defrecord causes "Method code too large"

Is there a way to build a defrecord with lots of fields? It appears there is a limit of around 122 fields, as this gives a "Method code too large!" error:
(defrecord WideCsvFile
[a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19
a20 a21 a22 a23 a24 a25 a26 a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39
a40 a41 a42 a43 a44 a45 a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59
a60 a61 a62 a63 a64 a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
a80 a81 a82 a83 a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99
a100 a101 a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115 a116 a117 a118 a119
a120 a121 a122])
while removing any of the fields allows record creation.
Java has a maximum size for its methods (see the answers to this question for specifics). defrecord creates methods whose size depends on the number of values the record will contain.
To deal with this issue, I see two options:
macroexpand-1 your call to defrecord, copy the results, and find a way to re-write the generated methods to be smaller.
Take a different approach to storing your data, such as using Clojure's vector class.
EDIT:
Now that I know what you want to do, I am more convinced that you should use vectors. Since you want to use indexes like a101, I've written you a macro to generate them:
(defmacro auto-index-vector [v prefix]
(let [indices (range (count (eval v)))
definitions (map (fn [ind]
`(def ~(symbol (str prefix ind)) ~ind)) indices)]
`(do ~#definitions)))
Let's try it out!
stack-prj.bigrecord> (def v1 (into [] (range 122)))
#'stack-prj.bigrecord/v1
stack-prj.bigrecord> (auto-index-vector v1 "a")
#'stack-prj.bigrecord/a121
stack-prj.bigrecord> (v1 a101)
101
stack-prj.bigrecord> (assoc v1 a101 "hi!")
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
95 96 97 98 99 100 "hi!" 102 103 104 105 106 107 108 109 110 111 112
113 114 115 116 117 118 119 120 121]
To use this: you'll read your CSV data into a vector, call auto-index-vector on it with the prefix of your choosing, and use the resulting indices to perform vector operations on your data.

Creating statistical data from a table

I have a table with 20 columns of measurements. I would like 'convert' the table into a table with 20 rows with columns of Avg, Min, Max, StdDev, Count types of information. There is another question like this but it was for the 'R' language. Other question here.
I could do the following for each column (processing the results with C++):
Select Count(Case When [avgZ_l1] <= 0.15 and avgZ_l1 > 0 then 1 end) as countValue1,
Count(case when [avgZ_l1] <= 0.16 and avgZ_l1 > 0.15 then 1 end) as countValue2,
Count(case when [avgZ_l1] <= 0.18 and avgZ_l1 > 0.16 then 1 end) as countValue3,
Count(case when [avgZ_l1] <= 0.28 and avgZ_l1 > 0.18 then 1 end) as countValue4,
Avg(avgwall_l1) as avg1, Min(avgwall_l1) as min1, Max(avgZ_l1) as max1,
STDEV(avgZ_l1) as stddev1, count(*) as totalCount from myProject.dbo.table1
But I do not want to process the 50,000 records 20 times (once for each column). I thought there would be away to 'pivot' the table onto its side and process the data at the same time. I have seen examples of the 'Pivot' but they all seem to pivot on a integer type field, Month number or Device Id. Once the table is converted I could then fetch each row with C++. Maybe this is really just 'Insert into ... select ... from' statements.
Would the fastest (execution time) approach be to simply create a really long select statement that returns all the information I want for all the columns?
We might end up with 500,000 rows. I am using C++ and SQL 2014.
Any thoughts or comments are welcome. I just don't want have my naive code to be used as a shining example of how NOT to do something... ;)...
If your table looks the same as the code that you sent in r then the following query should work for you. It selects the data that you requested and pivots it at the same time.
create table #temp(ID int identity(1,1),columnName nvarchar(50));
insert into #temp
SELECT COLUMN_NAME as columnName
FROM myProject.INFORMATION_SCHEMA.COLUMNS -- change myProject to the name of your database. Unless myProject is your database
WHERE TABLE_NAME = N'table1'; --change table1 to your table that your looking at. Unless table1 is your table
declare #TableName nvarchar(50) = 'table1'; --change table1 to your table again
declare #loop int = 1;
declare #query nvarchar(max) = '';
declare #columnName nvarchar(50);
declare #endQuery nvarchar(max)='';
while (#loop <= (select count(*) from #temp))
begin
set #columnName = (select columnName from #temp where ID = #loop);
set #query = 'select t.columnName, avg(['+#columnName+']) as Avg ,min(['+#columnName+']) as min ,max(['+#columnName+'])as max ,stdev(['+#columnName+']) as STDEV,count(*) as totalCount from '+#tablename+' join #temp t on t.columnName = '''+#columnName+''' group by t.columnName';
set #loop += 1;
set #endQuery += 'union all('+ #query + ')';
end;
set #endQuery = stuff(#endQuery,1,9,'')
Execute(#endQuery);
drop table #temp;
It creates a #temp table which stores the values of your column headings next to an ID. It then uses the ID when looping though the number of columns that you have. It then generates a query which selects what you want and then unions it together. This query will work on any number of columns meaning that if you add or remove more columns it should give the correct result.
With this input:
age height_seca1 height_chad1 height_DL weight_alog1
1 19 1800 1797 180 70
2 19 1682 1670 167 69
3 21 1765 1765 178 80
4 21 1829 1833 181 74
5 21 1706 1705 170 103
6 18 1607 1606 160 76
7 19 1578 1576 156 50
8 19 1577 1575 156 61
9 21 1666 1665 166 52
10 17 1710 1716 172 65
11 28 1616 1619 161 66
12 22 1648 1644 165 58
13 19 1569 1570 155 55
14 19 1779 1777 177 55
15 18 1773 1772 179 70
16 18 1816 1809 181 81
17 19 1766 1765 178 77
18 19 1745 1741 174 76
19 18 1716 1714 170 71
20 21 1785 1783 179 64
21 19 1850 1854 185 71
22 31 1875 1880 188 95
23 26 1877 1877 186 106
24 19 1836 1837 185 100
25 18 1825 1823 182 85
26 19 1755 1754 174 79
27 26 1658 1658 165 69
28 20 1816 1818 183 84
29 18 1755 1755 175 67
It will produce this output:
avg min max stdev totalcount
age 20 17 31 3.3 29
height_seca1 1737 1569 1877 91.9 29
height_chad1 1736 1570 1880 92.7 29
height_DL 173 155 188 9.7 29
weight_alog1 73 50 106 14.5 29
Hope this helps and works for you. :)