Counting how many times a condition succeeds in SAS - sas

I have a table A with real time values in it.
Amount Count Pct1 Pct2
300 2 0.000 100.000
1,891 2 0.001 100.000
500 2 0.000 100.000
100 2 0.000 100.000
1,350 2 0.001 100.000
2,648 2 0.001 100.000
2,255 2 0.001 100.000
500 2 0.000 100.000
200 2 0.000 30.441
10 2 0.000 100.000
1,928 2 0.001 100.000
40 2 0.000 100.000
200 2 0.000 100.000
256 2 0.000 100.000
254 2 0.000 100.000
100 2 0.001 100.000
50 1 0.000 33.333
1,512 2 0.001 100.000
I have a table B with a set of conditions. I want to generate the Condition success count in SAS. i.e. If I pass the row 1 in the below table as a condition to the table A it succeeds 2 times. I am using a join to generate a cartesin product and its not efficient. I want an efficient way to solve this problem (similar to what countifs function does in excel). Thanks a lot for your help.
Amount Count Pct1 Pct2 Condion Success Count
1,576 2 0 100 4
1,537 2 0 100 4
1,484 2 0 100 5
1,405 2 0 100 5
1,290 2 0 100 6
1,095 2 0 100 6
948 2 0 100 6
932 2 0 100 6
914 2 0 100 6
887 2 0 100 6
850 2 0 100 6
774 2 0 100 6
707 2 0 100 6
704 2 0 100 6
695 2 0 100 6
646 2 0 100 6
50 1 0 5.42 16
50 1 0 5.42 16

You said that you have tried join to make to make a cartesian product. However, since you didn't post any code I am not sure if you tried to make full product and then calculate the rows. Doing the counting in one SQL statement is much faster since actually full cartesian product is not written anywhere. Like this:
proc sql;
create table tableC as
select c.*, coalesce(s,0) as SuccessCount from TableB c
left join (
select id, count(*) as s from TableA a,TableB b
where
a.amount >= b.amount and
a.count >= b.count and
a.pct1 >= b.pct1 and
a.pct2 >= b.pct2
group by id
) as d
on c.id = d.id
;
quit;
Note that tableB needs to have some unique id column. You should always have some column to use as id but if you don't have it already simple create it like this for example:
data tableB;
set tableB;
id = _N_;
run;

Related

KDB moving percentile using Swin function

I am trying to create a list of the 99th and 1st percentiles. Rather than a single percentile for today. I wanted percentiles for 500 days each using the prior 500 days. The functions I was using for this are the following
swin:{[f;w;s] f each { 1_x,y }\[w#0;s]}
percentile:{[x;y] y (100 xrank y:asc y) bin x}
swin[percentile[99;];500;List].
The issue I come across is that the 99th percentile calculates perfectly, but the 1st percentile makes the entire list = 0. a bit lost as to why it would do that. suggestions appreciated!
What's causing the zeros is two-fold:
What behaviour do you want for the earliest 500 days when there isn't 500 days of history to work with? On day 1 there's only 1 datapoint, on day 2 only 2 etc. Only on the 500th day is there 500 days of actual data to work with. By default that swin function fills the gaps with some seed value
You're using zero as that seed value, aka w#0
For example a 5 day lookback on each date looks something like:
q)swin[::;5;1 2 3 4 5]
0 0 0 0 1
0 0 0 1 2
0 0 1 2 3
0 1 2 3 4
1 2 3 4 5
You have zeros until you have data, so naturally the 1st percentile will pick up the zeros for the first roughly 500 dates.
So then you can decide to seed with a different value, or else possibly exclude zeros from your percentile function:
q)List:1000?1000
q)percentile:{[x;y] y (100 xrank y:asc y except 0) bin x}
q)swin[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90...
If zeros are a legitimate value in your list and can't be excluded then maybe seed the swin with some other value that you know won't be in the list (negatives? infinity? null?) and then exclude that seed from the percentile function.
EDIT: A final alternative is to use a different sliding window function which doesn't fill gaps with a seed value, e.g.
q)swin2:{[f;w;s] f each(),/:{neg[x]sublist y,z}[w]\[s]}
q)swin2[::;5;1 2 3 4 5]
,1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
q)percentile:{[x;y] y (100 xrank y:asc y) bin x}
q)swin2[percentile[99;];500;List]
908 908 908 908 908 908 908 908 908 908 908 959 959..
q)swin2[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90 90..

How can I collapse my dataset to medians and 95% confidence intervals of the median in Stata?

I wish to collapse my dataset and (A) obtain medians by group, and (B) obtain the 95% confidence intervals for those medians.
I can achieve (A) by using collapse (p50) median = cost, by(group).
I can obtain the confidence intervals for the groups using bysort group: centile cost, c(50) but I ideally want to do this in a manner similar to collapse where I can create a collapsed dataset of means, lower limits (ll) and upper limits (ul) for each group (so I can export the dataset for graphing in Excel).
Data example:
input id group cost
1 0 20
2 0 40
3 0 50
4 0 40
5 0 30
6 1 20
7 1 10
8 1 10
9 1 60
10 1 30
end
Desired dataset (or something similar):
. list
+-----------------------+
| group p50 ll ul |
|-----------------------|
1. | 0 40 20 50 |
2. | 1 20 10 60 |
+-----------------------+
clear
input id group cost
1 0 20
2 0 40
3 0 50
4 0 40
5 0 30
6 1 20
7 1 10
8 1 10
9 1 60
10 1 30
end
statsby median=r(c_1) ub=r(ub_1) lb=r(lb_1), by(group) clear: centile cost
list
+--------------------------+
| group median ub lb |
|--------------------------|
1. | 0 40 50 20 |
2. | 1 20 60 10 |
+--------------------------+
In addition to the usual help and manual entry, this paper includes a riff on essentially this problem of accumulating estimates and confidence intervals.

operations with reference cells proc sql?

I have this table, call it "pre_report":
initial_balance
deposit
withdrawal
final_balance
1000
50
0
.
1000
0
25
.
1000
45
0
.
1000
30
0
.
1000
0
70
.
I want create a code in SAS that updates the "final_balance" field, the "deposit" field adds to the balance and "withdrawal" subtracts, but at the same time changes the values of the "initial_balance" field, in such a way that my desired output be this:
initial_balance
deposit
withdrawal
final_balance
1000
50
0
1050
1050
0
25
1025
1025
45
0
1070
1070
30
0
1100
1100
0
70
1030
I try this:
proc sql;
select initial_balance format=dollar32.2,
deposit format=dollar32.2,
withdrawal format=dollar32.2,
sum(initial_balance,deposit,-withdrawal) as final_balance,
calculated final_balance as initial_balance
from work.pre_report;
quit;
But it doesn't work properly. This code create two fields "final_balance" and "initial_balance" but both with the sames quantity.
code for creating "pre_report" table
data work.pre_report;
input initial_balance deposit withdrawal final_balance;;
datalines;
1000 50 0 .
1000 0 25 .
1000 45 0 .
1000 30 0 .
1000 0 70 .
run;
I would really appreciate if you help me.

Django ORM QUERY Adjacent row sum with sqlite

In my database I'm storing data as below:
id amt
-- -------
1 100
2 -50
3 100
4 -100
5 200
I want to get output like below
id amt balance
-- ----- -------
1 100 100
2 -50 50
3 100 150
4 -100 50
5 200 250
How to do with in django orm

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0