Creating statistical data from a table - c++

I have a table with 20 columns of measurements. I would like 'convert' the table into a table with 20 rows with columns of Avg, Min, Max, StdDev, Count types of information. There is another question like this but it was for the 'R' language. Other question here.
I could do the following for each column (processing the results with C++):
Select Count(Case When [avgZ_l1] <= 0.15 and avgZ_l1 > 0 then 1 end) as countValue1,
Count(case when [avgZ_l1] <= 0.16 and avgZ_l1 > 0.15 then 1 end) as countValue2,
Count(case when [avgZ_l1] <= 0.18 and avgZ_l1 > 0.16 then 1 end) as countValue3,
Count(case when [avgZ_l1] <= 0.28 and avgZ_l1 > 0.18 then 1 end) as countValue4,
Avg(avgwall_l1) as avg1, Min(avgwall_l1) as min1, Max(avgZ_l1) as max1,
STDEV(avgZ_l1) as stddev1, count(*) as totalCount from myProject.dbo.table1
But I do not want to process the 50,000 records 20 times (once for each column). I thought there would be away to 'pivot' the table onto its side and process the data at the same time. I have seen examples of the 'Pivot' but they all seem to pivot on a integer type field, Month number or Device Id. Once the table is converted I could then fetch each row with C++. Maybe this is really just 'Insert into ... select ... from' statements.
Would the fastest (execution time) approach be to simply create a really long select statement that returns all the information I want for all the columns?
We might end up with 500,000 rows. I am using C++ and SQL 2014.
Any thoughts or comments are welcome. I just don't want have my naive code to be used as a shining example of how NOT to do something... ;)...

If your table looks the same as the code that you sent in r then the following query should work for you. It selects the data that you requested and pivots it at the same time.
create table #temp(ID int identity(1,1),columnName nvarchar(50));
insert into #temp
SELECT COLUMN_NAME as columnName
FROM myProject.INFORMATION_SCHEMA.COLUMNS -- change myProject to the name of your database. Unless myProject is your database
WHERE TABLE_NAME = N'table1'; --change table1 to your table that your looking at. Unless table1 is your table
declare #TableName nvarchar(50) = 'table1'; --change table1 to your table again
declare #loop int = 1;
declare #query nvarchar(max) = '';
declare #columnName nvarchar(50);
declare #endQuery nvarchar(max)='';
while (#loop <= (select count(*) from #temp))
begin
set #columnName = (select columnName from #temp where ID = #loop);
set #query = 'select t.columnName, avg(['+#columnName+']) as Avg ,min(['+#columnName+']) as min ,max(['+#columnName+'])as max ,stdev(['+#columnName+']) as STDEV,count(*) as totalCount from '+#tablename+' join #temp t on t.columnName = '''+#columnName+''' group by t.columnName';
set #loop += 1;
set #endQuery += 'union all('+ #query + ')';
end;
set #endQuery = stuff(#endQuery,1,9,'')
Execute(#endQuery);
drop table #temp;
It creates a #temp table which stores the values of your column headings next to an ID. It then uses the ID when looping though the number of columns that you have. It then generates a query which selects what you want and then unions it together. This query will work on any number of columns meaning that if you add or remove more columns it should give the correct result.
With this input:
age height_seca1 height_chad1 height_DL weight_alog1
1 19 1800 1797 180 70
2 19 1682 1670 167 69
3 21 1765 1765 178 80
4 21 1829 1833 181 74
5 21 1706 1705 170 103
6 18 1607 1606 160 76
7 19 1578 1576 156 50
8 19 1577 1575 156 61
9 21 1666 1665 166 52
10 17 1710 1716 172 65
11 28 1616 1619 161 66
12 22 1648 1644 165 58
13 19 1569 1570 155 55
14 19 1779 1777 177 55
15 18 1773 1772 179 70
16 18 1816 1809 181 81
17 19 1766 1765 178 77
18 19 1745 1741 174 76
19 18 1716 1714 170 71
20 21 1785 1783 179 64
21 19 1850 1854 185 71
22 31 1875 1880 188 95
23 26 1877 1877 186 106
24 19 1836 1837 185 100
25 18 1825 1823 182 85
26 19 1755 1754 174 79
27 26 1658 1658 165 69
28 20 1816 1818 183 84
29 18 1755 1755 175 67
It will produce this output:
avg min max stdev totalcount
age 20 17 31 3.3 29
height_seca1 1737 1569 1877 91.9 29
height_chad1 1736 1570 1880 92.7 29
height_DL 173 155 188 9.7 29
weight_alog1 73 50 106 14.5 29
Hope this helps and works for you. :)

Related

Get frequency from dataset with repeated measurements over time

this is my problem: I have a dataset that has 10 measurements over time, something like this:
ID Expenditure Age
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
.
.
.
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
.
.
.
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
.
.
.
Now I want to obtain the frequency of age, so I did this:
proc freq data=Expenditure;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age Frequency Count Percent of total frequency
79 10 0.1
80 140 1.4
89 50 0.5
The problem is that this counts all rows, but doesn't take into account the repeated measurements per id. So I wanted to create a new colum with the actual frequencies like this:
data Age;
set Age_freq;
freq = Frequency Count /10;
run;
but I think sas doesn't recognize this 'Frequency Count' variable, can anybody gives me some insight on this?
thanks
You have to remove the duplicate records so that each ID had one record containing the age.
Solution: create a new table with the disticnt values of the ID and Age. then run the proc freq
Code:
I created a new table called Expenditure_ids that doesn't have any duplicate values for the ID & Age.
data Expenditure;
input ID Expenditure Age ;
datalines;
25 100 89
25 102 89
25 178 89
25 290 89
25 200 89
26 100 79
26 102 79
26 178 79
26 290 79
26 200 79
27 100 80
27 102 80
27 178 80
27 290 80
27 200 80
28 100 80
28 102 80
28 178 80
28 290 80
28 200 80
;
run;
proc sql;
create table Expenditure_ids as
select distinct ID, Age from Expenditure ;
quit;
proc freq data=Expenditure_ids;
table Age / out= Age_freq outexpect sparse;
run;
Output:
Age=79 COUNT=1 PERCENT=25
Age=80 COUNT=2 PERCENT=50
Age=89 COUNT=1 PERCENT=25

Year to date vs Year to date last year | Pandas

I would like to calculate the Year to date (YTD) value for this year and compare it to the same period last year in Pandas. My df looks like this:
Month Product A Product B
2015-01-01 24 62
2015-02-01 46 24
2015-03-01 30 70
2015-04-01 26 51
2015-05-01 34 42
2015-06-01 45 35
2015-07-01 25 13
2015-08-01 98 95
2015-09-01 6 81
2015-10-01 93 38
2015-11-01 98 59
2015-12-01 98 1
2016-01-01 67 42
2016-02-01 72 34
2016-03-01 7 6
2016-04-01 19 24
2016-05-01 82 38
2016-06-01 15 79
2016-07-01 49 83
2016-08-01 97 56
The two values i am after for product A are
YTD = 408 and YTD SPLY = 328 (Sum Jan-Aug 2016, Sum Jan-Aug 2015).
When a new month is added to the df, i would like the formula to calculate Jan-Sep and so on.
Any ideas how to proceed?
Not exactly sure what you want but it looks like you want to take the cumulative sum for each year.
df[['A_cumsum', 'B_cumsum']] = df.resample('A', on='Month').transform('cumsum')
Month Product A Product B A_cumsum B_cumsum
0 2015-01-01 24 62 24 62
1 2015-02-01 46 24 70 86
2 2015-03-01 30 70 100 156
3 2015-04-01 26 51 126 207
4 2015-05-01 34 42 160 249
5 2015-06-01 45 35 205 284
6 2015-07-01 25 13 230 297
7 2015-08-01 98 95 328 392
8 2015-09-01 6 81 334 473
9 2015-10-01 93 38 427 511
10 2015-11-01 98 59 525 570
11 2015-12-01 98 1 623 571
12 2016-01-01 67 42 67 42
13 2016-02-01 72 34 139 76
14 2016-03-01 7 6 146 82
15 2016-04-01 19 24 165 106
16 2016-05-01 82 38 247 144
17 2016-06-01 15 79 262 223
18 2016-07-01 49 83 311 306
19 2016-08-01 97 56 408 362

Reading and halving a sas data set

I have to read a data set of 50 numbers from a text file. It's all in a row with a space delimiter and in multiple uneven lines. for example:
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15
16 17 18 19 20 21
Etc.
The first 25 numbers belong to group 1, and the 2nd 25 belong to group 2. So I need to make a group variable (binary either 1 or 2), a count number (1 to 25), and a value variable which is holding the value of the number.
I am stuck on how to split the data in half when reading it. I tried to use truncover but it did not work.
Try something like this, replacing the datalines keyword with the path to your file:
data groups;
infile datalines;
format number 8. counter 2. group 1.; * Not mandatory, used here to order variables;
retain group (1);
input number ##;
counter + 1;
if counter = 26 then do;
group = 2;
counter = 1;
end;
datalines;
192 105 435 448 160 499 184 246 388 190 316
139 146 147 192 231 449 101 216 342 399 352 122 418
280 400 187 352 321 180 425 500 320 179 105
232 105 323 132 106 255 449
186 135 472 174 119 255
308 350
run;

How to extract next12 month data from master table for each id in Sample table based on the yearmonth and ID using sas

I am currently practicing SAS programming on using two SAS dataset(sample and master) . Below are the hypothetical or dummy data created for illustration purpose to solve my problem through SAS programming . I would like to extract the data for the id's in sample dataset from master dataset(test). I have given an example with few id's as sample dataset, for which i need to extract next 12 month information from master table(test) for each id's based on the yearmonth information( desired output given in the third output).
Below is the code to extract the previous 12 month data but i am not getting idea to extract next 12 month records as pulled for previous months, Can anyone help me in solving this problem using SAS programming with optimized way.
proc sort data=test;
by id yearmonth;
run;
data result;
set test;
array prev_month {13} PREV_MONTH_0-PREV_MONTH_12;
by id;
if first.id then do;
do i =1 to 13;
prev_month(i)=0;
end;
end;
do i = 13 to 2 by -1;
prev_month(i)=prev_month(i-1);
end;
prev_month(1)=no_of_cust;
drop i prev_month_0;
retain prev_month:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
One sample dataset (dataset name - sample).
ID YEARMONTH NO_OF_CUST
1 200909 50
1 201005 65
1 201008 78
1 201106 95
2 200901 65
2 200902 45
2 200903 69
2 201005 14
2 201006 26
2 201007 98
One master dataset - dataset name (test) (huge dataset over the year for each id from start of the account to till date.)
ID YEARMONTH NO_OF_CUST
1 200808 125
1 200809 125
1 200810 111
1 200811 174
1 200812 98
1 200901 45
1 200902 74
1 200903 73
1 200904 101
1 200905 164
1 200906 104
1 200907 22
1 200908 35
1 200909 50
1 200910 77
1 200911 86
1 200912 95
1 201001 95
1 201002 87
1 201003 79
1 201004 71
1 201005 65
1 201006 66
1 201007 66
1 201008 78
1 201009 88
1 201010 54
1 201011 45
1 201012 100
1 201101 136
1 201102 111
1 201103 17
1 201104 77
1 201105 111
1 201106 95
1 201107 79
1 201108 777
1 201109 758
1 201110 32
1 201111 15
1 201112 22
2 200711 150
2 200712 150
2 200801 44
2 200802 385
2 200803 65
2 200804 66
2 200805 200
2 200806 333
2 200807 285
2 200808 265
2 200809 222
2 200810 220
2 200811 205
2 200812 185
2 200901 65
2 200902 45
2 200903 69
2 200904 546
2 200905 21
2 200906 256
2 200907 214
2 200908 14
2 200909 44
2 200910 65
2 200911 88
2 200912 79
2 201001 65
2 201002 45
2 201003 69
2 201004 54
2 201005 14
2 201006 26
2 201007 98
Desired Output should like below,
ID YEARMONTH NO_OF_CUST AFTER_MONTH_1 AFTER_MONTH_2 AFTER_MONTH_3 AFTER_MONTH_4 AFTER_MONTH_5 AFTER_MONTH_6 AFTER_MONTH_7 AFTER_MONTH_8 AFTER_MONTH_9 AFTER_MONTH_10 AFTER_MONTH_11 AFTER_MONTH_12
1 200909 50 77 86 95 95 87 79 71 65 66 66 78 88
Step1: Join your sample table with the main(test) table and using intnx to get all the values for next 12 months.
Step2: Making a column names "after month"
Step3: Transpose to get your final output
proc sql;
create table abc as
select a.id,a.yearmonth,b.yearmonth as yearmonth1, b.no_of_cust
from
sample a
left join
test b
on a.id = b.id and a.yearmonth <= b.yearmonth <= intnx("month",a.yearmonth,12)
order by a.id,a.yearmonth,b.yearmonth;
quit;
data abc1(drop=col yearmonth1);
set abc;
by id yearmonth;
if first.yearmonth then col=-1;
col+1;
columns = compress("after_month_"||col);
run;
proc transpose data=abc1 out=abc2(rename=(after_month_0 = no_of_cust) drop=_name_);
by id yearmonth;
id columns;
var no_of_cust;
run;
My Output:
Or
If you want to make changes in your query then you could use the below code.
proc sort data=test;
by id descending yearmonth;
run;
data result;
set test;
array after_month {13} after_MONTH_0-after_MONTH_12;
by id;
if first.id then do;
do i = 1 to 13;
after_month(i) = 0;
end;
end;
do i = 13 to 2 by -1;
after_month(i) = after_month(i-1);
end;
after_month(1) = NO_OF_CUST;
drop i after_MONTH_0;
retain after_MONTH:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=result;
by id yearmonth;
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
Let me know in case of any queries.

Pandas DataFrame: How to get a min value in a vectorized way?

I have a pandas dataframe:
import numpy
import pandas
df1 = abs((pandas.DataFrame(numpy.random.randn(20, 8))*100).astype(int))
df1.columns = list('abcdefgh')
df1.index = pandas.date_range('1/1/2014', periods=20)
How would I create a new column that will give me the minimum value of the first half of the current row and the last 3 values in the previous row?
For example, the first five rows in the created column would be:
Nan
12
4
14
21
Here is one way to do it. Basically, you need to first shift last three columns and then combine with the first 4 columns, and finally calculate the min.
import numpy
import pandas
# your data
# ===================================
numpy.random.seed(0)
df1 = abs((pandas.DataFrame(numpy.random.randn(20, 8))*100).astype(int))
df1.columns = list('abcdefgh')
df1.index = pandas.date_range('1/1/2014', periods=20)
# processing
# ===================================
df1['custom_min'] = pandas.concat([df1[df1.columns[:5]], df1[df1.columns[-3:]].shift(1)], axis=1).min(axis=1)
print(df1)
a b c d e f g h custom_min
2014-01-01 176 40 97 224 186 97 95 15 40
2014-01-02 10 41 14 145 76 12 44 33 10
2014-01-03 149 20 31 85 255 65 86 74 12
2014-01-04 226 145 4 18 153 146 15 37 4
2014-01-05 88 198 34 15 123 120 38 30 15
2014-01-06 104 142 170 195 50 43 125 77 30
2014-01-07 161 21 89 38 51 118 2 42 21
2014-01-08 6 30 63 36 67 35 81 172 2
2014-01-09 17 40 163 46 90 5 72 12 17
2014-01-10 113 123 40 68 87 57 31 5 5
2014-01-11 116 90 46 153 148 189 117 17 5
2014-01-12 107 105 40 122 20 97 35 70 17
2014-01-13 1 178 12 40 188 134 127 96 1
2014-01-14 117 194 41 74 192 148 186 90 41
2014-01-15 86 191 26 80 94 15 61 92 26
2014-01-16 37 109 29 132 69 14 43 184 15
2014-01-17 67 40 76 53 67 3 63 67 14
2014-01-18 57 20 39 109 149 43 16 63 3
2014-01-19 238 94 91 111 131 46 6 171 16
2014-01-20 74 82 9 66 112 107 114 43 6