Incorrect floating point aggregation - sas

I am working on a sas query below and getting weird exponential values in the summed up column:
data t;
input a b $ c $ d ;
datalines;
481710428888 24Nov2010 NP 34961.0000
481710428888 07Mar2013 IP 175455.7500
481710428888 09Nov2015 WB -63835.6400
481710428888 23Nov2015 WO 27074.9000
481710428888 23Nov2015 WO 49240.6500
481710428888 23Nov2015 WO 70265.5600
481910257067 01Apr2010 NP 47129.0000
481910257067 27May2010 WO 47129.0000
481910257067 22Mar2013 IP 3287.6900
481910257067 11Apr2013 WO 3287.6900
;
run;
PROC SQL;
CREATE TABLE WORK.IAP_VLTEST AS
SELECT DISTINCT
put(a, z20.) AS ACCOUNT_NUMBER,
b,
c,
d,
(CASE WHEN c = 'WO' THEN -1 ELSE 1 END) * d AS PRVN_A
,SUM (CALCULATED PRVN_A) AS iap
FROM T
GROUP BY 1
ORDER BY a ;
QUIT;
I get the following output
ACCOUNT_NUMBER b c d PRVN_A iap
00000000481710428888 07Mar201 IP 175455.75 175455.75 7.27596E-12
00000000481710428888 09Nov201 WB -63835.64 -63835.64 7.27596E-12
00000000481710428888 23Nov201 WO 27074.9 -27074.9 7.27596E-12
00000000481710428888 23Nov201 WO 49240.65 -49240.65 7.27596E-12
00000000481710428888 23Nov201 WO 70265.56 -70265.56 7.27596E-12
00000000481710428888 24Nov201 NP 34961 34961 7.27596E-12
00000000481910257067 01Apr201 NP 47129 47129 0
00000000481910257067 11Apr201 WO 3287.69 -3287.69 0
00000000481910257067 22Mar201 IP 3287.69 3287.69 0
00000000481910257067 27May201 WO 47129 -47129 0
I dont understand why am I getting this weird exponential value for the first value of a.
This is happening with several rows in my original dataset.
Could anyone please help understand what's going wrong here.
Thanks!

You are getting a very small sum iap. Presuming all the prvn_a items in a group are a transactional group that should reconcile to zero, you are getting a very small non-zero result due to Numerical Accuracy in SAS Software. Numerical accuracy of decimal values is something that must be contended with in almost all programming languages. There is nothing weird or erroneous going on. I would recommend rounding the sum to the nearest 1e-5 in order to cover the four decimal places of data d.
, ROUND (
SUM (CALCULATED PRVN_A), 0.00001
) AS iap

Related

SAS length warning : Multiple lengths were specified for the variable np by input data set(s)

I want to set two datasets with different length of the same variable. My example:
data set 1
det np ord
C 5 0
data set 2
det np ord
A 1(10) 1
B 3(30) 2
Could someone help me in order to set these 2 datasets correctly without warning?
Many thanks!!
I want to set these 2 datasets correctly without warning. My final dataset will be
det np ord
C 5 0
A 1(10) 1
B 3(30) 2
Assuming the following situation with two data sets with different lengths in column "det":
data have1;
length det $4 np ord c $10;
det='1234';np='np1'; ord='ord1';
c='c text';
run;
data have2;
length det $5 np ord a b $10;
det='12345'; np='np2'; ord='ord2';
a='a sample';
b='b others';
run;
If you put them together by a simple data step there will be a warning due to different lengths depending on the order of your input data sets:
data want1;
set have1 have2;
run;
WARNING: Multiple lengths were specified for the variable det by input data set(s). This can cause truncation of data.
And here is the code without warning:
data want2;
set have2 have1;
run;
/* Sort step, just to get the original order */
proc sort data=want2;
by det;
quit;
Keep in mind that in the first case the warning will have real consequences because your data is eventually truncated because the resulting column det will have a length of 4. So, in the example above the second row will have det="1234" instead of "12345", i.e. want1 looks like that:
det np ord c a b
-------------------------------------------------
1234 np1 ord1 c text
1234 np2 ord2 a sample b others
In the case of want2 the length will be 5 and there will not be any truncation.
Explanation: Concerning the length of a column in a resulting data set SAS takes the length of the first occurrence of a variable.
But the best way to avoid this warning is to define the lengths for the resulting table before the set statement. In this way you can also define the order of the resulting columns:
data want;
length det $5 np ord c a b $10;
set have1 have2;
run;

How to make a table (with proc report or data step) of a grouped variable where in different columns are counts of different variables?

Could you give some advise please how to calculate different counts to different columns when we group a certain variable with proc report (if it is possible with it)?
I copy here an example and the solution to better understand what i want to achieve. I can compile this table in sql in a way that i group them individually (with where statements, for example where Building_code = 'A') and then i join them to one table, but it is a little bit long, especially when I want to add more columns. Is there a way to define it in proc report or some shorter data step query, if yes can you give a short example please?
Example:
Solution:
Thank you for your time.
This should work. There is absolutely no need to do this by joining multiple tables.
data have;
input Person_id Country :$15. Building_code $ Flat_code $ age_category $;
datalines;
1000 England A G 0-14
1001 England A G 15-64
1002 England A H 15-64
1003 England B H 15-64
1004 England B J 15-64
1005 Norway A G 15-64
1006 Norway A H 65+
1007 Slovakia A G 65+
1008 Slovakia B H 65+
;
run;
This is a solution in proc sql. It's not really long or complicated. I don't think you could do it any shorter using data step.
proc sql;
create table want as
select distinct country, sum(Building_code = 'A') as A_buildings, sum(Flat_code= 'G') as G_flats, sum(age_category='15-64') as adults
from have
group by country
;
quit;

SAS Proc NPAR1WAY wilcoxon exact test produce null P value?

I have a campaign result where i have a test and holdout dataset where the variance is not normally distributed. I was trying to use Proc NPAR1WAY wilcoxon exact test to get the P value. For some reason all the output is properly populated, but the Exact Test portion is showing Null for all the fields. Not sure what else to check as all the value within the VAR is not null and the log is not showing any ERROR message.
PROC NPAR1WAY WILCOXON DATA = TEST1;
CLASS TEST_HOLDOUT_FLAG;
VAR VARIANCE;
EXACT WILCOXON;
RUN;
Result
Wilcoxon Two-Sample Test
Statistic (S) 3.18E+12
Normal Approximation
Z 8.1747
One-Sided Pr > Z <.0001
Two-Sided Pr > |Z| <.0001
t Approximation
One-Sided Pr > Z <.0001
Two-Sided Pr > |Z| <.0001
Exact Test
One-Sided Pr >= S .
Two-Sided Pr >= |S - Mean| .
Z includes a continuity correction of 0.5.
Kruskal-Wallis Test
Chi-Square 66.826
DF 1
Pr > Chi-Square <.0001

Wilcoxon Z score is negative when it should be positive and vice versa

SAS Coding: - I perform a ttest on the differences in two groups (independent but from same population). The signs of the 'difference' amount and the t-stat match (i.e. mathematical difference between the two groups is negative and tstat is negative. Or if mathematical difference between the two groups is positive the tstat is positive).
However, when I run a wilcoxon rank sum test, the signs of my z-scores don't match the sign (-/+) of the group difference. (i.e. mathematical difference between the two groups is negative but z-score is positive. If mathematical difference between the two groups is positive the z-score is negative).
I have tried sorting the dataset regular and descending.
Here's my code:
*proc sort data = fundawin3t;
by vb_nvb_TTest;
run;
**Wilcoxon rank sums for vb vs nvb firms.;
proc npar1way data = fundawin3t wilcoxon;
title "NVB vs VB univariate tests and Wilcoxon-Table 4";
var ma_score_2015 age mve roa BM BHAR prcc_f CFI CFF momen6 vb_nvb SERIAL recyc_v;
class vb_nvb_TTest;
run;
Here is my log:
3208
3209 proc sort data = fundawin3t;
3210 by vb_nvb_TTest;
3211 run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
3212
3213 **Wilcoxon rank sums for vb vs nvb firms.;
3214 proc npar1way data = fundawin3t wilcoxon;
3215 title "NVB vs VB univariate tests and Wilcoxon-Table 4";
3216 var ma_score_2015 age mve roa BM BHAR prcc_f CFI CFF momen6
tenure vb_nvb SERIAL
3216! recyc_v;
3217 class vb_nvb_TTest;
3218 run;
NOTE: PROCEDURE NPAR1WAY used (Total process time):
real time 6.59 seconds
cpu time 5.25 seconds
RTM
To compute the linear rank statistic S, PROC NPAR1WAY sums the scores of the observations in the smaller of the two samples. If both samples have the same number of observations, PROC NPAR1WAY sums those scores for the sample that appears first in the input data set.
PROC NPAR1WAY computes one-sided and two-sided asymptotic p-values for each two-sample linear rank test. When the test statistic z is greater than its null hypothesis expected value of 0, PROC NPAR1WAY computes the right-sided p-value, which is the probability of a larger value of the statistic occurring under the null hypothesis. When the test statistic is less than or equal to 0, PROC NPAR1WAY computes the left-sided p-value, which is the probability of a smaller value of the statistic occurring under the null hypothesis. The one-sided p-value $P_1(z)$ can be expressed as

Pandas: ignore null values when using .astype(str)?

So I have a dataframe, call it TABLE and I'm using Pandas with Python 2.7 to analyze it. It's mostly categorical data so right now my goal is to have a summary of my table where I list each column name and the average length of the values in that column.
Example table:
A B C E F
0 djsdd 973 348f NaN abcd
1 dsa 49 34h5 NaN NaN
Then my desired output would be something like:
Column AvgLength
A 4.0
B 2.5
C 4.0
E NaN
F 4.0
Now the first problem I had was that there are some numerical values in the dataset. I thought I could resolve that by using .astype(str) so I did the following:
for k in TABLE:
print "%s\t %s"%(k,TABLE[k].astype(str).str.len().mean())
The issue now is that it looks to me like .astype(str) is converting the null values to strings because I ended up with the following output:
Column AvgLength
A 4.0
B 2.5
C 4.0
E 3.0
F 3.5
Notice that column E containing the null values is giving me an average length of 3, and column F is giving me an average of 3.5. My understanding is this happened because it's taking the length of the string "NaN."
Is there some way to do what I want and ignore the Null values? Or is there a completely different approach I should be taking (I'm very new to pandas)?
(I did read about .dropna() but I don't want to omit all columns that might contain null values because some columns may have null values alongside data. I want to just ignore the null values from my mean).
stack to get series
dropna to get rid of NaN
astype(str).str.len() to get lengths
unstack().mean() for average length
reindex(TABLE.columns) to ensure we get all original columns represented
TABLE.stack().dropna().astype(str).str.len().unstack().mean().reindex(TABLE.columns)
A 4.0
B 2.5
C 4.0
E NaN
dtype: float64