Weka machine learning - Interpeting naive bayes - weka

I got a training dataset of ill horses, the data it contains is about surgeries and diseases. Some of the fields of the registers are like: temperature of the horse, age, pulse, respiratory rate etc ....
What I want to do a clasificator on the live/dead/euthanized column of every row. What I am asked to check is:
Think about hypothesis of independence of variables
Check if I got enought number of elements to obtain reliable probabilities
The dataset had like 25% of missing values and them where imputated using MIMMI imputation.
Thinking about the possibility of getting reliable probabilities, I can see that the training dataset is a little unbalanced: 179 horses live and 121 die (dead + euthanized). But im not really sure of that. Any help with that two questions would be so much helpful for me.
=== Run information ===
Scheme:weka.classifiers.bayes.NaiveBayes
Relation: horseColic-weka.filters.unsupervised.attribute.Remove-R25-27
Instances: 300
Attributes: 24
surgery
age
id
temp
pulse
respRate
tempExtrem
periPulse
mucMemb
capRefT
pain
peri
abdDist
ngTube
ngReflux
ngRPH
feces
abd
pCellVol
totProt
abdCentApp
abdCentTotProt
outc
surgLes
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute lived died euthanized
(0.59) (0.26) (0.15)
==================================================================
surgery
yes 97.0 59.0 28.0
no 84.0 20.0 18.0
[total] 181.0 79.0 46.0
age
adult 168.0 67.0 44.0
young 13.0 12.0 2.0
[total] 181.0 79.0 46.0
id
mean 1009274.0202 1452556.3598 751596.8611
std. dev. 1431022.1677 1887025.7703 989556.6807
weight sum 179 77 44
precision 16915.735 16915.735 16915.735
temp
mean 34.8733 35.0055 33.054
std. dev. 10.2335 13.0545 14.9588
weight sum 179 77 44
precision 0.9275 0.9275 0.9275
pulse
mean 29.2039 33.2115 29.0187
std. dev. 10.8578 14.6404 16.7248
weight sum 179 77 44
precision 0.9107 0.9107 0.9107
respRate
mean 15.0771 16.9169 15.9348
std. dev. 8.9803 7.0278 8.1221
weight sum 179 77 44
precision 0.8667 0.8667 0.8667
tempExtrem
normal 82.0 16.0 12.0
warm 36.0 7.0 3.0
cool 53.0 48.0 25.0
cold 12.0 10.0 8.0
[total] 183.0 81.0 48.0
periPulse
normal 133.0 22.0 11.0
increased 5.0 8.0 7.0
reduced 43.0 47.0 25.0
absent 2.0 4.0 5.0
[total] 183.0 81.0 48.0
mucMemb
normal-pink 95.0 9.0 7.0
bright-pink 23.0 13.0 6.0
pale-pink 37.0 19.0 12.0
pale-cyanotic 16.0 17.0 12.0
bright-red 7.0 14.0 8.0
dark-cyanotic 7.0 11.0 5.0
[total] 185.0 83.0 50.0
capRefT
short 153.0 46.0 23.0
long 28.0 33.0 23.0
long2 1.0 1.0 1.0
[total] 182.0 80.0 47.0
pain
no-pain 53.0 6.0 8.0
depressed 42.0 21.0 14.0
inte-mild-pain 64.0 10.0 8.0
inte-severe-pain 12.0 18.0 12.0
cont-severe-pain 13.0 27.0 7.0
[total] 184.0 82.0 49.0
peri
hypermotile 42.0 7.0 7.0
normal 22.0 8.0 5.0
hypomotile 90.0 37.0 17.0
absent 29.0 29.0 19.0
[total] 183.0 81.0 48.0
abdDist
none 88.0 17.0 13.0
slight 53.0 18.0 8.0
moderate 28.0 30.0 14.0
severe 14.0 16.0 13.0
[total] 183.0 81.0 48.0
ngTube
none 79.0 40.0 27.0
slight 90.0 32.0 15.0
significant 13.0 8.0 5.0
[total] 182.0 80.0 47.0
ngReflux
none 149.0 50.0 30.0
much 17.0 15.0 6.0
less 16.0 15.0 11.0
[total] 182.0 80.0 47.0
ngRPH
mean 11.3797 13.0882 8.0606
std. dev. 2.3535 3.2916 5.1673
weight sum 179 77 44
precision 0.7917 0.7917 0.7917
feces
normal 77.0 14.0 10.0
increased 16.0 14.0 8.0
decreased 44.0 15.0 11.0
absent 46.0 38.0 19.0
[total] 183.0 81.0 48.0
abd
normal 48.0 13.0 4.0
other 39.0 5.0 7.0
firm-large-intestine 18.0 8.0 6.0
dist-small-intest 32.0 24.0 8.0
distended-large-intest 47.0 32.0 24.0
[total] 184.0 82.0 49.0
pCellVol
mean 31.0162 47.0465 46.0112
std. dev. 14.1207 18.5468 17.672
weight sum 179 77 44
precision 0.9518 0.9518 0.9518
totProt
mean 42.6539 41.451 43.7936
std. dev. 16.9138 18.6362 19.3247
weight sum 179 77 44
precision 0.9432 0.9432 0.9432
abdCentApp
clear 112.0 25.0 10.0
cloudy 54.0 22.0 20.0
serosanguinous 16.0 33.0 17.0
[total] 182.0 80.0 47.0
abdCentTotProt
mean 16.1341 21.1634 14.3203
std. dev. 6.8038 4.9109 8.6619
weight sum 179 77 44
precision 0.8837 0.8837 0.8837
surgLes
yes 94.0 70.0 30.0
no 87.0 9.0 16.0
[total] 181.0 79.0 46.0
Time taken to build model: 0.01 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 216 72 %
Incorrectly Classified Instances 84 28 %
Kappa statistic 0.5134
Mean absolute error 0.1965
Root mean squared error 0.3803
Relative absolute error 52.8451 %
Root relative squared error 88.2672 %
Total Number of Instances 300
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.777 0.198 0.853 0.777 0.813 0.873 lived
0.675 0.175 0.571 0.675 0.619 0.871 died
0.568 0.082 0.543 0.568 0.556 0.824 euthanized
Weighted Avg. 0.72 0.175 0.735 0.72 0.725 0.865
=== Confusion Matrix ===
a b c <-- classified as
139 28 12 | a = lived
16 52 9 | b = died
8 11 25 | c = euthanized

Naive Bayes has the prominent assumption that all attributes are independent. Meaning that in this case age, surgery, temp are taken to be mutually independent. This may not be the case though, and in many instances is not. Naive Bayes however will generally obtain decent results with little training, but is normally not as good as a model in which the assumptions are more correct. Finding these models takes time and effort though, and often a Naive Bayes model will reach an adequate accuracy. Not sure about your sample size, you'll have to look at the statistical power of your dataset.

Related

Python 2.7: Reading a text file online to a string and printing output

I am reading data from this link: http://www.weerindelft.nl/clientraw.txt.
The main goal is to print out the temperature http://www.weerindelft.nl displays. I have discovered that its in that text file so i only need to print out the right part of the file.
This is my code:
import socket
from decimal import Decimal
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("www.weerindelft.nl" , 80))
s.sendall("GET http://www.weerindelft.nl/clientraw.txt HTTP/1.0\n\n")
write = s.recv(1427247693)
variable_1 = str(write[311:])
integer = float(variable_1[46:50])
tim = round(integer,0)
print Decimal(tim)
f = open("output.txt", "w")
f.write(write)
f.close
s.close()
This is my output:
HTTP/1.1 200 OK
Date: Wed, 04 Jan 2017 12:34:14 GMT
Server: Apache
Last-Modified: Wed, 04 Jan 2017 12:34:12 GMT
Vary: Accept-Encoding
Content-Type: text/plain
X-Varnish: 110069959 109349321
Age: 32
Via: 1.1 varnish (Varnish/5.0)
ETag: W/"b173bdaf-2fb-54544008156e2"
Accept-Ranges: bytes
Content-Length: 763
Connection: close
12345 7.0 7.8 318 5.4 85 1016.9 1.0 4.2 4.2 0.014 0.086 18.7 38 100.0 34 0.0 0 0 0.2 -100.0 255.0 -100.0 -100.0 -100.0 -100.0 -100 -100 -100 13 20 58 WeerinDelft-13:20:58 2 100 4 1 100 100 100 100 100 100 100 2.6 4.0 8.0 5.1 34 zonnig/Gestopt_met_regenen 0.2 4 4 4 7 5 5 8 6 6 5 6 6 4 4 4 4 5 6 9 8 30.4 3.0 949.9 4/1/2017 7.5 3.6 6.0 0.9 0.5 14 12 10 12 7 11 8 5 6 10 6.8 6.9 6.7 6.5 6.4 6.5 5.7 5.3 5.1 5.3 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.0 1.0 1.0 8.0 5.1 5.4 18.2 0 13:09:41 2017/04/01 326 522 91 -100.0 -100.0 5 0 0 0 0 102.0 18.9 18.7 4.7 1017.2 1014.5 24 12:40 10:35 6.1 0.8 6.2 1.5 15 2017 -13.9 -1 1 -1 341 336 336 309 331 358 336 318 310 318 10.0 255.0 7.5 4.4 51.97944 -4.34139 0.6 90 66 1.0 10:46 0.0 0.0 0.0 0.0 0.0 0.0 249.8 05:47 13:11 !!C10.37S13!!
I have used requests before and it worked like a charm. Unfortunately the assignment is to use the socket module. I think i know where the problem lies but not to solve it. I need to get rid of the HTTP code and information and just be able to read the file so i can print out the right part of it. Because at this very moment running this script only succeeds a couple of times because the text file is shifting and my script is focussing on:
integer = float(variable_1[46:50])
This part of the text file/string.
I hope you guys understand what I mean. My apologies in advance if this post has some flaws. Its my first one and I am fairly new to programming.
Thanks in advance.
HTTP Response seprates the header and content with a blank line.
So you can use
write.split('\r\n\r\n', 1)[1]
to get rid of the HTTP code and information, extract only the content of the response.

fetching data from the web page using DataFrame

I am trying to scrape time series data using pandas DataFrame for Python 2.7 from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm). Could somebody please help me how I can write the code. Thanks!
I tried my code as follows:
html =urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
text= html.read();
df=pd.DataFrame(index=datum, columns=['m_ta','m_tax','m_taxd', 'm_tan','m_tand'])
But it doesn't give anything. Here I want to display the table as it is.
You can use BeautifulSoup for parsing all font tags, then split column a, set_index from column idx and rename_axis to None - remove index name:
import pandas as pd
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
soup = BeautifulSoup(html)
#print soup
fontTags = soup.findAll('font')
#print fontTags
#get text from tags fonts
li = [x.text for x in soup.findAll('font')]
#remove first 13 tags, before not contain necessary data
df = pd.DataFrame(li[13:], columns=['a'])
#split data by arbitrary whitspace
df = df.a.str.split(r'\s+', expand=True)
#set column names
df.columns = columns=['idx','m_ta','m_tax','m_taxd', 'm_tan','m_tand']
#convert column idx to period
df['idx'] = pd.to_datetime(df['idx']).dt.to_period('M')
#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])
#set column idx to index, remove index name
df = df.set_index('idx').rename_axis(None)
print df
m_ta m_tax m_taxd m_tan m_tand
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
1902-01 3.4 7.5 1902-01-25 -2.2 1902-01-15
1902-02 2.8 6.6 1902-02-09 -2.8 1902-02-06
1902-03 5.3 13.3 1902-03-22 -3.5 1902-03-13
1902-04 10.5 15.8 1902-04-21 6.1 1902-04-08
1902-05 12.5 20.6 1902-05-31 8.5 1902-05-10
1902-06 18.5 23.8 1902-06-30 14.4 1902-06-19
1902-07 20.2 25.2 1902-07-01 15.5 1902-07-03
1902-08 21.1 25.4 1902-08-07 14.7 1902-08-13
1902-09 16.1 23.8 1902-09-05 9.5 1902-09-24
1902-10 10.8 15.4 1902-10-12 4.9 1902-10-25
1902-11 2.4 9.1 1902-11-01 -4.2 1902-11-18
1902-12 -3.1 7.2 1902-12-27 -17.6 1902-12-15
1903-01 -0.5 8.3 1903-01-11 -11.5 1903-01-23
1903-02 4.6 13.4 1903-02-23 -2.7 1903-02-17
1903-03 9.0 16.1 1903-03-28 4.9 1903-03-09
1903-04 9.0 16.5 1903-04-29 2.6 1903-04-19
1903-05 16.4 21.2 1903-05-03 11.3 1903-05-19
1903-06 19.0 23.1 1903-06-03 15.6 1903-06-07
... ... ... ... ... ...
1998-07 22.5 30.7 1998-07-23 15.0 1998-07-09
1998-08 22.3 30.5 1998-08-03 14.8 1998-08-29
1998-09 16.0 21.0 1998-09-12 10.4 1998-09-14
1998-10 11.9 17.2 1998-10-07 8.2 1998-10-27
1998-11 3.8 8.4 1998-11-05 -1.6 1998-11-21
1998-12 -1.6 6.2 1998-12-14 -8.2 1998-12-26
1999-01 0.6 4.7 1999-01-15 -4.8 1999-01-31
1999-02 1.5 6.9 1999-02-05 -4.8 1999-02-01
1999-03 8.2 15.5 1999-03-31 3.0 1999-03-16
1999-04 13.1 17.1 1999-04-16 6.1 1999-04-18
1999-05 17.2 25.2 1999-05-31 11.1 1999-05-06
1999-06 19.8 24.4 1999-06-07 12.2 1999-06-22
1999-07 22.3 28.0 1999-07-06 16.3 1999-07-23
1999-08 20.6 26.7 1999-08-09 17.3 1999-08-23
1999-09 19.3 22.9 1999-09-26 15.0 1999-09-02
1999-10 11.5 19.0 1999-10-03 5.7 1999-10-18
1999-11 3.9 12.6 1999-11-04 -2.2 1999-11-21
1999-12 1.3 6.4 1999-12-13 -8.1 1999-12-25
2000-01 -0.7 8.7 2000-01-31 -6.6 2000-01-25
2000-02 4.5 10.2 2000-02-01 -0.1 2000-02-23
2000-03 6.7 11.6 2000-03-09 0.6 2000-03-17
2000-04 14.8 22.1 2000-04-21 5.8 2000-04-09
2000-05 18.7 23.9 2000-05-27 12.3 2000-05-22
2000-06 21.9 29.3 2000-06-14 15.4 2000-06-17
2000-07 20.3 26.6 2000-07-03 14.0 2000-07-16
2000-08 23.8 29.7 2000-08-20 18.5 2000-08-31
2000-09 16.1 21.5 2000-09-14 12.7 2000-09-24
2000-10 14.1 18.7 2000-10-04 8.0 2000-10-23
2000-11 9.0 14.9 2000-11-15 3.7 2000-11-30
2000-12 3.0 9.4 2000-12-14 -6.8 2000-12-24
[1200 rows x 5 columns]

pandas selection from a specific year

I am trying to select the following data using pandas for Python 2.7 from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm) starting from the year 1991 to 2000. somebody please can help me how I can write the code. Thanks!
datum m_ta m_tax m_taxd m_tan m_tand
------- ----- ----- ---------- ----- ----------
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
1902-01 3.4 7.5 1902-01-25 -2.2 1902-01-15
1902-02 2.8 6.6 1902-02-09 -2.8 1902-02-06
1902-03 5.3 13.3 1902-03-22 -3.5 1902-03-13
1902-04 10.5 15.8 1902-04-21 6.1 1902-04-08
1902-05 12.5 20.6 1902-05-31 8.5 1902-05-10
1902-06 18.5 23.8 1902-06-30 14.4 1902-06-19
....
You can use df.year with boolean indexing for selecting data by column datum:
#convert column datum to period
df['datum'] = pd.to_datetime(df['datum']).dt.to_period('M')
#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])
print df.datum.dt.year
0 1901
1 1901
2 1901
3 1901
4 1901
5 1901
6 1901
7 1901
8 1901
9 1901
10 1901
11 1901
12 1902
13 1902
14 1902
15 1902
16 1902
17 1902
Name: datum, dtype: int64
#change 1901 to 2000
print df[df.datum.dt.year <= 1901]
datum m_ta m_tax m_taxd m_tan m_tand
0 1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1 1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
2 1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
3 1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
4 1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
5 1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
6 1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
7 1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
8 1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
9 1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
10 1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
11 1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07

Subtract values in a data frame ignoring specific keys

I have two data frames as such:
df1 = pd.DataFrame({ 'pressure' : [42,42,42,42,42,42,42,36,36,36,36,36,36,36],
'load' : [350,350,350,350,350,350,350,700,700,700,700,700,700,700],
'speed' : [70,60,50,40,30,20,10,70,60,50,40,30,20,10],
'lforce' : [3.6,3.5,3.3,3.2,3.1,3.1,2.9,7.7,7.3,7.0,6.8,6.5,6.4,6.1],
'rforce' : [3.4,3.2,3.1,3.0,2.9,2.8,2.7,7.6,7.2,6.9,6.6,6.3,6.2,5.9]
}).set_index(['pressure','load','speed'])
df2 = pd.DataFrame({ 'pressure' : [47,47,47,47,47,47,47],
'load' : [20,20,20,20,20,20,20],
'speed' : [70,60,50,40,30,20,10],
'lforce' : [2.5,2.1,1.9,1.7,1.5,1.3,1.2],
'rforce' : [2.8,2.6,2.4,2.2,2.0,1.8,1.7]
}).set_index(['pressure','load','speed'])
Formatted:
>>> df1
lforce rforce
pressure load speed
42 350 70 3.6 3.4
60 3.5 3.2
50 3.3 3.1
40 3.2 3.0
30 3.1 2.9
20 3.1 2.8
10 2.9 2.7
36 700 70 7.7 7.6
60 7.3 7.2
50 7.0 6.9
40 6.8 6.6
30 6.5 6.3
20 6.4 6.2
10 6.1 5.9
>>> df2
lforce rforce
pressure load speed
47 20 70 2.5 2.8
60 2.1 2.6
50 1.9 2.4
40 1.7 2.2
30 1.5 2.0
20 1.3 1.8
10 1.2 1.7
I would like to subtract df2 from df1 on the lforce and rforce columns for each speed to get the resulting data frame df3.
My problem is that I need to ignore the pressure and load in df2 during the subtraction, but retain the originals from df1.
Desired result:
>>> df3
lforce rforce
pressure load speed
42 350 70 1.1 0.6
60 1.3 0.6
50 1.4 0.7
40 1.5 0.8
30 1.6 0.9
20 1.7 1.0
10 1.7 1.0
36 700 70 5.2 4.8
60 5.1 4.6
50 5.1 4.4
40 5.1 4.4
30 5.0 4.3
20 5.0 4.3
10 4.9 4.2
df1.sub(df2.reset_index([0, 1], drop=True), level=2)
output:
lforce rforce
pressure load speed
42 350 70 1.1 0.6
60 1.4 0.6
50 1.4 0.7
40 1.5 0.8
30 1.6 0.9
20 1.8 1.0
10 1.7 1.0
36 700 70 5.2 4.8
60 5.2 4.6
50 5.1 4.5
40 5.1 4.4
30 5.0 4.3
20 5.1 4.4
10 4.9 4.2
May be somehing like this:
>>> df3 = df1.reset_index(level=[0,1])
>>> df4 = df2.reset_index(level=[0,1])
>>> df4['pressure'] = 0
>>> df4['load'] = 0
>>> df3 - df4
pressure load lforce rforce
speed
10 42 350 1.7 1.0
10 36 700 4.9 4.2
20 42 350 1.8 1.0
20 36 700 5.1 4.4
30 42 350 1.6 0.9
30 36 700 5.0 4.3
40 42 350 1.5 0.8
40 36 700 5.1 4.4
50 42 350 1.4 0.7
50 36 700 5.1 4.5
60 42 350 1.4 0.6
60 36 700 5.2 4.6
70 42 350 1.1 0.6
70 36 700 5.2 4.8
Now you just have to move pressure and load back to index
Is this what you're looking for?
d1 = df1.reset_index(['pressure','load'])
d2 = df2.reset_index(['pressure','load'])
r0 = d1.merge(d2, left_index=True, right_index=True)
r1 = r0.set_index(['pressure_x','load_x'], drop=False)
r1['lforce'] = r1.lforce_x - r1.lforce_y
r1['rforce'] = r1.rforce_x - r1.rforce_y
df3 = r1[['lforce','rforce']]
df3

why ranks were different?

One:
data have;
input x1 x2;
diff=x1-x2;
a_diff= round(abs(diff), .01);
* a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
Results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.5
6 2.9 4.9 -2.0 2.0 5.5
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Two:
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.0
6 2.9 4.9 -2.0 2.0 6.0
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Attention Please,Obs 3,9,5,6, why ranks were different? Thank you!
Run the code below and you'll see that they are actually different. That's because of inaccuracies in numeric storage; similar to how 1/3 is not representable in decimal notation (0.333333333333333 etc.) and 1-(1/3)-(1/3)-(1/3) is not equal to zero if you use, say, ten digits to store each result as you go (it is equal to 0.000000001, then), any computer system will have some issues with certain numbers that while in decimal (base 10) appear to store nicely, in binary do not.
The solution here is basically to round as you are, or to fuzz the result which amounts to the same thing (it ignores differences less than 1x10^-12).
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
put a_diff= hex16.;
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;