weka doesnot show arff feature values - weka

I have an .arff file with 5 features named:
JaccardCoefficient,adamicadar,commonneighbors,katz,rootedpagerank
I open the file in weka but it does not show katz values. It shows the max:0 min:0 mean:0 stddev:0
Note that the katz values are so small like 0.0000312. What should I do to see katz values?

I have had a look at your sample file in Weka and have found the zero values that were reported. The data appears to be visually represented correctly, but the precision of the attributes appear to be limited to three decimal places. For this reason, the values are too small to be represented in the attribute list.
One way that you could change this for use with Weka's prediction models is to pre-process the data to a more suitable range. This could be done using normalisation or other rescaling techniques as required for your purposes. In the image below, I have adjusted the data by simply multiplying the attribute by 100, which brought the attribute summaries into a visible range on the screen:
Hope this helps!

Related

number expected.read Token[2015-02-02 14:19:00] weka project

i hope you all are doing well!
I have a project at data mining class.Τhe data consists of numerical data and many algorithms do not work.I have to do this:"you should compare the performance of the following categorization algorithms:
RandomForest, C4.5, JRip, Bayesian Network. Where necessary use them
Weka filters to replace or create values ​​for some properties
new properties. For comparison, adopt the Train / Test Percentage Split type with
percentage for training data equal to 80%.Describe your observations by giving tables with the results and
presenting the performance of the algorithms. Repeat the experiment by putting
percentage for training data equal to 70% and 50% presenting the results."
So my first try was to transform the data inside weka with preprocessing data numeric to nominal but a friend of mine suggest that is statistical wrong.So my second try was to use excel to transform all data even the date to numeric,remove the first row(id) and pass it to the weka(I leave double quotes only at date)
.But i have the error that i mention on the title.The dataset is:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Thank you for the time.
If you define date-like data as a DATE attribute in the ARFF file (using the right format for parsing the strings), then WEKA will treat it as a numeric attribute internally (Java epoch, ie milli-seconds since 1970-01-01).
Instead of using NumericToNominal, use either the supervised or unsupervised Discretize filter if the algorithm cannot handle numeric attributes.
Converting nominal attributes to numeric ones is not a recommended approach. Instead, try the supervised or unsupervised NominalToBinary filter.

Google vision fails to identify numbers in a table

I am aiming to extract a table of text & numbers from a document using Google's Vision API. The results are far from satisfactory - Vision seems to completely miss the contents of 2 columns in my table.
Recognition rate improves when I manually erase the column border but I cannot pre-process each file which I intend to process.
Cropping the column text, MOVing the column text to a new location dont seem to make a difference.
Increasing the brightness/contrast of the document seems to help a little bit but not enough to be satisfactory.
I'm using the "Try-It" web interface at cloud.google.com/vision/docs/drag-and-drop to test all my experiments... It mimics the results of running my code on the document.
I'm uploading JPG images, created from scanned PDF originals (converted in photoshop).
I dont have any code since the problem shows up just using the web-tool.
many of the numbers are single digits but many are not.
The numbers missed are 1,3,4,8,500,1,16,100,10
Other columns (which ARE read ok) contain decimal numbers
Perhaps there are some tricks/tips that I've not found that I can use?

Can I make Amazon SageMaker deliver a recommendation based on historic data instead of a probability score?

We have a huge set of data in CSV format, containing a few numeric elements, like this:
Year,BinaryDigit,NumberToPredict,JustANumber, ...other stuff
1954,1,762,16, ...other stuff
1965,0,142,16, ...other stuff
1977,1,172,16, ...other stuff
The thing here is that there is a strong correlation between the third column and the columns before that. So I have pre-processed the data and it's now available in a format I think is perfect:
1954,1,762
1965,0,142
1977,1,172
What I want is a predicition on the value in the third column, using the first two as input. So in the case above, I want the input 1965,0 to return 142. In real life this file is thousands of rows, but since there's a pattern, I'd like to retrieve the most possible value.
So far I've setup a train job on the CSV file using the Linear Learner algorithm, with the following settings:
label_size = 1
feature_dim = 2
predictor_type = regression
I've also created a model from it, and setup an endpoint. When I invoke it, I get a score in return.
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
My goal here is to get the third column prediction instead. How can I achieve that? I have read a lot of the documentation regarding this, but since I'm not very familiar with AWS, I might as well have used the wrong algorithms for what I am trying to do.
(Please feel free to edit this question to better suit AWS terminology)
For csv input, the label should be in the first column, as mentioned here: So you should preprocess your data to put the label (the column you want to predict) on the left.
Next, you need to decide whether this is a regression problem or a classification problem.
If you want to predict a number that's as close as possible to the true number, that's regression. For example, the truth might be 4, and the model might predict 4.15. If you need an integer prediction, you could round the model's output.
If you want the prediction to be one of a few categories, then you have a classification problem. For example, we might encode 'North America' = 0, 'Europe' = 1, 'Africa' = 2, and so on. In this case, a fractional prediction wouldn't make sense.
For regression, use 'predictor_type' = 'regressor' and for classification with more than 2 classes, use 'predictor_type' = 'multiclass_classifier' as documented here.
The output of regression will contain only a 'score' field, which is the model's prediction. The output of multiclass classification will contain a 'predicted_label' field, which is the model's prediction, as well as a 'score' field, which is a vector of probabilities representing the model's confidence. The index with the highest probability will be the one that's predicted as the 'predicted_label'. The output formats are documented here.
predictor_type = regression is not able to return the predicted label, according to
the linear-learner documentation:
For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats. For binary classification models, it returns both the score and the predicted label. For regression, it returns only the score.
For more information on input and output file formats, see Linear
Learner Response Formats for inference, and the Linear Learner Sample
Notebooks.

Weka not display Correctly classified instances as output

I am new on weka. I have a dataset in csv with 5000 samples. here 20 samples of it; when I upload this dataset into weka, it looks ok, but when I run knn algorithm it gives a result that is not supposed to give. here is the sample data.
a,b,c,d
74,85,123,1
73,84,122,1
72,83,121,1
70,81,119,1
70,81,119,1
69,80,118,1
70,81,119,1
70,81,119,1
76,87,125,1
76,87,125,1
82,92,146,2
74,86,140,2
68,80,134,2
64,76,130,2
64,75,132,2
83,96,152,2
72,85,141,2
71,83,141,2
69,81,139,2
65,79,137,2
here is the result :
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.6148
Mean absolute error 0.2442
Root mean squared error 0.4004
Relative absolute error 50.2313 %
Root relative squared error 81.2078 %
Total Number of Instances 5000
it is supposed to give this kind of result like:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
What should be the problem? What am I missing? I did this in all other algorithms but they all give the same output. I have used sample weka datasets, they all work as expected.
The IBk algorithm can be used for regression (predicting the value of a numeric response for each instance) as well as for classification (predicting which class each instance belongs to).
It looks like all the values of the class attribute in your dataset (column d in your CSV) are numbers. When you load this data into Weka, Weka therefore guesses that this attribute should be treated as a numeric one, not a nominal one. You can tell this has happened because the histogram in the Preprocess tab looks something like this:
instead of like this (coloured by class):
The result you're seeing when you run IBk is the result of a regression fit (predicting a numeric value of column d for each instance) instead of a classification (selecting the most likely nominal value of column d for each instance).
To get the result you want, you need to tell Weka to treat this attribute as nominal. When you load the csv file in the Preprocess tab, check Invoke options dialog in the file dialog window. Then when you click Open, you'll get this window:
The field nominalAttributes is where you can give Weka a list of which attributes are nominal ones even if they look numeric. Entering 4 here will specify that the fourth attribute (column) in the input is a nominal attribute. Now IBk should behave as you expect.
You could also do this by applying the NumericToNominal unsupervised attribute filter to the already loaded data, again specifying attribute 4 otherwise the filter will apply to all the attributes.
The ARFF format used for the Weka sample datasets includes a specification of which attributes are which type. After you've imported (or filtered) your dataset as above, you can save it as ARFF and you'll then be able to reload it without having to go through the same process.

predicting a numerical value using KNN in weka

I have two data sets, one for training and one for testing.
I am going to predict the values of a column with numerical type in test data set. In order to predict the value of an instance, I have to find the k nearest neighbors of that instance in training data set, and calculate the average of values. (waiting also can be used).
For example:
column0 column1 column2
......a..................b....................10
......a..................b....................12
......c..................d....................16
......a..................b....................?
I need a method of data mining to give me the result = (10+12)/2 = 11
Which method should I use to get such a result?
And do you know any good document which explains how to use that method?
KNN in Weka is implemented as IBk. It is capable of predicting numerical and nominal values.
If you are using the Weka Explorer (GUI) you can find it by looking for the "Choose" button under the Classify tab. Once there navigate the folders:
classifiers -> lazy -> IBk
Once you select IBk, click on the box immediately to the right of the button. This will open up a large number of options. If you then click on the button "More" in the options window, you will see all of the options explained. If you need more of an explanation of the classifier they even list the academic paper that the classifier is based on. You can do this for all of the classifiers to obtain additional information.