Convert NA values to ? automatically while loading - weka

Is there a way to automatically convert NA values to ? in weka while loading .csv files?
Or do we have to use some other script/program to either replace them with ? or a blank space before loading into weka.
Any help or suggestions are welcome. Thanks

Unfortunately I do not believe Weka has a way to do this conversion. This is the case because Weka's native format is .arff files. In .arff files, missing values are denoted with a "?". When a .csv file is loaded, it expects missing values to also be denoted by "?".
Depending on your method of using Weka I suggest:
For the Weka GUI, use "find and replace" in a simple text editor to change "NA" to "?" before loading the .csv into Weka.
For the Weka Java API, write a method to preprocess your ".csv" file before handing it over to the Weka .csv loader.

Related

How to open a tab-delimited file in Weka

When I try to open a tab-delimited file in Weka it says: "file format is not recognized". In the subsequent dialog box it shows weka.core.converters.CSVLoader and says "Reads a source that is in comma separated or tab separate format." When I click the OK button, it throws an error saying "wrong number of values. Read 11, expected 10 line 4." I verified the same file in Excel that the line had 10 fields.
Could someone advise a workaround?
The data file cannot be converted to CSV format because some of the fields contain a comma.
When installing the unofficial Weka package common-csv-weka-package, you can load tab-delimited CSV files using the CommonCSVLoader loader. Simply change the loader's format from DEFAULT to TDF (-F command-line option).
I had same problem. So far the best solution I found is using R to convert a tabular data file into arff. Google two keywords "import data to R" and "export R data to weka arff". My second choice is using JMP or SAS to open a csv or Excel workbook and then export as CSV.
I found a solution: for Windows 10, install the R language package from this url:
https://cran.r-project.org/web/packages/rio/index.html
install RStudio from:
https://www.rstudio.com/products/rstudio/download/#download
from the prompt in RStudio follow the Import, Export, and Convert Data Files instructions here:
https://cran.microsoft.com/snapshot/2015-11-15/web/packages/rio/vignettes/rio.html
works a treat, converted my .tsv files to Weka arff format no problem. The only thing I haven't done is test the arff files in Weka yet (and compare with Python sklearn results), as I'm hoping there isn't a problem with commas embedded in the text message bodies. Scikit-Learn and TfidfVectorizer has no problems with embedded commas in a tsv file!

Converting source code directory into ARFF (WEKA)

Currently, i am working on a project using WEKA. Being naive and newbie in it, there are many things which i am not familair with. In my last project I used text files as a classification using WEKA. I applied the TextDirectoryLoader convertor to convert a directory containing text files as mentioned on this URL Text categorization with WEKA. Now I want to use the same stretagy for converting a directory containing source code (instead of text). For example, I have a Jedit source file containing Java source code. I am trying to convert it to ARFF file so that i can apply classifiers or other functions present in WEKA on that ARFF file for data mining purposes. I have also tried a test file given on following URL ARFF files from Text Collections. I believe i can use the same file as an example to convert source code files. However, I do not know what attributes should I define in a FastVector? and What format should the data be in (String or numeric). And what other sections should an ARFF file may have?
As in the example the authors have defined following attributes
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
I have tried to find some examples on Google but no success.
Could anyone here suggests me any solution or alternate to solve the above said problem? (Example code will be highly appreciated).
Or atleast could give me a short example which convertes a source code directory into an ARFF file. (If it is possible).
If not possible what could be the possible reason
Any alternate solution (except WEKA) where I can use the same set of functions on a source code.
It is not clear, what is your goal? Do you want to classify the source code files, or find the files which are contains any bug, or what?
As I imagine, you want to extract features from each source file, and represent it with an instance. Then you can apply any machine learning based algorithm.
Here, you can find a java example, how can you construct an arff file from java:
https://weka.wikispaces.com/Creating+an+ARFF+file
But, you have to define your task specific features and extract it from each source code files.

Create arff training and test files for weka

Good day.
An apology for my English but it's not my native language so they know apologize for any errors.
I have a text file with the data processing which want to get a .arff file, which is the file type using weka.
I do not want to generate a single file. I get 2 files, one for training the model (training) and another to test the model (test).
This is done directly weka eh applying a filter stringtowordtokenizer but the problem is that when you use the second file to test a mistake because it is not fair test a model that has words that should not be.
If someone helps me I would appreciate.
Thank you and best regards.

Issues when loading data with weka

I am trying to load some csv data in weka. Some gene expression feature for 12 patients. There are around 22,000 features. However, when I load the csv file, it says
not recognized as an "CSV data files' file
to my csv file.
I am wondering is it because of the size of the features or something else. I have checked the csv file and it is nicely comma separated. Any suggestions?
I would not encourage you to use CSV files in Weka. While it is entirely possible (http://weka.wikispaces.com/Can+I+use+CSV+files%3F) it leads to some severe drawbacks. Try to generate a ARFF file from your CSV instead.

how can i read datasets in Weka?

I want to use some of the datasets available at the website of the Weka to perform some
experiments with Neural Networks.
What do I have to do to read the data?
I downloaded the datasets and they were saved as .arff.txt so I deleted the extension of .txt to have only .arff. So I used this file as an ipnut but an error occurs.
Which is the right way to read data?
Do I have to write code?
Please help me.
Thank you
I'm using Weka 3.6.6 and coc81.arff opens just fine. You are using Weka 3.7.x, which is the development branch of Weka. I suggest that you download 3.6.6 or 3.6.7 (the latest stable release) and try to open the file again.
There is also another simple throw...
open your dataset file in excel in my case MS Excel2010, format fields intype.
and save as 'csv',
then reload that csv file in the weka explorer and save on the local drive as arff format.
may be this help.