Issues when loading data with weka - weka

I am trying to load some csv data in weka. Some gene expression feature for 12 patients. There are around 22,000 features. However, when I load the csv file, it says
not recognized as an "CSV data files' file
to my csv file.
I am wondering is it because of the size of the features or something else. I have checked the csv file and it is nicely comma separated. Any suggestions?

I would not encourage you to use CSV files in Weka. While it is entirely possible (http://weka.wikispaces.com/Can+I+use+CSV+files%3F) it leads to some severe drawbacks. Try to generate a ARFF file from your CSV instead.

Related

What would be a way to convert a csv file to json WITHOUT using avroSchema or ConvertRecord Processors in Apache NIFi?

So I had made a workflow in ApacheNifi that extracted email attachements and converted the csv files into json files. I used InferAvroSchema to ConvertRecord to convert the csv into json. Everything works well until I get a csv file that does not follow the avroschema I had written. Now I need to find a way to convert csv to json without using these two processors as the csv formatting will vary from time to time. The csv Format I currently am working with I will link below.
I have tried to extractText but I am having trouble writing the correct regex to extract the values that match their headers. I also tried AttriutesToJson but it seems like it is not reading the desired attributes. I know I can specify which attributes to pull but since the headers/values will be changing constantly, I can't seem to find a way to set it up dynamically.Current CSV format
If you are using NiFi 1.9.2+, you can use a CsvReader which automatically infers schema on a per-flowfile basis. As the JsonRecordSetWriter can use the embedded inferred schema to write out the JSON as well, you no longer need an explicit Avro schema to be pre-defined.
As long as all the lines of CSV in a single flowfile follow the same schema, you won't have any problems. If you can have different schemas in the same flowfile (which I suspect would cause many additional problems as well), you'll have to filter them first into separate flowfiles.
Have you tried writing a script using the executeStreamCommand processor?
And more specifically, are you talking about the headers being different? There are options in the ConvertRecord processors to include headers

How to load the training data to OpenCV from UCI?

I have a character/font dataset found in UCI repository:
https://archive.ics.uci.edu/ml/datasets/Character+Font+Images
Take any CSV file as an example, for instance 'AGENCY.csv'. I am struggling to load it to the OpenCV using a c++ functions. It seems that the structure of the dataset is quite different from what normally assumed in function
cv::ml::TrainData::loadFromCSV
Any ideas to do it neatly or I need to pre-process the csv files directly?
You can try to load csv file like this:
CvMLData data;
data.read_csv( filename )
For details on opencv ml csv, Refer this page:
http://www.opencv.org.cn/opencvdoc/2.3.1/html/modules/ml/doc/mldata.html

How to open a tab-delimited file in Weka

When I try to open a tab-delimited file in Weka it says: "file format is not recognized". In the subsequent dialog box it shows weka.core.converters.CSVLoader and says "Reads a source that is in comma separated or tab separate format." When I click the OK button, it throws an error saying "wrong number of values. Read 11, expected 10 line 4." I verified the same file in Excel that the line had 10 fields.
Could someone advise a workaround?
The data file cannot be converted to CSV format because some of the fields contain a comma.
When installing the unofficial Weka package common-csv-weka-package, you can load tab-delimited CSV files using the CommonCSVLoader loader. Simply change the loader's format from DEFAULT to TDF (-F command-line option).
I had same problem. So far the best solution I found is using R to convert a tabular data file into arff. Google two keywords "import data to R" and "export R data to weka arff". My second choice is using JMP or SAS to open a csv or Excel workbook and then export as CSV.
I found a solution: for Windows 10, install the R language package from this url:
https://cran.r-project.org/web/packages/rio/index.html
install RStudio from:
https://www.rstudio.com/products/rstudio/download/#download
from the prompt in RStudio follow the Import, Export, and Convert Data Files instructions here:
https://cran.microsoft.com/snapshot/2015-11-15/web/packages/rio/vignettes/rio.html
works a treat, converted my .tsv files to Weka arff format no problem. The only thing I haven't done is test the arff files in Weka yet (and compare with Python sklearn results), as I'm hoping there isn't a problem with commas embedded in the text message bodies. Scikit-Learn and TfidfVectorizer has no problems with embedded commas in a tsv file!

Converting source code directory into ARFF (WEKA)

Currently, i am working on a project using WEKA. Being naive and newbie in it, there are many things which i am not familair with. In my last project I used text files as a classification using WEKA. I applied the TextDirectoryLoader convertor to convert a directory containing text files as mentioned on this URL Text categorization with WEKA. Now I want to use the same stretagy for converting a directory containing source code (instead of text). For example, I have a Jedit source file containing Java source code. I am trying to convert it to ARFF file so that i can apply classifiers or other functions present in WEKA on that ARFF file for data mining purposes. I have also tried a test file given on following URL ARFF files from Text Collections. I believe i can use the same file as an example to convert source code files. However, I do not know what attributes should I define in a FastVector? and What format should the data be in (String or numeric). And what other sections should an ARFF file may have?
As in the example the authors have defined following attributes
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
I have tried to find some examples on Google but no success.
Could anyone here suggests me any solution or alternate to solve the above said problem? (Example code will be highly appreciated).
Or atleast could give me a short example which convertes a source code directory into an ARFF file. (If it is possible).
If not possible what could be the possible reason
Any alternate solution (except WEKA) where I can use the same set of functions on a source code.
It is not clear, what is your goal? Do you want to classify the source code files, or find the files which are contains any bug, or what?
As I imagine, you want to extract features from each source file, and represent it with an instance. Then you can apply any machine learning based algorithm.
Here, you can find a java example, how can you construct an arff file from java:
https://weka.wikispaces.com/Creating+an+ARFF+file
But, you have to define your task specific features and extract it from each source code files.

Convert NA values to ? automatically while loading

Is there a way to automatically convert NA values to ? in weka while loading .csv files?
Or do we have to use some other script/program to either replace them with ? or a blank space before loading into weka.
Any help or suggestions are welcome. Thanks
Unfortunately I do not believe Weka has a way to do this conversion. This is the case because Weka's native format is .arff files. In .arff files, missing values are denoted with a "?". When a .csv file is loaded, it expects missing values to also be denoted by "?".
Depending on your method of using Weka I suggest:
For the Weka GUI, use "find and replace" in a simple text editor to change "NA" to "?" before loading the .csv into Weka.
For the Weka Java API, write a method to preprocess your ".csv" file before handing it over to the Weka .csv loader.