Weka CSV to Arff conversion : IllegalArgumentException: Attribute names are not unique - weka

I have a Csv file download from http://yann.lecun.com/exdb/mnist/index.html. I need to convert it to arff file format.
I tried running
java weka.core.converters.CSVLoader /home/saket/Documents/Assignment/NIST7000 > /home/saket/Documents/Myfile.arff
but it's giving following error
java.lang.IllegalArgumentException: Attribute names are not unique! Causes: '0' '0' '0' '0' '0' '0' '0'
Then I tried using http://weka.wikispaces.com/Converting+CSV+to+ARFF java code. BUt still same error came.
Can someone please suggest what i am doing wrong

There was no header fields in the csv. So I created a script and added column0,column1,...,class in the Csv file first line.
Then opened that generated csv file in weka.

I have encountered the same exception but with a different reason. I used "class" as the attribute name, but this word also appeared in my data as a string (after #data) and Weka did not correctly separate attribute and data.
Solved by simply renaming "class" attribute to something else like "s_label".

It happens when attribute name is same, in more than one column of the excel sheet. Just rename the column name which are same. It should be unique.I changed my third column name, which was same.Please have a look in screenshot attached.This can be done via an script too for large set of dataset. This worked for me.

Related

Upload and parse file in Oracle APEX

I'm trying to find the best way to upload, parse and work with text file in Oracle APEX (current version 20.1). Bussiness case: I must upload text file, first line will be saved to table A.
Rest lines contains some records (columns are pipe delimited) should be validated. After that correct recordes should be saved to table B or if there is some error it should be saved to table C (error log).
I tried to do something with the Data Loading wizard but it doesn't fit to my requirements.
Right now I added a "File browse..." item to page, and after page submit I can find this file in APEX_APPLICATION_TEMP_FILES in blob_content.
Is there any other option to work with that file than working with blob_content from APEX_APPLICATION_TEMP_FILES. I find it difficoult to work with type of data.
Text file look something like that:
2020-06-05 info: header line
2020-06-05|columnAValue|columnBValue|
2020-06-05|columnAValue||columnCValue
2020-06-05|columnAValue|columnBValue|columnCValue
have a look into the APEX_DATA_PARSER.PARSE table function. It parses the CSV file and returns the values as rows and columns. It's described in more detail within this blog posting:
https://blogs.oracle.com/apex/super-easy-csv-xlsx-json-or-xml-parsing-about-the-apex_data_parser-package
Simply pass "file.csv" (literally) as the p_file_name argument. APEX_DATA_PARSER does not care about the "real" file name....
The function uses the file extension only to differentiate between delimited, XLSX, XML or JSON files. So simply pass in a static file name like "file.csv". That should be enough.

AWS glue: ignoring spaces in JSON properties

I have a dataset with JSON files in it. Some of the entries of these JSONs have spaces in the entries like
{
'propertyOne': 'something',
'property Two': 'something'
}
I've had this data set crawled by several different crawlers to try and get the schema I want. For some reason on one of my crawls, the spaces were removed, but on trying to replicate the process, I cannot get the spaces to be removed and when querying in Athena I get this error
HIVE_METASTORE_ERROR: : expected at position x in 'some string' but ' ' found instead.
Position x is the position of the space between 'property' and 'Two' in the JSON entry.
I would like to just be able to exclude this field or have the space removed when crawled, but I'm not sure how. I can't change the JSON format. Any help is appreiated
this is actually a bug with aws gule json classifier because it doesn't play nice with nested properties that have spaces in them. The syntax error is in the schema generated by the crawler, not in the json. It generates something like this:
struct<propertyOne:string, property Two:string>
The space in "property two" should have been escaped, by the crawler. At this point, generating the DDL for the table is also not working. We are also facing this issue and looking for workarounds
I believe your only option, in this case, would be to create your own custom JSON classifier to select only those attributes you want the Crawler to add to the Data Catalog.
I.e. if you want to only retrieve propertyOne you can use specify the JSONPath expression as $.propertyOne.
Note also that your JSON should be in double quotes, the single quotes could also be causing issues when parsing the data.

Extract filname and store the name in a new column in csv file

I want to extract filename and store the filename in one of the existing column in the CSV file. How to do this? Which processor to use? what configuration?
Ex- i have a filename 'FE_CHRGRSIM_20171207150616_CustRec.csv' and i want to extract ''FE_CHRGRSIM_20171207150616' and store this value under an existing column in the Same CSV file. Please help. TIA
Usually the "real" file name is available as an attribute on the flow file called "filename". You can use UpdateRecord with a Replacement Strategy of "Literal Value"; add a user-defined property called /filename and set the value to ${filename:substringBeforeLast('.')}. You'll need to make sure that the "filename" field is added to your schema (either by UpdateRecord or manually). If you won't know your CSV schema ahead of time you can use InferAvroSchema and it will try to figure it out.
If UpdateRecord and the schema stuff doesn't seem to be working for you, an alternative (since it's CSV) is to use ReplaceText, match the entire line, then replace with that value followed by ,${filename:substringBeforeLast('.')}. That should add the filename (with extension removed) as the last column in the outgoing CSV.

Split attribute labels with delimiter for processing

I opened a csv file in Weka 3.8 and selected an attribute/column (picture below). The labels are delimited by a pipe character. There should be 23 distinct labels but Weka displays 914. Thus, Weka cannot visualize for too many values. Action is one label, adventure is another one, etc. Basically there can be more than one label per row.
For processing (eg. classification), How can separate those values so Weka can read them?
This question is similar to this. But the question asks about the date attribute (eg. "dd-MM-yyyy HH:mm"). This asks about a character-separated value (eg. "Action|Adventure|Drama")
Edit:
The data is taken from kaggle.
Ah, I had run into this problem too.
Firstly, ensure that the Genres attribute is recognised as a String type. If you are only using the GUI, go to Open File... and open the file (I presume it's a .dat file. If you've renamed it to .csv hit the check box which says "Invoke options dialog").
In the Generic Object Editor window, enter the index of the Genres attribute (here it's last).
Doing that will cause the attribute to look like this in the GUI.
Now choose the filter called StringToWordVector (weka.filters.unsupervised.attribute.StringToWordVector). Now under the Editor window, find the Tokenizer entry, click on its field, and under delimeters remove the defaults and add the pipe character. You may optionally edit the attribute prefix field as well.
Hit apply and find the required genres added in as numeric attributes, set to 0 for cases where the genre was not present in the original string, 1 otherwise.
StringToWordVector is a pretty useful filter, and there's much more to it in the docs: http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html.

Redshift COPY command delimiter not found

I'm trying to load some text files to Redshift. They are tab delimited, except for after the final row value. That's causing a delimiter not found error. I only see a way to set the field delimiter in the COPY statement, not a way to set a row delimiter. Any ideas that don't involve processing all my files to add a tab to the end of each row?
Thanks
I don't think the problem is with missing <tab> at the end of lines. Are you sure that ALL lines have correct number of fields?
Run the query:
select le.starttime, d.query, d.line_number, d.colname, d.value,
le.raw_line, le.err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
order by le.starttime desc
limit 100
to get the full error report. It will show the filename with errors, incorrect line number, and error details.
This will help to find where the problem lies.
You can get the delimiter not found error if your row has less columns than expected. Some CSV generators may just output a single quote at the end if last columns are null.
To solve this you can use FILLRECORD on Redshift copy options.
From my understanding the error message Delimiter not found may be caused also by not specifying correctly the COPY command, in particular by not specifying the Data format parameters https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
In my case I was trying to load Parquet data with this expression:
COPY my_schema.my_table
FROM 's3://my_bucket/my/folder/'
IAM_ROLE 'arn:aws:iam::my_role:role/my_redshift_role'
REGION 'my-region-1';
and I received the Delimiter not found error message when looking into the system table stl_load_errors. But specifying I'm dealing with Parquet data in the expression in this way:
COPY my_schema.my_table
FROM 's3://my_bucket/my/folder/'
IAM_ROLE 'arn:aws:iam::my_role:role/my_redshift_role'
FORMAT AS PARQUET;
solved my problem and I was able to correctly load the data.
I know this was answered, but I just dealt with the same error and I had a simple solution so i'll share it.
This error can also be solved by stating the specific columns of the table that are copied from the s3 files (if you know what are the columns in the data on s3).
In my case the data had less columns than the number of columns in the table.
Madahava's answer with the 'FILLRECORD' option DID solve the issue for me but then I noticed a column that was supposed to filled up with default values, remained null.
COPY <table> (col1, col2, col3) from 's3://somebucket/file' ...
This may not be directly related to the OP's question but I received the same Delimiter not found error which was caused by newline characters within one of the fields.
For any field that you think may have newline characters you can remove them with:
replace(my_field, chr(10), '')
When you send fewer fields than expected on the destin table, it will also throw this error.
I'm sure there are multiple scenarios that would return this error. I just came across one that I don't see mentioned in the other answers while I was debugging someone else's code. The COPY had the EXPLICIT_IDS option listed, the table it was trying to import into had a column with a data type of identity(1,1), but the file it was trying to import into Redshift did not have an ID field. It made sense for me to add the identity field to the file. But, I imagine removing the EXPLICIT_IDS option would also have fixed the issue.
So recently I came across of this Delimiter not found error in Redshift SQL while loading the data with copy command. In my case, the problem was with column numbers.
I had created a table with 20 columns but I was loading the file with 21 columns.
I corrected it in my table by making 21 columns in the table and then re-loaded the data and boom it worked.
Hope it will be helpful to those who are facing the same kind of problem.
Ta-da
Sometimes this pops up when you dont specify the file type, for example CSV
Ref: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
copy "dev"."my"."table" from 's3://bucket/myfile_upload.csv' credentials 'aws_iam_role=arn:aws:iam::2112277888:role/RedshiftAccessRole' IGNOREHEADER 1 csv;