Load pipe delimited file in Weka

Load pipe delimited file in Weka - weka

Hi I'm trying to load pipe delimited file in weka using java CSVLoader. Looks like CSVLoader only loads comma and tab. Is there a way i can change the delimiter on these loaders ?
Has anyone loaded a pipe separated file in Weka ?
Thanks,
Amit

The new version does allow you to enter a delimiter or separator using option -F option. See: http://weka.sourceforge.net/doc.dev/weka/core/converters/CSVLoader.html

Doesn't look like there are any options to give different delimiters. Just read the file first and replace pipes with commas?

Related

How to deal with Linebreaks in redshift load?

I have a csv which has line breaks in one of the column. I get the error Delimiter not found.
If I replace the text as continuous without line-breaks then it works. But how do I deal with line-breaks.
My COPY command:
COPY cat_crt_test_scores
from 's3://rds-cat-crt-test-score-table/checkcsv.csv'
iam_role 'arn:aws:iam::423639311527:role/RedshiftS3Access'
explicit_ids
delimiter '|'
TIMEFORMAT 'auto'
ESCAPE;
Delimiter not found after reading till Dear Conduira,

As suggested by John Rotenstein in the comments, using the CSV option is the right way to deal with this.
A more detailed answer is given here.

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

I've sample data like below:
id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$
In second row (id=101), log column has newline characters making 3 lines out of one line.
I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.
To handle newline characters AWS Suggested OpenCSVSerde here.
I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=,
Nonetheless, it is showing data as 5 rows where as I need three rows.
When I query in Athena, id=101 is showing only first line and rest is missing:
id,log,code,sequence
101,sample- 4/52
Any tips or example on how to handle multiline characters in a csv file column?
I'm exploring custom classifiers but no luck yet.

According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.
I see that you are trying to put some kind of log there.
Your options are:
Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

Redshift COPY command with "^A" delimiter

I am trying to use a control A ("^A") delimited file to load into redshift using COPY command, I see default delimiter is pipe (|) and with CSV it is comma.
I couldnt file a way to use ^A, when i tried COPY command with ^A or \x01, it is throwing below message. Anybody tried this before? documentation says we can use delimiter, but no clue on using ^A.
Password:
ERROR: COPY delimiter must be a single character

I have used '\\001' as a delimiter for ctrl+A based field separation in redshift and also in Pig.
Example :
copy redshiftinfo from 's3://mybucket/data/redshiftinfo.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter '\\001'

Amazon Redshift - COPY from CSV - single Double Quote in row - Invalid quote formatting for CSV Error

I'm loading a CSV file from S3 into Redshift. This CSV file is analytics data which contains the PageUrl (which may contain user search info inside a query string for example).
It chokes on rows where there is a single, double-quote character, for example if there is a page for a 14" toy then the PageUrl would contain:
http://www.mywebsite.com/a-14"-toy/1234.html
Redshift understandably can't handle this as it is expecting a closing double quote character.
The way I see it my options are:
Pre-process the input and remove these characters
Configure the COPY command in Redshift to ignore these characters but still load the row
Set MAXERRORS to a high value and sweep up the errors using a separate process
Option 2 would be ideal, but I can't find it!
Any other suggestions if I'm just not looking hard enough?
Thanks
Duncan

It's 2017 and I run into the same problem, happy to report there is now a way to get redshift to load csv files with the odd " in the data.
The trick is to use the ESCAPE keyword, and also to NOT use the CSV keyword.
I don't know why, but having the CSV and ESCAPE keywords together in a copy command resulted in failure with the error message "CSV is not compatible with ESCAPE;"
However with no change to the loaded data I was able to successfully load once I removed the CSV keyword from the COPY command.
You can also refer to this documentation for help:
http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-escape

Unfortunately, there is no way to fix this. You will need to pre-process the file before loading it into Amazon Redshift.
The closest options you have are CSV [ QUOTE [AS] 'quote_character' ] to wrap fields in an alternative quote character, and ESCAPE if the quote character is preceded by a slash. Alas, both require the file to be in a particular format before loading.
See:
Redshift COPY Data Conversion Parameters
Redshift COPY Data Format Parameters

I have done this using ---> DELIMITER ',' IGNOREHEADER 1; at the replacement for 'CSV' at the end of COPY command. Its working really fine.

Wrong delimiter in Informatica output file

I have created a informatica workflow. The target is made as a flat file. The delimiter used is \037 with UTF-8 encoding, but the output file created contains , as the delimiter. It works fine with other WF's i have created.
How to get the required delimiter in the output file.
Regards
Sriram

Just check once if it is only \037 or ,\037 in delimiter. And also check the same in session in set file properties for the flat file target.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Load pipe delimited file in Weka - weka

Hi I'm trying to load pipe delimited file in weka using java CSVLoader. Looks like CSVLoader only loads comma and tab. Is there a way i can change the delimiter on these loaders ? Has anyone loaded a pipe separated file in Weka ? Thanks, Amit

The new version does allow you to enter a delimiter or separator using option -F option. See: http://weka.sourceforge.net/doc.dev/weka/core/converters/CSVLoader.html

Doesn't look like there are any options to give different delimiters. Just read the file first and replace pipes with commas?

Related

How to deal with Linebreaks in redshift load?

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

Redshift COPY command with "^A" delimiter

Amazon Redshift - COPY from CSV - single Double Quote in row - Invalid quote formatting for CSV Error

Wrong delimiter in Informatica output file

Categories

Resources