parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog - amazon-web-services

I've sample data like below:
id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$
In second row (id=101), log column has newline characters making 3 lines out of one line.
I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.
To handle newline characters AWS Suggested OpenCSVSerde here.
I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=,
Nonetheless, it is showing data as 5 rows where as I need three rows.
When I query in Athena, id=101 is showing only first line and rest is missing:
id,log,code,sequence
101,sample- 4/52
Any tips or example on how to handle multiline characters in a csv file column?
I'm exploring custom classifiers but no luck yet.

According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.
I see that you are trying to put some kind of log there.
Your options are:
Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

Related

Importing file with split lines Azure Data Factory pipeline

I have a pipe delimited text file with a header row that I need to import into an SQL Server table (obtained via SFTP). That should be easy enough, however, the input file has input rows split over several lines if the data for the row exceeds 80 chars in length. EOL character is a newline character, which ADF can cope with just fine.
So, we have something like:
Col1Name|Col2Name|Col3Name|Col4Name
aaaa|bbbbbb|cccc|ddddd
eeeeeee|fffff|gggggg|this is some data that pushes the row over the 80 character li\
mit
hhhhhh|iiiiiii|jjjjjjjj|kk
If some of the rows of data weren't split in this manner it would be straightforward to shunt the data into the destination table but I can't work out how to merge the split lines prior to mapping the data to output columns.
Things I have tried/looked at doing:
Using a text file source with pipes as delimiters and newlines as row terminators, replacing the backslash and newline combination with an empty string. Unfortunately, the data is already processed into separate rows at this point so this achieves nothing.
Mucking around with the column/row delimiters to read the file into one big blob and replacing the backslash/newline combos in the blob with an empty string. This doesn't work as the file gets truncated doing this.
Some combination of aggregate transformation with a collect() expression to merge the lines. Again, can't seem to manage this because there aren't any grouping columns in common between the lines the row has been split into to be able to perform this sort of aggregation.
Do I need to write an Azure function to pre-process the file and merge the split lines, or is there something I'm missing that would help?

How to ignore specific charactor and new line using regex

I am trying to validate a csv file using Apache-NiFi.
My CSV file has some defects.
id,name,address
1,sachith,{"Lane":"ABC.RTG.EED","No":"12"}
2,nalaka,{"Lane":"DEF",
"No":"23"}
3,muha,{"Lane":"GRF.FFF","No":"%$&%*^%"}
Here in second row,its been divided into two lines and third row has some special characters.
I want to ignore both the lines. For that I use \{("\w+":"\w+",)*[^%&*#]*\}, but this is not capturing row split error and new line.
I also used \{("\w+":"\w+",)*[^%&*#]*\}$, but it doesnt even get the right answer.
This is you might looking for: ^[0-9]+,[a-z]+,\{("\w+":"[\w\.]+","\w+":"[a-zA-Z0-9]+")\}$

Escaping delimiter in Amazon Redshift COPY command

I'm pulling data from Amazon S3 into a table in Amazon Redshift. The table contains various columns, where some column data might contain special characters.
The copy command has an option called Delimiter where we can specify the delimiter while pulling the data into the table.
The issue is 2 fold -
When I export (unload command) to S3 using a delimiter - say , - it works fine, but when I try to import into Redshift from S3, the issue creeps in because certain columns contain the ',' operator which the copy command misinterprets as delimiter and throws error.
I tried various delimiters, but the data in my table seems to contain some or other kind of special character which causes the above issue.
I even tried unloading using multiple delimiter - like #% or ~, but when loading from s3 using copy command - the dual delimiter is not supported.
Any solutions?
I think the delimiter can be escaped using \ but for some reason that isn't working either, or maybe I'm not using the right syntax for escaping in copy command.
The following example shows the contents of a text file with the field values separated by commas.
12,Shows,Musicals,Musical theatre
13,Shows,Plays,All "non-musical" theatre
14,Shows,Opera,All opera, light, and "rock" opera
15,Concerts,Classical,All symphony, concerto, and choir concerts
If you load the file using the DELIMITER parameter to specify comma-delimited input, the COPY command will fail because some input fields contain commas. You can avoid that problem by using the CSV parameter and enclosing the fields that contain commas in quote characters. If the quote character appears within a quoted string, you need to escape it by doubling the quote character. The default quote character is a double quotation mark, so you will need to escape each double quotation mark with an additional double quotation mark. Your new input file will look something like this.
12,Shows,Musicals,Musical theatre
13,Shows,Plays,"All ""non-musical"" theatre"
14,Shows,Opera,"All opera, light, and ""rock"" opera"
15,Concerts,Classical,"All symphony, concerto, and choir concerts"
Source :- Load Quote from a CSV File
What I use -
COPY tablename FROM 'S3-Path' CREDENTIALS '' MANIFEST CSV QUOTE '\"' DELIMITER ',' TRUNCATECOLUMNS ACCEPTINVCHARS MAXERROR 2
If I’ve made a bad assumption please comment and I’ll refocus my answer.
If the delimiter is appearing within fields, then use the ADDQUOTES parameter with the UNLOAD command:
Places quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself.
Then:
If you use ADDQUOTES, you must specify REMOVEQUOTES in the COPY if you reload the data.
A popular delimiter is the pipe character (|) that is rare in text files.
Adding CSV QUOTE as '\"' before the DELIMITER worked for me.

How to replace \n in Impala parquet files?

I have some text data stored in parquet format in HDFS in the Hive metastore. Each observation may or may not include \n as part of the text itself.
I need to export this data to a text (tab or comma delimited) file to analyze further in Python.
If I were to run a query against the data and save to text file I would get:
id,txt
1,I like this site \n tomorrow I'll write more
2,How cool \n is this website
At that point my rows get screwed due to the extra \n.
I tried to export the data but the regexp_replace function doesn't seem to produce the stripping I was expecting:
select id, regexp_replace(txt,'\\n',' ') as txt
from table
limit 1000
Any ideas on how to deal with this?

Amazon Redshift - COPY from CSV - single Double Quote in row - Invalid quote formatting for CSV Error

I'm loading a CSV file from S3 into Redshift. This CSV file is analytics data which contains the PageUrl (which may contain user search info inside a query string for example).
It chokes on rows where there is a single, double-quote character, for example if there is a page for a 14" toy then the PageUrl would contain:
http://www.mywebsite.com/a-14"-toy/1234.html
Redshift understandably can't handle this as it is expecting a closing double quote character.
The way I see it my options are:
Pre-process the input and remove these characters
Configure the COPY command in Redshift to ignore these characters but still load the row
Set MAXERRORS to a high value and sweep up the errors using a separate process
Option 2 would be ideal, but I can't find it!
Any other suggestions if I'm just not looking hard enough?
Thanks
Duncan
It's 2017 and I run into the same problem, happy to report there is now a way to get redshift to load csv files with the odd " in the data.
The trick is to use the ESCAPE keyword, and also to NOT use the CSV keyword.
I don't know why, but having the CSV and ESCAPE keywords together in a copy command resulted in failure with the error message "CSV is not compatible with ESCAPE;"
However with no change to the loaded data I was able to successfully load once I removed the CSV keyword from the COPY command.
You can also refer to this documentation for help:
http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-escape
Unfortunately, there is no way to fix this. You will need to pre-process the file before loading it into Amazon Redshift.
The closest options you have are CSV [ QUOTE [AS] 'quote_character' ] to wrap fields in an alternative quote character, and ESCAPE if the quote character is preceded by a slash. Alas, both require the file to be in a particular format before loading.
See:
Redshift COPY Data Conversion Parameters
Redshift COPY Data Format Parameters
I have done this using ---> DELIMITER ',' IGNOREHEADER 1; at the replacement for 'CSV' at the end of COPY command. Its working really fine.