Specify row delimiter in Redshift COPY command - amazon-web-services

I am trying to use the COPY command to import data into Redshift. Unfortunately the data is not sanitized very well and there are CRLF characters in some of the data. This is causing an error because it thinks it is a new record.
I am already using the DELIMITER parameter, but that is setting the delimiter for the fields in each record. Is there a similar way to specify what character(s) are delimiting each record?

No. Redshift expects \n (0x0A) as the End of Record (EOF) and doesn't handle CRLF (0x0D 0x0A). I believe it just sees the CR as another piece of input data but this info cannot be inserted into anything other than a varchar column. If you lines just have CR (0x0D) Redshift won't see an EOF at all and combine rows.
You will need to cleanse your data to remove the CR characters. Each record needs to end with a newline NL (0x0A). (Yes, LF and NL are the same ascii code and just have different names in different applications.) Hopefully you can just remove the CR's but I've seen data with just CR as the EOF and you will need to change these to NL in this case not just remove them.
If your last column of data is a varchar then you can (I believe) just strip the CR character from these strings after it is loaded into Redshift. Otherwise you data needs to be fix before it enters Redshift.

Related

Importing file with split lines Azure Data Factory pipeline

I have a pipe delimited text file with a header row that I need to import into an SQL Server table (obtained via SFTP). That should be easy enough, however, the input file has input rows split over several lines if the data for the row exceeds 80 chars in length. EOL character is a newline character, which ADF can cope with just fine.
So, we have something like:
Col1Name|Col2Name|Col3Name|Col4Name
aaaa|bbbbbb|cccc|ddddd
eeeeeee|fffff|gggggg|this is some data that pushes the row over the 80 character li\
mit
hhhhhh|iiiiiii|jjjjjjjj|kk
If some of the rows of data weren't split in this manner it would be straightforward to shunt the data into the destination table but I can't work out how to merge the split lines prior to mapping the data to output columns.
Things I have tried/looked at doing:
Using a text file source with pipes as delimiters and newlines as row terminators, replacing the backslash and newline combination with an empty string. Unfortunately, the data is already processed into separate rows at this point so this achieves nothing.
Mucking around with the column/row delimiters to read the file into one big blob and replacing the backslash/newline combos in the blob with an empty string. This doesn't work as the file gets truncated doing this.
Some combination of aggregate transformation with a collect() expression to merge the lines. Again, can't seem to manage this because there aren't any grouping columns in common between the lines the row has been split into to be able to perform this sort of aggregation.
Do I need to write an Azure function to pre-process the file and merge the split lines, or is there something I'm missing that would help?

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

I've sample data like below:
id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$
In second row (id=101), log column has newline characters making 3 lines out of one line.
I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.
To handle newline characters AWS Suggested OpenCSVSerde here.
I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=,
Nonetheless, it is showing data as 5 rows where as I need three rows.
When I query in Athena, id=101 is showing only first line and rest is missing:
id,log,code,sequence
101,sample- 4/52
Any tips or example on how to handle multiline characters in a csv file column?
I'm exploring custom classifiers but no luck yet.
According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.
I see that you are trying to put some kind of log there.
Your options are:
Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

Escaping delimiter in Amazon Redshift COPY command

I'm pulling data from Amazon S3 into a table in Amazon Redshift. The table contains various columns, where some column data might contain special characters.
The copy command has an option called Delimiter where we can specify the delimiter while pulling the data into the table.
The issue is 2 fold -
When I export (unload command) to S3 using a delimiter - say , - it works fine, but when I try to import into Redshift from S3, the issue creeps in because certain columns contain the ',' operator which the copy command misinterprets as delimiter and throws error.
I tried various delimiters, but the data in my table seems to contain some or other kind of special character which causes the above issue.
I even tried unloading using multiple delimiter - like #% or ~, but when loading from s3 using copy command - the dual delimiter is not supported.
Any solutions?
I think the delimiter can be escaped using \ but for some reason that isn't working either, or maybe I'm not using the right syntax for escaping in copy command.
The following example shows the contents of a text file with the field values separated by commas.
12,Shows,Musicals,Musical theatre
13,Shows,Plays,All "non-musical" theatre
14,Shows,Opera,All opera, light, and "rock" opera
15,Concerts,Classical,All symphony, concerto, and choir concerts
If you load the file using the DELIMITER parameter to specify comma-delimited input, the COPY command will fail because some input fields contain commas. You can avoid that problem by using the CSV parameter and enclosing the fields that contain commas in quote characters. If the quote character appears within a quoted string, you need to escape it by doubling the quote character. The default quote character is a double quotation mark, so you will need to escape each double quotation mark with an additional double quotation mark. Your new input file will look something like this.
12,Shows,Musicals,Musical theatre
13,Shows,Plays,"All ""non-musical"" theatre"
14,Shows,Opera,"All opera, light, and ""rock"" opera"
15,Concerts,Classical,"All symphony, concerto, and choir concerts"
Source :- Load Quote from a CSV File
What I use -
COPY tablename FROM 'S3-Path' CREDENTIALS '' MANIFEST CSV QUOTE '\"' DELIMITER ',' TRUNCATECOLUMNS ACCEPTINVCHARS MAXERROR 2
If I’ve made a bad assumption please comment and I’ll refocus my answer.
If the delimiter is appearing within fields, then use the ADDQUOTES parameter with the UNLOAD command:
Places quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself.
Then:
If you use ADDQUOTES, you must specify REMOVEQUOTES in the COPY if you reload the data.
A popular delimiter is the pipe character (|) that is rare in text files.
Adding CSV QUOTE as '\"' before the DELIMITER worked for me.

Remove or replace '�' character in Informatica

We have a requirement wherein we need to replace or remove '�' character (which is an unrecognizable, undefined character) present in our source. While running my workflow it runs successfully but when i check the records in target they are not committed. I get the following error in Informatica
Error executing query for record 37: 6706: The string contains an untranslatable character.
I tried functions like replace_chr, reg_replace, replace_str etc., but none seems to be working. Kindly advise on how to get rid of this. Any reply is greatly appreciated.
You need to use in your schema definitions charset=> utf8-unidode-ci
but now you can do:
UPDATE tablename
SET columnToCheck = REPLACE(CONVERT(columnToCheck USING ascii), '?', '')
WHERE ...
or
update tablename
set columnToCheck = replace(columnToCheck , char(146), '');
Replace NonASCII Characters in MYSQL
You can replace the special characters in an expression transformation.
REPLACESTR(1,Column_Name,'?',NULL)
REPLACESTR - Function
1 - Position
Column_Name - Column name which has a special character
? - Special character
NULL - Replacing character
You need to fetch rows with the appropriate character set defined on your connection. What is the connection you're using, ODBC or native? What's the DB?
Special characters are a challenge and having checked the informatica network I can see there is a kludge involving replace_str setting first a variable to the string with all non special characters first and then using the resulting variable in a replace_str so that the final value has only the allowed characters https://network.informatica.com/thread/20642 (awesome workaround by nico so long as you can positively identify every character that should be allowed) ...
As an alternate kludge I would also attempt something using an xml transformation somewhere within the mapping as informatica conveniently converts special characters to encoded (decimal or hex I cant remember) values... so long as you can live with these encoded values appearing in your target text you should be fine ( and build some extra space into your strings to accommodate any bloatage from the extra characters

SAS while reading varbinary data from Amazon RDS is appending spaces at the end of the data. Can we avoid it?

SAS while reading varbinary data from Amazon RDS is appending spaces at the end of the data.
proc sql;
select emailaddr from tablename1;
quit;
The column emailaddr is varbinary(20)
For example:
I inserted "XX#WWW.com ", but while reading from db, it is appending spaces equal to the length of the column.
Since the column length is 20 it is returning "XX#WWW.com " ( note the spaces appended. I cannot use the trim() function since this also removes spaces that might genuinely be part of the original inserted data.
How can i stop sas from appending these spaces?
For my program i need to get the exact data as present in database without any extra spaces attached.
That's how SAS works; SAS has only CHAR equivalent datatype (in base SAS, anyway, DS2 is different), no VARCHAR concept. Whatever the length of the column is (20 here) it will have 20 total characters with spaces at the end to pad to 20.
Most of the time, it doesn't matter; when SAS inserts into another RDBMS for example it will typically treat trailing spaces as nonexistent (so they won't be inserted). You can use TRIM and similar to deal with the spaces if you're using regular expressions or concatenation to work with these values; CATS and similar functions perform concatenation-with-trimming.
If trailing spaces are part of your data, you are mostly out of luck in SAS. SAS considers trailing spaces irrelevant (equivalent to null characters). You can append a non-space character in SQL, or translate the spaces to NBSPs ('A0'x) or something else, while still in SQL, or use quotes or something around your actual values - but whatever you do will be complicated.