Redshift - Load data which has newline in field - amazon-web-services

I am trying to load the data that includes a new line within a field:
001|myname|fav\
movie | myaddress| myphone|
There is a blank line between fav\movie.
I am loading the data with this command:
COPY catdemo
FROM 's3://tickit/catego.csv'
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
REGION 'ap-south-1'
DELIMITER '|'
ESCAPE
ACCEPTINVCHARS
IGNOREBLANKLINES
NULL AS '\0';
I want to ignore this blank line, can anyone help me?
its showing delimiter not found between fav\ and movie, but its actually a single line.
fav\
movie

Related

Change the delimiter in AWS Glue Pyspark

abv_data = glueContext.create_dynamic_frame_from_options("s3", \
{'paths': ["s3://{}/{}".format(bucket, prefix)], \
"recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True},separator='\t')
abv_df_1 = abv_data.toDF()
abv_df_2 = abv_df_1.withColumn("save_date", lit(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")))
conparms_r = glueContext.extract_jdbc_conf("reporting", catalog_id = None)
abv_df_2.write\
.format("com.databricks.spark.redshift")\
.option("url", "jdbc:redshift://rs_cluster:8192/rptg")\
.option("dbtable", redshift_schema_table_output)\
.option("user", conparms_r['user'])\
.option("password", conparms_r['password'])\
.option("aws_iam_role", "arn:aws:iam::123456789:role/redshift_admin_role")\
.option("tempdir", args["TempDir"])\
.option("extracopyoptions","DELIMITER '\t' IGNOREHEADER 1 DATEFORMAT AS 'YYYY-MM-DD'")\
.mode("append")\
.save()
The csv has a tab delimiter on read, but when I add the column to the dataframe is uses a comma delimiter and is causing the Redshift load to fail.
Is there a way to add the column with a tab delimiter OR change the delimiter on the entire data frame?
This isn't necessarily the way to do this, but here is what I ended up doing:
bring the csv in with a ',' separator.
glueContext.create_dynamic_frame_from_options("s3", \
{'paths': ["s3://{}/{}".format(bucket, prefix)], \
"recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True}, separator = ',')
Then split the first column on tab and then add all the splits to their own column and add the extra column at the same time.
Drop the first column because it is still the combined column.
This gives you a comma seperated df to load.
Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.
For, special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

Snowflake - getting 'Error parsing JSON' while using the Copy command from S3 to snowflake

i'm trying to copy gz files from my S3 directory to Snowflake.
i created a table in snowflake (notice that the 'extra' field is defined as 'Variant')
CREATE TABLE accesslog
(
loghash VARCHAR(32) NOT NULL,
logdatetime TIMESTAMP,
ip VARCHAR(15),
country VARCHAR(2),
querystring VARCHAR(2000),
version VARCHAR(15),
partner INTEGER,
name VARCHAR(100),
countervalue DOUBLE PRECISION,
username VARCHAR(50),
gamesessionid VARCHAR(36),
gameid INTEGER,
ingameid INTEGER,
machineuid VARCHAR(36),
extra variant,
ingame_window_name VARCHAR(2000),
extension_id VARCHAR(50)
);
i used this copy command in snowflake:
copy INTO accesslog
FROM s3://XXX
pattern='.*cds_201911.*'
CREDENTIALS = (
aws_key_id='XXX',
aws_secret_key='XXX')
FILE_FORMAT=(
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
TYPE = CSV
COMPRESSION = GZIP
FIELD_DELIMITER = '\t'
)
ON_ERROR = CONTINUE
I run it, and got this result (i got many error lines, this is an example to one)
snowflake result
snowflake result -more
a17589e44ae66ffb0a12360beab5ac12 2019-11-01 00:08:39 155.4.208.0 SE 0.136.0 3337 game_process_detected 0 OW_287d4ea0-4892-4814-b2a8-3a5703ae68f3 e9464ba4c9374275991f15e5ed7add13 765 19f030d4-f85f-4b85-9f12-6db9360d7fcc [{"Name":"file","Value":"wowvoiceproxy.exe"},{"Name":"folder","Value":"C:\\Program Files (x86)\\World of Warcraft\\_retail_\\Utils\\WowVoiceProxy.exe"}]
can you please tell me what cause this error?
thanks!
I'm guessing;
The 'Error parsing JSON' is certainly related to the extra variant field.
The JSON looks fine, but there are potential problems with the backslashes \.
If you look at the successfully loaded lines, have the backslashes been removed?
This can (maybe) happen if you have STAGE settings involving escape characters.
The \\Utils substring in the Windows path value can then trigger a Unicode decode error, eg.
Error parsing JSON: hex digit is expected in \U???????? escape sequence, pos 123
UPDATE:
It turns out you have to turn off escape char processing by adding the following to the FILE_FORMAT:
ESCAPE_UNENCLOSED_FIELD = NONE
The alternative is to doublequote fields or to doubly escape backslash, eg. C:\\\\Program Files.

Redshift Copy with Newline Embedded in Quotes

Trying to copy data from S3 to Redshift with a newline with in quotes
Example CSV file:
Line 1 --> ID,Description,flag
Line 2 --> "1111","this is a test", "FALSE"
Line 3 --> "2222","I hope someone
could help", "TRUE"
Line 4 --> "3333", "NA", "FALSE"
Sample Table:
TEST_TABLE:
ID VARCHAR(100)
DESCRIPTION VARCHAR(100)
FLAG VARCHAR(100)
If you notice in line 2 there is a linefeed and I get the error Delimited value missing end quote when using the COPY command.
This is the Copy command I use:
copy table_name
from sample.csv
credentials aws_access_key_id= blah; aws_secret_access_key=blah
DELIMITER ','
removequotes
trimblanks
ESCAPE ACCEPTINVCHARS
EMPTYASNULL
IGNOREHEADER 1
COMPUPDATE OFF;
commit;
I've also tried the CSV option, but get "Extra column(s) found ":
copy table_name
from sample.csv
credentials aws_access_key_id= blah; aws_secret_access_key=blah
CSV
IGNOREHEADER 1
COMPUPDATE OFF;
commit;
I would expect the description column in Line 2 to be loaded with the linefeed.
Since the field is delimited by quotes, use the CSV option.
Note: CSV cannot be used with FIXEDWIDTH, REMOVEQUOTES, or ESCAPE.

Redshift Copy errors out when trying to load NUL

I am loading data to Redshift using Copy. The text file has NUL.
I have looked at several options and tried using options such as:
null as '\0' EMPTYASNULL ACCEPTINVCHARS TRIMBLANKS TRUNCATECOLUMNS escape
However, it still errors out.
Below sample records and the error message.
NUL is after Main St|
2278|2047|5|1|1|1|18 N Main St| |Bowman|1|39|16443|15811|58623|Y|544|2018-05-21 17:29:12.000||||
2491|2047|6|1|1|1|18 N Main| |Bowman|1|39|16443|15811|58623-9613|Y|920|2018-11-26 18:28:26.000||||
2491|2047|7|1|1|1|18 N Main| |Bowman|1|39|16443|15811|58623-9613|Y|920|2018-11-26 18:28:26.000||||
2408|2154|7|1|1|1|101 Main St| |Lakota|1|39|16469|15956|58344|Y|447|2018-08-17 08:10:54.000||||
copy table1 from 's3://....txt' iam_role xx delimiter '|' null as '\0' EMPTYASNULL ACCEPTINVCHARS TRIMBLANKS TRUNCATECOLUMNS escape;
Missing newline: Unexpected character 0x7d found at location nn

how to save mysql query output to a file named (currentTime).csv

i am trying to store the output of mysql query to a file. I need to have a file with extension .csv and its name should be the current time of my pc e.g: 2015-03-26 19:26:13.065000.csv.
when i execute this query
conn=mysql.connector.connect(user='root',password='',host='localhost',database='ER_PC_NK')
exe2 = conn.cursor()
exe2.execute("""SELECT tbl_site.Site_name, State_Code, Country_Code,Street_Address, instrum_start_date, instrum_end_date, Comment INTO OUTFILE 'myrecord.csv' FIELDS TERMINATED BY '|' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\\\' LINES TERMINATED BY '\\n' FROM tbl_site JOIN tbl_site_monit_invent ON site_id = tbl_Site_site_id""")
first time it saved a file named myrecord.csv but second time NOT.After a long search on the internet i found that it cannot override the file myrecord.csv, so i decided to name the file as currentTime.csv, to do this i thought to try this kind of thing:
ss=DATE_FORMAT(NOW(),'_%Y_%m_%d_%H_%i_%s');
SET #t1=1
set #FOLDER = 'c:/tmp/';
SET #PREFIX = 'orders';
SET #EXT = '.csv';
SET #CMD = CONCAT("SELECT * FROM orders INTO OUTFILE '",#FOLDER,#PREFIX,#TS,#EXT,
"' FIELDS ENCLOSED BY '\"' TERMINATED BY ';' ESCAPED BY '\"'",
" LINES TERMINATED BY '\r\n';");
PREPARE statement FROM #CMD
but found an error: user defined variables are not defined, again googled and found user defined variables are available from Connector/NET version 5.2.2 while i'm using MySQL Connector Python v2.0.3 for python v2.7
i am very confused, if you have more better solution please tell. Your effort will be of great help. Thank you.
instead of user defined variables you can just do
import datetime
conn = mysql.connector.connect(user='root',password='',host='localhost',database='ER_PC_NK')
exe2 = conn.cursor()
exe2.execute(
"""SELECT tbl_site.Site_name, State_Code, Country_Code,
Street_Address, instrum_start_date, instrum_end_date,
Comment INTO OUTFILE %s FIELDS TERMINATED BY '|' OPTIONALLY
ENCLOSED BY '"' ESCAPED BY '\\\\' LINES TERMINATED BY '\\n'
FROM tbl_site JOIN tbl_site_monit_invent ON site_id = tbl_Site_site_id
""", (str(datetime.datetime.now()),))