DELIMITER Not found during Amazon Copy - amazon-web-services

I have added a Delimiter ',' but then too I am getting an error.
Code:
"copy %s.%s_tmp
from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
REMOVEQUOTES
ESCAPE
ACCEPTINVCHARS
ENCODING AS UTF8
DELIMITER ','
GZIP
ACCEPTANYDATE
region '%s'"
% (schema, table, s3_path, access_key, secret_key, region)
Error:
InternalError: Load into table 'my_table' failed. Check 'stl_load_errors' system table for details.
In this table in Redshift the error is Delimiter not found
How can I fix this?
One of the raw line is in this format :
1122,"",4332345,"2016-07-28 15:00:09","2032-09-28
15:00:09",19.00,"","some string","","som string","abc","abc","abc"

Try using the MAXERROR parameter in the copy command. IT will succeed you partial load even though some records are in error.
Also try using this version of COPY:
copy tblname(col1,col2,col3...) from s3 path

Related

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.
The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:
df.repartition('datekey','coeff')\
.write\
.format('parquet')\
.partitionBy('datekey','coeff')\
.mode('overwrite')\
.option("path", S3_PATH)\
.saveAsTable('hive_tables.my_table')
I then try to access this table with Spark SQL, but when I run something as simple as
spark.sql('select * from hive_tables.my_table where datekey=20210506').show(),
it throws this:
An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine:
spark.sql('select * from hive_tables.my_table where coeff=0.5').show() and
spark.sql('select * from hive_tables.my_table').show()
I've verified through Spark SQL that the partitions exist and have data in them. The datekey query also works fine through AWS Athena - just not Spark SQL.
Also Glue definitely has the two partition columns recognized:
datekey: int
coeff: double
Any ideas here? I've tried everything I can think of and it just isn't making any sense.
I had same error In emr 6.3.0 (Spark 3.1.1).
After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.
I would still like a straight-forward solution to this, but currently this workaround suffices:
I first read the table straight from the S3 path
temp_df = spark.read.parquet(S3_PATH)
so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:
temp_df.createGlobalTempView('my_table')
which allows me to query it using Spark SQL with the global_temp database:
spark.sql('select * from global_temp.my_table where datekey=20210506').show()
and this works
I had a similar issue in a similar environment (EMR cluster + Spark SQL + AWS Glue catalog). The query was like this:
select *
from ufd.core_agg_data
where year <> date_format(current_timestamp, 'yyyy')
This is a table partitioned by "year", and "year" is a string. Note that "year" is used in the filter.
I got
User class threw exception: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown operator '!='
Then I "modified" the query to this one, and it worked!
select *
from ufd.core_agg_data
where year in (select date_format(current_timestamp, 'yyyy'))

How to input fsx for lustre to Amazon Sagemaker?

I am trying to set up Amazon sagemaker reading our dataset from our AWS Fsx for Lustre file system.
We are using the Sagemaker API, and previously we were reading our dataset from s3 which worked fine:
estimator = TensorFlow(
entry_point='model_script.py',
image_uri='some-repo:some-tag',
instance_type='ml.m4.10xlarge',
instance_count=1,
role=role,
framework_version='2.0.0',
py_version='py3',
subnets=["subnet-1"],
security_group_ids=["sg-1", "sg-2"],
debugger_hook_config=False,
)
estimator.fit({
'training': f"s3://bucket_name/data/{hyperparameters['dataset']}/"}
)
But now that I'm changing the input data source to Fsx Lustre file system, I'm getting an error that the file input should be s3:// or file://. I was following these docs (fsx lustre):
estimator = TensorFlow(
entry_point='model_script.py',
# image_uri='some-docker:some-tag',
instance_type='ml.m4.10xlarge',
instance_count=1,
role=role,
framework_version='2.0.0',
py_version='py3',
subnets=["subnet-1"],
security_group_ids=["sg-1", "sg-2"],
debugger_hook_config=False,
)
fsx_data_folder = FileSystemInput(file_system_id='fs-1',
file_system_type='FSxLustre',
directory_path='/fsx/data',
file_system_access_mode='ro')
estimator.fit(f"{fsx_data_folder}/{hyperparameters['dataset']}/")
Throws the following error:
ValueError: URI input <sagemaker.inputs.FileSystemInput object at 0x0000016A6C7F0788>/dataset_name/ must be a valid S3 or FILE URI: must start with "s3://" or "file://"
Does anyone understand what I am doing wrong? Thanks in advance!
I was (quite stupidly, it was late ;)) treating the FileSystemInput object as a string instead of an object. The error complained that the concatenation of obj+string is not a valid URI pointing to a location in s3.
The correct way to do it is making a FileSystemInput object out of the entire path to the dataset. Note that the fit now takes this object, and will mount it to data_dir = "/opt/ml/input/data/training".
fsx_data_obj = FileSystemInput(
file_system_id='fs-1',
file_system_type='FSxLustre',
directory_path='/fsx/data/{dataset}',
file_system_access_mode='ro'
)
estimator.fit(fsx_data_obj)

Redshift COPY import getting aborted

I have a query:
copy quality_temp.temp_qa_leave_list_193de282ea194c169a97bb82a8fcc3b9 (exclusion_value)
from
's3://quality-staging1/quality-staging1/qa_leave/industry_copy_5eb3f0f88f5fe.csv' iam_role '' IGNOREBLANKLINES delimiter '\^' MAXERROR 1
But when I am looking at Redshift console, it shows 'Aborted'.
I am not able to understand why it is not working. Can anyone help?

Error in uploading file in S3 using boto 3

I am trying to upload a file in S3 using boto3.I tried below code.
import boto3
s3 = boto3.resource('s3')
buck_name = s3.create_bucket(Bucket='trubuckboto')
s3.Object('trubuckboto','tlearn.txt').upload_file(
Filename='G:\tlearn.txt')
My bucket creation is successfull but i am not able to upload file from location G:\tlearn.txt inside that bucket.Below is the error i am getting
return os.stat(filename).st_size
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'G:\tlearn.txt'
Can someone suggest what i am missing here ?
In Python strings, the backslash "\" is a special character, also called the "escape" character. If you want a literal backslash then you need to escape the escape character, for example G:\\tlearn.txt:
import boto3
s3 = boto3.resource('s3')
# buck_name = s3.create_bucket(Bucket='trubuckboto')
s3.Object('trubuckboto', 'tlearn.txt').upload_file(
Filename='G:\\tlearn.txt')

Copying txt file to Redshift

I am trying to copy the text file from S3 to Redshift using the below command but getting the same error.
Error:
Missing newline: Unexpected character 0xffffffe2 found at location 177
copy table from 's3://abc_def/txt_006'
credentials '1234567890'
DELIMITER '|'
NULL AS 'NULL'
NULL AS '' ;
The text file has No header and field delimiter is |.
I tried passing the parameters using: ACCEPTINVCHARS.
Redshift shows same error
1216 error code: invalid input line.
Can anyone provide how to resolve this issue?
Thanks in advance.
Is your file in UTF8 format? if not convert it and try reloading.
I am Assuming path to the text file is correct. Also you generated the text file with some tool and uploaded to redshift manually
I faced the same issue and the issue is with whitespaces .I recommend you to generate the text file by nulling and trimming the whitespaces .
your query should be select RTRIM(LTRIM(NULLIF({columnname}, ''))),.., from {table}. generate the output of this query into text file.
If you are using SQl Server, query out the table using BCP.exe by passing the above query with all the columns and functions
Then use the below copy command after uploading the txt file in S3
copy {table}
from 's3://{path}.txt'
access_key_id '{value}'
secret_access_key '{value}' { you can alternatively use credentials as mentioned above }
delimiter '|' COMPUPDATE ON
removequotes
acceptinvchars
emptyasnull
trimblanks
BLANKSASNULL
FILLRECORD
;
commit;
This solved my problem. Please let us know if you are facing anything else.