I've parquet files and need to load into redshift using copy command. The command is getting failed due to spectrum scan error. So I want to ignore the file if any causing error.
Is there any way to ignore records/maxerror option in redshift copy command for parquet file load?
COPY <targettablename> from '<s3 path>' iam_role 'arn:aws:iam::1232432' format as parquet maxerror 250
Error:- MAXERROR argument is not supported for PARQUET based COPY
For copying data from parquet file to Redshift, you just use this below format-
Copy SchemaName.TableName
From 'S3://buckets/file path'
access_key_id 'Access key id details' secret_access_key 'Secret access key details'
Format as parquet
STATUPDATE off
Spectrum scan error you get when there is discrepancy in source columns data type and destination column data types, for that you have to change data types according to Redshift's standard data type format.
For checking errors you can refer this query-
Select * from SVL_S3LOG where query = 'Query_id needs to be placed here'
Related
I'm creating a table in Athena and specifying the format as PARQUET however the file extension is not being recognized in S3. The type is displayed as "-" which means that the file extension is not recognized despite that I can read the files (written from Athena) successfully in a Glue job using:
df = spark.read.parquet()
Here is my statement:
CREATE EXTERNAL TABLE IF NOT EXISTS test (
numeric_field INT
,numeric_field2 INT)
STORED AS PARQUET
LOCATION 's3://xxxxxxxxx/TEST TABLE/'
TBLPROPERTIES ('classification'='PARQUET');
INSERT INTO test
VALUES (10,10),(20,20);
I'm specifying the format as PARQUET but when I check in the S3 bucket the file type is displayed as "-". Also when I check the glue catalog, that table type is set as 'unknown'
S3 STORAGE PRINT SCREEN
I expected that the type is recognized as "parquet" in the S3 bucket
After contacting the AWS support, it was confirmed that with CTAS queries Athena does not create file extensions for parquet files.
"Further to confirm this, I do see the Knowledge Center article [1] where CTAS generates the Parquet files without extension ( Under section 'Convert the data format and set the approximate file size' Point 5)."
However the files written from Athena, even without the extension are readable.
Reference:
[1] https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Workaround: I created a function to change the file extension. Basically iterating over the files in the S3 bucket and then writing the contents back to the same location with parquet file extension
I have five JSON files in one folder in amazon s3. I am trying to load all five files from s3 into redshift using copy command. I am getting an error while loading one file from s3 to redshift. Is there any way in redshift to skip loading that file and load the next file.
Use the MAXERROR parameter in the COPY command to increase the number of errors permitted. This will skip over any lines that produce errors.
Then, use the STL_LOAD_ERRORS table to view the errors and diagnose the data problem.
I have set up a Kinesis Firehose that passes data through glue which compresses to and transforms JSON to parquet and stores it in an S3 bucket. The transformation is successful and I can query the output file normally with apacheDrill. I cannot however get Athena to function. Doing a preview table (select * from s3data limit 10) I get results with the proper headers for the columns but the data is empty.
Steps I have taken:
I already added the newline to my source: JSON.stringify(event) + '\n';
Downloaded the parquet and queried successfully with apacheDrill
Glue puts the parquet file in YY/MM/DD/HH folders. I have tried moving the parquet to the root folder and I get the same empty results.
The end goal is to get data eventaully into Quicksights, so if I'm going about this wrong let me know.
What am I missing?
I am trying to sync a table from MySQL RDS to redshift trough data pipeline.
There was no issue in copying data frm RDS to S3. But while copying S3 to redhsift the follwoing isue is seen.
amazonaws.datapipeline.taskrunner.TaskExecutionException: java.lang.RuntimeException: Unable to load data: Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]
While observing data it is seen that while copying data to S3 an extra "0" is being appended at the end of time stamp i.e 2015-04-28 10:25:58 from MySQL table is being copied as 2015-04-28 10:25:58.0 into CSV file which is giving issue.
I also tried copying with copy command using the following
copy XXX
from 's3://XXX/rds//2018-02-27-14-38-04/1d6d39b9-4aac-408d-8275-3131490d617d.csv'
iam_role 'arn:aws:iam::XXX:role/XXX' delimiter ',' timeformat 'auto';
but still the same issue.
Can anyone help me sort out this issue.
Thanks in advance
For a given data frame (df) we get the schema by df.schema, which is a StructType array. Can I save just this schema onto hdfs, while running from spark-shell? Also, what would be the best format in which the schema should be saved?
You can use treeString
schema = df._jdf.schema().treeString()
and convert it to an RDD and use saveAsTextFile:
sc.parallelize([schema ]).saveAsTextFile(...)
Or to use saveAsPickleFile:
temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")
Yes, you can save the schema as df.write.format("parquet").save("path")
#Give path as a HDFS path
You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path
Parquet + compression is the best storage strategy whether it resides on S3
or not.
Parquet is a columnar format, so it performs well without iterating over all
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala