I'm creating an app that works with AWS Athena on compressed Parquet (SNAPPY) data.
It works almost fine, however, after every query execution, 2 files get uploaded to the S3_OUTPUT_BUCKET of type csv and metadata. (as it should)
These 2 files break the execution of the next query.
I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://MY_OUTPUT_BUCKET/logs/QUERY_NAME/2022/08/07/tables/894a1d10-0c1d-4de1-9e61-13b2b0f79e40.metadata expected magic number: PAR1 got: HP
I need to manually delete those files for the next query to work.
Any suggestions on how to make this work?
(I know I cannot exclude those files with a regex etc.. but I don't want to delete the files manually for the app to work)
I read everything about the output files but it didn't help. ( Working with query results, recent queries, and output files )
Any help is appreciated.
While setting up Athena for execution, we need to specify where the metadata and csv from the query execution are written into. This needs to be written into a different folder than the table location.
Go to Athena Query Editor > Settings > Manage
and edit Query Result Location to be another S3 bucket than the table or a different folder within the same bucket.
I am getting an issue when I am loading my file , I have backslash in my csv file
how and what delimited can I use while using my copy command so that I don't get
error loading data from s3 to redshift.
Though I used the QUOTE command but gave me a syntax error so seems like new format
doesn't like the QUOTE key word.
Please if any one can provide a new and correct
command or dow I need to clean or preprocess my data before uploading to s3.
If the
Data size is too big it might not be a very feasible solution
If I have to process it , Do I use pyspark or python(PANDAS) to do it?
Below is the copy command I am using to copy data from s3 to redshift
I tried passing a quote command in the copy command but seems like it doesn't take
that anymore also there is no example in amazon docs on how to do or acheive it
If someone can suggest a command which can replace especial characters while loading
the data
COPY redshifttable from 'mys3filelocation'
CREDENTIALS 'aws_access_key_id=myaccess_key;aws_secret_access_key=mysecretID'
region 'us-west-2'
CSV
DATASET:
US063737,2019-11-07T10:23:25.000Z,richardkiganga,536737838,Terminated EOs,"",f,Uganda,Richard,Kiganga,Business owner,Round Planet DTV Uganda,richardkiganga,0.0,4,7.0,2021-06-1918:36:05,"","",panama-
Disc.s3.amazon.com/photos/…,\"\",Mbale,Wanabwa p/s,Eastern,"","",UACE Certificate,"",drive.google.com/file/d/148dhf89shh499hd9303-JHBn38bh/… phone,Mbale,energy_officer's_id_type,letty
mainzi,hakuna Cell,Agent,8,"","",4,"","","",+647739975493,Feature phone,"",0,Boda goda,"",1985-10-12,Male,"",johnatlhnaleviski,"",Wife
I'm running a SELECT Athena query on an S3 bucket manifest. I then want to use the results of that query, in .csv format, in an S3 Batch operation.
My query runs fine and I am able to access the .csv output via S3 Batch, but since the first row is actually column headers, S3 Batch to throws an unrecoverable error because it thinks that the manifest is now referring to multiple buckets.
How can I easily strip the column headers out of my results? I would prefer to just do it in SQL. The file size makes using standard unix tools prohibitive. I could use AWS Glue, but this seems like overkill for just suppressing headers in a SQL query.
Here's a hacky way to get around it
SELECT bucket as "my-bucket-name", key as "fakekey"
from your_athena_table
This will make your header look like the rest of the file which will not break the S3 Batch copy job. You will have just one failed record of fakekey
I am trying to sync a table from MySQL RDS to redshift trough data pipeline.
There was no issue in copying data frm RDS to S3. But while copying S3 to redhsift the follwoing isue is seen.
amazonaws.datapipeline.taskrunner.TaskExecutionException: java.lang.RuntimeException: Unable to load data: Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]
While observing data it is seen that while copying data to S3 an extra "0" is being appended at the end of time stamp i.e 2015-04-28 10:25:58 from MySQL table is being copied as 2015-04-28 10:25:58.0 into CSV file which is giving issue.
I also tried copying with copy command using the following
copy XXX
from 's3://XXX/rds//2018-02-27-14-38-04/1d6d39b9-4aac-408d-8275-3131490d617d.csv'
iam_role 'arn:aws:iam::XXX:role/XXX' delimiter ',' timeformat 'auto';
but still the same issue.
Can anyone help me sort out this issue.
Thanks in advance
I have large numbers of files within s3 bucket and usually import it to Redshift. Since number of files is large I need a column in Redshift table which should contain source file name from s3 location.
Is there any means to carried out problem ?
Agree with Ketan that this is currently not possible in Redshift. If this is what you would want to achieve, it is possible through either
Reading the S3 files programmatically and write a new S3 files with file name as the column and load the new file
Alternatively, use Hive. Create external table on S3 file bucket location and use INPUT__FILE__NAME to get the file names, create a new table and then write back to S3. You can also do some pre-processing in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Hope this helps.
That isn't possible. During a Copy operation, Redshift only loads file contents into a table; it doesn't provide access to S3 file names.
To achieve what you want, you need to preprocess the data to add additional information inside the files.