Query JSON files stored in S3 buckets with Athena

Query JSON files stored in S3 buckets with Athena - amazon-web-services

I would like to create sql queries against some files stored in S3, using Athena. However I'm having some difficulties due to the structure of the files. The structure of the file is always the same, however the indentation might be different for each file. Some example listed below.
There are a tons of files formatted like this and I need to do basic SQL queries on them. I only have read-access to the s3 buckets and would like to avoid duplicating the whole buckets just to change indentation to make it work with athena.
Do you have any idea how these might be achieved?
I tried importing the table either straight via Athena and via Glue but it doesn't work. While creating it via Athena I get this error Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 48 [character 49 line 1] which should be caused by the fact that the files starts with an array.
Example file 1:
[{"name":"sample name1", "category":"category 1"},{"name":"sample name2", "category":"category 2"}]
Example file 2:
[{"name":"sample name3", "category":"category 1"},
{"name":"sample name4", "category":"category 3"}]
Example file 3:
[
{"name":"sample name1", "category":"category 1"},
{"name":"sample name5", "category":"category 3"}
]

Related

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.

You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.

Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.

I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx

It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements

Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

Athena - CTAS file name

I used Athena's CTAS and INSERT commands and Avro files created at the external_location
But the file name is very strange and the filename extension also disappear. (That file don't have any filename extension. File has only their strange filename like hash code)
How can I define filenames rule for Athena's file?
Thank you.

As stated on page 20 of AWS Athena's manual, ..."This location in Amazon S3 comprises all of the files representing your table. For more information, see Using Folders in the Amazon Simple Storage Service Console User Guide."...
Reference:
https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf
So, no, you can't define the name of the file (or files, because more than one may be needed to represent a table). BUT THE RIGHT WAY TO THINK is that the BUCKET/PATH is what represents the file name, or the output table.
We might get confused because you're genereting and AVRO file, which really is a file, like PARQUET, but remember that Athena can also output to other formats, which may be multi-file.

How does AWS Athena react to schema changes in S3 files?

What happens when after creating the table in AWS Athena for files on S3, the structure of the files on S3 change?
For eg:
If the files previously had 5 columns when the table was created and later the new files started getting 1 more column:
a) at the end?
b) in between?
What happens when some columns are not available in new files?
What happens when the columns remain the same but the column order changes?
Can we alter Athena tables to adjust to these changes?

1 - Athena is not a NoSQL solution. It is not dynamic schema either. If you change the schema, all your files in a particular folder should reflect that change. Athena wont magically update to have it included.
2 - Then it'll be a problem and it'll break. You should include NULL or ,, to force it to be okay.
3 - Athena picks it up by column order. Not by name, really. If your column orders change, it'll probably break (different types).
4 - Yes. You can always easily recreate Athena tables by dropping it and creating a new one.
If you have variable length files, then you should insert them into different folders so that each folder represents one consistent schema. You can then unify this later on in Athena with a union or similar to create a condensed, simplified table that you can apply the consistent schema to.

It depends on the files format you are using and the setup (if the schema is by field order or by field name). All the details are here: https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
Take a big note that if the data is nested or in arrays, it will completely break your data, to quote from this page:
Schema updates described in this section do not work on tables with complex or nested data types, such as arrays and structs.

Is it possible to validate the column order when uploading data from flat files using aws copy command

I'm uploading data from zipped flat files to redshift using copy command, I would like to understand if there is any way to validate that the column order of the files is correct? (for example, if fields are all varchar then the data could be uploaded to the wrong columns).
In the copy command documentation it shows that you can specify the column order, but not for flat files, but I was wondering if there are any other approaches that would allow me to check how the columns have been supplied (for example, uploading only the header row into a dummy table to check, but that doesn't seem a possibility).

You can't really do this inside Redshift. COPY doesn't provide any options to only load a specific number of rows or perform any validation.
Your best option would be to do this in the tool where you schedule the loads. You can get the first line from a compressed file easily enough (zcat < file.z|head -1) but for a file on S3 you may have to download the whole thing first.
FWIW, the process generating the load file should be fully automated in such a way that the column order can't change. If these files are being manually prepared you're asking for all sorts of trouble.

How do I remove Header Row when migrating from S3 to Redshift DB?

I have a MySQL table that I'm migrating over to Redshift. The steps are pretty straightforward.
Export MySQL table to CSV
Place CSV into Amazon S3
Create table in Redshift with exact specifications as MySQL table
Copy CSV export into Redshift
I'm having a problem with the last step. I have headers in my MySQL CSV export. I can't currently recreate it, so I'm stuck with the CSV file. Step 4 is giving me an error because of the headers.
Instead of changing the CSV, I would love to add a line to account for headers. I've searched through AWS's documentation for copying tables which is pretty extensive, but nothing to account for headers. Looking for something like header = TRUE to add into the query below.
My COPY statement into Redshift right now looks like:
COPY apples FROM
's3://buckets/apples.csv'
CREDENTIALS 'aws_access_key_id=abc;aws_secret_access_key=def'
csv
;

Found the IGNOREHEADER function, but still couldn't figure out where to write it.
Pretty obvious now, but just add IGNOREHEADER at the bottom. The 1 represents the number of rows you want to skip for headers, aka my CSV had one row of headers.
COPY apples FROM
's3://buckets/apples.csv'
CREDENTIALS 'aws_access_key_id=abc;aws_secret_access_key=def'
csv
IGNOREHEADER 1
;

There is a parameter that Copy command can use.
Refer to documentation

so you can do something like this using the S3ToRedshiftOperator
You'd want to add 'IGNOREHEADER 1' under copy_options : list[str]
To use it:
copy_options_list = ["csv", "timeformat 'auto'", 'IGNOREHEADER 1']
transfer_s3_to_redshift = S3ToRedshiftOperator(
task_id="music_story_s3_to_redshift",
redshift_conn_id=redshift_connection_id,
s3_bucket=s3_bucket_name,
s3_key=s3_key,
schema=schema_name,
table=redshift_table,
column_list=cols_list,
copy_options=copy_options_list,
dag=dag,
)
The copy instruction then becomes:
COPY <schema.table> (column1, column2, column3…)
FROM 's3://<BUCKET_NAME>/<PATH_TO_YOUR_S3_FILE>’
credentials
'aws_access_key_id=<> ;aws_secret_access_key=<>;token=<>’
csv
timeformat 'auto'
IGNOREHEADER 1;
, parameters: None

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js