I am newbie to AWS ecosystem. I am creating an application which queries data using AWS Athena. Data is transformed from JSON into parquet using AWS Glue and stored in S3.
Now use case is to update that parquet data using SQL.
can we update underlying parquet data using AWS Athena SQL command?
No, it is not possible to use UPDATE in Amazon Athena.
Amazon Athena is a query engine, not a database. It performs queries on data that is stored in Amazon S3. It reads those files, but it does not modify or update those files. Therefore, it cannot 'update' a table.
The closest capability is using CREATE TABLE AS to create a new table. You can provide a SELECT query that uses data from other tables, so you could effectively modify information and store it in a new table, and tell it to use Parquet for that new table. In fact, this is an excellent way to convert data from other formats into Snappy-compressed Parquet files (with partitioning, if you wish).
Depending on how data is stored in Athena, you can update it using SQL UPDATE statmements. See Updating Iceberg table data and Using governed tables.
Related
I have stored changelogs(data with information about data) from non-relational schemaless data tables to S3. now I want some structured relational database to query on all the data. So I need to create a database from S3. Now I am confused about what should I do, whether using another S3 or using some traditional database!!!
You can create glue catalog over the data and query it using serverless Athena.
This way you are not bound to use any rdbms and can query your data at any required time keeping the files in s3.
This will also be cost effective.
Or you can anytime spin up a RDS in AWS if requires. So keeping files in s3 is good option.
I need to store Amazon Athena query results into New Amazon Athena Table.
Updates:
Athena now supports Create Table as Select Queries (CTAS). Examples are available here.
They have implemented several nice feautes, namely the ability to apply compression to outputs (GZIP, SNAPPY) and supply output format.
Recently they have added Create Table as CTAS support to Athena.
My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.
I'm running a pyspark job which creates a dataframe and stores it to S3 as below:
df.write.saveAsTable(table_name, format="orc", mode="overwrite", path=s3_path)
I can read the orcfile without a problem, just by using spark.read.orc(s3_path), so there's schema information in the orcfile, as expected.
However, I'd really like to view the dataframe contents using Athena. Clearly if I wrote to my hive metastore, I can call hive and do a show create table ${table_name}, but that's a lot of work when all I want is a simple schema.
Is there another way?
One of the approaches would be to set up a Glue crawler for your S3 path, which would create a table in the AWS Glue Data Catalog. Alternatively, you could create the Glue table definition via the Glue API.
The AWS Glue Data Catalog is fully integrated with Athena, so you would see your Glue table in Athena, and be able to query it directly:
http://docs.aws.amazon.com/athena/latest/ug/glue-athena.html
I am currently working with AWS-Athena and it does not support CREATE TABLE AS which is fine so I thought I would approach it by doing INSERT OVERWRITE DIRECTORY S3://PATH and then loading from S3 but apparently that doesn't seem to work either. How would I create a table from a query if both of these options are out the window?
Amazon Athena is read-only. It cannot be used to create tables in Amazon S3.
However, the output of an Amazon Athena query is stored in Amazon S3 and could be used as input for another query. However, you'd have to know the path of the output.
Amazon Athena is ideal for individual queries against data stored in Amazon S3, but is not the best tool for ETL actions, which typically involve transforming data, storing it and then sequentially processing it again.
You don't have to use INSERT, just create an external table over the location of the previous query results
https://aws.amazon.com/premiumsupport/knowledge-center/athena-query-results/