Can you create views in Amazon Athena? - amazon-athena

Is it possible to create views in Amazon Athena?
Since an External table is essentially metadata for data stored in files on S3, there's no transformation involved. Therefore, you can't handle data inconsistencies. Quite often, this can result in tables being defined with lots of string fields.
Can you create a view over the top of the External table that can contain the transformation logic, allowing users to query a "cleansed" view of the data?

Looks like they have added this support now AWS Doc

While that is a nice feature that you are looking for,
AWS Athena does not support creating any view.
Reference Documentation of supported DDL's:
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Hope it helps.

Related

Copy tables and views from Athena

Hi I need to copy tables and views from one Athena instance to another (I'm not using Glue). How to do this via AWS Console or via boto3/pyathena api without loosing any data? Cant find anything in documentation :(
Athena tables are built on top of files stored on S3. If you need the data, you will have to export the data.

Strategy for Updating Schema/Data of Data Stored in AWS S3

At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.
Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)

Using Amazon S3 as a limited-database

I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.
I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.
The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.
Any recommendations?
The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.
I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.
I hope it will helps!
I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.

Are Amazon Athena views actually hive views, or are they a separate bolt-on?

Amazon Athena is based on Presto. Amazon Athena supports views.
Presto does not support Hive views because it doesn't want to deal with Hive Query Language. Since a view is actually a Hive query, it would have to understand hive's entire language rather than just its schema. Presto supports views via its Hive connector. These views are "Presto views", are Presto-specific (cannot be queried from Hive).
Does Athena support Hive views under the covers? Or are Athena views an entirely separate layer/bolt-on that just saves named Presto/Athena queries?
To the best of my knowledge they are Presto views. I've dug into how views are saved in the Glue catalog, and talked to the Athena team about why it's done the way it is. I'm no expert on what makes something a Presto view vs. a Hive view, but Athena is not doing anything on top of Presto when it comes to views.
When you create a view in Athena it creates a table in Glue that is of type VIRTUAL_VIEW, and has TableInput.ViewOriginalText with a very special structure (see below). Parameters also needs to contain presto_view: true.
The structure in TableInput.ViewOriginalText looks like this /* Presto View: <BASE64 DATA> */, where the payload is a base 64 encoded JSON structure that describes the view. Value of TableInput.ViewOriginalText is produced by Presto (see https://github.com/prestosql/presto/blob/27a1b0e304be841055b461e2c00490dae4e30a4e/presto-hive/src/main/java/io/prestosql/plugin/hive/HiveUtil.java#L597-L600).
If the question is whether or not views created in Athena can be used by other tools that connect to the Glue catalog I think the answer is no. The way they are encoded is Presto-specific.

AWS Appsync multiple dynamodb requesst in one dynamodb resolver

I would like to know if it is possible to have multiple dynamodb request using only one dynamo resolver in AppSync?
Or the only/best way to have more complicated processing is to use a lambda function ?
Practically, no. You even cannot query on multiple indices in a single resource definition for an query, indeed.
However, if you are to use that structure for joining multiple DynamoDB tables, you can attach resolvers not to the query entry; but to the field you want to relate on other fields.
I had an issue like relating users to another table for containing the posts and I've passed it by attaching a resolver aiming the Posts field of the User type.
This issue refers to a similar problem and is quite helpful for that kind of cases: https://github.com/awslabs/aws-mobile-appsync-sdk-js/issues/17
If it is not the case of yours, you can elaborate the question. I may look like guessing your purpose for relating tables, all in all.
Have you looked at batch resolvers with AWS AppSync?https://docs.aws.amazon.com/appsync/latest/devguide/tutorial-dynamodb-batch.html
This will allow you to write to one or more tables in a single request, and also allow you to do multiple write/read/delete operations in a single request.
You can do it with pipeline resolvers
https://docs.aws.amazon.com/appsync/latest/devguide/tutorial-pipeline-resolvers.html