Create temporary function in Amazon Athena - amazon-athena

I am doing some querying in Amazon Athena (which is using Presto from my understanding). I would like to create a temporary function in similar fashion as in Presto:
CREATE TEMPORARY FUNCTION square(x int)
RETURNS int
RETURN x * x
SELECT square(col) from table
Is it possible to do it like this in Athena? The only tutorial I found is not very understandable for me.

Amazon Athena provides an extension mechanism of User-Defined-Functions (UDF) using AWS Lambda. It gives you more flexibility and scale of the functionality in your UDF. The tutorial you mentioned in your question is explaining how to develop the Lambda function and deploy it to the cloud, and then how to call it from Athena.

Related

Trigger a Custom Function Every X Hours in AWS

I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena

replace glue with presto built-in commands

I have completed all the steps mentioned in this tutorial.
https://aws.amazon.com/blogs/big-data/improve-amazon-athena-query-performance-using-aws-glue-data-catalog-partition-indexes/
I am getting the expected results. But I will like to know if this is possible without glue.
I will like to use only Athena (as well as S3) and nothing else to achieve the same results.
Athena uses the Glue Data Catalog to store table metadata. Technically you can use Athena Federation to not use the Glue Data Catalog, but for normal usage the Glue Data Catalog is necessary, just like S3 is.
Partition indexes is a feature of Glue Data Catalog, and without implementing something like it yourself and use Federation there is no equivalent feature in Athena itself, since Athena does not store table metadata itself.
Perhaps you could explain in more detail why you don't want to use Glue Data Catalog?

Querying and updating Redshift through AWS lambda

I am using a step function and it gives a JSON to the Lambda as event(object data from s3 upload). I have to check the JSON and compare 2 values in it(file name and eTag) to the data in my redshift DB. If the entry does not exist, I have to classify the file to a different bucket and add an entry to the redshift DB(versioning). Trouble is, I do not have a good idea of how I can query and update Redshift through Lambda. Can someone please give suggestions on what methods I should adopt? Thanks!
Edit: Should've mentioned the lambda is in Python
One way to achieve this use case is you can write the Lambda function by using the Java run-time API and then within the Lambda function, use a RedshiftDataClient object. Using this API, you can perform CRUD operations on a Redshift cluster.
To see examples:
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/example_code/redshift/src/main/java/com/example/redshiftdata
If you are unsure how to build a Lambda function by using the Lambda Java run-time API that can invoke AWS Services, please refer to :
Creating an AWS Lambda function that detects images with Personal Protective Equipment
This example shows you how to develop a Lambda function using the Java runtime API that invokes AWS Services. So instead of invoking Amazon S3 or Rekognition, use the RedshiftDataClient within the Lambda function to perform Redshift CRUD opertions.

How to invoke athena triggered automatically by lambda when objects are updated in the s3 bucket?

I have following 2 use case to apply on this
Case 1. I would need to call the lambda alone to invoke athena to perform query on s3 data? Question: How to invoke lambda alone via api?
Case 2. I would need lambda function to invoke athena whenever a file copied to the same s3 bucket that already mapped to the athena?
Iam referring following link to do the same to perform the Lambda operation over athena
Link:
https://dev.classmethod.jp/cloud/run-amazon-athenas-query-with-aws-lambda/
For the case 2: Following are eg want to integrate:
File in s3-1 is sales.csv - and i would updating sales details by copying data from other s3-2 . And the schema/column defined in the s3-1 data would remain same.
so when i copy some file to the same s3 data that mapped to the athena, the lambda should call athena to perform the query
Appreciate if can provide the better way to achieve above cases?
Thanks
Case 1
An AWS Lambda can be directly invoked via the invoke() command. This can be done via the AWS Command-Line Interface (CLI) or from a programming language using an AWS SDK.
Case 2
An Amazon S3 event can be configured on a bucket to automatically trigger an AWS Lambda function when a file is uploaded. The event provides the bucket name and file name (object name) to the Lambda function.
The Lambda function can extract these details from the event record and can then use that information in an Amazon Athena command.
Please note that, if the file name is different each time, a CREATE TABLE command would be required before a SELECT command can query the data.
General Comments
A Lambda function can run for a maximum of 15 minutes, so make sure the Athena queries do not take more than this time. This is not a particularly efficient use of an AWS Lambda function because it will be billed for the duration of the function call, even if it is just waiting for Athena to finish.
Another option would be to have the Lambda function directly process the file, assuming that the query is not particularly complex. For example, the Lambda function could download the file to temporary storage (maximum 500MB), read through the file, do some calculations (eg add up the total of some columns), then store the results somewhere.
The next step wuold be create a end point to your lambda, you ver can use aws-apigateway for that.
On the other hand, using the amazon console or amazon cli, you can invoke the lambda in order to test.

Best way to develop application with multiple instances in AWS

I Am working on one AWS POC, it uses different aws component, below are the details of each individual components.
1- java function have code to generate data, I am calling it from lambda function through cloud watch scheduler
2- datapipe-line to copy data from RDS to S3.
3- Run hive scripts using athena over s3 data.
4- quicksight for visualization.
I am done with creating individual model but not able to understand what could be best way to connect all these components,So it can run in one go.
one though is to use lambda as a connector for each step. but have no template to connect lamda with Athena.
Kindly anyone can suggest best way to connect all above component.So that it can run in one go.
I am not familiar with hive scripts or quicksight, but a cloudformation stack or a terraform stack should assist you to connect various aws components as your workflow demands.