I am new to Amazon's MapReduce. Can anyone tell me how to easily submit a message to Amazon's SQS (simple queue service) from a Hive job? I want to put a pointer to my job's S3 results into SQS for simpler automated retrieval.
I don't think you can call SQS from HIVE itself, however, you can create a script (as Steve suggested) that will call SQS and Hive. You can call SQS before the Hive script start as well as after the Hive job has ended with a pointer to where the result is stored (or will be stored).
Another possible approach:
create a new DynamoDB table; setup a lambda that would push an SQS on each new item using SQS streams
in Hive, insert a record into that table each time you want to push an SQS
Related
What is the easiest way to automatically ingest csv-data from a S3 bucket into a Timestream database ?
I have a s3-bucket which continuasly is generating csv files inside a folder structure. I want to save these files inside a timestream database so i can visualize them inside my grafana instance.
I already tried to do that via a Glue crawler but that wont wont for me. Is there any workaround or tutorial on how to solve this task ?
I do this using a Lambda function, an SNS topic and a queue.
New files in my bucket triggers a notification on an SNS Topic
The notification gets added to an SQS queue.
The lambda function consumes the queue, recovers the bucket and key of the new s3 object, downloads the csv file, does some processing and ingests the data into timestream. The lambda is implemented in Python.
This has been working ok, with the caveat that large files may not ingest fully within the lambda 15 minute limit. Timestream is not super fast. It gets better by using multi-valued records, as well as using the "common attributes' feature of the timestream client in boto3.
(it should be noted that the lambda can be triggered directly by the S3 bucket, if one prefers. Using a queue allows a bit more flexibility, such as being able to manually add files to the queue for reprocessing)
I want to write a python function that send data to Bigquery every time a put event occurs in my S3 bucket but I'm new in AWS is it possible to integrate bigquery with a lambda function? or can someone give me another way to stream my dynamodb data to bigquery? Thank you my language is python
N.B: I used dynamostream firehose to send my data in S3 now I want to retrieve my data from s3 every time a put event occur to send it into bigquery.
There are already plenty of resources online about "how to trigger a lambda after a put object on a S3".
But here are a few links to get you set up:
You will need to set up an EventBridge (CloudWatch Events are legacy) to trigger your Lambda when some action happens on your S3 bucket:
https://aws.amazon.com/fr/blogs/compute/using-dynamic-amazon-s3-event-handling-with-amazon-eventbridge/
You can use the boto3 Python framework to write AWS Lambdas:
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
You can check the BigQuery Python SDK by GCP to communicate with your BQ database: https://googleapis.dev/python/bigquery/latest/index.html
I have gone through couple of stackoverflow questions regarding hourly backups from DDB to S3 where the best solution turned out to be to enable DDB Stream, subscribe lambda function and push to S3.
I am trying to understand if directly pushing from Lambda to S3 is fine or from Lambda to Kinesis Firehose and then to S3. Can someone share what is the advantage if we introduce Firehose in between. We anyways trigger lambda only after specific batch window that implies we are already buffering there.
Thanks in advance.
Firehose gives you the possibility to convert and compress your data. In addition you can directly attach a Glue Metadata table, so you can query your data with Athena.
You can write a Lambda function that reads a DynamoDB table, gets a result set, encodes the data to some format (ie, JSON), then place that JSON into an Amazon S3 bucket. You can use scheduled events to fire off the Lambda function on a regular schedule.
Here in AWS tutorial that shows you how to use scheduled events to invoke a Lambda function:
Creating scheduled events to invoke Lambda functions
This AWS tutorial also shows you how to read data from an Amazon DynamoDB table from a Lambda function.
I am using AWS Dynamo DB and Elastic Search. I am looking for some way to keep Dynamo DB data in sync with Elastic Search if any of them fails.
Currently I use lambda to push my record into Elastic Search. I know there is plugin - Logstash available but I can't use that as it will require a lot of changes.
Also, I won't prefer scanning the DynamoDB table, as it is too expensive. Is there any other way I could achieve this?
You can make use of SQS. Move failed records to SQS and later you can schedule a lambda to read records from SQS and send the records to ElasticSearch.
If you don't want to go with the plugin solution, you can continue with a lambda but triggered by DynamoDb Streams. In this way you shouldn't have to scan the table since the stream will have the added item and you can reuse the part of sending it to ES.
Take a look at DynamoDB Streams and Lambda triggers.
As per AWS docs, there's no Redshift-Lambda integration yet.
What we would like to do is monitoring redshift activity in order to do something when a redshift table is created, a copy from S3 is made or a bulk insert is performed.
Is there a way to register this kind of activity, and then do something similar to run a lambda function ir order run a small script or so?
Redshift provides an event notification mechanism. You can find a full list of the event categories and messages here. If that covers the kind of information you are interested in you can simply have your Lambda function add the SNS topic used by Redshift for event notification as an event source and your Lambda function will get called every time an event is sent by Redshift.
You can enable audit logs that end up in s3.
All the info you want is also available in various admin tables with prefixes like stl_, stv_ and pg_. For example, COPY commands from S3 are recorded in stl_load_commits, and stl_utilitytext has info on non-select queries like CREATE.
As for triggering events, you could have S3 trigger a lambda when one of the log files lands or run occasional jobs that query the system tables and take action with something like cron jobs or airflow.