Crawl is not running in S3 event mode - amazon-web-services

When running an AWS Glue crawler that points to S3, the second log entry in CloudWatch is always:
Crawl is not running in S3 event mode
What is S3 event mode?
The name sounds like some way of getting S3 to invoke Glue for partial crawls after every object upload to the prefix. But as far as I can tell, such functionality does not exist. So what is this log entry referring to?
The closest thing I found in the Glue documentation was event based triggers for Glue jobs, but Glue Jobs are different to Glue Crawlers.
Steps to reproduce
Create a Glue Crawler. Choose any configuration. Point it to anywhere in any S3 bucket with any dataset (even an empty one)
Run the crawler. It doesn't matter if the crawl fails or succeeds
Open the logs for that crawl
Look at the second log entry
2021-07-01T20:04:39.882+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] BENCHMARK : Running Start Crawl for Crawler my-crawler
2021-07-01T20:04:40.200+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] INFO : Crawl is not running in S3 event mode

AWS Support gave me an answer.
S3 Event mode is functionality available internally inside AWS. As I suspected it means S3 triggers crawler crawls for every file upload. But this functionality is not public at the moment.

I had the same problem and I found a solution in this article https://www.linkedin.com/pulse/my-top-5-gotchas-working-aws-glue-tanveer-uddin/
In short though the solution was to have aws-glue- before the name of my bucket. So, for example trying to get a crawler to go through a bucket called test-bucket would not work but if I change the name to aws-glue-test-bucket then works.

Related

AWS glue incremental load

I have a S3 bucket where everyday files are getting dumped. AWS crawler crawls the data from this location.On the very first day when my glue job runs it takes all the data present in the table that is created by AWS crawler.For example on very first day three files are there.(i.e. file1.txt,file2.txt,file3.txt) and glue job processes these files on the first day of glue job execution.On the second day another two files reaches to S3 location.Now in S3 location these are the files present.(i.e. file1.txt,file2.txt,file3.txt,file4.txt,file5.txt).Can i somehow design my AWS crawler in such a way that on the next day of job execution it just reads two files (file4.txt,file5.txt)?Or else how can I write AWS glue job just to identify these incremental files?
You need to enable AWS job bookmark for glue and it will be able to persist the state of already processed data. You can refer to the link below about how to do it.
aws glue job bookmark
You could implement an intermediate service like SQS. With that said, you can setup your SQS to wait events or messages from S3 (Such Put event in your case) and then, you can configure your crawler in order to poll from SQS when a new message comes and this it would apply for the new files.
The previous answer marked as correct does not answer your question and/or scenario.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

Run ETL python script in AWS triggered by S3

I am new with AWS and don't know how to do the following. When I put an object in S3 I want to launch a python script that does some transformations and returns it to another path in S3. I've tried a lambda function but the process takes more than 300 seconds. I've also tried it with a Glue job but I don't know how to trigger it when I put the file in S3.
Does anyone know how to do it? Maybe I'm using the wrong AWS tools.
The simple solution for your problem is here:
Since you've already mentioned that you have AWS Glue job working to do this operation. And all you don't know is how to trigger glue job when file placed in s3, I am answering to that question.
You can write an AWS lambda using boto3 module which can be triggered based up on the s3 event and have setup glue.start_job_run command in your lambda function.
response = client.start_job_run(
JobName='string')
https://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.start_job_run
Note:: I strongly believe Glue is the right tool rather than lambda for your requirement that you mentioned in question, because AWS lambda have time out limitation. It will get timeout after 300 seconds.
One option would be to use SQS:
Create the SQS queue.
Setup S3 to send notifications to the SQS queue when new objects are added to the source bucket. See Configuring Amazon S3 Event Notifications.
Setup your Python script on an EC2 instance and listen to the SQS queue in your code.
Upload the output of your Python script into the target S3 bucket after script finished.
Can you break up the Python processing into smaller steps? I'd definitely recommend that you use Lambda instead of managing EC2 if you can get your code to run within the Lambda restrictions.

Scheduling data extraction from AWS Redshift to S3

I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets.
Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS.
Consider using AWS Data Pipeline for this.
AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data from Redshift to S3. You can schedule a pipeline to run however often you require e.g. daily.
Pipeline is defined by you, you can even version control it. You can prepare a pipeline definition in a browser using Data Pipeline Architect or compose it using JSON file locally on your computer. Pipeline definition is composed of components, such as, Redshift database, S3 node , SQL activity, as well as parameters, for example to specifying S3 path to use for extracted data.
AWS Data Pipeline service handles scheduling, dependency between components in your pipeline, monitoring and error handling.
For your specific use case, I would consider the following options:
Option 1
Define pipeline with the following components: SQLDataNode and S3DataNode. SQLDataNode would reference your Redshift database and SELECT query to use to extract your data. S3DataNode would point to S3 path to be used to store your data. You add a CopyActivity activity to copy data from SQLDataNode to S3DataNode. When such pipeline runs, it will retrieve data from Redshift using SQLDataNode and copy that data to S3DataNode using CopyActivity. S3 path in S3DataNode can be parameterised so it is different every time you run a pipeline.
Option 2
Firstly, define SQL query with UNLOAD statement to be used to unload your data to S3. Optionally, you can save it in a file and upload to S3. Use SQLActivity component to specify SQL query to execute in Redshift database. SQL query in SQLActivity can be a reference to S3 path where you stored your query (optionally), or just a query itself. Whenever a pipeline runs, it will connect to Redshift and execute SQL query which stores the data in S3.
Constraints of option 2: in UNLOAD statement, S3 path is static. If you plan to store every data extract in a separate S3 path, you will have to modify UNLOAD statement to use another S3 path every time you run it which is not out-of-the-box function.
Where do these pipelines run?
On EC2 instance with a TaskRunner, a tool provided by AWS to run data pipelines. You can start that instance automatically at the time when pipeline runs, or you can reference already running instance with a TaskRunner installed on it. You have to make sure that EC2 instance is allowed to connect to your Redshift database.
Relevant documentation:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftdatabase.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqldatanode.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqlactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html
I think Pawel has answered this correctly , I'm just adding details on option two for anyone who wants to implement this:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Edit each field in this json file(after copying to your favorite editor) and update the fields which has "$NEED_TO_UPDATE_THIS_WITH_YOURS" with the correct value that pertains to your AWS environment and save it as data_pipeline_template.json some where on your computer
Go back to AWS Console again, Click on "Load Local File" for the source field and upload the json file
if you are not able to upload it because you may be getting some error related to your database instances etc then follow these steps:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Populate all the fields manually (see below)
Click on "Edit in Architect" at the bottom of the page
Implement the same activities and resources as below , again make sure your are adding the correct values such as your Database JDBC connection etc

process s3 access logs using AWS datapipeline

My usecase is to process S3 access logs(having those 18 fields) periodically and push to table in RDS. I'm using AWS data pipeline for this task to run everyday to process previous day's logs.
I decided to split the task into two activities
1. Shell Command Activity : To process s3 access logs and create a csv file
2. Hive Activity : To read data from csv file and insert to RDS table.
My input s3 bucket has lots of log files hence first activity fails due to out of memory error while staging. However i don't want to stage all the logs, staging the previous day's log is enough for me. I searched around internet but didn't get any solution. How do i achieve this ? Is my solution the optimal one ? Does any solution better than this exist ? Any suggestions will be helpful
Thanks in Advance
You can define your S3 data node use timestamps. For e.g. you can say the directory path is
s3://yourbucket/ #{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}
Since your log files should have a timestamp in the name (or they could be organized by timestamped directories).
This will only stage the files matching that pattern.
You may be recreating a solution that is already done by Logstash (or more precisely the ELK stack).
http://logstash.net/docs/1.4.2/inputs/s3
Logstash can consume S3 files.
Here is a thread on reading access logs from S3
https://groups.google.com/forum/#!topic/logstash-users/HqHWklNfB9A
We use Splunk (not-free) that has the same capabilities through its AWS plugin.
May I ask why are you pushing the access logs to RDS?
ELK might be a great solution for you. You can build it on your own or use ELK-as-a-service from Logz.io (I work for Logz.io).
It enables you to easily define an S3 bucket, get all your logs read regularly from the bucket and ingested by ELK and view them in preconfigured dashboards.