process s3 access logs using AWS datapipeline - amazon-web-services

My usecase is to process S3 access logs(having those 18 fields) periodically and push to table in RDS. I'm using AWS data pipeline for this task to run everyday to process previous day's logs.
I decided to split the task into two activities
1. Shell Command Activity : To process s3 access logs and create a csv file
2. Hive Activity : To read data from csv file and insert to RDS table.
My input s3 bucket has lots of log files hence first activity fails due to out of memory error while staging. However i don't want to stage all the logs, staging the previous day's log is enough for me. I searched around internet but didn't get any solution. How do i achieve this ? Is my solution the optimal one ? Does any solution better than this exist ? Any suggestions will be helpful
Thanks in Advance

You can define your S3 data node use timestamps. For e.g. you can say the directory path is
s3://yourbucket/ #{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}
Since your log files should have a timestamp in the name (or they could be organized by timestamped directories).
This will only stage the files matching that pattern.

You may be recreating a solution that is already done by Logstash (or more precisely the ELK stack).
http://logstash.net/docs/1.4.2/inputs/s3
Logstash can consume S3 files.
Here is a thread on reading access logs from S3
https://groups.google.com/forum/#!topic/logstash-users/HqHWklNfB9A
We use Splunk (not-free) that has the same capabilities through its AWS plugin.

May I ask why are you pushing the access logs to RDS?
ELK might be a great solution for you. You can build it on your own or use ELK-as-a-service from Logz.io (I work for Logz.io).
It enables you to easily define an S3 bucket, get all your logs read regularly from the bucket and ingested by ELK and view them in preconfigured dashboards.

Related

Crawl is not running in S3 event mode

When running an AWS Glue crawler that points to S3, the second log entry in CloudWatch is always:
Crawl is not running in S3 event mode
What is S3 event mode?
The name sounds like some way of getting S3 to invoke Glue for partial crawls after every object upload to the prefix. But as far as I can tell, such functionality does not exist. So what is this log entry referring to?
The closest thing I found in the Glue documentation was event based triggers for Glue jobs, but Glue Jobs are different to Glue Crawlers.
Steps to reproduce
Create a Glue Crawler. Choose any configuration. Point it to anywhere in any S3 bucket with any dataset (even an empty one)
Run the crawler. It doesn't matter if the crawl fails or succeeds
Open the logs for that crawl
Look at the second log entry
2021-07-01T20:04:39.882+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] BENCHMARK : Running Start Crawl for Crawler my-crawler
2021-07-01T20:04:40.200+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] INFO : Crawl is not running in S3 event mode
AWS Support gave me an answer.
S3 Event mode is functionality available internally inside AWS. As I suspected it means S3 triggers crawler crawls for every file upload. But this functionality is not public at the moment.
I had the same problem and I found a solution in this article https://www.linkedin.com/pulse/my-top-5-gotchas-working-aws-glue-tanveer-uddin/
In short though the solution was to have aws-glue- before the name of my bucket. So, for example trying to get a crawler to go through a bucket called test-bucket would not work but if I change the name to aws-glue-test-bucket then works.

Move data from S3 to Amazon Aurora Postgres

I have multiple files present in different buckets in S3. I need to move these files to Amazon Aurora PostgreSQL every day on a schedule. Every day I will get a new file and, based on the data, insert or update will happen. I was using Glue for insert but with upsert Glue doesn't seem to be the right option. Is there a better way to handle this? I saw Load command from S3 to RDS will solve the issue but didn't get enough details on it. Any recommendations please?
You can trigger a Lambda function from S3 events, that could then process the file(s) and inject them into Aurora. Alternatively you can create a cron-type function that will run daily on whatever schedule you define.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

How to fetch data from EMR Spark session?

I'm doing designing some ETL data pipelines with Airflow. Data transformations is done by provisioning an AWS EMR Spark cluster and sending its some jobs. The jobs read data from S3, process them and write them back to S3 using date as a partition.
For my last step, I need to load the S3 data to a datawarehouse using SQL scripts that are submitted to Redshift using Python script, however I cannot find a clean way to get retrieve which data need to be loaded, ie. which date partitions have been generated during Spark transformations (can only be known during the execution of the job and not beforehand).
Note that everything is orchestrated through a Python script using boto3 library that is run from a corporate VM that cannot be accessed from outside.
What would be the best way to fetch this information from EMR?
For now I'm thinking about different solutions:
- Write the information into a log file. Get the data from Spark master node using SSH through Python script
- Write the information to an S3 file
- Write the information to a database (RDS?)
I'm struggling to determine what are the pros and the cons of these solutions. I'm also wondering what would be the best way to inform that data transformations is over and that metadata can be fetched.
Thanks in advance
The most straightforward is to use S3 as your temporary storage. After finishing your Spark execution (Writing result to S3), you can add one more step writing data to S3 bucket which you want to get in next step.
The approach with RDS should be the similar to S3, but it requires more implementations than S3. You need to setup RDS, maintain Schema, implementation to work with RDS...
With S3 tmp file, after EMR terminated and AF running next step, using Boto to fetch that tmp file (S3 Path depends on your requirement) and that is it.

S3 to RDS file management system

I'm new to AWS and have a feasibility question for a file management system I'm trying to build. I would like to set up a system where people will use the Amazon S3 browser and drop either a csv or excel file into their specific bucket. Then I would like to automate the process of taking that csv/excel file and inserting that into a table within RDS. Now this is assuming that the table has already been built and those excel/csv file will always be formatted the same and will be in the same exact place every single time. Is it possible to automate this process or at least get it to point where very minimal human interference is needed. I'm new to AWS so I'm not exactly sure of the limits of S3 to RDS. Thank you in advance.
It's definitely possible. AWS supports notifications from S3 to SNS, which can be forwarded automatically to SQS: http://aws.amazon.com/blogs/aws/s3-event-notification/
S3 can also send notifications to AWS Lambda to run your own code directly.

Big data zip on amazon S3 files

I have large amount of data stored on amazon S3 in the forms of objects.
like i Have user which have 200+ GB of photos (about 100000+ objects) stored on amazon S3. each object is a photo , each object size is average 5MB.
Now I want to give a user a link to download data.
Currently what i am doing.
Using S3cmd i copy all the objects from S3 to EC2.
and then using ZIP command or TAR Command i create a
ZIp.
After Zip process is complete i move the zip file back to the S3.
and Then create a singed link that i send to user as an email.
But this process takes a long long time, most of the time it gives out of memory issues, storage issues and this process is very slow.
I need to Know
Is there any way that i can boost this process time.
Is there any third party service/tool where i can create fast zip
of my files and send to user.
or any other 3rd party solution, I am ready to pay for it.
Try using EMR (Elastic Map Reducer and the S3distCp) that can be helpful in your required situation, for EMR you have to create a cluster. and the running your job.
The direction what you are following at high level is correct. However there isn't any straight forward answer which may possibly solve your problem in a single shot.
These are the things which you can try doing
Ask your user to create a AWS account ( or create an IAM user ) and provide a read-only access to that user / account
During the process of uploading to S3 you can group the photos in the bundles of few 50s or 100s compress it and then put in S3 ( from EC2 i.e. during creation of the media itself)
Export to external media from S3 using - Amazon Import / Export
S3DistCP is tool that can greatly help in cases such as this.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
S3DistCP can copy from and to S3 using an EMR Cluster instead of a single instance and compress objects on the fly.
However, in "big data" processing, the user will probably have a better experience if you either create the bundles in advance proactively or start the process asynchronously on-demand and notify the user on completion with the download link.