best practise for logging on Elastic Map Reduce - AWS - amazon-web-services

I am planning to use Amazon EMR for spark streaming application. Amazon provides a nice interface to show stderr & controller log. But for streaming application I am not sure how manage the logs.
Amazon logs the data to /var/log/hadoop/steps/<step-id> and similar places for spark. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-view-web-log-files.html
I was wondering on how do we rotate logs and still be accessible via the aws emr web interface. We can easily change the log rotation policy by configuring the hadoop-log4j, but that way I cannot access it via the web interface. Also EMR should manage the log s3 upload

AWS EMR also stores the logs in S3.
Navigate to your cluster console for the running cluster, and in the left middle column, you'll see the path to the s3 bucket.
Be careful not to reuse the same s3 bucket path for future clusters, otherwise you could overwrite your log data.

Related

Can I use s3fs to perform "free data transfer" between AWS EC2 and S3?

I am looking to deploy a Python Flask app on an AWS EC2 (Ubuntu 20.04) instance. The app fetches data from an S3 bucket (in the same region as the EC2 instance) and performs some data processing.
I prefer using s3fs to achieve the connection to my S3 bucket. However, I am unsure if this will allow me to leverage the 'free data transfer' from S3 to EC2 in the same region - or if I must use boto directly to facilitate this transfer?
My app works when deployed with s3fs, but I would have expected the data transfer to be much faster - so I am wondering that perhaps AWS EC2 is not able to "correctly" fetch data using s3fs from S3.
All communication between Amazon EC2 and Amazon S3 in the same region will not incur a Data Transfer fee. It does not matter which library you are using.
In fact, communication between any AWS services in the same region will not incur Data Transfer fees.

Using AWS Lambda to copying S3 files to on-premise LAN folder

Problem:
We need to perform a task under which we have to transfer all files ( CSV format) stored in AWS S3 bucket to a on-premise LAN folder using the Lambda functions. This will be a scheduled tasks which will be carried out after every 1 hour, and the file will again be transferred from S3 to on-premise LAN folder while replacing the existing ones. Size of these files is not large (preferably under few MBs).
I am not able to find out any AWS managed service to accomplish this task.
I am a newbie to AWS, any solution to this problem is most welcome.
Thanks,
Actually, I am looking for a solution by which I can push S3 files to on-premise folder automatically
For that you need to make the on-premise network visible to the logic (lambda, whatever..) "pushing" the content. The default solution is using the AWS site-to-site VPN.
There are multiple options for setting up the VPN, you could choose based on the needs.
Then the on-premise network will look just like another subnet.
However - VPN has its complexity and cost. In most of the cases it is much easier to "pull" data from the on-premise environment.
To sync data there are multiple options. For a managed service, I could point out the S3 Gateway which based on your description sounds like an insane overkill.
Maybe you could start with a simple cron job (or a task timer if working with windows) and run a CLI command to sync the S3 content or just copy specified files.
Check out S3 Sync, I think it will help you accomplish this task: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
To run any AWS CLI in your computer, you will need to setup credentials, and the setup account/roles should have permissions to do the task (e.g. access S3)
Check out AWS CLI setup here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

How to view AWS Glue Spark UI

In my Glue job, I have enabled Spark UI and specified all the necessary details (s3 related etc.) needed for Spark UI to work.
How can I view the DAG/Spark UI of my Glue job?
You need to setup an ec2 instance that can host the history server.
The below documentation has links to CloudFormation templates that you can use.
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
You can access the history server via the ec2 instance(default on 18080). You need to configure the networks and ports suitably.
EDIT - There is also an option to setup SparkUI locally. This requires downloading the docker image from aws-glue-samples repo amd settin the AWS credential and s3 location there. This server consummes the files that the glue job generates. The files are about 4MB large.

Easier way to access ElasticBeanstalk EC2 Log files

I am programming a Jersey service on Tomcat via EBS with LoadBalancer. I am finding getting the EC2's S3 catalina files very cumbersome. Currently I need to determine the EC2 instance(s) then work my way to each of the S3 locations, download the files, then I can diagnose.
The snapshot doesn't help due to the amount of requests that come in, it doesn't hold enough info and by the time I get the snapshot, it has "rolled" off the snapshot.
Two questions:
1) Is there an easier approach to logs files via AWS? (Increase time before rotation which I don't believe is supported as of now, scripts, etc)
2) Is there any software or scripts to access all the logs under load balancer? I am basically wanting to say "give me all logs for this EBS" and have it get all logs for that day under all servers for that load balancer (up or down)". The clincher is down. Problem becomes more complex when the load balancer takes down an instance right when the issue occurs.
Thanks!
As an immediate solution to your problem you can follow the approach suggested in this answer. Essentially you can modify the logrotate configuration to rotate for a bigger log size using ebextensions.
Then snapshot logs should work for you.
Let me know if you need more clarifications on this approach.
AWS has released CloudWatch Logs just last week, which enables you to to monitor and troubleshoot your systems and applications using your existing system, application, and custom log files:
You can send your existing system, application, and custom log files to CloudWatch Logs and monitor these logs in near real-time. [...] you can store your logs using highly durable, low-cost storage for later access.
See the introductory blog post Store and Monitor OS & Application Log Files with Amazon CloudWatch for an illustrated walk through, which touches on using Elastic Beanstalk and CloudWatch Logs already - this is further detailed in Using AWS Elastic Beanstalk with Amazon CloudWatch Logs.

Enable log file rotation to s3

I have enabled this option.
Problem is:
If I don't press snapshot log button log, is not going to s3.
Is there any method through which log publish to s3 each day?
Or how log file rotation option is working ?
If you are using default instance profile with Elastic Beanstalk, then AWS automatically creates permission to rotate the logs to S3.
If you are using custom instance profile, you have to grant Elastic Beanstalk permission to rotate logs to Amazon S3.
The logs are rotated every 15 minutes.
AWS Elastic Beanstalk: Working with Logs
For a more robust mechanism to push your logs to S3 from any EC2 server instance, you can pair LogRotate with S3. I've put all the details in this post as a reference whicould should be able to achieve exactly what you're describing.
Hope that helps.
NOTICE: if you want to rotate custom log files, then, depending on your container, you need to add links to your custom log files in a proper places. For example, consider Ruby on Rails deployment, if you want to store custom information, eg. some monitoring using Oink gem in oink.log file, add proper link in /var/app/support/logs using .ebextensions
.ebextensions/XXXlog.config
files:
"/var/app/support/logs/oink.log" :
mode: "120400"
content: "/var/app/current/log/oink.log"
This, after deploy, will create symlink:
/var/app/support/logs/oink.log -> /var/app/current/log/oink.log
I'm not sure why permissions 120400 are used, I took it from the example in Amazon AWS doc page http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html (seems like 120xxx is for symlinks in unix fs)
This log file rotation is good for archival purpose, but difficult to search and consolidate when you need the most.
Consider using services like splunk or loggly.