I have custom application/service for AWS EMR that I'm installing from bootstrap action on all nodes. My I want to put logs file of that application in same s3 bucket that I'm using for emr-logs.
Can anyone suggest me where I have to configure my log path in logpusher to get logs in s3 bucket in fixed interval same as a hadoop application.
you can configure it in /etc/logpusher/hadoop.config and restart the logpusher on all nodes
Related
Did anyone try the new Amazon managed services Airflow? I did all the steps mentioned and could open airflow link. But none of the dags are synched. I don't see any error. I event tried updating the environment with a different S3 bucket .
Are you uploading dags in the dags folder in your s3 bucket?
dags.py should be inside the dags folder in s3 bucket to get synced.
I have did a number of searches and can't seem to understand if this is doable at all.
I have a data logger that has FTP-push function. The FTP-push function have the following settings:
FTP server
Port
Upload directory
User name
Password
In general, I understand that a Filezilla client (I have a Pro edition) is able to drop files into my AWS S3 bucket and I had done this successfully in my local PC.
Is it possible to remove the Filezilla client requirement and input my S3 information directly into my data logger? Something like the below diagram:
Data logger ----FTP----> S3 bucket
If not, what will be the most sensible method to have my data logger JSON files drop into AWS S3 via FTP?
Frankly, you'd be better off with:
Logging to local files
Using a schedule to copy the log files to Amazon S3 using the aws s3 sync command
The schedule could be triggered by cron (Linux) or a Scheduled Task (Windows).
Amazon did add support recently to AWS Transfer for FTP support. This will provide an integration with Amazon S3 via FTP without setting up any additional infrastructure, however you should review the pricing at the moment.
As an alternative you could create an intermediary server that can sync between itself and AWS S3 using the cli aws s3 sync.
I have spark jobs running on a EKS cluster to ingest AWS logs from S3 buckets.
Now I have to ingest logs from another AWS account. I have managed to use the below setting to successfully read in data from cross account with hadoop AssumedRoleCredentialProvider.
But how do I save the dataframe back to my own AWS account S3? It seems no way to set the Hadoop S3 config back to my own AWS account.
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.external.id","****")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.arn","****")
val data = spark.read.json("s3a://cross-account-log-location")
data.count
//change back to InstanceProfileCredentialsProvider not working
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
data.write.parquet("s3a://bucket-in-my-own-aws-account")
As per the hadoop documentation different S3 buckets can be accessed with different S3A client configurations, having a per bucket configuration including the bucket name.
Eg: fs.s3a.bucket.<bucket name>.access.key
Check the below URL: http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets
As per my project requirement, I want to fetch some files from on-prem FTP server & put them into a S3 bucket. Files are of size 1-2 GB. Once the file will be put into the FTP server folder, I want that file to be uploaded to S3 bucket.
Please suggest the easiest way to achieve this?
Note- Mostly the files will be put into FTP server only once in a day, hence i dont want continuously scan the FTP server. once the files will be uploaded to S3 from FTP server, i want to terminate any resources (like EC2) created in AWS.
These are my ideas:
I think you could create an agent on your FTP server that will upload the files every N seconds/minutes/hours/Etc using the AWS CLI. This way you're avoiding external access to your FTP server.
Another approach is a Lambda function for pulling process, but like you said the FTP server doesn't allow external access.
Create a VPN between your on-prem and the cloud infra, create a Cloudwatch event and through a Lambda execute the pulling process.
Here you can configure a timeout:
Create a VPN between your on-prem and the cloud infra, from your FTP server upload the files using AWS CLI (pay attention to sync option). Take a look at this link: https://aws.amazon.com/answers/networking/accessing-vpc-endpoints-from-remote-networks/
With Jenkins create a task to execute a process that will upload the files.
You can use Storage gateway, visit its site here: https://aws.amazon.com/es/storagegateway/
Here is how we solved it.
Enable S3 acceleration on your S3 bucket. This is very much needed, since you are pushing large file.
If you have access to the server install aws cli and perform a sync on the folder to s3 bucket. AWS CLI will automatically sync your folder to bucket. This way if you change any of your existing files, it will keep in sync with S3 bucket. This is ideal and simplest way if you have access to the server and able to install aws cli.
https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration-examples.html#transfer-acceleration-examples-aws-cli
aws s3api put-bucket-accelerate-configuration --bucket bucketname --accelerate-configuration Status=Enabled
If you want to enable for specific or default profile,
aws configure set default.s3.use_accelerate_endpoint true
If you don't have access to ftp server in your premisis, you need an external server to perform this process. In this case you need to perform a poll or share file system, copy the file locally and move it to s3 bucket. There will be lot of failure points with this process.
Hope it helps.
I am using elastic beanstalk to serve up our app. I have enabled the "Enable log file rotation to Amazon S3" option and I see the logs a writing to a bucket. It looks like the logs are going to the default bucket for the Elastic BeanStalk instance.
How can I tell the app to write logs to a different bucket?
I want to make sure we set a retention policy (say 10 days) for the logs in that bucket so that the bucket stays groomed.
Thank you in advance for your help. Much appreciated.
It is not possible to configure the Elastic Beanstalk app to publish logs to a different bucket. It can only be set off or on.
To delete logs after 10 days, you can add a lifecycle rule to your bucket that will delete the logs after 10 days: http://docs.aws.amazon.com/AmazonS3/latest/UG/lifecycle-configuration-bucket-no-versioning.html
However, it may also possible to configure your Elastic Beanstalk application to publish logs to CloudWatch logs:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.managing.cw.html
The docs show a console screenshot, but I don't see anywhere in the console to set this. The docs do reference settings you can set via configuration files, CLI, or SDK