Does logstash downloads logs from s3 or it reads without downloading? - amazon-web-services

I'm using logstash to process logs to our centralized logging and the inputs are at s3 in gz format. I need to create a cost projection regarding this process and does logstash download the s3 object or it parse it remotely?

Amazon S3 is an object storage service. Data is not "processed" on Amazon S3.
If an application wants to process data from Amazon S3, it would need to download the files from Amazon S3 and process the data locally. An exception to this is if the application uses the Amazon S3 Select service, which can query data directly on Amazon S3.
In terms of cost, if the Amazon EC2 instance is in the same region as the Amazon S3 bucket, there is no data transfer cost for downloading the data.

Related

Regularly pull files from On-Prem server to S3 using AWS Transfer family

I'm trying to prepare a flow where we can regularly pull the available new files in third parties' on-prem server to our S3 using AWS Transfer family.
I read this documentation https://aws.amazon.com/blogs/storage/how-discover-financial-secures-file-transfers-with-aws-transfer-family/, but it was not clear on setting up and configuring the process.
Can someone share any clear documentation or reference links on using AWS Transfer Family to pull files from external on-prem server to our S3?
#Sampath, I think you misunderstood the available features of the AWS Transfer service. That service is actually acting as a serverless SFTP with AWS S3 as the backend storage to which you can connect via SFTP protocol (now supports FTP and FTPS as well). You can either PUSH data to S3 or PULL data from S3 via AWS Transfer service. You cannot PULL data into S3 from anywhere else via AWS Transfer service alone.
You may have to use any other solution like a Python Script running on AWS EC2 for that purpose.
Another solution would be to connect the external third-party server to the AWS Transfer Service and that server PUSHES files on S3 via AWS Transfer.
As per your use case, I think you need a simple solution that connects to an external third-party server and copies files from it to the AWS S3 bucket. It can be done via a Python script as well and you can run it on either AWS EC2, AWS ECS, AWS Lambda, AWS Batch, etc, depending on the specifications and requirements.
I have used AWS Transfer once I found it to be very expensive and went on with AWS EC2 instead. In the case of AWS EC2, you can even buy reserved instances to further reduce the cost. If the task is just about copying files from an external server to S3 and the copy job will never take more than 10 minutes, then it is better to run it on AWS Lambda.
In short, you cannot PULL data from any server into S3 using the AWS Transfer service. You can only PUSH data to or PULL data from S3 using the AWS Transfer service.
References to some informative blogs:
Centralize data access using AWS Transfer Family and AWS Storage Gateway
How Discover Financial secures file transfers with AWS Transfer Family
Moving external site data to AWS for file transfers with AWS Transfer Family
Easy SFTP Setup with AWS Transfer Family
With the AWS Transfer Family service you can create servers that uses SFTP, FTPS, and FTP protocols for your file transfers, and use the Amazon S3 and EFS as domains to store and access your files.
To connect your on-premise servers with the Transfer Family server you will need to use a service like File Gateway/Storage Gateway and connect via HTTPS to S3 to sync your files.
Your architecture will be something like this:
If you want more details of how to connect with your on-premises servers with the AWS S3/Transfer Family services take a look on this blog post: Centralize data access using AWS Transfer Family and AWS Storage Gateway

Accessing Amazon S3 via FTP?

I have did a number of searches and can't seem to understand if this is doable at all.
I have a data logger that has FTP-push function. The FTP-push function have the following settings:
FTP server
Port
Upload directory
User name
Password
In general, I understand that a Filezilla client (I have a Pro edition) is able to drop files into my AWS S3 bucket and I had done this successfully in my local PC.
Is it possible to remove the Filezilla client requirement and input my S3 information directly into my data logger? Something like the below diagram:
Data logger ----FTP----> S3 bucket
If not, what will be the most sensible method to have my data logger JSON files drop into AWS S3 via FTP?
Frankly, you'd be better off with:
Logging to local files
Using a schedule to copy the log files to Amazon S3 using the aws s3 sync command
The schedule could be triggered by cron (Linux) or a Scheduled Task (Windows).
Amazon did add support recently to AWS Transfer for FTP support. This will provide an integration with Amazon S3 via FTP without setting up any additional infrastructure, however you should review the pricing at the moment.
As an alternative you could create an intermediary server that can sync between itself and AWS S3 using the cli aws s3 sync.

How to use Spark to read data from one AWS account and write to another AWS account?

I have spark jobs running on a EKS cluster to ingest AWS logs from S3 buckets.
Now I have to ingest logs from another AWS account. I have managed to use the below setting to successfully read in data from cross account with hadoop AssumedRoleCredentialProvider.
But how do I save the dataframe back to my own AWS account S3? It seems no way to set the Hadoop S3 config back to my own AWS account.
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.external.id","****")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.arn","****")
val data = spark.read.json("s3a://cross-account-log-location")
data.count
//change back to InstanceProfileCredentialsProvider not working
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
data.write.parquet("s3a://bucket-in-my-own-aws-account")
As per the hadoop documentation different S3 buckets can be accessed with different S3A client configurations, having a per bucket configuration including the bucket name.
Eg: fs.s3a.bucket.<bucket name>.access.key
Check the below URL: http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets

Using Elastic Transcoder on local storage

My RDS is currently working on AWS instance. There is a Lambda function for uploading and transcoding videos.
Can I change the transcoder to use my local storage instead of an Amazon S3 bucket?
If you are using AWS Elastic Transcoder service for transcoding, input file has to on S3. So, you have to upload them to S3. But if you are transcoding your files inside Lambda, a Lambda script can fetch your local server files over simple FTP(for example). But best practice is to upload them to S3 first. You can clean up your S3 files after you are done with them if your concern is storage cost.

AWS s3 cli sync multipart upload

AWS s3 cli sync can use multipart upload option?
on-premise server sync to s3 using AWS cli sync
but, speed is very slow.
Assuming you mean the aws command (and not e.g. s3cmd): Yes, sync uses multipart upload by default. From the docs:
All high-level commands that involve uploading objects into an Amazon S3 bucket (aws s3 cp, aws s3 mv, and aws s3 sync) automatically perform a multipart upload when the object is large
I guess the slowness is caused by another factor, e.g. your bandwidth is low (check e.g. with speedtest or it is already saturated