I run a service where the users can publicly upload and download files to our site, using Amazon S3. Last month we had a problem where a user uploaded a file that was downloaded like crazy, resulting in 170 TB of bandwidth and a huge bill.
Talking to Amazon and searching on StackOverflow the way to ensure this doesn't happen again is to download the S3 logs parse them, and take actions from there.
We could build such script, but I guess there must be some open source or third party service providing a script or service for this?
What about:
Create a CloudFront Distribution for downloads
Setup a CloudWatch alarm that is triggered when the distribution's BytesDownloaded metric exceeds your chosen monthly limit
Add a notification (sent to an SNS topic you create) that is triggered when the alarm is fired
Add a Lambda function that is triggered by SNS when a notification is sent to that topic (the SNS topic should also have your email subscribed of course so you receive an email with the alarm)
In the Lambda function write code that uses the AWS SDK to update the cloudfront distribution and sets the enabled value to false
(You could also create a notification that is fired when the state of the alarm changes back to OK and trigger a lambda function that re-enables the distribution)
My solution to this, and problem like this, is to have billing alerts on my account. I know roughly how much I should spend each month, and setup alerts accordingly - roughly I have divided that amount by 4 (weeks), and set a series of billing alerts at 1/4, 1/2, 3/4 and 1X my estimated spend.
This is not a technical solution to stop the downloads, but at least someone will get notified and they can take action before it gets out of control.
Your best approach is distribute your S3 content using AWS Cloudfront and implement AWS Web Application Firewall (WAF) and implement IP blocking.
So if a IP hits your Cloud Front Distribution more than for say 5 times the AWS WAF will block that IP.
Here is the detailed guide.
https://blogs.aws.amazon.com/security/post/Tx1ZTM4DT0HRH0K/How-to-Configure-Rate-Based-Blacklisting-with-AWS-WAF-and-AWS-Lambda
We had similar kind of requirement long ago.
We used CloudTrail logs to figure out all the activities being performed on our AWS Account.
hope the script for downloading and filter Cloudtrail logs helps you out. ( The following script is only for figuring out launched instance-ids, owner, eventname. please modify according to your need)
import boto3
import gzip
import os
import json
client = boto3.client('s3')
bucketname = "mybucketname"
list_bucket_objects = client.list_objects(Bucket=bucketname )
download_path = '/home/ec2-user/cloudtrail/'
# DOWNLOADING: Downloading Log files from S3
for object in list_bucket_objects['Contents']:
print object['Key']
object_name = object['Key'].split('/')
if len(object_name)==8:
print "Downloading --->%s" % object_name[7]
client.download_file(bucketname, object['Key'], download_path+object_name[7])
# UNZIPPING: Unzipping the files in one folder
file_path = '/home/ec2-user/cloudtrail/'
new_file_path = '/home/ec2-user/cloudtrail/logs/'
#Create Log Directory
if not os.path.exists(new_file_path):
os.mkdir(new_file_path)
files = os.listdir(file_path)
for file in files:
boolean = os.path.isfile(file_path+file)
if boolean == True:
f = gzip.GzipFile(file_path+file, 'rb')
s = f.read()
f.close()
split_file = file.split('.')
log_path = new_file_path+split_file[0]
print log_path
out = open(log_path, 'wb')
out.write(s)
out.close()
# PARSING AND FILTERING: parsing output into json format, filtering output and writing it in result.txt file
fin = open(log_path).read()
content = json.loads(fin)
for i in range(0, len(content['Records'])):
event = content['Records'][i]['eventName']
if 'userName' in content['Records'][i]['userIdentity']:
user = content['Records'][i]['userIdentity']['userName']
if 'responseElements' in content['Records'][i]:
res_ele = content['Records'][i]['responseElements']
if res_ele:
if 'instancesSet' in content['Records'][i]['responseElements']:
if 'items' in content['Records'][i]['responseElements']['instancesSet']:
instance_id = content['Records'][i]['responseElements']['instancesSet']['items'][0]['instanceId']
if (event == "RunInstances" and instance_id != ""):
open('result.txt', 'ab').write(event+": :"+user+": :"+instance_id+"\n")
#result.txt is stored in your current working directory.
Related
I have a requirement, in which an excel file is being uploaded to S3 bucket, so as soon as that file gets uploaded, I want to trigger a lambda function which will read that excel file and then persist the data in aerospike db.
For reading the excel file, I have got this piece of code
key = 'key-name'
bucket = 'bucket-name'
s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, key)
data = s3_object.get()['Body'].read().decode('utf-8').splitlines()
lines = csv.reader(data)
headers = next(lines)
print('headers: %s' %(headers))
for line in lines:
#print complete line
print(line)
But I not able to figure out how to connect to aerospike db, as boto3 library doesn't support aerospike.
Please help me in connecting to db cluster and persist the data ?
Or any reference would be helpful
I think the way to interact with Aerospike from something like AWS Lambda is to use the Aerospike REST Client that provides a server which translates Restful API requests into messages to an Aerospike Cluster (it is mentioned in the blog post).
Basically you can run a REST server (Aerospike REST Client) that you can send HTTP requests from AWS Lambda using Python to the server and the server will translate these requests to Aerospike operations and will be responsible of executing them.
This is the GitHub repository of Aerospike REST Client - it also contains couple of blog posts of how to use it and a Swagger UI documentation of the actual supported requests:
https://github.com/aerospike/aerospike-client-rest
There is also this blog post of Serverless Event Stream Processing with Aerospike which can help you get started:
https://medium.com/aerospike-developer-blog/serverless-event-stream-processing-with-aerospike-679f2a5cbba6
I've created a dashboard and deployed it on AWS Elastic Beanstalk. The data fed into my dashboard is supplied by a CSV file in my S3 bucket, set to update every 12 hours with AWS EventBridge. For some reason, my deployed dashboard is not updating. It's still using the same old data from my previous deployment even though the CSV file has been updating correctly.
More specifically:
I'm trying to create a Dashboard with Plotly Dash to visualize some trends starting from 2020-01-01.
I had a Lambda function that scrapes the data and saves them as a CSV file in an S3 bucket. This CSV file gets overwritten every 12 hours to capture the latest available trends.
I used boto3 to fetch the CSV file directly from my S3 bucket and use its data to construct my dashboard.
The app was then deployed with Elastic Beanstalk.
Everything was written in a Cloud9 environment, except for setting up the EventBridge trigger.
Say I deployed the app on 2020-12-10. The CSV file would contain all data up till 2020-12-10, and my dashboard would show trends between 2020-01-01 and 2020-12-10.
However, if I check the dashboard anytime after 2020-12-10 (or when the CSV file is updated with data post 2020-12-10), it still shows the same trends (between 2020-01-01 and 2020-12-10), though the CSV file in my S3 bucket is up to date.
The dashboard would update only if I redeploy the app on Elastic Beanstalk. Not sure why this is the case since my app is pulling the data directly from the updated CSV file.
Is my architecture incorrect here? Or do I need to tweak some settings in AWS?
Thanks in advance!
Update:
I'm using the following codes to load my data into trends_data dataframe.
# define bucket name
bucket = "mobilitytrends"
# define s3 client
s3 = boto3.client('s3')
# define file names
historical_file_name = 'historical_trends.csv'
# load historical data from s3
data_obj = s3.get_object(Bucket= bucket, Key= historical_file_name)
trend_data = pd.read_csv(data_obj['Body'],low_memory = False)
I then have some functions that clean this dataframe. I have a scatterplot that's rendered using the code snippet below:
fig.add_scatter(x = filtered_trend.index,
y = filtered_trend[transportation],
line = dict(color = line_color[idx]),
name = transportation)
filtered_trend is a subset of trends_data, which gets selected based on some callback functions that I set up. But I don't think that's where the problem lies since everything worked fine locally.
In Dash, global variables will break your app. More specifically, modifying global variables will not work, at least not reliably.
One approach to avoid the use of global variables would be to create a single callback that first loads the data from S3, and then renders the layout. Other approaches are discussed in this similar question.
I had a similar problem, EB was not fetching the latest version of CSV from the s3 bucket.
The only option I could find was to restart the app server after a new version of the CSV is updated in s3 bucket.
you can use below code in AWS lambda function to restart your app server at specific times in a day:
import boto3
client = boto3.client('elasticbeanstalk', region_name='your-region')
def lambda_handler(event, context):
try:
response = client.restart_app_server(EnvironmentName='your-environment-name')
if response:
print('restarting app server')
else:
print('Failed to restart server')
except Exception as e:
print(e)
Make sure to set up cron using eventbridge for timings
in S3 buckets we have a folder where incoming files are being placed. And then some of our system picks it up and processes it.
I want to know how many files in this folder is older than some period and then send a notification to corresponding team.
I.e. In S3 bucket, if some file arrived today and it's still there even after 3 hours, I want to get notified.
I am thinking to use boto python library to iterate through all the objects inside S3 bucket at schduled interval to check files are folder. And then send notification. However, this pulling solution doesn't seem good.
I am thinking to have some event based solution. I know, S3 has events which I can subscribe using either queue or lambda. However, I don't want to do any action as soon as I have file available, I just want to to check which files are older than some time and send email notification.
can we achieve this using event based solution?
Per hour we are expecting around 1000 files. Once file is processed they are moved to different folder. However if something goes wrong it will be there. So in day, I am not expecting more than 10,000 files in one bucket. Consider I have multiple buckets.
Itarate through S3 files to do that kind of filter is not a good idea. It can get very slow when you have more than a thousad of files in there. I would suggest you to use a database to store that records.
You can have a dynamodb with 2 columns: file name and upload date. Or, if budget is a problem, you can even have a sqlite3 file on the bucket, and fetch it whenever you need to query or add data to it. I did this using lambda, and it works just fine. Just don't forget to upload the file again when new records are inserted.
You could create an Amazon CloudWatch Event rule that triggers an AWS Lambda function at a desired time interval (eg every 5 minutes or once an hour).
The AWS Lambda function could list the desired folder looking for files older than a desired time period. It would be something like this:
import boto3
from datetime import datetime, timedelta, timezone
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
Bucket = 'my-bucket',
Prefix = 'to-be-processed/'
)
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now(tz=timezone.utc) - timedelta(hours=3):
// Print name of object older than given age
print(object['Key'])
You could then have it notify somebody. The easiest way would be to send a message to an Amazon SNS topic, and then people can subscribe to that topic via SMS or email to receive a notification.
The above code is quite simple in that it will find the same file every time, not just the new files that have been added to the notification period.
I am creating temporary credentials via AWS Security Token Service (AWS STS).
And Using these credentials to upload a file to S3 from S3 JAVA SDK.
I need some way to restrict the size of file upload.
I was trying to add policy(of s3:content-length-range) while creating a user, but that doesn't seem to work.
Is there any other way to specify the maximum file size which user can upload??
An alternative method would be to generate a pre-signed URL instead of temporary credentials. It will be good for one file with a name you specify. You can also force a content length range when you generate the URL. Your user will get URL and will have to use a specific method (POST/PUT/etc.) for the request. They set the content while you set everything else.
I'm not sure how to do that with Java (it doesn't seem to have support for conditions), but it's simple with Python and boto3:
import boto3
# Get the service client
s3 = boto3.client('s3')
# Make sure everything posted is publicly readable
fields = {"acl": "private"}
# Ensure that the ACL isn't changed and restrict the user to a length
# between 10 and 100.
conditions = [
{"acl": "private"},
["content-length-range", 10, 100]
]
# Generate the POST attributes
post = s3.generate_presigned_post(
Bucket='bucket-name',
Key='key-name',
Fields=fields,
Conditions=conditions
)
When testing this make sure every single header item matches or you'd get vague access denied errors. It can take a while to match it completely.
I believe there is no way to limit the object size before uploading, and reacting to that would be quite hard. A workaround would be to create an S3 event notification that triggers your code, through a Lambda funcation or SNS topic. That could validate or delete the object and notify the user for example.
I am considering moving to lambdas and after spending some time reading docs and various blogs with user experiences I am still struggling with a simple question. Is there a proposed/proper way to use lambda with existing s3 files?
I have an s3 bucket that contains archived data spanning a couple of years. The size of these data is rather large (hundreds of GB). Each file is a simple txt file. Each line in the file represents an event and it's just a comma separated string.
My endgame is to consume these files, parse each one of them line by line apply some transformation, create batches of lines and send them to an external service. From what I've read so far, if I write a proper lambda, this will be triggered by an s3 event (for example an upload of a new file).
Is there a way to apply the lambda to all the existing contents of my bucket?
Thanks
For existing resources you would need to write a script that gets a listing of all your resources and sends each item to a Lambda function somehow. I'd probably look into sending the location of each of your existing S3 objects to a Kenesis stream and configure a Lambda function to pull records from that stream and process them.
Try using s3cmd.
s3cmd modify --recursive --add-header="touched:touched" s3://path/to/s3/bucket-or-folder
This will modify metadata and invoke an event for lambda
I had a similar problem I solved it with minimal changes to my existing Lambda function. The solution involves creating API Gateway trigger (in addition to S3 trigger) - the API gateway trigger is used to process historical files in S3 & the regular S3 trigger will processes my files as new files are uploaded to my S3 bucket.
Initially - I started by building my function to expect a S3 event as trigger. Recall that the S3 events have this structure - so I would look for the S3 bucket name and key to process - like so:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
But when you send the API Gateway trigger there is no "Records" object in the request/event. You can use query parameters in the API Gateway Trigger - so the modification required to the above snippet of code is:
if 'Records' in event:
# this means we are working off of an S3 event
records_to_process = event['Records']
else:
# this is for ad-hoc posts via API Gateway trigger for Lambda
records_to_process = [{
"s3":{"bucket": {"name": event["queryStringParameters"]["bucket"]},
"object":{"key": event["queryStringParameters"]["file"]}}
}]
for record in records_to_process:
# below lines of code s same as the earlier snippet of code
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
Postman result of sending the post request
Try to copy your bucket content and catch create events with lambda.
copy:
s3cmd sync s3://from/this/bucket/ s3://to/this/bucket
for larger buckets:
https://github.com/paultuckey/s3_bucket_to_bucket_copy_py