I have a requirement, in which an excel file is being uploaded to S3 bucket, so as soon as that file gets uploaded, I want to trigger a lambda function which will read that excel file and then persist the data in aerospike db.
For reading the excel file, I have got this piece of code
key = 'key-name'
bucket = 'bucket-name'
s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, key)
data = s3_object.get()['Body'].read().decode('utf-8').splitlines()
lines = csv.reader(data)
headers = next(lines)
print('headers: %s' %(headers))
for line in lines:
#print complete line
print(line)
But I not able to figure out how to connect to aerospike db, as boto3 library doesn't support aerospike.
Please help me in connecting to db cluster and persist the data ?
Or any reference would be helpful
I think the way to interact with Aerospike from something like AWS Lambda is to use the Aerospike REST Client that provides a server which translates Restful API requests into messages to an Aerospike Cluster (it is mentioned in the blog post).
Basically you can run a REST server (Aerospike REST Client) that you can send HTTP requests from AWS Lambda using Python to the server and the server will translate these requests to Aerospike operations and will be responsible of executing them.
This is the GitHub repository of Aerospike REST Client - it also contains couple of blog posts of how to use it and a Swagger UI documentation of the actual supported requests:
https://github.com/aerospike/aerospike-client-rest
There is also this blog post of Serverless Event Stream Processing with Aerospike which can help you get started:
https://medium.com/aerospike-developer-blog/serverless-event-stream-processing-with-aerospike-679f2a5cbba6
Related
Context
I'm building a mock service to learn AWS. I want a user to be able to upload a sound file (which other users can listen to). To do this I need the sound file to be uploaded to S3 and metadata such as file name, name of uploader, length, S3 ID to RDS. It is preferable that the user uploads directly to S3 with a signed URL instead of doubling the data transfered by first uploading it to my server and from there to S3.
Optimally this would be transactional but from what I have gathered there's no functionality for that given. In order to implement this and minimize the risk of the cases where the file being successfully uploaded to S3 but not the metadata to RDS and vice versa my best guess is as follows:
My solution
With words:
First is an attempt to upload the file to S3 with a key (uuid) I generate locally or server-side. If this is successful I make a request to my API to upload the metadata including the key to RDS. If this is unsuccessful I remove the object from S3.
With code:
uuid = get_uuid_from_server();
s3Client.putObject({.., key: uuid, ..}, function(err, data) {
if (err) {
reject(err);
} else {
resolve(data);
// Upload metadata to RDS through API-call to EC2 server. Remove s3 object with key:
uuid if this call is unsuccessful
}
});
As I'm learning, my approaches are seldom the best practices but I was unable to find any good information on this particular problem. Is my approach/solution above in line with best practices?
Bonus question: is it beneficial for security purposes to generate the file's key (uuid) server-side instead of client-side?
Here are 2 approaches that you can pick, assuming the client is a web browser or mobile app.
1. Use your server as a proxy to S3.
Your server acts as a proxy between your clients and S3, you have full control of the upload flow, control the supported file types and can inspect file contents, for example: to make sure the file is a correct sound file, before uploading to S3.
2. Use your server to create pre-signed upload URLs
In this approach, your client first requests server to create a single or multiple (for multi-part upload) pre-signed URLs. Clients then upload to your S3 using those URLs. Your server can save those URLs to keep track later.
To be notified when the upload finishes successfully or unsuccessfully, you can either
(1) Ask clients to call another API,e.g: /ack after the upload finishes for a particular signed URL. If this API is not called after some time, e.g: 1 hour, you can check with S3 and delete the file accordingly. You can do this because you have the signed URL stored in your DB at the start of the upload.
or
(2) Make use of S3 events. You can configure ObjectCreated event in S3, which is fired whenever an object is created, and send all the events to a queue in SQS, and have your server process each event from there. This way, you do not rely on clients to update your server after an upload finishes. S3 will notify your server accordingly, for all successful uploads.
I've created a dashboard and deployed it on AWS Elastic Beanstalk. The data fed into my dashboard is supplied by a CSV file in my S3 bucket, set to update every 12 hours with AWS EventBridge. For some reason, my deployed dashboard is not updating. It's still using the same old data from my previous deployment even though the CSV file has been updating correctly.
More specifically:
I'm trying to create a Dashboard with Plotly Dash to visualize some trends starting from 2020-01-01.
I had a Lambda function that scrapes the data and saves them as a CSV file in an S3 bucket. This CSV file gets overwritten every 12 hours to capture the latest available trends.
I used boto3 to fetch the CSV file directly from my S3 bucket and use its data to construct my dashboard.
The app was then deployed with Elastic Beanstalk.
Everything was written in a Cloud9 environment, except for setting up the EventBridge trigger.
Say I deployed the app on 2020-12-10. The CSV file would contain all data up till 2020-12-10, and my dashboard would show trends between 2020-01-01 and 2020-12-10.
However, if I check the dashboard anytime after 2020-12-10 (or when the CSV file is updated with data post 2020-12-10), it still shows the same trends (between 2020-01-01 and 2020-12-10), though the CSV file in my S3 bucket is up to date.
The dashboard would update only if I redeploy the app on Elastic Beanstalk. Not sure why this is the case since my app is pulling the data directly from the updated CSV file.
Is my architecture incorrect here? Or do I need to tweak some settings in AWS?
Thanks in advance!
Update:
I'm using the following codes to load my data into trends_data dataframe.
# define bucket name
bucket = "mobilitytrends"
# define s3 client
s3 = boto3.client('s3')
# define file names
historical_file_name = 'historical_trends.csv'
# load historical data from s3
data_obj = s3.get_object(Bucket= bucket, Key= historical_file_name)
trend_data = pd.read_csv(data_obj['Body'],low_memory = False)
I then have some functions that clean this dataframe. I have a scatterplot that's rendered using the code snippet below:
fig.add_scatter(x = filtered_trend.index,
y = filtered_trend[transportation],
line = dict(color = line_color[idx]),
name = transportation)
filtered_trend is a subset of trends_data, which gets selected based on some callback functions that I set up. But I don't think that's where the problem lies since everything worked fine locally.
In Dash, global variables will break your app. More specifically, modifying global variables will not work, at least not reliably.
One approach to avoid the use of global variables would be to create a single callback that first loads the data from S3, and then renders the layout. Other approaches are discussed in this similar question.
I had a similar problem, EB was not fetching the latest version of CSV from the s3 bucket.
The only option I could find was to restart the app server after a new version of the CSV is updated in s3 bucket.
you can use below code in AWS lambda function to restart your app server at specific times in a day:
import boto3
client = boto3.client('elasticbeanstalk', region_name='your-region')
def lambda_handler(event, context):
try:
response = client.restart_app_server(EnvironmentName='your-environment-name')
if response:
print('restarting app server')
else:
print('Failed to restart server')
except Exception as e:
print(e)
Make sure to set up cron using eventbridge for timings
New to AWS Lambda, Cognito, AWS API-Gateway and server-less computing.
My user uploads a csv file, and i want to insert it into Amazon RDS, and return a success or failure response to the user.
I understand that I can upload the file to S3(used cognito identity pools) and then write a lambda that trigger on upload to S3, which then insert the data from csv to Amazon RDS. I want to show the success or failure response from lambda to the user.
One way i thought about is:
After upload to S3, show a message, "Upload successful. File processing"
Then redirect user to a file list page, show the status of file there.
Meanwhile, my lambda function will insert the file name into file table with status column as "IN PROGRESS", and update it's status depending upon the success/failure of csv insert.
I will keep checking the status of file table every 10 seconds or so, and change the status in the file list page for the recent file, when the status is changed in file table.
Is there a better way to do this using aws server-less computing?
Going with the serverless approach is good. If you're trying to perform real time notifications you can take a look at using API Gateway with Web Socket APIs.
You would enhance your suggestion by replacing the 10 second poll by opening a web socket connection instead.
Once the file is processed your Lambda would notify the web socket connection and then you would notify the customer.
This is how real time notification systems and instant messenger style applications tend to work.
I need to automate a process to extract data from Google Big Query and exported to an external CSV in a external server outside of the GCP.
I just researching how to to that I found some commands to run from my External Server. But I prefer to do everything in GCP to avoid possible problems.
To run the query to CSV in Google storage
bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv
To Download the csv from Google Storage
gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]
But I would like to hear your suggestions
If you want to fully automatize this process, I would do the following:
Create a Cloud Function to handle the export:
This is the more lightweight solution, as Cloud Functions are serverless, and provide flexibility to implement code with the Client Libraries. See the quickstart, I recommend you to use the console to create the functions to start with.
In this example I recommend you to trigger the Cloud Function from an HTTP request, i.e. when the function URL is called, it will run the code inside of it.
An example Cloud Function code in Python, that creates the export when a HTTP request is made:
main.py
from google.cloud import bigquery
def hello_world(request):
project_name = "MY_PROJECT"
bucket_name = "MY_BUCKET"
dataset_name = "MY_DATASET"
table_name = "MY_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
requirements.txt
google-cloud-bigquery
Note that the job will run asynchronously in the background, you will receive a return response with the job ID, which you can use to check the state of the export job in the Cloud Shell, by running:
bq show -j <job_id>
Create a Cloud Scheduler scheduled job:
Follow this documentation to get started. You can set the Frequency with the standard cron format, for example 0 0 * * * will run the job every day at midnight.
As a target, choose HTTP, in the URL put the Cloud Function HTTP URL (you can find it in the console, inside the Cloud Function details, under the Trigger tab), and as HTTP method choose GET.
Create it, and you can test it in the Cloud Scheduler by pressing the Run now button in the Console.
Synchronize your external server and the bucket:
Up until now you only have scheduled exports to run every 24 hours, now to synchronize the bucket contents with your local computer, you can use the gsutil rsync command. If you want to save the imports, lets say to the my_exports folder, you can run, in your external server:
gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports
To periodically run this command in your server, you could create a standard cron job in your crontab inside your external server, to run each day as well, just at a few hours later than the bigquery export, to ensure that the export has been made.
Extra:
I have hard-coded most of the variables in the Cloud Function to be always the same. However, you can send parameters to the function, if you do a POST request instead of a GET request, and send the parameters as data in the body.
You will have to change the Cloud Scheduler job to send a POST request to the Cloud Function HTTP URL, and in the same place you can set the body to send the parameters regarding the table, dataset and bucket, for example. This will allow you to run exports from different tables at different hours, and to different buckets.
I run a service where the users can publicly upload and download files to our site, using Amazon S3. Last month we had a problem where a user uploaded a file that was downloaded like crazy, resulting in 170 TB of bandwidth and a huge bill.
Talking to Amazon and searching on StackOverflow the way to ensure this doesn't happen again is to download the S3 logs parse them, and take actions from there.
We could build such script, but I guess there must be some open source or third party service providing a script or service for this?
What about:
Create a CloudFront Distribution for downloads
Setup a CloudWatch alarm that is triggered when the distribution's BytesDownloaded metric exceeds your chosen monthly limit
Add a notification (sent to an SNS topic you create) that is triggered when the alarm is fired
Add a Lambda function that is triggered by SNS when a notification is sent to that topic (the SNS topic should also have your email subscribed of course so you receive an email with the alarm)
In the Lambda function write code that uses the AWS SDK to update the cloudfront distribution and sets the enabled value to false
(You could also create a notification that is fired when the state of the alarm changes back to OK and trigger a lambda function that re-enables the distribution)
My solution to this, and problem like this, is to have billing alerts on my account. I know roughly how much I should spend each month, and setup alerts accordingly - roughly I have divided that amount by 4 (weeks), and set a series of billing alerts at 1/4, 1/2, 3/4 and 1X my estimated spend.
This is not a technical solution to stop the downloads, but at least someone will get notified and they can take action before it gets out of control.
Your best approach is distribute your S3 content using AWS Cloudfront and implement AWS Web Application Firewall (WAF) and implement IP blocking.
So if a IP hits your Cloud Front Distribution more than for say 5 times the AWS WAF will block that IP.
Here is the detailed guide.
https://blogs.aws.amazon.com/security/post/Tx1ZTM4DT0HRH0K/How-to-Configure-Rate-Based-Blacklisting-with-AWS-WAF-and-AWS-Lambda
We had similar kind of requirement long ago.
We used CloudTrail logs to figure out all the activities being performed on our AWS Account.
hope the script for downloading and filter Cloudtrail logs helps you out. ( The following script is only for figuring out launched instance-ids, owner, eventname. please modify according to your need)
import boto3
import gzip
import os
import json
client = boto3.client('s3')
bucketname = "mybucketname"
list_bucket_objects = client.list_objects(Bucket=bucketname )
download_path = '/home/ec2-user/cloudtrail/'
# DOWNLOADING: Downloading Log files from S3
for object in list_bucket_objects['Contents']:
print object['Key']
object_name = object['Key'].split('/')
if len(object_name)==8:
print "Downloading --->%s" % object_name[7]
client.download_file(bucketname, object['Key'], download_path+object_name[7])
# UNZIPPING: Unzipping the files in one folder
file_path = '/home/ec2-user/cloudtrail/'
new_file_path = '/home/ec2-user/cloudtrail/logs/'
#Create Log Directory
if not os.path.exists(new_file_path):
os.mkdir(new_file_path)
files = os.listdir(file_path)
for file in files:
boolean = os.path.isfile(file_path+file)
if boolean == True:
f = gzip.GzipFile(file_path+file, 'rb')
s = f.read()
f.close()
split_file = file.split('.')
log_path = new_file_path+split_file[0]
print log_path
out = open(log_path, 'wb')
out.write(s)
out.close()
# PARSING AND FILTERING: parsing output into json format, filtering output and writing it in result.txt file
fin = open(log_path).read()
content = json.loads(fin)
for i in range(0, len(content['Records'])):
event = content['Records'][i]['eventName']
if 'userName' in content['Records'][i]['userIdentity']:
user = content['Records'][i]['userIdentity']['userName']
if 'responseElements' in content['Records'][i]:
res_ele = content['Records'][i]['responseElements']
if res_ele:
if 'instancesSet' in content['Records'][i]['responseElements']:
if 'items' in content['Records'][i]['responseElements']['instancesSet']:
instance_id = content['Records'][i]['responseElements']['instancesSet']['items'][0]['instanceId']
if (event == "RunInstances" and instance_id != ""):
open('result.txt', 'ab').write(event+": :"+user+": :"+instance_id+"\n")
#result.txt is stored in your current working directory.