I've created a dashboard and deployed it on AWS Elastic Beanstalk. The data fed into my dashboard is supplied by a CSV file in my S3 bucket, set to update every 12 hours with AWS EventBridge. For some reason, my deployed dashboard is not updating. It's still using the same old data from my previous deployment even though the CSV file has been updating correctly.
More specifically:
I'm trying to create a Dashboard with Plotly Dash to visualize some trends starting from 2020-01-01.
I had a Lambda function that scrapes the data and saves them as a CSV file in an S3 bucket. This CSV file gets overwritten every 12 hours to capture the latest available trends.
I used boto3 to fetch the CSV file directly from my S3 bucket and use its data to construct my dashboard.
The app was then deployed with Elastic Beanstalk.
Everything was written in a Cloud9 environment, except for setting up the EventBridge trigger.
Say I deployed the app on 2020-12-10. The CSV file would contain all data up till 2020-12-10, and my dashboard would show trends between 2020-01-01 and 2020-12-10.
However, if I check the dashboard anytime after 2020-12-10 (or when the CSV file is updated with data post 2020-12-10), it still shows the same trends (between 2020-01-01 and 2020-12-10), though the CSV file in my S3 bucket is up to date.
The dashboard would update only if I redeploy the app on Elastic Beanstalk. Not sure why this is the case since my app is pulling the data directly from the updated CSV file.
Is my architecture incorrect here? Or do I need to tweak some settings in AWS?
Thanks in advance!
Update:
I'm using the following codes to load my data into trends_data dataframe.
# define bucket name
bucket = "mobilitytrends"
# define s3 client
s3 = boto3.client('s3')
# define file names
historical_file_name = 'historical_trends.csv'
# load historical data from s3
data_obj = s3.get_object(Bucket= bucket, Key= historical_file_name)
trend_data = pd.read_csv(data_obj['Body'],low_memory = False)
I then have some functions that clean this dataframe. I have a scatterplot that's rendered using the code snippet below:
fig.add_scatter(x = filtered_trend.index,
y = filtered_trend[transportation],
line = dict(color = line_color[idx]),
name = transportation)
filtered_trend is a subset of trends_data, which gets selected based on some callback functions that I set up. But I don't think that's where the problem lies since everything worked fine locally.
In Dash, global variables will break your app. More specifically, modifying global variables will not work, at least not reliably.
One approach to avoid the use of global variables would be to create a single callback that first loads the data from S3, and then renders the layout. Other approaches are discussed in this similar question.
I had a similar problem, EB was not fetching the latest version of CSV from the s3 bucket.
The only option I could find was to restart the app server after a new version of the CSV is updated in s3 bucket.
you can use below code in AWS lambda function to restart your app server at specific times in a day:
import boto3
client = boto3.client('elasticbeanstalk', region_name='your-region')
def lambda_handler(event, context):
try:
response = client.restart_app_server(EnvironmentName='your-environment-name')
if response:
print('restarting app server')
else:
print('Failed to restart server')
except Exception as e:
print(e)
Make sure to set up cron using eventbridge for timings
Related
I am working on a feature where a user can upload multiple files which need to be parsed and converted to PDF if required. For that, I'm using AWS and when the user selects N files for upload then the following happens:
The client browser is connected to an AWS WebSocket API which is responsible for sending back the parsed data to respective clients later.
A signed URL for S3 is get from the webserver using which all of the user's files are uploaded onto an S3 bucket.
As soon as each file is uploaded, a lambda function is triggered for it which fetches the object for that file in order to get the content and some metadata to associate the files with respective clients.
Once the files are parsed, the response data is sent back to the respective connected clients via the WebSocket and the browser JS catches the event data and renders it.
The issue I'm facing here is that the lambda function randomly times out at the line which fetches the object of the file (either just head_object or get_object). This is happening for roughly 50% of the files (Usually I test by just sending 15 files at once and 6-7 of them fail)
import boto3
s3 = boto3.client("s3")
def lambda_handler(event, context):
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = urllib.parse.unquote_plus(event["Records"][0]["s3"]["object"]["key"], encoding="utf-8")
response = s3.get_object(Bucket=bucket, Key=key) # This or head_object gets stuck for 50% of the files
What I have observed is that even if the head_object or get_object is fetched for a file which already exists on S3 instead of getting it for the file who's upload triggered the lambda. Then also it times out with the same rate.
But if the objects are fetched in bulk via some local script using boto3 then they are fetched under a second for 15 files.
I have also tried using my own AWS Access ID and Secret key in lambda to avoid any issue caused by the temporarily generated keys.
So it seems that the multiple lambda instances are having trouble in getting the S3 file objects in parallel, which shouldn't happen though as AWS is supposed to scale well.
What should be done to get around it?
I am new to Sagmaker, and I have created a pipeline from the SageMaker notebook, consisting of training and deployment components.
In the training script, we can upload the model to s3 via SM_MODEL_DIR. But now, I want to upload the classification report to s3. I tried this code. But It shows this is not a proper s3 bucket.
df_classification_report = pd.DataFrame(class_report).transpose()
classification_report_file_name = os.path.join(args.output_data_dir,
f"{args.eval_model_name}_classification_report.csv")
df_classification_report.to_csv(classification_report_file_name)
# instantiate S3 client and upload to s3
# save classification report to s3
s3 = boto3.resource('s3')
print(f"classification_report is being uploaded to s3- {args.model_dir}")
s3.meta.client.upload_file(classification_report_file_name, args.model_dir,
f"{args.eval_model_name}_classification_report.csv")
And the error
Invalid bucket name "/opt/ml/output/data": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Can anybody help? I really appreciate any help you can provide.
SageMaker Training Jobs will compress any files located in /opt/ml/model which is the value of SM_MODEL_DIR and upload it to S3 automatically. You could look at saving your file to SM_MODEL_DIR (Your classification report will thus be uploaded to S3 in the model tar ball).
The upload_file() function requires you to pass an S3 bucket.
You could also look at manually specify an S3 bucket in your code to upload the file to.
s3.meta.client.upload_file(classification_report_file_name, <YourS3Bucket>,
f"{args.eval_model_name}_classification_report.csv")
You can save non model artifacts, such as reports, to output_data_dir. See here.
parser.add_argument("--output_data_dir", type=str,
default=os.environ.get('SM_OUTPUT_DATA_DIR'),
help="Directory to save output data artifacts.")
If you want the artifacts to be packaged with the model files then follow #Marc's answer. Maybe it makes sense in the case of a report that pertains to a specific model, though capturing this in a model registry makes more sense to me.
Note that these additional artifacts would be carried over if you deploy the model to an endpoint (might confuse the inference runtime model loading code).
I have a requirement, in which an excel file is being uploaded to S3 bucket, so as soon as that file gets uploaded, I want to trigger a lambda function which will read that excel file and then persist the data in aerospike db.
For reading the excel file, I have got this piece of code
key = 'key-name'
bucket = 'bucket-name'
s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, key)
data = s3_object.get()['Body'].read().decode('utf-8').splitlines()
lines = csv.reader(data)
headers = next(lines)
print('headers: %s' %(headers))
for line in lines:
#print complete line
print(line)
But I not able to figure out how to connect to aerospike db, as boto3 library doesn't support aerospike.
Please help me in connecting to db cluster and persist the data ?
Or any reference would be helpful
I think the way to interact with Aerospike from something like AWS Lambda is to use the Aerospike REST Client that provides a server which translates Restful API requests into messages to an Aerospike Cluster (it is mentioned in the blog post).
Basically you can run a REST server (Aerospike REST Client) that you can send HTTP requests from AWS Lambda using Python to the server and the server will translate these requests to Aerospike operations and will be responsible of executing them.
This is the GitHub repository of Aerospike REST Client - it also contains couple of blog posts of how to use it and a Swagger UI documentation of the actual supported requests:
https://github.com/aerospike/aerospike-client-rest
There is also this blog post of Serverless Event Stream Processing with Aerospike which can help you get started:
https://medium.com/aerospike-developer-blog/serverless-event-stream-processing-with-aerospike-679f2a5cbba6
I need to automate a process to extract data from Google Big Query and exported to an external CSV in a external server outside of the GCP.
I just researching how to to that I found some commands to run from my External Server. But I prefer to do everything in GCP to avoid possible problems.
To run the query to CSV in Google storage
bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv
To Download the csv from Google Storage
gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]
But I would like to hear your suggestions
If you want to fully automatize this process, I would do the following:
Create a Cloud Function to handle the export:
This is the more lightweight solution, as Cloud Functions are serverless, and provide flexibility to implement code with the Client Libraries. See the quickstart, I recommend you to use the console to create the functions to start with.
In this example I recommend you to trigger the Cloud Function from an HTTP request, i.e. when the function URL is called, it will run the code inside of it.
An example Cloud Function code in Python, that creates the export when a HTTP request is made:
main.py
from google.cloud import bigquery
def hello_world(request):
project_name = "MY_PROJECT"
bucket_name = "MY_BUCKET"
dataset_name = "MY_DATASET"
table_name = "MY_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
requirements.txt
google-cloud-bigquery
Note that the job will run asynchronously in the background, you will receive a return response with the job ID, which you can use to check the state of the export job in the Cloud Shell, by running:
bq show -j <job_id>
Create a Cloud Scheduler scheduled job:
Follow this documentation to get started. You can set the Frequency with the standard cron format, for example 0 0 * * * will run the job every day at midnight.
As a target, choose HTTP, in the URL put the Cloud Function HTTP URL (you can find it in the console, inside the Cloud Function details, under the Trigger tab), and as HTTP method choose GET.
Create it, and you can test it in the Cloud Scheduler by pressing the Run now button in the Console.
Synchronize your external server and the bucket:
Up until now you only have scheduled exports to run every 24 hours, now to synchronize the bucket contents with your local computer, you can use the gsutil rsync command. If you want to save the imports, lets say to the my_exports folder, you can run, in your external server:
gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports
To periodically run this command in your server, you could create a standard cron job in your crontab inside your external server, to run each day as well, just at a few hours later than the bigquery export, to ensure that the export has been made.
Extra:
I have hard-coded most of the variables in the Cloud Function to be always the same. However, you can send parameters to the function, if you do a POST request instead of a GET request, and send the parameters as data in the body.
You will have to change the Cloud Scheduler job to send a POST request to the Cloud Function HTTP URL, and in the same place you can set the body to send the parameters regarding the table, dataset and bucket, for example. This will allow you to run exports from different tables at different hours, and to different buckets.
I am trying to make my django app in production in aws, i used elastic beanstalk to deploy it so the ec2 instance is created and connected to an rds database mysql instance and i use a bucket in amazon s3 storage to store my media files on it.
When user upload a video, it is stored in s3 as : "https://bucketname.s3.amazonaws.com/media/videos/videoname.mp4".
In django developpement mode, i was using the video filename as an input to a batch script which gives a video as output.
My view in developpement mode is as follow:
def get(request):
# get video
var = Video.objects.order_by('id').last()
v = '/home/myproject/media/videos/' + str(var)
# call process
subprocess.call("./step1.sh %s" % (str(v)), shell=True)
return render(request, 'endexecut.html')
In production mode in aws (Problem), i tried:
v = 'https://bucketname.s3.amazonaws.com/media/videos/' + str(var)
but the batch process doesn't accept the url as input to process.
How can i use my video file from s3 bucket to do process with in my view as i described ? Thank you in advance.
You should not hard-code that string. There are a couple of things wrong with that:
"bucketname" is not the name of your bucket. You should use the name of your bucket if this would at all work.
Your Media File URl (In settings.py) should be pointing to the bucket url where your files are saved (If it's well configured). So you can make use of:
video_path = settings.MEDIA_URL + video_name
I am assuming you are using s3boto to handle your storages (That's not a prerequisite though, it only makes your storage handling smarter and it's highly recommended if you are pushing to s3 from a django app)