SageMaker ProfilerReport stopped saying InternalServerError but the training job is successful - amazon-web-services

When running a simple training job on Amazon SageMaker, ProfilerReport (not configured by me) is also enabled by default and a processing job appears parallel to the training job.
The training job runs successfully, but a few times (so I don't know how to reproduce the error) the profiler report goes into generic error saying:
InternalServerError: An internal error occurred. Try again.
Looking at the CloudWatch logs, the last few are all like this:
Put the output notebook in /opt/ml/processing/output/rule/profiler-output/profiler-report.ipynb
Put the html in /opt/ml/processing/output/rule/profiler-output/profiler-report.html
Current timestamp 1666357140000000 last timestamp 1666357080000000: waiting for new profiler data.
Current timestamp 1666357140000000 most recent timestamp 1666357080000000: waiting for new profiler data.
Current timestamp 1666357140000000 most recent timestamp 1666357080000000: waiting for new profiler data.
......
repeating to the last this waiting for new profiler data.
The job in question lasted 2 days, but the profiler report failed after 20 hours. Looking at the instance parameters, there is no error in terms of resources used.
The only point I can think of is that I configured early stopping (with saving only the best model, progressively) and so in the last training phase it does not save any data.
Could the explanation then be that by not saving anything, the profiler report goes into timeout? The ProfilerReport though, shouldn't it also show a lot of other information about the training job by looking at the debugger like gpu utilization and more?
This is the simple example of the training job code:
from sagemaker.pytorch import PyTorch
tft_train_estimator = PyTorch(
base_job_name="my-training-job-name"
entry_point="training.py",
framework_version="1.12.0",
py_version="py38",
role=role,
instance_count=1,
instance_type=train_instance_type,
code_location = code_location,
output_path=output_model_path
)
In each case, the trained model works correctly.

Related

AWS Glue Job using awsglueml.transforms.FindMatches gives timeout error seemingly randomly

I have a Glue ETL Job (using pyspark) that gives a timeout error when trying to access the awsglueml.transforms.FindMatches library seemingly randomly. The error given on the glue dashboard is:
An error occurred while calling z:com.amazonaws.services.glue.ml.FindMatches.apply. The target server failed to respond
Basically if I try to run this Glue ETL job late at night, it most of the time succeeds. But if I try to run this ETL Job in the middle of the day, it fails with this error. Sometimes just retrying it enough times causes it to succeed, but this doesn't seem like a good solution. It seems like the issue is with AWS FindMatches library not having enough bandwidth to support people wanting to use this library, but I could be wrong here.
The Glue ETL job was setup using the option A proposed script generated by AWS Glue
The line of code that this is timing out on is a line that was provided by glue when I created this job:
from awsglueml.transforms import FindMatches
...
findmatches2 = FindMatches.apply(frame = datasource0, transformId = "<redacted>", computeMatchConfidenceScores = True, transformation_ctx = "findmatches2")
Welcoming any information on this elusive issue.

Used Dataflow's DLP to read from GCS and write to BigQuery - Only 50% data written to BigQuery

I recently started a Dataflow job to load data from GCS and run it through DLP's identification template and write the masked data to BigQuery. I could not find a Google-provided template for batch processing hence used the streaming one (ref: link).
I see only 50% of the rows are written to the destination BigQuery table. There is no activity on the pipeline for a day even though it is in the running state.
yes DLP Dataflow template is a streaming pipeline but with some easy changes you can also use it as batch. Here is the template source code. As you can see it uses File IO transform and poll/watch for any new file in every 30 seconds. if you take out the window transform and continuous polling syntax, you should be able to execute as batch.
In terms of pipeline not progressing all data, can you confirm if you are running a large file with default settings? e.g- workerMachineType, numWorkers, maxNumWorkers? Current pipeline code uses a line based offsetting which requires a highmem machine type with large number of workers if the input file is large. e.g for 10 GB, 80M lines you may need 5 highmem workers.
One thing you can try to see if it helps is to trigger the pipeline with more resources e.g: --workerMachineType=n1-highmem-8, numWorkers=10, maxNumWorkers=10 and see if it's any better.
Alternatively, there is a V2 solution that uses byte based offsetting using state and timer API for optimized batching and resource utilization that you can try out.

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

com.amazonaws.services.gluejobexecutor.model.VersionMismatchException

Exactly like in this AWS forum question I was running 2 Jobs concurrently. The Job was configured with Max concurrency: 10 but when executing job.commit() I receive this error message:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.amazonaws.services.glue.util.Job.commit.
: com.amazonaws.services.gluejobexecutor.model.VersionMismatchException:
Continuation update failed due to version mismatch. Expected version 6 but found version 7
(Service: AWSGlueJobExecutor; Status Code: 400; Error Code: VersionMismatchException; Request ID: 123)
The two Jobs read different portions of data.
But I can't understand what's the problem here and how to deal with it. Anyone can help?
Reporting #bgiannini's answer in this other AWS forum question, it looks like that the "version" was referring to job bookmarking.
If multiple instances of the same job are running simultaneously (i.e. max concurrency > 1) and using bookmarks, when job run 1 runs job.init() it gets a version and job.commit() seems to expect a certain value (+1 to version for every job.commit that is executed I guess?). If job run 2 started at the same time and got the same initial version from job.init(), then submits job.commit() before job 1 does, job 1 doesn't increment to the version it expected to.
Actually I was running the 2 Jobs with Job bookmark: Enable. Indeed when disabling bookmarking, looks to be working for me.
I understand it might not be the best solution but it can be a good compromise.
The default JobName for your bookmark is the glue JOB_NAME, but it doesn't have to be.
Consider you have a glue job called JobA which executes concurrently taking different input parameters. You have two concurrent executions with input parameter contextName. Let's call the value passed into this parameter contextA and contextB.
The default initialisation in your pyspark script is:
Job.init(args['JOB_NAME'], args)
but you can change this to be unique for your execution context. Instead:
Job.init(args['JOB_NAME']+args['contextName'], args)
This is unique for each concurrent execution so would never clash. When you view the bookmark state from the cli for this job, you'd need to view it like this:
aws glue get-job-bookmark --job-name "jobAcontextA"
or
aws glue get-job-bookmark --job-name "jobAcontextB"
You wouldn't be able to use the UI to pause or reset the bookmark, you'd need to do it programatically.

Django cron-job for user-specified times

I am wanting to send notifications based on a user-specified time. IE, in google calendar, I can receive a text message when my task time is hit.
Is the solution to this to run a cron job, have it execute every minute and scan which users have a time equaling the current time?
Since you tagged your question with celery, I assume you have celery running. You could use the eta kwarg to apply_async() to schedule a task to run at a specific time, see here:
http://docs.celeryproject.org/en/latest/userguide/calling.html#eta-and-countdown
If you need to use a cron job, I would not check if notification_time == current_time, but rather track unsent notifications with a boolean is_sent field on the model and check for notification_time <= current_time and not is_sent. This seems to be slightly less error prone. You could also add some form of check to prevent mass-sending notifications in case your system goes down for a few hours.