AWS Sagemaker training job stuck in progress state - amazon-web-services

I have created a training job yesterday, same as usual, just adding few more training data. I didn't have any problem with this in the last 2 years (the same exact procedure and code). This time after 14 hours more or less simply stalled.
Training job is still "in processing", but cloudwatch is not logging anything since then. Right now 8 more hours passed and no new entry is in the logs, no errors no crash.
Can someone explain this ? Unfortunately I don't have any AWS support plan.
As you can see from the picture below after 11am there is nothing..
The training job is supposed to complete in the next couple of hours, but now I'm not sure if is actually running (in this case would be a cloudwatch problem) or not..
UPDATE
Suddenly the training job failed, without any further log. The reason is
ClientError: Artifact upload failed:Error 7: The credentials received
have been expired
But there is still nothing in the logs after 11am. Very weird.

For future readers I can confirm that is something that can happen very rarely (I' haven't experienced it anymore since then), but it's AWS fault. Same data, same algorithm.

Related

AlphaFold on VertexAI - Stuck in setting up notebook for 2 hours

I am trying to run AlphaFold on VertexAI as explained here. However, my instance creation is stuck in this state for roughly two hours now. There is no error message either. I am wondering if something has gone wrong or this is just the expected time it will take to setup a new instance?
I actually tried with two different notebooks. One is the default one linked in the above article and the other is https://raw.githubusercontent.com/deepmind/alphafold/main/notebooks/AlphaFold.ipynb
Both are in the same state for roughly the same time.
I finally gave up and Canceled the notebook creation. When I went back to the Workbench screen, THEN it displayed me this error message:
So, turns out that the new Google Cloud account I created has no quota for GPUs. In order to increase the quota, I first had to upgrade to a full GCP account. And now I need to update for a couple of days before I can actually request the quota increase because I got this automated response when I submitted the quota increase request.
I have also contacted Sales on the link given at the end of this email to see if they can escalate the process in any way.

AWS CloudWatch rule schedule has irregular intervals (when it shouldn't)

There is an Elastic Container Service cluster running an application internally referred to as Deltaload. It checks the data in Oracle production database and in dev database in Amazon RDS and loads whatever is missing into RDS. A CloudWatch rule is set up to trigger this process every hour.
Now, for some reason, every 20-30 hours there is one interval of a different length. Normally, it is a ~25 min gap, but on other occasions it can be 80-90 min instead of 60. I could understand a difference of 1-2 minutes, but being off by 30 min from an hourly schedule sounds really problematic, especially given the full run takes ~45 min. Does anyone have any ideas on what could be the reason for that? Or at least how can I figure out why it is so?
The interesting part is that this glitch in schedule either breaks or fixes the Deltaload app. What I mean is, if it is running successfully every hour for a whole day and then the 20 min interval happens, it will then be crashing every hour for the next day until the next glitch arrives, after which it will work again (the very same process, same container, same everything). It crashes, because the connection to RDS times out. This 'day of crashes, day of runs' thing has been going on since early February. I am not too proficient with AWS. This Deltaload app is written in C#, which I don't know. The only thing I managed to do is to increase the RDS connection timeout to 10 min, which did not fix the problem. The guy that wrote the app has left the company a time ago and is unavailable. There are no other developers on this project, as everyone got fired, because of corona. So far, the best alternative I see, is to just rewrite the whole thing in Python (which I know). If anyone has any other thoughts on how understand/fix it, I'd greatly appreciate any input.
To restate my actual question: why is it that CloudWatch rule drops in irregular intervals in a regular schedule? How to prevent this from happening?

scheduling an informatica workflow with a customized frequency

Hello Dear Informatica admin/platform experts,
I have a workflow that i need to scheduled say Monday-Friday and Sunday. All the 6 days the job should at a specific time say 10 times in a day, but the timing is not uniform but at a predefined time say(9 AM, 11 AM, 1:30 PM etc), so the difference in the timing is not uniform. so we had 10 different scheduling workflows for each schedule/run that triggers a shell script that uses pmcmd command.
It looked a bit weird for me, so what i did was, have a single workflow that triggers the pmcmd shell script, and have a link between the start and the shell script where i specified a condition of the time and scheduled it to run monday-friday and sunday every 30 minutes.
So what happens is, it runs 48 times in a day but actually triggers the "actual" workflow only 10 times. and the remaining 38 times it just runs but does nothing.
one of my informatica admin colleague says that running this 38 times(which does actually nothing) consumes informatica resources. Though i was quite sure it does not, but as i am just an informatica developer and not an expert, thought of posting it here, to check whether it is really true?
Thanks.
Regards
Raghav
Well... it does consume some resources. Each time workflow starts, it does quite a few operations on the Repository. It also allocates some memory on Integration Service as well as creates log file for the Workflow. Even if there are no sessions executed at all.
So, there is an impact. Multiply that by the number of workflows, times the number of executions - and there might be a problem.
Not to mention there are some limitations regarding the number of Workflow being executed at the same time.
I don't know your platform and setup. But this look like a field for improvement indeed. A cron scheduler should help you a lot.

DynamoDB on-demand mode suddenly stops working

I have a table that is incrementally populated with a lambda function every hour. The write capacity metric is full of predictable spikes and throttling was normally avoided by relying on the burst capacity.
The first three loads after turning on-demand mode on kept working. Thereafter it stopped loading new entries into the table and began to time-out (from ~10 seconds to the current limit of 4 minutes). The lambda function was not modified at all.
Does anyone know why might this be happening?
EDIT: I just see timeouts in the logs.
Logs before failure
Logs after failure
Errors and availability (%)
Since you are using Lambda to perform incremental writes, this issue is more than likely on Lambda side. That is where I would start looking for this. Do you have CW logs to look through? If you cannot find it, open a case with AWS support.
Unless this was recently fixed, there is a known bug in Lambda where you can get a series of timeouts. We encountered it on a project I worked on: a lambda would just start up and sit there doing nothing, quite like yours.
So like Kirk, I'd guess the problem is with the Lambda, not DynamoDB.
At the time there was no fix. As a workaround, we had another Lambda checking the one that suffered from failures and rerunning those. Not sure if there are other solutions. Maybe deleting everything and setting it back up again (with your fingers crossed :))? Should be easy enough if everything is in Cloudformation.

ZONE_RESOURCE_POOL_EXHAUSTED for DataFlow & DataPrep

Alright team...Dataprep running into BigQuery. I cannot for the life of me find out why I have the ZONE_RESOURCE_POOL_EXHAUSTED issue for the past 5 hours. The night before, everything was going great, but today, I am having some serious issues.
Can anyone give any insight into how to change the resource pool for Dataflow jobs with regard to Dataprep? I can't even get a basic column transform to push through.
Looking forward to anyone helping me with this because honestly, this issue one of those "just change this and maybe that will fix it and if not, maybe a few weeks and it'll work".
Here is the issue in screenshot: https://i.stack.imgur.com/Qi4Dg.png
UPDATE:
I believe some of my issue may deal with GCP Compute incident 18012 espcially since it's a us-central based issue for creation of instances.
The incident you mentioned was actually resolved on November 5th and was only affecting the us-central1-a zone. Seeing that your question was posted on November 10th and other users in the comments got the error in the us-central1-b zone, the error is not related to the incident you linked.
As the error message suggests, this is a resource availability issue. These scenarios are rare and are usually resolved quickly. If this ever happens in the future, using Compute Engine instances in other regions/zones will solve the issue. To do so using Dataprep, as mentioned in the comment, after the job is launched from Dataprep, you can re-run the job from Dataflow while specifying the region/zone you would like to run the job in.