I have the following architecture
I followed this link to the T, https://github.com/aws/amazon-sagemaker-examples/blob/main/step-functions-data-science-sdk/automate_model_retraining_workflow/automate_model_retraining_workflow.ipynb. I am not sure how to debug to see what is going wrong. Any suggestions would be appreciated.
To provide more context, this is a machine learning deployment project. What I am doing in the picture is chaining processes together. The "Query Training Results" part is a Lambda function that pulls the training metrics data from an S3 location. For some reason this part gets cancelled.
From what I found online (Why would a step function cancels itself when there are no errors), “this happens in step functions when you have a Choice state, and the Variable you are referencing is not actually in the state input.” There is also some answers in that post that suggest that the dictionary metrics need to be of string type which I made sure I casted it as such.
The problem I am having is when you click on that grey box it provides no information other than the fact that it was cancelled, so I have no clue what is going wrong.
Related
I have a multi-label dataset with 727253 labeled images. Smallest label occurence is ~15 and largest around 200000. Model training started ~18h ago and failed now with the following message:
Unable to deploy model
cancel_lro() got an unexpected keyword argument 'min_nodes'
Pipeline d884756f14314048b7a036f5b07f0fd2 timeout.
The automatically generated email contained the following:
Last error message
Please reference 116298312436989152 when reporting errors.
Is this already known? Also I chose the free plan (1h) to train. Do I need to increase this to work properly? Is there any way to see a status during training to predict large waiting times without outcome? (I tried the API but there was no percentage or anything like else, only for finished models.)
Thanks in advance!
This seems like an internal error. The main problem seems to be that the pipeline timed out. As part of the timeout it tries to do some sort of cleanup and this cleanup seems to have a bug.
My recommendation is re-try the pipeline.
Newer to AWS and working with Athena for the first time. Would appreciate any help/clarification.
I set the query results location to be s3://aws-athena-query-results-{ACCOUNTID}-{Region}, I can see that whenever I am running the query, whether it be from console or externally elsewhere, that the two results file are created as expected.
However, my question is what are supposed to do with these files long term? What are some recommendations on rotating them? From what I understand, these are the query results (other one is metadata file) that contains the results of the user's query and is passed back to them. What are the recommendations on how to manage the query results bucket files? I don't want to just let them accumulate there and comeback to a million files if that makes sense.
I did search through the docs and couldn't find info on the above topic, maybe I missed it? Would appreciate any help!
Thanks!
From the documentation,
You can delete metadata files (*.csv.metadata) without causing errors,
but important information about the query is lost
The query results files can be safely deleted if you dont want to refer back to the query that ran at a particular date in past and the result it returned. If you have deleted the results files from the S3 buckets and from Athena "History" trying to download the result, it will just give you error message that result file is not available.
In summary, its up to your use case whether you can afford to run the same query in future if required? or just want to extract the result from past run history.
I'm trying to understand the Enterprise Guide process flow. As I understand it, the process flow is supposed to make it easy to run related steps in the order they need to be run to make a dependent action able to run and be up to date somewhere later in the flow.
Given that understanding, I'm getting stuck trying to make the process flow work in cases where the temporary data is purged. I'm warned when closing Enterprise Guide that the project has references to temporary data which must be the tables I created. That should be fine, the data is on the SAS server and I wrote code to import that data into SAS.
I would expect that the data can be regenerated when I try run an analysis that depends on that data again later, but instead I'm getting an error indicating that the input data does not exist. If I then run the code to import the data and/or join tables in each necessary place, the process flow seems to work as expected.
See the flow that I'm working with below:
I'm sure I must be missing something. Imagine I want to rerun the rightmost linear regression. Is there a way to make the process flow import the data without doing so manually for each individual table creation the first time round?
The general answer to your question is probably that you can't really do what you're wanting directly, but you can do it indirectly.
A process flow (of which you can have many per project, don't forget) is a single set of programs/tasks/etc. that you intend to run as a group. Typically, you will run whole process flows at once, rather than just individual pieces. If you have a point that you want to pause, look at things, then continue, then you have a few choices.
One is to have a process flow that goes to that point, then a second process flow that starts from that point. You can even take your 'import data' steps out of the process flow entirely, make an 'import data' process flow, always run that first, then run the other process flows individually as you need them. In fact, if you use the AUTOEXEC process flow, you could have the import data steps run whenever you open the project, and imported data ready and waiting for you.
A second is to use the UI and control+click or drag a box to select on the process flow to select a group of programs to run; select the first five, say, then run them, then select 'run branch from program...' option to run from that point on. You could also make separate 'branches' and run just the one branch at a time, making each branch dependent on the input streams.
A third option would be to have different starting points for different analysis tasks, and have the import data bit be after that starting point. It could be common to the starting points, and use macro variables and conditional execution to go different directions. For example, you could have a macro variable set in the first program that says which analysis program you're running, then the conditional from the last import step (which are in sequence, not in parallel like you have them) send you off to whatever analysis task the macro variable says. You could also have macro variables that indicate whether an import has been run once already in the current session that then would tell you not to rerun it via conditional steps.
Unfortunately, though, there's no direct way to run something and say 'run this and all of its dependencies', though.
Long story short - Familiar with BASE 9, now using EG (7.1) due to a new role with another company. The transition is painful, but there is one thing that bothers me the most and that is the log.
As I am sure most know, it will rewrite/refresh for every piece of code you execute.
Surely there must be an option to maintain a "running log" within the SAS code you are running/building (not necessarily for the whole project, but just for the program node within the project).
Can this be done?
Any assistance is greatly appreciated. Searched for some reference, but none citing the subject specifically.
Yes - from SAS's support pages:
You’ll notice that a separate log node is generated for each code node. By turning on Project Logging, you can
easily tell Enterprise Guide that you’d like a single SAS log to be generated for all of the tasks and code nodes in your
Project. This single Project Log will be created in addition to the individual logs created for each task or code node.
Helpful Hint: If Project Logging is turned on, the log represents a running log of the entire project. To
turn on the Project Logging, select Project Log in the Context Menu of the Process Flow, and then select
Turn On.
I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.