I noticed some strange behavior by AWS data pipeline.
The Execution start time is before the scheduled start time. Please refer to the screenshot below.
Am I missing something here ?
Is this acceptable behavior for AWS data pipline ? What are the recommended way to avoid this ?
Data Pipeline here is recording creation of instance as Execution start time. But it does not start execution (Running state) before Scheduled start time. You can verify this by clicking on the instance, view all fields, it has additional info.
This is definitely misleading. Data Pipeline needs to fix the recording of timestamps.
Related
I have an AWS Glue job, with max concurrent runs set to 1. The job is currently not running. But when I try to run it, I keep getting the error: "Max concurrent runs exceeded".
Deleting and re-creating the job does not help. Also, other jobs in the same account run fine, so it cannot be a problem with account wide service quotas.
Why am I getting this error?
I raised this issue with AWS support, and they confirmed that it is a known bug:
I would like to inform you that this is a known bug, where an internal distributed counter that keeps track of job concurrency goes into a stale state due to an edge case, causing this error. Our internal Service team has to manually reset the counter to fix this issue. Service team has already added the bug fix in their product roadmap and will be working on it. Unfortunately I may not be able to comment on the ETA on the deployment, as we don’t have any visibility on product teams road map and fix release timeline.
The suggested workarounds are:
Increase the max concurrency to 2 or higher
Re-create the job with a different name
Glue container is start and its taking some time same when your job end container shutdown taking some time in between if you try to execute new Jon and default concurrency is 1 so you will get this error.
How to resolve:
Go to your Glue Job --> Under Job detail tab you can find "Maximum concurrency" default value is 1 change it to 3 or more as per your need.
I tried changing "Maximum concurrency" to 2 and then run it !
It worked but again running it cause the same issue, but I looked into my s3 ,it has dumped the data ,so it run for once!
I'm still looking for a stable solution but this may work!
I'm using AWS-DMS to migrate existing data only from a Postgres db as source to aws-S3 as target. I have created a migration task for this, and I'm able to do the aforementioned.
However, I wanted to know how much time it took for a task to complete. I couldn't find a time completion metric in either the metrics corresponding to the task or the metrics corresponding to the replication-instance.
How do I find out the time taken for the full load?
Using the AWS CLI you can try using the describe-replication-tasks function.
This will provide you with both the Start and Stop times, as well as the time elapsed.
I have a use case where I schedule a task 24h into the future after an event occurs. This task represents some sort of "deadline" for other things to happen.
The scheduled task triggers a creation of a report. If not all of the above mentioned "other things" have completed by this time, then the triggered report creation process creates it anyways with the information it has at the time.
If, on the other hand, all other things do complete before these 24h, then ideally I'd like to re-use the same Google Cloud Task to trigger the same process (as it's identical as the previous case but will contain all of the information possible).
I would imagine the easiest way to achieve the above is to:
schedule a task 24h into the future
if all information arrives: run the task early before it's scheduled time
However, reading through the Google Cloud Tasks documentation I don't see the option to run the task early. However, that feature does exist on the Cloud Tasks console, so I was wondering if it is available in the documentation and client libraries.
Thanks!
This is probably what you're looking for
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/run
NOTE: It does say however that "This command is meant to be used for manual debugging"
actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples
I have a linear three step Dataflow pipeline - for some reason the last step started, but the preceding two steps hung in Not started for a long time before I gave up and killed the job. I'm not sure what caused this, as this same pipeline had successfully run in the past, and I'm surprised it didn't show any errors in the logs as to what was preventing the first two steps from starting. What can cause such a situation and how can I prevent it from happening?
This was happening because of an error in the worker start up. Certain Dataflow steps do not seem to require workers (e.g. writing to GCS), which is why that step was able to start - i.e. that step starting does not imply that workers are being created correctly. Worker start up is not displayed in the job logs by default - you need to click the link to Stackdriver in the job logs and then add worker-startup in the logs drop down in order to see any of those errors.