How to disable/deactivate a data pipeline? - amazon-web-services

I have just created a data pipeline and activated it. But while running, it showed WAITING_ON_DEPENDENCIES for my EC2Resource. I suspect that this might be due to some permissions issue.
So, I now want to edit the pipeline. But when I open the pipeline, it says "Pipeline is active.". And many of the fields are not editable anymore. Is there any way to deactive and/or edit the pipeline?
Regards.

I encountered the same problem.
The only way that I was able to proceed, was to clone the pipeline into a new one and then edit the new one. All the fields are editable there. The old one I deleted.

The limitations on editing an active pipeline are here.
Before you activate a pipeline, you can make any changes to it. After you activate a pipeline, you can edit the pipeline with the following restrictions. The changes you make apply to new runs of the pipeline objects after you save them and then activate the pipeline again.
You can't remove an object
You can't change the schedule period of an existing object
You can't add, delete, or modify reference fields in an existing object
You can't reference an existing object in an output field of a new object
You can't change the scheduled start date of an object (instead, activate the pipeline with a specific date and time)
Basically, you can't change the schedule or the execution graph. That still leaves a lot of non-"ref" properties you can edit (S3 paths, etc.) Otherwise, you will need to use the Clone-and-Edit trick.
You can also deactivate a pipeline in order to pause the execution schedule, either to edit the pipeline or for something like a DB maintenance window.

Related

how to edit a already deployed pipeline in data fusion?

I am trying to edit a pipeline which is already deployed I understand that we can duplicate a same pipeline and rename it but how can do make a change in existing pipeline as renaming would require a change in production scheduling jobs as well.
There is one way thru http calls executor..
Open https://<cdf instnace url ..datafusion.googleusercontent.com>/cdap/httpexecutor
Select PUT(to change pipeline code) from drop down and give
namespaces/<namespaces_name>/apps/<pipeline_name>
Go to body part and paste the new pipeline code (export the code of updated pipeline to i.e. json formatted)
Click on SEND and Response would come as "Deploy Complete" with status code 200.

Is there a way for AWS s3 source in pipeline to detect actual file changes?

I have configured an s3 bucket as a source for my pipeline.
But whenever I upload the files, the pipeline triggers, regardless if the newly updated file is exactly the same/no change.
I was wondering if there's a configuration that will only trigger the pipeline to detect actual change of the file.
The put actually occurs, so from S3s point of view, the object has changed. So the pipeline will fire, even if the new file happens to be identical to the old file. After all, you will also see a previous object version added to the version history, etc.
If you control the uploading process, I would suggest first fetching the object metadata, and only putting an updated version if it actually mismatches (eg. a different MD5 hash). This makes sense since it also saves you the upload itself.

AWS Data Pipeline: how to add steps other than data nodes and activities

EDIT
What I'm really asking here, is whether most folks use the "Architect" GUI to build their pipelines, or whether most folks just use JSON. Is JSON the only way to access some functionality?
/EDIT
I'm just getting started in AWS, so I'm hoping someone here can help me out.
I've used the template for "Load S3 Data into RDS MySQL table" to create a basic pipeline that does a very simple insert:
For learning purposes I want to recreate that pipeline from scratch, but I can't figure out how to add anything to the pipeline that isn't an activity or a data node. Does this have to be done through the CLI? When I try to use the "Add" button in Architect I only see options for activities and data nodes.
TaskRunners, Preconditions, Databases, Actions and Resources can be added to the Pipeline only from their respective Activities and Datanodes.
For example, An RDSDatabase can be added to the Pipeline from SqlActivity or SqlDataNode or MySqlDataNode.
Add SqlActivity -- Choose Database -- Create new: Database : Adds a Database object to the Pipeline.
Database -- Choose Type -- Select type: RDSDatabase

Is there a way to group my DynamoDB export tasks on one EMR cluster?

When I set up a re-occuring backup via the export function in the DynamoDB console, the task it creates automatically creates a new EMR cluster when it runs. Some of my tables need to be backed up but are fairly small. What I end up with is a huge number of large servers running to back up some relatively small tables. Is there any easy way to chain these tasks to run on one server group in series or parallel?
Yes, it is possible. There is not a direct way but needs some additional tweaking in the Data-Pipeline end. You are required to understand how Data-Pipeline actually runs your export job by default.
When you click on export button on DDB console, it takes you to Data-Pipelines console to create a Pipeline for the export.
After filling out the template, instead of running, you can use Edit in Architect feature to alter the current template which only works with one table.
On the architect page, if you observe the Activities section ,you will find EmrAcvity running a EMR STEP using the following param's . This EMR STEP will run the export job using parameters that you initially passed on the template. Note that it will also RunsOn EMRclusterforBackup resource which you can find in resource section.
s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}
To run export on other DDB tables using same EMR resource, you simply need to create another EMRActivity object by clicking Add and then add EMRActivity on architect. On this activity , you can use the same RunsOn as previous activity is using and in the STEP param's you can manually edit to to include other table name and its export path
like
s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,s3://myexport-bucket/table2/,table2,0.9
You can extend it for multiple tables.
Note: This can easily be done for multiple tables using a JSON file as Data-Pipeline definition , editing it to add more activities and parameters and then exporting it to Run later.

Sitecore Scheduled Job: Unable to run

I am very new with Sitecore, I am trying to create one task, but after creating task I configured command and task at content editor. Still I don't see run now option for my task at content editor. Need help.I want to know where the logs of scheduled jobs are written?
There are 2 places where you can define custom task.
In database
In config file
If you decide to go with 1st option
a task item must be created under /sitecore/system/tasks/schedules
item in the “core” database (default behavior).
no matter what schedule you set on that item, it may never be executed, if you do not have right DatabaseAgent looking after that task item.
DatabaseAgent periodically checks task items and if tasks must be
executed (based on the value set in the Scheduling field), it
executes actual code
By default the DatabaseAgent is called each 10
minutes
If you decide to go with 2nd option, check this article first.
In short, you need to define your task class and start method in the config files (check out the /sitecore/admin/showconfig.aspx page, to make sure config changes are applied successfully)
<scheduling>
<!--
Time between checking for scheduled tasks waiting to execute
-->
<frequency>00:00:05</frequency>
<agent type="NameSpace.TaskClass" method="MethodName" interval="00:10:00"/>
</agent>
</scheduling>
As specified in the other answers, you can use config file or database to execute your task. However, it seems that you want to run it manually.
I have a custom module on the Sitecore Marketplace which allows you to select the task you want run. Here is the link
In brief, you need to go to the Sitecore control panel, then click on administration and lastly click on the Run Agent.
It will open a window where you can select the task. I am assuming that the task you have implemented does not take the Item details on which you are when triggering the job.