How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x? - amazon-web-services

I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP, etc.
The change from EMR 3.x to 4.x/5.x requires the use of releaseLabel in EmrCluster, versus amiVersion.
When I use a "releaseLabel": "emr-4.1.0", I get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Below is my data pipeline definition, for EMR 3.x. It works well, so I hope others find this useful (including the answer for emr 4.x/5.x), as the common answer/recommendation to importing data into DynamoDB from a file is to use Data Pipeline, but literally no one has put forward a solid & simple working example (say for custom data format).
{
"objects": [
{
"type": "DynamoDBDataNode",
"id": "DynamoDBDataNode1",
"name": "OutputDynamoDBTable",
"dataFormat": {
"ref": "DynamoDBDataFormat1"
},
"region": "us-east-1",
"tableName": "testImport"
},
{
"type": "Custom",
"id": "Custom1",
"name": "InputCustomFormat",
"column": [
"firstName", "lastName"
],
"columnSeparator" : "|",
"recordSeparator" : "\n"
},
{
"type": "S3DataNode",
"id": "S3DataNode1",
"name": "InputS3Data",
"directoryPath": "s3://data.domain.com",
"dataFormat": {
"ref": "Custom1"
}
},
{
"id": "Default",
"name": "Default",
"scheduleType": "ondemand",
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://logs.data.domain.com"
},
{
"type": "HiveActivity",
"id": "HiveActivity1",
"name": "S3ToDynamoDBImportActivity",
"output": {
"ref": "DynamoDBDataNode1"
},
"input": {
"ref": "S3DataNode1"
},
"hiveScript": "INSERT OVERWRITE TABLE ${output1} SELECT reflect('java.util.UUID', 'randomUUID') as uuid, TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) as loadDate, firstName, lastName FROM ${input1};",
"runsOn": {
"ref": "EmrCluster1"
}
},
{
"type": "EmrCluster",
"name": "EmrClusterForImport",
"id": "EmrCluster1",
"coreInstanceType": "m1.medium",
"coreInstanceCount": "1",
"masterInstanceType": "m1.medium",
"amiVersion": "3.11.0",
"region": "us-east-1",
"terminateAfter": "1 Hours"
},
{
"type": "DynamoDBDataFormat",
"id": "DynamoDBDataFormat1",
"name": "OutputDynamoDBDataFormat",
"column": [
"uuid", "loadDate", "firstName", "lastName"
]
}
],
"parameters": []
}
A sample file could look like
John|Doe
Jane|Doe
Carl|Doe
Bonus: rather than setting CURRENT_DATE in a column, how I can set as a variable in the hiveScript section? I tried SET loadDate = CURRENT_DATE;\n\n INSERT OVERWRITE..." to no avail. Not shown in my example are other dynamic fields I would like to set before the query clause.

Related

Add virtual network to existing Event hub namespace

I have an arm template to add new virtual network to an existing Event Hub Namespace.
The problem for is that i have to hard code, or ask as paramameter the vnet and subnet addresprefix.
Is there a way to not have to do this or extract those values and use the on the template ?
Vnet exist subnet exist
using reference function get's me the values but i get circular definition error if i use it on resource definition.
Tried to get ip's for vnet and subnet by using reference function but i cant use it in the param or variable and in resource i get Circular dependency error.
Basically i like to to this on a clean template.
"variables": {
"namespaceVirtualNetworkRuleName": "[concat(parameters('eventhubNamespaceName'), concat('/', parameters('vnetRuleName')))]",
"subNetId": "[resourceId('Microsoft.Network/virtualNetworks/subnets/', parameters('vnetRuleName'), parameters('subnetName'))]"
},
resources : [
{
"apiVersion": "2018-01-01-preview",
"name": "[variables('namespaceVirtualNetworkRuleName')]",
"type": "Microsoft.EventHub/namespaces/VirtualNetworkRules",
"properties": {
"virtualNetworkSubnetId": "[variables('subNetId')]"
}
}
]
But i have to add vnet and subnet with correct ip for this to work.
"variables": {
"namespaceVirtualNetworkRuleName": "[concat(parameters('eventhubNamespaceName'), concat('/', parameters('vnetRuleName')))]",
"subNetId": "[resourceId('Microsoft.Network/virtualNetworks/subnets/', parameters('vnetRuleName'), parameters('subnetName'))]"
},
"resources": [
{
"apiVersion": "2018-01-01-preview",
"name": "[parameters('eventhubNamespaceName')]",
"type": "Microsoft.EventHub/namespaces",
"location": "[parameters('location')]",
"sku": {
"name": "Standard",
"tier": "Standard"
},
"properties": { }
},
{
"apiVersion": "2017-09-01",
"name": "[parameters('vnetRuleName')]",
"location": "[parameters('location')]",
"type": "Microsoft.Network/virtualNetworks",
"properties": {
"addressSpace": {
"addressPrefixes": [
"a.b.c.0/24",
"x.y.0.0/16"
]
},
"subnets": [
{
"name": "[parameters('subnetName')]",
"properties": {
"addressPrefix": "x.y.z.w/26",
"serviceEndpoints": [
{
"service": "Microsoft.EventHub"
}
]
}
}
]
}
},
{
"apiVersion": "2018-01-01-preview",
"name": "[variables('namespaceVirtualNetworkRuleName')]",
"type": "Microsoft.EventHub/namespaces/VirtualNetworkRules",
"dependsOn": [
"[concat('Microsoft.EventHub/namespaces/', parameters('eventhubNamespaceName'))]"
],
"properties": {
"virtualNetworkSubnetId": "[variables('subNetId')]"
}
}
]
Any idea on hou to remove the vnet part or at list get the ip's before the template and pass them as param ?
Found out how to make this work.
the solution is this :
"variables": {
},
"resources": [
{
"apiVersion": "2018-01-01-preview",
"name": "[parameters('eventhubNamespaceName')]",
"type": "Microsoft.EventHub/namespaces",
"location": "[resourceGroup().location]",
"sku": {
"name": "Standard",
"tier": "Standard"
},
"properties": { }
},
{
"apiVersion": "2018-01-01-preview",
"name": "[concat(parameters('eventhubNamespaceName'), concat('/', parameters('vnetSubscriptioID'), parameters('vnetResorceGroupName'), parameters('vnetRuleName'), parameters('subnetName')[copyIndex('subnetcopy')]))]",
"type": "Microsoft.EventHub/namespaces/VirtualNetworkRules",
"location": "[resourceGroup().location]",
"dependsOn": [
"[concat('Microsoft.EventHub/namespaces/', parameters('eventhubNamespaceName'))]"
],
"properties": {
"virtualNetworkSubnetId": "[resourceId(parameters('vnetSubscriptioID'), parameters('vnetResorceGroupName'),'Microsoft.Network/virtualNetworks/subnets/', parameters('vnetRuleName'), parameters('subnetName')[copyIndex('subnetcopy')])]"
},
"copy": {
"name": "subnetcopy",
"count": "[length(parameters('subnetName'))]"
}
}
]
}
At the end no network part was needed. Updated my solution to be able to deploy more than one subnet at a time.

How to use Data Pipeline to copy data of a DynamoDB table to another DynamoDB table when both have on-demand capacity

I used to copy data from one DynamoDB to another DynamoDB using a pipeline.json. It works when the source table has provisioned capacity and doesn't matter if destination is set to provisioned/on demand. I want both of my tables set to On Demand capacity. But when i use the same template it doesn't work. Is there any way that we can do that, or is it still under development?
Here is my original functioning script:
{
"objects": [
{
"startAt": "FIRST_ACTIVATION_DATE_TIME",
"name": "DailySchedule",
"id": "DailySchedule",
"period": "1 day",
"type": "Schedule",
"occurrences": "1"
},
{
"id": "Default",
"name": "Default",
"scheduleType": "ONDEMAND",
"pipelineLogUri": "#{myS3LogsPath}",
"schedule": {
"ref": "DailySchedule"
},
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DDBSourceTable",
"tableName": "#{myDDBSourceTableName}",
"name": "DDBSourceTable",
"type": "DynamoDBDataNode",
"readThroughputPercent": "#{myDDBReadThroughputRatio}"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
},
{
"id": "DDBDestinationTable",
"tableName": "#{myDDBDestinationTableName}",
"name": "DDBDestinationTable",
"type": "DynamoDBDataNode",
"writeThroughputPercent": "#{myDDBWriteThroughputRatio}"
},
{
"id": "EmrClusterForBackup",
"name": "EmrClusterForBackup",
"amiVersion": "3.8.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBSourceRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "EmrClusterForLoad",
"name": "EmrClusterForLoad",
"amiVersion": "3.8.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBDestinationRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "TableLoadActivity",
"name": "TableLoadActivity",
"runsOn": {
"ref": "EmrClusterForLoad"
},
"input": {
"ref": "S3TempLocation"
},
"output": {
"ref": "DDBDestinationTable"
},
"type": "EmrActivity",
"maximumRetries": "2",
"dependsOn": {
"ref": "TableBackupActivity"
},
"resizeClusterBeforeRunning": "true",
"step": [
"s3://dynamodb-emr-#{myDDBDestinationRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}"
]
},
{
"id": "TableBackupActivity",
"name": "TableBackupActivity",
"input": {
"ref": "DDBSourceTable"
},
"output": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"resizeClusterBeforeRunning": "true",
"type": "EmrActivity",
"maximumRetries": "2",
"step": [
"s3://dynamodb-emr-#{myDDBSourceRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
]
},
{
"dependsOn": {
"ref": "TableLoadActivity"
},
"name": "S3CleanupActivity",
"id": "S3CleanupActivity",
"input": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"type": "ShellCommandActivity",
"command": "(sudo yum -y update aws-cli) && (aws s3 rm #{input.directoryPath} --recursive)"
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBSourceTableName",
"type": "String",
"description": "Source DynamoDB table name"
},
{
"id": "myDDBDestinationTableName",
"type": "String",
"description": "Target DynamoDB table name"
},
{
"id": "myDDBWriteThroughputRatio",
"type": "Double",
"description": "DynamoDB write throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"id": "myDDBSourceRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"id": "myDDBDestinationRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"id": "myDDBReadThroughputRatio",
"type": "Double",
"description": "DynamoDB read throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}
And here is the error message from Data Pipeline execution when source DynamoDB table is set to On Demand capacity:
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.hadoop.dynamodb.tools.DynamoDbExport.run(DynamoDbExport.java:79)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.dynamodb.tools.DynamoDbExport.main(DynamoDbExport.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
The following JSON file worked for upload (DynamoDB to S3) -
{
"objects": [
{
"id": "Default",
"name": "Default",
"scheduleType": "ONDEMAND",
"pipelineLogUri": "#{myS3LogsPath}",
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DDBSourceTable",
"tableName": "#{myDDBSourceTableName}",
"name": "DDBSourceTable",
"type": "DynamoDBDataNode",
"readThroughputPercent": "#{myDDBReadThroughputRatio}"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/data"
},
{
"subnetId": "subnet-id",
"id": "EmrClusterForBackup",
"name": "EmrClusterForBackup",
"masterInstanceType": "m5.xlarge",
"coreInstanceType": "m5.xlarge",
"coreInstanceCount": "1",
"releaseLabel": "emr-5.23.0",
"region": "#{myDDBSourceRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "TableBackupActivity",
"name": "TableBackupActivity",
"input": {
"ref": "DDBSourceTable"
},
"output": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"resizeClusterBeforeRunning": "true",
"type": "EmrActivity",
"maximumRetries": "2",
"step": [
"s3://dynamodb-dpl-#{myDDBSourceRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
]
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBSourceTableName",
"type": "String",
"description": "Source DynamoDB table name"
},
{
"id": "myDDBSourceRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"id": "myDDBReadThroughputRatio",
"type": "Double",
"description": "DynamoDB read throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}
And the following worked for download (S3 to DynamoDB) -
{
"objects": [
{
"id": "Default",
"name": "Default",
"scheduleType": "ONDEMAND",
"pipelineLogUri": "#{myS3LogsPath}",
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/data"
},
{
"id": "DDBDestinationTable",
"tableName": "#{myDDBDestinationTableName}",
"name": "DDBDestinationTable",
"type": "DynamoDBDataNode",
"writeThroughputPercent": "#{myDDBWriteThroughputRatio}"
},
{
"subnetId": "subnet-id",
"id": "EmrClusterForLoad",
"name": "EmrClusterForLoad",
"releaseLabel": "emr-5.23.0",
"masterInstanceType": "m5.xlarge",
"coreInstanceType": "m5.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBDestinationRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "TableLoadActivity",
"name": "TableLoadActivity",
"runsOn": {
"ref": "EmrClusterForLoad"
},
"input": {
"ref": "S3TempLocation"
},
"output": {
"ref": "DDBDestinationTable"
},
"type": "EmrActivity",
"maximumRetries": "2",
"resizeClusterBeforeRunning": "true",
"step": [
"s3://dynamodb-dpl-#{myDDBDestinationRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}"
]
},
{
"dependsOn": {
"ref": "TableLoadActivity"
},
"name": "S3CleanupActivity",
"id": "S3CleanupActivity",
"input": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForLoad"
},
"type": "ShellCommandActivity",
"command": "(sudo yum -y update aws-cli) && (aws s3 rm #{input.directoryPath} --recursive)"
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBDestinationTableName",
"type": "String",
"description": "Target DynamoDB table name"
},
{
"id": "myDDBWriteThroughputRatio",
"type": "Double",
"description": "DynamoDB write throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"id": "myDDBDestinationRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}
Also, the subnet ID fields in both the pipeline definitions are totally optional, but it is always good to set them.

AWS Data Pipeline stuck on Waiting For Runner

My goal is to copy a table in a postgreSQL database running on AWS RDS to a .csv file on Amazone S3. For this I use AWS data pipeline and found the following tutorial however when I follow all steps my pipeline is stuck at: "WAITING FOR RUNNER" see screenshot. The AWS documentation states:
ensure that you set a valid value for either the runsOn or workerGroup
fields for those tasks
however the field "runs on" is set. Any idea why this pipeline is stuck?
and my definition file:
{
"objects": [
{
"output": {
"ref": "DataNodeId_Z8iDO"
},
"input": {
"ref": "DataNodeId_hEUzs"
},
"name": "DefaultCopyActivity01",
"runsOn": {
"ref": "ResourceId_oR8hY"
},
"id": "CopyActivityId_8zaDw",
"type": "CopyActivity"
},
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"name": "DefaultResource1",
"id": "ResourceId_oR8hY",
"type": "Ec2Resource",
"terminateAfter": "1 Hour"
},
{
"*password": "xxxxxxxxx",
"name": "DefaultDatabase1",
"id": "DatabaseId_BWxRr",
"type": "RdsDatabase",
"region": "eu-central-1",
"rdsInstanceId": "aqueduct30v05.cgpnumwmfcqc.eu-central-1.rds.amazonaws.com",
"username": "xxxx"
},
{
"name": "DefaultDataFormat1",
"id": "DataFormatId_wORsu",
"type": "CSV"
},
{
"database": {
"ref": "DatabaseId_BWxRr"
},
"name": "DefaultDataNode2",
"id": "DataNodeId_hEUzs",
"type": "SqlDataNode",
"table": "y2018m07d12_rh_ws_categorization_label_postgis_v01_v04",
"selectQuery": "SELECT * FROM y2018m07d12_rh_ws_categorization_label_postgis_v01_v04 LIMIT 100"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://rutgerhofste-data-pipeline/logs",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"dataFormat": {
"ref": "DataFormatId_wORsu"
},
"filePath": "s3://rutgerhofste-data-pipeline/test",
"name": "DefaultDataNode1",
"id": "DataNodeId_Z8iDO",
"type": "S3DataNode"
}
],
"parameters": []
}
Usually "WAITING FOR RUNNER" state implies that it is waiting for a resource (such as an EMR cluster). You seem to have not set 'workGroup' field. It means that you have specified "What" to do, but have not specified "who" should do it.

AWS Data Pipeline S3 to DynamoDB JSON Error

I'm trying to import a TSV file from S3 into DynamoDB using Data Pipelines, but I keep hitting a MalformedJsonException. I've validated both pieces of Json that I provide: the definition of the data pipeline and the manifest of the S3 folder, so that's not the problem. Is there any way to go about figuring out what Json is malformed?
Definition of the job:
{
"objects": [
{
"output": {
"ref": "DDBDestinationTable"
},
"input": {
"ref": "S3InputDataNode"
},
"maximumRetries": "2",
"name": "TableLoadActivity",
"step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}",
"runsOn": {
"ref": "EmrClusterForLoad"
},
"id": "TableLoadActivity",
"type": "EmrActivity",
"resizeClusterBeforeRunning": "true"
},
{
"column": [
"property_id STRING",
"addr_line_1 STRING",
...
],
"name": "DefaultDataFormat1",
"id": "DataFormatId_JMZkM",
"type": "TSV"
},
{
"bootstrapAction": "s3://#{myDDBRegion}.elasticmapreduce/bootstrap-actions/configure-hadoop, --mapred-key-value,mapreduce.map.speculative=false",
"name": "EmrClusterForLoad",
"coreInstanceCount": "1",
"coreInstanceType": "m3.xlarge",
"amiVersion": "3.9.0",
"id": "EmrClusterForLoad",
"masterInstanceType": "m3.xlarge",
"region": "#{myDDBRegion}",
"type": "EmrCluster",
"terminateAfter": "1 Month"
},
{
"directoryPath": "#{myInputS3Loc}",
"dataFormat": {
"ref": "DataFormatId_JMZkM"
},
"name": "S3InputDataNode",
"id": "S3InputDataNode",
"type": "S3DataNode"
},
{
"writeThroughputPercent": "#{myDDBWriteThroughputRatio}",
"name": "DDBDestinationTable",
"id": "DDBDestinationTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://log-bucket/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
}
],
"parameters": [
{
"description": "Input S3 folder",
"id": "myInputS3Loc",
"type": "AWS::S3::ObjectKey"
},
{
"description": "Target DynamoDB table name",
"id": "myDDBTableName",
"type": "String"
},
{
"default": "0.25",
"watermark": "Enter value between 0.1-1.0",
"description": "DynamoDB write throughput ratio",
"id": "myDDBWriteThroughputRatio",
"type": "Double"
},
{
"default": "us-east-1",
"watermark": "us-east-1",
"description": "Region of the DynamoDB table",
"id": "myDDBRegion",
"type": "String"
}
],
"values": {
"myDDBRegion": "us-east-1",
"myDDBTableName": "TableName",
"myDDBWriteThroughputRatio": "0.5",
"myInputS3Loc": "s3://input/folder/"
}
}
Exception:
24 Jan 2018 23:59:56,657 [INFO] (TaskRunnerService-df-02737991EW1XAIM4T1PD_#EmrClusterForLoad_2018-01-24T23:27:35-0) df-02737991EW1XAIM4T1PD amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg : at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: com.google.gson.stream.MalformedJsonException: Expected ':' at line 1 column 36
at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1298)
at com.google.gson.stream.JsonReader.objectValue(JsonReader.java:762)
at com.google.gson.stream.JsonReader.peek(JsonReader.java:380)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:158)
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:188)
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:146)
at com.google.gson.Gson.fromJson(Gson.java:755)
... 17 more
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.hadoop.dynamodb.tools.DynamoDbImport.run(DynamoDbImport.java:68)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.dynamodb.tools.DynamoDbImport.main(DynamoDbImport.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
In the AWS console, click Create new pipeline and Import a definition and see whether your json can be imported correctly.
Do you create the pipeline from the command line? I suspect there is some problem with this command.
I assume the ... are not present in your actual json :)

DynamoDB table duplication through Data Pipeline producing incomplete duplicate

I have a DynamoDB table that is 14.05GB, with 140,000,000 items. I am trying to clone it (to the same region) using Data Pipeline, but the destination table only has about 160,000 items when the pipeline is finished and I wait 6 hours to view the item count.
I set the throughput to 256 for each table and the pipeline took about 20 minutes to complete. Is there anything that might be causing the pipeline to only copy a section of the table? Are there invisible limits on size and item count? I have tried this 3 times with similar results each time with the 'completed' destination table containing only 90-150k of the 140M items.
I also made sure the max execution time was set very high.
Is the Data Pipeline the simplest way to quickly copy a Dynamo table?
Thanks.
Amazon has replied to my ticket and have confirmed it is a known issue (bug) in the Data Pipeline.
They have recommended me this Java programme https://github.com/awslabs/dynamodb-import-export-tool to first export it to S3 and then import it back into DynamoDB
Using EmrActivity of AWS Data Pipeline one can copy from one Dynamodb table to another. Below is an example pipeline definition.
{
"objects": [
{
"startAt": "FIRST_ACTIVATION_DATE_TIME",
"name": "DailySchedule",
"id": "DailySchedule",
"period": "1 day",
"type": "Schedule",
"occurrences": "1"
},
{
"id": "Default",
"name": "Default",
"scheduleType": "CRON",
"pipelineLogUri": "#{myS3LogsPath}",
"schedule": {
"ref": "DailySchedule"
},
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DDBSourceTable",
"tableName": "#{myDDBSourceTableName}",
"name": "DDBSourceTable",
"type": "DynamoDBDataNode",
"readThroughputPercent": "#{myDDBReadThroughputRatio}"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
},
{
"id": "DDBDestinationTable",
"tableName": "#{myDDBDestinationTableName}",
"name": "DDBDestinationTable",
"type": "DynamoDBDataNode",
"writeThroughputPercent": "#{myDDBWriteThroughputRatio}"
},
{
"id": "EmrClusterForBackup",
"name": "EmrClusterForBackup",
"releaseLabel": "emr-4.2.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBSourceRegion}",
"terminateAfter": "6 Hours",
"type": "EmrCluster"
},
{
"id": "EmrClusterForLoad",
"name": "EmrClusterForLoad",
"releaseLabel": "emr-4.2.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBDestinationRegion}",
"terminateAfter": "6 Hours",
"type": "EmrCluster"
},
{
"id": "TableLoadActivity",
"name": "TableLoadActivity",
"runsOn": {
"ref": "EmrClusterForLoad"
},
"input": {
"ref": "S3TempLocation"
},
"output": {
"ref": "DDBDestinationTable"
},
"type": "EmrActivity",
"maximumRetries": "2",
"dependsOn": {
"ref": "TableBackupActivity"
},
"resizeClusterBeforeRunning": "true",
"step": [
"s3://dynamodb-emr-#{myDDBDestinationRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}"
]
},
{
"id": "TableBackupActivity",
"name": "TableBackupActivity",
"input": {
"ref": "DDBSourceTable"
},
"output": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"resizeClusterBeforeRunning": "true",
"type": "EmrActivity",
"maximumRetries": "2",
"step": [
"s3://dynamodb-emr-#{myDDBSourceRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
]
},
{
"dependsOn": {
"ref": "TableLoadActivity"
},
"name": "S3CleanupActivity",
"id": "S3CleanupActivity",
"input": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"type": "ShellCommandActivity",
"command": "(sudo yum -y update aws-cli) && (aws s3 rm #{input.directoryPath} --recursive)"
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBSourceTableName",
"type": "String",
"description": "Source DynamoDB table name"
},
{
"id": "myDDBDestinationTableName",
"type": "String",
"description": "Target DynamoDB table name"
},
{
"id": "myDDBWriteThroughputRatio",
"type": "Double",
"description": "DynamoDB write throughput ratio",
"default": "0.25",
"watermark": "Enter value between 0.1-1.0"
},
{
"id": "myDDBSourceRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-east-1",
"watermark": "us-east-1"
},
{
"id": "myDDBDestinationRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-east-1",
"watermark": "us-east-1"
},
{
"id": "myDDBReadThroughputRatio",
"type": "Double",
"description": "DynamoDB read throughput ratio",
"default": "0.25",
"watermark": "Enter value between 0.1-1.0"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}