Importing User Events in Google Cloud's Recommendation AI - google-cloud-platform

I'm working on Google Cloud's Recommendation AI. For importing User Events, it supports BigQuery but can't find the Table Schema required. As it's in Beta, the community support and Documentation is limited. Can anyone please guide me in creating Table Schema for importing user events from BigQuery?

As you can see here, to import data from BigQuery you should use the Recommendations AI Schema. The schema is a JSON schema like below:
[
{
"name": "product_metadata",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "images",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "uri",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "height",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "width",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "exact_price",
"type": "RECORD",
"mode": "REQUIRED",
"fields": [
{
"name": "original_price",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "display_price",
"type": "FLOAT",
"mode": "NULLABLE"
}
]
},
{
"name": "canonical_product_uri",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "currency_code",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "language_code",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "description",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "title",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "tags",
"type": "STRING",
"mode": "REPEATED"
},
{
"name": "category_hierarchies",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "categories",
"type": "STRING",
"mode": "REPEATED"
}
]
},
{
"name": "id",
"type": "STRING",
"mode": "REQUIRED"
}
]
Its also important to mention that you can also import data from Google Analytics. The reference for the Google Analytics schema can be found here.

Related

Data Fusion - Argument Setter plugin throws Null pointer exception

Argument setter - HTTP post endpoint url.
Below is the argument setter plugin configuration.
{
"name": "Argument Setter",
"plugin": {
"name": "ArgumentSetter",
"type": "action",
"label": "Argument Setter",
"artifact": {
"name": "argument-setter-plugins",
"version": "1.1.1",
"scope": "USER"
},
"properties": {
"method": "POST",
"connectTimeout": "60000",
"readTimeout": "60000",
"numRetries": "0",
"followRedirects": "true",
"url": "-----url----",
"body": "{\"type\": \"DELTA\", \"fileType\": \"CSV\", \"targetName\": \"DATAFUSION_TEST_1\", \"dataPoolId\": \"4d3164d9-c8e4-4042-9c69-ff758a17b140\"}"
}
},
"outputSchema": [
{
"name": "id",
"type": "string",
"nullable" : true
},
{
"name": "targetName",
"type": "string",
"nullable" : true
},
{
"name": "lastModified",
"type": "int",
"nullable" : true
},
{
"name": "lastPing",
"type": "string",
"nullable" : true
},
{
"name": "status",
"type": "string",
"nullable" : true
},
{
"name": "type",
"type": "string",
"nullable" : true
},
{
"name": "fileType",
"type": "string",
"nullable" : true
},
{
"name": "targetSchema",
"type": "string",
"nullable" : true
},
{
"name": "upsertStrategy",
"type": "string",
"nullable" : true
},
{
"name": "fallbackVarcharLength",
"type": "string",
"nullable" : true
},
{
"name": "dataPoolId",
"type": "string",
"nullable" : true
},
{
"name": "connectionId",
"type": "string",
"nullable" : true
},
{
"name": "postExecutionQuery",
"type": "string",
"nullable" : true
},
{
"name": "sanitizedPostExecutionQuery",
"type": "string",
"nullable" : true
},
{
"name": "allowDuplicate",
"type": "boolean",
"nullable" : true
},
{
"name": "tableSchema",
"type": "string",
"nullable" : true
},
{
"name": "mirrorTargetNames",
"type": "array",
"nullable" : true
},
{
"name": "changeDate",
"type": "int",
"nullable" : true
},
{
"name": "keys",
"type": "array",
"nullable" : true
},
{
"name": "logs",
"type": "array",
"nullable" : true
},
{
"name": "csvParsingOptions",
"type": "string",
"nullable" : true
},
{
"name": "optionalTenantId",
"type": "string",
"nullable" : true
}
]
}
],
Response from the endpoint url is like below
[{
"id": "b489dc71-96fd-4e94-bcc5-9a4b3732855e",
"targetName": "POSTMAN_TBL",
"lastModified": 1631598169840,
"lastPing": null,
"status": "NEW",
"type": "DELTA",
"fileType": "PARQUET",
"targetSchema": null,
"upsertStrategy": "UPSERT_WITH_UNCHANGED_METADATA",
"fallbackVarcharLength": null,
"dataPoolId": "f37b8619-30e2-4804-9355-38d123142ac4",
"connectionId": null,
"postExecutionQuery": null,
"sanitizedPostExecutionQuery": null,
"allowDuplicate": false,
"tableSchema": null,
"changeDate": 1631598169840,
"mirrorTargetNames": [],
"keys": [],
"logs": [],
"csvParsingOptions": null,
"optionalTenantId": null
}
]
When I execute this, I get 200 response after hitting the endpoint url.
But the pipeline is failing with NullPointerException
java.lang.NullPointerException: null
at io.cdap.plugin.ArgumentSetter.handleResponse(ArgumentSetter.java:73) ~[na:na]
at io.cdap.plugin.http.HTTPArgumentSetter.run(HTTPArgumentSetter.java:76) ~[na:na]
at io.cdap.cdap.etl.common.plugin.WrappedAction.lambda$run$1(WrappedAction.java:49) ~[na:na]
Can someone help me what am I missing here?
As mentioned in the comments, the response from the server has to be structured in the following format:
{
"arguments" : [
{ argument }, { argument }, ..., {argument}
]
}
Since your response is not structured in this format, you are getting a NullPointerException when you execute your pipeline.
See docs for reference https://github.com/data-integrations/argument-setter/tree/9f6aabbf28e00644726d485188235b406cac522f#usage

How to delete a column in BigQuery that is part of a nested column

I want to delete a column in a BigQuery table that is part of a record or nested column. I've found this command in their documentation. Unfortunately, this command is not available for nested columns inside existing RECORD fields.
Is there any workaround for this?
For example, if I had this schema I want to remove the address2 field inside the address field. So from this:
[
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "addresses",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "address1",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "address2",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "country",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
to this:
[
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "addresses",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "address1",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "country",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
Use below
select * replace(
array(select as struct * except(address2) from t.addresses)
as addresses)
from `project.dataset.table` t
if you want permanently remove that field - use create or replace table as in below example
create or replace table `project.dataset.new_table` as
select * replace(
array(select as struct * except(address2) from t.addresses)
as addresses)
from `project.dataset.table` t

How to use Data Pipeline to copy data of a DynamoDB table to another DynamoDB table when both have on-demand capacity

I used to copy data from one DynamoDB to another DynamoDB using a pipeline.json. It works when the source table has provisioned capacity and doesn't matter if destination is set to provisioned/on demand. I want both of my tables set to On Demand capacity. But when i use the same template it doesn't work. Is there any way that we can do that, or is it still under development?
Here is my original functioning script:
{
"objects": [
{
"startAt": "FIRST_ACTIVATION_DATE_TIME",
"name": "DailySchedule",
"id": "DailySchedule",
"period": "1 day",
"type": "Schedule",
"occurrences": "1"
},
{
"id": "Default",
"name": "Default",
"scheduleType": "ONDEMAND",
"pipelineLogUri": "#{myS3LogsPath}",
"schedule": {
"ref": "DailySchedule"
},
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DDBSourceTable",
"tableName": "#{myDDBSourceTableName}",
"name": "DDBSourceTable",
"type": "DynamoDBDataNode",
"readThroughputPercent": "#{myDDBReadThroughputRatio}"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
},
{
"id": "DDBDestinationTable",
"tableName": "#{myDDBDestinationTableName}",
"name": "DDBDestinationTable",
"type": "DynamoDBDataNode",
"writeThroughputPercent": "#{myDDBWriteThroughputRatio}"
},
{
"id": "EmrClusterForBackup",
"name": "EmrClusterForBackup",
"amiVersion": "3.8.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBSourceRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "EmrClusterForLoad",
"name": "EmrClusterForLoad",
"amiVersion": "3.8.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBDestinationRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "TableLoadActivity",
"name": "TableLoadActivity",
"runsOn": {
"ref": "EmrClusterForLoad"
},
"input": {
"ref": "S3TempLocation"
},
"output": {
"ref": "DDBDestinationTable"
},
"type": "EmrActivity",
"maximumRetries": "2",
"dependsOn": {
"ref": "TableBackupActivity"
},
"resizeClusterBeforeRunning": "true",
"step": [
"s3://dynamodb-emr-#{myDDBDestinationRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}"
]
},
{
"id": "TableBackupActivity",
"name": "TableBackupActivity",
"input": {
"ref": "DDBSourceTable"
},
"output": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"resizeClusterBeforeRunning": "true",
"type": "EmrActivity",
"maximumRetries": "2",
"step": [
"s3://dynamodb-emr-#{myDDBSourceRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
]
},
{
"dependsOn": {
"ref": "TableLoadActivity"
},
"name": "S3CleanupActivity",
"id": "S3CleanupActivity",
"input": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"type": "ShellCommandActivity",
"command": "(sudo yum -y update aws-cli) && (aws s3 rm #{input.directoryPath} --recursive)"
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBSourceTableName",
"type": "String",
"description": "Source DynamoDB table name"
},
{
"id": "myDDBDestinationTableName",
"type": "String",
"description": "Target DynamoDB table name"
},
{
"id": "myDDBWriteThroughputRatio",
"type": "Double",
"description": "DynamoDB write throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"id": "myDDBSourceRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"id": "myDDBDestinationRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"id": "myDDBReadThroughputRatio",
"type": "Double",
"description": "DynamoDB read throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}
And here is the error message from Data Pipeline execution when source DynamoDB table is set to On Demand capacity:
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.hadoop.dynamodb.tools.DynamoDbExport.run(DynamoDbExport.java:79)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.dynamodb.tools.DynamoDbExport.main(DynamoDbExport.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
The following JSON file worked for upload (DynamoDB to S3) -
{
"objects": [
{
"id": "Default",
"name": "Default",
"scheduleType": "ONDEMAND",
"pipelineLogUri": "#{myS3LogsPath}",
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DDBSourceTable",
"tableName": "#{myDDBSourceTableName}",
"name": "DDBSourceTable",
"type": "DynamoDBDataNode",
"readThroughputPercent": "#{myDDBReadThroughputRatio}"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/data"
},
{
"subnetId": "subnet-id",
"id": "EmrClusterForBackup",
"name": "EmrClusterForBackup",
"masterInstanceType": "m5.xlarge",
"coreInstanceType": "m5.xlarge",
"coreInstanceCount": "1",
"releaseLabel": "emr-5.23.0",
"region": "#{myDDBSourceRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "TableBackupActivity",
"name": "TableBackupActivity",
"input": {
"ref": "DDBSourceTable"
},
"output": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"resizeClusterBeforeRunning": "true",
"type": "EmrActivity",
"maximumRetries": "2",
"step": [
"s3://dynamodb-dpl-#{myDDBSourceRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
]
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBSourceTableName",
"type": "String",
"description": "Source DynamoDB table name"
},
{
"id": "myDDBSourceRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"id": "myDDBReadThroughputRatio",
"type": "Double",
"description": "DynamoDB read throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}
And the following worked for download (S3 to DynamoDB) -
{
"objects": [
{
"id": "Default",
"name": "Default",
"scheduleType": "ONDEMAND",
"pipelineLogUri": "#{myS3LogsPath}",
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/data"
},
{
"id": "DDBDestinationTable",
"tableName": "#{myDDBDestinationTableName}",
"name": "DDBDestinationTable",
"type": "DynamoDBDataNode",
"writeThroughputPercent": "#{myDDBWriteThroughputRatio}"
},
{
"subnetId": "subnet-id",
"id": "EmrClusterForLoad",
"name": "EmrClusterForLoad",
"releaseLabel": "emr-5.23.0",
"masterInstanceType": "m5.xlarge",
"coreInstanceType": "m5.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBDestinationRegion}",
"terminateAfter": "10 Days",
"type": "EmrCluster"
},
{
"id": "TableLoadActivity",
"name": "TableLoadActivity",
"runsOn": {
"ref": "EmrClusterForLoad"
},
"input": {
"ref": "S3TempLocation"
},
"output": {
"ref": "DDBDestinationTable"
},
"type": "EmrActivity",
"maximumRetries": "2",
"resizeClusterBeforeRunning": "true",
"step": [
"s3://dynamodb-dpl-#{myDDBDestinationRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}"
]
},
{
"dependsOn": {
"ref": "TableLoadActivity"
},
"name": "S3CleanupActivity",
"id": "S3CleanupActivity",
"input": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForLoad"
},
"type": "ShellCommandActivity",
"command": "(sudo yum -y update aws-cli) && (aws s3 rm #{input.directoryPath} --recursive)"
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBDestinationTableName",
"type": "String",
"description": "Target DynamoDB table name"
},
{
"id": "myDDBWriteThroughputRatio",
"type": "Double",
"description": "DynamoDB write throughput ratio",
"default": "1",
"watermark": "Enter value between 0.1-1.0"
},
{
"id": "myDDBDestinationRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-west-2"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}
Also, the subnet ID fields in both the pipeline definitions are totally optional, but it is always good to set them.

Elasticsearch index data exists and can be queried against but the returned json is blank

I suspect I'm making a newbie mistake.
I have an elasticsearch index (lswl) that is accepting data from logtash and winlogbeat that has indexed (not_analyzed) data but I can't seem to retrieve it.
When I run the following query
POST /lswl-2016.08.15/_search?pretty
{
"query": {
"match_all": {}
}
}
I get the following results:
"hits": {
"total": 9,
"max_score": 1,
"hits": [
{
"_index": "lswl-2016.08.15",
"_type": "wineventlog",
"_id": "AVaLgghl49PiM_pqlihQ",
"_score": 1
}
I know there's data in there because queries like this return a subset of values.
POST /lswl-*/_search?pretty
{
"query": {
"term": { "host": "BTRDAPTST02"}
}
}
I suspect that the problem is in the template I created for the the lswl index but for the life of me I can't figure out what I did incorrectly. The template is below for reference.
"template": "lswl*",
"settings":{
"number_of_shards": 1
},
"mappings": {
"wineventlog":{
"_source": {
"enabled": false
},
"properties": {
"#timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"#version": {
"type": "string",
"index": "not_analyzed"
},
"category": {
"type": "string"
},
"computer_name": {
"type": "string",
"index": "not_analyzed"
},
"count": {
"type": "long"
},
"event_id": {
"type": "long"
},
"host": {
"type": "string",
"index": "not_analyzed"
},
"level": {
"type": "string",
"index": "not_analyzed"
},
"log_name": {
"type": "string",
"index": "not_analyzed"
},
"message": {
"type": "string",
"fields": {
"original": {
"type": "string",
"index": "not_analyzed"
}
}
},
"record_number": {
"type": "string",
"index": "not_analyzed"
},
"source_name": {
"type": "string",
"index": "not_analyzed"
},
"tags": {
"type": "string",
"index": "not_analyzed"
},
"type": {
"type": "string",
"index": "not_analyzed"
},
"user": {
"properties": {
"domain": {
"type": "string",
"index": "not_analyzed"
},
"identifier": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"type": {
"type": "string",
"index": "not_analyzed"
}
Just remove the following part or set it to true
"_source": {
"enabled": false
},
And then delete the index and re-index your data. You'll be able to see the data afterwards.

DynamoDB table duplication through Data Pipeline producing incomplete duplicate

I have a DynamoDB table that is 14.05GB, with 140,000,000 items. I am trying to clone it (to the same region) using Data Pipeline, but the destination table only has about 160,000 items when the pipeline is finished and I wait 6 hours to view the item count.
I set the throughput to 256 for each table and the pipeline took about 20 minutes to complete. Is there anything that might be causing the pipeline to only copy a section of the table? Are there invisible limits on size and item count? I have tried this 3 times with similar results each time with the 'completed' destination table containing only 90-150k of the 140M items.
I also made sure the max execution time was set very high.
Is the Data Pipeline the simplest way to quickly copy a Dynamo table?
Thanks.
Amazon has replied to my ticket and have confirmed it is a known issue (bug) in the Data Pipeline.
They have recommended me this Java programme https://github.com/awslabs/dynamodb-import-export-tool to first export it to S3 and then import it back into DynamoDB
Using EmrActivity of AWS Data Pipeline one can copy from one Dynamodb table to another. Below is an example pipeline definition.
{
"objects": [
{
"startAt": "FIRST_ACTIVATION_DATE_TIME",
"name": "DailySchedule",
"id": "DailySchedule",
"period": "1 day",
"type": "Schedule",
"occurrences": "1"
},
{
"id": "Default",
"name": "Default",
"scheduleType": "CRON",
"pipelineLogUri": "#{myS3LogsPath}",
"schedule": {
"ref": "DailySchedule"
},
"failureAndRerunMode": "CASCADE",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DDBSourceTable",
"tableName": "#{myDDBSourceTableName}",
"name": "DDBSourceTable",
"type": "DynamoDBDataNode",
"readThroughputPercent": "#{myDDBReadThroughputRatio}"
},
{
"name": "S3TempLocation",
"id": "S3TempLocation",
"type": "S3DataNode",
"directoryPath": "#{myTempS3Folder}/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
},
{
"id": "DDBDestinationTable",
"tableName": "#{myDDBDestinationTableName}",
"name": "DDBDestinationTable",
"type": "DynamoDBDataNode",
"writeThroughputPercent": "#{myDDBWriteThroughputRatio}"
},
{
"id": "EmrClusterForBackup",
"name": "EmrClusterForBackup",
"releaseLabel": "emr-4.2.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBSourceRegion}",
"terminateAfter": "6 Hours",
"type": "EmrCluster"
},
{
"id": "EmrClusterForLoad",
"name": "EmrClusterForLoad",
"releaseLabel": "emr-4.2.0",
"masterInstanceType": "m3.xlarge",
"coreInstanceType": "m3.xlarge",
"coreInstanceCount": "1",
"region": "#{myDDBDestinationRegion}",
"terminateAfter": "6 Hours",
"type": "EmrCluster"
},
{
"id": "TableLoadActivity",
"name": "TableLoadActivity",
"runsOn": {
"ref": "EmrClusterForLoad"
},
"input": {
"ref": "S3TempLocation"
},
"output": {
"ref": "DDBDestinationTable"
},
"type": "EmrActivity",
"maximumRetries": "2",
"dependsOn": {
"ref": "TableBackupActivity"
},
"resizeClusterBeforeRunning": "true",
"step": [
"s3://dynamodb-emr-#{myDDBDestinationRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}"
]
},
{
"id": "TableBackupActivity",
"name": "TableBackupActivity",
"input": {
"ref": "DDBSourceTable"
},
"output": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"resizeClusterBeforeRunning": "true",
"type": "EmrActivity",
"maximumRetries": "2",
"step": [
"s3://dynamodb-emr-#{myDDBSourceRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
]
},
{
"dependsOn": {
"ref": "TableLoadActivity"
},
"name": "S3CleanupActivity",
"id": "S3CleanupActivity",
"input": {
"ref": "S3TempLocation"
},
"runsOn": {
"ref": "EmrClusterForBackup"
},
"type": "ShellCommandActivity",
"command": "(sudo yum -y update aws-cli) && (aws s3 rm #{input.directoryPath} --recursive)"
}
],
"parameters": [
{
"myComment": "This Parameter specifies the S3 logging path for the pipeline. It is used by the 'Default' object to set the 'pipelineLogUri' value.",
"id" : "myS3LogsPath",
"type" : "AWS::S3::ObjectKey",
"description" : "S3 path for pipeline logs."
},
{
"id": "myDDBSourceTableName",
"type": "String",
"description": "Source DynamoDB table name"
},
{
"id": "myDDBDestinationTableName",
"type": "String",
"description": "Target DynamoDB table name"
},
{
"id": "myDDBWriteThroughputRatio",
"type": "Double",
"description": "DynamoDB write throughput ratio",
"default": "0.25",
"watermark": "Enter value between 0.1-1.0"
},
{
"id": "myDDBSourceRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-east-1",
"watermark": "us-east-1"
},
{
"id": "myDDBDestinationRegion",
"type": "String",
"description": "Region of the DynamoDB table",
"default": "us-east-1",
"watermark": "us-east-1"
},
{
"id": "myDDBReadThroughputRatio",
"type": "Double",
"description": "DynamoDB read throughput ratio",
"default": "0.25",
"watermark": "Enter value between 0.1-1.0"
},
{
"myComment": "Temporary S3 path to store the dynamodb backup csv files, backup files will be deleted after the copy completes",
"id": "myTempS3Folder",
"type": "AWS::S3::ObjectKey",
"description": "Temporary S3 folder"
}
]
}