Is there any efficient way to delete all the items from a amazon dynamodb tabe at once.I have gone through the aws docs but there it's shown deletion of a single item.
Do the following steps:
Make delete table request
In the response you will get the TableDescription
Using TableDescription create the table again.
For step 1 and 2 click here
for step 3 click here
That's what I do in my application.
DynamoDBMapper will do the job in few lines :
AWSCredentials credentials = new PropertiesCredentials(credentialFile);
client = new AmazonDynamoDBClient(credentials);
DynamoDBMapper mapper = new DynamoDBMapper(this.client);
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
PaginatedScanList<LogData> result = mapper.scan(LogData.class, scanExpression);
for (LogData data : result) {
mapper.delete(data);
}
As ihtsham says, the most efficient way is to delete and re-create the table. However, if that is not practical (e.g. due to complex configuration of the table, such as Lambda triggers), here are some AWS CLI commands to delete all records. They require the jq program for JSON processing.
Deleting records one-by-one (slow!), assuming your table is called my_table, your partition key is called partition_key, and your sort key (if any) is called sort_key:
aws dynamodb scan --table-name my_table | \
jq -c '.Items[] | { partition_key, sort_key }' | \
tr '\n' '\0' | \
xargs -0 -n1 -t aws dynamodb delete-item --table-name my_table --key
Deleting records in batches of up to 25 records:
aws dynamodb scan --table-name my_table | \
jq -c '[.Items | keys[] as $i | { index: $i, value: .[$i]}] | group_by(.index / 25 | floor)[] | { "my_table": [.[].value | { "DeleteRequest": { "Key": { partition_key, sort_key }}}] }' | \
tr '\n' '\0' | \
xargs -0 -n1 -t aws dynamodb batch-write-item --request-items
If you start seeing non-empty UnprocessedItems responses, your write capacity has been exceeded. You can account for this by reducing the batch size. For me, each batch takes about a second to submit, so with a write capacity of 5 per second, I set the batch size to 5.
Just for the record, a quick solution with item-by-item delete in Python 3 (using Boto3 and scan()):
(Credentials need to be set.)
def delete_all_items(table_name):
# Deletes all items from a DynamoDB table.
# You need to confirm your intention by pressing Enter.
import boto3
client = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(table_name)
response = client.describe_table(TableName=table_name)
keys = [k['AttributeName'] for k in response['Table']['KeySchema']]
response = table.scan()
items = response['Items']
number_of_items = len(items)
if number_of_items == 0: # no items to delete
print("Table '{}' is empty.".format(table_name))
return
print("You are about to delete all ({}) items from table '{}'."
.format(number_of_items, table_name))
input("Press Enter to continue...")
with table.batch_writer() as batch:
for item in items:
key_dict = {k: item[k] for k in keys}
print("Deleting " + str(item) + "...")
batch.delete_item(Key=key_dict)
delete_all_items("test_table")
Obviously, this shouldn't be used for tables with a lot of items. (100+) For that, the delete / recreate approach is cheaper and more efficient.
You will want to use BatchWriteItem if you can't drop the table. If all your entries are within a single HashKey, you can use the Query API to retrieve the records, and then delete them 25 items at a time. If not, you'll probably have to Scan.
Alternatively, you could provide a simple wrapper around AmazonDynamoDBClient (from the official SDK) that collects a Set of Hash/Range keys that exist in your table. Then you wouldn't need to Query or Scan for the items you inserted after the test, since you would already have the Set built. That would look something like this:
public class KeyCollectingAmazonDynamoDB implements AmazonDynamoDB
{
private final AmazonDynamoDB delegate;
// HashRangePair is something you have to define
private final Set<Key> contents;
public InsertGatheringAmazonDynamoDB( AmazonDynamoDB delegate )
{
this.delegate = delegate;
this.contents = new HashSet<>();
}
#Override
public PutItemResult putItem( PutItemRequest putItemRequest )
throws AmazonServiceException, AmazonClientException
{
contents.add( extractKey( putItemRequest.getItem() ) );
return delegate.putItem( putItemRequest );
}
private Key extractKey( Map<String, AttributeValue> item )
{
// TODO Define your hash/range key extraction here
// Create a Key object
return new Key( hashKey, rangeKey );
}
#Override
public DeleteItemResult deleteItem( DeleteItemRequest deleteItemRequest )
throws AmazonServiceException, AmazonClientException
{
contents.remove( deleteItemRequest.getKey() );
return delegate.deleteItem( deleteItemRequest );
}
#Override
public BatchWriteItemResult batchWriteItem( BatchWriteItemRequest batchWriteItemRequest )
throws AmazonServiceException, AmazonClientException
{
// Similar extraction, but in bulk.
for ( Map.Entry<String, List<WriteRequest>> entry : batchWriteItemRequest.getRequestItems().entrySet() )
{
String tableName = entry.getKey();
List<WriteRequest> writeRequests = entry.getValue();
for ( WriteRequest writeRequest : writeRequests )
{
PutRequest putRequest = writeRequest.getPutRequest();
if ( putRequest != null )
{
// Add to Set just like putItem
}
DeleteRequest deleteRequest = writeRequest.getDeleteRequest();
if ( deleteRequest != null )
{
// Remove from Set just like deleteItem
}
}
}
// Write through to DynamoDB
return delegate.batchWriteItem( batchWriteItemRequest );
}
// remaining methods elided, since they're direct delegation
}
Key is a class within the DynamoDB SDK that accepts zero, one, or two AttributeValue objects in the constructor to represent a hash key or a hash/range key. Assuming it's equals and hashCode methods work, you can use in within the Set I described. If they don't, you'll have to write your own Key class.
This should get you a maintained Set for use within your tests. It's not specific to a table, so you might need to add another layer of collection if you're using multiple tables. That would change Set<Key> to something like Map<TableName, Set<Key>>. You would need to look at the getTableName() property to pick the correct Set to update.
Once your test finishes, grabbing the contents of the table and deleting should be straightforward.
One final suggestion: use a different table for testing than you do for your application. Create an identical schema, but give the table a different name. You probably even want a different IAM user to prevent your test code from accessing your production table. If you have questions about that, feel free to open a separate question for that scenario.
You can recreate a DynamoDB table using AWS Java SDK
// Init DynamoDB client
AmazonDynamoDB dynamoDB = AmazonDynamoDBClientBuilder.standard().build();
// Get table definition
TableDescription tableDescription = dynamoDB.describeTable("my-table").getTable();
// Delete table
dynamoDB.deleteTable("my-table");
// Create table
CreateTableRequest createTableRequest = new CreateTableRequest()
.withTableName(tableDescription.getTableName())
.withAttributeDefinitions(tableDescription.getAttributeDefinitions())
.withProvisionedThroughput(new ProvisionedThroughput()
.withReadCapacityUnits(tableDescription.getProvisionedThroughput().getReadCapacityUnits())
.withWriteCapacityUnits(tableDescription.getProvisionedThroughput().getWriteCapacityUnits())
)
.withKeySchema(tableDescription.getKeySchema());
dynamoDB.createTable(createTableRequest);
I use following javascript code to do it:
async function truncate(table, keys) {
const limit = (await db.describeTable({
TableName: table
}).promise()).Table.ProvisionedThroughput.ReadCapacityUnits;
let total = 0;
let lastEvaluatedKey = null;
do {
const qp = {
TableName: table,
Limit: limit,
ExclusiveStartKey: lastEvaluatedKey,
ProjectionExpression: keys.join(' '),
};
const qr = await ddb.scan(qp).promise();
lastEvaluatedKey = qr.LastEvaluatedKey;
const dp = {
RequestItems: {
},
};
dp.RequestItems[table] = [];
if (qr.Items) {
for (const i of qr.Items) {
const dr = {
DeleteRequest: {
Key: {
}
}
};
keys.forEach(k => {
dr.DeleteRequest.Key[k] = i[k];
});
dp.RequestItems[table].push(dr);
if (dp.RequestItems[table].length % 25 == 0) {
await ddb.batchWrite(dp).promise();
total += dp.RequestItems[table].length;
dp.RequestItems[table] = [];
}
}
if (dp.RequestItems[table].length > 0) {
await ddb.batchWrite(dp).promise();
total += dp.RequestItems[table].length;
dp.RequestItems[table] = [];
}
}
console.log(`Deleted ${total}`);
setTimeout(() => {}, 1000);
} while (lastEvaluatedKey);
}
(async () => {
truncate('table_name', ['id']);
})();
In this case, you may delete the table and create a new one.
Example:
from __future__ import print_function # Python 2/3 compatibility
import boto3
dynamodb = boto3.resource('dynamodb', region_name='us-west-2', endpoint_url="http://localhost:8000")
table = dynamodb.Table('Movies')
table.delete()
Related
I'm currently writing integration tests for my BatchWriteItem logic using Spock/Groovy. I'm running a docker container which spins up a real DynamoDb table for this same purpose.
This is my logic in Java for BatchWriteItems
public Promise<Boolean> createItemsInBatch(ClientKey clientKey, String accountId, List<SrItems> srItems) {
List<Item> items = srItems.stream()
.map(srItem -> createItemFromSrItem(clientKey, createItemRef(srItem.getId(), accountId), srItem))
.collect(Collectors.toList());
List<List<Item>> batchItems = Lists.partition(items, 25);
var promises = batchItems.stream().map(itemsList -> Blocking.get(() -> {
TableWriteItems tableWriteItems = new TableWriteItems(table.getTableName());
tableWriteItems.withItemsToPut(itemsList);
BatchWriteItemOutcome outcome = dynamoDB.batchWriteItem(tableWriteItems);
return outcome.getUnprocessedItems().values().stream().flatMap(Collection::stream).collect(Collectors.toList());
})).collect(Collectors.toList());
return ParallelPromises.yieldAll(promises).map((List<? extends ExecResult<List<WriteRequest>>> results) -> {
if(results.isEmpty()) {
return true;
} else {
results.stream().map(Result::getValue).flatMap(Collection::stream).forEach(failure -> {
var failedItem = failure.getPutRequest().getItem();
logger.error(append("item", failedItem), "Failed to batch write item");
});
return false;
}
});
}
And this is my current implementation for the test (happy path)
#Unroll
def "createItemsInBatch - #description"(description, srItemsList, createResult) {
given:
def dynamoItemService = new DynamoItemService(realTable, amazonDynamoDBClient1) //passing the table running in the docker image + the dynamo client associated
when:
def promised = ExecHarness.yieldSingle {
dynamoItemService.createItemsInBatch(CLIENT_KEY, 'account-id', srItemsList as List<SrItem>)
}
then:
promised.success == createResult
where:
description | srItemsList | createResult
"single batch req not reaching batch size limit" | srItems(10) | true
"double batch req reaching batch size limit" | srItems(25) | true
"triple batch req reaching batch size limit" | srItems(51) | true
}
For context:
srItems() is a function that just creates a bunch of different items to be injected in the service for the BatchWriteItem request
Want I want now is to be able to test the unhappy path of my logic, i.e. get some UnprocessedItems from my outcome, testing the below code is actually doing its job
BatchWriteItemOutcome outcome = dynamoDB.batchWriteItem(tableWriteItems);
return outcome.getUnprocessedItems().values().stream().flatMap(Collection::stream).collect(Collectors.toList());
Any help would be greatly appreciated
This is quite easy to do actually, we can force throttling on your DynamoDB table which will result in UnprocessedItems....
Configure your table to have 1WCU and disable auto-scaling. Now run your BatchWriteItem in batches of 25 for a couple of seconds and DynamoDB will begin to throttle requests, which will return the throttled items in the UnprocessedItems response, testing your unhappy path.
We are using Spark job with emr-dynamodb-connector to load the data from S3 files into Dyanamodb.
https://github.com/awslabs/emr-dynamodb-connector
But if document is already present in dynamodb, my code is replacing it.
Is there a way to avoid updating existing records (based on id) if they are present in Dynamodb. If id is present in dynamodb, i simply don't want to update it, just skip that id and write rest of records. Code i am using is
JobConf ddbConf = new JobConf(spark.sparkContext().hadoopConfiguration());
ddbConf.set("dynamodb.output.tableName", tableName);
ddbConf.set("dynamodb.throughput.write.percent", "50");
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat");
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat");
JavaPairRDD<Text, DynamoDBItemWritable> ddbInsertFormattedRDD = finalDatasetToBeSaved.toJavaRDD().mapToPair(new PairFunction<Row, Text, DynamoDBItemWritable>() {
#Override
public Tuple2<Text, DynamoDBItemWritable> call(Row row) throws Exception {
Map<String, AttributeValue> ddbMap = new HashMap<String, AttributeValue>();
for (int i = 0 ; i <= schemaDdb.length - 1; i++) {
Object value = row.get(i);
if (value != null) {
AttributeValue att = new AttributeValue();
if(schemaDdb[i]._2.toString().equalsIgnoreCase("IntegerType")){
att.setN(value.toString());
}else{
att.setS(value.toString());
}
ddbMap.put((String)schemaDdb[i]._1, att);
}
}
DynamoDBItemWritable item = new DynamoDBItemWritable();
item.setItem(ddbMap);
return new Tuple2<Text, DynamoDBItemWritable>(new Text(""), item);
}
});
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf);
By saying Is there a way to avoid updating existing records (based on id) if they are already present, Do you want to add another document instead of replacing/updating it?
If yes, I am afraid it wont be possible with primary key, since that should be unique and distinguishes it from other. You need to make a key non-primary in order to do this.
If you want to ignore the insertion (if item exists), you can use condition-expression attribute_not_exists(your-key) as defined in the documentation
Just like backing up all the tables of dynamo db, i also want clear all the tables of my test environment after testing without deleting tables.
We used backup service such a way that we don't want schema structure or java object of table schema as below:
Map<String, AttributeValue> exclusiveStartKey = null;
do {
// Let the rate limiter wait until our desired throughput "recharges"
rateLimiter.acquire(permitsToConsume);
ScanSpec scanSpec = new ScanSpec().withReturnConsumedCapacity(ReturnConsumedCapacity.TOTAL)
.withMaxResultSize(25);
if(exclusiveStartKey!=null){
KeyAttribute haskKey = getExclusiveStartHashKey(exclusiveStartKey, keySchema);
KeyAttribute rangeKey = getExclusiveStartRangeKey(exclusiveStartKey, keySchema);
if(rangeKey!=null){
scanSpec.withExclusiveStartKey(haskKey, rangeKey);
}else{
scanSpec.withExclusiveStartKey(haskKey);
}
}
Table table = dynamoDBInstance.getTable(tableName);
ItemCollection<ScanOutcome> response = table.scan(scanSpec);
StringBuffer data = new StringBuffer();
Iterator<Item> iterator = response.iterator();
while (iterator.hasNext()) {
Item item = iterator.next();
data.append(item.toJSON());
data.append("\n");
}
logger.debug("Data read from table: {} ", data.toString());
if(response.getLastLowLevelResult()!=null){
exclusiveStartKey = response.getLastLowLevelResult().getScanResult().getLastEvaluatedKey();
}else{
exclusiveStartKey = null;
}
// Account for the rest of the throughput we consumed,
// now that we know how much that scan request cost
if(response.getTotalConsumedCapacity()!=null){
double consumedCapacity = response.getTotalConsumedCapacity().getCapacityUnits();
if(logger.isDebugEnabled()){
logger.debug("Consumed capacity : " + consumedCapacity);
}
permitsToConsume = (int)(consumedCapacity - 1.0);
if(permitsToConsume <= 0) {
permitsToConsume = 1;
}
}
} while (exclusiveStartKey != null);
is it possible to delete all items without knowing table schema? can we do it using DeleteItemSpec
I have a table called friends:
Friend 1 | Friend 2 | Status
Friend 1 is my HASH attribute and Friend 2 is my range attribute.
I would like to update an item's staus attribute where friend 1 = 'Bob' and friend 2 = 'Joe'. Reading through the documentation on http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaDocumentAPICRUDExample.html I can only see how to update an item by 1 key, how do I include the other key?
Here you go:
DynamoDBQueryExpression<Reply> queryExpression = new DynamoDBQueryExpression<Reply>()
.withKeyConditionExpression("Id = :val1 and ReplyDateTime > :val2")
.withExpressionAttributeValues(
...
where Id is the Hash Key and ReplyDateTime is the Range Key.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.QueryScanExample.html
I'm writing example where you can make update of multiple item in single table. I have primary key as id and range key as Datetime.
Actually there is no feature available in dynamodb so what i'm doing here is first query all the variable with hash key and range key of which i want to make update. Once all data are stored in List then loading data with it's hash key and rangekey and changing or updating field using set and save it.
Since I'm editing in hash key so, hash key original will be there we need to delete it. If you need to update in next attribute no need. I haven't added deleting code write yourself. You can query if you have confusion your entry with hash key will be still and new entry with new hash key will be added.
Code is below:
public static void main(String[] args) {
AmazonDynamoDBClient client = new AmazonDynamoDBClient();
DynamoDBMapper mapper = new DynamoDBMapper(client);
client.setEndpoint("http://localhost:8000/");
String fromDate = "2016-01-13";
String toDate = "2016-02-05";
User user = new User();
user.setId("YourHashKey");
LocalDate frmdate = LocalDate.parse(fromDate, DateTimeFormatter.ISO_LOCAL_DATE);
LocalDate todate = LocalDate.parse(toDate, DateTimeFormatter.ISO_LOCAL_DATE);
LocalDateTime startfrm = frmdate.atStartOfDay();
LocalDateTime endto = todate.atTime(23, 59, 59);
Condition rangeCondition = new Condition().withComparisonOperator(ComparisonOperator.BETWEEN.toString()).withAttributeValueList(new AttributeValue().withS(startfrm.toString()), new AttributeValue().withS(endto.toString()));
DynamoDBQueryExpression<User> queryExpression = new DynamoDBQueryExpression<User>().withHashKeyValues(user).withRangeKeyCondition("DATETIME", rangeCondition);
List<User> latestReplies = mapper.query(User.class, queryExpression);
for (User in : latestReplies) {
System.out.println(" Hashid: " + in.getId() + " DateTime: " + in.getDATETIME() + "location:" + in.getLOCID());
User ma = mapper.load(User.class, in.getId(), in.getDATETIME());
ma.setLOCID("Ohelig");
mapper.save(ma);
}
}
I'm using Scalding with Spyglass to read from/write to HBase.
I'm doing a left outer join of table1 and table2 and write back to table1 after transforming a column.
Both table1 and table2 are declared as Spyglass HBaseSource.
This works fine. But, i need to access a different row in table1 using rowkey to compute transformed value.
I tried the following for HBase get:
val hTable = new HTable(conf, TABLE_NAME)
val result = hTable.get(new Get(rowKey.getBytes()))
I'm getting access to Configuration in Scalding job as mentioned in this link:
https://github.com/twitter/scalding/wiki/Frequently-asked-questions#how-do-i-access-the-jobconf
This works when i run the scalding job locally.
But, when i run it in cluster, conf is null when this code is executed in Reducer.
Is there a better way to do HBase get/scan in a Scalding/Cascading job for cases like this?
Ways to do this...
1) You can use a managed resource
class SomeJob(args: Args) extends Job(args) {
val someConfig = HBaseConfiguration.create().addResource(new Path(pathtoyourxmlfile))
lazy val hPool = new HTablePool(someConfig, 3)
def getConf = {
implicitly[Mode] match {
case Hdfs(_, conf) => conf
case _ => whateveryou are doing for a local conf...
}
}
... somePipe.someOperation.... {
val gets = key.map { key => new Get(key) }
managed(hPool.getTable("myTableName")) acquireAndGet { table =>
val results = table.get(gets)
...do something with these results
}
}
}
2) You can use some more specific cascading code, where you write a custom scheme and inside that you will override the source method and possibly some others depending on your needs. In there you can access the JobConf like this:
class MyScheme extends Scheme[JobConf, SomeRecordReader, SomeOutputCollector, ..] {
#transient var jobConf: Configuration = super.jobConfiguration
override def source(flowProcess: FlowProcess[JobConf], ...): Boolean = {
jobConf = flowProcess match {
case h: HadoopFlowProcess => h.getJobConf
case _ => jconf
}
... dosomething with the jobConf here
}
}