We're trying to make a simple data migration in one of our tables in DDB.
Basically we're adding a new field and we need to backfill all the Documents in one of our tables.
This table has around 700K documents.
The process we follow is quite simple:
Manually trigger a lambda that will scan the table and for each document, will update the document and continue doing the same til its close to the 15 minutes top, in that case
Puts LastEvaluatedKey into SQS to trigger new lambda execution that uses that key to continue scanning.
Process goes on spawining lambdas sequentially as needed until there are no more documents
The problem we found is as follows...
Once the migration is done we noticed that the number of documents updated is way lower than the total number of documents existing in that table. It's a random value, not the same always but it ranges from tens of thousands to hundreds of thousands (worst case we seen was 300K difference).
This is obviously a problem, because if we scan the documents again, it seems obvious some documents were not migrated. We thought at first this was because of some clients updating/inserting new documents but the throughput on that table is not that large that will justify such a big difference, so this is not that there are new documents being added while we run the migration.
We tried a second approach that was first scanning, because if we only scan, we noticed that number of scan documents == count of documents in table, so we tried to dump the IDs of the documents in another table, then scan that table and update those items again. Funny thing, same problem happens with this new table with just IDs, there are way less than the count in the table we want to update, thus, we're back to square one.
We thought about using parallel scans but I don't see how this could benefit plus I don't want to compromise reading capacity for the table while running the migration.
Anybody with experience in data migrations in DDB can shed some light here? We're not able to figure out what we're doing wrong.
UPDATE: Sharing the function that is triggered and actually scans and updates
#Override
public Map<String, AttributeValue> migrateDocuments(String lastEvaluatedKey, String typeKey){
LOG.info("Migrate Documents started {} ", lastEvaluatedKey);
int noOfDocumentsMigrated = 0;
Map<String, AttributeValue> docLastEvaluatedKey = null;
DynamoDBMapperConfig documentConfig = new DynamoDBMapperConfig.TableNameOverride("KnowledgeDocumentMigration").config();
if(lastEvaluatedKey != null) {
docLastEvaluatedKey = new HashMap<String,AttributeValue>();
docLastEvaluatedKey.put("base_id", new AttributeValue().withS(lastEvaluatedKey));
docLastEvaluatedKey.put("type_key",new AttributeValue().withS(typeKey));
}
Instant endTime = Instant.now().plusSeconds(840);
LOG.info("Migrate Documents endTime:{}", endTime);
try {
do {
ScanResultPage<Document> docScanList = documentDao.scanDocuments(docLastEvaluatedKey, documentConfig);
docLastEvaluatedKey = docScanList.getLastEvaluatedKey();
LOG.info("Migrate Docs- docScanList Size: {}", docScanList.getScannedCount());
docLastEvaluatedKey = docScanList.getLastEvaluatedKey();
LOG.info("lastEvaluatedKey:{}", docLastEvaluatedKey);
final int chunkSize = 25;
final AtomicInteger counter = new AtomicInteger();
final Collection<List<Document>> docChunkList = docScanList.getResults().stream()
.collect(Collectors.groupingBy(it -> counter.getAndIncrement() / chunkSize)).values();
List<List<Document>> docListSplit = docChunkList.stream().collect(Collectors.toList());
docListSplit.forEach(docList -> {
TransactionWriteRequest documentTx = new TransactionWriteRequest();
for (Document document : docList) {
LOG.info("Migrate Documents- docList Size: {}", docList.size());
LOG.info("Migrate Documents- Doc Id: {}", document.getId());
if (!StringUtils.isNullOrEmpty(document.getType()) && document.getType().equalsIgnoreCase("Faq")) {
if (docIdsList.contains(document.getId())) {
LOG.info("this doc already migrated:{}", document);
} else {
docIdsList.add(document.getId());
}
if ((!StringUtils.isNullOrEmpty(document.getFaq().getQuestion()))) {
LOG.info("doc FAQ {}", document.getFaq().getQuestion());
document.setTitle(document.getFaq().getQuestion());
document.setTitleSearch(document.getFaq().getQuestion().toLowerCase());
documentTx.addUpdate(document);
}
} else if (StringUtils.isNullOrEmpty(document.getType())) {
if (!StringUtils.isNullOrEmpty(document.getTitle()) ) {
if (!StringUtils.isNullOrEmpty(document.getQuestion())) {
document.setTitle(document.getQuestion());
document.setQuestion(null);
}
LOG.info("title {}", document.getTitle());
document.setTitleSearch(document.getTitle().toLowerCase());
documentTx.addUpdate(document);
}
}
}
if (documentTx.getTransactionWriteOperations() != null
&& !documentTx.getTransactionWriteOperations().isEmpty() && docList.size() > 0) {
LOG.info("DocumentTx size {}", documentTx.getTransactionWriteOperations().size());
documentDao.executeTransaction(documentTx, null);
}
});
noOfDocumentsMigrated = noOfDocumentsMigrated + docScanList.getScannedCount();
}while(docLastEvaluatedKey != null && (endTime.compareTo(Instant.now()) > 0));
LOG.info("Migrate Documents execution finished at:{}", Instant.now());
if(docLastEvaluatedKey != null && docLastEvaluatedKey.get("base_id") != null)
sqsAdapter.get().sendMessage(docLastEvaluatedKey.get("base_id").toString(), docLastEvaluatedKey.get("type_key").toString(),
MIGRATE, MIGRATE_DOCUMENT_QUEUE_NAME);
LOG.info("No Of Documents Migrated:{}", noOfDocumentsMigrated);
}catch(Exception e) {
LOG.error("Exception", e);
}
return docLastEvaluatedKey;
}
Note: I would've added this speculation as a comment but my reputation does not allow
I think the issue that you're seeing here could be caused by the Scans not being ordered. So as long as your Scan would be executed in a single lambda I'd expect to you see that everything was handled fine. However, as soon as you hit the runtime limit of the lambda & start a new one your Scan will essentially get a new "ScanID" which might come in a different order. Based on the different order you're now skipping a certain set of entries.
I haven't tried to replicate this behavior & sadly there is no clear indication in the AWS documentation whether a Scan Request can be created in a new Session/Application.
I think #Charles' suggestion might help you in this case as you can simply run the entire migration in one process.
Related
I have a small dataflow job triggered from a cloud function using a dataflow template. The job basically reads from a table in Bigquery, converts the resultant Tablerow to a Key-Value, and writes the Key-Value to Datastore.
This is what my code looks like :-
PCollection<TableRow> bigqueryResult = p.apply("BigQueryRead",
BigQueryIO.readTableRows().withTemplateCompatibility()
.fromQuery(options.getQuery()).usingStandardSql()
.withoutValidation());
bigqueryResult.apply("WriteFromBigqueryToDatastore", ParDo.of(new DoFn<TableRow, String>() {
#ProcessElement
public void processElement(ProcessContext pc) {
TableRow row = pc.element();
Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
KeyFactory keyFactoryCounts = datastore.newKeyFactory().setNamespace("MyNamespace")
.setKind("MyKind");
Key key = keyFactoryCounts.newKey("Key");
Builder builder = Entity.newBuilder(key);
builder.set("Key", BooleanValue.newBuilder("Value").setExcludeFromIndexes(true).build());
Entity entity= builder.build();
datastore.put(entity);
}
}));
This pipeline runs fine when the number of records I try to process is anywhere in the range of 1 to 100. However, when I try putting more load on the pipeline, ie, ~10000 records, the pipeline does not scale (eventhough autoscaling is set to THROUGHPUT based and maximumWorkers is specified to as high as 50 with an n1-standard-1 machine type). The job keeps processing 3 or 4 elements per second with one or two workers. This is impacting the performance of my system.
Any advice on how to scale up the performance is very welcome.
Thanks in advance.
Found a solution by using DatastoreIO instead of the datastore client.
Following is the snippet I used,
PCollection<TableRow> row = p.apply("BigQueryRead",
BigQueryIO.readTableRows().withTemplateCompatibility()
.fromQuery(options.getQueryForSegmentedUsers()).usingStandardSql()
.withoutValidation());
PCollection<com.google.datastore.v1.Entity> userEntity = row.apply("ConvertTablerowToEntity", ParDo.of(new DoFn<TableRow, com.google.datastore.v1.Entity>() {
#SuppressWarnings("deprecation")
#ProcessElement
public void processElement(ProcessContext pc) {
final String namespace = "MyNamespace";
final String kind = "MyKind";
com.google.datastore.v1.Key.Builder keyBuilder = DatastoreHelper.makeKey(kind, "root");
if (namespace != null) {
keyBuilder.getPartitionIdBuilder().setNamespaceId(namespace);
}
final com.google.datastore.v1.Key ancestorKey = keyBuilder.build();
TableRow row = pc.element();
String entityProperty = "sample";
String key = "key";
com.google.datastore.v1.Entity.Builder entityBuilder = com.google.datastore.v1.Entity.newBuilder();
com.google.datastore.v1.Key.Builder keyBuilder1 = DatastoreHelper.makeKey(ancestorKey, kind, key);
if (namespace != null) {
keyBuilder1.getPartitionIdBuilder().setNamespaceId(namespace);
}
entityBuilder.setKey(keyBuilder1.build());
entityBuilder.getMutableProperties().put(entityProperty, DatastoreHelper.makeValue("sampleValue").build());
pc.output(entityBuilder.build());
}
}));
userEntity.apply("WriteToDatastore", DatastoreIO.v1().write().withProjectId(options.getProject()));
This solution was able to scale from 3 elements per second with 1 worker to ~1500 elements per second with 20 workers.
At least with python's ndb client library it's possible to write up to 500 entities at a time in a single .put_multi() datastore call - a whole lot faster than calling .put() for one entity at a time (the calls are blocking on the underlying RPCs)
I'm not a java user, but a similar technique appears to be available for it as well. From Using batch operations:
You can use the batch operations if you want to operate on multiple
entities in a single Cloud Datastore call.
Here is an example of a batch call:
Entity employee1 = new Entity("Employee");
Entity employee2 = new Entity("Employee");
Entity employee3 = new Entity("Employee");
// ...
List<Entity> employees = Arrays.asList(employee1, employee2, employee3);
datastore.put(employees);
I'm totally new to coding in general, so this is really my first attempt, so don't shoot me if I ask stupid questions ;) Right now I'm having trouble even understanding the the vast world of backend.
So I'm having some problems in my service, and even deciding which way is the best to go, scanning, querying... what?
So I -think- the way to go for me is scanning... I'm having trouble to retrieve an item from the database, based on the id of that item. Retrieving all items works like a charm, and I need something similar for getting one item. I'm getting confused when searching the web, and not really even understanding the difference for example scanfilter, scanexpression? That's why I haven't even come up with a good attempt... but what I need is scan the table and retrieve the item with the matching id. I tried looking at my method for retrieving all searchCases, and implement it for retrieving one it, as it should look quite the same, but no success...
EDITED method a bit: Method I need help with:
public SearchCase getSearchCase(String id){
//this is obviously for a list, but how do I do it for ONE item?
HashMap<String, AttributeValue> sc = new HashMap<String, AttributeValue>();
sc.put("scId", new AttributeValue().withS(id));
ScanRequest scanRequest = new ScanRequest()
.withTableName(searchCaseTableName)
.withFilterExpression("id = scId");
ScanResult scanResult = client.scan(scanRequest);
?????
return searchCase;
}
As a reference here is the method for retrieving all items, that does work:
public List<SearchCase> getSearchCases() {
final List<SearchCase> cases = new ArrayList<SearchCase>();
ScanRequest scanRequest = new ScanRequest()
.withTableName(searchCaseTableName);
ScanResult result = client.scan(scanRequest);
try {
for (Map<String, AttributeValue> item : result.getItems()) {
SearchCase searchCase = mapper.readValue(item.get("payload").getS(), SearchCase.class);
cases.add(searchCase);
}
} catch (Exception e) {
throw new RuntimeException(e);
}
return cases;
}
It has been forever, but thought I'd post the correct answer I fought with for a long time back in June. So this was the solution that worked for me for retrieving a single item:
public SearchCase getSearchCase(String id) throws Exception {
Table t = db.getTable(searchCaseTableName);
GetItemSpec gis = new GetItemSpec()
.withPrimaryKey("id", id);
Item item = t.getItem(gis);
SearchCase searchCase = mapper.readValue(StringEscapeUtils.unescapeJson(item.getJSON("payload").substring(1)), SearchCase.class);
return searchCase;
}
This method actually took a whole another approach then the way I originally thought I would solve this. So no Scanrequest, but using GetItemSpec and Item instead. This thus caused some funky backslashes in the JSON, so my frontend wouldn't accept before I ran it through StringEscapeUtils.unescapeJson, otherwise worked like a charm.
I'm having trouble to retrieve an item from the database, based on the id of that item
If you want to retrieve an item from DynamoDB based on some unique ID, then use load, which "Loads an object with the hash key given".
How do I increment a number in AWS Dynamodb?
The guide says when saving an item to simply resave it:
http://docs.aws.amazon.com/mobile/sdkforios/developerguide/dynamodb_om.html
However I am trying to use a counter where many users may be updating at the same time.
Other documentation has told me to use and UpdateItem operation but I cannot find a good example to do so.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.Modifying.html
However, I cannot find a method to implement the expression. In the future I will be adding values to arrays and maps. Will this be the same? My code is in Obj C
Currently my code looks like:
AWSDynamoDBUpdateItemInput *updateItemInput = [AWSDynamoDBUpdateItemInput new];
updateItemInput.tableName = #"TableName";
updateItemInput.key = #{
UniqueItemKey:#"KeyValue"
};
updateItemInput.updateExpression = #"SET counter = counter + :val";
updateItemInput.expressionAttributeValues =#{
#":val":#1
};
It looks like you're missing the last bit of code that actually makes the update item request:
AWSDynamoDB *dynamoDB = [AWSDynamoDB defaultDynamoDB];
[[dynamoDB updateItem:updateItemInput]
continueWithBlock:^id(AWSTask *task) {
if (task.error) {
NSLog(#"The request failed. Error: [%#]", task.error);
}
if (task.exception) {
NSLog(#"The request failed. Exception: [%#]", task.exception);
}
if (task.result) {
//Do something with result.
}
return nil;
}];
In DynamoDB if you want to increment the value of the any propertie/field you can use the UpdateItemRequest with action option ADD. I used in android this method would update the existing value of the field. Let me share the code snippet. You can use any actions such like add,delete,put etc.
.....
AttributeValue viewcount = new AttributeValue().withS("100");
AttributeValueUpdate attributeValueUpdate = new AttributeValueUpdate().withAction(AttributeAction.ADD).withValue(viewcount);
updateItems.put(UploadVideoData.FIELD_VIEW_COUNT, attributeValueUpdate);
UpdateItemRequest updateItemRequest = new UpdateItemRequest().withTableName(UploadVideoData.TABLE_NAME)
.withKey(primaryKey).withAttributeUpdates(updateItems);
UpdateItemResult updateItemResult = amazonDynamoDBClient.updateItem(updateItemRequest);
....
You can see the above code will add 100 count into the existing value of that field.
This code is for android but the technique would remain the same.
Thank you.
My problem is how to ensure that no data will be lost while concurrent access.
I have script published as web-app. I want to add new row to DATA_SHEET. The function that handles submit button looks like this:
function onButtonSubmit(e) {
var app = UiApp.getActiveApplication();
var lock = LockService.getPublicLock();
while (! lock.tryLock(1000))
;
var ssheet = SpreadsheetApp.openById(SHEET_ID);
var sheet = ssheet.getSheetByName(DATA_SHEET);
var lastRow = sheet.getLastRow();
var lastCol = sheet.getLastColumn();
var rangeToInsert = sheet.getRange(lastRow+1, 1, 1, lastCol);
var statText = rangeToInsert.getA1Notation();
rangeToInsert.setValues(<some data from webapp form>);
app.getElementById('statusLabel').setText(statText);
lock.releaseLock();
return app;
}
But it seems that this does not work. When I open two forms and click submit button within one second, it shows same range in statsLabel and writes data into same range. So I lose data from one form.
What is wrong with this code? It seems like tryLock() does not block script.
Is there any other way how to prevent concurrent write access to sheet?
It might be worth taking a look at appendRow(), rather than using getLastRow()/setValues() etc.
Allows for atomic appending of a row to a spreadsheet; can be used
safely even when multiple instances of the script are running at the
same time. Previously, one would have to call getLastRow(), then write
to that row. But if two invocations of the script were running at the
same time, they might both read the same value for getLastRow(), and
then overwrite each other's values.
while (! lock.tryLock(1000))
;
seems a bit hinky. Try this instead:
if (lock.tryLock(30000)) {
// I got the lock! Wo000t!!!11 Do whatever I was going to do!
} else {
// I couldn’t get the lock, now for plan B :(
GmailApp.sendEmail(“admin#example.com”, “epic fail”,
“lock acquisition fail!”);
}
http://googleappsdeveloper.blogspot.com/2011/10/concurrency-and-google-apps-script.html
You must insert this code when using getLastRow()/setValues() with lock.
SpreadsheetApp.flush();
// before
lock.releaseLock();
I am creating items on the fly via Sitecore Web Service. So far I can create the items from this function:
AddFromTemplate
And I also tried this link: http://blog.hansmelis.be/2012/05/29/sitecore-web-service-pitfalls/
But I am finding it hard to access the fields. So far here is my code:
public void CreateItemInSitecore(string getDayGuid, Oracle.DataAccess.Client.OracleDataReader reader)
{
if (getDayGuid != null)
{
var sitecoreService = new EverBankCMS.VisualSitecoreService();
var addItem = sitecoreService.AddFromTemplate(getDayGuid, templateIdRTT, "Testing", database, myCred);
var getChildren = sitecoreService.GetChildren(getDayGuid, database, myCred);
for (int i = 0; i < getChildren.ChildNodes.Count; i++)
{
if (getChildren.ChildNodes[i].InnerText.ToString() == "Testing")
{
var getItem = sitecoreService.GetItemFields(getChildren.ChildNodes[i].Attributes[0].Value, "en", "1", true, database, myCred);
string p = getChildren.ChildNodes[i].Attributes[0].Value;
}
}
}
}
So as you can see I am creating an Item and I want to access the Fields for that item.
I thought that GetItemFields will give me some value, but finding it hard to get it. Any clue?
My advice would be to not use the VSS (Visual Sitecore Service), but write your own service specifically for the thing you want it to do.
This way is usually more efficient because you can do exactly the thing you want, directly inside the service, instead of making a lot of calls to the VSS and handle your logic on the clientside.
For me, this has always been a better solution than using the VSS.
I am assuming you are looking to find out what the fields looks like and what the field IDs are.
You can call GetXml with the ID, it returns the item and all the versions and fields set in it, it won't show fields you haven't set.