Map reduce returns old data - mapreduce

I have the following map:
from doc in docs
select new {Name = doc.Name, Count = 1}
reduce
from result in results
group result by new {result.Name}
into g
select new {
Name = g.Key.Name,
Count = Enumerable.Sum(g, x => ((int) x.Count))
}
If I put a lock on the index folder and then save a document and then delete the document and resave the document to trigger a reindex the old document still appears in the index query results despite the index being reported as up to date. The last indexed date is also older than the date the document was updated so therefore the index should not contain any old results.
Any ideas what's going on? This is actually part of a large problem I've discovered on a production system. I'm not clear why it's happening but I've been able to reproduce a similar situation by locking the index so I suspect there's some process causing the lock. It means the index results return projections that are old.
How can I get the reduce to filter out results that are old?

If you disabled the index and the documents are updated/deleted. You'll get outdated results from the map-reduce index. This can happen even when the index isn't disabled.
The reason is that indexes are eventual consistent. You can read about it here:
https://ravendb.net/docs/article-page/3.5/Csharp/users-issues/understanding-eventual-consistency
You can use WaitForNonStaleResultsAsOfLastWrite:
https://ravendb.net/docs/article-page/2.5/Csharp/client-api/querying/stale-indexes#setting-cut-off-point

What you're describing is stale indexes: you update/create/delete a document and immediately queried for the document, but the query returns stale results.
The recommended way to fix this is by calling .WaitForIndexesAfterSaveChanges() during your create/update/delete calls:
// Inform Raven you'll wait for indexes when calling .SaveChanges
session.Advanced.WaitForIndexesAfterSaveChanges(
timeout: TimeSpan.FromSeconds(30),
throwOnTimeout: false);
// Do your update.
session.Store(new Employee
{
FirstName = "John",
LastName = "Doe"
});
// This won't return until affected indexes are updated.
session.SaveChanges();
// Now you can run a query against your index, and it will return the updated data.
...
This way, .SaveChanges will block until the indexes are updated. Run your query immediately after .SaveChanges and you'll see the updated results as expected.

Related

Consistent implementation of modifying the same item in DynamoDB table by Multiple machines - multiple threads

I have an item (number) in DynamoDB table. This value is read in a service, incremented and updated back to the table. There are multiple machines with multiple threads doing this simultaneously.
My problem here is to be able to read the correct consistent value, and update with the correct value.
I tried doing the increment and update in a java synchronized block.
However, I still noticed inconsistencies in the count in the end. It doesn't seem to be updating in a consistent manner.
"My problem here is to be able to read the correct consistent value, and update with the correct value."
To read/write the correct consistent value
Read Consistency in dynamodb (you can set it in your query as ConsistentRead parameter):
There are two types of read.
Eventually Consistent Read: if you read data after changes in table, that might be stale and should wait a bit to be consistent.
Strongly Consistent Data: it returns most up-to-date data, so, should not be worried about the stale data
ConditionExpression (specify in your query):
in your query you can specify that update the value if some conditions are true (for example current value in db is the same as the value you read before. meaning no one updated it in between) otherwise it returns ConditionalCheckFailedException and you need to handle it in your code to redo, ...
So, to answer your question, first you need to ready strongly consistent to get the current counter value in db. Then, to update it, your query should look like this (removed unnecessary parameters) and you should handle ConditionalCheckFailedException in your code:
"TableName": "counters",
"ReturnValues": "UPDATED_NEW",
"ExpressionAttributeValues": {
":a": currentValue,
":bb": newValue
},
"ExpressionAttributeNames": {
"#currentValue": "currentValue"
},
**// current value is what you ve read
// by Strongly Consistent **
"ConditionExpression": "(#currentValue = :a)",
"UpdateExpression": "SET #currentValue = :bb", // new counter value
With every record store a uuid (long random string) sort of value, whenever you are trying to update the record send with update request which should update only if uuid is equal the the value you read. And update the uuid value.
synchronised block will not work if you are trying to write from multiple machines together.

How to update multiple items in a DynamoDB table at once

I'm using DynamoDB and I need to update a specific attribute on multiple records. Writing my requirement in pseudo-language I would like to do an update that says "update table persons set relationshipStatus = 'married' where personKey IN (key1, key2, key3, ...)" (assuming that personKey is the KEY in my DynamoDB table).
In other words, I want to do an update with an IN-clause, or I suppose one could call it a batch update. I have found this link that asks explicitly if an operation like a batch update exists and the answer there is that it does not. It does not mention IN-clauses, however. The documentation shows that IN-clauses are supported in ConditionalExpressions (100 values can be supplied at a time). However, I am not sure if such an IN-clause is suitable for my situation because I still need to supply a mandatory KEY attribute (which expects a single value it seems - I might be wrong) and I am worried that it will do a full table scan for each update.
So my question is: how do I achieve an update on multiple DynamoDB records at the same time? At the moment it almost looks like I will have to call an update statement for each Key one-by-one and that just feels really wrong...
As you noted, DynamoDB does not support a batch update operation. You would need to query for, and obtain the keys for all the records you want to update. Then loop through that list, updating each item one at a time.
You can use TransactWriteItems action to update multiple records in DynamoDB table.
The official documentation available here, also you can see TransactWriteItems javascript/nodejs example here.
I don't know if it has changed since the answer was given but it's possible now
See the docs:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
I have used it like this in javascript (mapping the new blocks to an array of objects with the wanted structure:
let params = {}
let tableName = 'Blocks';
params.RequestItems[tableName] = _.map(newBlocks, block => {
return {
PutRequest: {
Item: {
'org_id': orgId,
'block_id': block.block_id,
'block_text': block.block_text
},
ConditionExpression: 'org_id <> :orgId AND block_id <> :block_id',
ExpressionAttributeValues: {
':orgId': orgId,
':block_id': block.block_id
}
}
}
})
docClient.batchWrite(params, function(err, data) {
.... and do stuff with the result
You can even mix puts and deletes
And if your using dynogels (you cant mix em due to dynogels support but what you can do is for updating (use create because behind the scenes it casts to the batchWrite function as put's)
var item1 = {email: 'foo1#example.com', name: 'Foo 1', age: 10};
var item2 = {email: 'foo2#example.com', name: 'Foo 2', age: 20};
var item3 = {email: 'foo3#example.com', name: 'Foo 3', age: 30};
Account.create([item1, item2, item3], function (err, acccounts) {
console.log('created 3 accounts in DynamoDB', accounts);
});
Note this from DynamoDB limitations (from the docs):
The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB.
If i remember correctly i think dynogels is chunking the requests into chucks of 25 before sending them off and then collecting them in one promise and returns (though im not 100% certain on this) - otherwise a wrapper function would be pretty simple to assemble
DynamoDb is not designed as relational DB to support the native transaction. It is better to design the schema to avoid the situation of multiple updates at the first place. Or if it is not practical in your case, please keep in mind you may improve it when restructuring the design.
The only way to update multiple items at the same time is use TransactionWrite operation provided by DynamoDB. But it comes with a limitation (25 at most for example). So keep in mind with that, you probable should do some limitation in your application as well. In spite of being very costly (because of the implementation involving some consensus algorithm), it is still mush faster than a simple loop. And it gives you ACID property, which is probably we need most. Think of a situation using loop, if one of the updates fails, how do you deal with the failure? Is it possible to rollback all changes without causing some race condition? Are the updates idempotent? It really depends on the nature of your application of cause. Be careful.
Another option is to use the thread pool to do the network I/O job, which can definitely save a lot of time, but it also has the same failure-and-rollback issue to think about.

Pagination with DynamoDBMapper Java AWS SDK

From the API docs dynamo db does support pagination for scan and query operations. The catch here is to set the ExclusiveStartIndex of current request to the value of the LastEvaluatedIndex of previous request to get next set (logical page) of results.
I'm trying to implement the same but I'm using DynamoDBMapper, which seems to have lot more advantages like tight coupling with data models. So if I wanted to do the above, I'm assuming I would do something like below:
// Mapping of hashkey of the last item in previous query operation
Map<String, AttributeValue> lastHashKey = ..
DynamoDBQueryExpression expression = new DynamoDBQueryExpression();
...
expression.setExclusiveStartKey();
List<Table> nextPageResults = mapper.query(Table.class, expression);
I hope my above understanding is correct on paginating using DynamoDBMapper.
Secondly, how would I know that I've reached the end of results. From the docs if I use the following api:
QueryResult result = dynamoDBClient.query((QueryRequest) request);
boolean isEndOfResults = StringUtils.isEmpty(result.getLastEvaluatedKey());
Coming back to using DynamoDBMapper, how can I know if I've reached end of results in this case.
You have a couple different options with the DynamoDBMapper, depending on which way you want go.
query - returns a PaginatedQueryList
queryPage - returns a QueryResultPage
scan - returns a PaginatedScanList
scanPage - returns a ScanResultPage
The part here is understanding the difference between the methods, and what functionality their returned objects encapsulate.
I'll go over PaginatedScanList and ScanResultPage, but these methods/objects basically mirror each other.
The PaginatedScanList says the following, emphasis mine:
Implementation of the List interface that represents the results from a scan in AWS DynamoDB. Paginated results are loaded on demand when the user executes an operation that requires them. Some operations, such as size(), must fetch the entire list, but results are lazily fetched page by page when possible.
This says that results are loaded as you iterate through the list. When you get through the first page, the second page is automatically fetched with out you having to explicitly make another request. Lazy loading the results is the default method, but it can be overridden if you call the overloaded methods and supply a DynamoDBMapperConfig with a different DynamoDBMapperConfig.PaginationLoadingStrategy.
This is different from the ScanResultPage. You are given a page of results, and it is up to you to deal with the pagination yourself.
Here is quick code sample showing an example usage of both methods that I ran with a table of 5 items using DynamoDBLocal:
final DynamoDBMapper mapper = new DynamoDBMapper(client);
// Using 'PaginatedScanList'
final DynamoDBScanExpression paginatedScanListExpression = new DynamoDBScanExpression()
.withLimit(limit);
final PaginatedScanList<MyClass> paginatedList = mapper.scan(MyClass.class, paginatedScanListExpression);
paginatedList.forEach(System.out::println);
System.out.println();
// using 'ScanResultPage'
final DynamoDBScanExpression scanPageExpression = new DynamoDBScanExpression()
.withLimit(limit);
do {
ScanResultPage<MyClass> scanPage = mapper.scanPage(MyClass.class, scanPageExpression);
scanPage.getResults().forEach(System.out::println);
System.out.println("LastEvaluatedKey=" + scanPage.getLastEvaluatedKey());
scanPageExpression.setExclusiveStartKey(scanPage.getLastEvaluatedKey());
} while (scanPageExpression.getExclusiveStartKey() != null);
And the output:
MyClass{hash=2}
MyClass{hash=1}
MyClass{hash=3}
MyClass{hash=0}
MyClass{hash=4}
MyClass{hash=2}
MyClass{hash=1}
LastEvaluatedKey={hash={N: 1,}}
MyClass{hash=3}
MyClass{hash=0}
LastEvaluatedKey={hash={N: 0,}}
MyClass{hash=4}
LastEvaluatedKey=null

Entity Framework DB First: Timestamp column not working

Using db first approach, I want my application to throw a concurrency exception whenever I try to update an (out-of-date) entity which it's correspoinding row in the database has been already updated by another application/user/session.
I am using Entity Framework 5 on .Net 4.5. The corresponding table has a Timestamp column to maintain row version.
I have done this in the past by adding a timestamp field to the table you wish to perform a concurrency check. (in my example i added a column called ConcurrencyCheck)
There are two types of concurrency mode here depending on your needs :
1 Concurrency Mode: Fixed :
Then re-add/refresh your table in your model. For fixed concurrency , make sure your set your concurrency mode to fixed for your table when you import it into your model : like this :
Then to trap this :
try
{
context.SaveChanges();
}
catch (OptimisticConcurrencyException ex) {
////handle your exception here...
2. Concurrency Mode: None
If you wish to handle your own concurrency checking , i.e. raise a validation informing the user and not even allowing a save to occur then you can set Concurrency mode None.
1.Ensure you change the ConcurrencyMode in the properties of the new column you just added to "None".
2. To use this in your code , i would create a variable to store your current timestamp on the screen you which to check a save on.
private byte[] CurrentRecordTimestamp
{
get
{
return (byte[])Session["currentRecordTimestamp"];
}
set
{
Session["currentRecordTimestamp"] = value;
}
}
1.On page load (assuming you're using asp.net and not mvc/razor you dont mention above), or when you populate the screen with the data you wish you edit , i would pull out the current record under edit's ConcurrencyCheck value into this variable you created.
this.CurrentRecordTimestamp = currentAccount.ConcurrencyCheck;
Then if the user leaves the record open , and someone else in the meantime changes it , and then they also attempt to save , you can compare this timestamp value you saved earlier with the concurrency value it is now.
if (Convert.ToBase64String(accountDetails.ConcurrencyCheck) != Convert.ToBase64String(this.CurrentRecordTimestamp))
{
}
After reviewing many posts here and on the web explaining concurrency and timestamp in Entity Framework 5, I came into the conclusion that basically it is impossible to get a concurrency exception when the model is generated from an existing database.
One workaround is modifying the generated entities in the .edmx file and setting the "Concurrency Mode" of the entity's timestamp property to "Fixed". Unfortunately, if the model is repeatedly re-generated from the database this modification may be lost.
However, there is one tricky workaround:
Initialize a transaction scope with isolation level of Repeatable Read or higher
Get the timestamp of the row
Compare the new timestamp with the old one
Not equal --> Exception
Equal --> Commit the transaction
The isolation level is important to prevent concurrent modifications of inferring.
PS:
Erikset's solution seems to be fine to overcome regenerating the model file.
EF detects a concurrency conflict if no rows were affected. Then if you use stored procedures to delete and update you could manually add the timestamp value in the where clause:
UPDATE | DELETE ... WHERE PKfield = PkValue and Rowversionfield = rowVersionValue
Then if the row has been deleted or modified by anyone else the Sql statement affects 0 rows and EF interpret it as concurrency conflict.

EclipseLink JPA: Can I run multiple queries from one builder?

I have a method that builds and runs a Criteria query. The query does what I want it to, specifically it filters (and sorts) records based on user input.
Also, the query size is restricted to the number of records on the screen. This is important because the data table can be potentially very large.
However, if filters are applied, I want to count the number of records that would be returned if the query was not limited. So this means running two queries: one to fetch the records and then one to count the records that are in the overall set. It looks like this:
public List<Log> runQuery(TableQueryParameters tqp) {
// get the builder, query, and root
CriteriaBuilder builder = em.getCriteriaBuilder();
CriteriaQuery<Log> query = builder.createQuery(Log.class);
Root<Log> root = query.from(Log.class);
// build the requested filters
Predicate filter = null;
for (TableQueryParameters.FilterTerm ft : tqp.getFilterTerms()) {
// this section runs trough the user input and constructs the
// predicate
}
if (filter != null) query.where(filter);
// attach the requested ordering
List<Order> orders = new ArrayList<Order>();
for (TableQueryParameters.SortTerm st : tqp.getActiveSortTerms()) {
// this section constructs the Order objects
}
if (!orders.isEmpty()) query.orderBy(orders);
// run the query
TypedQuery<Log> typedQuery = em.createQuery(query);
typedQuery.setFirstResult((int) tqp.getStartRecord());
typedQuery.setMaxResults(tqp.getPageSize());
List<Log> list = typedQuery.getResultList();
// if we need the result size, fetch it now
if (tqp.isNeedResultSize()) {
CriteriaQuery<Long> countQuery = builder.createQuery(Long.class);
countQuery.select(builder.count(countQuery.from(Log.class)));
if (filter != null) countQuery.where(filter);
tqp.setResultSize(em.createQuery(countQuery).getSingleResult().intValue());
}
return list;
}
As a result, I call createQuery twice on the same CriteriaBuilder and I share the Predicate object (filter) between both of them. When I run the second query, I sometimes get the following message:
Exception [EclipseLink-6089] (Eclipse Persistence Services -
2.2.0.v20110202-r8913):
org.eclipse.persistence.exceptions.QueryException Exception
Description: The expression has not been initialized correctly. Only
a single ExpressionBuilder should be used for a query. For parallel
expressions, the query class must be provided to the ExpressionBuilder
constructor, and the query's ExpressionBuilder must always be on the
left side of the expression. Expression: [ Base
com.myqwip.database.Log] Query: ReportQuery(referenceClass=Log ) at
org.eclipse.persistence.exceptions.QueryException.noExpressionBuilderFound(QueryException.java:874)
at
org.eclipse.persistence.expressions.ExpressionBuilder.getDescriptor(ExpressionBuilder.java:195)
at
org.eclipse.persistence.internal.expressions.DataExpression.getMapping(DataExpression.java:214)
Can someone tell me why this error shows up intermittently, and what I should do to fix this?
Short answer to the question : Yes you can, but only sequentially.
In the method above, you start creating the first query, then start creating the second, the execute the second, then execute the first.
I had the exact same problem. I don't know why it's intermittent tough.
I other words, you start creating your first query, and before having finished it, you start creating and executing another.
Hibernate doesn't complain but eclipselink doesn't like it.
If you just start by the query count, execute it, and then create and execute the other query (what you've done by splitting it in 2 methods), eclipselink won't complain.
see https://issues.jboss.org/browse/SEAMSECURITY-91
It looks like this posting isn't going to draw much more response, so I will answer this in how I resolved it.
Ultimately I ended up breaking my runQuery() method into two methods: runQuery() that fetches the records and runQueryCount() that fetches the count of records without sort parameters. Each method has its own call to em.getCriteriaBuilder(). I have no idea what effect that has on the EntityManager, but the problem has not appeared since.
Also, the DAO object that has these methods used to be #ApplicationScoped. It now has no declared scope, so it is now constructed on demand from the various #RequestScoped and #ConversationScoped beans that use it. I don't know if this has any effect on the problem but since it has not appeared since I will use this as my code pattern from now on. Suggestions welcome.