Querying IMAPITable for data during a search operation

Querying IMAPITable for data during a search operation - c++

Similarly to my other previously opened threads, I'm trying to achieve a fast and efficient folder search. I've tried waiting for fvnSearchCompleted event which took forever to arrive and for fvnTableModified with TABLE_ROW_ADDED which has never arrived due to IMsgStore's decision tree (more than 10K mails)
Is it possible to query the IMAPITable associated with the search_folder during the search operation(SetSearchCriteria) and until the fvnSearchCompleted event.
In case it's plausible, a simple IMAPTable->QueryRow call with an infinitive loop is enough for that? the table's order wont change during the search operation, and the cursor will correctly move to the next record?
Edit I've found out that SetSearchCriteria, moves table's cursor position, each time it inserts new records to the search table, is there a way to overcome this behavior for on-the-fly table query?

Related

How can I loop only the page records from the selected one to the latest?

I'm trying to loop all records displayed in a page, from the selected one to the end of the rows:
For example here, as I'm selecting only the 5th row it will loop through 5th and 6th row (as there are no more rows below)
What I've been trying is this:
ProdOrderLine := Rec;
REPEAT
UNTIL ProdOrderLine.NEXT = 0;
But it will loop through all records in the table which are not even displayed in the page...
How can I loop only the page records from the selected one to the latest?

Try Copy instead of assignment. Assignment only copies values of there field from one instance of record-variable to another, it died not copy filters or keys (sort order).
Alas, I have to mention that this is uncommon scenario to handle records like this in BC. General best practice approach would be to ask user to select all the records he or she needs with the shift+click, ctrl+click or by dragging the mouse. In that case you will use SetSelectionFiler to instantly grab ask the selected records.
This is how it works across the system and this how user should be taught to work. It is a bad idea to add a way to interact with record that only works in one page in the whole system even if users are asking for it bursting into tears. They probably just had this type of interaction in some other system they worked with before. I know this is a tough fight but it worth it. It is for the sake of stability (less coding = less bugs) and predictability (a certain way of interaction works across all the pages) of the system.

Informatica session refuses to update first then insert in "update then insert" mode

Very basic setup: source-to-target - wanted to replicate the MERGE behavior.
Removed the update strategy, activated "update then insert" rule on target within the session. Doesn't work as described, always attempts to insert into the primary key column, even though the same key arrives, which should have triggered an "update" statement. Tried other target methods - always attempts to insert. Attached is the mapping pic.
basic merge attempt

Finally figured this out. You have to make edits in 3 places: a) mapping - remove update strategy b) session::target properties - set the "update then insert" method c) session's own properties - "treat source rows as"
In the third case you have to switch "treat source rows as" from insert to update.
Which will then allow both - updates and inserts.
Why is it set like this is beyond me. But it works.

I'll make an attempt to clarify this a bit.
First of all, using Update Strategy in the mapping requires the session Treat source rows as property to be set to Data driven. This is slowest possible option as it means it will be set on row-by-row basis within the mapping - but that's exactly what you need if using the Update Strategy transformation. So in order to mirror MERGE, you need to remove it.
And tell the session not to expect this in the mapping anymore - so the property needs to be set to one of the remaining ones. There are two options:
set Treat source rows as to Insert - this means all the rows will be inserted each time. If there are no errors (e.g. caused by unique index), the data will be multiplied. In order to mimic MERGE behavior, you'd need to add the unique index that would prevent inserts and tell the target connector to insert else update. This way in case the insert fails it will make an update attempt.
set Treat source rows as to Update - now this will tell PowerCenter to try updates for each and every input row. Now, using update else insert will cause that in case of failure (i.e. no row to update) there will be no error - instead an insert attempt will be made. Here there's no need for unique index. That's one difference.
Additional difference - although both solutions will reflect the MERGE operation - might be observed in performance. In the environment where new data is very rare, the first approach will be slow: each time an insert attempt will be made just to fail and do an update operation then. Just a few times it will succeed at first attempt. Second approach will be faster: updates will succeed most of the time and just on a rare occasion it will fail and result in an insert operation.
Of course, if updates are not often expected, it will be exactly the opposite.
This can be seen as complex solution for a simple merge. But it also lets the developer to influence the performance.
Hope this sheds some light!

AWS Dynamodb scan ordering?

We have a setup where various worker nodes perform computations and update their relative states in a DynamoDB table. The table acts as a kind of history of activity of the worker nodes. A watchdog node needs to periodically scan through the table, and build an object representing the current state of the worker nodes and their jobs. As such, it's important for our application to be able to scan the table and retrieve data in chronological order (i.e. sorted by timestamp). The table will eventually be too large to scan into local memory for later ordering, so we cannot sort it after scanning.
Reading from the AWS documentation about the primary key:
DynamoDB uses the partition key value as input to an internal hash
function. The output from the hash function determines the partition
(physical storage internal to DynamoDB) in which the item will be
stored. All items with the same partition key are stored together, in
sorted order by sort key value.
Documentation on the scan function doesn't seem to mention anything about the order of the returned results. But can that last part in the quote above (the part I emphasized in bold) be interpreted to mean that the results of scans are ordered by the sort key? If I set all partition keys to be the same value, say "0", then use my timestamp as the sort key, can I be guaranteed that the scan operation will return data in chronological order?
Some note:
All code is written in Python, and thus I'm using the boto3 module to perform scan operations.
Our system architect is steadfast against the idea of updating any entries in the table to reflect their current state, or deleting items when the job is complete. We can only ever add to the table, and thus we need to scan through the whole thing each time to determine the worker states.
I am using strong read consistency for scan operations.

Technically SCAN never guarantees order (although as an observation the lack of order guarantee seems to mean that the partition is randomly ordered, but the sort remains, well, sorted.)
What you've proposed will work however, but instead of scanning, you'll be doing a query on partition-key == 0, which will then return all the items with the partition key of 0, (up to limit and optional sorted forward/backwards) sorted by the sort key.
That said, this is really not the way that dynamo wants you to use it. For example, it guarantees your partition will run hot (because you've explicitly put everything on the same partition), and this operation will cost you the capacity of reading every item on the table.
I would recommend investigating patterns such as using a dynamodb stream processed by a lambda to build and maintain a materialised view of this "current state", rather than "polling" the table with this expensive scan and resulting poor key design.

You’re better off using yyyy-mm-dd as the partition key, rather than all 0. There’s a limit of 10 GB of data per partition, which also means you can’t have more than 10 GB of data per partition key value.
If you want to be able to retrieve data sorted by date, take the ISO 8601 time stamp format (roughly yyyy-mm-ddThh-mm-ss.sss), split it somewhere reasonable for your data, and use the first part as the partition key and the second part as the sort key. (Another advantage of this approach is that you can use eventually consistent reads for most of the queries since it’s pretty safe to assume that after a day (or an hour o something) that the data is completely replicated.)
If you can manage it, it would be even better to use Worker ID or Job ID as a partition key, and then you could use the full time stamp as the sort key.
As #thomasmichaelwallace mentioned, it would be best to use DynamoDB streams with Lambda to create a materialized view.
Now, that being said, if you’re dealing with jobs being run on workers, then you should also consider whether you can achieve your goal by using a workflow service rather than a database. Workflows will maintain a job history and/or current state for you. AWS offers Step Functions and Simple Workflow.

Scanning DynamoDB table while inserting

When we scan a DynamoDB table, we can/should use LastEvaluatedKey to track the progress so that we can resume in case of failures. The documentation says that
LastEvaluateKey is The primary key of the item where the operation stopped, inclusive of the previous result set. Use this value to start a new operation, excluding this value in the new request.
My question is if I start a scan, pause, insert a few rows and resume the scan from the previous LastEvaluatedKey, will I get those new rows after resuming the scan?
My guess is I might miss some of all of the new rows because the new keys will be hashed and the values could be smaller than LastEvaluatedKey.
Is my guess right? Any explanation or documentation links are appreciated.

It is going sequentially through your data, and it does not know about all items that were added in the process:
Scan operations proceed sequentially; however, for faster performance
on a large table or secondary index, applications can request a
parallel Scan operation by providing the Segment and TotalSegments
parameters.
Not only it can miss some of the items that were added after you've started scanning it can also miss some of the items that were added before the scan started if you are using eventually consistent read:
Scan uses eventually consistent reads when accessing the data in a
table; therefore, the result set might not include the changes to data
in the table immediately before the operation began.
If you need to keep track of items that were added after you've started a scan you can use DynamoDB streams for that.

Auto-increment on Azure Table Storage

I am currently developing an application for Azure Table Storage. In that application I have table which will have relatively few inserts (a couple of thousand/day) and the primary key of these entities will be used in another table, which will have billions of rows.
Therefore I am looking for a way to use an auto-incremented integer, instead of GUID, as primary key in the small table (since it will save lots of storage and scalability of the inserts is not really an issue).
There've been some discussions on the topic, e.g. on http://social.msdn.microsoft.com/Forums/en/windowsazure/thread/6b7d1ece-301b-44f1-85ab-eeb274349797.
However, since concurrency problems can be really hard to debug and spot, I am a bit uncomfortable with implementing this on own. My question is therefore if there is a well tested impelemntation of this?

For everyone who will find it in search, there is a better solution. Minimal time for table lock is 15 seconds - that's awful. Do not use it if you want to create a truly scalable solution. Use Etag!
Create one entity in table for ID (you can even name it as ID or whatever).
1) Read it.
2) Increment.
3) InsertOrUpdate WITH ETag specified (from the read query).
if last operation (InsertOrUpdate) succeeds, then you have a new, unique, auto-incremented ID. If it fails (exception with HttpStatusCode == 412), it means that some other client changed it. So, repeat again 1,2 and 3.
The usual time for Read+InsertOrUpdate is less than 200ms. My test utility with source on github.

See UniqueIdGenerator class by Josh Twist.

I haven't implemented this yet but am working on it ...
You could seed a queue with your next ids to use, then just pick them off the queue when you need them.
You need to keep a table to contain the value of the biggest number added to the queue. If you know you won't be using a ton of the integers, you could have a worker every so often wake up and make sure the queue still has integers in it. You could also have a used int queue the worker could check to keep an eye on usage.
You could also hook that worker up so if the queue was empty when your code needed an id (by chance) it could interupt the worker's nap to create more keys asap.
If that call failed you would need a way to (tell the worker you are going to do the work for them (lock), then do the workers work of getting the next id and unlock)
lock
get the last key created from the table
increment and save
unlock
then use the new value.

The solution I found that prevents duplicate ids and lets you autoincrement it is to
lock (lease) a blob and let that act as a logical gate.
Then read the value.
Write the incremented value
Release the lease
Use the value in your app/table
Then if your worker role were to crash during that process, then you would only have a missing ID in your store. IMHO that is better than duplicates.
Here is a code sample and more information on this approach from Steve Marx

If you really need to avoid guids, have you considered using something based on date/time and then leveraging partition keys to minimize the concurrency risk.
Your partition key could be by user, year, month, day, hour, etc and the row key could be the rest of the datetime at a small enough timespan to control concurrency.
Of course you have to ask yourself, at the price of date in Azure, if avoiding a Guid is really worth all of this extra effort (assuming a Guid will just work).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js