Using column versions for time series

Using column versions for time series - google-cloud-platform

In the official documentation there is a text for which I can't totally understand the reason:
When working with time series, do not leverage the transactional behavior of rows. Changes to data in an existing row should be stored as a new, separate row, not changed in the existing row. This is an easier model to construct, and it enables you to maintain a history of activity without relying upon column versions.
The last sentence is not obvious and concrete, so it doesn't convince me. For now, using versioning for updating the cell's data still looks to me like a good fit for the update task. At least versions are managed by BigTable, so it's simplier solution.
Can anybody please provide more obvious explanation of why the versioning shouldn't be used in that use case?

Earlier in that page under Patterns for row key design, a bit more detail is explained. The high level view being that using row keys instead of column versions will:
Make it easier to run queries across your data, allowing for scanning of less data.
Avoid going over the recommended maximum row size.
The one caveat being:
It is acceptable to use versions of a column where the use case is
actually amending a value, and the value's history is important. For
example, suppose you did a set of calculations based on the closing
price of ZXZZT, and initially the data was mistakenly entered as
559.40 for the closing price instead of 558.40. In this case, it might be important to know the value's history in case the incorrect value
had caused other miscalculations.

Related

DynamoDB version control using sort keys

Anyone that has implemented versioning using sort keys as stated in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html?
Trying to implement this using typescript for building a database with versions of the items. Is there any way of doing this using updateItem or is it a get + put operation needed?
Any sample to get me started or help is much appreciated!

The concept of versioning using sort key involves the creation of a completely new item that uses same Partition Key and different Sort Key.
DynamoDB offers some operations that allow to update values within an object in an atomic way, this use case is perfect for when you have something like a counter or a quantity and you want to decrease/increase it without having to read its value first. - Docs here.
In the case you're trying to achieve, as mentioned, you are essentially creating a new object. DynamoDB, by itself, doesn't have any concept of versioning and what this pattern does is to cleverly leverage the relation between Partition Key and Sort Key and the fact that a PK can have multiple SK associated with it, to correlate multiple rows of the same table.
To answer your question, if your only source of truth (or data store) is DynamoDB, then yes, your client will have to first query the table to know which was the last version of the item being updated and then insert the new version.
In case you are recording this information elsewhere and are using DynamoDB only to store these versions, then no, one put operation will be enough but again, this assumes you can retrieve this info somewhere else.
In terms of samples, the official documentation of the AWS SDK is always a good start, in your case I assume you'll want to use the Javascript one which you can find here.
At a very high level, you'll have to do the following:
Create an AWS.DynamoDB() client.
Execute a query using the dynamodb.query() method and specifying the PK of the item you want to update.
Go through the items (rows) returned from the previous query and find the one with the bigger version number as SK.
Put a new item using the dynamodb.putItem() method passing an item with the incremented version number as SK and same PK.

You can do the technique described by Amazon with a read and then a write, or more accurately, a read and then two writes (since they want to update both v0 and a new v4!). Often, you need the extra read because you want to build the new version v4 based on data you read from v3 (or equivalently, v0) - but in case you don't need that, the read is not necessary, and two writes are enough:
You first do an UpdateItem to v0 which increments the "Latest" attribute, sets whatever attributes you want to set in the new version, and uses the ReturnValues parameter to ask the update operation to return the new "Latest" attribute.
Then you write with PutItem the new row for v4 (where 4 is the "Latest" you just read).
This approach is safe in the sense that if two clients try to create two new versions at the same time, each one will pick a different "Latest", and both will appear on the version histories. However, it is not safe in the sense that if the client dies between step 1 and 2, you'll have a "hole" in the version history. However, I don't think there's any implementation of this technique that doesn't suffer from this problem.
After saying this, I want to reiterate what I said in the first paragraph: In most realistic use cases, the new version would be based on the old version, so your code anyway needs to read the old version first, then decide how to change it - and then write it (twice). You can't avoid the read in these cases. By the way, in this case the first write (to v0) would be a conditional update to verify that you only write the new version if the old version is still the same one ("Latest" is the same one you read during the read) - otherwise you'd be basing your modification on a non-current version. This is an example of optimistic locking.

Better method for querying DynamoDB table randomly?

I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!

Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!

This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)

Reducing the number of calls to MongoDB with mongoengine

I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.

The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross

Add Indexes (db_index=True)

I'm reading a book about coding style in Django and one thing they discuss is db_index=True. Ever since I started using Django, I've never used this function because I'm not really sure what it does.
So my question is, when to consider adding indexes?

This is not really django specific; more to do with databases. You add indexes on columns when you want to speed up searches on that column.
Typically, only the primary key is indexed by the database. This means look ups using the primary key are optimized.
If you do a lot of lookups on a secondary column, consider adding an index to that column to speed things up.
Keep in mind, like most problems of scale, these only apply if you have a statistically large number of rows (10,000 is not large).
Additionally, every time you do an insert, indexes need to be updated. So be careful on which column you add indexes.
As always, you can only optimize what you can measure - so use the EXPLAIN statement and your database logs (especially any slow query logs) to find out where indexes can be useful.

The above answer is correct but in some cases where the search is being done on columns that have only varchar datatype like email. There you need to add an index.
Following is the way of doing that:
Index(name='covering_index', fields=['headline'], include=['pub_date'])
reference from https://docs.djangoproject.com/en/3.2/ref/models/indexes/

How do I find if any names in a list of names appears in a paragraph with ColdFusion?

Assume that I have a list of employee names from a database (thousands, potentially tens-of-thousands in the near future). To make the problem simpler assume that each firstname/lastname combination is unique (a big if, but a tangent).
I also have a RSS stream of news content that pertains to the business (again, could be in the hundreds of items per day).
What I would like to do is detect if an employees name appears in the several paragraph news item and, if so, 'tag' the item with the person its talking about.
There may be more than one employee named in a single news item so breaking the loop after the first positive match isn't a possibility.
I can certainly brute force things: for every news item, loop over each and every employee name and if a regex expression returns a match, make note of it.
Is there a simpler way in ColdFusion or should I just get on with my nested loops?

Just throwing this out there as something you could do...
It sounds like you'll almost unanimously have significantly more employee names than words per post. Here's how I might handle it:
Have an always-running CF app that will pull in the feeds and onAppStart
Grab all employees from your db
Create an app-scoped look up struct with first names as keys and a struct of last names as values ( you could also add middle names sibling to last names with a 3rd tier if desired ).
So one key in the look up might be "Vanessa" with a struct with 2 keys ( "Johnson" and "Forta" ) as its value.
Then, each article you parse, just listToArray with a space as a delimiter and loop through the array doing a simple structKeyExists with each token. For matches, check the next item in the array as a last name.
I'd guess this would be much more performant processingwise than doing however many searches and also take almost no time to code and you can feed in any future sources extremely simply ( your checker takes one argument, any text on Earth ).
Interested to see what route you go and whether your experiments expose anything new about performance in CF.

Matthew, you have a tall order there, and there are really multiple parts to the challenge/solution. But just in terms of comparing a list of values to a given set of text to see if one of them occur in there, you'll find that there's no one could CF function. BEcause of that, I created a new one, findList, available at cflib:
http://cflib.org/index.cfm?event=page.udfbyid&udfid=1908
It's not perfect, nor as optimal as it could be, but it may be a useful first step or you, or give you some ideas. That said, it suited my need (determine if a given blog comment had reference to any of the blacklisted words). I show it comparing a list of URLs, but it could be any words at all. Hope that's a little helpful.

Another option worth exploring is leveraging the Solr engine that ships with CF now. It will do the string search heavy lifting for you and you can probably focus on dynamically keeping your collections up to date and optimized as new feed items come in.
Good luck!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js