Cassandra NOT EQUAL Operator - mapreduce

Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce.
Can somebody tell me what are the options available in this regard.

I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).
2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.

If you want to use not-equals operator on a specific partition key and get all other data from table then you can use a combination of range queries and TOKEN function from CQL to achieve this
For example, if you want to fetch all rows except the ones having partition key as 'abc' then you execute below 2 queries
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) < TOKEN('abc');
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) > TOKEN('abc');
But, beware that result is going to be huge (depending on size of table and fields you need). So you might want to use this in conjunction with dsbulk kind of utility. Also note that there is no guarantee of ordering in your result. This is just a kind of data dump which will most probably be useful for some kind of one-time data migration like scenarios.

Related

what is the right way to query different filters on dynamodb?

I save my order data on dyanmodb table. And the partition key is orderId, sort key is timestamp. Each order has many other attributes like category, userName, price, items, status`. I am going to build a filter service to let clients query order based on these attributes. Also I'd like to add a limit on the query for pagination. But I find some limitations on dynamodb.
In order to support querying different fields, I have two options:
Create GSI for each attribute. It is very expensive but it supports query each attribute very performance. This solution doesn't support combine multiple attributes in the filter.
Attach a filter expression on the SCAN to include attribute condition. SCAN is not very performance in the first place. Also the filter expression is applied after limits. Which means it is very likely to response less than users request limits.
so what is the good way to achieve this in dynamodb?
There is unfortunately no magic way to solve your problems. There is no DynamoDB feature which you missed. Indeed, as you said, making each of the attributes available for efficient queries requires a GSI which will cost you additional money - but that's reasonable. Indeed, as you said, there is no efficient way to search for an intersection of requirements on two different attribute. And indeed, the "limit" feature doesn't quite do what you want and you'll need to emulate your page size need in the client code (asking for more pages until your desired amount is recieved), potentially with unacceptably high latency.
It sounds that what you really need is a search engine. These have exactly the features that you asked for. You'll still be paying for these features (indexing of individual columns still takes up CPU and disk space, intersection of multiple attribute searches still requires significant work at query time) but search engines are designed for exactly these operations, and do them more efficiently and with lower latency (which is important for interactive searches, which are the bread-and-butter of search engines).
You can add the limit for pagination using the limit attribute in the query. But can you please be more specific about your access patterns, is your clients going to query all the orders or only the orders belonging to them ?

Cloud Spanner - read performance with large number of items in WHERE clause

I'm in the process of evaluating some different data stores for a project and I have a strange but inflexible requirement to check the existence of a 1500 keys per query... Basically the only query I'll be running is of the form:
SELECT user_id, name, gender
WHERE user_id in (user1, user2, ..., user1500)
I will have around 3.5 billion rows in the table. One data store that has caught my eye is Spanner. I was wondering if querying the data in this way would be feasible or if I would run into performance issues due to the large number of items in my WHERE clause. I have only been able to test these queries on a small amount of data so far so I'm leaning more on what the theoretical performance hit might look like instead having the luxury to just "try and found out".
Also, are there other data stores that might work better for this read pattern? I expected to run no more than 80 queries per second. Also, the data will be bulk loaded on a weekly basis. The data is structured by nature but we don't use it in a relational way (i.e. no joins).
Anyways, sorry if this question is vague in any way. I'm happy to provide more detail if needed.
1500 keys should not be a problem if you use a bound array parameter to specify the keys:
SELECT user_id, name, gender
FROM table
WHERE user_id in UNNEST(#users)
https://cloud.google.com/spanner/docs/sql-best-practices#write_efficient_queries_for_range_key_lookup

DynamoDB - Partition grouping or sharding?

So, looking through the DynamoDB docs, they'll often recommend that you "group" togheter items that are related in the same partition, as so to better distribute your partition usage.
Take the following example where we have an user that has contacts and invoices inside its partition :
So, if I need all of user_001's invoice I will simply query (pseudo):
QUERY WHERE PartitionKey = "user_001" AND SortKey.begins_with("invoice_")
But I recently noticed there's quite an issue when you use the method above.
You see, DynamoDB will search inside the whole user_001 partition for the invoices, and will consume read capacity based on all items searched, whether they where invoices or not.
This can be end up being very inefficient if you have a partition that is too big, let's say I had 10,000 contacts and 2 invoices, it could end up being very costly to get those 2 invoices.
I'm assuming this based on the quote by the docs :
DynamoDB calculates the number of read capacity units consumed based on
item size, not on the amount of data that is returned to an
application
The solution :
Wouldn't this be a better approach?
1) It shards the data better so I don't need to use starts_with
2) It allows me to use a time-based uuid as the sort key and enable more complex ordering/pagination
3) I will consume much less capacity on queries since it won't have to go through items I don't need
What's the question?
Well, what I said above is just theories and assumptions, the documentation doesn't make it clear how it really works behind the scene, and it even recommends picture 1 to be used.
But I'm really thinking picture 2 it's the best here, specially when you consider that now DynamoDB smartly distributes capacity throughout your partitions (and not evenly like it used to be)
So, are my points for thinking picture 2 being much better than 1 valid?
You have assumed incorrectly—the documentation you have quoted applies to filter expressions.
If you have a condition that applies to your sort key, that should be part of the query expression, not a filter expression.

Better method for querying DynamoDB table randomly?

I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!
Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!
This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)

Convert Django ORM query with large IN clause to table value constructor

I have a bit of Django code that builds a relatively complicated query in a programmatic fashion, with various filters getting applied to an initial dataset through a series of filter and exclude calls:
for filter in filters:
if filter['name'] == 'revenue':
accounts = accounts.filter(account_revenue__in: filter['values'])
if filter['name'] == 'some_other_type':
if filter['type'] == 'inclusion':
accounts = accounts.filter(account__some_relation__in: filter['values'])
if filter['type'] == 'exclusion':
accounts = accounts.exclude(account__some_relation__in: filter['values'])
...etc
return accounts
For most of these conditions, the possible values of the filters are relatively small and contained, so the IN clauses that Django's ORM generates are performant enough. However there are a few cases where the IN clauses can be much larger (10K - 100K items).
In plain postgres I can make this query much more optimal by using a table value constructor, e.g.:
SELECT domain
FROM accounts
INNER JOIN (
VALUES ('somedomain.com'), ('anotherdomain.com'), ...etc 10K more times
) VALS(v) ON accounts.domain=v
With a 30K+ IN clause in the original query it can take 60+ seconds to run, while the table value version of the query takes 1 second, a huge difference.
But I cannot figure out how to get Django ORM to build the query like I want, and because of the way all the other filters are constructed from ORM filters I can't really write the entire thing as raw SQL.
I was thinking I could get the raw SQL that Django's ORM is going to run, regexp parse it, but that seems very brittle (and surprisingly difficult to get the actual SQL that is about to be run, because of parameter handling etc). I don't see how I could annotate with RawSQL since I don't want to add a column to select, but instead want to add a join condition. Is there a simple solution I am missing?