I have a DynamoDB table. I have an index on the "Id" field.
I also have other fields - status, timestamp, etc, but I don't have index on these other fields.
The status can be "online", "offline", "in-progress", etc.
I want to get the count of the records based on "status" and "Id" field.
The user will pass the Id field and the query needs to return the count based on the status field. e.g.
"online" : 20
"offline" : 30
"in-progress" : 40
The query works fine.
As I understand, the maximum size of the DynamoDb query output is 1 MB. This limit applies before any FilterExpression is applied to the results.
Since the number of records in the table are huge, ( around 100k), I need to execute the queries again and again by passing "Exclusive Start key" parameter.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Pagination
In fact, I need to run multiple queries (one for each status value) in the loop, for calculating the counts based on "status" field.
Is there any efficient way to retrieve theses counts?
I am thinking of extending the index to the status field also. So it will eliminate the need for applying filter expression.
If the field isn't indexed, you need to do a table scan to get the full count. You can parallelize the scan to make it faster, or just index it.
There are fields ScannedCount and Count yet even if field is indexed, you will get count of items only when result of query is less than 1MB.
If you have a lot of rows or single row is big, max size per row may be up to 400KB, so if you have rows of 400KB, you may scan only couple of such before hitting 1MB limit and you will get count of those. If you have small rows, you will be able to scan through more during single query. Yet in any case DynamoDB will not scan all the data to give you results on one go. You will get paginated results.
With proper index your query won't need use filters, w/o good index you will do index-scan or table-scan probably with applied filters but it does nothing to work around the fact - query will always scan up to 1MB of data and will return paginated results.
From the docs:
ScannedCount — The number of items that matched the key condition
expression before a filter expression (if present) was applied.
Count — The number of items that remain after a filter expression (if
present) was applied.
If the size of the Query result set is larger than 1 MB, ScannedCount
and Count represent only a partial count of the total items. You need
to perform multiple Query operations to retrieve all the results.
Each Query response contains the ScannedCount and Count for the items
that were processed by that particular Query request. To obtain grand
totals for all of the Query requests, you could keep a running tally
of both ScannedCount and Count.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
Related
It seems a silly question. For the returned result from a dynamodb query, it has Items and Count. Items is an array which has a length property. I would like to ask are Items.length and Count always the same?
I am using javascript SDK.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB/DocumentClient.html#query-property
Yes, the length of Items and the Count should be the same.
A other count fun facts:
Each Query response will contain the ScannedCount and Count for the items that were processed by that particular Query request. To obtain grand totals for all of the Query requests, you could keep a running tally of both ScannedCount and Count.
If the size of the Query result set is larger than 1 MB, then ScannedCount and Count will represent only a partial count of the total items. You will need to perform multiple Query operations in order to retrieve all of the results (see Paginating the Results).
Also, if you just care about the count and not the data, you can ask DynamoDB to only return the count via the Select property of the request.
Is there an efficient way to get at least n records for a given filter and order? More specifically, I want to get a list of all entries in a model that have a certain date field greater than a month ago, but in case there are less than 10 such entries matching the filter, then get at least 10 of such entries by relaxing the filter. I can do this by getting the count, then checking and making the query again, but was wondering if there was a better way to do it.
I thought of this scenario in querying/scanning in DynamoDB table.
What if i want to get a single data in a table and i have 20k data in that table, and the data that im looking for is at 19k th row. Im using Scan with a limit 1000 for example. Does it consume throughput even though for the 19th time it does not returned any Item?. For Instance,
I have a User table:
type UserTable{
userId:ID!
username:String,
password:String
}
then my query
var params = {
TableName: "UserTable",
FilterExpression: "username = :username",
ExpressionAttributeValues: {
":username": username
},
Limit: 1000
};
How to effectively handle this?
According to the doc
A Scan operation always scans the entire table or secondary index. It
then filters out values to provide the result you want, essentially
adding the extra step of removing data from the result set.
Performance
If possible, you should avoid using a Scan operation on a large table
or index with a filter that removes many results. Also, as a table or
index grows, the Scan operation slows
Read units
The Scan operation examines every item for the requested values and can
use up the provisioned throughput for a large table or index in a
single operation. For faster response times, design your tables and
indexes so that your applications can use Query instead of Scan
For better performance an less read unit consumption i advice you create GSI using it with query
A Scan operation will look at entire table and visits all records to find out which of them matches your filter criteria. So it will consume throughput enough to retrieve all the visited records. Scan operation is also very slow, especially if the table size is large.
To your second question, you can create a Secondary Index on the table with UserName as Hash key. Then you can convert the scan operation to a Query. That way it will only consume throughput enough for fetching one record.
Read about Secondary Indices Here
In DynamoDB is there a way to guarantee that exactly n results will be
returned if I specify a limit and a filter?
The problem I see is that the docs state:
In a response, DynamoDB returns all the matching results within the
scope of the Limit value. For example, if you issue a Query or a Scan
request with a Limit value of 6 and without a filter expression,
DynamoDB returns the first six items in the table that match the
specified key conditions in the request (or just the first six items
in the case of a Scan with no filter). If you also supply a
FilterExpression value, DynamoDB will return the items in the first
six that also match the filter requirements (the number of results
returned will be less than or equal to 6).
So this means 6 items will be retrieved and then the filter applied. How can I keep searching until I get exactly '6' items? (Ideally there is some setting in the query to keep going until the limit has been reached -- or exhaustion has been reached)
For example, Suppose I make a query to get 50 people, who's name is "john", Dynamo would return 50 people and then apply the "john" filter. Now only 3 people are returned.
Is there a way I can ensure it will keep searching until the limit of 50 is satisfied?
I don't want to use a Scan since a Scan always searches every item in the table (regardless of limit -- correct me if I'm wrong on this).
How can I make the query's filter lazily until the Limit is satisfied? How can I keep searching until the Limit is satisfied?
If you can filter in the query itself, then that'll be best, since you wouldn't have to use a filter expression. But if you can't, the way dynamo works I suspect means the filter is just a scan over the results - basically a way to save on bandwidth, not much more. You can still use pagination to get more results; and if you're using Dynamo you probably care about the rate in which you're querying, so having that control over how many queries you're actually doing (and their size) is kind of a good thing.
My table has 77k entries (number of entries keep increasing this a high rate), I need to make a select query in CQL 3. When I do select count(*) ... where (some_conditions) allow filtering I get:
count
-------
10000
(1 rows)
Default LIMIT of 10000 was used. Specify your own LIMIT clause to get more results.
Let's say the 23k rows satisfied this some_condition. The 10000 count above is of the first 10k rows of these 23k rows, right? But how do I get the actual count?
More importantly, How do I get access to all of these 23k rows, so that my python api can perform some in-memory operation on the data in some columns of the rows. Are there a some sort pagination principles in Cassandra CQL 3.
I know I can just increase the limit to a very large number but that's not efficient.
Working Hard is right, and LIMIT is probably what you want. But if you want to "page" through your results at a more detailed level, read through this DataStax document titled: Paging through unordered partitioner results.
This will involve using the token function on your partitioning key. If you want more detailed help than that, you'll have to post your schema.
While I cannot see your complete table schema, by virtue of the fact that you are using ALLOW FILTERING I can tell that you are doing something wrong. Cassandra was not designed to serve data based on multiple secondary indexes. That approach may work with a RDBMS, but over time that query will get really slow. You should really design a column family (table) to suit each query you intend to use frequently. ALLOW FILTERING is not a long-term solution, and should never be used in a production system.
you just have to specify limit with your query.
let's assume your database is containing under 1 lack records so if you will execute below query it will give you the actual count of the records in table.
select count(*) ... where (some_conditions) allow filtering limit 100000;
Another way is to write python code, the cqlsh indeed is python script.
use
statement = " select count(*) from SOME_TABLE"
future = session.execute_async(statement)
rows = future.result()
count = 0
for row in rows:
count = count + 1
the above is using cassandra python driver PAGE QUERY feature.