Cassandra get more than 10k rows - django

I am getting stuck with Cassandra all() query.
I am using the Django platform. My query is to get all rows from Cassandra table. But, CQL has some limit to 10k rows at a time.
Before, I have less than 10k rows in Cassandra table. But, now the count has increased up-to 12k.
How do I get the all() query to return all 12k rows?

CQL have a default limitation to 10k rows. That means there's an implicit limit to 10k when you perform any SELECT. If you want you can override that by specifying a new LIMIT value, eg:
SELECT * FROM mytable LIMIT 500000;

Related

BQ google update columns - Exceeded rate limits

Scenario is to update column descriptions in tables(About 1500 columns in 50 tables). Due to multiple restrictions I have been asked to use the bq query command to execute the ALTER TABLE sql for updating column descriptions, thorugh cloud CLI. query -
bq query --nouse_legacy_sql \ 'ALTER TABLE `<Table>` ALTER COLUMN <columnname> SET OPTIONS(DESCRIPTION="<Updated Description>")';
Issue is if I bunch the bq queries together for 1500 columns it is 1500 sql statements.
This is causing the standard Exceeded rate limits: too many table update operations for this table error.
Any suggestions on how to execute it better.
You are hitting the rate limit:
Maximum rate of table metadata update operations per table: 5 operations per 10 seconds
You will need to stagger the updates to make sure it happens in batch of 5 operations per 10 seconds. You could also try to alter all the columns in a single table with a single statement to reduce the number of calls required.

Redshift slow select query

All,
I have a fact table in redshift with around 90 million rows, max columns are integers and have AUTO sort key and EVEN dist key. 2 nodes cluster. Running a simple select statement taking forever and aborting. Any help.
select * from facts_invoice
Basically trying to feed this data to Powerbi and seems like the slowness coming from Redshift itself. In Snowflake, I used 200 Billions select * before and never took more than 10-15 minutes.

BigQuery - querying only a subset of keys in a table with key value schema

So I have a table with the following schema:
timestamp: TIMESTAMP
key: STRING
value: FLOAT
There are around 200 unique keys. I am partitioning the dataset by date.
I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.
The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?
I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.
Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?
You can try using recently introduced clustering partitioned tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Update (moved from comments)
Also have in mind below
Feature Partitioning Clustering
--------------- ------------- -------------
Cardinality Less than 10k Unlimited
Dry Run Pricing Available Not available
Query Pricing Exact Best Effort
Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this
See more at Clustering partitioned tables

Can Spanner maintain indexes to easily count analytics queries of my data?

I'd like to give my partners the results of simple COUNT(*) ... GROUP BY items.color type queries and perhaps joins over items and orders or some such. I'd like query response time to be sub-second (on the order of a second, at worst), and scale to billions of rows counted.
My current approach is to either backup my GCDatastore data and load it into BigQuery and provide daily analytics or use GCDataflow to maintain a set of pre-defined counters.
Is this something Spanner has as a use-case for, if I transition my backend from Datastore to Spanner?
Today, running counting queries in Cloud Spanner requires a full table scan. Depending on the size of the table this could take more than a second.
One thing you could do is to track the count in a separate table, and whenever you update the items table, update the count in the same transaction.

Cloud spanner: select count * takes over a minute

I created a test table in cloud spanner and populated it with 120 million rows. i have created a composite primary key for the table.
when i run a simple "select count(*) from " query, it takes approximately a minute for cloud spanner web UI to return results.
Is anyone else facing similar problem?
Cloud Spanner does not materialize counts, so queries will like "select count(*) ...." will scan the entire table to return the count of rows, hence the higher time to execute.
If you require faster counts, recommend keeping a sharded counter updated transactionally with changes to the table.
#samiz - you answer "recommend keeping a sharded counter updated transactionally with changes to the table"
how can we detect how many sharded counter need for the table? there is no retry n transaction...
thank you