I created a test table in cloud spanner and populated it with 120 million rows. i have created a composite primary key for the table.
when i run a simple "select count(*) from " query, it takes approximately a minute for cloud spanner web UI to return results.
Is anyone else facing similar problem?
Cloud Spanner does not materialize counts, so queries will like "select count(*) ...." will scan the entire table to return the count of rows, hence the higher time to execute.
If you require faster counts, recommend keeping a sharded counter updated transactionally with changes to the table.
#samiz - you answer "recommend keeping a sharded counter updated transactionally with changes to the table"
how can we detect how many sharded counter need for the table? there is no retry n transaction...
thank you
Related
Scenario is to update column descriptions in tables(About 1500 columns in 50 tables). Due to multiple restrictions I have been asked to use the bq query command to execute the ALTER TABLE sql for updating column descriptions, thorugh cloud CLI. query -
bq query --nouse_legacy_sql \ 'ALTER TABLE `<Table>` ALTER COLUMN <columnname> SET OPTIONS(DESCRIPTION="<Updated Description>")';
Issue is if I bunch the bq queries together for 1500 columns it is 1500 sql statements.
This is causing the standard Exceeded rate limits: too many table update operations for this table error.
Any suggestions on how to execute it better.
You are hitting the rate limit:
Maximum rate of table metadata update operations per table: 5 operations per 10 seconds
You will need to stagger the updates to make sure it happens in batch of 5 operations per 10 seconds. You could also try to alter all the columns in a single table with a single statement to reduce the number of calls required.
All,
I have a fact table in redshift with around 90 million rows, max columns are integers and have AUTO sort key and EVEN dist key. 2 nodes cluster. Running a simple select statement taking forever and aborting. Any help.
select * from facts_invoice
Basically trying to feed this data to Powerbi and seems like the slowness coming from Redshift itself. In Snowflake, I used 200 Billions select * before and never took more than 10-15 minutes.
I have millions of records in Spanner table and I would like to delete rows from Spanner using some query condition. For Eg: delete from spanner table where id > 2000. I'm not able to run this query in Spanner UI, because of Spanner 20k mutation limit in single op's. So is there any way I could delete this record from spanner table by doing some tweaks in api level code or do we have a work around for this type of use-case.
You can use gcloud command line as :
gcloud spanner databases execute-sql <database_id> --instance=<instance_id> --enable-partitioned-dml --sql="delete from YourTable where id > 2000"
NOTE: SQL query must be fully partitionable and idempotent
According to the official documentation Deleting rows in a table, I think you should consider Particioned DML execution model:
If you want to delete a large amount of data, you should use
Partitioned DML, because Partitioned DML handles transaction limits
and is optimized to handle large-scale deletions
Partitioned DML enables large-scale, database-wide operations with
minimal impact on concurrent transaction processing by partitioning
the key space and running the statement over partitions in separate,
smaller-scoped transactions.
I have a very basic Azure SQL Warehouse setup for test purposes DWU100. It has one table in it with 60 million rows. I run a query of the form:
SELECT
SUM(TheValue), GroupId
FROM
[dbo].[Fact_TestTable]
GROUP BY
GroupId
Running this query takes 5 seconds.
Running the same query on a DTU 250 SQL database (equivalent by price), I get an execution time of 1 second.
I'm assuming there must be things I can do to speed this up, can anyone suggest what I can do to improve this?
The group by GroupId above is just an example, I can't assume people will always group by any one particular column.
based on your question, it's not clear how is your table designed - are you using ROUND-ROBIN or HASH distributed table design? If you did not choose distribution type during table creation, default table design is round robin. Given your query, choosing HASH distributed table design would likely lead to improved query execution time as this query would converted to local-global aggregation type of query. It's hard to comment exactly what is happening given you did not share query plan.
Below is a link to SQL DW documentation that talks about various table design options.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse?view=aps-pdw-2016-au7
hope this helps, Igor
I'd like to give my partners the results of simple COUNT(*) ... GROUP BY items.color type queries and perhaps joins over items and orders or some such. I'd like query response time to be sub-second (on the order of a second, at worst), and scale to billions of rows counted.
My current approach is to either backup my GCDatastore data and load it into BigQuery and provide daily analytics or use GCDataflow to maintain a set of pre-defined counters.
Is this something Spanner has as a use-case for, if I transition my backend from Datastore to Spanner?
Today, running counting queries in Cloud Spanner requires a full table scan. Depending on the size of the table this could take more than a second.
One thing you could do is to track the count in a separate table, and whenever you update the items table, update the count in the same transaction.