Best practice for using Force_Index on spanner - google-cloud-platform

I have a client application which querys data in Spanner..
Lets say I have a table with 10 columns and my client application can search on a combination of columns.. Lets say I've added 5 indexes to optimise searching.
According to https://cloud.google.com/spanner/docs/sql-best-practices#secondary-indexes
it says:
In this scenario, Spanner automatically uses the secondary index SingersByLastName when executing the query (as long as three days have passed since database creation; see A note about new databases). However, it's best to explicitly tell Spanner to use that index by specifying an index directive in the FROM clause:
And also https://cloud.google.com/spanner/docs/secondary-indexes#index-directive suggests
When you use SQL to query a Spanner table, Spanner automatically uses any indexes that are likely to make the query more efficient. As a result, you don't need to specify an index for SQL queries. However, for queries that are critical for your workload, Google advises you to use FORCE_INDEX directives in your SQL statements for more consistent performance.
Both links suggest YOU (The developer) should be supplying Force_Index on yours queries.. This means I now need business logic in my client to say something like:
If (object.SearchTermOne)
queryBuilder.IndexToUse = "Idx_SearchTermOne"
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use.. It also means if I add an extra index I need a code change to make use of it
So what are the best practices when it comes to using Force_Index in spanner queries?

The best practice is to use the Force_Index as described in the documentation at this time.
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use..
I feel the same.
https://cloud.google.com/spanner/docs/secondary-indexes#index-directive
Note: The query optimizer requires up to three days to collect the databases statistics required to select a secondary index for a SQL query. During this time, Cloud Spanner will not automatically use any indexes.
As noted in this note, even if an amount of data is added that would allow the index to function effectively, it may take up to three days for the optimizer to figure it out.
Queries during that time will probably be full scans.
If you want to prevent this other than using Force_Index, you will need to run ANALYZE DDL manually.
https://cloud.google.com/blog/products/databases/a-technical-overview-of-cloud-spanners-query-optimizer
But none of this changes the fact that we are essentially trying to do the optimizer's job...

Related

How to Speed up AWS Timestream query performance?

The AWS timestream Database is queried using grafana API and results are shown on dashboards
While everything works well when we query for less data points but my queries would fail when I query too much data i.e, of 1-2 months for 100 or more dimensions. the query would fail while fetching data.
As stated in the AWS Timestream docs, there are some best practices that, if you follow, your queries will be quite fast. I can vouch that, obeying those rules, you can return a huge data-set (4M records) under 40s.
Adding to those guides beneath, I would also suggest avoiding high cardinality dimensions. I explain: IF you have a dimension, like time, or something that grows indefinitely, the indexes on this dimension will get out of hand and, soon, your query will be too slow to be useful.
The original document can be found here (There are some not-pasted links in the list, consult the doc).
Following are suggested best practices for queries with Amazon
Timestream.
Include only the measure and dimension names essential to query.
Adding extraneous columns will increase data scans, which impacts the
performance of queries.
Where possible, push the data computation to Timestream using the
built-in aggregates and scalar functions in the SELECT clause and
WHERE clause as applicable to improve query performance and reduce
cost. See SELECT and Aggregate functions.
Where possible, use approximate functions. E.g., use APPROX_DISTINCT
instead of COUNT(DISTINCT column_name) to optimize query performance
and reduce the query cost. See Aggregate functions.
Use a CASE expression to perform complex aggregations instead of
selecting from the same table multiple times. See The CASE statement.
Where possible, include a time range in the WHERE clause of your
query. This optimizes query performance and costs. For example, if you
only need the last one hour of data in your dataset, then include a
time predicate such as time > ago(1h). See SELECT and Interval and
duration.
When a query accesses a subset of measures in a table, always include
the measure names in the WHERE clause of the query.
Where possible, use the equality operator when comparing dimensions
and measures in the WHERE clause of a query. An equality predicate on
dimensions and measure names allows for improved query performance and
reduced query costs.
Wherever possible, avoid using functions in the WHERE clause to
optimize for cost.
Refrain from using LIKE clause multiple times. Rather, use regular
expressions when you are filtering for multiple values on a string
column. See Regular expression functions.
Only use the necessary columns in the GROUP BY clause of a query.
If the query result needs to be in a specific order, explicitly
specify that order in the ORDER BY clause of the outermost query. If
your query result does not require ordering, avoid using an ORDER BY
clause to improve query performance.
Use a LIMIT clause if you only need the first N rows in your query.
If you are using an ORDER BY clause to look at the top or bottom N
values, use a LIMIT clause to reduce the query costs.
Use the pagination token from the returned response to retrieve the
query results. For more information, see Query.
If you've started running a query and realize that the query will not
return the results you're looking for, cancel the query to save cost.
For more information, see CancelQuery.
If your application experiences throttling, continue sending data to
Amazon Timestream at the same rate to enable Amazon Timestream to
auto-scale to the satisfy the query throughput needs of your
application.
If the query concurrency requirements of your applications exceed the
default limits of Timestream, contact AWS Support for limit increases.

Safely segregating customer data in Spanner

We're exploring options for reliably segregating customer data in Spanner. The most obvious solution is a customer per database, but the 100 database/instance limitation renders that impractical. Past experience leads me to be very suspicious of any plan to add a customer-id field to the primary key of each table, because it's far too easy to screw that up in SQL queries, leading to dangerous data cross-talk.
I'm considering weird solutions like using all 2k tables/instance, and taking the ~32 tables we need per customer and prefixing those. E.g., [cust-id]-Table1, [cust-id]-Table2, etc. At least then the customer segregation logic that needs to be iron-clad can be put in one place that's hard to screw up in queries. But is anyone aware of a less weird approach? E.g., "100" is a suspiciously-non-round number in a technical limitation -- is that adjustable somehow?
Unfortunately, 100 databases/instance is not an adjustable value.
Though, I don't seem to fully understand " very suspicious of any plan to add a customer-id field to the primary key of each table, because it's far too easy to screw that up in SQL queries, leading to dangerous data cross-talk." Are you concerned about query performance, data correctness, code correctness or schema ?
With this schema, ~32 tables per customer will only allow you to store ~6000 customers. Though I would suggest benchmarking with other schema choices Spanner exposes.
Would you be able to provide a high-level schema of these customer tables as well as your query patterns ?
Also, suggest to read into for more ideas that fit your usecase better:
Spanner Schema
Interleaved Tables
Secondary Indexes
SQL Best Practices

Identifying needed statistics - Azure SQL Data Warehouse

Is there any hint or directive that can be used with EXPLAIN of a query on Azure SQL Data Warehouse that would return recommended statistics that were not available for the optimizer? Alternatively is there a tool that can analyze a workload and make any recommendation.
Today, no. Right now the recommendation is to create statistics on every column as these are needed to create an optimal parallel query plan (I.e. how to move data around between nodes to return a result since it's a MPP architecture).
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices#maintain-statistics
An example of how to script this out can be found here as well (example H).
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics#examples-create-statistics
As you know, statistics should be created (according to this article):
on columns involved in JOINs, GROUP BY, HAVING and WHERE clauses.
There are no tools to do this (yet), but if you have access to the EXPLAIN plans they give you certain information. For example the shuffle_columns element lists all columns involved in a SHUFFLE_MOVE:
<shuffle_columns>col;</shuffle_columns>
as well as myriad other information. Review the annotation I did of an Azure SQL Data Warehouse plan here.
Lastly, (and I haven't actually done this, I've only been thinking about doing it), you could set up a copy of your database on SQL Server 2016, bearing in mind the syntax differences (eg distribution, lack of unique indexes etc). this would give you access to certain useful resources like execution plans, including index suggestions, and certain trace flags which tell you what stats were used. I mean the database engines and indexing are really different so I don't know how worthwhile this might be. I'll post back if I progress my thinking on this. I do find the question "Why is this query going slow?" much harder to answer on this platform that ordinary "box product" SQL Server because the tools aren't as mature yet.

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Cassandra NOT EQUAL Operator

Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce.
Can somebody tell me what are the options available in this regard.
I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).
2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.
If you want to use not-equals operator on a specific partition key and get all other data from table then you can use a combination of range queries and TOKEN function from CQL to achieve this
For example, if you want to fetch all rows except the ones having partition key as 'abc' then you execute below 2 queries
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) < TOKEN('abc');
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) > TOKEN('abc');
But, beware that result is going to be huge (depending on size of table and fields you need). So you might want to use this in conjunction with dsbulk kind of utility. Also note that there is no guarantee of ordering in your result. This is just a kind of data dump which will most probably be useful for some kind of one-time data migration like scenarios.