The documentation states that one split should not be bigger then 'a few GB'.
Is there a hard limit on that where Cloud Spanner will stop storing more data in one split ?
Nothing can be found in the limits-section here: https://cloud.google.com/spanner/quotas
What is the implication of e.g. splits growing to 20-30GB ?
I can think of problems when those splits need to be moved around between instances while being read/written
I know the second point sound like we should split up our primary key/add a sharding-key as first primary-key-part.
But if you have hundreds of customers having really big product catalogs and you need to interleave brand- and category-tables so you can join on them. And alternative approaches of storing one product-catalog in several splits become very slow on secondary index queries (like: query all active products in a catalog).
Thanks a lot in advance because this would help us a lot of understanding Cloud Spanner better for our planned production-use.
Christian Gintenreiter
A split can only be served by a single node, so very large splits may cause the single node to become a performance bottleneck. You may start to see performance degradation with a split size greater than 2GB. The hard limit on split size is bound by the the storage limit for a single node, which is 2TB.
Can you please provide some more details about your schema and interleaving?
Related
I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.
I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?
What are the limitations / considerations in scaling down spanner nodes? Since there is a tight coupling of nodes to data stored - is it fair to say that it is highly scalable but not elastic? The following is a quote from the quizlet case study on GCP website...
"it might be impossible to reduce the number of nodes on your database, even if you previously ran the database with that number of nodes."
The word "might" needs some expanding
To expand on the "might" -- we restrict the reduction of nodes to meet a 2T/node limit for the instance. You can scale up and down, as long as the down-sizing doesn't cross that threshold.
Hope this helps!
A few things we would recommend to scale down effectively is by deleting unused data (databases, tables, global index, rows etc.). This data will be cleaned within ~7 days, allowing you to potentially run with lower node counts.
fully aware of the documentations about how to name a S3 object within a bucket to optimize performance
can not understand the example in this article https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
2134857/gamedata/start.png
2134857/gamedata/resource.rsrc
2134857/gamedata/results.txt
2134858/gamedata/start.png
2134858/gamedata/resource.rsrc
2134858/gamedata/results.txt
2134859/gamedata/start.png
2134859/gamedata/resource.rsrc
2134859/gamedata/results.txt
the article says "All these reads and writes will basically always go to the same partitio"
but we should have three partitions
2134857, 2134858, 2134859
, right ?
if we reverse the id
7584312/gamedata/start.png
7584312/gamedata/resource.rsrc
7584312/gamedata/results.txt
8584312/gamedata/start.png
8584312/gamedata/resource.rsrc
8584312/gamedata/results.txt
9584312/gamedata/start.png
9584312/gamedata/resource.rsrc
9584312/gamedata/results.txt
we have also three partitions 7584312, 8584312, 9584312
what is the difference.
What is the definition of a prefix and its relationship to partitioning strategy.
The S3 partitioning does not (always) occur on the full ID. It will usually be some sort of partial match on the ID. It's likely your first example will be on the same partition using a partition match of, say, 2134, 21348, or 213485.
More info from the blog post you linked to:
As we said, S3 has automation that continually looks for areas of the
keyspace that need splitting. Partitions are split either due to
sustained high request rates, or because they contain a large number
of keys (which would slow down lookups within the partition). ... This split
operation happens dozens of times a day all over S3 and simply goes unnoticed
from a user performance perspective.
I am looking for a strategy to batch all my queries (with IN clause) to overcome the restrictions by databases on IN clause (See here).
I usually get list of size 100000 to 305000. So, this has become very important to tackle.
I have tried two strategies so far.
Strategy 1:
Create an entity and hence a table with one column to hold such values (can we create temp tables on the fly with JPA 2.0 vendor-independent?) and use the data from the temp table as a subquery to the original query before eventually cleaning up the temp table.
Advantage: Very performant queries. Really quick, I must admit for the numbers I have mentioned, it was mostly under a minute.
Possible drawback: Use of temp table which is actually a permanent one in my case thus far.
Strategy 2:
Calculate the batch size for the given input list and for each batch execute the query and accumulate the result.
Advantage: No temp tables. Easy for any threads within the same transaction.
Disadvantage: A big disadvantage is amount of time it takes to execute all the batches. For the mentioned numbers, this is at an unacceptable level at the moment. Takes anything between 5 to 15 mins!
I would appreciate any feedback, suggestions or improvements from all you JPA gurus.
Thanks.
I only tested up to 50,000 integers but I have some pretty good performance data around splitting large lists using various methods, with CLR and a numbers table leading the pack at the higher end:
Splitting a list of integers : another roundup
Not sure if you are using integers or strings but the results should be roughly equivalent.
As an aside, I'll confess I have no idea what JPA 2.0 is, but I assume you can control the format of the lists that it sends to SQL Server.