How does Azure SQL DW know the row count without statistics? - azure-sqldw

If I run a CREATE EXTERNAL TABLE cetasTable AS SELECT command then run:
EXPLAIN
select * from cetasTable
I see in the distributed query plan:
<operation_cost cost="4231.099968" accumulative_cost="4231.099968" average_rowsize="2056" output_rows="428735" />
It seems to know the correct row count, however, if I look there are no statistics created on that table as this query returns zero rows:
select * from sys.stats where object_id = object_id('cetasTable')
If I already have files in blob storage and I run a CREATE EXTERNAL TABLE cetTable command then run:
EXPLAIN
select * from cetTable
The distributed query plan shows SQL DW thinks there are only 1000 rows in the external table:
<operation_cost cost="4.512" accumulative_cost="4.512" average_rowsize="940" output_rows="1000" />
Of course I can create statistics to ensure SQL DW knows the right row count when it creates the distributed query plan. But can someone explain how it knows the correct row count some of the time and where that correct row count is stored?

What you are seeing is the difference between a table created using CxTAS (CTAS, CETAS or CRTAS) and CREATE TABLE.
When you run CREATE TABLE row count and page count values are fixed as the table is empty. If memory serves the fixed values are 1000 rows and 100 pages. When you create a table with CTAS they are not fixed. The actual values are known to the CTAS command as it has just created and populated the table in a single command. Consequently, the metadata correctly reflects the table SIZE when a CxTAS is used. This is good. The APS / SQLDW cost based optimizer can immediately make better estimations for MPP plan generation based on table SIZE when a table has been created via CxTAS as opposed to CREATE table.
Having an accurate understanding of table size is important.
Imagine you have a table created using CREATE TABLE and then 1 billion rows are inserted using INSERT into said table. The shell database still thinks that the table has 1000 rows and 100 pages. However, this is clearly not the case. The reason for this is because the table size attributes are not automatically updated at this time.
Now imagine that a query is fired that requires data movement on this table. Things may begin to go awry. You are now more likely to see the engine make poor MPP plan choices (typically using BROADCAST rather than SHUFFLE) as it does not understand the table size amongst other things.
What can you do to improve this?
You create at least one column level statistics object per table. Generally speaking you will create statistics objects on all columns used in JOINS, GROUP BYs, WHEREs and ORDER BYs in your queries. I will explain the underlying process for statistics generation in a moment. I just want to emphasise that the call to action here is to ensure that you create and maintain your statistics objects.
When CREATE STATISTICS is executed for a column three events actually occur.
1) Table level information is updated on the CONTROL node
2) Column level statistics object is created on every distribution on the COMPUTE nodes
3) Column level statistics object is created and updated on the CONTROL node
1) Table level information is updated on the CONTROL node
The first step is to update the table level information. To do this APS / SQLDW executes DBCC SHOW_STATISTICS (table_name) WITH STAT_STREAM against every physical distribution; merging the results and storing them in the catalog metadata of the shell database. Row count is held on sys.partitions and page count is held on sys.allocation_units. Sys.partitions is visible to you in both SQLDW and APS. However, sys.allocation_units is not visible to the end user at this time. I referenced the location for those familiar with the internals of SQL Server for information and context.
At the end of this stage the metadata held in the shell database on the CONTROL node has been updated for both row count and page count. There is now no difference between a table created by CREATE TABLE and a CTAS - both know the size.
2) Column level statistics object is created on every distribution on the COMPUTE nodes
The statistics object must be created in every distribution on every COMPUTE node. By creating a statistics object important, detailed statistical data (notably the histogram and the density vector) for the column has been created.
This information is used by APS and SQLDW for generating distribution level SMP plans. SMP plans are used by APS / SQLDW in the PHYSICAL layer only. Therefore, at this point the statistical data is not in a location that can be used for generating MPP plans. The information is distributed and not accessible in a timely fashion for cost based optimisation. Therefore a third step is necessary...
3) Column level statistics object is created and updated on the CONTROL node
Once the data is created PHYSICALLY on the distributions in the COMPUTE layer it must be brought together and held LOGICALLY to facilitate MPP plan cost based optimisation. The shell database on the CONTROL node also creates a statistics object. This is a LOGICAL representation of the statistics object.
However, the shell database stat does not yet reflect the column level statistical information held PHYSICALLY in the distributions on the COMPUTE nodes. Consequently, the statistics object in the shell database on the CONTROL node needs to be UPDATED immediately after it has been created.
DBCC SHOW_STATISTICS (table_name, stat_name) WITH STAT_STREAM is used to do this.
Notice that the command has a second parameter. This changes the result set; providing APS / SQLDW with all the information required to build a LOGICAL view of the statistics object for that column.
I hope this goes some way to explaining what you were seeing but also how statistics are created and why they are important for Azure SQL DW and for APS.

Related

Can we alter AWS QLDB table?

Suppose I have created a table like this.
CREATE TABLE Vehicle
and insert some documents to this table.
INSERT INTO Vehicle
<< {
'VIN' : '1N4AL11D75C109151',
'Type' : 'Sedan',
} >>
So my requirement is to change the table name from Vehicle to VehicleCar and want to change the 'VIN' to 'VID'
How can I do that?
Thanks,
Dasun.
QLDB doesn't currently offer an ALTER TABLE capability. You'd have to DROP the table and re-create it. This counts against your table limits, so don't do it too often.
QLDB is schema-less, so you can change your field names and/or the structure of your documents anytime you want to, simply by writing new revisions to your documents in the new format. The journal will still contain the old revisions, however. If your application has any functionality that uses the history() function to access old revisions, then it needs to be able to gracefully handle variations in the document format.
It is important to note that QLDB is not optimized for scanning large volumes of data. It's optimized for targeted queries against an index using an equality operator. A query like "SELECT * FROM table" will scan the entire table. This is an anti-pattern for QLDB and will not perform well as your ledger grows. So if you change your document format, running a SELECT * and updating every document to the new format may be more work than you realize. First, that SELECT * scan query may time-out or it may be aborted with an Optimistic Concurrency Control exception because another process inserted a document in the table. Second, you'd have to do it in batches of 40 documents at a time because of the limit to the number of documents in a transaction.
All of this is to say that making your application resilient to schema changes is a good idea. :-)

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.

Releasing the space created by deletion of rows from table on greenplum

I have a table on greenplum from which data is being deleted on a daily basis using a simple delete statement. However, the size of the table is not decreasing. Is there some way using which the space returned by deletion of rows is removed from the table and thus size of the table be reduced.
Greenplum database at it's core uses much of the same code as a Postgres database. Therefore, the command you want is the VACUUM command. From the docs at http://gpdb.docs.gopivotal.com/4300/pdf/GPDB43RefGuide.pdf:
VACUUM reclaims storage occupied by deleted tuples. In normal Greenplum Database operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present on disk until a VACUUM is done. Therefore it is necessary to do VACUUM periodically, especially on frequently-updated tables.
Also, if you are altering a significant number of rows, then you may want to use VACUUM ANALYZE to allow the statistics for the table to be updated for better query planning.