I try to understand if is possible:
Central: table1
On local nodes table1_1 table1_2 table1_3 table1_4 ...
I have made propagation from table1 to tables table1_1, table1_2, table1_3, table1_4...
It is working.
But now need to make opposite. E.g. in table table1_1 happen changes - change have be passed to table1 (central) and after that to table1_2, table1_3 ....
Is it possible or not?
Thanks.
Yes, it is possible.
As wrote in documentation
"A node will always push and pull data
to other node groups according to the
node group link configuration. A node
can only pull and push data to other
nodes that are represented node table
in its database and having
sync_enabled = 1."
So need configure table NODE_GROUP_LINK for each tier.
Related
Suppose I have created a table like this.
CREATE TABLE Vehicle
and insert some documents to this table.
INSERT INTO Vehicle
<< {
'VIN' : '1N4AL11D75C109151',
'Type' : 'Sedan',
} >>
So my requirement is to change the table name from Vehicle to VehicleCar and want to change the 'VIN' to 'VID'
How can I do that?
Thanks,
Dasun.
QLDB doesn't currently offer an ALTER TABLE capability. You'd have to DROP the table and re-create it. This counts against your table limits, so don't do it too often.
QLDB is schema-less, so you can change your field names and/or the structure of your documents anytime you want to, simply by writing new revisions to your documents in the new format. The journal will still contain the old revisions, however. If your application has any functionality that uses the history() function to access old revisions, then it needs to be able to gracefully handle variations in the document format.
It is important to note that QLDB is not optimized for scanning large volumes of data. It's optimized for targeted queries against an index using an equality operator. A query like "SELECT * FROM table" will scan the entire table. This is an anti-pattern for QLDB and will not perform well as your ledger grows. So if you change your document format, running a SELECT * and updating every document to the new format may be more work than you realize. First, that SELECT * scan query may time-out or it may be aborted with an Optimistic Concurrency Control exception because another process inserted a document in the table. Second, you'd have to do it in batches of 40 documents at a time because of the limit to the number of documents in a transaction.
All of this is to say that making your application resilient to schema changes is a good idea. :-)
I started looking into DynamoDB, but got stuck reading this part about the materialized graph pattern: Best Practices for Managing Many-to-Many Relationships.
I guess I get some ideas, but don't understand the whole thing yet.
As far as I understand the pattern the main table stores edges and each edge can have properties (the data-attribute).
For example (taken from the shown tables):
Node 1 (PK 1) has an edge to Node 2 which is of type DATE, and the edge is of type BIRTH (SK DATE|2|BIRTH).
I guess this would somewhat be the same as ()-[:BIRTH]->(:DATE { id: 2 }) in Cipher, right?
But after this it becomes unclear how everything fits together.
For example:
Can the data attribute be a map?
Does the data attribute have to be written to two places on writes? E.g. under (1, DATE|2|BIRTH) and (2, DATE|2)?
If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
How can I get all properties associated with a node? How to get all properties associated with an edge?
How can I query adjacent nodes?
...
Can someone explain to me how everything fits together by walking through a few use cases?
Thanks in advance.
Hopefully this answers all of your questions. Here's a couple of introductory things. I'll be using a generic table for all of my examples. The hash key is node_a and the sort key is node_b. There is a reverse lookup GSI where node_b is the hash key and node_a is the sort key.
1. Can the data attribute be a map?
The data attribute can be any of the supported data types in DynamoDB, including a map.
2. Does the data attribute have to be written to two places on writes?
The data attribute should be written to only one place. For the example of birthdate, you could do either one of these DynamoDB entries:
node_a | node_b | data
----------|-----------|---------------
user-1 | user-1 | {"birthdate":"2000-01-01", "firstname": "Bob", ...}
user-1 | birthdate | 2000-01-01
In the first row, we created an edge from the user-1 node that loops back on itself. In the second row, we created an edge from user-1 to birthdate. Either way is fine, and the best choice depends on how you will be accessing your data. If you need to be able to find users with a birthdate in a given range, then you should create a birthdate node. If you just need to look up a user's information from their user ID, then you can user either strategy, but the first row will usually be a more efficient use of your table's throughput.
3. If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
No. Just insert one of the rows from the example above.
You only have to look up the node if there is a more complex access pattern, such as "update the name of the person who was born on 1980-12-19". In that case, you would need to look up by birthdate to get the person node, and then modify something related to the person node. However, that use case is really two different operations. You can rephrase that sentence as "Find the person who was born on 1980-12-19, and update the name", which makes the two operations more apparent.
4.(a) How can I get all properties associated with a node?
Suppose you want to find all the edges for "myNode". You would query the main table with the key condition expression of node_a="myNode" and query the reverse lookup GSI with the key condition expression of node_b="myNode". This is the equivalent of SELECT * FROM my_table WHERE node_a="myNode" OR node_b="myNode".
4.(b) How to get all properties associated with an edge?
All of the properties of an edge are stored directly in the attributes of the edge, but you may still run into a situation where you don't know exactly where the data is. For example:
node_a | node_b | data
----------|-----------|---------------
thing-1 | thing-2 | Is the data here?
thing-2 | thing-1 | Or here?
If you know the ordering of the edge nodes (ie. which node is node_a and node_b) then you need only one GetItem operation to retrieve the data. If you don't know which order the nodes are in, then you can use BatchGetItems to look up both of the rows in the table (only one of the rows should exist unless you're doing something particularly complex involving a directed graph).
5. How can I query adjacent nodes?
Adjacent nodes are simply two nodes that have an edge connecting them. You would use the same query as 4a, except that instead of being interested in the data attribute, you're interested in the IDs of the other nodes.
Some more examples
Using a graph pattern to model a simple social network
Using a graph pattern to model user-owned resources
How to model a circular relationship between actors and films in DynamoDB (answer uses a graph pattern)
Modeling many-to-many relationships in DynamoDB
From relational DB to single DynamoDB table: a step-by-step exploration. This is a killer piece. It's got an AWS re:Invent talk embedded in it, and the author of this blog post adds his own further explanation on top of it.
I am trying to do an upsert with a stage table when copying data from S3.
I want to do this because I want to be able to backfill the data (or launch the process more than just once), and right now it was creating duplicate rows.
I see a bunch of responses that show how to do a DELETE from {table} USING {stage_table} WHERE {table.primarykey} = {stage_table.primarykey}
The thing is that I want to do that with a generic function, this means.. how can I access the primary key somehow 'automatically'? Because "primarykey" or "primaryKey" like I read in many places does not work. I am guessing it is just pseudo-code.
Any help would be appreciated. Thanks!
EDIT
The idea is to execute the upsert like this:
[transaction]
connection.execute("CREATE TEMP TABLE {stage_table} (like {table});".format(stage_table=stage_table, table=text(self.table)))
connection.execute(self.clean(self.compile_query(copy)))
connection.execute("DELETE FROM {table} USING {stage_table} WHERE {table}.primarykey = {stage_table}.primarykey;".format(stage_table=stage_table, table=text(self.table)))
connection.execute("INSERT INTO {table} SELECT * FROM {stage_table};".format(stage_table=stage_table, table=text(self.table)))
connection.execute("DROP TABLE {stage_table};".format(stage_table=stage_table))
[end transaction]
If I run a CREATE EXTERNAL TABLE cetasTable AS SELECT command then run:
EXPLAIN
select * from cetasTable
I see in the distributed query plan:
<operation_cost cost="4231.099968" accumulative_cost="4231.099968" average_rowsize="2056" output_rows="428735" />
It seems to know the correct row count, however, if I look there are no statistics created on that table as this query returns zero rows:
select * from sys.stats where object_id = object_id('cetasTable')
If I already have files in blob storage and I run a CREATE EXTERNAL TABLE cetTable command then run:
EXPLAIN
select * from cetTable
The distributed query plan shows SQL DW thinks there are only 1000 rows in the external table:
<operation_cost cost="4.512" accumulative_cost="4.512" average_rowsize="940" output_rows="1000" />
Of course I can create statistics to ensure SQL DW knows the right row count when it creates the distributed query plan. But can someone explain how it knows the correct row count some of the time and where that correct row count is stored?
What you are seeing is the difference between a table created using CxTAS (CTAS, CETAS or CRTAS) and CREATE TABLE.
When you run CREATE TABLE row count and page count values are fixed as the table is empty. If memory serves the fixed values are 1000 rows and 100 pages. When you create a table with CTAS they are not fixed. The actual values are known to the CTAS command as it has just created and populated the table in a single command. Consequently, the metadata correctly reflects the table SIZE when a CxTAS is used. This is good. The APS / SQLDW cost based optimizer can immediately make better estimations for MPP plan generation based on table SIZE when a table has been created via CxTAS as opposed to CREATE table.
Having an accurate understanding of table size is important.
Imagine you have a table created using CREATE TABLE and then 1 billion rows are inserted using INSERT into said table. The shell database still thinks that the table has 1000 rows and 100 pages. However, this is clearly not the case. The reason for this is because the table size attributes are not automatically updated at this time.
Now imagine that a query is fired that requires data movement on this table. Things may begin to go awry. You are now more likely to see the engine make poor MPP plan choices (typically using BROADCAST rather than SHUFFLE) as it does not understand the table size amongst other things.
What can you do to improve this?
You create at least one column level statistics object per table. Generally speaking you will create statistics objects on all columns used in JOINS, GROUP BYs, WHEREs and ORDER BYs in your queries. I will explain the underlying process for statistics generation in a moment. I just want to emphasise that the call to action here is to ensure that you create and maintain your statistics objects.
When CREATE STATISTICS is executed for a column three events actually occur.
1) Table level information is updated on the CONTROL node
2) Column level statistics object is created on every distribution on the COMPUTE nodes
3) Column level statistics object is created and updated on the CONTROL node
1) Table level information is updated on the CONTROL node
The first step is to update the table level information. To do this APS / SQLDW executes DBCC SHOW_STATISTICS (table_name) WITH STAT_STREAM against every physical distribution; merging the results and storing them in the catalog metadata of the shell database. Row count is held on sys.partitions and page count is held on sys.allocation_units. Sys.partitions is visible to you in both SQLDW and APS. However, sys.allocation_units is not visible to the end user at this time. I referenced the location for those familiar with the internals of SQL Server for information and context.
At the end of this stage the metadata held in the shell database on the CONTROL node has been updated for both row count and page count. There is now no difference between a table created by CREATE TABLE and a CTAS - both know the size.
2) Column level statistics object is created on every distribution on the COMPUTE nodes
The statistics object must be created in every distribution on every COMPUTE node. By creating a statistics object important, detailed statistical data (notably the histogram and the density vector) for the column has been created.
This information is used by APS and SQLDW for generating distribution level SMP plans. SMP plans are used by APS / SQLDW in the PHYSICAL layer only. Therefore, at this point the statistical data is not in a location that can be used for generating MPP plans. The information is distributed and not accessible in a timely fashion for cost based optimisation. Therefore a third step is necessary...
3) Column level statistics object is created and updated on the CONTROL node
Once the data is created PHYSICALLY on the distributions in the COMPUTE layer it must be brought together and held LOGICALLY to facilitate MPP plan cost based optimisation. The shell database on the CONTROL node also creates a statistics object. This is a LOGICAL representation of the statistics object.
However, the shell database stat does not yet reflect the column level statistical information held PHYSICALLY in the distributions on the COMPUTE nodes. Consequently, the statistics object in the shell database on the CONTROL node needs to be UPDATED immediately after it has been created.
DBCC SHOW_STATISTICS (table_name, stat_name) WITH STAT_STREAM is used to do this.
Notice that the command has a second parameter. This changes the result set; providing APS / SQLDW with all the information required to build a LOGICAL view of the statistics object for that column.
I hope this goes some way to explaining what you were seeing but also how statistics are created and why they are important for Azure SQL DW and for APS.
I'm having some trouble getting this table creation query to work, and I'm wondering if I'm running in to a limitation in redshift.
Here's what I want to do:
I have data that I need to move between schema, and I need to create the destination tables for the data on the fly, but only if they don't already exist.
Here are queries that I know work:
create table if not exists temp_table (id bigint);
This creates a table if it doesn't already exist, and it works just fine.
create table temp_2 as select * from temp_table where 1=2;
So that creates an empty table with the same structure as the previous one. That also works fine.
However, when I do this query:
create table if not exists temp_2 as select * from temp_table where 1=2;
Redshift chokes and says there is an error near as (for the record, I did try removing "as" and then it says there is an error near select)
I couldn't find anything in the redshift docs, and at this point I'm just guessing as to how to fix this. Is this something I just can't do in redshift?
I should mention that I absolutely can separate out the queries that selectively create the table and populate it with data, and I probably will end up doing that. I was mostly just curious if anyone could tell me what's wrong with that query.
EDIT:
I do not believe this is a duplicate. The post linked to offers a number of solutions that rely on user defined functions...redshift doesn't support UDF's. They did recently implement a python based UDF system, but my understanding is that its in beta, and we don't know how to implement it anyway.
Thanks for looking, though.
I couldn't find anything in the redshift docs, and at this point I'm
just guessing as to how to fix this. Is this something I just can't do
in redshift?
Indeed this combination of CREATE TABLE ... AS SELECT AND IF NOT EXISTS is not possible in Redshift (per documentation). Concerning PostgreSQL, it's possible since version 9.5.
On SO, this is discussed here: PostgreSQL: Create table if not exists AS . The accepted answer provides options that don't require any UDF or procedural code, so they're likely to work with Redshift too.