All,
I have a Django multitenant application with PostgreSQL (13.1) as a backend. I’m in a situation where I need to replicate the data between “table 1” and “table 2”. Right now, I’m using logical replication (publish-subscribe) for this one-directional replication. But, we need to replicate the data in two ways (ie. From A B and BA).
I have some questions for the experts while learning more about this situation. This would help me in finding a solution
Context
We have a user with permission in both schemas
Uses logical replication (Publish subscribe)
Logical replication from table 1 to Table 2 is working fine
But need to replicate data in both directions (ie. From A B and BA)
Simultaneous data medication won’t happen in (in schema 1 or Schema 2)
Questions
Will this result in an infinite loop?
Is there any way to avoid this looping?
Related
I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.
Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.
However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.
Please guide what Azure alternatives may work in this case.
You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.
This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.
If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.
E.g. following JSON
{
"_id":"ahokw88",
"type":"Book",
"title":"The AWK Programming Language",
"year":"1988",
"publisher":"Addison-Wesley",
"authors":[
"Alfred V. Aho",
"Brian W. Kernighan",
"Peter J. Weinberger"
],
"source":"DBLP"
}
Can be queried with following SQL:
SELECT
JSON_VALUE(jsonContent, '$.title') AS title
, JSON_VALUE(jsonContent, '$.publisher') as publisher
, jsonContent
FROM OPENROWSET
(
BULK 'json/books/*.json',
DATA_SOURCE = 'SqlOnDemandDemo'
, FORMAT='CSV'
, FIELDTERMINATOR ='0x0b'
, FIELDQUOTE = '0x0b'
, ROWTERMINATOR = '0x0b'
)
WITH
( jsonContent varchar(8000) ) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.
Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago
You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.
Only source for Serverless is ADLS.
You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.
Understand the cost and performance implications of each piece before you make a decision:
Serverless SQL Pool
Dedicated SQL pool
Data Factory
We are looking at Amazon Redshift to implement our Data Warehouse and I would like some suggestions on how to properly design Schemas in Redshift, please.
I am completely new to Redshift. In the past when I worked with "traditional" data warehouses, I was used to creating schemas such as "Source", "Stage", "Final", etc. to group all the database objects according to what stage the data was in.
By default, a database in Redshift has a single schema, which is named PUBLIC. So, my question to those who have worked with Redshift, does the approach that I have outlined above apply here? If not, I would love some suggestions.
Thanks.
With my experience in working with Redshift, I can assert the following points with confidence:
Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named production, aggregates and rough. Now, you know that the table production will contain the tables that are not supposed to be changed (mostly OLTP data) - such as user, order, transactions tables. Table aggregates will have aggregated data built over raw tables - such as number of orders placed per user per day per category. Finally, rough will contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table in rough schema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.
Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.
Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.
Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.
You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.
I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).
For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.
In addition to the other responses, here are some suggestions for improving schema performance.
First: Automatic compression encodings using COPY command
Improve the performance of Amazon Redshift using the COPY command. It will get data into Redshift database. The COPY command is clever enough. It automatically chooses the most appropriate encoding settings for the data it uploads. You don’t have to think about it. However, it does so only for the first data upload into an empty table.
So, make sure to use a significant data set while uploading data for the first time, which Redshift can assess to set the column encodings in the best way. Uploading a few lines of test data will confuse Redshift to know how best to optimize the compression to handle the real workload.
Second: Use Best Distribution Style and Key
Distribution-style decides how data is distributed across the nodes. Applying a distribution style at table level tells Redshift how you want to distribute the table and the key. So, how you specify distribution style is important for good query performance with Redshift. The style you choose may affect requirements for data storage and cluster. It also affects the time taken by the COPY command to execute.
I recommend setting the distribution style to all tables with a smaller dimension. For large dimension, distribute both the dimension and associated fact on their join column. To optimize the second large dimension, take the storage-hit and distribute ALL. You can even design the dimension columns into the fact.
Third: Use the Best Sort Key
A Redshift database maintains data in a table with an arrangement of a sort-key-column if specified. Since it’s sorted in each partition; each cluster node upholds its partition in predefined order. (While designing your Redshift schema, also consider the impact on your budget. Redshift is priced by amount of stored data and by the number of nodes.)
Sort key optimizes Amazon Redshift performance significantly. You can do it in many ways. First, use data filtering. If where-clause filters on a sort-key-column, it skips the entire data blocks. It’s because Redshift saves data in blocks. Each block header records the minimum and maximum sort key value. Filter outside of that range, the entire block may get skipped.
Alternatively, when joining two tables, sorted on their joint keys, the data is read in matching order. Also, you can merge-join without separate sort-steps. Joining large dimension to a large fact table will be easy with this method because neither will fit into a hash table.
I’m quite new to NoSQL and DynamoDB and I used to RDBMS. I’m designing database for a game and we're using DynamoDB and AWS Lambda for our backend. I created a table name “Users” for player profile that contains the user information and resources. Because the game has inventory system I also created a table name “UserItems”.
It’s all good until I realized DynamoDB don’t have transaction and any operation that is executed on both table (for example using an item that increase resource) has a chance of failure on one table while success on other and will cause missing data which affect our customers.
So I was thinking maybe my multiple tables design is not good since it’s a habit of me to design multiple table when I’m working with RDBMS. Which let me to think of storing the entire “UserItems” as hash in “Users” but I’m not sure this is a good practice because the size of a single row in Users table will be really big (we may have 500 unique items per users) and each time I pull or put data from/to “Users” (most of the time don’t need “UserItems” data) the read/write throughput will be also really large.
What should I do, keep the multiple tables design and handle transaction manually or switch to single table design? Or maybe there is a 3rd option?
Updated: more information about my use case
Currently I have 2 tables
Users: UserId (key), Username, Gold
UserItems: UserId (partition key), ItemId (sort key), Name, GoldValue
Scenarios:
User buy an item: Users.Gold will be deduced, new UserItem will be add to UserItems table.
User sell an item: Users.Gold will be increased, the Item will be deleted from UserItems table.
In both scenarios above I will have to do 2 update operation for 2 tables which without transaction there is a chance one of them failed.
To solve that I consider using single table solution which is a single Users table with 4 columns UserId(key), Username, Gold, UserItems. However there are two things I'm worried about:
Data in UserItems might be come to big for a single cell because one user could have up to 500 items.
To add/delete item I have to pull the UserItems from dynamodb, add/delete item and then put it back into Users. So I have to do 1 read and 1 write operation for 1 action. And because of issue (1) the read/write data size could become really big.
FWIW, the AWS documentation on NoSQL Design for DynamoDB suggests to use a single table:
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
NoSql database is best suited for non-trasactional data. If you bring normalization(splitting your data into multiple tables) into noSQL, then you are beating the whole purpose of it. If performance is what matters most, then you should consider only having a single table for your use case. DynamoDB supports Range Keys, and also supports Secondary Indices. For your usecase, it would be better to redesign your table to use Range Keys.
If you can share more details about your current table, maybe i can help you with more inputs.
I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?
To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).
Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc
1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.
We are working on deploying our product (currently on prem) on AWS and are looking at DynamoDB as a alternative to Cassandra mainly to avoid the devop costs associated with a large number of Cassandra clusters.
The DynamoDB doc says that the per account limit on the number of tables is 256 per region but can be increased by calling AWS support. How much is the max limit for this per account?
Our product is separated into distinct logical units where each such unit will have several tables (say 100). Each customer can have several of such units. Each logical unit can be backed up (i.e. a snapshot taken) and that snapshot can be restored at any time in the future (to overwrite the current content of all tables). The backup/restore performance - time taken to take a snapshot/import old data for all the tables - need to be good - it cannot be several minutes/hrs.
We were thinking of using distinct set of tables for each such logical unit - so that backup/restore is quick using EMR on S3. But if we follow this approach, we will run out of the 256 table number limit even with one customer. Looks like there are 2 options
Create a new account for each such logical unit for each customer. Is this possible? We will have a main corporate account I suppose (I am still learning about this), but can it have a set of sub-accounts for our customers using IAM each of which is considered as an independent AWS account?
Use each table in a true multi-tenant manner - where the primary key contains the customer id + logical unit id. But in this scenario,when using EMR to backup an entire table, we will need to selectively back up specific set of rows/items which may be in millions and this will go on while other write/read operations are going on on a different set of items. Is this feasible in terms of large scale?
Any other thoughts on how to approach this?
Thanks for any info.
I would suggest changing the approach - rather then thinking how to get more tables via creating more accounts.
I would think of how to use less tables.
Having said that - you could contact support and increase the amount of tables for you account.
I think that you will run into a money problem, due to the current pricing model of provisioning throughput per table.
Many people split tables based on time frame.
e.x: this weeks table, last weeks table, then move it to last months table and so on..
This helps when analyzing the data with EMR/Redshift - so you wont have to pull the whole table every time.