Loopback 3.x Transaction over two data sources - loopbackjs

I've used Higher-level Transaction API of Loopback, but it supports only one datasource, as you start the transaction with
await app.dataSources.db.transaction(async models => {});
I have two data sources pointing to two separate databases on the same mysql database server, and I would like to use write into two tables, each residing in separate database, in the same transaction. Is it possible to achieve it somehow in Loopback? Maybe using Lower-level Transaction API? Any experience?

Related

PostgreSQL: How to store and fetch historical SQL data from Azure Data Lakes (ADLS)

I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.
Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.
However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.
Please guide what Azure alternatives may work in this case.
You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.
This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.
If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.
E.g. following JSON
{
"_id":"ahokw88",
"type":"Book",
"title":"The AWK Programming Language",
"year":"1988",
"publisher":"Addison-Wesley",
"authors":[
"Alfred V. Aho",
"Brian W. Kernighan",
"Peter J. Weinberger"
],
"source":"DBLP"
}
Can be queried with following SQL:
SELECT
JSON_VALUE(jsonContent, '$.title') AS title
, JSON_VALUE(jsonContent, '$.publisher') as publisher
, jsonContent
FROM OPENROWSET
(
BULK 'json/books/*.json',
DATA_SOURCE = 'SqlOnDemandDemo'
, FORMAT='CSV'
, FIELDTERMINATOR ='0x0b'
, FIELDQUOTE = '0x0b'
, ROWTERMINATOR = '0x0b'
)
WITH
( jsonContent varchar(8000) ) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.
Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago
You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.
Only source for Serverless is ADLS.
You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.
Understand the cost and performance implications of each piece before you make a decision:
Serverless SQL Pool
Dedicated SQL pool
Data Factory

django model creation connecting to mssql

I am connecting to a legacy mysql database in cloud using my django project.
i need to fetch data and insert data if required in DB table.
do i need to write model for the tables which are already present in the db?
if so there are 90 plus tables. so what happens if i create model for single table?
how do i talk to database other than creating models and migrating? is there any better way in django? or model is the better way?
when i create model what happens in the backend? does it create those tables again in the same database?
There are several ways to connect to a legacy database; the two I use are either by creating a model for the data you need from the legacy database, or using raw SQL.
For example, if I'm going to be connecting to the legacy database for the foreseeable future, and performing both reads and writes, I'll create a model containing only the fields from the foreign table I need as a secondary database. That method is well documented and a bit more time consuming.
However, if I'm only reading data from a legacy database which will be retired, I'll create a read-only user on the legacy database, and use raw SQL to grab what I need like so:
from django.db import connections
cursor = connections["my_secondary_db"].cursor()
cursor.execute("SELECT * FROM my_table")
for row in cursor.fetchall():
insert_data_into_my_new_system_model(row)
I'm doing this right now with a legacy SQL Server database from our old website as we migrate our user and product data to Django/PostgreSQL. This has served me well over the years and saved a lot of time. I've used this to create sync routines all within a single app as Django management commands, and then when the legacy database is done being migrated to Django, I've completely deleted the single app containing all of the sync routines, for a clean break. Good luck!

PostgreSQL with Django: should I store static JSON in a separate MongoDB database?

Context
I'm making, a Django web application that depends on scraped API data.
The workflow:
A) I retrieve data from external API
B) Insert structured, processed data that I need in my PostgreSQL database (about 5% of the whole JSON)
I would like to add a third step, (before or after the "B" step) which will store the whole external API response in my databases. For three reasons:
I want to "freeze" data, as an "audit trail" in case of the API changes the content (It happened before)
API calls in my business are expensive, and often limited to 6 months of history.
I might decide to integrate more data from the API later.
Calling the external API again when data is needed is not possible because of 2) and 3)
Please note that the stored API responses will never be updated and read performance is not really important. Also, being able to query the stored API responses would be really nice, to perform exploratory analysis.
To provide additional context, there is a few thousand API calls a day, which represent around 50GB of data a year.
Here comes my question(s)
Should I store the raw JSON in the same PostgreSQL database I'm using for the Django web application, or in a separate datastore (MongoDB or some other NoSQL database)?
If I go with storing the raw JSON in my PostgreSQL database, I fear that my web application performance will decrease due to the database being "bloated" (50Mb of parsed SQL data in my Django database are equivalent to 2GB of raw JSON from the external API, so integrating the full API responses in my database will multiply its size by 40)
What about cost? as all this is hosted on a DBaaS. I understand that the cost will increase greatly (due to the DBs size increase), but is any of the two options more cost effective?

Cross-service references in DB

I am building service oriented system, with multiple services and application.
Current I am not sure how to handle DB references between resources from multiple services and databases.
For example, I have a users service, where I can define all users and their roles.
Next I have, products service, where I can define my products, their prices and other information.
I also have invoicing service, which is used to create invoices. This service will use information from previous two services. It will link products and users to invoice. Now I am not sure what is the best approach for this?
Do I just save product ID and user ID that it got from other two services, without any referential integrity?
If I do this, then I will have problem when generating reports, because at time of generation I will need to send a lot of requests to products service, to get names and prices of product in invoice. Same for users.
Do I create some table products in my invoicing application, and store name and price of product at the moment of invoice creation?
If I go with this approach, then in case that price or name of product changes, I will have inconsistent data across my applications?
Is there some well-known pattern for this kind of problem, that is what is the best solution.
Cross-service references in DB is a common challenge for Data integrity between multiple web services, And specially when we are talking about Real time access.
There is two approaches for your case :
1- Databases Replication across your servers
I suppose that you have each application hosted on a separate server, So i can name your servers as Users_server, Products_server and Invoices_Server.
In your example, your Invoice web service need to grab data from Users & Products Servers, in this case you can create a Replication of your Users Database and Products Database on your Invoices_server.
This way you can run your Join queries on the same server and get data from multiple databases.
Query example :
SELECT *
FROM UsersDB.User u
JOIN InvoicesDB.Invoice i ON u.Id = i.ClientId
2- Main Database Replication
1st step you have to replicate all your databases into one main server we can call it Base_server, which basically contain all your databases from all your services.
Then you can build an internal web service for your application to provide needed data in just "One Call", this answer your question about generating reports.
In other words, you will make one call to the mane Base service instead of making 2 or 3 calls to your separate services.
Note: As a Backend developer we use this organization as a best practice while building a large bundle based application, we create a base bundle and then create service_bundle which rely on the base bundle.
If your services are already live, we may need more details about the technology and databases type you using in order to give you a more accurate solution.
Just because you are using SOA doesn't mean you abandon database integrity. Continue to use referential integrity where your database design requires it.
At the service level, you can have each service be responsible for returning identity information for the entities which it owns. This identity information may or may not be the actual primary key from the database, but it will be used by the clients of the service as though it were the actual primary key.
When a client wants to create an invoice, it will call the User service and receive a User entity, which will contain a User Identifier. It will call the Product service and receive a set of products, each with a product identifier. It will then call the Invoice service to create an invoice, passing the user identifier and the product identifiers. This will likely return an invoice identifier.
You can (probably should) enforce the integrity making the productId and userId foreign keys in your invoice table. Then your DB makes sure the referenced entities exist. Reports should join tables, not query services for each item. I assume a central DB shared across the system.

Making sharding simple with Django

I have a Django project based on multiple PostgreSQL servers.
I want users to be sharded across those database servers using the same sharding logic used by Instagram:
User ID => logical shard ID => physical shard ID => database server => schema => user table
The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
Finally, the user table in the appropriate schema is queried.
How can this be achieved as simply as possible in Django ?
Ideally, I would love to be able to write Django code such as:
Fetching an instance
# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)
Fetching related objects
# this gets the user's posted articles, located in the same logical shard:
articles = user.articles
Creating an instance
# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()
Searching users by name
# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
# relevant database servers.
users = User.objects.filter(title = "King")
To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.
Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.
I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.
In short: how can I shard data in Django, as simply as possible?
Thanks a lot for your kind help.
Note
There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).
Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.
Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.
In addition, they released some code as part of a sample application demonstrating how they shard.
You really don't want to be in the position of asking this question. If you are sharding by user id then you probably don't want to search by name.
If you are sharding your database then it's not going to be invisible to your application and will probably end up requiring schema alterations.
You might find SkyTools useful - read up on PL/Proxy. It's how Skype shard their databases.
it is better to use professional sharding middleware, for example: Apache ShardingSphere.
The project contains 2 productions, ShardingSphere-JDBC for java driver, and ShardingSphere-Proxy for all programing languages. It can support python and Django as well.