PostgreSQL: How to store and fetch historical SQL data from Azure Data Lakes (ADLS) - django

I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.
Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.
However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.
Please guide what Azure alternatives may work in this case.

You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.
This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.
If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.
E.g. following JSON
{
"_id":"ahokw88",
"type":"Book",
"title":"The AWK Programming Language",
"year":"1988",
"publisher":"Addison-Wesley",
"authors":[
"Alfred V. Aho",
"Brian W. Kernighan",
"Peter J. Weinberger"
],
"source":"DBLP"
}
Can be queried with following SQL:
SELECT
JSON_VALUE(jsonContent, '$.title') AS title
, JSON_VALUE(jsonContent, '$.publisher') as publisher
, jsonContent
FROM OPENROWSET
(
BULK 'json/books/*.json',
DATA_SOURCE = 'SqlOnDemandDemo'
, FORMAT='CSV'
, FIELDTERMINATOR ='0x0b'
, FIELDQUOTE = '0x0b'
, ROWTERMINATOR = '0x0b'
)
WITH
( jsonContent varchar(8000) ) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.
Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago
You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.
Only source for Serverless is ADLS.
You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.
Understand the cost and performance implications of each piece before you make a decision:
Serverless SQL Pool
Dedicated SQL pool
Data Factory

Related

Correct appreoach to consume and prepare data from multiple sources power BI

I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage.
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA.
BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
Raw
Enriched
Curated
The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it.
Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Once copy pipeline is complete a subsiquent data flow unions the data
from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.

What does "interactive query" or "interactive analytics" mean in Presto SQL?

The Presto website (and other docs) talk about "interactive queries" on Presto. What is an "interactive query"? From the Presto Website: "Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse."
An interactive query system is basically a user interface that translates the input from the user into SQL queries. These are then sent to Presto, which processes the queries and gets the data and sends it back to the user interface.
The UI then renders the output, which is typically NOT just a simple table of numbers and text, but rather a complex chart, a diagram or some other powerful visualization.
The users expects to be able to e.g. update one criteria and get the updated chart or visualization in near real time, just like you expect on any application typically. Even if the creation of this analysis involves LOTS of data to be processed.
Presto can do that since it can query massive distributed object storage systems like HDFS and many other cloud storage systems, as well as RDBMSs and so on. And it can be set up to have a huge cluster of workers that query the source in parallel and therefore process massive amounts of data for analysis, and still be fast enough for the user expectations.
A typical application to use for the visualization is Apache Superset. You just hook up Presto to it via the JDBC driver. Presto has to be configured to point at the underlying data sources and you are ready to go.

Logstash and looking up additional data from a relational table?

I have mobile app log data being posted daily (eventually it will be a data stream). I am looking at different solutions for processing this log data and providing analytics. I am considering using logstash/elasticsearch/kibana combination, but we have additional data on our users stored in a redshift database. So in addition to the mobile data, I would like to pull in additional data from redshift about the user at the time of interaction with mobile app.
However, I've read in some places that doing an actual database query through logstash isn't feasible, but you can use a dictionary file to do a lookup of each user.
I have two questions regarding this approach
Is there a limit to have large this lookup file can be? Mine would be < 500K records so I'd imagine it would be fine?
Can the process of making the the lookup file from redshift tables be fully automated (ideally though aws services) - i.e. each night the lookup table is refreshed and posted to logstash, and then used for breakouts in Kibana
The way we're currently doing it is processing a daily jason file with a lambda function, posting it to s3 and then reading it into a redshift table. This data is then processed into sessions and joined up with other tables to generate the final dataset to be used for visualization. This is currently done in Tableau but we are exploring other options (such as quicksight, or possibly the ELK stack)
Just trying to figure out what solution is going to be scalable to clickstream data and will be the most useful down the line.
Thanks!
logstash 7 has a jdbc_streaming filter plugin for dynamically adding stuff to your events, as well as the jdbc_static filter for static stuff.
As you found, you can also use the translate filter. The man page says they've tested "very large" datasets up to 100,000 entries, so your dataset may require some testing. The good part about this filter is that it will reload the data when it detects a change, so you can publish the data on your own schedule (e.g. cron) without restarting logstash. Be on the lookout for events that don't get the translated value, which might be a sign that your publishing frequency should be updated.

PostgreSQL with Django: should I store static JSON in a separate MongoDB database?

Context
I'm making, a Django web application that depends on scraped API data.
The workflow:
A) I retrieve data from external API
B) Insert structured, processed data that I need in my PostgreSQL database (about 5% of the whole JSON)
I would like to add a third step, (before or after the "B" step) which will store the whole external API response in my databases. For three reasons:
I want to "freeze" data, as an "audit trail" in case of the API changes the content (It happened before)
API calls in my business are expensive, and often limited to 6 months of history.
I might decide to integrate more data from the API later.
Calling the external API again when data is needed is not possible because of 2) and 3)
Please note that the stored API responses will never be updated and read performance is not really important. Also, being able to query the stored API responses would be really nice, to perform exploratory analysis.
To provide additional context, there is a few thousand API calls a day, which represent around 50GB of data a year.
Here comes my question(s)
Should I store the raw JSON in the same PostgreSQL database I'm using for the Django web application, or in a separate datastore (MongoDB or some other NoSQL database)?
If I go with storing the raw JSON in my PostgreSQL database, I fear that my web application performance will decrease due to the database being "bloated" (50Mb of parsed SQL data in my Django database are equivalent to 2GB of raw JSON from the external API, so integrating the full API responses in my database will multiply its size by 40)
What about cost? as all this is hosted on a DBaaS. I understand that the cost will increase greatly (due to the DBs size increase), but is any of the two options more cost effective?

Problem regarding integration of various datasources

We have 4 datasources.2 datasources are internal and we can directly connect to the database.For the 3rd datasource we get a flat file (.csv) and have to pull in the data.4rth datasource is external and we cannot access it directly.
We need to pull data from all the 4 datasources, run business rules on them and store them in our database. We have a web application that runs on top of this database.Also every month we have to pull the data and do any updates/deletes/adds etc to existing data.
I am pretty much ignorant about this process.Also Can you please point some good books to study this topic.
These are the current approaches that i was thinking of.
To write an internal webservice that will talk to internal datasoureces and pull data.Create a handler to the external datasource using middleware (mqseries is already setup for this in some other existing project,planning to reuse that).PUll data from csv file again using Java.
On this data run some business rules from Java.Use this data.
This approach might run in my dev box, but not sure what all problems can occur in prod (specially due to synchronization)
Pull data from internal using plain java jdbc connection.For the remaining 2 get flat files, dump data using sql loader.All the data goes to temporary tables first.Run busines rules thru pl/sql and use.
Use some ELT tool like informatica to pull data.write business rules in perl (invoked by informatica)
Thanks.
A book like "The Data Warehouse ETL Toolkit" by Ralph Kimball is a good resource for learning techniques/architectures to bring data from different sources into one place.