Big Query vs Cloud SQL for audit data - google-cloud-platform

Which database is better for transactional data - cloud SQL or Google Big query?
I have a requirement wherein from multiple jobs I need to load data into a single table. Which database will be better for this Google Big query or Cloud SQL?
I know that in terms of cost effectiveness cloud sql is a better choice. But are there any other pointers apart from this?

CloudSQL is a managed database for transactional loads (OLTP). It can be tweaked to work with OLAP (analytical); but it is intended to be a transactional database.
BigQuery is for analytical data (OLAP), that's data that won't change. Think of it as data "at rest" that's not going to be changed.
If some of your transactions are not finalized (there are in-flight transactions, or your end-to-end process need some steps), you need a transactional database - like the ones from Cloud SQL.
If your data won't change, use BigQuery.

Related

How to deal with failing Athena queries as AWS Glue datacatalog metada size grows large?

Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database. The article below has come up frequently in my research and is written by Amazon's team:
Querying AWS Glue Data Catalog
However, under the section titled Considerations and limitations the following is written:
Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.
Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata.
My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata?
Is there a best practice for this? Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?
I talked to AWS Support and did some research on this. Here's what I gathered:
The information_schema is built at query execution time, there doesn't seem to be any caching.
If you access information_schema.tables, it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).
If you access information_schema.columns, it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.
These queries are affected by the general service quotas. In this case, DML queries like your select must finish within 30 minutes.
If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance. As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.

Is there a way to connect PBI to a Databricks cluster that is not running?

In my scenario, Databricks is performing read and writing transformations in Delta tables. We have PBI connected to the Databricks cluster that needs to be running most of the time, which is expensive.
Knowing that delta tables are in a container, what would be the best way in terms of cost x performance to feed PBI from delta tables?
If your set size is under max allowed size in PowerBI (100 GB I guess) and daily refresh is enough you can just load everything to your PowerBI model.
https://blog.gbrueckl.at/2021/01/reading-delta-lake-tables-natively-in-powerbi/
If you want to save the costs maybe you don't need transactions and can save it in csv in data lake, than loading everything to PowerBI and refresh daily is really easy.
If you want to save the costs and query new incoming data all the time using DirectQuery consider using Azure SQL. It has really competitive prices starting from 5 eur/usd. Integration with databricks is also perfect write in append mode do all magic.
Another option to consider is to create an Azure Synapse workspace and use serverless SQL compute to query the delta lake files. This is a pay-per-the-TB consumed pricing model so you don’t have to have your Databricks cluster running all the time. It’s a great way to load Power BI import models.

PostgreSQL: How to store and fetch historical SQL data from Azure Data Lakes (ADLS)

I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.
Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.
However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.
Please guide what Azure alternatives may work in this case.
You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.
This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.
If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.
E.g. following JSON
{
"_id":"ahokw88",
"type":"Book",
"title":"The AWK Programming Language",
"year":"1988",
"publisher":"Addison-Wesley",
"authors":[
"Alfred V. Aho",
"Brian W. Kernighan",
"Peter J. Weinberger"
],
"source":"DBLP"
}
Can be queried with following SQL:
SELECT
JSON_VALUE(jsonContent, '$.title') AS title
, JSON_VALUE(jsonContent, '$.publisher') as publisher
, jsonContent
FROM OPENROWSET
(
BULK 'json/books/*.json',
DATA_SOURCE = 'SqlOnDemandDemo'
, FORMAT='CSV'
, FIELDTERMINATOR ='0x0b'
, FIELDQUOTE = '0x0b'
, ROWTERMINATOR = '0x0b'
)
WITH
( jsonContent varchar(8000) ) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.
Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago
You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.
Only source for Serverless is ADLS.
You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.
Understand the cost and performance implications of each piece before you make a decision:
Serverless SQL Pool
Dedicated SQL pool
Data Factory

Migrating a relational DB into AWS services

I have a terabyte size SQL Server DB table which has only two columns:
Id,
HTML Content
There are few applications that call this Table to retrieve the HTML content by providing the Id of the row.
The DB is residing On-premises, and the maintenance cost and size of it is getting higher and higher. I am thinking to move this DB into AWS Dynamo DB. Reason I have choose Dynamo DB is the cost and the performance I have read about it.
Are the any concerns I should know about before choosing Dynamo DB?
Are the any other services in AWS that I could possibly use over
Dynamo DB?
I understand that SQL Server is a Relational DB, while DynamoDB is no sql. And it seems a No Sql DB could be a potential solution for this scenario. I have no kind of joins nor transactions against that Table. All I am doing with the table is to Insert, and Select.
Are the any concerns I should know about before choosing Dynamo DB?
As with any NoSql bigdata DB, Dynamo is "eventually consistent", so, if your application writes and then immediately reads the same record - you should expect failures (inconsistencies).
I'm not familiar with "Prem" and assuming you mean that you're working with your private servers I feel obligated to provide the following warning: working in the cloud is very different from working with your own servers: requests fail more often, latency pattern is different and you should architect your software to handle these sort of issues. If you're planning on moving to the cloud I'd start with migrating your application and leave the DB to be last.
If you really need real time updates of your data, You should reconsider moving on Dynamo. Also dynamo is useful when you do need a dynamic number of columns for each row. So except the cost, i don't see any benefits here.
If you don't need realtime updates, you can look into AWS Redshift or Google BigQuery, and these will be cheaper solutions compare to Dynamo.
Like you have mentioned, you just have two columns, take a look into "redis" also. A plain key value structure will help in performance. But since Redis stores everything in the Physical memory, costing will be high and you'll still need permanent storage/ DB like SQL, MySQL. So in terms of performance, yes you ll be able to see huge difference. but you'll be more thn the current cost.
How about AWS Aurora? At least AWS claims of 1/10th of cost compare to other SQL/MySQL instances. It have backward compatibility also.

Backup/Remove Datastore entities to Big Query Automatically at some interval

I am looking to stream some data into Big Query and had a question around Step 3 of Google's best practices for Streaming data into Big Query. The process makes sense at a high level but I'm struggling with the implementation for step 3. (I am looking to use the datastore as my transactional data store.) For step 3 says to "reconciled data from the transactional data store and truncate the unreconciled data table.". My question is this; If my reconciled data is in the Google Datastore is there a way to automate the backup and deletion of this data without manually intervention?
I know I could achieve this recommended practice by using the Datastore Admin. I could:
1) Pause all writes to the datastore
2) Backup the datastore table to Cloud Storage
3) Delete all entities in the table I just backed up.
4) Import the backup into Big Query
Is there a way I can automate this so I don't have to do it manually at regular intervals?
Real-time dashboards and queries
In certain situations, streaming data into BigQuery enables real-time analysis over transactional data. Since streaming data comes with a possibility of duplicated data, ensure that you have a primary, transactional data store outside of BigQuery.
You can take a few precautions to ensure that you'll be able to perform analysis over transactional data, and also have an up-to-the-second view of your data:
1) Create two tables with an identical schema. The first table is for the reconciled data, and the second table is for the real-time, unreconciled data.
2) On the client side, maintain a transactional data store for records.
Fire-and-forget insertAll() requests for these records. The insertAll() request should specify the real-time, unreconciled table as the destination table.
3) At some interval, append the reconciled data from the transactional data store and truncate the unreconciled data table.
4) For real-time dashboards and queries, you can select data from both tables. The unreconciled data table might include duplicates or dropped records.