How to maximize DB upload rate with azure cosmos db - python-2.7

Here is my problem. I am trying to upload a large csv file to cosmos db (~14gb) but I am finding it difficult to maximize the throughput I am paying for. On the azure portal metrics overview UI, it says that I am using 73 RU/s when I am paying for 16600 RU/s. Right now, I am using pymongo's bulk write function to upload to the db but I find that any bulk_write length greater than 5 will throw a hard Request rate is large. exception. Am I doing this wrong? Is there a more efficient way to upload data in this scenario? Internet bandwidth is probably not a problem because I am uploading from an azure vm to cosmos db.
Structure of how I am uploading in python now:
for row in csv.reader:
row[id_index_1] = convert_id_to_useful_id(row[id_index_1])
find_criteria = {
# find query
}
upsert_dict = {
# row data
}
operations.append(pymongo.UpdateOne(find_criteria, upsert_dict, upsert=True))
if len(operations) > 5:
results = collection.bulk_write(operations)
operations = []
Any suggestions would be greatly appreciated.

Aaron. Yes,as you said in the comment, migration tool is supported by Azure Cosmos DB MongoDB API. You could find the blow statement in the official doc.
The Data Migration tool does not currently support Azure Cosmos DB
MongoDB API either as a source or as a target. If you want to migrate
the data in or out of MongoDB API collections in Azure Cosmos DB,
refer to Azure Cosmos DB: How to migrate data for the MongoDB API for
instructions. You can still use the Data Migration tool to export data
from MongoDB to Azure Cosmos DB SQL API collections for use with the
SQL API.
I just provide you with a workaround that you could use Azure Data Factory. Please refer to this doc to make the cosmos db as sink.And refer to this doc to make the csv file in Azure Blob Storage as source.In the pipeline,you could configure the batch size.
Surely,you could do this programmatically. You didn't miss something, the error Request rate is large just means you have exceeded the provisioned RUs quota. You could raise up the value of RUs setting. Please refer to this doc.
Any concern,please feel free to let me know.

I'd take a look at the Cosmos DB: Data Migration Tool. I haven't used this with the MongoDB API, but it is supported. I have used this to move lots of documents from my local machine to Azure with great success, and it will utilize RU/s that are available.
If you need to do this programmatically, I suggest taking a look at the underlying source code for DB Migration Tool. This is open source. You can find the code here.

I was able to improve the upload speed. I noticed that each physical partition had a throughput limit (which for some reason, the number of physical partitions times the throughput per partition is still not the total throughput for the collection) so what I did was split the data by each partition and then create a separate upload process for each partition key. This increased my upload speed by (# of physical partitions) times.

I have used ComsodDB Migration tool, which is awesome to send data to CosmosDB without doing much configurations. Even we can send the CSV files which are 14Gb also as per my assumption.
Below is the data which we transferred
[10000 records transferred | throughput 4000 | 500 parellel request | 25 seconds].
[10000 records transferred | throughput 4000 | 100 parellel request | 90 seconds].
[10000 records transferred | throughput 350 | parellel request 10 | 300 seconds].

Related

PostgreSQL: How to store and fetch historical SQL data from Azure Data Lakes (ADLS)

I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.
Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.
However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.
Please guide what Azure alternatives may work in this case.
You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.
This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.
If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.
E.g. following JSON
{
"_id":"ahokw88",
"type":"Book",
"title":"The AWK Programming Language",
"year":"1988",
"publisher":"Addison-Wesley",
"authors":[
"Alfred V. Aho",
"Brian W. Kernighan",
"Peter J. Weinberger"
],
"source":"DBLP"
}
Can be queried with following SQL:
SELECT
JSON_VALUE(jsonContent, '$.title') AS title
, JSON_VALUE(jsonContent, '$.publisher') as publisher
, jsonContent
FROM OPENROWSET
(
BULK 'json/books/*.json',
DATA_SOURCE = 'SqlOnDemandDemo'
, FORMAT='CSV'
, FIELDTERMINATOR ='0x0b'
, FIELDQUOTE = '0x0b'
, ROWTERMINATOR = '0x0b'
)
WITH
( jsonContent varchar(8000) ) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.
Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago
You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.
Only source for Serverless is ADLS.
You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.
Understand the cost and performance implications of each piece before you make a decision:
Serverless SQL Pool
Dedicated SQL pool
Data Factory

How to make Snowflake as application back end data base for fast search

To save storage cost we are planning to migrate from Aurora/Mysql to Snowflake for one of our use case where we store Audit related information .
We Store all Audit info in Aurora to gives us milliseconds latency when we combine this Aurora into Application .
We do have huge amount of Audit info size is 12 TB and has Text column also and it is growing .
Now to save cost and keeping future growth in mind we are exploring other option when we can save money and performance also can match .
while doing research cam to know about Snowflake and we are doing POC on this but i observe the search on ID on primary key does not give us performance same as Aurora Mysql .
So wanted some expert advice how can we make Snowflake as our application Back end where i can do Insert/Update/Delete and display record directly from Snowflake Database .
2022 update
Things have changed since my reply below!
Check the Snowflake Search Optimization Service:
The search optimization service can significantly improve the performance of certain types of lookup and analytical queries that use an extensive set of predicates for filtering.
https://docs.snowflake.com/en/user-guide/search-optimization-service.html
Unistore and Hybrid Tables are coming to Snowflake:
Unistore is a new workload that delivers a modern approach to working with transactional and analytical data together in a single platform.
https://www.snowflake.com/blog/introducing-unistore/
Don't do this.
I read from the requirements in the question that you are looking for a backend that will:
Retrieve rows by id in milliseconds.
Be a backend for an app that's constantly performing updates and deletes.
Those are not the strengths of Snowflake, nor what people love it for.
Read more about the strengths of Snowflake and the workloads you would use it for at https://www.snowflake.com/cloud-data-platform/.

What are the pros and cons of loading data directly into Google BigQuery vs going through Cloud Storage first?

Also, is there anything wrong with doing transforms/joins directly within BigQuery? I'd like to minimize the number of components and steps involved for a data warehouse I'm setting up (simple transaction and inventory data for a chain of retail stores.)
Well, if you go through GCS it means you are not streaming your data, and loading from file to BQ is free, and files can be up to 5TB in size. Which is sometimes and advantage, the large file capability and being free. Also streamin is realtime, and going through GCS means it's not realtime.
If you want to directly stream data into BQ tables that has a cost. Currently the price for streaming is $0.01 per 200 MB (June 2018), so around $50 for 1TB.
On the other hand, transformation can be done with SQL if you can express the task. Otherwise you have plenty of options, people most of the time us a Dataflow to transform things. See the linked tutorial for an advanced example.
Look also into
Cloud Dataprep - Data Preparation and Data Cleansing and
Google Data Studio: Easily Build Custom Reports and Dashboards
Also an advanced example:
Performing ETL from a Relational Database into BigQuery
Loading data via Cloud Storage is the fastest (and the cheapest) way.
Loading directly can be done via app (using streaming insert which add some additional cost)
For the doing transformation - if what are you plan/need to do can be done in BigQuery - you should do it in BigQuery :) - it is the best and fastest way of doing ETL.
But you should take in account cost of running query (if you not paying Google for slots - it could be 5$ per 1TB scans)
Another good options for complex ETL is using Data Flow - but it can became expensive very quick - in exchange of more flexibility.

is Google Bigquery suitable for inserting data from IoT devices?

I am working on a startup company where we would sell an IoT device of some sort . these devices will be connected to our server hosted in Google cloud and will send data every 1 second where my server will store it in database as a time series. Let's say we have 1000 device connected and all are sending their data every second , Is it suitable to use google bigquery to insert these data in table every second for each device to it's corresponding table to the owner of the device ?
since my data is in form of a time series i am thinking of using partitioned table for each user ( owner of my device ) but with the limits and quotas listed in the official documentation i am worrying of reaching the limit with my high number of inserts every second ( not to say that I will query the data based on user demand on my phone app ) .
if it's not suitable what would be suited for my use case ?
EDIT : my main concern is the huge amount of inserts per second which can exceeds BigQuery limits or might cause slow down since it's mainly for data warehouse . BigTable seems expensive for us and CloudSQL it seems the way to go but we are worried of slow query times once the table get filled since i am inserting 86400 row per user per day .
Thanks.
You should check out CLOUD IOT CORE - fully managed service to easily and securely connect, manage, and ingest data from globally dispersed devices
Device data captured by Cloud IoT Core gets published to Cloud Pub/Sub for downstream analytics. You can do ad hoc analysis using Google BigQuery, easily run advanced analytics and apply machine learning with Cloud Machine Learning Engine, or visualize IoT data results with rich reports and dashboards in Google Data Studio.
Check also IoT Core with PubSub, Dataflow, and BigQuery

Migrating a relational DB into AWS services

I have a terabyte size SQL Server DB table which has only two columns:
Id,
HTML Content
There are few applications that call this Table to retrieve the HTML content by providing the Id of the row.
The DB is residing On-premises, and the maintenance cost and size of it is getting higher and higher. I am thinking to move this DB into AWS Dynamo DB. Reason I have choose Dynamo DB is the cost and the performance I have read about it.
Are the any concerns I should know about before choosing Dynamo DB?
Are the any other services in AWS that I could possibly use over
Dynamo DB?
I understand that SQL Server is a Relational DB, while DynamoDB is no sql. And it seems a No Sql DB could be a potential solution for this scenario. I have no kind of joins nor transactions against that Table. All I am doing with the table is to Insert, and Select.
Are the any concerns I should know about before choosing Dynamo DB?
As with any NoSql bigdata DB, Dynamo is "eventually consistent", so, if your application writes and then immediately reads the same record - you should expect failures (inconsistencies).
I'm not familiar with "Prem" and assuming you mean that you're working with your private servers I feel obligated to provide the following warning: working in the cloud is very different from working with your own servers: requests fail more often, latency pattern is different and you should architect your software to handle these sort of issues. If you're planning on moving to the cloud I'd start with migrating your application and leave the DB to be last.
If you really need real time updates of your data, You should reconsider moving on Dynamo. Also dynamo is useful when you do need a dynamic number of columns for each row. So except the cost, i don't see any benefits here.
If you don't need realtime updates, you can look into AWS Redshift or Google BigQuery, and these will be cheaper solutions compare to Dynamo.
Like you have mentioned, you just have two columns, take a look into "redis" also. A plain key value structure will help in performance. But since Redis stores everything in the Physical memory, costing will be high and you'll still need permanent storage/ DB like SQL, MySQL. So in terms of performance, yes you ll be able to see huge difference. but you'll be more thn the current cost.
How about AWS Aurora? At least AWS claims of 1/10th of cost compare to other SQL/MySQL instances. It have backward compatibility also.