I am working on a startup company where we would sell an IoT device of some sort . these devices will be connected to our server hosted in Google cloud and will send data every 1 second where my server will store it in database as a time series. Let's say we have 1000 device connected and all are sending their data every second , Is it suitable to use google bigquery to insert these data in table every second for each device to it's corresponding table to the owner of the device ?
since my data is in form of a time series i am thinking of using partitioned table for each user ( owner of my device ) but with the limits and quotas listed in the official documentation i am worrying of reaching the limit with my high number of inserts every second ( not to say that I will query the data based on user demand on my phone app ) .
if it's not suitable what would be suited for my use case ?
EDIT : my main concern is the huge amount of inserts per second which can exceeds BigQuery limits or might cause slow down since it's mainly for data warehouse . BigTable seems expensive for us and CloudSQL it seems the way to go but we are worried of slow query times once the table get filled since i am inserting 86400 row per user per day .
Thanks.
You should check out CLOUD IOT CORE - fully managed service to easily and securely connect, manage, and ingest data from globally dispersed devices
Device data captured by Cloud IoT Core gets published to Cloud Pub/Sub for downstream analytics. You can do ad hoc analysis using Google BigQuery, easily run advanced analytics and apply machine learning with Cloud Machine Learning Engine, or visualize IoT data results with rich reports and dashboards in Google Data Studio.
Check also IoT Core with PubSub, Dataflow, and BigQuery
Related
I want to get the "entire" ethereum blockchain data, not just from a few sets of smart contracts. By data I mean, transaction details including the generated logs.
I can get real-time data using Infura, but it's pretty much impossible to fetch all the old data, it would simply cost too much because I would simply have to do too many network requests.
I need the old data because I am trying to make an indexed database out of the "append-only" ethereum transaction data so that I can easily query it.
To be more precise, I would like to retrieve all NFT(ERC721, ERC1155) transfer transactions and their logs. So that I can do the following queries and much more: all the NFT owned by a particular wallet, transfer histories of a particular NFT token.
You can do this by
Run your own node
Query data from your node - locally it is fast
For some data, you might need to run the node in archival mode
You can use the same Web3 / JSON-RPC APIs on a local node than you are using on Infura.
Two solutions I have discovered.
Just like #Mikko has mentioned, you can run your own node. And it seemed not be as complex as I have expected. You can search for "geth" and then simply connect this node to your web3 library, just like connecting to Infura.
But I have not tried this and found a much better solution.
Google cloud Bigquery's public data set has all the old ethereum data. Bigquery is Google's data warehouse service, where you can use simple SQL to query your data. It adds new data every day. I have already tested some simple queries from its console and the result was good.
I am planning to fetch all the old data I need from bigquery and store it in my own database and afterwards get real time data from infura. Now that I dont have to fetch all the old data from infura, the price becomes very affordable.
you may check this https://github.com/blockchain-etl/ethereum-etl
It is a Python library for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, and internal transactions.
For example, you may run the cli command
> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv
You may export ERC20 and ERC721 transfers by specific the block number which enable you to query the old data.
Data is also available in Google BigQuery.
In my scenario, Databricks is performing read and writing transformations in Delta tables. We have PBI connected to the Databricks cluster that needs to be running most of the time, which is expensive.
Knowing that delta tables are in a container, what would be the best way in terms of cost x performance to feed PBI from delta tables?
If your set size is under max allowed size in PowerBI (100 GB I guess) and daily refresh is enough you can just load everything to your PowerBI model.
https://blog.gbrueckl.at/2021/01/reading-delta-lake-tables-natively-in-powerbi/
If you want to save the costs maybe you don't need transactions and can save it in csv in data lake, than loading everything to PowerBI and refresh daily is really easy.
If you want to save the costs and query new incoming data all the time using DirectQuery consider using Azure SQL. It has really competitive prices starting from 5 eur/usd. Integration with databricks is also perfect write in append mode do all magic.
Another option to consider is to create an Azure Synapse workspace and use serverless SQL compute to query the delta lake files. This is a pay-per-the-TB consumed pricing model so you don’t have to have your Databricks cluster running all the time. It’s a great way to load Power BI import models.
To save storage cost we are planning to migrate from Aurora/Mysql to Snowflake for one of our use case where we store Audit related information .
We Store all Audit info in Aurora to gives us milliseconds latency when we combine this Aurora into Application .
We do have huge amount of Audit info size is 12 TB and has Text column also and it is growing .
Now to save cost and keeping future growth in mind we are exploring other option when we can save money and performance also can match .
while doing research cam to know about Snowflake and we are doing POC on this but i observe the search on ID on primary key does not give us performance same as Aurora Mysql .
So wanted some expert advice how can we make Snowflake as our application Back end where i can do Insert/Update/Delete and display record directly from Snowflake Database .
2022 update
Things have changed since my reply below!
Check the Snowflake Search Optimization Service:
The search optimization service can significantly improve the performance of certain types of lookup and analytical queries that use an extensive set of predicates for filtering.
https://docs.snowflake.com/en/user-guide/search-optimization-service.html
Unistore and Hybrid Tables are coming to Snowflake:
Unistore is a new workload that delivers a modern approach to working with transactional and analytical data together in a single platform.
https://www.snowflake.com/blog/introducing-unistore/
Don't do this.
I read from the requirements in the question that you are looking for a backend that will:
Retrieve rows by id in milliseconds.
Be a backend for an app that's constantly performing updates and deletes.
Those are not the strengths of Snowflake, nor what people love it for.
Read more about the strengths of Snowflake and the workloads you would use it for at https://www.snowflake.com/cloud-data-platform/.
For context, we would like to visualize our data in google data studio - this dataset receives more entries each week. I have tried hosting our data sets in google drive, but it seems that they're too large and this slows down google data studio (the file is only 50 mb, am I doing something wrong?).
I have loaded our data into google cloud storage --> google bigquery, and connected my google data studio to my bigquery table. This has allowed me to use the google data studio dashboard much quicker!
I'm not sure what is the best way to update our data weekly in google cloud/bigquery. I have found a slow way to do this by uploading the new weekly data to google cloud, then appending the data to my table manually in bigquery, but I'm wondering if there's a better way to do this (or at least a more automated way)?
I'm open to any suggestions, and if you think that bigquery/google cloud storage is not the answer for me, please let me know!
If I understand your question correctly, you want to automate the query that populate your table, which is connected to Data Studio.
If this is the case, then you can use Scheduled Query from BigQuery. Scheduled query allow you to define a query which results can be inserted in a new table. Particularly you can specify different rules for repetition (minimum each 15 minutes) and execution, as well as destination writing options (destination table, writing mode: append, truncate).
In order to use Scheduled Queries your account must have the right permissions. You can have a look at the following documentation to better understand how to use Scheduled Query [1].
Also, please note that at the front end the updated data in the BigQuery table will be seen updated in Datastudio at each refresh (click on refresh button in Datastudio). To automatically refresh the front-end visualization you can use the following plugin [2] or automate the click on the refresh button through Browser console commands.
[1] https://cloud.google.com/bigquery/docs/scheduling-queries
[2] https://chrome.google.com/webstore/detail/data-studio-auto-refresh/inkgahcdacjcejipadnndepfllmbgoag?hl=en
Also, is there anything wrong with doing transforms/joins directly within BigQuery? I'd like to minimize the number of components and steps involved for a data warehouse I'm setting up (simple transaction and inventory data for a chain of retail stores.)
Well, if you go through GCS it means you are not streaming your data, and loading from file to BQ is free, and files can be up to 5TB in size. Which is sometimes and advantage, the large file capability and being free. Also streamin is realtime, and going through GCS means it's not realtime.
If you want to directly stream data into BQ tables that has a cost. Currently the price for streaming is $0.01 per 200 MB (June 2018), so around $50 for 1TB.
On the other hand, transformation can be done with SQL if you can express the task. Otherwise you have plenty of options, people most of the time us a Dataflow to transform things. See the linked tutorial for an advanced example.
Look also into
Cloud Dataprep - Data Preparation and Data Cleansing and
Google Data Studio: Easily Build Custom Reports and Dashboards
Also an advanced example:
Performing ETL from a Relational Database into BigQuery
Loading data via Cloud Storage is the fastest (and the cheapest) way.
Loading directly can be done via app (using streaming insert which add some additional cost)
For the doing transformation - if what are you plan/need to do can be done in BigQuery - you should do it in BigQuery :) - it is the best and fastest way of doing ETL.
But you should take in account cost of running query (if you not paying Google for slots - it could be 5$ per 1TB scans)
Another good options for complex ETL is using Data Flow - but it can became expensive very quick - in exchange of more flexibility.