BigQuery Data Transfer Service benchmarks for Campaign Manager data - google-cloud-platform

There's some good info here on general transfer times over the wire for data to/from various sources.
Besides the raw data transfer time, I am trying to estimate roughly how long it would take to import ~12TB/day into BigQuery using the BigQuery Data Transfer service for DoubleClick Campaign Manager.
Is this documented anywhere?

In the first link you've shared, there's an image that shows the transfer speed (estimated) depending on the network bandwidth.
So let's say you have a bandwith of 1Gbps, then the data will be available in your GCP project in ~30 hours as you are transfering 12TB which is close to 10TB. That makes it 1 day and a half to transfer.
If you really want to transfer 12TB/day because you need that data to be available each day, and increasing bandwidth is not a possibility, I would recommend you to batch data and create different transfer services for each batch. As an example:
Split 12TB into 12 batches of 1TB -> 12 transfer jobs of 1TB each
Each batch will take 3 hours to complete, therefore you will have available 8/12TB a day.
This can be applied to smaller batches of data if you want to have a more fine-grained solution.

Related

Dataflow resource usage

After following the dataflow tutorial, I used the pub/sub topic to big query template to parse a JSON record into a table. The Job has been streaming for 21 days. During that time I have ingested about 5000 JSON records, containing 4 fields (around 250 bytes).
After the bill came this month I started to look into resource usage. I have used 2,017.52 vCPU hr, memory 7,565.825 GB hr, Total HDD 620,407.918 GB hr.
This seems absurdly high for the tiny amount of data I have been ingesting. Is there a minimum amount of data I should have before using dataflow? It seems over powered for small cases. Is there another preferred method for ingesting data from a pub sub topic? Is there a different configuration when setting up a Dataflow Job that uses less resources?
It seems that the numbers you mentioned, correspond to not customizing the job resources. By default streaming jobs use a n1-standar-4 machine:
3 Streaming worker defaults: 4 vCPU, 15 GB memory, 400 GB Persistent Disk.
4 vCPU x 24 hrs x 21 days = 2,016
15 GB x 24 hrs x 21 days = 7,560
If you really need streaming in Dataflow, you will need to pay for resources allocated even if there is nothing to process.
Options:
Optimizing Dataflow
Considering that the number and size of the JSON string you need to process are really small, you can reduce the cost to aprox 1/4 of current charge. You just need to set the job to use a n1-standard-1 machine, which has 1vCPU and 3.75GB memory. Just be careful with max nodes, unless you are planning increase the load, one node may be enough.
Your own way
If you don't really need streaming (not likely), you can just create a function that pulls using Synchronous Pull, and add the part that writes to BigQuery. You can schedule according to your needs.
Cloud functions (my recommendation)
You can create a serverless Event-Driven Cloud Function with a Cloud Pub/Sub trigger. This way, considering your low volume, you can take advantage of the Free Tier and keep the real time processing:
"Cloud Functions provides a perpetual free tier for compute-time resources, which includes an allocation of both GB-seconds and GHz-seconds. In addition to the 2 million invocations, the free tier provides 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month."[1]
[1] https://cloud.google.com/functions/pricing

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

bigQuery data export costs

I'm working on a project that consists of receiving data in BigQuery, do some aggregations and export it to Qlik Sense to make dashboards. I'm still on a exploratory phase with only a few Gb of data but I want to calculate how much I will pay to export data from BigQuery to an external platform in a production phase.
When I pass data from Cloud Storage buckets (where the data comes from) to BigQuery, data compression occurs, i.e., the tables in bigQuery occupy less space than in the buckets of Cloud Storage.
Thus, my question is, when I then export the data from BigQuery to an external paltform, what volume of traffic will be considered? The data's original size (before compression), or the size (compressed) it has on bigQuery?
Will the egress(export) volume depend on the format that Qlik uses to export the data? If yes, what is the format Qlik uses to export the data?
If you follow BigQuery Exporting Data and export BigQuery table data to GCS. Exporting itself is free of cost.
Although free, exporting has a limit of 10 TB of data per day per project. You are suggested to use Storage API if exporting quota doesn't work for you.
In case that you are trying to use Qlik/BigQuery connector. I noticed that the maximum response size is 128MB compressed, it might not be able to export a few GB of data for you. From the documentation, the pricing model on BigQuery side is simple, since the connector is just placing a query and read response:
Query cost: see BigQuery query pricing, which is based on the size of the table that the query scanned
Transfer cost: free

What are the pros and cons of loading data directly into Google BigQuery vs going through Cloud Storage first?

Also, is there anything wrong with doing transforms/joins directly within BigQuery? I'd like to minimize the number of components and steps involved for a data warehouse I'm setting up (simple transaction and inventory data for a chain of retail stores.)
Well, if you go through GCS it means you are not streaming your data, and loading from file to BQ is free, and files can be up to 5TB in size. Which is sometimes and advantage, the large file capability and being free. Also streamin is realtime, and going through GCS means it's not realtime.
If you want to directly stream data into BQ tables that has a cost. Currently the price for streaming is $0.01 per 200 MB (June 2018), so around $50 for 1TB.
On the other hand, transformation can be done with SQL if you can express the task. Otherwise you have plenty of options, people most of the time us a Dataflow to transform things. See the linked tutorial for an advanced example.
Look also into
Cloud Dataprep - Data Preparation and Data Cleansing and
Google Data Studio: Easily Build Custom Reports and Dashboards
Also an advanced example:
Performing ETL from a Relational Database into BigQuery
Loading data via Cloud Storage is the fastest (and the cheapest) way.
Loading directly can be done via app (using streaming insert which add some additional cost)
For the doing transformation - if what are you plan/need to do can be done in BigQuery - you should do it in BigQuery :) - it is the best and fastest way of doing ETL.
But you should take in account cost of running query (if you not paying Google for slots - it could be 5$ per 1TB scans)
Another good options for complex ETL is using Data Flow - but it can became expensive very quick - in exchange of more flexibility.

is Google Bigquery suitable for inserting data from IoT devices?

I am working on a startup company where we would sell an IoT device of some sort . these devices will be connected to our server hosted in Google cloud and will send data every 1 second where my server will store it in database as a time series. Let's say we have 1000 device connected and all are sending their data every second , Is it suitable to use google bigquery to insert these data in table every second for each device to it's corresponding table to the owner of the device ?
since my data is in form of a time series i am thinking of using partitioned table for each user ( owner of my device ) but with the limits and quotas listed in the official documentation i am worrying of reaching the limit with my high number of inserts every second ( not to say that I will query the data based on user demand on my phone app ) .
if it's not suitable what would be suited for my use case ?
EDIT : my main concern is the huge amount of inserts per second which can exceeds BigQuery limits or might cause slow down since it's mainly for data warehouse . BigTable seems expensive for us and CloudSQL it seems the way to go but we are worried of slow query times once the table get filled since i am inserting 86400 row per user per day .
Thanks.
You should check out CLOUD IOT CORE - fully managed service to easily and securely connect, manage, and ingest data from globally dispersed devices
Device data captured by Cloud IoT Core gets published to Cloud Pub/Sub for downstream analytics. You can do ad hoc analysis using Google BigQuery, easily run advanced analytics and apply machine learning with Cloud Machine Learning Engine, or visualize IoT data results with rich reports and dashboards in Google Data Studio.
Check also IoT Core with PubSub, Dataflow, and BigQuery