BigQuery python client dropping some rows using Streaming API - google-cloud-platform

I have around a million data items being inserted into BigQuery using streaming API (BigQuery Python Client's insert_row function), but there's some data loss, ~10,000 data items are lost while inserting. Is there a chance BigQuery might be dropping some of the data? Since there aren't any insertion errors (or any errors whatsoever for that matter).

I would recommend you to file a private Issue Tracker in order for the BigQuery Engineers to look into this. Make sure to provide affected project, the source of the data, the code that you're using to stream into BigQuery along with the client library version.

Related

Correct appreoach to consume and prepare data from multiple sources power BI

I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage.
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA.
BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
Raw
Enriched
Curated
The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it.
Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Once copy pipeline is complete a subsiquent data flow unions the data
from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.

Streaming Insert/Update using Google Cloud Functions

This is regarding streaming data Insert/Update using google cloud function. I am using Salesforce as Source database and wanted to do a streaming insert/update to google BigQuery tables. My insert part is working fine but how can i able to do a update since streaming data is getting inserted into a streaming buffer first which wont allow to do DML operation for a period of 30 min or so. Any help on this will be really appreciated
Got a reply from Google Support like below
"It is true that modifying recent data for the last 30 minutes (with an active streaming buffer) is not possible as one of the limitations of BigQuery DML operations"
One of the workaround which we can try is to copy the data from streaming table to a new table and perform any operation on that. This helped me.
Thanks
vak

Logstash and looking up additional data from a relational table?

I have mobile app log data being posted daily (eventually it will be a data stream). I am looking at different solutions for processing this log data and providing analytics. I am considering using logstash/elasticsearch/kibana combination, but we have additional data on our users stored in a redshift database. So in addition to the mobile data, I would like to pull in additional data from redshift about the user at the time of interaction with mobile app.
However, I've read in some places that doing an actual database query through logstash isn't feasible, but you can use a dictionary file to do a lookup of each user.
I have two questions regarding this approach
Is there a limit to have large this lookup file can be? Mine would be < 500K records so I'd imagine it would be fine?
Can the process of making the the lookup file from redshift tables be fully automated (ideally though aws services) - i.e. each night the lookup table is refreshed and posted to logstash, and then used for breakouts in Kibana
The way we're currently doing it is processing a daily jason file with a lambda function, posting it to s3 and then reading it into a redshift table. This data is then processed into sessions and joined up with other tables to generate the final dataset to be used for visualization. This is currently done in Tableau but we are exploring other options (such as quicksight, or possibly the ELK stack)
Just trying to figure out what solution is going to be scalable to clickstream data and will be the most useful down the line.
Thanks!
logstash 7 has a jdbc_streaming filter plugin for dynamically adding stuff to your events, as well as the jdbc_static filter for static stuff.
As you found, you can also use the translate filter. The man page says they've tested "very large" datasets up to 100,000 entries, so your dataset may require some testing. The good part about this filter is that it will reload the data when it detects a change, so you can publish the data on your own schedule (e.g. cron) without restarting logstash. Be on the lookout for events that don't get the translated value, which might be a sign that your publishing frequency should be updated.

What are the pros and cons of loading data directly into Google BigQuery vs going through Cloud Storage first?

Also, is there anything wrong with doing transforms/joins directly within BigQuery? I'd like to minimize the number of components and steps involved for a data warehouse I'm setting up (simple transaction and inventory data for a chain of retail stores.)
Well, if you go through GCS it means you are not streaming your data, and loading from file to BQ is free, and files can be up to 5TB in size. Which is sometimes and advantage, the large file capability and being free. Also streamin is realtime, and going through GCS means it's not realtime.
If you want to directly stream data into BQ tables that has a cost. Currently the price for streaming is $0.01 per 200 MB (June 2018), so around $50 for 1TB.
On the other hand, transformation can be done with SQL if you can express the task. Otherwise you have plenty of options, people most of the time us a Dataflow to transform things. See the linked tutorial for an advanced example.
Look also into
Cloud Dataprep - Data Preparation and Data Cleansing and
Google Data Studio: Easily Build Custom Reports and Dashboards
Also an advanced example:
Performing ETL from a Relational Database into BigQuery
Loading data via Cloud Storage is the fastest (and the cheapest) way.
Loading directly can be done via app (using streaming insert which add some additional cost)
For the doing transformation - if what are you plan/need to do can be done in BigQuery - you should do it in BigQuery :) - it is the best and fastest way of doing ETL.
But you should take in account cost of running query (if you not paying Google for slots - it could be 5$ per 1TB scans)
Another good options for complex ETL is using Data Flow - but it can became expensive very quick - in exchange of more flexibility.

Comparison of loading from different file formats in BigQuery

We currently load most of our data into BigQuery either via csv or directly via the streaming API. However, I was wondering if there were any benchmarks available (or maybe a Google engineer could just tell me in the answer) how loading different formats would compare in efficiency.
For example, if we have the same 100M rows of data, does BigQuery show any performance difference from loading it in:
parquet
csv
json
avro
I'm sure one of the answers will be "why don't you test it", but we're hoping that before architecting a converter or re-writing our application, an engineer could share with us what (if any) of the above formats would be the most performant in terms of loading data from a flat file into BQ.
Note: all of the above files would be stored in Google Cloud Storage: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage.
https://cloud.google.com/blog/big-data/2016/03/improve-bigquery-ingestion-times-10x-by-using-avro-source-format
"Improve BigQuery ingestion times 10x by using Avro source format"
The ingestion speed has, to this point, been dependent upon the file format that we export from BigQuery. In prior releases of the SDK, tables and queries were made available to Dataflow as JSON-encoded objects in Google Cloud Storage. Considering that every such entry has the same schema, this representation is extremely redundant, essentially duplicating the schema, in string form, for every record.
In the 1.5.0 release, Dataflow uses the Avro file format to binary-encode and decode BigQuery data according to a single shared schema. This reduces the size of each individual record to correspond to the actual field values
Take care not to limit your comparison to just benchmarks. Those formats also imply some limitations for the client that writes data into BigQuery, and you should also consider them. For instance:
Size of the allowed compressed files (https://cloud.google.com/bigquery/quotas#load_jobs )
CSV is quite "fragile" has for serialization format (no control of types for instance)
Avro offers poor support for types like Timestamp, Date, Time.