Dataflow streaming template for data masking/tokenization giving inconsistent results

Dataflow streaming template for data masking/tokenization giving inconsistent results - templates

The Google provided Dataflow Streaming template for data masking/tokenization from cloud storage to bigquery using cloud DLP is giving inconsistent output for each source files.
We have 100 odd files with 1M records each in the GCS bucket and we are calling the dataflow streaming template to tokenize the data using DLP and load into BigQuery.
While loading the files sequentially we saw that the results are inconsistent
For few files full 1M got loaded but for most of them the rows are varied between 0.98M to 0.99M. Is there any reason for such behaviour?

I am not sure but it's maybe due to BigQuery best-effort deduplication mechanism used for streaming data to BigQuery :
From the Beam documentation :
Note: Streaming inserts by default enables BigQuery best-effort deduplication mechanism. You can disable that by setting ignoreInsertIds. The quota limitations are different when deduplication is enabled vs. disabled :
Streaming inserts applies a default sharding for each table
destination. You can use withAutoSharding (starting 2.28.0 release) to
enable dynamic sharding and the number of shards may be determined and
changed at runtime. The sharding behavior depends on the runners.
From the Google Cloud documentation :
Best effort de-duplication When you supply insertId for an inserted
row, BigQuery uses this ID to support best effort de-duplication for
up to one minute. That is, if you stream the same row with the same
insertId more than once within that time period into the same table,
BigQuery might de-duplicate the multiple occurrences of that row,
retaining only one of those occurrences.
The system expects that rows provided with identical insertIds are
also identical. If two rows have identical insertIds, it is
nondeterministic which row BigQuery preserves.
De-duplication is generally meant for retry scenarios in a distributed
system where there's no way to determine the state of a streaming
insert under certain error conditions, such as network errors between
your system and BigQuery or internal errors within BigQuery. If you
retry an insert, use the same insertId for the same set of rows so
that BigQuery can attempt to de-duplicate your data. For more
information, see troubleshooting streaming inserts.
De-duplication offered by BigQuery is best effort, and it should not
be relied upon as a mechanism to guarantee the absence of duplicates
in your data. Additionally, BigQuery might degrade the quality of best
effort de-duplication at any time in order to guarantee higher
reliability and availability for your data.
If you have strict de-duplication requirements for your data, Google
Cloud Datastore is an alternative service that supports transactions.
This mecanism can be disabled with ignoreInsertIds
You can test with disabling this mecanism and check if all the rows are inserted.

By adjusting the value of the batch size in the template all files of 1M records each got loaded successfully

Related

How to deal with failing Athena queries as AWS Glue datacatalog metada size grows large?

Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database. The article below has come up frequently in my research and is written by Amazon's team:
Querying AWS Glue Data Catalog
However, under the section titled Considerations and limitations the following is written:
Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.
Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata.
My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata?
Is there a best practice for this? Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?

I talked to AWS Support and did some research on this. Here's what I gathered:
The information_schema is built at query execution time, there doesn't seem to be any caching.
If you access information_schema.tables, it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).
If you access information_schema.columns, it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.
These queries are affected by the general service quotas. In this case, DML queries like your select must finish within 30 minutes.
If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance. As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.

Bigquery Pricing Comparison : Loading data into Bigquery vs Using Create External Table

My team is working on developing data platform using Google Cloud Platform.
We uploaded our company's data on Google Cloud Storage and try to make data mart on Bigquery.
However, in order to save GCP usage cost, we are considering to load all data from gcs to bigquery or create external table on bigquery.
Which way is more cost efficienct?

BigQuery and the external table capacity make the border between datalake (file) and data warehouse (structured data) blurry, and your question is relevant.
When you use external table, several feature are missing, like clustering and partitioning, and your file are parsed on the fly (with type casting) -> the processing time is slower and you can't control/limit the volume of data that your process. In addition of possible errors in file that will break your query
When you use native table, the data storage is optimize for the BigQuery processing, the data already clean and parsed, the table partitioned and clustered.
The question of cost is hard multiple. Firstly, we can talk about data storage. if you have file in GCS and the same data in BigQuery, you will pay the storage twice. However, after 90 days without any update, the data goes to "archive" storage mode in BigQuery and are 2 time cheaper. In addition, you can also move your GCS file to a cold storage after their integration in BigQuery.
That's for the storage. Then the processing. First of all, the processing roughly cost 10 times more than the storage, and it's the most important things to focus on. When you perform a BigQuery request, you pay for the volume of data that your query scan. If you have partitions or clusters, with BigQuery native tables, you can limit the amount of data that you scan and therefore reduce a lot the cost. With external tables, you can't use partitioning and clustering feature and therefore you always pay for the full amount of data.
Therefore, it depends (as always) on your volume of data and the frequency of the requests.
Don't forget something additional: with external table you can have error that can break your queries. In production mode, it can be dramatic. Think smart on that.
Finally, requesting external table is slower that native table (no partitioning, therefore more data to process and parsing/casting duration). Because time is money (if you have time critical queries), and that immaterial cost can also influence your choices.

The #guillaume blaquiere answer is okay, but he forget mention something important: it is possible to do partitioned queries. You can create partitioned external tables linked to a bucket in the storage. Eg:
gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar
Then, you can use "dt" or "lang" filters in SQL queries from BigQuery.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs

BigQuery read is Slow in Google DataFlow pipeline

For our Near real time analytics, data will be streamed into pubsub and Apache beam dataflow pipeline will process by first writing into bigquery and then do the aggregate processing by reading again from bigquery then storing the aggregated results in Hbase for OLAP cube Computation.
Here is the sample ParDo function which is used to fetch record from bigquery
String eventInsertedQuery="Select count(*) as usercount from <tablename> where <condition>";
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig
=QueryJobConfiguration.newBuilder(eventInsertedQuery).build();
TableResult result = bigquery.query(queryConfig);
FieldValueList row = result.getValues().iterator().next();
LOG.info("rowCounttt {}",row.get("usercount").getStringValue());
bigquery.query is taking aroud ~4 seconds. Any suggestions to improve it? Since this is near real time analytics this time duration is not acceptable.

Frequent reads from BigQuery can add undesired latency in your app. If we consider that BigQuery is a data warehouse for Analytics, I would think that 4 seconds is a good response time. I would suggest to optimize the query to reduce the 4 seconds threshold.
Following is a list of possibilities you can opt to:
Optimizing the query statement, including changing the Database schema to add partitioning or clustering.
Using a relational database provided by Cloud SQL for getting better response times.
Changing the architecture of you app. As recommended in comments, it is a good option to transform the data before writing to BQ, so you can avoid the latency of querying the data twice. There are several articles to perform Near Real Time computation with Dataflow (e.g. building real time app and real time aggregate data).
On the other hand, keep in mind that the time to finish a query is not included in the BigQuery SLAs webpage, in fact, it is expected that errors can occur and consume even more time to finish a query, see Back-off Requirements in the same link.

Can we schedule StackDriver Logging to export log?

I am new to Google Cloud StackDriver Logging and as per this documentation StackDriver stores the Data Access audit logs for 30 days. Also mentioned on the same page, that Size of a log entry is limited to 100KB.
I am aware of the fact that the logs can be exported Google Cloud Storage using Cloud SDK as well as using Logging Libraries in many languages (we prefer Python).
I have two questions related to the exporting the logs, which are:
Is there any way in StackDriver to schedule something similar to a task or cronjob that keeps exporting the Logs in the Google Cloud storage automatically after a fixed interval of time?
What happens to the log entries which are larger than 100KB. I assume they get truncated. Is my assumption correct? If yes, is there any way in which we can export/view the full(which is not at all truncated) Log entry?

Is there any way in StackDriver to schedule something similar to a
task or cronjob that keeps exporting the Logs in the Google Cloud
storage automatically after a fixed interval of time?
Stackdriver supports exporting log data via sinks. There is no schedule that you can set as everything is automatic. Basically, the data is exported as soon as possible and you have no control over the amount exported at each sink or the delay between exports. I have never found this to be an issue. Logging, by design, is not to be used as a real-time system. The closest is to sink to PubSub which has a couple of second delay (based upon my experience).
There are two methods to export data from Stackdriver:
Create an export sink. Supported destinations are BigQuery, Cloud Storage and PubSub. The log entries will be written to the destination automatically. You can then use tools to process the exported entries. This is the recommended method.
Write your own code in Python, Java, etc. to read the log entries and do what you want with them. Scheduling is up to you. This method is manual and requires your management of schedule and destination.
What happens to the log entries which are larger than 100KB. I assume
they get truncated. Is my assumption correct? If yes, is there any way
in which we can export/view the full(which is not at all truncated)
Log entry?
Entries that exceed the max size of an entry cannot be written to Stackdriver. The API call that attempts to create the entry will fail with an error message similar to (Python error message):
400 Log entry with size 113.7K exceeds maximum size of 110.0K
This means that entries that are too large will be discarded unless the writer has logic to handle this case.

As per the documentation of stack driver logging the whole process is automatic. Export sink to google cloud storage is slower than the Bigquery and Cloud sub/pub. link for the documentation
I recently used the export sink to the big query, which is better than cloud pub/sub in case if you don't want to use other third-party application for log analysis. For Bigquery sink needs dataset where do you want to store the log entries. I noticed that sink create bigquery table on a timestamp basis in the bigquery dataset.
One more thing if you want to query timestamp partitioned tables check this link
Legacy SQL Functions and Operators

What are the pros and cons of loading data directly into Google BigQuery vs going through Cloud Storage first?

Also, is there anything wrong with doing transforms/joins directly within BigQuery? I'd like to minimize the number of components and steps involved for a data warehouse I'm setting up (simple transaction and inventory data for a chain of retail stores.)

Well, if you go through GCS it means you are not streaming your data, and loading from file to BQ is free, and files can be up to 5TB in size. Which is sometimes and advantage, the large file capability and being free. Also streamin is realtime, and going through GCS means it's not realtime.
If you want to directly stream data into BQ tables that has a cost. Currently the price for streaming is $0.01 per 200 MB (June 2018), so around $50 for 1TB.
On the other hand, transformation can be done with SQL if you can express the task. Otherwise you have plenty of options, people most of the time us a Dataflow to transform things. See the linked tutorial for an advanced example.
Look also into
Cloud Dataprep - Data Preparation and Data Cleansing and
Google Data Studio: Easily Build Custom Reports and Dashboards
Also an advanced example:
Performing ETL from a Relational Database into BigQuery

Loading data via Cloud Storage is the fastest (and the cheapest) way.
Loading directly can be done via app (using streaming insert which add some additional cost)
For the doing transformation - if what are you plan/need to do can be done in BigQuery - you should do it in BigQuery :) - it is the best and fastest way of doing ETL.
But you should take in account cost of running query (if you not paying Google for slots - it could be 5$ per 1TB scans)
Another good options for complex ETL is using Data Flow - but it can became expensive very quick - in exchange of more flexibility.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js