Streaming Insert/Update using Google Cloud Functions - google-cloud-platform

This is regarding streaming data Insert/Update using google cloud function. I am using Salesforce as Source database and wanted to do a streaming insert/update to google BigQuery tables. My insert part is working fine but how can i able to do a update since streaming data is getting inserted into a streaming buffer first which wont allow to do DML operation for a period of 30 min or so. Any help on this will be really appreciated

Got a reply from Google Support like below
"It is true that modifying recent data for the last 30 minutes (with an active streaming buffer) is not possible as one of the limitations of BigQuery DML operations"
One of the workaround which we can try is to copy the data from streaming table to a new table and perform any operation on that. This helped me.
Thanks
vak

Related

Export Data from BigQuery to Google Cloud SQL using Create Job From SQL tab in DataFlow

I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.

Real time ingestion to Redshift without using S3?

I'm currenty using Apache NiFi to ingest realtime data using kafka, I would like to take this data to redshift to have a near real time transacction table for online analytics of campaign results and other sutff.
The catch, if I use Kinesis or copy from S3 I would have a LOT of read/writes from/to S3, I found on previous experiences that this becomes very expensive.
So, is there a way to send data to a redshift table without constant locking the destination table? the idea is putting the data directly from NiFi without persisting it. I have a hourly batch process so it would not be a problem if I lost a couple or rows on the on online stream.
Why redshift? that's the Data Lake platform and it will become handy to cross online data with other tables.
Any idea?
Thanks!

How to update data in google cloud storage/bigquery for google data studio?

For context, we would like to visualize our data in google data studio - this dataset receives more entries each week. I have tried hosting our data sets in google drive, but it seems that they're too large and this slows down google data studio (the file is only 50 mb, am I doing something wrong?).
I have loaded our data into google cloud storage --> google bigquery, and connected my google data studio to my bigquery table. This has allowed me to use the google data studio dashboard much quicker!
I'm not sure what is the best way to update our data weekly in google cloud/bigquery. I have found a slow way to do this by uploading the new weekly data to google cloud, then appending the data to my table manually in bigquery, but I'm wondering if there's a better way to do this (or at least a more automated way)?
I'm open to any suggestions, and if you think that bigquery/google cloud storage is not the answer for me, please let me know!
If I understand your question correctly, you want to automate the query that populate your table, which is connected to Data Studio.
If this is the case, then you can use Scheduled Query from BigQuery. Scheduled query allow you to define a query which results can be inserted in a new table. Particularly you can specify different rules for repetition (minimum each 15 minutes) and execution, as well as destination writing options (destination table, writing mode: append, truncate).
In order to use Scheduled Queries your account must have the right permissions. You can have a look at the following documentation to better understand how to use Scheduled Query [1].
Also, please note that at the front end the updated data in the BigQuery table will be seen updated in Datastudio at each refresh (click on refresh button in Datastudio). To automatically refresh the front-end visualization you can use the following plugin [2] or automate the click on the refresh button through Browser console commands.
[1] https://cloud.google.com/bigquery/docs/scheduling-queries
[2] https://chrome.google.com/webstore/detail/data-studio-auto-refresh/inkgahcdacjcejipadnndepfllmbgoag?hl=en

BigQuery python client dropping some rows using Streaming API

I have around a million data items being inserted into BigQuery using streaming API (BigQuery Python Client's insert_row function), but there's some data loss, ~10,000 data items are lost while inserting. Is there a chance BigQuery might be dropping some of the data? Since there aren't any insertion errors (or any errors whatsoever for that matter).
I would recommend you to file a private Issue Tracker in order for the BigQuery Engineers to look into this. Make sure to provide affected project, the source of the data, the code that you're using to stream into BigQuery along with the client library version.

BigQuery API - retrieving rows limited

I'm trying to retrieve the result of query with aggregates, based on the GA sessions and using the BigQuery API in python. And then to push it to my data warehouse.
Issue: I can only retrieve 8333 records of the aforementioned query result.
But there are always 40k+ records any day of the year..
I tried to do 'allowLargeResults': True
I read I should extract all to google cloud first and then retrieve it...
Also read somewhere in Google doc that I might only get the first page?!
Has anybody faced the same situation?
See section on paging through results in the BigQuery docs https://cloud.google.com/bigquery/docs/data#paging
Alternately, you can export your table to Google Cloud Storage: https://cloud.google.com/bigquery/exporting-data-from-bigquery