I'm trying to retrieve the result of query with aggregates, based on the GA sessions and using the BigQuery API in python. And then to push it to my data warehouse.
Issue: I can only retrieve 8333 records of the aforementioned query result.
But there are always 40k+ records any day of the year..
I tried to do 'allowLargeResults': True
I read I should extract all to google cloud first and then retrieve it...
Also read somewhere in Google doc that I might only get the first page?!
Has anybody faced the same situation?
See section on paging through results in the BigQuery docs https://cloud.google.com/bigquery/docs/data#paging
Alternately, you can export your table to Google Cloud Storage: https://cloud.google.com/bigquery/exporting-data-from-bigquery
Related
I want to validate the data that is exported from Y42 to BigQuery in Google Cloud (e.g. given a predefined schema, I want to check whether all columns appear in the data, the ranges of the values, etc.).
I created a Python script that validates the data that comes in a CSV file. However, I do not know how to run the script before exporting the data to Google Cloud. I can create a VM instance in Google Cloud and run a Python script there, but I don't know how to use the data that is stored in Google Cloud in my script. Can anyone give me any hints regarding this issue?
I investigated whether there are any other ways to validate data directly in Google Cloud, but I did not find anything. Is someone aware of any data validation methods in Google Cloud?
What I usually do, I import the data in BigQuery (in a temporary table to not break my clean prod table) and I run a query on it. That query perform all the checks that I want.
If the query return lines, those lines are in error, the others are OK. Then I merge the valid data in the clean prod table, and the bad data in a log table for further analysis.
All that sequence is orchestrated with Cloud Workflow.
I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.
For context, we would like to visualize our data in google data studio - this dataset receives more entries each week. I have tried hosting our data sets in google drive, but it seems that they're too large and this slows down google data studio (the file is only 50 mb, am I doing something wrong?).
I have loaded our data into google cloud storage --> google bigquery, and connected my google data studio to my bigquery table. This has allowed me to use the google data studio dashboard much quicker!
I'm not sure what is the best way to update our data weekly in google cloud/bigquery. I have found a slow way to do this by uploading the new weekly data to google cloud, then appending the data to my table manually in bigquery, but I'm wondering if there's a better way to do this (or at least a more automated way)?
I'm open to any suggestions, and if you think that bigquery/google cloud storage is not the answer for me, please let me know!
If I understand your question correctly, you want to automate the query that populate your table, which is connected to Data Studio.
If this is the case, then you can use Scheduled Query from BigQuery. Scheduled query allow you to define a query which results can be inserted in a new table. Particularly you can specify different rules for repetition (minimum each 15 minutes) and execution, as well as destination writing options (destination table, writing mode: append, truncate).
In order to use Scheduled Queries your account must have the right permissions. You can have a look at the following documentation to better understand how to use Scheduled Query [1].
Also, please note that at the front end the updated data in the BigQuery table will be seen updated in Datastudio at each refresh (click on refresh button in Datastudio). To automatically refresh the front-end visualization you can use the following plugin [2] or automate the click on the refresh button through Browser console commands.
[1] https://cloud.google.com/bigquery/docs/scheduling-queries
[2] https://chrome.google.com/webstore/detail/data-studio-auto-refresh/inkgahcdacjcejipadnndepfllmbgoag?hl=en
This is regarding streaming data Insert/Update using google cloud function. I am using Salesforce as Source database and wanted to do a streaming insert/update to google BigQuery tables. My insert part is working fine but how can i able to do a update since streaming data is getting inserted into a streaming buffer first which wont allow to do DML operation for a period of 30 min or so. Any help on this will be really appreciated
Got a reply from Google Support like below
"It is true that modifying recent data for the last 30 minutes (with an active streaming buffer) is not possible as one of the limitations of BigQuery DML operations"
One of the workaround which we can try is to copy the data from streaming table to a new table and perform any operation on that. This helped me.
Thanks
vak
I am new to development, so I am sorry if this is a really basic question. I am trying to access some of the data available from instagram's API as documented here. https://developers.facebook.com/docs/instagram-api/insights.
I would like some kind of data repository to pull the data into, so I am looking at Google Big Query to see if I can pull in the data. (The ultimate place will be PowerBi so I can publish online)
Looking at the Facebook request code - is it possible to put this into Google Big query to return the data?
I am replacing the 'instagram-business-user-id' with an ID I have generated already - but it feels like perhaps it needs more markup to let Big Query know what language it is in.
Any help would be much appreciated.
GET graph.facebook.com/{instagram-business-user-id}/insights
?metric=impressions,reach,profile_views
&period=day
Looking at the Facebook request code - is it possible to put this into Google Big query to return the data?
Yes it's absolutely possible using bigQuery API or bigQuery CLI
You can use this Psuedo workflow as an example (using BigQuery API):
Create a table in bigQuery with the desired schema for this you also have 2 options:
Save the result in 1 column with the full JSON, This means to the select you need you use JSON_EXTRACT to fetch specific data
Process the JSON in your code and save it in specific columns to simplify the select statement
Call instagram's API
Call bigQuery API or bigQuery CLI to insert the data, This link provides one option how to do this
Call bigQuery API or bigQuery CLI to fetch the data, This link provides one option how to do this