I have data stored in BigQuery - it is a small dataset - roughly 500 rows. I want to be able to query this data and load it in to the front end of Django Application. What is the best practice for this type of data flow?
I want to be able to make calls to the BigQuery API using Javascript. I will then parse the result of the query and serve it in the webpage. The alternative seems to be to find a way of making a regular copy of the BigQuery data which I could store in a Cloud Storage Bucket but this adds a potentially unnecessary level of complexity which I could hopefully avoid if there is a way to query the live dataset.
Related
Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.
I am trying to automate the entire data loading, that means whenever I upload a file to Google Cloud storage, it automatically triggers the data to be uploaded into the BigQuery dataset. I know that there is a daily set timing update available, but I want something where it only triggers whenever the CSV file is re-uploaded.
You have 2 possibilities:
Either you react on event. I mean you can plug a function on Google Cloud Storage events. In the event message you have the file stored in GCS and you can do what you want with it, for exemple run a load job from Google Cloud Storage.
Or, do nothing! Let the file in GCS and create a BigQuery federated table to read into GCS
With this 2 solutions, your data are accessible by BigQuery. Your Datastudio graph can query BigQuery, the data are here. However.
The load job is more efficient, you can partition and clusterize your data for optimize the speed and the cost. However, you duplicate your data (from GCS) and you have to code and to run your function. Anyway, cost is very low and function very simple. For Big Data it's my recommended solution
The federated table are very useful when the quantity of data is low and for occasional access or for prototyping. You can't clusterize and partition your data and the speed is lower than data loaded into BigQuery (because the CSV parsing is performing on the fly).
So, Big Data is a wide area: do you need to transform the data before the load? can you transform them after the log? How can you link query the ones after the others? ....
Don't hesitate if you have other questions on this!
We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.
We can query results in any language from Google BigQuery using the predefined methods -> see docs.
Alternatively, we can also query the results and store them to cloud storage, for example in a .csv -> see docs on storing data to GCS
When we repeatedly need to extract the same data, eg lets say 100 times per day, does it make sense to cache the data to Cloud Storage and load it from there, or to redo the BigQuery requests?
What is more cost efficient and how would I obtain the unit cost of these requests, to estimate a % difference?
BigQuery's pricing model is based on how much bytes your query uses.
So if you want to query the results, than store the results as a BigQuery table. Set a destination table when you run your query.
There is no point of reloading from GCS a previous results. The cost will be the same, you just complicate this.
I am new to development, so I am sorry if this is a really basic question. I am trying to access some of the data available from instagram's API as documented here. https://developers.facebook.com/docs/instagram-api/insights.
I would like some kind of data repository to pull the data into, so I am looking at Google Big Query to see if I can pull in the data. (The ultimate place will be PowerBi so I can publish online)
Looking at the Facebook request code - is it possible to put this into Google Big query to return the data?
I am replacing the 'instagram-business-user-id' with an ID I have generated already - but it feels like perhaps it needs more markup to let Big Query know what language it is in.
Any help would be much appreciated.
GET graph.facebook.com/{instagram-business-user-id}/insights
?metric=impressions,reach,profile_views
&period=day
Looking at the Facebook request code - is it possible to put this into Google Big query to return the data?
Yes it's absolutely possible using bigQuery API or bigQuery CLI
You can use this Psuedo workflow as an example (using BigQuery API):
Create a table in bigQuery with the desired schema for this you also have 2 options:
Save the result in 1 column with the full JSON, This means to the select you need you use JSON_EXTRACT to fetch specific data
Process the JSON in your code and save it in specific columns to simplify the select statement
Call instagram's API
Call bigQuery API or bigQuery CLI to insert the data, This link provides one option how to do this
Call bigQuery API or bigQuery CLI to fetch the data, This link provides one option how to do this