I am performing a query to generate a new BigQuery table of of size ~1 Tb (a few billion rows), as part of migrating a Cloud SQL table to BigQuery, using Federated query. I use the BigQuery Python client to submit the query job, in the query I select all from the the Cloud SQL database table and use EXTERNAL_QUERY.
I find that the query can take 6+ hours (and fails with "Operation timed out after 6.0 hour")! Even if it didn't fail, I would like to speed it up as I may need to perform this migration again.
I see that the PostgreSQL egress is 20Mb/sec, consistent with a job that would take half a day. Would it help if I consider something more distributed with Dataflow? Or simpler, extend my Python code using the BigQuery client to generate multiple queries, which can run async by BigQuery?
Or is it possible to still use that single query but increase the egress traffic (database configuration)?
I think it is more suitable to use the dump export.
Running a query on large table is an inefficient job.
I recommend to export Cloud SQL data to a CSV file.
BigQuery can import CSV format file, So you can use this file to create your new bigQuery table.
I'm not sure of how long this job will takes, But at least will not be failed.
Refer here to get more detailed job about export Cloud SQL to CSV dump.
Related
I am trying to see if there are any benefits to running dbt on bigquery as oppose to scheduled queries. Hence, what are the benefits?
If DBT runs a simple query, there is no real advantages. But in that case, I won't use Schedule queries, but Cloud Workflows, or Cloud Scheduler that invoque a BigQuery API call (jobs API, with a query job).
However, if you build a complex query, DBT can help you to create and maintain it, even to create temporary working table to get the final table.
But at the end of the day, DBT is simply a SQL query generator and you can schedule the output with the service that you prefer.
We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.
I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.
Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.