Google Cloud Platform Dataflow is not Loading or Down - google-cloud-platform

I am facing a problem of dataflow not loading even after 30minutes of staying. How do I complete that Lab,
The Task is:- Make a chart on dataflow by running a query on BigQuery.

I found the second option.
We can also start dataflow in another tab and restarting it from the new phase will help it in loading correctly. Then using make a new chart using a big query option with a query that has completed the task.

Related

How to get the Dataflow template of a Dataprep job?

Good morning everyone, the client I work for is going to deprecate Dataprep in October, we currently work everything with Google Cloud Platform.
Dataprep is a "pretty" layer that runs under dataflow,
currently, one of the implemented solutions consists of: a file is received in a bucket, and with python I execute the dataprep.
https://www.trifacta.com/blog/automate-cloud-dataprep-pipeline-data-warehouse/
I need to know how I can obtain the template of those dataprep jobs so that when the file is received in the bucket I can trigger the dataflow corresponding to that dataprep and if I can eliminate dataprep from the solution.
https://mbha-phoenix.medium.com/running-cloud-dataprep-jobs-on-cloud-dataflow-for-more-control-37ed84e73cf3
Dataprep previously allowed you to do this, but this option is no longer available.
Screenshot: "Export Result Window"
I would appreciate your help.
Thanks a lot
Cheers
The export option is no longer available in DataPrep. From the DataPrep -> Job History, you can view the previously executed Dataflow job, but can not export

Google Cloud Dataprep job failing with error message

I have simple Dataprep jobs that are transferring GCS data to BQ. Until today, scheduled jobs were running fine but today two jobs failed and two jobs succeeded after taking more than half hour to one hour.
Error message I am getting is below:
java.lang.RuntimeException: Failed to create job with prefix beam_load_clouddataprepcmreportalllobmedia4505510bydataprepadmi_aef678fce2f441eaa9732418fc1a6485_2b57eddf335d0c0b09e3000a805a73d6_00001_00000, reached max retries: 3, last failed job:
I ran same job again, it again took very long time and failed but this time with different message:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Any pointers or direction for possible cause! Also, link or troubleshooting tips for dataprep or dataflow job is appreciated.
Thank you
There could be a lot of potential causes for the jobs to get stuck: Transient issues, some quota/limit being reached, change in data format/size or another issue with the resources being used. I suggest to start the troubleshooting from Dataflow side.
Here are some useful resources that can guide you through the most common job errors, and how to troubleshoot them:
Troubleshooting your pipeline
Dataflow common errors
In addition, you could check in the Google Issue tracker of Dataprep and Dataflow to see if the issue has been reported before
Issue tracker for Dataprep
Issue tracker for Dataflow
And you can also look at the GCP status dashboard to discard a widespread issue with some service
Google Cloud Status Dashboard
Finally, if you have GCP support you can reach out directly to support. If you don't have support, you can use the Issue tracker to create a new issue for Dataprep and report the behavior you're seeing.

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

How to schedule a query (Export Data) from Google Big Query to External Storage space (Eg: Box)

I read many articles and solutions regarding scheduling queries to external storage places in Google Big Query but they didn't seem to be that clear.
Note: My company has subscription only to Google Big Query and not to the complete cloud Services (Google Cloud Platform).
I know how to do it manually but I am looking to automate the process since I need the same data every week.
Any suggestions will be appreciated. Thank you.
Option 1
You can use Apache Airflow which provides the option to create schedule task on to of BigQuery using BigQuery operator.
You can find in this link the basic steps required to start setting this up
option 2
You can use the Google BigQuery command line to export your data as you do from the webUI, for example:
bq --location=[LOCATION] extract --destination_format [FORMAT] --compression [COMPRESSION_TYPE] --field_delimiter [DELIMITER] --print_header [BOOLEAN] [PROJECT_ID]:[DATASET].[TABLE] gs://[BUCKET]/[FILENAME]
Once you get this working you can use any schedule process of your liking to schedule the run of this job
BTW: Airflow has a connector which enables you to run the command line tool
Once the file in GCP you can use Box G suite integration to see and manage your files

Google App Engine Parse Logs in DataStore Save to Table

I am new to GAE and I am trying to quickly find a way to retrieve logs in DataStore, clean them to my specs, and then save them to a table to be called on later for a reports view in my app. I was thinking of using Google Data Flow and creating batch jobs (app is python/Django) but the documentation does not seem to fit my use case so maybe data flow is not the answer. I could create a python script with BigQuery and schedule through CRON but then I would have to contend with errors and it would seem that there is a faster way to solve this problem.
Any help/thoughts/suggestions is always greatly appreciated.
You can use Dataflow/Beam Python SDK to develop a pipeline that read entities from Datastore [1], transform data, and write a table to BigQuery [2]. To schedule this job to run regularly you'll have to use a third party mechanism such as a cron job. Note that Dataflow performs automatic scaling and perform retries to handle errors so you are not expected to manually address these complexities.
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/datastore/v1/datastoreio.py
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py