I am importing data from csv (file in storage) to BigQuery using dataflow. I am creating dataflow using existing template "text Files on cloud storage to BigQuery". I am owner of GCP account but still I gave permission for different BigQuery roles as mentioned here : https://stackoverflow.com/questions/49640105/bigquery-unable-to-insert-job-workflow-failed
I am getting error message as "Error message from worker: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_textiotobigquerydataflow0releaser01181655135f622d5e_fd7c58dd997c4778989618b7cc9232f7_f4de7979252441e28f37f4de908b70ba_00001_00000, reached max retries: 3, last failed job: null. org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199) org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152) org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:376)"
Can anyone please help me on this.
Related
I'm new to Dataflow. I am sending JSON data to pub/sub using a python script. I am using the "Pub/Sub to Text Files on Cloud Storage" template that I created in Dataflow.
When I want to write to Cloud Storage, it writes to the folder named .temp-beam instead of the bucket I specified as the output path. I know this is a fault-tolerant.
In the dataflow logs, I get the following error:
Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.io.FileNotFoundException: gs://mybucket/data/.temp-beam/d96626fc549ca4e3-77c2
Example data in Pub/Sub:
{"productFullName": "watch", "productBrand": "Rolex", "productPrice": "1089.00", "productRating": "100", "productRatingCount": "15", "productDealer": "WatchCenter", "dealerRating": "100"}
I tried everything on the permission side.
I've verified from .temp-beam that my data is coming in correctly.
Tried txt and json as suffix.
Bucket path was made as desired. ( gs://bucket/data/ )
DataFlow SDK Version: Apache Beam SDK for Java 2.36.0
I am referring to this article of creating Cloud Dataprep pipeline
When following the step of importing the data while creating flow, I am not able to read the data and its says access denied as per the above screenshot.
Reference Link : https://www.trifacta.com/blog/data-quality-monitoring-for-cloud-dataprep-pipelines/
I tried importing the json file and I am expecting the flow to read the table
I'm having troubles with a job I've set up on dataflow.
Here is the context, I created a dataset on bigquery using the following path
bi-training-gcp:sales.sales_data
In the properties I can see that the data location is "US"
Now I want to run a job on dataflow and I enter the following command into the google shell
gcloud dataflow sql query ' SELECT country, DATE_TRUNC(ORDERDATE , MONTH),
sum(sales) FROM bi-training-gcp.sales.sales_data group by 1,2 ' --job-name=dataflow-sql-sales-monthly --region=us-east1 --bigquery-dataset=sales --bigquery-table=monthly_sales
The query is accepted by the console and returns me a sort of acceptation message.
After that I go to the dataflow dashboard. I can see a new job as queued but after 5 minutes or so the job fails and I get the following error messages:
Error
2021-09-29T18:06:00.795ZInvalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Could not resolve table in Data Catalog: bi-training-gcp.sales.sales_data
Error 2021-09-29T18:10:31.592036462ZError occurred in the launcher
container: Template launch failed. See console logs.
My guess is that it cannot find my table. Maybe because I specified the wrong location/region, since my table is specified to be location in "US" I thought it would be on a US server (which is why I specified us-east1 as a region), but I tried all us regions with no success...
Does anybody know how I can solve this ?
Thank you
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API. To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries. Alternately, assign the roles/datacatalog.
I have created simple dataprep workflow(Source File as CSV from GCS, simple transformation(Upper case conversion) & Target - load into BigQuery).
When i run this workflow job in DataPrep UI, I am getting error as:
Unable to rename output files from
gs://test//temp/dax-tmp-2021-04-02_09_21_11-16351772716646701863-S07-0-d007cba17fe923f9/tmp-d007cba17fe92b44#DAX.ism to gs://test//temp/tmp-d007cba17fe92b44#.ism.,
I am having below IAM roles:
Dataprep User,
Storage Admin,
Storage Object Admin,
Storage Object Creator,
Viewer
Since i already have admin access in GCS, i am not sure why i am getting 'Unable to rename output files' error. But i am able to modify files are available in GCS using gsutil command.
Kindly advise whether this is access issue in DataPrep and how to solve this problem.
DataPrep Logs:
DataFlow Log:
I'm trying to load around 1000 files from Google Cloud Storage into BigQuery using the BigQuery transfer service, but it appears I have an error in one of my files:
Job bqts_601e696e-0000-2ef0-812d-f403043921ec (table streams) failed with error INVALID_ARGUMENT: Error while reading data, error message: CSV table references column position 19, but line starting at position:206 contains only 19 columns.; JobID: 931777629779:bqts_601e696e-0000-2ef0-812d-f403043921ec
How can I find which file is causing this error?
I feel like this is in the docs somewhere, but I can't seem to find it.
Thanks!
You can use bq show --format=prettyjson -j job_id_here and will show a verbose error about the failed job. You can see more info about the usage of the command in BigQuery managing jobs docs.
I tried this with a failed job of mine wherein I'm loading csv files from a Google Coud Storage bucket in my project.
Command used:
bq show --format=prettyjson -j bqts_xxxx-xxxx-xxxx-xxxx
Here is a snippet of the output. Output is in JSON format: