google cloud data Prep - error when importing dataset from BigQuery - google-cloud-platform

When I tring to import bigquery tables as dataset in my data Prep flow, I have the following error:
Could not create dataset: I/o error.
I tried to import many bigquery tables (which are from same BQ dataset) all of them successfully imported except this which has many columns (more than 2700 columns!).
Maybe that's because of the large number of columns but I can't see any such limitation in Docs!
when I selecet the table ----> I have this message "Preview not available" like this:
and after clicking "import":
Does anyone have any idea why this is happening or has any suggestion?

Dataprep documentation doesn't have maximum columns limitation, thus it's not probable that this is the problem.
On the other hand, I have seen generic messages like 'I/O error' or simply red icons when importing data, and they are related to the data type conversion between Dataprep and BigQuery.
I think that finding the types in BQ that are not compatibles in Dataprep and converting to one compatible should solve your issue.

Related

BigQuery returned Premature EOF

I have a table in bigquery with around 32 million rows and 85 columns with size of 22 GB. (It is a physical table not a view or temporary table) I am using this table as a datasource in Domo. This is the domo connector I am using: https://www.domo.com/appstore/connector/google-bigquery-service-connector/overview .
I am getting the "Domo is ready, but BigQuery returned Premature EOF. Please try again later." error in Domo, there are no further details availbale from Domo on this error. As per suggestion from Domo support I have tried using "allow large results" option too, still it has the same error. When I run a query in Bigquery editor on the same dataset is works fine and complete execution in 17sec.
Are there other options I should try?
I don't think volume of data is a issue because yesterday I got the same error on another dataset in domo that connects to table in bigquery that has only 0.7 million row.
Are there any bigquery limits that are causing this error?

how to know how many query jobs is submitted

I am working on a pipeline that takes data and do some partitioning on it, I am trying to load some data into bq table on gcp, but I got Too many partitions produced by query, allowed 4000, query produces at least 10000 partitions, I understand that it's a limitation by bq, and have found multiple purposed solutions to create a cluster on the data or partition by week instead of day, The problem is that I have no visibility on the data itself, I can not do this. if any other ideas are there please help.
Also, for sake of investigation and analysis, how to know how many big query jobs is submitted? is there a way to get the number of bq jobs submitted by specific dataflow?
Thannks
You can view the jobs created by a particular Dataflow job by navigating to the Google Cloud Console and clicking through to the Dataflow Job UI. Here is the relevant documentation with screenshots.

PowerBi - Connection Type (DIRECT QUERY or IMPORT DATA) Question

I am working on a PowerBi project and I need some advice/questions on the best way to approach this project. I am tasked to create a dashboard for employee metrics pulled from an onsite SQL Server database. The managers here are going to have access to the PowerBi cloud, so I will end up uploading this to the cloud. There are 10 or so metrics that need to be shown on the dashboard. We have 5000+ employees. My first thought was to create a table and dump all the metrics into a table and set the PowerBi report to import the data, but that seems excessive and a waste of space to upload all that data to the CLOUD because all of the managers don't need access to every employee. They may want to see 1 or 2 employees' metrics on the dashboard.
My second thought is to (and if this is possible) create a stored procedure that will take the employee id and output a dataset for PowerBi to create a visual for. On the dashboard, have a list of employees and when a manager selects one, PowerBi will call the stored procedure with the employee id and the dataset will be returned for PowerBi to decipher into a visual based on my measurements. I guess I would set the PowerBi report connection type as DIRECT QUERY?
Here are my questions:
Is this possible? Is it possible to what I am thinking for my second plan? Is this how DIRECT QUERY works?
If so, how does DIRECT QUERY work with the PowerBi cloud?
What is setup like? Do I just install the PowerBi Data Gateway/configure it like IMPORT DATA and PowerBi does the rest?
A couple of queries:
What is the frequency of data update ?
In case if it is a batch job, it is ideally preferable to import that data from source into powerbi model and do reporting on the imported data as
a) The performance would be quicker
b) There would be no to and for of data across on prem database and cloud
c) the source would not be impacted constantly
So is the ask to have RLS wherein the managers should see only the employees under them?
Then it is pretty easy to implement RLS in imported version rather than in case of direct query.
Also you won't be able to pass parameters to stored procedures, and you can't execute them in direct query mode. You can however, create table valued functions which give you the ability to use table variables and perform other functions that are more complex in nature in Direct Query mode
you can refer this for additional details :
https://community.powerbi.com/t5/Desktop/Can-i-call-Stored-Procedure-with-Direct-Query/m-p/267141#:~:text=%40Pallavi%20you%20won't%20be,nature%20in%20Direct%20Query%20mode.

Spanner to CSV DataFlow

I am trying to copy table from spanner to big query. I have created two dataflow. One which copies from spanner to text file and other one that imports text file into bigquery.
Table has a column which has JSON string as a value. Issue is seen when dataflow job runs while importing from text file to bigquery. Job throws below error :
INVALD JSON: :1:38 Expected eof but found, "602...
Is there anyway I can exclude this column while copying or any way I can copy JSON object as it is? I tried excluding this column in schema file but it did not help.
Thank you!
Looking at https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-spanner-to-cloud-storage-text there are options on BigQuery import jobs that would allow to skip columns, neither Cloud Spanner options that would skip a column when extracting.
I think your best shot is to write a custom processor that will drop the column, similar to Cleaning data in CSV files using dataflow.
it's more complicated but you can also try DataPrep: http://cloud/dataprep/docs/html/Drop-Transform_57344635. It should be possible to run DataPrep jobs as a DataFlow template.

"No data" message in Google Data Studio chart after connecting dataset from BigQuery?

I am trying to connect and visualise aggregation of metrics from a wildcard table in BigQuery. This is the first time I am connecting a table from this particular Google Cloud project to Data Studio. Prior to this, I have successfully connected and visualised metrics from other BigQuery tables from other Google Cloud projects in Google Data Studio and never encountered this issue. Any ideas? Could this be something to do with project-level permissions for Google Data Studio to access a BigQuery table for the first time?
More details of this instance: the dataset itself seems to be successfully connected into Data Studio so errors were encountered. After adding some charts connected to that data source and aggregating metrics, no other Data Studio error messages were encounterd. Just the words "No data" displayed in the chart. Could this also be a formatting issue in the BigQuery table itself? The BigQuery table in question was created via pandas-gbq in a loop to split the original dataset into individual daily _YYYYMMDD tables. However, this has been done before and never presented a problem.
I have been struggling with the same problem for a while, and eventually I find out that, at least for my case, it is related to the date I add to the suffix (_YYYYMMDD). If I add "today" to the suffix, DataStudio won't recognize it and will display "no data", but if I change it to "yesterday" (a day earlier), it will then display the data correctly. I think it is probably related to the timezones, e.g., "today" here is not yet there in the US, so the system can't show. Hopefully it helps.