Not able to upload Dataset into AutoML Natural Language text classification GUI - google-cloud-platform

I'm trying to perform custom text classification by using AutoML on Google Cloud Platform. I am using the official google documentation to help me get started. The link to the blog is https://cloud.google.com/blog/products/ai-machine-learning/no-deep-learning-experience-needed-build-a-text-classification-model-with-google-cloud-automl-natural-language
In the above above blog they have used the 20 Newsgroup dataset. After preparing the dataset and following the instructions given here I am getting an error while uploading the dataset into GCP AutoML Text Classification GUI.
I have also tried to upload a csv file with just one data entry that also doesn't seem to work.
Every time I try to upload dataset I get the following error
ERROR CODES:
4
Last error message
CSV file is empty

More like a csv issue. If you use the csv file provided in the quickstart then it will work.

Related

Why my CSV with JSONL doesn´t have Labels for training AutoML?

I have a problem with my CSV that is created using python. I'm following the example of google cloud about how to create JSONL from my PDF files, the big problem is that I can make my CSV with JSONL URL'S form my PDF files but this JSONL doesn't have any label. I'm using this file ->
input_helper_v2.py to create my CSV with JSONL, after that when I upload my CSV to google cloud for train my AutoML I got this error:
I tried to put some labels by myself in the CSV but that doesn't work. I don´t know if maybe its not the correct way to do this but I can find any solutions
This is an example of my CSV without labels:

Illegal Characters in Parquet file

I recently got data from Google Analytics (GA) and wanted to store the data in AWS as parquet file.
When I wanted to preview the file in the WebUI I realised it gives me an error.
I took me a while to realise that the column "pagePath" coming from GA is the reason for this as I am able to preview data once I remove the column.
I can´t share any data but is there any "illegal" characters that lead to failures?
I have >10k unique page paths and I can´t just figure out what the problem is.

fileOffset empty in Data Loss Prevention results for PDF and DOCX files

I configured a DLP inspection job using GCP Console to scan PDF and DOCX files. It is working as expected, finding the expected entities and saving results to a BigQuery table.
According to the docs, DLP uses Intelligent Document Parsing for PDF and DOCX. This should give me additional location details in a DocumentLocation object.
I can see column location.content_locations.document_location.file_offset in the BigQuery table, but it is empty.
I am getting location.byte_range values for TXT files and location.content_locations.image_location.bounding_boxes for images, but no location information for documents.
What can be causing this issue?

Google Analytics DataSet Type (CRMint)

I am novice to Google Analytics and I am using a tool called CRMint to import a custom audience to Google Analytics. A Data Scientist is using a model to predict if a user has more chances to buy a product than another. Right now, I have a csv file containing 2 columns fullVisitorId and predictions.
On CRMint, I am using a job called "GaDataImporter" to import that CSV file into Google Analytics. As you can see on the picture bellow, I need to provide a GA Dataset ID.
I am currently trying to create a new DataSet from my Google Analytics dashboard but I am not sure about the dataset type and the import bahavihor. Anyone has some suggestions?
fullVisitorId is not an available dimension in Google Analytics (found only on BigQuery) so you cannot use it to link information to users in Google Analytics.
Rather you should use the clientId passed to Analytics as custom dimension, then use that as a key by importing the data as Custom Data.
(If you are novice to Google Analytics it is not something that can be explained in a post, anyway the process described is what you need)

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.