Can you import data from parquet or delta file format data sets into Vertex AI feature store - google-cloud-ml

Just wondering if it is possible to not just import from CSV-based data sets for the Vertex AI feature store, but from parquet or delta file formats as well. When trying to import a dataset from within GCP, the only options it gives are from BigQuery or from CSV.
I have attached a picture of the options given
No Parquet Option - Only CSV and BigQuery
Does anyone know if there is an API/plug-in/other methodology from which one can load parquet or delta files directly into the Vertex AI feature store?
Thank you!

Sorry, but no.
Vertex AI's Feature Store can only ingest from BigQuery tables or CloudStorage files- the latter of which must be either csv or avro format.
I am not aware of any planned support for Parquet format. However you can load Parquet into BigQuery, so that may be your best option.

Related

Where to best store data on Google Cloud?

If my end goal is to run a machine learning model on some CSV data, where should I best store my data file?
In a bucket,
in BigQuery, or
as a dataset under Vertex AI?
It seems that these three options can lead to overlap/redundancies in storage. Is there a practical reason why a basic CSV would have so many options for storage?
If your goal is to train a ML model in vertex AI, the best way to store data in Vertex-AI dataset.
Vertex-AI Datasets make data discoverable from a central place and provide the ability to annotate and label the data within the UI. You can upload your CSV data into the dataset on the basis of where your data resides ie. in GCS, BigQuery or local storage.
Is there a practical reason why a basic CSV would have so many options for storage? It is based on a people's requirement. If someone wants to query and visualize the data they need not go for creating Vertex-AI datasets, they can directly upload data to BQ and get insights.

how to export oracle DB table with complex CLOB data into bigquery through batch upload?

We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.

Import Google Analytics to Redshift

I am trying to figure out how to import Google Analytics data into AWS Redshift. Until now I have been able to setup an export job so the data makes it to Google's BigQuery and then exporting the tables to Google's Cloud Storage.
BigQuery stores data in particular way, so when you export it to a file, it gives you a multilevel nested JSON structure. So, in order to import it to Redshift, I would have to "explode" that JSON into a table or CSV file.
I haven't been able to find a simple solution to do this.
Does anyone know how I can do this in an elegant and efficient way, instead of having to write a long function that will go through the whole JSON object?
Here's Google's documentation about how to export data https://cloud.google.com/bigquery/docs/exporting-data
You can try the following:
Export your BigQuery data as json into the S3 bucket
Create JSONPaths file according to specification
Include JSONPaths file in your COPY command to import into the Redshift
You may also try to export your BigQuery table as AVRO (one of the supported export file formas in BigQuery) instead of json. This link has an example of how to write the JSONPaths file for nested AVRO objects.

AWS GLUE Data Import Issue

There's an excel file testFile.xlsx, it looks like as below:
ID ENTITY STATE
1 Montgomery County Muni Utility Dist No.39 TX
2 State of Washington WA
3 Waterloo CUSD 5 IL
4 Staunton CUSD 6 IL
5 Berea City SD OH
6 City of Coshocton OH
Now I want to import the data into the AWS GLUE database, a crawler in AWS GLUE has been created, there's nothing in the table in AWS GLUE database after running the crawler. I guess it should be the issue of classifier in AWS GLUE, but have no idea to create a proper classifier to successfully import data in the excel file to AWS GLUE database. Thanks for any answers or advice.
I'm afraid Glue Crawlers have no classifier for MS Excel files (.xlsx or .xls). Here you can find list of supported formats and built-in classifiers. Probably, it would be better to convert files to CSV or some other supported format before exporting to AWS Glue Catalog.
Glue crawlers doesn't support MS Excel files.
If you want to create a table for the excel file you have to convert it first from excel to csv/json/parquet and then run crawler on the newly created file.
You can convert it easily using pandas.
Create a normal python job and read the excel file.
import pandas as pd
df = pd.read_excel('yourFile.xlsx', 'SheetName', dtype=str, index_col=None)
df.to_csv('yourFile.csv', encoding='utf-8', index=False)
This will convert your file to csv then run crawler over this file and your table will be loaded.
Hope it helps.
When you say that "there's nothing in the table in AWS Glue database after running the crawler" are you saying that in the Glue UI, you are clicking on Databases, then the database name, then on "Tables in xxx", and nothing is showing up?
The second part of your question seems to indicate that you are looking for Glue to import the actual data rows of your file into the Glue database. Is that correct? The Glue database does not store data rows, just the schema information about the files. You will need to use a Glue ETL job, or Athena, or hive to actually move the data from the data file into something like mySQL.
You should write script (most likely python shell job in glue) to convert excel to csv and then run crawler over it.

Re-parsing Blob data stored in HDFS imported from Oracle by Sqoop

Using Sqoop I’ve successfully imported a few rows from a table that has a BLOB column.Now the part-m-00000 file contains all the records along with BLOB field as CSV.
Questions:
1) As per doc, knowledge about the Sqoop-specific format can help to read those blob records.
So , What does the Sqoop-specific format means ?
2) Basically the blob file is .gz file of a text file containing some float data in it. These .gz file is stored in Oracle DB as blob and imported into HDFS using Sqoop. So how could I be able to get back those float data from HDFS file.
Any sample code will of very great use.
I see these options.
Sqoop Import from Oracle directly to hive table with a binary data type. This option may limit the processing capabilities outside hive like MR, pig etc. i.e. you may need to know the knowledge of how the blob gets stored in hive as binary etc. The same limitation that you described in your question 1.
Sqoop import from oracle to avro, sequence or orc file formats which can hold binary. And you should be able to read this by creating a hive external table on top of it. You can write a hive UDF to decompress the binary data. This option is more flexible as the data can be processed easily with MR as well especially the avro, sequence file formats.
Hope this helps. How did you resolve?