I was trying to figure out key differences between using GCP Vertex AI feature store and Saving preprocessed features to BigQuery and loading whenever it gets necessary.
I still cannot understand why to choose the first option, rather than the second option, which seems to be easier and more accessible.
Is there any good reason to use feature store in Vertex AI, rather than storing features in BigQuery tables formats?
Vertex AI Feature Store and BigQuery, both can be used to store the features as mentioned by you. But Vertex AI Feature Store has several advantages over BigQuery that makes it favorable for storing features.
Advantages of Vertex AI Feature Store over BigQuery :
Vertex AI Feature Store is designed to create and manage featurestores, entity types, and features whereas BigQuery is a data warehouse where you can perform analysis on data.
Vertex AI Feature Store can be used for batch and online storage but BigQuery is not a solution for storage.
Vertex AI Feature Store can be used for sharing the features across the organization from the central repository which BigQuery does not provide.
Vertex AI Feature Store is a managed solution for online feature serving which is not supported by BigQuery.
For more information, you can check this link.
Related
What is the difference between Feature Store and Dataset inside of Vertex AI (GCP)?
And why the Feature Store has Offline and Online serving nodes? What is it for?
As described at the official documentation of Vertex AI's Feature Store
, a feature store is a container for organizing, storing, and serving ML feature. Basically its a more organized container that can be easily store or share features to permitted users. I would suggest reading the article linked above.
Online serving nodes is best described here:
"Online serving nodes provide the compute resources used to store and serve feature values for low-latency online serving."
My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.
I am new to BQ. I have a table with around 200 columns, when i wanted to get DDL of this table there is no ready-made option available. CATS is not always desirable.. some times we dont have a refernce table to create with CATS, some times we just wanted a simple DDL statement to recreate a table.
I wanted to edit a schema of bigquery with changes to mode.. previous mode is nullable now its required.. (already loaded columns has this column loaded with non-null values till now)
Looking at all these scenarios and the lengthy solution provided from Google documentation, and also no direct solution interms of SQL statements rather some API calls/UI/Scripts etc. I feel not impressed with Bigquery with many limitations. And the Web UI from Google Bigquery is so small that you need to scroll lot many times to see the query as a whole. and many other Web UI issues as you know.
Just wanted to know how you are all handling/coping up with BQ.
I would like to elaborate a little bit more to #Pentium10 and #guillaume blaquiere comments.
BigQuery is a serverless, highly scalable data warehouse that comes with a built-in query engine, which is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure.
BigQuery is based on Google's column based data processing technology called dremel and is able to run queries against up to 20 different data sources and 200GB of data concurrently. Prediction API allows users to create and train a model hosted within Google’s system. The API recognizes historical patterns to make predictions about patterns in new data.
BigQuery is unlike anything that has been used as a big data tool. Nothing seems to compare to the speed and the amount of data that can be fitted into BigQuery. Data views are possible and recommended with basic data visualization tools.
This product typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data
I recommend you to have a look for Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale book about BigQuery that includes walkthrough on how to use the service and a deep dive of how it works.
More than that, I found really interesting article for Data Engineers new to BigQuery, where you can find consideration regarding DDL and UI and best practices on Medium.
I hope you find the above pieces of information useful.
The documentation for AWS Neptune does not indicate that it supports the GeoSPARQL extension. Is there any way to support geospatial queries that we have stored as GeoJSON objects? Perhaps using Gremlin? We are considering Azure Cosmos as an alternative but would rsther stay in the AWS ecosphere. Thank you!
Added re: comment for GeoJSON specifics:
As a recent workaround, we have been exploring the use of Geohashing and bounding boxes to solve this in a NoSQL store . Or, the original Azure Cosmos direction. The problem with the former solution is the computational time to do so at scale, and then fetch and combine results from the graph with some type of unique ID mapping. Instead, traversing raw lat/long coordinates via GeoJSON enabled queries in the graph means that we could do everything in one pass. Ultimately it enables dramatically simpler architecture and cleaner DEVOPS.
In regards to specifics, mostly just querying items within a radius or manhattan distance of a given point. We don't currently use polygons but have been considering it for various use cases.
Can you provide more details on the types of geospatial queries that you'd want to support over the GeoJSON objects?
From the latest AWS Neptune User Guide (PDF) in the section on Blazegraph to Neptune compatibility (dated Dec 2020 at time of writing):
Geospatial search – Blazegraph supports the configuration of namespaces that enable geospatial
support. This feature is not yet available in Neptune.
Therefore, the answer appears to be 'no' if using out-the-box functionality.
I am an entry level developer in a startup. I am trying to deploy a text classifier on GCP. For storing inputs(training data) and outputs, I am struggling to find the right storage option.
My data isn't huge in terms of columns but is fairly huge in terms of instances. It could even be just key-value pairs. My use case is to retrieve each entity from just one particular column from the DB, apply some classification on it and store the result in the corresponding column and update the DB. Our platform requires a DB which can handle a lot of small queries at once without much delay. Also, the data is completely unrelational.
I looked into GCP's article of choosing a storage option but couldn't narrow down my options to any specific answer. Would love to get some advice on this.
You should take a look at Google's "Choosing a Storage Option" guide: https://cloud.google.com/storage-options/
Your data is structured, your main goal is not analytics, your data isn't relational, you don't mostly need mobile SDKs, so you should probably use Cloud Datastore. That's a great choice for durable key-value data.
In brief, these are the storage options available. May be in future it can be more or less.
Depending on choice, you can select your storage option which is best suited.
SOURCE: Linux Academy