What is the difference between Feature Store and Dataset inside of Vertex AI (GCP)?
And why the Feature Store has Offline and Online serving nodes? What is it for?
As described at the official documentation of Vertex AI's Feature Store
, a feature store is a container for organizing, storing, and serving ML feature. Basically its a more organized container that can be easily store or share features to permitted users. I would suggest reading the article linked above.
Online serving nodes is best described here:
"Online serving nodes provide the compute resources used to store and serve feature values for low-latency online serving."
Related
I was trying to figure out key differences between using GCP Vertex AI feature store and Saving preprocessed features to BigQuery and loading whenever it gets necessary.
I still cannot understand why to choose the first option, rather than the second option, which seems to be easier and more accessible.
Is there any good reason to use feature store in Vertex AI, rather than storing features in BigQuery tables formats?
Vertex AI Feature Store and BigQuery, both can be used to store the features as mentioned by you. But Vertex AI Feature Store has several advantages over BigQuery that makes it favorable for storing features.
Advantages of Vertex AI Feature Store over BigQuery :
Vertex AI Feature Store is designed to create and manage featurestores, entity types, and features whereas BigQuery is a data warehouse where you can perform analysis on data.
Vertex AI Feature Store can be used for batch and online storage but BigQuery is not a solution for storage.
Vertex AI Feature Store can be used for sharing the features across the organization from the central repository which BigQuery does not provide.
Vertex AI Feature Store is a managed solution for online feature serving which is not supported by BigQuery.
For more information, you can check this link.
I have been exploring using Vertex AI for my machine learning workflows. Because deploying different models to the same endpoint utilizing only one node is not possible in Vertex AI, I am considering a workaround. With this workaround, I will be unable to use many Vertex AI features, like model monitoring, feature attribution etc., and it simply becomes, I think, a managed alternative to running the prediction application on, say, a GKE cluster. So, besides the cost difference, I am exploring if running the custom prediction container on Vertex AI vs. GKE will involve any limitations, for example, only N1 machine types are available for prediction in Vertex AI
There is a similar question, but I it does not raise the specific questions I hope to have answered.
I am not sure of the available disk space. In Vertex AI, one can specify the machine type, such as n1-standard-2 etc., but I am not sure what disk space will be available and if/how one can specify it? In the custom container code, I may copy multiple model artifacts, or data from outside sources to the local directory before processing them so understanding any disk space limitations is important.
For custom training in Vertex AI, one can use an interactive shell to inspect the container where the training code is running, as described here. Is something like this possible for a custom prediction container? I have not found anything in the docs.
For custom training, one can use a private IP for custom training as described here. Again, I have not found anything similar for custom prediction in the docs, is it possible?
If you know of any other possible limitations, please post.
we don't specify a disk size, so default to 100GB
I'm not aware of this right now. But if it's a custom container, you could just run it locally or on GKE for debugging purpose.
are you looking for this? https://cloud.google.com/vertex-ai/docs/predictions/using-private-endpoints
My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.
The documentation for AWS Neptune does not indicate that it supports the GeoSPARQL extension. Is there any way to support geospatial queries that we have stored as GeoJSON objects? Perhaps using Gremlin? We are considering Azure Cosmos as an alternative but would rsther stay in the AWS ecosphere. Thank you!
Added re: comment for GeoJSON specifics:
As a recent workaround, we have been exploring the use of Geohashing and bounding boxes to solve this in a NoSQL store . Or, the original Azure Cosmos direction. The problem with the former solution is the computational time to do so at scale, and then fetch and combine results from the graph with some type of unique ID mapping. Instead, traversing raw lat/long coordinates via GeoJSON enabled queries in the graph means that we could do everything in one pass. Ultimately it enables dramatically simpler architecture and cleaner DEVOPS.
In regards to specifics, mostly just querying items within a radius or manhattan distance of a given point. We don't currently use polygons but have been considering it for various use cases.
Can you provide more details on the types of geospatial queries that you'd want to support over the GeoJSON objects?
From the latest AWS Neptune User Guide (PDF) in the section on Blazegraph to Neptune compatibility (dated Dec 2020 at time of writing):
Geospatial search – Blazegraph supports the configuration of namespaces that enable geospatial
support. This feature is not yet available in Neptune.
Therefore, the answer appears to be 'no' if using out-the-box functionality.
I am an entry level developer in a startup. I am trying to deploy a text classifier on GCP. For storing inputs(training data) and outputs, I am struggling to find the right storage option.
My data isn't huge in terms of columns but is fairly huge in terms of instances. It could even be just key-value pairs. My use case is to retrieve each entity from just one particular column from the DB, apply some classification on it and store the result in the corresponding column and update the DB. Our platform requires a DB which can handle a lot of small queries at once without much delay. Also, the data is completely unrelational.
I looked into GCP's article of choosing a storage option but couldn't narrow down my options to any specific answer. Would love to get some advice on this.
You should take a look at Google's "Choosing a Storage Option" guide: https://cloud.google.com/storage-options/
Your data is structured, your main goal is not analytics, your data isn't relational, you don't mostly need mobile SDKs, so you should probably use Cloud Datastore. That's a great choice for durable key-value data.
In brief, these are the storage options available. May be in future it can be more or less.
Depending on choice, you can select your storage option which is best suited.
SOURCE: Linux Academy