If my end goal is to run a machine learning model on some CSV data, where should I best store my data file?
In a bucket,
in BigQuery, or
as a dataset under Vertex AI?
It seems that these three options can lead to overlap/redundancies in storage. Is there a practical reason why a basic CSV would have so many options for storage?
If your goal is to train a ML model in vertex AI, the best way to store data in Vertex-AI dataset.
Vertex-AI Datasets make data discoverable from a central place and provide the ability to annotate and label the data within the UI. You can upload your CSV data into the dataset on the basis of where your data resides ie. in GCS, BigQuery or local storage.
Is there a practical reason why a basic CSV would have so many options for storage? It is based on a people's requirement. If someone wants to query and visualize the data they need not go for creating Vertex-AI datasets, they can directly upload data to BQ and get insights.
Related
Using CSV upload in Apache Superset works as expected. I can use it to add data from CSV to a databse, e.g. Postgres. Now I want to apped data from a different CSV to this table/dataset. But how?
The CSVs all have the same format. But there is a new one for every day. In the end I want to have a dashboard which updates every day, taking the new data into account.
Generally, I agree with Ana that if you want to repeatedly upload new CSV data then you're better off operationalizing this into some type of process, pipeline, etc that runs on a schedule.
But if you need to stick with the uploading CSV route through the Superset UI, then you can set the Table Exists field to Append instead of Replace.
You can find a helpful GIF in the Preset docs: https://docs.preset.io/docs/tips-tricks#append-csv-to-a-database
Probably you'll be better served by creating a simple process to load the CSV to a table in the database and then querying that table in Superset.
Superset is a tool to visualize data, it allows uploading CSV for quick and dirty "only once" kind of charts, but if this is going to be a recurrent and structured periodical load of data, it's better to use whatever integrating tool you want to load the data, there are zillions of ETL (Extract-Transform-Load) tools out there (or scripting programs to do it), ask if your company is already using one, or choose the one that is simpler for you.
I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?
This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.
I had a similar use case, Google Data Catalog has an option to create custom entries.
Some tips on building a Data Catalog on unstructured files data:
Use meaningful file names on your JSON files. That way searching for them will become easier.
Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.
I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.
In case you are wondering how to do the step 2, I put together one script that automatically does that:
link for the GitHub. Another option is to work with Data Catalog Filesets.
So between using custom entries or filesets, I'd ask you this, do you need information about your files name?
If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.
The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.
I am trying to automate the entire data loading, that means whenever I upload a file to Google Cloud storage, it automatically triggers the data to be uploaded into the BigQuery dataset. I know that there is a daily set timing update available, but I want something where it only triggers whenever the CSV file is re-uploaded.
You have 2 possibilities:
Either you react on event. I mean you can plug a function on Google Cloud Storage events. In the event message you have the file stored in GCS and you can do what you want with it, for exemple run a load job from Google Cloud Storage.
Or, do nothing! Let the file in GCS and create a BigQuery federated table to read into GCS
With this 2 solutions, your data are accessible by BigQuery. Your Datastudio graph can query BigQuery, the data are here. However.
The load job is more efficient, you can partition and clusterize your data for optimize the speed and the cost. However, you duplicate your data (from GCS) and you have to code and to run your function. Anyway, cost is very low and function very simple. For Big Data it's my recommended solution
The federated table are very useful when the quantity of data is low and for occasional access or for prototyping. You can't clusterize and partition your data and the speed is lower than data loaded into BigQuery (because the CSV parsing is performing on the fly).
So, Big Data is a wide area: do you need to transform the data before the load? can you transform them after the log? How can you link query the ones after the others? ....
Don't hesitate if you have other questions on this!
I have data stored in BigQuery - it is a small dataset - roughly 500 rows. I want to be able to query this data and load it in to the front end of Django Application. What is the best practice for this type of data flow?
I want to be able to make calls to the BigQuery API using Javascript. I will then parse the result of the query and serve it in the webpage. The alternative seems to be to find a way of making a regular copy of the BigQuery data which I could store in a Cloud Storage Bucket but this adds a potentially unnecessary level of complexity which I could hopefully avoid if there is a way to query the live dataset.
I have a central data store in AWS . I wanted to access multiple tables in that database and find patterns and predictions on those collection of data.
my tables have several transactional data like call details,marketing campaign details,contact information of people etc.
How to integrate all this data for a big data analysis to find the relationship and store them efficiently
I am confused whether to use Haddop or not, which architecture would be perfect
The most easiest way for you to start is to export tables you wish to analyze into a csv file and process it using Amazon Machine Learning.
The following guide describes entire process:
http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html