Append CSV Data to Apache Superset Dataset - apache-superset

Using CSV upload in Apache Superset works as expected. I can use it to add data from CSV to a databse, e.g. Postgres. Now I want to apped data from a different CSV to this table/dataset. But how?
The CSVs all have the same format. But there is a new one for every day. In the end I want to have a dashboard which updates every day, taking the new data into account.

Generally, I agree with Ana that if you want to repeatedly upload new CSV data then you're better off operationalizing this into some type of process, pipeline, etc that runs on a schedule.
But if you need to stick with the uploading CSV route through the Superset UI, then you can set the Table Exists field to Append instead of Replace.
You can find a helpful GIF in the Preset docs: https://docs.preset.io/docs/tips-tricks#append-csv-to-a-database

Probably you'll be better served by creating a simple process to load the CSV to a table in the database and then querying that table in Superset.
Superset is a tool to visualize data, it allows uploading CSV for quick and dirty "only once" kind of charts, but if this is going to be a recurrent and structured periodical load of data, it's better to use whatever integrating tool you want to load the data, there are zillions of ETL (Extract-Transform-Load) tools out there (or scripting programs to do it), ask if your company is already using one, or choose the one that is simpler for you.

Related

AWS Glue: Add An Attribute to CSV Distinguish Between Data Sets

I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.

Building Google Cloud Platform Data Catalog on unstructured data

I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?
This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.
I had a similar use case, Google Data Catalog has an option to create custom entries.
Some tips on building a Data Catalog on unstructured files data:
Use meaningful file names on your JSON files. That way searching for them will become easier.
Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.
I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.
In case you are wondering how to do the step 2, I put together one script that automatically does that:
link for the GitHub. Another option is to work with Data Catalog Filesets.
So between using custom entries or filesets, I'd ask you this, do you need information about your files name?
If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.
The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.

Use AWS Athena With Dynamic Fields / Schemaless

We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.

Is there any way in dss to read data from multiple sheet of single excel and insert those data into multiple tables of database using wso2 6.4.0?

I am new to wso2 6.4.0 dss, i have to do retrieve the data from multiple sheets of single excel file and insert those data into multiple tables. Please help me to do this. just guide me.
It looks like you need sophisticated logic to implement. Excel files may be a source of data. First of all how wsodss does know about a moment when it must start read excel? It sounds like wsoesb job, which supports a virtual file-system, and can truck directory and generate an event if there are any changes.
Why don't you use wsoesb to read sheet by sheet and insert data?
It provides the necessary tools (mediators) to execute.
Anyway, it does look like a ETL job.

How to split data when archiving from AWS database to S3

For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects