Multi-level Sankey - apache-superset

I am currently using v0.99 of Superset and am able to create a 2 level Sankey diagram.
I would like to plot a sequence of User actions as follows:
Sign up selected -> Base package selected -> Extras selected ->Accounts created ->Sign up completed -> Content Watched.
(a combination of 2-3columns in the Apache druid dataset captures each of the above states)
I was thinking a multi-level Sankey diagram would be useful to capture this flow. Which release of Superset has this support (I don't see it in v0.99)? Is there another way to represent this flow?

Related

Google Document AI - Add New Label Workflow

We’re just getting started with Document AI. So far, we have about 80 labeled documents and one trained version.
We are making changes to the schema and adding a property. We’d like to go back in and extract this new label to the previously labeled documents.
The Document AI user interface presents some challenges here.
I want to isolate the documents that don’t contain this label. With the filtering capabilities, it looks like I’m only able to filter on documents that have that label, not the inverse.
I also don’t see a way to mark a bunch of documents as unlabeled once they have been marked as labeled. That would be useful for indicating which previously labeled documents need some additional work.
For those that are making schema changes and need to go back and re-label documents, what does your workflow look like?

Attach multiple query tabs in BigQuery to the same BQ Session

I cannot find a way to do this in the UI: I'd like to have distinct query tabs in the BigQuery's UI attached to the same session (i.e. so they share the same ##session_id and _SESSION variables). For example, I'd like to create a temporary table (session-scoped) in one tab, then in a separate query tab be able to refer to that temp table.
As far as I can tell, when I put a query tab in Session Mode, it always creates a new session, which is precisely what I don't want :-\
Is this doable in BQ's UI?
There is 3rd party IDE for BigQuery supporting such a feature (namely: joining Tab(s) into existing session)
This is Goliath - part of Potens.io Suite available at Marketplace.
Let's see how it works there:
Step 1 - create Tab with new session and run some query to actually initiate session
Step 2 - create new Tab(s) and join to existing session (either using session_id or just simply respective Tab Name
So, now both Tabs(Tab 2 and Tab 3) share same session with all expected perks
You can add as many Tabs to that session as you want to comfortably organize your workspace
And, as you can see Tabs that belong to same session are colored in user defined color so easy to navigate between them
Note: Another tool in this suite is Magnus - Workflow Automator. Supports all BigQuery, Cloud Storage and most of Google APIs as well as multiple simple utility type Tasks like BigQuery Task, Export to Storage Task, Loop Task and many many more along with advanced scheduling, triggering, etc. Supports GitHub as a source control as well
Disclosure: I am GDE for Google Cloud and creator of those tools and leader on Potens team

Using google cloud for image classification, cropping and OCR

Please allow me to ask a rather newbie question. So far, I have been using local tools like imagemagick or GOCR to perform the job, but that is rather old-fashioned, and I am urged to "move to google cloud AI".
The setup
I have a (training) data set of various documents (as JPG and PDF) of different kinds, and by certain features (like prevailing color, repetitive layout) I intend to classify them, e.g. as invoice type 1, invoice type 2, not an invoice. In a 2nd step, I would like to OCR certain predefined areas of each document and extract e.g. the address of the company sending the invoice and the date.
The architecture I am envisioning
In a modern platform as a service (pass), I have already set up an UI where I can upload new files. These are then locally stored in a directory with filenames (or in a MongoDB). Meta info like upload timestamp, user, original file name is stored in a DB.
The newly uploaded file should should then be submitted to google cloud which should do the classification step, and deliver back the label to be saved in the database.
The document pages should be auto-cropped, i.e. black or white margins are removed, most probably with google cloud as well. The parameters of the crop should be persisted in the DB.
In case it is e.g. an invoice, OCR should be performed (again by google cloud) for certain regions of the documents, e.g. a bounding box of spanning from the mid of the page to the right margin in the upper 10% of the cropped page. The results of the OCR should be again persisted locally.
The problem
I seem to be missing the correct search term to figure out how to do it with google cloud. Is there an google-API (e.g. REST), I can use to upload and which gives me back the results of steps 2 to 4?
I think that your best option here is to use Document AI (REST API and Libraries).
Using Document AI, you can:
Convert images to text
Classify documents
Analyze and extract entities
Additionally, for your use case, we have a new Document AI feature that is still in preview and has limited access which is the Invoice parser.
Invoice parser is similar to Form parser but for invoices instead of forms. Check out the Invoice parser page and you will see what I mean by preview and limited access.
AFIK, there isn't any GCP tool for image edition.

Interpretation of output of bounding box annotation job

The output folder of an annotation job contains the following file structure:
active learning
annotation-tools
annotations
intermediate
manifests
Each line of the manifests/output/output.manifest file is a dictionary, where the key 'jobname' contains information about the annotations, and the key 'jobname-metadata' contains confidence score and other information about each of the bounding box annotations. There is also another folder called annotations which contain json files which contain information about annotations and associated worker ids. How are the two annotation informations related to each other? Is there any blogs/tutorials which discuss how to interpret the data received from amazon sagemaker ground-truth service? Thanks in advance.
Links I referred to:
1. https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-output.html
2. https://github.com/awslabs/amazon-sagemaker-examples/blob/master/ground_truth_labeling_jobs/ground_truth_object_detection_tutorial/object_detection_tutorial.ipynb
I have displayed the annotations received using the code available in the link 2 here, which treats consolidated annotations and worker response separately.
Thank you for your question. I’m the product manager for Amazon SageMaker Ground Truth and am happy to answer your question here.
We have a feature called annotation consolidation that takes the responses from multiple workers for a single image and then consolidates those responses into a single set of bounding boxes for the image. The bounding boxes referenced in the manifest file are the consolidated responses whereas what you see in the annotations folders are the raw annotations (which is why you have the respective worker IDs).
You can find out more about the annotation consolidation feature here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-annotation-consolidation.html
Please let us know if you have any further questions.

How to create a drill down graph using apache superset?

Is it possible to create a drill down graph with apache superset?
Say for example - population of all countries and onclick of a country, population of all states within that country should be drawn and onclick of state, population of state should be drawn.
Can someone help me with steps/tips to create this using apache superset as I did not find any example/option to create the same.
There is a walkthrough on this from ApacheCon Asia 2022 on youtube - https://www.youtube.com/watch?v=7YnpKLZ1PRM
More than I can summarize here for you
Please see the response of mistercrunch (one of the creators of Apache Superset) below or here: https://github.com/apache/incubator-superset/issues/2890.
Drill down assumes the framework is aware of hierarchies which Superset isn't at the moment. We encourage our users to slice and dice by entering the explore mode, applying filters and altering the "Group By" field which is pretty easy and very flexible. It's an open field instead of a guided flow.
Preset, which uses Apache Superset, has implemented a feature for Drilling to Chart details. You can find more information about it here:
https://docs.preset.io/docs/drilling-to-chart-details
There is also a pull request for a drill down prototype but I don't think it was integrated to Superset, according to the comments.
https://github.com/apache/superset/pull/14688
Including this article link here in case anyone finds it helpful: https://www.tetranyde.com/blog/embedding-superset
It is possible by using custom JavaScript and charts.