Druid's data sources columns visible in Superset as STRING - apache-superset

I have injected data to Druid using tranquility.
The data source is visible through overlord's console, all good I can query.
Tranquility 0.1.0
Druid 12.3
Superset 0.1.0
When I attach Druid's datasource to Superset I see that all defined columns are of type String. That is pretty weird because I defined types in the tranquility schema as follow:
"dimensionsSpec": {
"dimensions": [
"some_id",
{
"type": "double",
"name": "total_positions"
}]
}
I tried to use Calculated Columns and Metrics but when I save those new element are not appearing in Druid.
Druid chart -> datasource editor
Did anyone has a similar issue? Is there any way I can change column type in Superset or maybe the schema should be defined some different way.

We have the same issue on our environment. We were planning to use it in Apache Branch Report.
As a workaround, we've created external table for Druid on Hive and using Hive connector in Superset in order to cast to integer in SQL Lab: https://cwiki.apache.org/confluence/display/Hive/Druid+Integration
However, it would have been much better if Superset charts could interpret numeric dimensions out of the box so that the architecture would be leaner.

We faced a similar issue. By default, all dimensions were taken as String. In Tranquility, we used metrixSpec and defined the column as longSum. These columns will reflect as numbers in Superset. Remember to refresh Druid metadata in Superset.
"metricsSpec": [
{
"name": "trafficUp",
"type": "longSum",
"fieldName": "trafficUp"
}
]

Related

How to add columns to an existing Athena table using Avro storage

I have an existing Athena table (w/ hive-style partitions) that's using the Avro SerDe. When I first created the table, I declared the Athena schema as well as the Athena avro.schema.literal schema per AWS instructions. Everything has been working great.
I now wish to add new columns that will apply going forward but not be present on the old partitions. I tried a basic ADD COLUMNS command that claims to succeed but has no impact on SHOW CREATE TABLE. I then wondered if I needed to change the Avro schema declaration as well, which I attempted to do but discovered that ALTER TABLE SET SERDEPROPERTIES DDL is not supported in Athena.
AWS claims I should be able to add columns when using Avro, but at this point I'm unsure how to do it. Even if I'm willing to drop the table metadata and redeclare all of the partitions, I'm not sure how to do it right since the schema is different on the historical partitions.
Looking for high-level guidance on the steps to be taken. Documentation is scant and Athena seems to be lacking support for commands that are referenced in this same scenario in vanilla Hive world. Thanks for any insights.

Warning "Not accelerated by BigQuery BI Engine"

I'm using GCP BigQuery with Google Data Studio. I did setup the BI engine reservation in the same region as the dataset (EU, multiple locations).
I still get in Data Studio this warning: "Not accelerated by BigQuery BI Engine"
BI Engine accelerates only some types of queries. See some limitations documented here.
Go to your BigQuery console, locate the project history, and the Jobs history, filter to Querys only.
Examine each query that is coming from DataStudio, and if you open the pane, it will have details why it's not accelerating it.
You will see samples like this:
If you are more developer friendly you can use the BQ cli command tool
To get the most recent jobs:
bq ls -j -a --max_results=15
To fetch the statistics associated with BI Engine accelerated queries, run the following bq command-line tool command:
bq show --format=prettyjson -j job_id
and it will have a section such as:
"statistics": {
"creationTime": "1602175128902",
"endTime": "1602175130700",
"query": {
"biEngineStatistics": {
"biEngineMode": "DISABLED",
"biEngineReasons": [
{
"code": "UNSUPPORTED_SQL_TEXT",
"message": "Detected unsupported join type"
}
]
},

Power BI and parquet at ADLS Gen2

I'm able to connect to ADLS Gen2 from Power BI Desktop and work on CSV files.
The issue is that the same doesn't work for Parquet format. Have you ever worked with parquet at Power BI Desktop?
The problem arise when after adding parquet table, I click on Binary reference - Power Query is unable to read/preview parquet data. I tried both with and w/o snappy compression.
Also I tried to write query manually:
let
Source = AzureStorage.DataLake("https://xxx.dfs.core.windows.net/yyy/data.parquet"),
#"File" = Source{[#"Folder Path"="https://xxx.dfs.core.windows.net/yyy/data.parquet",Name="data.parquet"]}[Content],
#"Imported File" = Parquet.Document(#"File")
in
#"Imported File"
But got the following exception:
The name 'Parquet.Document' wasn't recognized. Make sure it's spelled
correctly.
Despite the fact that Parquet.Document function is documented. I'm using Poewr BI Desktop latest version (Dec 2019).
P.S. I've also faced the same issue while developing DAX model for AAS from Visual Studio SSDT.
Power BI supports this natively now.
Just paste in the url to the parquet file on your lake/storage account and you're good to go. Apparently this isn't slated to go live until March 2021, but it appears for me in the Dec 2020 release.
Currently, you can't work directly with parquet files in Power BI Desktop. You'll need to leverage something like Azure Data Factory's wrangling data flows to convert to CSV or another consumable format first.
It looks like the function you're referring to was specifically added for this new feature in Azure Data Factory, which allows usage of Parquet files in wrangling data flows.
This might come soon for the Power BI Service's dataflows, too, but that's speculation on my part.
I have been able to successfully read parquet files stored in ADLSG2 via a Power BI Dataflow.
Unfortunately you cannot progress to completion via the gui; Parquet format is not natively detected as a source datatype at the time of this writing. To get around the issue, just use the advanced query editor (in order to progress to the advanced editor, just select the JSON or alternative datatype, then overwrite the M code in the Advanced query editor).
Note: This does not currently work with the June 2020 release of PowerBI Desktop. It only works via a dataflow from what I can tell:
let
Source = AzureStorage.DataLake("https://xxxxxxxxxx.dfs.core.windows.net/Container"),
Navigation = Parquet.Document(Source{[#"Folder Path" = "https://xxxxxxxxxx.dfs.core.windows.net/yourcontainer/yoursubfolder/", Name = "yourParquetFile"]}[Content]),
#"Remove columns" = Table.RemoveColumns(Navigation, Table.ColumnsOfType(Navigation, {type table, type record, type list, type nullable binary, type binary, type function}))
in
#"Remove columns"

how to give individual schema access to users in apache-superset?

I have searched a lot about this question, there are no concrete answers to this.
I have a AWS Redshift DB, has around 6-7 schema' with 10-12 tables in each.
and dashboards are made within schema level as well as across schema.
here's the use case:
I have some users who needs to see only dashboards related to "schema 1" but not "schema 2"
I have some other users who are looking at dashboards which are connected to "schema 1" and as well as "schema 2", but m not able to find any workaround to this.
I have seen a thread saying that it's possible to give access to schema but they haven't mentioned that How.
https://github.com/apache/incubator-superset/issues/5483#issuecomment-494227986
As per the Superset documentation, you can not create access level on the schema but you can create access on data source level. Or you can create custom data sources and can create desired roles as per your need.
Refer: https://superset.incubator.apache.org/security.html#managing-gamma-per-data-source-access

Query 100Gb of S3 data in milliseconds

I have json data in s3. data looks like
{
"act_timestamp": 1576480759864,
"action": 26,
"cmd_line": "\\??\\C:\\Windows\\system32\\conhost.exe 0xffffffff",
"guid": "45af94911fb911ea827300270e098ff0",
"md5": "d5669294f78a7d48c318ef22d5685ba7",
"name": "conhost.exe",
"path": "C:\\Windows\\System32\\conhost.exe",
"pid": 1968,
"sha2": "6bd1f5ab9250206ab3836529299055e272ecaa35a72cbd0230cb20ff1cc30902",
"proc_id": "45af94901fb911ea827300270e098ff0",
"proc_name": "gcxvdf.exe"
}
I have around 100GB of such jsons stored in s3, in folder structure like year/month/day/hour.
I have to query this data and get results in milliseconds.
query can be like:-
select proc_id where name='conhost.exe',
select proc_id where cmd_line contains 'conhost.exe'.
I tried using AWS Athena and Redshift but both are giving results around 10-20 seconds. I even tried with Paraquet and orc file formats.
Is there any tool/technology/technique which can be used to query this kind of data and get results in milliseconds.
(Reason for response time to be in milliseconds is because I am developing interactive application.)
I think you are looking for a distributed search system like SOLR or elastic search (I am sure there are others, but those are the ones I am familiar with).
Also worth considering if you are able to reduce your data size at all. Any old or stale date in your 100GB?
I am able to solve above use case by using presto,hive on aws emr.
With help of hive we can create table on data in s3, and by using presto and hive as a catalog we can query this data.
Found out that Presto on emr is way too faster than compared to aws athena
(strange that athena uses presto internally)
create table in hive:-
CREATE EXTERNAL TABLE `test_table`(
`field_name1` datatype,
`field_name2` datatype,
`field_name3` datatype
)
STORED AS ORC
LOCATION
's3://test_data/data/';
query this table in presto:-
>presto-cli --catalog hive
>select field_name1 from test_table limit 5;