Saving SQL Lab query as dataset? - apache-superset

currently, to create a virtual dataset in superset, I create a dataset based on a physical table and edit it afterwards. This feels not as straight forward as I expected, so perhaps there is a better way to do this? I especially expected a way to create a dataset from a SQL Lab query.
Am I missing an option to do so?
Thx,
Holger

Good question. A virtual dataset in Superset is more / less really just a query (albeit with some optional additions, like jinja templated variables).
SQL Lab to virtual dataset: When you're in SQL Lab, any query you run can be published as a virtual dataset by clicking "Explore". You'll be then asked to give a name to the virtual dataset and a no-code Explore workflow will be kicked off with this virtual dataset as the data context / input!
Editing existing virtual dataset: To edit the virtual Dataset (which again is mostly a SQL query), you select Edit from the Datasets tab next to the line item for your dataset.
I think the first bullet is what you're looking for :)
The following documentation may also be helpful: https://docs.preset.io/docs/sql-editor#results-tab

. "Save" just saves as query, not as dataset.
create sql
type "explore"
save as virtual dataset

I don't have superset open right now, but yes, you can run a query in SQL Lab and then save it as a virtual dataset.

Related

QuickSight "Geographical fields aren't supported in joins between data sources"

I've been trying to work around this issue for a couple of days by now without success so far.
Imagine you have these two dummy datasets
dataset_1
id,latitud,longitude
1,-0.023437,-0.070068
2,-0.069099,-0.069099
dataset_2
id,name
1,"site one"
2,"site two"
and you want to JOIN them by id. This is very straightforward with the QuickSight dataset editor. The issue happens when you change the data type of latitud and longitude to their geospatial type, since the error shown in the title pops-up and won't let you save the dataset.
The weird thing is that the error suggests the fields latitude and/or longitude being used to make the JOIN instead of id.
Before contacting AWS for a possible bug have anyone had and solved this issue before?
At the end we contacted AWS support. It seems they have this feature in consideration but it's still not addressed. They suggested us a work-around though :
Change the datatype of the Geo-spatial field to string and perform the join
Once the join is successful, go back to the dataset page, click on the dataset and select "Use in a new Dataset" option
This will create a new child dataset for the main dataset
Here you can change the datatype back to Geo-spatial and save it
Have in mind that the option Use in a new Dataset is disabled if your dataset have Row-level security or if it exceeds 3 levels of JOIN (in which case you'd have to follow #darcoli's answer first)
This seems to be a limitation with quicksight. Can you do the join in custom sql and then add the fields as geographical coordinates in data preparation?

Power bi datasets

Guys this is a bit of newbie question. Ive tried to google it and understand how they work but im not having much luck. I have a datasets created by colleague that connect to one of our systems.
I want to look at using it and trying to make some changes. I can see Its create a .pibx file when i saved a copy of the dataset. i want to look at model section and see if i can pull some further fields(column) into table on the dataset that already links some corresponding data from two other tables. Id like to add more fields(columns) that are not currently in that table
However i don't want to affect other datasets and or the data on the system it is communicating with.
Can anyone advise me if this is the case.
As i really only want to test things for now and not make any changes that might affect other people
You can create duplicate for original table( Query ) then play with that duplicat

Most efficient Snowflake connection type from PowerBI?

We're trialling PowerBI on a Snowflake dimensional model and performance seems very non-optimised. Can anyone point me to information on best practices for this connection? I've previously used Tableau and there's an excellent white paper describing the pros/cons of each connection type and how to set this up so that as much heavy lifting as possible is done in Snowflake, with minimal load on the viz tool.
e.g. when you summarise 1 million invoices to get a chart of sales volume by year that distils this to 10 data points, Tableu would send 'SELECT year, sum(volume) FROM t GROUP BY year' (~10 rows), but in PowerBI we see SF receiving a query like 'SELECT invoice_id, sum(volume) FROM t GROUP BY invoice_id' (~1M rows) - leaving the viz tool to do a lot more work.
So far, we've tried mapping the individual facts and dimensions within PowerBI, and also using a mix of direct query and import, but without significant improvement. Is there any guidance on best practice?
Thanks in advance!
I've never used Snowflake, and I have no clue about how PowerBi interfaces to it. That said on the PowerBI side you may be interested in the composite model and aggregations.
MS Docs:
https://learn.microsoft.com/en-us/power-bi/desktop-composite-models
https://learn.microsoft.com/en-us/power-bi/desktop-storage-mode
https://learn.microsoft.com/en-us/power-bi/desktop-aggregations
Radacad's blog about aggregations:
https://radacad.com/power-bi-fast-and-furious-with-aggregations
https://radacad.com/dual-storage-mode-the-most-important-configuration-for-aggregations-step-2-power-bi-aggregations
In practice, when you are using a composite model the aggregation functionality allows you to create a hidden table (in import mode) in your model with aggregated data (by year, month, customer, etc).
Now when you query your data, PowerBI will check if this table can answer the query, if yes then it will just pick the data from this table, otherwise, it will run a query against the source (direct query)
The example you shared about PowerBI querying the source without asking for aggregation (but instead asking for every single InvoiceId) might be caused by not setting up the composite model correctly.
A table in "direct query" cannot reference other tables in its query (in this case the calendar) unless that table is also in "Direct query" or "dual" mode.
How does the model look like in the case you shared? and which is the storage mode of each table?

The processing of large data sets in sas

I am looking for solutions or ideas how to speed the processing of large data sets in sas.
What would you recommend?
What is better data step or proc sql procedure?
Speeding up your data processing depends on where your data is saved.
Your data can be either in:
SAS Table,
Database Table (Miscrosfot SQL, Oracle, DB2, MYSQL, ..
etc.)
Use SAS Data Step when:
You are querying/processing SAS tables,
You want to do iterative
processing (ex. retaining values or using arrays).
Use Proc SQL when:
You are querying a large Database table,
You can do a SQL "Pass Through" where you send SQL code to be
executed on the DB server and only the output is sent to SAS (instead
of bringing the entire tables through the network to SAS and then filter it),
You want to query SAS Tables but prefer SQL joins to data step merges.
Another topic you should consider is efficiency programming; where you are optimising your query and look-ups.
I find Proc SQL to be better for my use cases. We may need some more specifics on the size and variety of data your trying to join/export etc.
Give us some info on that and we can try to help.
Tips:
Limit the fields your pulling over
Subset data
Anecdotally from my experience Proc SQL seems faster.
Here are two tips on speeding up queries with Proc SQL:
In general, you want to rule out as much data as possible when querying. If you are usingProc SQL, the order of the restrictions in the where clause matters. Put the most restrictive parts first.
For example, if I'm querying a database for teachers with the last name "JONES", that were hired after Jan 2005, I would structure my where clause like this: where last_name = 'JONES' and hire_date > 200501 I would do this because last name is likely to exclude more records than the hire date restriction.
When possible, don't use Select * instead, list out the specific columns that you need. Remember, even if you are doing a calculation with a column, you don't have to include that column in your select statement.
Here is a very useful resource for understanding how to use proc sql efficiently. I recommend reading it in it's entirety if you do a lot of work with large data sets in SAS.
http://www2.sas.com/proceedings/sugi29/127-29.pdf

Reading (even joining) a very large (1.1bn row) table in Enterprise Guide from Teradata

Hopefully you guys can help with what I'm hoping is quite a simple question for those in the know!
I live (well, work) in SAS Enterprise Guide and am trying to perform a simple left join against a table in Teradata.
The table is extremely large (700+ columns, 1.1bn rows) and so far I have been connecting via a LIBNAME statement at the top of my program, followed by the usual PROC SQL to read the data.
The issue I am having is its is extremely slow. I performed the join successfully using 90 rows on the left table and it took 3 hours to complete. The real table I want to use has something like 15,000 rows.
I have tried to connect via the SQL Pass-Through method, but this throws a hosts file error, which I can't fix due to corporate security limitations.
Has anyone had any experience performing this kind of task?
I should mention that I can run a simple select * query in Teradata SQL Assistant is just over 1 minute (16,666,666 obs/s!) so the limitation must be somewhere between SAS/Teradata, or even SAS itself.
I'm sorry I haven't posted actual code snippets as they're on my work machine but this has been bugging me for ages so thought I'd see if I'm missing any tricks.
Thanks in advance for your help.
So you're joining a SAS data set to a Teradata table and want to return the matching records. You'll want to use SAS's DBMASTER= data set option. It designates which of the tables is larger. By telling SAS this, it knows which table to move.
Here I assume librefs have already been assigned and that the Teradata table is larger--more obs--than the SAS data set.
proc sql threads; select tdTable.* from sastables.sasTable1, td.tdTable(dbmaster=yes)
where tdTable.idNum=sasTable1.idNum; quit;
If by chance your SAS data set is larger, you'll want to use the MULTI_DATASRC_OPT= option. Either google these terms or look in the SAS/Access to Relational Databases manual. It's pretty good.
Good luck.
Have you considered creating a volatile table in Teradata? Since this is created in your spool allocation you shouldn't need explicit permissions to create the table. Once created you can load the SAS data set into the Volatile table and collect statistics on the table's join columns and filter columns. This will help the optimizer understand the demographics about your "small" table. The volatile table will only persist for the duration of your session and is not accessible across multiple sessions.
Then rewrite your SAS code to push-down the SQL to Teradata joining the large table to your volatile table. The results can be returned to SAS and loaded into another data set.
CREATE VOLATILE TABLE MyTable, NO FALLBACK
( ColA SMALLINT NOT NULL,
ColB VARCHAR(10) NOT NULL
) PRIMARY INDEX (ColA)
ON COMMIT PRESERVE ROWS /* This is important */
;
The primary index is how Teradata distributes the data and accesses the data. Tables distributed on the same column will join "AMP local" and will not require a redistribution. This is not always possible, as your primary index selection has to consider even distribution as well as access path. The primary index does not have to be unique, but can be.
Hope this helps.