Creating tables with dynamic columns in apache calcite - apache-calcite

There are two tables in graph database.
User { id, name}
Group { id, name}
User is connected to Group via an edge. No i want to query this via apache calcite with where clause as
select * from User where User.Group.id="Foo"
Since apache calcite accepts Schema with predefined Table with predefined columns, above query fails in validation step. One way to achieve this way is to Define user with Four columns as {id, name, Group.id, Group.name}. Now the problem is in my case, A table can be connected to more than one other tables and the depth can go up to 6 depth. Creating a table with all the columns of their child classes with lead to a table with lot of dynamic columns.
Is there a way to define columns of a table as the way they appear in query.

Look at resolved issue https://issues.apache.org/jira/browse/CALCITE-1150.
It introduces DynamicRecordType to Apache Calcite. Here is propossed specification https://docs.google.com/document/d/1vCWlqRyJQCtYbtVAjGOKP-8BD4_hrhoM9-4qbdoJs6k/edit.
I think it's used by Apache Drill project, see https://github.com/apache/drill/search?q=DynamicRecordType.

Related

With SQL or Python how can I find out if a table is part of a sharded set of tables (in BigQuery)?

I want to find out what my table sizes are (in BigQuery).
However I want to sum up the size of of all tables that belong to a specific set of sharded tables.
So I need to find metadata that shows that a table is part of a set of sharded tables.
So I can do: How to get BigQuery storage size for a single table
select
sum(size_bytes)/pow(2, 30) as size_gb
from
<your_dataset>.__TABLES__
But here I can't see if the table is part of a set of sharded set of tables.
This is what my Google Analytics sharded tables look like in BQ:
So somewhere must be metadata that indicates that tables with for example name ga_sessions_20220504 belong to a sharded set ga_sesssions_
Where/how can I find that metadata?
I think you are exploring the right query, most of the time, I use the following query to drill down on shards & it's sizes
SELECT
project_id,
dataset_id,
table_id,
array_reverse(SPLIT(table_id, '_'))[OFFSET(0)] AS shard_pt,
DATE(TIMESTAMP_MILLIS(creation_time)) creation_dt,
ROUND(size_bytes/POW(1024, 3), 2) size_in_gb
FROM
`<project>.<dataset>.__TABLES__`
WHERE
table_id LIKE 'ga_sessions_%'
ORDER BY
4 DESC
Result (on some random GA dataset I have access to FYI)
There is no metadata on Sharded tables via SQL.
Tables being displayed as Sharded in BigQuery UI happens when you do the following ->
Create 2 or more tables that have the following characteristics:
exist in the same dataset
have the exact same table schema
the same prefix
have a suffix of the form _YYYYMMDD (eg. 20210130)
These are something of a legacy feature, they were more commonly used with bigquery’s legacy SQL.
This blog was very insightful on this:
https://mark-mccracken.medium.com/bigquery-date-sharding-vs-date-partitioning-cee3754f7900

concatenation of data into superset

there are two tables, one collects facts on a daily basis, the other on a monthly basis with the same set of attributes (for example, region, city, technology).
I need to calculate the formula in a superset
SUM(t1.count_exp) / SUM(t2.count_base)
which will be correctly visualized when calculating by region, or by city, or by region + city + technology per month.
in other bi systems, the group by is performed first, then the join is executed and the formula above is calculated, which gives the desired result. How to achieve a similar result in a superset?
Assuming both tables are in same database, then you can write your own query joining the two tables in 'SQL Lab' and then visualize the query results using 'Explore' option available there.
Once you click on 'Explore' from SQL Lab, Superset will create a Virtual Dataset(Table) inside Superset from results of SQL query. Any filters/group by/limit applied on this virtual table from visualization will query over this query.
https://superset.apache.org/docs/frequently-asked-questions
A view is a simple logical layer that abstract an arbitrary SQL
queries as a virtual table. This can allow you to join and union
multiple tables, and to apply some transformation using arbitrary SQL
expressions. The limitation there is your database performance as
Superset effectively will run a query on top of your query (view). A
good practice may be to limit yourself to joining your main large
table to one or many small tables only, and avoid using GROUP BY where
possible as Superset will do its own GROUP BY and doing the work twice
might slow down performance.

Problems loading data in to Analysis Services Model

I’m building an model in Azure Analysis Services. The model should contain only data for the last 3 months and is processed every day.
I have a separate dimension for date that has a relation with a fact table using a datekey. I’m using a power query to only load the last 3 months in the date dimension. In the power query to load the fact table I used Table.nestedjoin to only load the rows that have a value in the date table.
When I do this, the processing of the model takes forever. After some troubleshooting I saw that the query Analysis Services is using to retrieve data from the SQL database retrieves all rows. So, Am I correct saying AS load all data before it merge the rows? Is there a way to change this? Or is there a better way to a chief my solution?
Kind regards,
Joins are super slow in Power Query. You should avoid them if you can do it in the datasource or use normal relationships in the data model.
Also, you can setup the date dimension in DAX and dynamically populate it to contain only dates present in the FACT table.
As for the load of all the data, it could be because the data is fetched as is, and only then power query applies the transformations (the join).
You can modify the query in the Power Query Editor / Advenced Editor to add a where clause direclty in the query

Azure SQL Data Warehouse CTAS statistics

Does the "Create table as" function in SQL Data Warehouse create statistics in the background, or do they have to manually be created (as I would when I do a normal "Create table" statement?)
As of the current version, you always have to create column-level statistics on tables, irrespective of whether it was created with a normal CREATE TABLE or the CTAS CREATE TABLE AS... command. It's also good practice to create stats for columns used in JOINs, WHERE clauses, GROUP BY, ORDER BY and DISTINCT clauses.
Regarding tables created with CTAS, the database engine has a correct idea of how many rows are in the table as listed in sys.partitions, but not at the column-level statistics level. For tables created by CREATE TABLE this defaults to 1,000 rows. For the example below, the first table was created with a CTAS and has 208 rows, the second table with an ordinary CREATE TABLE and INSERT from the first table and also has 208 rows, but sys.partitions believes it to have 1,000 eg
Creating any column-level statistics manually will correct this number.
In summary, always manually create statistics against important columns irrespective of how the table was created.

How can I create a model with ActiveRecord capabilities but without an actual table behind?

I think this is a recurrent question in the Internet, but unfortunately I'm still unable to find a successful answer.
I'm using Ruby on Rails 4 and I would like to create a model that interfaces with a SQL query, not with an actual table in the database. For example, let's suppose I have two tables in my database: Questions and Answers. I want to make a report that contains statistics of both tables. For such purpose, I have a complex SQL statement that takes data from these tables to build up the statistics. However the SELECT used in the SQL statement does not directly take values from neither Answers nor Questions tables, but from nested SELECTs.
So far I've been able to create the StatItem model, without any migration, but when I try StatItem.find_by_sql("...nested selects...") the system complains about unexisting table stat_items in the database.
How can I create a model whose instance's data is retrieved from a complex query and not from a table? If it's not possible, I could create a temporary table to store the data in there. In such case, how can I tell the migration file to not create such table (it would be created by the query)?
How about creating a materialized view from your complex query and following this tutorial:
ActiveRecord + PostgreSQL Materialized Views
Michael Kohl and his proposal of materialized views has given me an idea, which I initially discarded because I wrongly thought that a single database connection could be shared by two processes, but after reading about how Rails processes requests, I think my solution is fine.
STEP 1 - Create the model without migration
rails g model StatItem --migration=false
STEP 2 - Create a temporary table called stat_items
#First, drop any existing table created by older requests (database connections are kept open by the server process(es).
ActiveRecord::Base.connection.execute('DROP TABLE IF EXISTS stat_items')
#Second, create the temporary table with the desired columns (notice: a dummy column called 'id:integer' should exist in the table)
ActiveRecord::Base.connection.execute('CREATE TEMP TABLE stat_items (id integer, ...)')
STEP 3 - Execute an SQL statement that inserts rows in stat_items
STEP 4 - Access the table using the model, as usual
For example:
StatItem.find_by_...
Any comments/improvements are highly appreciated.