Cannot create table in QuestDB with error that no space left on the device - questdb

When I try to crate table with index
CREATE TABLE NEW AS( SELECT DISTINCT * FROM OLD),
index(RIC capacity 1000000)
PARTITION BY MONTH
I get back error
io.questdb.cairo.CairoException: [2] No space left on device [need=99472113664]
I have 800+Gb free on the filesystem and able OLD is not particularly big, few Gb on disk. Any idea why I have the error?

Index capacity is how many rows you expect per each symbol value on average. If you specify capacity of 1 million QuestDB will allocate 7Mb of data per each symbol value. If you try to insert 150k of distinct symbol values table will try allocate 1TB of space.
If you have few distinct symbol values and many rows per each of them then you increase index capacity. If you have many distinct symbols then you increase symbol capacity.

Related

What does this EXPLAIN query plan output mean for my Redshift query?

I ran this:
EXPLAIN select id, birth_date, ROW_NUMBER() OVER (ORDER BY 1) AS load_id from user_profile;
and I see this:
WindowAgg (cost=0.00..133833424.40 rows=30901176 width=36)
-> Seq Scan on user_profile (cost=0.00..133369906.76 rows=30901176 width=28)
What does this query plan mean?
The query plan is the execution plan that the PostgreSQL planner (Amazon Redshift is based on PostgreSQL) has generated for the your SQL statement.
The first node is a window aggregation (WindowAgg) over the data as you're using the OVER window function to calculate a row number.
The second node is a sequential scan (Seq Scan) on the user_profile table, as you're doing a full select of the table without any filtering.
A sequential scan scans the entire table as stored on disk since your query requires a full traversal of the table. Even if there is a multi-column index on id & birth_date, the query engine would pretty much always go for a sequence scan here as you need everything (depending on the random_page_cost & enable_seqscan parameters in PostgreSQL).
The cost number is actually arbitrary, but conventionally means the number of disk page fetches; it's split into 2 values with the delimiter being ...
The first value shows the startup cost - this is the estimated cost to return the first row. The second value shows the total cost - this is the estimated cost to return all rows.
For example, for the Seq Scan, the startup cost is 0 and the total cost is estimated to be 133369906.76.
For sequential scans, the startup cost is usually 0. There's nothing really to do other than return data so it can start returning data right away. Total costs for a node includes the cost of all its child nodes as well - in this case, the final total cost of both operations looks to be 133833424.40 which is the sum of the scan and aggregation cost.
The rows value demonstrates the estimated number of rows that will be returned. In this case, both operations have the same value as the aggregation will apply to all rows & no filtering is being carried out that will reduce the number of final rows.
The width value demonstrates the estimated size in bytes of each returned row i.e. each row will most likely be 28 bytes in length before the aggregation and 36 bytes after the aggregation.
Putting that all together, you could read the query plan as such:
Sequential Scan on table user_profile
will most likely start returning rows immediately
estimated disk page fetch count of 133369906.76
estimated 30,901,176 rows to be returned
estimated total row size of 28 bytes
Window Aggregation on data from above operation
will most likely start returning rows immediately
estimated disk page fetch count of 133833424.40
estimated 30,901,176 rows to be returned
estimated total row size of 36 bytes

Redshift table size identification based on date

I would like to create a query in redshift where I want to pass dates as between 25-07-2021 and 24-09-2022 and would like to get result in MB(table size) for a particular table between those dates.
I assume that by "get result in MB" you saying that, if those matching rows were all placed in a new table, you would like to know how many MB that table would occupy.
Data is stored in Amazon Redshift in different ways, based upon the particular compression type for each column, and therefore the storage taken on disk is specific to the actual data being stored.
The only way to know how much disk space would be occupied by these rows would be to actually create a table with those rows. It is not possible to accurately predict the storage any other way.
You could, of course, obtain an approximation by counting the number of rows matching the dates and then taking that as a proportion of the whole table size. For example, if the table contains 1m rows and the dats matched 50,000 rows then they would represent 50/1000 (5%). However, this would not be a perfectly accurate measure.

Linear LATEST BY performance in QuestDB

I observe linear time increase with number of rows in the table when I query LATEST BY symbol in QuestDB as if it does full scan to find the values. Here is my table
CREATE TABLE metric(
ObjectType SYMBOL capacity 2 cache ,
ObjectId SYMBOL capacity 20000 cache,
Group SYMBOL capacity 4000 cache,
Region SYMBOL capacity 20 cache,
CC Symbol capacity 50 cache,
value DOUBLE,
timestamp TIMESTAMP
)
timestamp(timestamp)
PARTITION BY DAY;
And query is
select value from metric
LATEST BY ObjectType
where objectType= 'queue'
I'd expect linear or logarithmic time growth for it.
An index is needed to avoid full table scan and to have linear LATEST BY SELECT query performance
Try
ALTER table metric ALTER COLUMN objectType ADD INDEX
or you should create table
CREATE TABLE metric(
ObjectType SYMBOL capacity 2 cache index,
ObjectId SYMBOL capacity 20000 cache,
Group SYMBOL capacity 4000 cache,
Region SYMBOL capacity 20 cache,
CC Symbol capacity 50 cache,
value DOUBLE,
timestamp TIMESTAMP
)
timestamp(timestamp)
PARTITION BY DAY;
Symbol by itself is not indexed it just merely means that repeated string values will be saved as integers in a row with a separate lookup table (symbol dictionary) translating from integer to string.

RedShift Deep Copy Without INSERT_XID (Hidden metadata) Column Data

I have a very long, narrow table in AWS Redshift. It's that has fallen victim to the issue of the hidden metadata column, INSERT_XID, being hugely disproportionate in size compared to the table.
Picture a table of 632K rows that has 22gb visible data in it and a hidden column with 83gb.
I want to reclaim that space, but Vacuum has no effect on it
I tried copying the table:
BEGIN;
CREATE TABLE test.copied (like prod.table);
INSERT INTO test.copied (select * from prod.table);
COMMIT;
This results in a true deep copy where the hidden meta data column is still very large. I was hoping that a copy of the table in one go into a new one would allow the hidden INSERT_XID column to compress, but it failed to do so.
Any ideas how I can optimize this hidden column in AWS Redshift?
I measured the size of each column with the following:
SELECT col, attname, COUNT(*) AS "mbs"
FROM stv_blocklist bl
JOIN stv_tbl_perm perm
ON bl.tbl = perm.id AND bl.slice = perm.slice
LEFT JOIN pg_attribute attr ON
attr.attrelid = bl.tbl
AND attr.attnum-1 = bl.col
WHERE perm.name = 'table_name'
GROUP BY col, attname
ORDER BY col;
Update:
I also tried an UNLOAD of this table into S3 and then a single COPY back into a new table and he size of the hidden column was unchanged. I'm not sure if this is even resolvable.
Thank you!
I did some math on the numbers you provided and I think you may be running into 1MB block size quanta effects. However, the math still doesn't work out.
Redshift stores you data around the cluster per the table's distribution style. For non-diststyle-all tables this means that each column has rows on each slice of the cluster. The minimum storage size on Redshift, a block, is 1MB in size. When you have small (for Redshift) number of rows in your table there isn't enough data on each slice to fill up one block so there is a lot of wasted space on disk.
If you have a table of say 2 columns which has 630K rows and you are working on a cluster that has 1024 slices (like 32 nodes of dc2.8xl) then these effects can be quite pronounced. Each slice has only 615 rows (on average), no where close to filling up a 1MB block. So the non-metadata portion of this table will take up 2X1024X1MB = 2.048gb. As you can see, even in this case, I can only get to one tenth of what you are showing.
I could rerun this with 20 columns instead of 2 and I would get up to your 22gb figure but then the size of the metadata columns wouldn't make a whole lot of sense - they aren't that inefficient. It is possible that I'm not looking at configurations like what you have - 4000 slices? 8 columns?
22gb of space is 22,000 blocks spread across the slices and columns of your cluster / table. Knowing your column count and cluster configuration will greatly help in understanding how the data is being stored.
Recommendation - move this table to DISTSTYLE ALL and you will save greatly in storage space. 600K rows is tiny for Redshift and spreading the data across all the slices is just inefficient. Be advised that DISTSTLYE ALL has query compilation implications - mostly positive but not all so monitor your query performance if you make this change.

List size per sqlite column

I have imported a list with characters, the list size reaching about 100 MB, into a column in sqlite3 DB.
When I iterate through the list, its fast enough, but opening the database takes too long.
Is there a maximum or limit size per column?