I would like to create a query in redshift where I want to pass dates as between 25-07-2021 and 24-09-2022 and would like to get result in MB(table size) for a particular table between those dates.
I assume that by "get result in MB" you saying that, if those matching rows were all placed in a new table, you would like to know how many MB that table would occupy.
Data is stored in Amazon Redshift in different ways, based upon the particular compression type for each column, and therefore the storage taken on disk is specific to the actual data being stored.
The only way to know how much disk space would be occupied by these rows would be to actually create a table with those rows. It is not possible to accurately predict the storage any other way.
You could, of course, obtain an approximation by counting the number of rows matching the dates and then taking that as a proportion of the whole table size. For example, if the table contains 1m rows and the dats matched 50,000 rows then they would represent 50/1000 (5%). However, this would not be a perfectly accurate measure.
Related
Is there a way to calculate the final size of a BigQuery table based on the size of the Cloud Storage data?
For example, an 80GB bucket, it's transformed into a 100GB table.
I want an approximation to know if a Cloud Storage bucket could be less than 100GB in BQ.
Thanks!
The answer to your question is hard. It will vary as a function of how the data in the files in GCS are stored. If you have 80GB of data and that data is in CSV the BQ size will be one value but if it is stored in JSON then it will be another value and if its AVRO yet another and so on. It will also be a function of the schema types for your columns and how many columns you have. Google has documented how much storage (in BQ) is required for each of the data types:
In the docs on BQ Storage Pricing there is a table showing the amount of data required to store different column types.
If I needed to know the resulting BQ size from a file of data, I would determine each of my resulting columns, the data size for each column (average) and that would give me the approximate size of a row in the BQ table. From there, I would multiply that by the number of rows in my source files.
Another way you might want to try is to load in some existing files one at a time and see what the "apparent" multiplier is. In theory, that might be a good enough indication for given sets of file / table pairs.
I have S3 with compressed JSON data partitioned by year/month/day.
I was thinking that it might reduce the amount of scanned data if construct query with filtering looking something like this:
...
AND year = 2020
AND month = 10
AND day >= 1 "
ORDER BY year, month, day DESC
LIMIT 1
Is this combination of partitioning, order and limit an effective measure to reduce the amount of data being scanned per query?
Partitioning is definitely an effective way to reduce the amount of data that is scanned by Athena. A good article that focuses on performance optimization can be found here: https://aws.amazon.com/de/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ - and better performance mostly comes from reducing the amount of data that is scanned.
It's also recommended to store the data in a column based format, like Parquet and additionally compress the data. If you store data like that you can optimize queries by just selecting columns you need (there is a difference between select * and select col1,col2,.. in this case).
ORDER BY definitely doesn't limit the data that is scanned, as you need to scan all of the columns in the order by clause to be able to order them. As you have JSON as underlying storage it most likely reads all data.
LIMIT will potentially reduce the amount of data that is read, it depends on the overall size of the data - if limit is way smaller than the overall count of rows it will help.
In general I can recommend to test queries in the Athena interface in AWS - it will tell you the amount of scanned data after a successful execution. I tested on one of my partitioned tables (based on compressed parquet):
partition columns in WHERE clause reduces the amount of scanned data
LIMIT further reduces the amount of scanned data in some cases
ORDER BY leads to reading the all partitions again because it otherwise can't be sorted
I created two tables with 43,547,563 rows each:
CREATE TABLE metrics_compressed (
some_id bigint ENCODE ZSTD,
some_value varchar(200) ENCODE ZSTD distkey,
...,
some_timestamp bigint ENCODE ZSTD,
...,
PRIMARY KEY (some_id, some_timestamp, some_value)
)
sortkey (some_id, some_timestamp);
The second one is exactly like the first one but without any column compressed.
Running this query (it just counts one row):
select count(*)
from metrics_compressed
where some_id = 20906
and some_timestamp = 1475679898584;
shows a table scan of 42,394,071 rows (from the rows_pre_filter column in svl_query_summary, column is_rrscan true) and while running it over the uncompressed table it scans 3,143,856. I guess the reason for this is that the compressed one uses less 1MB blocks, hence the scan shows the total number of rows from the retrieved blocks.
Are the scanned rows a sign of bad performance? Or does Redshift use some kind of binary search within a block for such simple queries as this one, and the scanned rows is just confusing info for optimizing queries?
In general, you should let Amazon Redshift determine its own compression types. It does this by loading 100,000 rows and determining the optimal compression type to use for each column based on this sample data. It then drops those rows and restarts the load. This happens automatically when a table is first loaded if there is no compression type specified on the columns.
The SORTKEY is more important for fast queries than compression, because it allows Redshift to totally skip over blocks that do not contain desired data. In your example, using some_id within the WHERE clause allows it to only look at blocks containing that specific value and since it is also the SORTKEY this will be extremely efficient.
Once a block is identified as potentially containing the SORTKEY data, Redshift will read the block from disk and process the contents.
The general rule is to use DISTKEY for columns most used in JOIN and use SORTKEY for columns most used in WHERE statements (but there are also more subtle variations on those general rules).
would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.
Question Summary
I can read all values out of the single column of a one-column table quite quickly. How can I read all values just as quickly from a single column of a table that has several other columns as well?
Details
I'm using the C++ api to read a sqlite database containing a single table with 2.2 million records.
The data has a "coordinates" column and (optionally) several other columns. The "coordinates" column is a BLOB and currently is always 8 bytes long. The other columns are a mix of TEXT and REAL, with the TEXT strings anywhere from a few characters to about 100 characters (the lengths vary record by record).
In one experiment, I created the table with the "coordinates" column, plus about 15 other columns. The total database file size was 745 MB. I did a simple
int rc = sqlite3_exec( db, "select coordinates from facilities", ReadSQLiteCallback, NULL, &errorMessage );
and it took 91 seconds to execute.
I then created the table with just the "coordinates" column and no other data columns. The total database file size was 36 MB. I ran the same select statement and it took 1.23 seconds.
I'm trying to understand what accounts for this dramatic difference in speed, and how I can improve the speed when the table has those additional data columns.
I do understand that the larger file means simply more data to read through. But I would expect the slowdown to be at worst more or less linear with the file size (i.e., that it would take maybe 20 times the 1.23 seconds, or about 25 seconds, but not 91 seconds).
Question Part I
I'm not using an index on the file because in general I tend to read most or all of the entire "coordinates" column at once as in the simple select above. So I don't really need an index for sorting or quickly accessing a subset of the records. But perhaps having an index would help the engine move from one variable-sized record to the next more quickly as it reads through all the data?
Is there any other simple idea that might help cut down on those 91 seconds?
Question Part II
Assuming there is no magic bullet for bringing the 91 seconds (when the 15 other data columns are included) down close to the 1.23 seconds (when just the coordinates column is present) in a single table, it seems like I could just use multiple tables, putting the coordinates in one table and the rest of the fields (to which I don't need such quick access) in another.
This sounds like it may be a use for foreign keys, but it seems like my case doesn't necessarily require the complexity of foreign keys, because I have a simple 1-to-1 correspondence between the coordinates table and the other data table -- each row of the coordinates table corresponds to the same row number of the other data table, so it's really just like I've "split" each record across two tables.
So the question is: I can of course manage this splitting by myself, by adding a row to both tables for each of my records, and deleting a row from both tables to delete a record. But is there a way to make SQLite manage this splitting for me (I googled "sqlite split record across tables" but didn't find much)?
Indexes are typically used for searching and sorting.
However, if all the columns actually used in a query are part of a single index, you have a covering index, and the query can be executed without accessing the actual table.
An index on the coordinates column is likely to speed up this query.
Even with a 1:1 relationship, you still need to know which rows are associated, so you still need a foreign key in one table. (This also happens to be the primary key, so in effect you just have the primary key column(s) duplicated in both tables.)
If you don't have an INTEGER PRIMARY KEY, you could use the internal ROWID instead of your primary key.