Size of a BQ Table from GCS - google-cloud-platform

Is there a way to calculate the final size of a BigQuery table based on the size of the Cloud Storage data?
For example, an 80GB bucket, it's transformed into a 100GB table.
I want an approximation to know if a Cloud Storage bucket could be less than 100GB in BQ.
Thanks!

The answer to your question is hard. It will vary as a function of how the data in the files in GCS are stored. If you have 80GB of data and that data is in CSV the BQ size will be one value but if it is stored in JSON then it will be another value and if its AVRO yet another and so on. It will also be a function of the schema types for your columns and how many columns you have. Google has documented how much storage (in BQ) is required for each of the data types:
In the docs on BQ Storage Pricing there is a table showing the amount of data required to store different column types.
If I needed to know the resulting BQ size from a file of data, I would determine each of my resulting columns, the data size for each column (average) and that would give me the approximate size of a row in the BQ table. From there, I would multiply that by the number of rows in my source files.
Another way you might want to try is to load in some existing files one at a time and see what the "apparent" multiplier is. In theory, that might be a good enough indication for given sets of file / table pairs.

Related

Redshift table size identification based on date

I would like to create a query in redshift where I want to pass dates as between 25-07-2021 and 24-09-2022 and would like to get result in MB(table size) for a particular table between those dates.
I assume that by "get result in MB" you saying that, if those matching rows were all placed in a new table, you would like to know how many MB that table would occupy.
Data is stored in Amazon Redshift in different ways, based upon the particular compression type for each column, and therefore the storage taken on disk is specific to the actual data being stored.
The only way to know how much disk space would be occupied by these rows would be to actually create a table with those rows. It is not possible to accurately predict the storage any other way.
You could, of course, obtain an approximation by counting the number of rows matching the dates and then taking that as a proportion of the whole table size. For example, if the table contains 1m rows and the dats matched 50,000 rows then they would represent 50/1000 (5%). However, this would not be a perfectly accurate measure.

Amazon Athena scans lots of data when query involves only partitions

I have a table on Athena partitioned by day (huge table, TB of data). There's no day column on the table, at least not explicitly. I would expect that a query like the following:
select max(day) from my_table
would scan virtually no data. However, Athena reports that several hundreds of GB are scanned. Any idea why?
===== EDIT 2021-01-14 ===
I've recently bumped on this issue again. It turns out that when the underlying data is parquet then operations on partitions don't consume data. For other data formats that I've tried (including ORC) there is an associated data cost. It doesn't make any sense to me.
I don't know the answer for a fact but I guesstimate:
Athena just does not have the optimization of looking at the partition names only, when only they are queried. This is clear from its behaviour. So it scans everything.
Parquet has min/max for every column whereas ORC does it only if an index is present, AFAIU. Thus for Parquet Athena's query optimizer directs it to look directly at these rollup values, i.e., no scan is performed. It's different for ORC.
I know is a little late to answer this question for you Nicolas but it is important to keep here also some possible solutions.
Unfortunately, this is the way Athena works, Athena will read all data as a tableScan just to list the partitions values.
A possible workaround that works perfectly here is using the metadata of the partition instead of the data information, for example:
Instead of using this syntax:
select max(day) from my_table
Try to use this syntax:
SELECT day FROM my_schema."my_table$partitions" ORDER BY day DESC LIMIT 1
This second statement will read just metadata information and returns the same data you need.
It does not depend on the format but on the compression algorithm used. Snappy for ORC mostly & GZIP for parquet. This is what makes the difference

Spanner - How to find table size

cloud spanner in database detail has: Total storage.
Can i take this value by query?
Also how to find tables sizes (by query). At the end of which will be total storage size.
Thank you
Cloud Spanner table sizes aren't currently exposed.
To get a general idea of the tables size porportions, we check the resulted avro files sizes when exporting the DB (we have a daily job for backup).
Of-course it is not accurate since it's a whole different storage model.
Just released table size statistics system table that includes table and index sizes. Sample query from docs:
SELECT interval_end
,table_name
,used_bytes
FROM spanner_sys.table_sizes_stats_1hour
WHERE interval_end =
(
SELECT MAX(interval_end)
FROM spanner_sys.table_sizes_stats_1hour
)
ORDER BY used_bytes DESC;
See full details at https://cloud.google.com/spanner/docs/introspection/table-sizes-statistics

Dist and Sort Keys Redshift

I'm trying to add dist and sort keys to some of the tables in redshift.
I notice that before adding the size of the table is 0.50 and after adding it gets increased to 0.51 or 0.52. Is this possible ? The whole purpose of having dist and sort keys is to decrease the size of the table and help in increasing the read/write performance.
That is not the purpose of having a DISTKEY and SORTKEY.
To decrease the storage size of a table, use compression.
The DISTKEY is used to distribute data amongst slices. By co-locating information on the same slice, queries can run faster. For example, if you had these tables:
customer table, DISTKEY = customer_id
invoices table, DISTKEY = customer_id
...then these tables would be distributed in the same manner. All records in both tables for a given customer_id would be located on the same slice, thereby avoiding the need to transfer data between slices. The DISTKEY should be the column that is mostly used for JOINS.
The SORTKEY is used to sort data on disk, for the benefit of Zone Maps. Each storage block on disk is 1MB in size and contains data for only one column in one table. The data for this column is sorted, then stored in multiple blocks. The Zone Map associated with each block identifies the minimum and maximum values stored within that block. Then, when a query is run with a WHERE statement, Amazon Redshift only needs to read the blocks that contain the desired range of data. By skipping over blocks that do not contain data within the WHERE clause, Redshift can run queries much faster.
The above can all work together. For example, compressed data requires fewer blocks, which also allows Redshift to skip over more data based on the Zone Maps. To get the best possible performance out of queries, use DISTKEY, SORTKEY and compression together.
(It is often recommended not to compress the SORTKEY column because it causes too many rows to be loaded from a single block.)
See also: Top 10 Performance Tuning Techniques for Amazon Redshift

Redshift performance: encoding on join column

would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.