Is it worth dropping stale columns on a large data set?

Is it worth dropping stale columns on a large data set? - django

I have a relatively large data set in a table with about 60 columns, of which about 20 have gone stale. I've found a few posts on dropping multiple columns and the performance of DROP COLUMN, but nothing on whether or not dropping a bunch of columns would result in a noticeable performance increase.
Any insight as to whether or not something like this could a perceptible impact?

Dropping one or more columns can be done in a single statement and is very fast. All it needs is a short ACCESS EXCLUSIVE lock on the table, so long running queries would block it.
The table is not rewritten during this operation, and it will not shrink. Subsequent rewrites (with VACUUM (FULL) or similar) will get rid of the column data.

Related

How does Cloud Bigtable read rows that are non-contiguous?

Given a large number of known row keys. How does bigtable read(not a scan operation) those rows? Does it read the rows one after the other or all at once? If I have a large number of non-contiguous rows that I want to read, is it better to make separate concurrent or parallel hits to read each or to give all rows to bigtable i.e. a "batch read"?

There are three options for a non-contiguous batch read which depend on your latency and CPU requirements. You can do all the reads as get requests in parallel, you can issue a read rows request/scan with multiple ranges that include only one row, or you can do a hybrid.
Reading with multiple parallel get requests
This option can be great if you have a lot of processing power or don't need to read a huge number of rows. This will issue multiple requests to Bigtable, so it's going to have an impact on your CPU utilization. One Bigtable node supports around 10K reads per second, but if you have 1000 rows you need to read individually that might make a dent in your capacity.
Also, if you need all of the requests to resolve before you can process the data, you may run into performance issues if one request is slow, it slows down the entire result.
Scan with multiple rows
Bigtable supports scanning with multiple filters. One filter is a row range based on the row key. You can create a row range filter that includes exactly one row and do a scan with a filter for each row.
The Bigtable client libraries support queries like this, so you can just pass the row keys and don't need to create all of those row range filters. However, it's important to know what is happening under the hood for performance. This one query will be performed sequentially on the Bigtable server, so it could take a lot more time than multiple gets.
In Java, to do this kind of query, you just pass multiple row keys to the Query builder like so:
Query query = Query.create(tableId).rowKey("phone#4c410523#20190501").rowKey("phone#4c410523#20190502");
ServerStream<Row> rows = dataClient.readRows(query);
for (Row row : rows) {
printRow(row);
}
Hybrid approach
Depending on the scale of rows you're working with, it may make sense to take your set of row keys, divide them up and issue multiple scans in parallel. You can get the benefit of fewer requests while still potentially getting better latency since the requests are parallelized.
I would recommend experimenting to see which scenario works best for your use case, or leave a comment with more information on your use case and I can see if there is more information I can offer you.

How does Amazon Redshift reconstruct a row from columnar storage?

Amazon describes columnar storage like this:
So I guess this means in what PostgreSQL would call the "heap", blocks contain all the values for one column, then the next column, and so on.
Say I want to query for all people in their 30's, and I want to know their names. So columnar storage means less IO is required to read just the age of every row and find those that are 30-something, because all the other columns don't need to be read. Also maybe some efficient compression can be applied. That's neat, I guess.
Then what? This data structure alone doesn't explain how anything useful can happen after that. After determining what records are 30-something, how are the associated names found? What data structure is used? What are its performance characteristics?

If the Age column is the Sort Key, then the rows in the table will be stored in order of Age. This is great, because each 1MB storage block on disk keeps data for only one column, and it keeps note of the minimum and maximum values within the block.
Thus, searching for the rows that contain an Age of 30 means that Redshift can "skip over" blocks that do not contain Age=30. Since reading from disk is the slowest part of a database, this means it can operate much faster.
Once it has found the blocks that potentially contain Age=30, it reads those blocks from disk. Blocks are compressed, so they might contain much more data than the 1MB on disk. This means many rows can be read with fewer disk accesses.
Once those blocks are decompressed into memory, it finds the rows with Age=30 and then loads the corresponding blocks for the Name column. The compression ratio would be different for the Name column since it is text and is not sorted, so this might result in loading more blocks from disk for Name than for Age.
Redshift then assembles the data from Name and Age for the desired rows and performs any remaining operations.
These operations are also parallelized across multiple nodes based on the Distribution Key, which distributed data based on a given column (or replicates it between nodes for often-used tables). Data is typically distributed based upon a column that is frequently used in JOIN statements so that similar data is co-located on the same node. Each node returns its data to the Leader Node, which combines the data and provides the final results.
Bottom line: Minimise the amount of data read from disk and parallelize operations on separate nodes.

AFAIK every value in the columnar storage has an ID pointer (similar to CTID you mentioned), and to get the select results Redshift needs to find and combine the values with the same ID pointer for each column that's selected from the raw data. If memory allows it's stored in memory, unless it's spilling to disk. This process is called materialization (don't confuse with materialized view materialization). In your case there are 2 technically possible scenarios:
materialize all Age/Name pairs, then filter by Age=30, and output the result
filter Age column by Age=30, get IDs, get Name values with corresponding IDs, materialize pairs and output
I guess in this case #2 is what happens because materialization is more expensive than filtering. However, there is a plenty of scenarios where this is much less obvious (with complex queries and aggregations). It is the responsibility of the query optimizer to decide what's better. #1 is still better than the row oriented because it would still read just 2 columns.

Amazon redshift large table VACUUM REINDEX issue

My table is 500gb large with 8+ billion rows, INTERLEAVED SORTED by 4 keys.
One of the keys has a big skew 680+. On running a VACUUM REINDEX, its taking very long, about 5 hours for every billion rows.
When i track the vacuum progress it says the following:
SELECT * FROM svv_vacuum_progress;
table_name | status | time_remaining_estimate
-----------------------------+--------------------------------------------------------------------------------------+-------------------------
my_table_name | Vacuum my_table_name sort (partition: 1761 remaining rows: 7330776383) | 0m 0s
I am wondering how long it will be before it finishes as it is not giving any time estimates as well. Its currently processing partition 1761... is it possible to know how many partitions there are in a certain table? Note these seem to be some storage level lower layer partitions within Redshift.

These days, it is recommended that you should not use Interleaved Sorting.
The sort algorithm places a tremendous load on the VACUUM operation and the benefits of Interleaved Sorts are only applicable for very small use-cases.
I would recommend you change to a compound sort on the fields most commonly used in WHERE clauses.
The most efficient sorts are those involving date fields that are always incrementing. For example, imagine a situation where rows are added to the table with a transaction date. All new rows have a date greater than the previous rows. In this situation, a VACUUM is not actually required because the data is already sorted according to the Date field.
Also, please realise that 500 GB is actually a LOT of data. Doing anything that rearranges that amount of data will take time.

If you vacuum is running slow you probably don’t have enough space on the cluster. I suggest you double the number of nodes temporarily while you do the vacuum.
You might also want to think about changing how your schema is set up. It’s worth going through this list of redshift tips to see if you can change anything:
https://www.dativa.com/optimizing-amazon-redshift-predictive-data-analytics/

The way we recovered back to the previous stage is to drop the table and restore it from the pre vacuum index time from the backup snapshot.

Why Amazon Redshift UNLOAD performance is much better for fresh data?

I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?
Time the query spent executing. 39m 37.02s:
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...
vs.
Time the query spent executing. 23.01s :
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...
Thanks!

Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).
If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.
Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.

QSqlQuery using hundreds of MB of memory

I'm using a QSqlQuery object to do a linear walk of a 10million row SQL table with a very complex 6 column primary key in sorted order. Because of the key(which I CAN NOT change), breaking upmy query SELECT * FROM table1 with < or > with LIMIT causes a huge number of issues with the algorithm I'm using.
My problem arising is as follows, for whatever reason QSqlQuery seems to be caching the entire result set in memory until it hits a bad alloc and kills the application. So I may read some couple hundred rows, seek() over a couple hundred thousand, and by this point QSqlQuery is using 300mb of memory and my application dies. I read the docs and it seems the only thing that can be done is to use setForwardOnly(), however I often have a need for previous()(which is why breaking up the query with LIMIT is a PITA)
Is there no way to cap the cache for QSqlQuery?

Why don't you store previous yourself? There's a QContiguousCache which seems ideal for this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js