Redshift derive column value based on other column - amazon-web-services

Let's say, I have 3 date columns (d1,d2,d3) in a redshift table.
d1 = max(d2,d3)
Instead of my application computing the value and setting it, during insert, if I specify only d2 and d3, can redshift auto-populate d1 = max(d2,d3)?

There are two ways to load data into Amazon Redshift.
The first is via the COPY command, when data is loaded from files stored in Amazon S3. Each column in a file will be mapped to one column in a table, so you cannot 'compute' a column during this process.
The second is via an INSERT command. This is not very efficient when used with Amazon Redshift and preferably is used to insert bulk rows rather than one row at a time.
A common practice is to load the data into a staging table, manipulate it as desired, then re-insert it into the target table.
You might even be able to do some fancy stuff with Redshift Spectrum where you can SELECT directly from files in S3 and insert into a table. This would also allow you to include terms, eg:
INSERT INTO normal-table
SELECT max(d2,d3), d2, d3 FROM spectrum-table
An alternative is to load the data, then use an UPDATE command to set the value of the extra column based upon existing columns.
Update:
It appears that using an UPDATE statement in Amazon Redshift (and, in fact, in any columnar database) is not a good idea. This is because each column is stored separately but in the same order. Updating one value requires the whole row to be re-written at the end of the storage space, rather than updated in-place. Thus, you'd need to VACUUM the database after such updates.

In postgresql (on which redshift is based), you can do what you want like this:
create table test (a int, b int, c int);
insert into test (a, b, c)
values (1, 2, greatest(1, 2))
(4, 1, greatest(4, 1));
It should also work in redshift, although I can't verify that at the moment. But this won't work for bulk loading data via the copy command.
If the above doesn't work, the other option would be to insert data, and then set column c using an update query.
insert into test (a, b) values (1, 2);
update test set c = greatest(a, b) where c is null;
for bulk loading, it is necessary to load data into columns a & b first using the copy command, then use the update query to set the value of column c

Related

Would an HBase Scan perform better with multiple Column Families or single Column Family?

I would like to store an object (payload) along with some metadata in HBase.
Then I would like to run queries on the table and pull out the payload part based on metadata info.
For example, let's say I have the following column qualifiers
P: Payload (larger than M1 + M2).
M1: Meta-Data1
M2: Meta-Data2
Then I would run a query such as:
Fetch all Payload where M1='search-key1' && M2='search-key2'
Does it make sense to:
keep M1 and M2 in one column family and P in another column family? Would the scan be quicker?
Keep all 3 columns in the same column family?
Normally, I would do a spike (I may still need to) - I thought I ask first.
I'd try to follow the advice given in HBase Reference and go with option #2 (Keep all 3 col in the same column family):
Try to make do with one column family if you can in your schemas. Only
introduce a second and third column family in the case where data
access is usually column scoped; i.e. you query one column family or
the other but usually not both at the one time.

For each distinct value in col_a, yield a new table

I have an Athena table of data in S3 that acts as a source table, with columns id, name, event. For every unique name value in this table, I would like to output a new table with all of the rows corresponding to that name value, and save to a different bucket in S3. This will result in n new files stored in S3, where n is also the number of unique name values in the source table.
I have tried single Athena queries in Lambda using PARTITION BY and CTAS queries, but can't seem to get the result that I wanted. It seems that AWS Glue may be able to get my expected result, but I've read online that it's more expensive, and that perhaps I may be able to get my expected result using Lambda.
How can I store a new file (JSON format, preferably) that contains all rows corresponding to each unique name in S3?
Preferably I would run this once a day to update the data stored by name, but the question above is the main concern for now.
When u write your spark/glue code you will need to partition the data using the name column. However this will result in a path having the below format
S3://bucketname/folder/name=value/file.json
This should give a separate set of files for each name value, but if u want to access that as a separate table u might need to get rid of that = sign from the key before u crawl the data and make it available via Athena
If u do use a lambda, the operation involves going through the data , similar to what glue does, and partitioning the data
I guess it all depends on the volume of data that it needs to process. Glue, if using spark may have a little bit of an extra start up time. Glue python shells have comparatively better start up times

DynamoDB update one column of all items

We have a huge DynamoDB table (~ 4 billion items) and one of the columns is some kind of category (string) and we would like to map this column to either new one category_id (integer) or update existing one from string to int. Is there a way to do this efficiently without creating new table and populating it from beginning. In other words to update existing table?
Is there a way to do this efficiently
Not in DynamoDB, that use case is not what it's designed for...
Also note, unless you're talking about the hash or sort key (of the table or of an existing index), DDB doesn't have columns.
You'd run Scan() (in a loop since it only returns 1MB of data)...
Then Update each item 1 at a time. (note could BatchUpdate of 10 items at a time, but that save just network overhead..still does 10 individual updates)
If the attribute in question is used as a key in the table or an existing index...then a new table is your only option. Here's a good article with a strategy for migrating a production table.
Create a new table (let us call this NewTable), with the desired key structure, LSIs, GSIs.
Enable DynamoDB Streams on the original table
Associate a Lambda to the Stream, which pushes the record into NewTable. (This Lambda should trim off the migration flag in Step 5)
[Optional] Create a GSI on the original table to speed up scanning items. Ensure this GSI only has attributes: Primary Key, and Migrated (See Step 5).
Scan the GSI created in the previous step (or entire table) and use the following Filter:
FilterExpression = "attribute_not_exists(Migrated)"
Update each item in the table with a migrate flag (ie: “Migrated”: { “S”: “0” }, which sends it to the DynamoDB Streams (using UpdateItem API, to ensure no data loss occurs).
NOTE You may want to increase write capacity units on the table during the updates.
The Lambda will pick up all items, trim off the Migrated flag and push it into NewTable.
Once all items have been migrated, repoint the code to the new table
Remove original table, and Lambda function once happy all is good.

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.