Google Big Query splitting an ingestion time partitioned table - google-cloud-platform

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.

Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)

BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

Related

Athena query is very slow

I am storing 400,000 parquet files in S3 that are partitioned based on a unique id (e.g. 412812). The files range in size from 25kb to 250kb of data. I then want to query the data using Athena. Like so,
Select *
From Table
where id in (412812, 412813, 412814)
This query is much slower than anticipated. I want to be able to search for any set of ids and get a fast response. I believe it is slow is because Athena must search through the entire glue catalog looking for the right file (i.e., a full scan of files).
The following query is extremely fast. Less than a second.
Select *
From Table
where id = 412812
partition.filtering is enabled on the table. I tried adding an index to the table that was the same as the partition, but it did not speed anything up.
Is there something wrong with my approach or a table configuration that would make this process faster?
Your basic problem is that you have too many files and too many partitions.
While Amazon Athena does operate in parallel, there are limits to how many files it can process simultaneously. Plus, each extra file adds overhead for listing, opening, etc.
Also, putting just a single file in each partition greatly adds to the overhead of handling so many partitions and is probably counterproductive for increasing the efficiency of the system.
I have no idea of how you actually use your data, but based on your description I would recommend that you create a new table that is bucketed_by the id, rather than partitioned:
CREATE TABLE new_table
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://bucket/new_location/',
bucketed_by = ARRAY['id']
)
AS SELECT * FROM existing_table
Let Athena create as many files as it likes -- it will optimize based upon the amount of data. More importantly, it will create larger files that allow it to operate more efficiently.
See: Bucketing vs Partitioning - Amazon Athena
In general, partitions are great when you can divide the data into some major subsets (eg by country, state or something that represents a sizeable chunk of your data), while bucketing is better for fields that have values that are relatively uncommon (eg user IDs). Bucketing will create multiple files and Athena will be smart enough to know which files contain the IDs you want. However, it will not be partitioned into subdirectories based upon those values.
Creating this new table will greatly reduce the number of files that Amazon Athena would need to process for each query, which will make your queries run a LOT faster.

Indexed Range Query with DynamoDB

With DynamoDB, there is simply no straightforward way to perform an indexed range query over a column. Primary key, local secondary index, and global secondary index all require a partition key to range query.
For example, suppose I have a high-scores table with a numerical score attribute. There is no way to get the top 10 scores or top scores 25 to 50 with an indexed range query
So, what is the idiomatic or preferred way to perform this incredibly common task?
Settle for a table scan.
Use a static partition key and take advantage of partition queries.
Use a fixed number of static partition keys and use multiple partition queries.
It's either 2) or 3) but it depends on the amount and structure of data as well as the read/write activity.
There's no generic answer here as it's use-case specific.
As long as you can get away with it, you probably want to use 2) as it only requires a single Query API call. If you have lots of data or heavy read/write activity, you'd use some bucketing-strategy (very close to your third option) to write to multiple partitions, then do multiple queries and aggregate the results.
DDB isn't suited for analytics. As Maurice said you can facilitate what you need via secondary index, but there are also other options to consider:
If you are providing this Top N to your customers consistently/frequently and N is fixed, then you can have dedicated item(s) that hold this information and you would update that/those item(s) upon writing an item to a table. You can have 1 item for the whole top N or you can apply some bucketing strat.
If your system needs this information infrequently (on some singular occasions), then scan might be also fine.
If this is for analytics/research, consider exporting the table to S3 and using Athena.

Redshift - Redesign tables to use DIST and SORT keys (performance issue)

I'm having serious performance problems on Redshift and I've started to rethink my tables structures.
Right now, I'm identifying tables that have most significance on my dashboard. First of all, I run the following query:
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT s.tbl FROM stl_scan s
JOIN pg_user u ON u.usesysid = s.userid
WHERE s.type=2 AND u.usename='looker'
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;
Based on query result, I could identify a lot of small tables (1-1000 records) that are distributed as EVEN and it could be ALL - this tables are used in a lot of joins instructions.
Beside that, I've identified that 99% of my tables are using EVEN without sort key. I'm not using denormalized tables so I need to run plenty of joins to get data - for what I've read, EVEN is not good for joins because it could be distributed over the network.
I have 3 tables related to Ticket flow: user, ticket and ticket_history. All those tables are EVEN without sort keys and diststyle as EVEN.
For now, I would like to redesign table user: this table is used on join by condition ticket.user_id = user.id and where clauses like user.email = 'xxxx#xxxx.com' or user.email like '%#something.com%' or group by user.email.
First thing I'm planning to do is use diststyle as distribution and key as id. Does make sense use a unique value as dist key? I've read plenty of posts about dist keys and still confuse for me.
As sort keys makes sense use email as compound? I've read to avoid columns that grows like dates, timestamps or identities, that's why i'm not using it as interleaved. To avoid that like, I'm planning to create a new column to identify what is email domain.
After that, I'll change small tables to dist ALL and try my queries again.
Am I on right way? Any other tip?
This question could sound stupid but my tech background is only software development, I'm learning about Redshift and reading a lot of documentations.
The basic rule of thumb is:
Set the DISTKEY to the column that is most used in JOINs
Set the SORTKEY to the column(s) most used in WHEREs
You are correct that small tables can have a distribution of ALL, which would avoid sending data between nodes.
DISTKEY provides the most benefit when tables are join via a common column that has the same DISTKEY in both tables. This means that each row is contained on the same node and no data needs to be sent between nodes (or, more accurately, slices). However, you can only select one DISTKEY, so do it on the column that is most often used for the JOIN.
SORTKEY provides the most benefit when Redshift can skip over blocks of storage. Each block of storage contains data for one column and is marked with a MIN and MAX value. When a table is sorted on a particular column, it minimises the number of disk blocks that contain data for a given column value (since they are all located together, rather than being spread randomly throughout disk storage). Thus, use column(s) that are most frequently used in WHERE statements.
If the user.email wildcard search is slow, you can certainly create a new column with the domain. Or, for even better performance, you could consider creating a separate lookup table with just user_id and domain, having SORTKEY = domain. This will perform the fastest when searching by domain.
A tip from experience: I would advise against using an email address as a user_id because people sometimes want to change email address. It is better to use a unique number for such id columns, with email address as a changeable attribute. (I've seen software systems need major rewrites to fix such an early design decision!)

What is the best practice for loading data into BigQuery table?

Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.