What is the best practice for loading data into BigQuery table? - google-cloud-platform

Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?

We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.

Related

Simultaneously `CREATE TABLE LIKE` in AWS Redshift and change a few of columns' default values

Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!

INSERT INTO table SELECT Redshift super slow

We have a large table, that we need to do a DEEP COPY on it.
Since we don't have enough empty disk space to make it in one statements I've tried to make it in batches.
But the batches seem to run very very slowly.
I'm running something like this:
INSERT INTO new_table
SELECT * FROM old_table
WHERE creation_date between '2018-01-01' AND '2018-02-01'
Even though the query returns small amount of lines ~ 1K
SELECT * FROM old_table
WHERE creation_date between '2018-01-01' AND '2018-02-01'
The INSERT query take around 50 minutes to complete.
The old_table has ~286M rows and ~400 columns
creation_date is one of the SORTKEYs
Explain plan looks like:
XN Seq Scan on old_table (cost=0.00..4543811.52 rows=178152 width=136883)
Filter: ((creation_date <= '2018-02-01'::date) AND (creation_date >= '2018 01-01'::date))
My question is:
What may be the reason for INSERT query to take this long?
In my opinion, following are two possibilities--- though if you could add more details to your question will be great.
As #John stated in comments, your SORTKEY matters a lot in RedShift, is creation_date sortkey?
Did you do lot of updates to your old_table, if so, you must to vacuum first do VACUUM DELETE Only old_table then, do select queries.
Other option, you might be doing S3 way, but not sure do you want to do it.

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

Cassandra: How to query the complete data set?

My table has 77k entries (number of entries keep increasing this a high rate), I need to make a select query in CQL 3. When I do select count(*) ... where (some_conditions) allow filtering I get:
count
-------
10000
(1 rows)
Default LIMIT of 10000 was used. Specify your own LIMIT clause to get more results.
Let's say the 23k rows satisfied this some_condition. The 10000 count above is of the first 10k rows of these 23k rows, right? But how do I get the actual count?
More importantly, How do I get access to all of these 23k rows, so that my python api can perform some in-memory operation on the data in some columns of the rows. Are there a some sort pagination principles in Cassandra CQL 3.
I know I can just increase the limit to a very large number but that's not efficient.
Working Hard is right, and LIMIT is probably what you want. But if you want to "page" through your results at a more detailed level, read through this DataStax document titled: Paging through unordered partitioner results.
This will involve using the token function on your partitioning key. If you want more detailed help than that, you'll have to post your schema.
While I cannot see your complete table schema, by virtue of the fact that you are using ALLOW FILTERING I can tell that you are doing something wrong. Cassandra was not designed to serve data based on multiple secondary indexes. That approach may work with a RDBMS, but over time that query will get really slow. You should really design a column family (table) to suit each query you intend to use frequently. ALLOW FILTERING is not a long-term solution, and should never be used in a production system.
you just have to specify limit with your query.
let's assume your database is containing under 1 lack records so if you will execute below query it will give you the actual count of the records in table.
select count(*) ... where (some_conditions) allow filtering limit 100000;
Another way is to write python code, the cqlsh indeed is python script.
use
statement = " select count(*) from SOME_TABLE"
future = session.execute_async(statement)
rows = future.result()
count = 0
for row in rows:
count = count + 1
the above is using cassandra python driver PAGE QUERY feature.

Partitioning a table in sybase-select query

My main concern:
I have an existing table with huge data.It is having a clustered index.
My c++ process has a list of many keys with which it checks whether the key exists in the table,
and if yes, it will then check the row in the table and the new row are similar. if there is a change the new row is updated in the table.
In general there will less changes. But its huge data in the table.
S it means there will be lot of select queries but not many update queries.
What I would I like to achieve:
I just read about partitioning a table in sybase here.
I just wanted to know will this be helpful for me, as I read in the article it mentions about the insert queries only. But how can I improve my select query performance.
Could anyone please suggest what should I look for in this case?
Yes it will improve your query (read) performance so long as your query is based on the partition keys defined. Indexes can also be partitioned and it stands to reason that a smaller index will mean faster read performance.
For example if you had a query like select * from contacts where lastName = 'Smith' and you have partitioned your table index based on first letter of lastName, then the server only has to search one partition "S" to retrieve its results.
Be warned that partitioning your data can be difficult if you have a lot of different query profiles. Queries that do not include the index partition key (e.g. lastName) such as select * from staff where created > [some_date] will then have to hit every index partition in order to retrieve it's result set.
No one can tell you what you should/shouldn't do as it is very application specific and you will have to perform your own analysis. Before meddling with partitions, my advice is to ensure you have the correct indexes in place, they are being hit by your queries (i.e. no table scans), and your server is appropriately resourced (i.e got enough fast disk and RAM), and you have tuned your server caches to suit your queries.