UPDATE BigQuery query to fix broken timestamps is too complex or resource intensive - sql-update

I have several tables that have invalid Standard-SQL TIMESTAMP records. These records are nested at least one-level deep in an array. These records break SELECT * on these tables, even when using Legacy SQL. They also break exporting the table as JSON. When I try to UPDATE these tables to fix the record, it the UPDATE errors out unless the statement fixes all the broken fields at the same time. This leads to large UPDATE statements. Example: https://gist.github.com/dadrian/b83585c23f6cbbcd5f6d6478c92c745d
That UPDATE statement is too big to compile!
Error: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex
I took a second approach then: SELECT out a "fixed-up" STRUCT for each parent field containing the arrays with invalid timestamps (e.g. the p443.https.tls field) into it's own table. Then UPDATE the original table by joining on each "fixed-up struct" table. After each SELECT happens, the UPDATE looks like:
UPDATE `scratch.domain_20170819_copy` target
SET
target.p443.https.tls = fixed_https.tls,
target.p443.https_www.tls = fixed_https_www.tls,
target.p25.smtp.starttls.tls = fixed_smtp_starttls.tls
FROM `scratch.domain_20170819_copy` AS original
INNER JOIN `scratch.domain_20170819_https_tls` AS fixed_https
ON original.domain = fixed_https.domain
INNER JOIN `scratch.domain_20170819_https_www_tls` AS fixed_https_www
ON original.domain = fixed_https_www.domain
INNER JOIN `scratch.domain_20170819_smtp_starttls_tls` AS fixed_smtp_starttls
ON original.domain = fixed_smtp_starttls.domain
WHERE
original.domain = target.domain
AND
original.domain = fixed_https.domain
AND
original.domain = fixed_https_www.domain
AND
original.domain = fixed_smtp_starttls.domain
This works fine on small enough tables. On larger tables, or tables with more (similarly-broken) fields, the UPDATE statement does not finish, and errors after about 30 minutes with
Resources exceeded during query execution: ORDER BY operator used too much memory..
How can I fix these tables?
EDIT: The invalid timestamps are due to accidentally outputting an integer timestamp as milliseconds instead of seconds in our datasource. BQ interprets as seconds, which makes the timestamps be in the year 48000.
EDIT: I added the schema and an example data object to the gist.
A quick description of the schema: The relevant data is of the form a.b.c.tls, where tls is the parent object containing all the broken data. tls contains a number of things included certificate, which is an object, and chain, with is an array of certificate objects. A certificate contains parsed.extensions.signed_certificate_timestamps, which is an array of a struct. On of the fields of signed_certificate_timestamps is timestamp, which contains the invalid timestamps in question. This effectively means I have a nested invalid timestamp for every ...tls.certificate.parsed.extensions.signed_certificate_timestamps, and a doubly-nested invalid timestamps for every ...tls.chain.certificate.parsed.....signed_certificate_timestamps.

Related

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

Does searching by id depends on number of columns in postgres?

I have the following query: MyModel.objects.filter(id__in=ids).
I noticed that increasing number of columns in table decreases speed of the above query.
Why is that?
Query time in Postgres mostly consists of planing time, execution time and data fetch.
Planing time and execution time should not be affected by a number of columns in the table, but the data fetch phase definitely is as you are returning more data.
Also, an additional step that happens is the mapping of return data into Django QuerySet which takes more time if more columns are involved.
To limit the scope of data returned if applicable, you can always use values, defer, or only.
In some complex data-modeling situations, your models might contain a lot of fields, some of which could contain a lot of data (for example, text fields), or require expensive processing to convert them to Python objects. If you are using the results of a queryset in some situation where you don’t know if you need those particular fields when you initially fetch the data, you can tell Django not to retrieve them from the database.

Resources Exceeded During Query Execution: Custom Dimensions & MAX(IF(...))

I'm performing what I thought was a simple events query in BigQuery with two custom dimensions. When trying to execute this query for year-to-date, I get the following error:
query: Resources exceeded during query execution. (error code:
resourcesExceeded)
Research of this error and investigation of the 'resourcesExceeded' error code indicates that this happens most frequently when using windowed functions, joins, count(distinct()) or group by each, none of which I am using here. The only thing I am doing is order by Date asc, which doesn't appear to eliminate the error when that line is removed. Since this is a shard resources issue, I think this has to be related to the two custom dimensions I'm trying to pull, since the MAX/IF function seem to be the most resource intensive portions of this query. Here's a snippet of the query I'm running:
SELECT
DATE,
userId,
fullVisitorId,
visitId,
trafficSource.source,
trafficSource.medium,
trafficSource.campaign,
trafficSource.adContent,
MAX(IF (hits.customDimensions.INDEX = 1,hits.customDimensions.value,NULL)) WITHIN RECORD AS XXXXXX,
MAX(IF (hits.customDimensions.INDEX = 2,hits.customDimensions.value,NULL)) WITHIN RECORD AS YYYYYY,
totals.visits,
totals.bounces,
totals.pageviews
FROM (
TABLE_DATE_RANGE([########.ga_sessions_],
TIMESTAMP('2016-01-01'), # start date
TIMESTAMP('2016-07-31')) # end date
)
ORDER BY DATE ASC;
I've tried this query both through the BigQuery console UI as well as from the command line. I've also set the option, 'allowLargeResults' to True.

Database polling, prevent duplicate fetches

I have a system whereby a central MSSQL database keeps in a table a queue of jobs that need to be done.
For the reasons that processing requirements would not be that high, and that there would not be a particularly high frequency of requests (probably once every few seconds at most) we made the decision to have the applications that utilise the queue simply query the database whenever one is needed; there is no message queue service at this time.
A single fetch is performed by having the client application run a stored procedure, which performs the query(ies) involved and returns a job ID. The client application then fetches the job information by querying by ID and sets the job as handled.
Performance is fine; the only snag we have felt is that, because the client application has to query for the details and perform a check before the job is marked as handled, on very rare occasions (once every few thousand jobs), two clients pick up the same job.
As a way of solving this problem, I was suggesting having the initial stored procedure that runs "tag" the record it pulls with the time and date. The stored procedure, when querying for records, will only pull records where this "tag" is a certain amount of time, say 5 seconds, in the past. That way, if the stored procedure runs twice within 5 seconds, the second instance will not pick up the same job.
Can anyone foresee any problems with fixing the problem this way or offer an alternative solution?
Use a UNIQUEIDENTIFIER field as your marker. When the stored procedure runs, lock the row you're reading and update the field with a NEWID(). You can mark your polling statement using something like WITH(READPAST) if you're worried about deadlocking issues.
The reason to use a GUID here is to have a unique identifier that will serve to mark a batch. Your NEWID() call is guaranteed to give you a unique value, which will be used to prevent you from accidentally picking up the same data twice. GETDATE() wouldn't work here because you could end up having two calls that resolve to the same time; BIT wouldn't work because it wouldn't uniquely mark off batches for picking up or reporting.
For example,
declare #ReadID uniqueidentifier
declare #BatchSize int = 20; -- make a parameter to your procedure
set #ReadID = NEWID();
UPDATE tbl WITH (ROWLOCK)
SET HasBeenRead = #ReadID -- your UNIQUEIDENTIFIER field
FROM (
SELECT TOP (#BatchSize) Id
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead IS null ORDER BY [Id])
AS t1
WHERE ( tbl.Id = t1.Id)
SELECT Id, OtherCol, OtherCol2
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead = #ReadID
And then you can use a polling statement like
SELECT COUNT(*) FROM tbl WITH(READPAST) WHERE HasBeenRead IS NULL
Adapted from here: https://msdn.microsoft.com/en-us/library/cc507804%28v=bts.10%29.aspx