Resources Exceeded During Query Execution: Custom Dimensions & MAX(IF(...)) - python-2.7

I'm performing what I thought was a simple events query in BigQuery with two custom dimensions. When trying to execute this query for year-to-date, I get the following error:
query: Resources exceeded during query execution. (error code:
resourcesExceeded)
Research of this error and investigation of the 'resourcesExceeded' error code indicates that this happens most frequently when using windowed functions, joins, count(distinct()) or group by each, none of which I am using here. The only thing I am doing is order by Date asc, which doesn't appear to eliminate the error when that line is removed. Since this is a shard resources issue, I think this has to be related to the two custom dimensions I'm trying to pull, since the MAX/IF function seem to be the most resource intensive portions of this query. Here's a snippet of the query I'm running:
SELECT
DATE,
userId,
fullVisitorId,
visitId,
trafficSource.source,
trafficSource.medium,
trafficSource.campaign,
trafficSource.adContent,
MAX(IF (hits.customDimensions.INDEX = 1,hits.customDimensions.value,NULL)) WITHIN RECORD AS XXXXXX,
MAX(IF (hits.customDimensions.INDEX = 2,hits.customDimensions.value,NULL)) WITHIN RECORD AS YYYYYY,
totals.visits,
totals.bounces,
totals.pageviews
FROM (
TABLE_DATE_RANGE([########.ga_sessions_],
TIMESTAMP('2016-01-01'), # start date
TIMESTAMP('2016-07-31')) # end date
)
ORDER BY DATE ASC;
I've tried this query both through the BigQuery console UI as well as from the command line. I've also set the option, 'allowLargeResults' to True.

Related

Google Bigquery: Join of two external tables fails if one of them is empty

I have 2 external tables in BiqQuery, created on top of JSON files on Google Cloud Storage. The first one is a fact table, the second is errors data - and it might or might not be empty.
I can query each table separately just fine, even an empty one - here is an
empty table query result example
I'm also able to left join them if both of them are not empty.
However, if errors table is empty, my query fails with the following error:
The query specified one or more federated data sources but not all of them were scanned. It usually indicates incorrect uri specification or a 'limit' clause over a union of federated data sources that was satisfied without having to read all sources.
This situation isn't covered anywhere in the docs, and it's not related to this versioning issue - Reading BigQuery federated table as source in Dataflow throws an error
I'd rather avoid converting either of this tables to native, since they are used in just one step of the ETL process, and this data is dropped afterwards. One of them being empty doesn't look like an exceptional situation, since plain select works just fine.
Is some workaround possible?
UPD: raised an issue with Google, waiting for response - https://issuetracker.google.com/issues/145230326
It feels like a bug. One workaround is to use scripting to avoid querying the empty table:
DECLARE is_external_table_empty BOOL DEFAULT
(SELECT 0 = (SELECT COUNT(*) FROM your_external_table));
-- do things differently when is_external_table_empty is true
IF is_external_table_empty = true
THEN ...
ELSE ...
END IF

PDI - Update field value in Logging tables

I'm trying create a transformation that can change field value in DB (postgreSQL what i use).
Case :
In postgre db I have table called Monitoring and it has several field like id, date, starttime, endtime, duration, transformation name, status, desc. All those value I get from Transformation Logging.
So, when I run the transformation it will insert into Monitoring table and set value for field status with Running. And when it done it will update the status into Finish. What I'm trying is to define value in table field by myself not take it from Transformation Logging so I can customize the value like I want to.
Goal is Update transformation status value from 'running' to 'finish/error/abort etc' in my db using pentaho and display that status in web app
I have thinking to used Modified Java Script step to do it but if there any other way maybe? A better one. (Just need opinion about this)
Apart from my remark, did you try the Value Mapper?
modified javascript is not a good idea to use. Ideally, it shouldn't be used due to the performance issue. You can use "add constant" step or "User defined Java Class" for an alternative.
You cannot change the values of the built-in Logging tables, for the simple reason that they are reserved for PDI usage. This causes a known issue in case of hard error: for example the status is not set to finish when the data base server crashes, or when a NullException is not catch by the DPI code.
You have some work around.
The simplest, the one used in the ETL-Pilot is to test (Status=Finish OR LogDate< 15 minutes ago) is the web app.
You can update the table when the transformation is not running. For example, put an hourly (or less) crontab that changes to Finish the status of any transformation whose LogDate is older than 15 mn. This crontab may be a simple SQL or included in a transformation that also check the tables size and/or send an email in case of potential error.
You can copy the table (if it is a non locking operation in your DB system), modify the Status column and use this table for your web app.

Query hive table with Spark

I am newbie to Apache Hive and Spark. I have some existing Hive tables sitting on my Hadoop server that I can run some HQL commands and get what I want out of the table using hive or beeline, e.g, selecting first 5 rows of my table. Instead of that I want to use Spark to achieve the same goal. My Spark version on server is 1.6.3.
Using below code (I replace my database name and table with database and table):
sc = SparkContext(conf = config)
sqlContext = HiveContext(sc)
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
df = query.toPandas()
df.show()
I get this error:
ValueError: Some of types cannot be determined after inferring.
Error:root: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
However, I can use beeline with same query and see the results.
After a day of googling and searching I modified the code as:
table_ccx = sqlContext.table("database.table")
table_ccx.registerTemplate("temp")
sqlContext.sql("SELECT * FROM temp LIMIT 5").show()
Now the error is gone but all the row values are null except one or two dates and column names.
I also tried
table_ccx.refreshTable("database.table")
and it did not help. Is there a setting or configuration that I need to ask my IT team to do? I appreciate any help.
EDIT: Having said that, my python code is working for some of the table on Hadoop. Do not know the problem is because of some entries on table or not? If yes, then how come the corresponding beeline/Hive command is working?
As it came out in the comments, straightening up the code a little bit makes the thing work.
The problem lies on this line of code:
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
What you are doing here is:
asking Spark to query the data source (which creates a DataFrame)
collect everything on the driver as a local collection
parallelize the local collection on Spark with createDataFrame
In general the approach should work, although it's evidently unnecessarily convoluted.
The following will do:
query = sqlContext.sql("SELECT * from database.table LIMIT 5")
I'm not entirely sure of why the thing breaks your code, but still it does (as it came out in the comments) and it also improves it.

UPDATE BigQuery query to fix broken timestamps is too complex or resource intensive

I have several tables that have invalid Standard-SQL TIMESTAMP records. These records are nested at least one-level deep in an array. These records break SELECT * on these tables, even when using Legacy SQL. They also break exporting the table as JSON. When I try to UPDATE these tables to fix the record, it the UPDATE errors out unless the statement fixes all the broken fields at the same time. This leads to large UPDATE statements. Example: https://gist.github.com/dadrian/b83585c23f6cbbcd5f6d6478c92c745d
That UPDATE statement is too big to compile!
Error: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex
I took a second approach then: SELECT out a "fixed-up" STRUCT for each parent field containing the arrays with invalid timestamps (e.g. the p443.https.tls field) into it's own table. Then UPDATE the original table by joining on each "fixed-up struct" table. After each SELECT happens, the UPDATE looks like:
UPDATE `scratch.domain_20170819_copy` target
SET
target.p443.https.tls = fixed_https.tls,
target.p443.https_www.tls = fixed_https_www.tls,
target.p25.smtp.starttls.tls = fixed_smtp_starttls.tls
FROM `scratch.domain_20170819_copy` AS original
INNER JOIN `scratch.domain_20170819_https_tls` AS fixed_https
ON original.domain = fixed_https.domain
INNER JOIN `scratch.domain_20170819_https_www_tls` AS fixed_https_www
ON original.domain = fixed_https_www.domain
INNER JOIN `scratch.domain_20170819_smtp_starttls_tls` AS fixed_smtp_starttls
ON original.domain = fixed_smtp_starttls.domain
WHERE
original.domain = target.domain
AND
original.domain = fixed_https.domain
AND
original.domain = fixed_https_www.domain
AND
original.domain = fixed_smtp_starttls.domain
This works fine on small enough tables. On larger tables, or tables with more (similarly-broken) fields, the UPDATE statement does not finish, and errors after about 30 minutes with
Resources exceeded during query execution: ORDER BY operator used too much memory..
How can I fix these tables?
EDIT: The invalid timestamps are due to accidentally outputting an integer timestamp as milliseconds instead of seconds in our datasource. BQ interprets as seconds, which makes the timestamps be in the year 48000.
EDIT: I added the schema and an example data object to the gist.
A quick description of the schema: The relevant data is of the form a.b.c.tls, where tls is the parent object containing all the broken data. tls contains a number of things included certificate, which is an object, and chain, with is an array of certificate objects. A certificate contains parsed.extensions.signed_certificate_timestamps, which is an array of a struct. On of the fields of signed_certificate_timestamps is timestamp, which contains the invalid timestamps in question. This effectively means I have a nested invalid timestamp for every ...tls.certificate.parsed.extensions.signed_certificate_timestamps, and a doubly-nested invalid timestamps for every ...tls.chain.certificate.parsed.....signed_certificate_timestamps.

Oracle Materialized View Refresh fails with ORA-01555

I've a Materialized view set to refresh on demand:
CREATE MATERIALIZED VIEW XYZ
REFRESH COMPLETE ON DEMAND
AS
SELECT * FROM ABC WHERE LAST_UPD > SYSDATE-30;
When i run a procedure for refresh it fails every two days.
Refresh command:
dbms_mview.refresh(list => 'XYZ',
method => 'C',
parallelism => 0,
atomic_refresh => false);
Error:
1 - ERROR IN MERGE : ORA-12008: error in materialized view refresh path
ORA-01555: snapshot too old: rollback segment number 406 with name "_SYSSMU406_3487494604$" too small
ORA-02063: preceding line from IJSFASIEBEL
I've read that using select * to create the Materialized view can cause this error,
but i've dropped the view and recreated it many times, the refresh runs fine one day and gets erred out the next day.
No changes were made to the base table.
Can anyone tell me what the error message means or what might be causing the issue?
The problem is that your rollback segments are not large enough for the query that is being run given the other updates happening on the database at the same time.
There is a full discussion of what this means here:
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:275215756923
Possible solutions:-
Create larger rollback segment to allow more changes to occur during the refresh without running out of rollback space
Creating an index on LAST_UPD to improve the speed of the query (if indeed it does)
Running the refresh at a quieter time of day
Pratheek Ponnuru,
Please check if any LOB are there in the table , the check for lob corruption.
If LOB are corrupted then this error used to come....
I had faced same issue recently, I check the corruption for all lobs in the table and
post further investigation found some corrupted lob segments, which later I set to blob_null().
-- Milind Kale