hive aggregate query takes wrong value from cache

hive aggregate query takes wrong value from cache - mapreduce

I am running aggregate query on hive session.
hive>select count(1) from table_name;
For the first time it runs mapreduce program and returns result. But for the consecutive runs later in the day it returns same count from the cache(though table is getting updated hourly). which is wrong count.
tried:-
set hive.metastore.aggregate.stats.cache.enabled=false
hive.cache.expr.evaluation=false
set hive.fetch.task.conversion=none
But no luck. Using Hive 1.2.1.2.3.4.29-5 hive version. Thanks

Disable using stats for query calculation:
set hive.compute.query.using.stats=false;
See also this answer for more details: https://stackoverflow.com/a/41021682/2700344

Related

Schedule the creation a partitioned table overwriting an existing table in BigQuery GCP

Yesterday I scheduled daily the overwriting of a table. The new table will be partitioned as well as the overwritten one... It did not run at the corresponding time, nor gave an error... It just did not started.
My feeling is that it has to be with the partitioning option. To mention that the casting of the field date_formatted that will be used as partition field works fine.
As far as I know, when scheduling a query you can't use the create or replace table T partitioned by column C as select...
You starts from the select... clause, as shows in the image, and I don't know if the problem comes from here.
PS: I had no troubles scheduling the appending to a partitioned by day table with this same procedure.

the destination table is in the same dataset.
if the very same query is scheduled to deliver the results in a table with the same name, but in a different dataset (located in the same project), it works.
by the way, the input table and the output table never were the same.

I would like to know about BigQuery output destination tables

BigQuery scheduled queries and I would like to know one point when spitting out the output of a select statement into a separate table.
I am new to Google Cloud Platform.
The last table that outputs using order by will not have the Is the result of the output spit out as it is?
I would like to write to the table of output results of the schedule query after sorting by order by id. Is that possible?
Sorry for the rudimentary question, but thank you in advance.

BigQuery tables don't guarantee a certain order of its records. Unless you're using clustering - this will physically sort the data in the background. But this will also not guarantee a sorted output of a query. You have to use ORDER BY in order to have a sorted output guaranteed. That means you need a key by which you can sort.

CURRENT_DATE in BigQuery result to yesterday's date

I am trying to use current_date function in my BQ query to fecth today's data, but it was not working. After debugging I found out this function returns yesterday's data.
Unfortunately I am not able to add screenshot here.
Below is the query that I ran
Select current_date as the_date
result = 2020-08-20
it should be 2020-08-21
Any idea how to resolve this or how to fetch current datein big query

If you don't specify a time zone, it uses the default for your project. I would guess your default is mis-configured.
Running these queries returns different results.
select current_date('US/Pacific') as the_date
select current_date('Australia/Melbourne') as the_date
Allowable time zone values are here, also on how to use the current_date().

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks

I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

Are Redshift system tables immutable and well ordered?

Redshift system tables only story a few days of logging data - periodically backing up rows from these tables is a common practice to collect and maintain proper history. To find new rows added in to system logs I need to check against my backup tables either on query (number) or execution time.
According to an answer on How do I keep more than 5 day's worth of query logs? we can simply select all rows with query > (select max(query) from log). The answer is unreferenced and assumes that query is inserted sequentially.
My question in two parts - hoping for references or code-as-proof - is
are query (identifiers) expected to be inserted sequentially, and
are system tables, e.g. stl_query, immutable or unchanging?
Assuming that we can't verify or prove both the above, then what's the right strategy to backup the system tables?
I am wary of this because I fully expect long running queries to complete after many other queries have started and completed.
I know query (identifier) is generated at query submit time, because I can monitor in progress queries. Therefore it is completed expected that a long running query=1 may complete after query=2. If the stl_query table is immutable then query=1 will be inserted after query=2, and the max(query) logic is flawed.
Alternatively, if query=1 is inserted into stl_query at run time, then the row must be updated upon completion (with end time, duration, etc). This would required me to do an upsert into the backup table.

I think the stl_query table is indeed immutable, it would seem that it's only written to after a query finishes.
Here is why I think that. First off, I ran this query on a cluster with running queries
select count(*) from stl_query where endtime is null
This returns 0. My hunch is that you'll probably see the same thing on your side.
To be double sure, I also ran this query:
select count(*) from stv_inflight i
inner join stl_query q on q.query = i.query
This also returns zero (while I did have queries inflight), which seems to confirm that queries are only logged in stl_query when they have finished executing and are not updated.
That said, I would rewrite the query to insert into your history table as such:
insert into admin.query_history (
select * from stl_query
where query not in (select query from admin.query_history)
)
That way, you'll always insert any records you don't have in the history table.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

hive aggregate query takes wrong value from cache - mapreduce

Disable using stats for query calculation: set hive.compute.query.using.stats=false; See also this answer for more details: https://stackoverflow.com/a/41021682/2700344

Related

Schedule the creation a partitioned table overwriting an existing table in BigQuery GCP

I would like to know about BigQuery output destination tables

CURRENT_DATE in BigQuery result to yesterday's date

Using merge in Power Query while keeping native query

Are Redshift system tables immutable and well ordered?

Categories

Resources