DMV [dm_pdw_exec_requests] having NULL start_time stamp, resource class - azure-sqldw

I have a SQL Procedure running on SQL DW everyday and I was trying to analyze the stats that is captured in DMV [dm_pdw_exec_requests].
My Procedure ran for 288 minutes- but when i saw in DMV table , I see it has 10 rows with few rows having null resource class and empty start_time. Is it fair to exclude all rows with NULL/empty in start_time and resource_class to find the total time elapsed ?
Thanks,
Aravind

If you query sys.dm_pdw_exec_requests, you should see a single entry for each batch execution. If you have a statement like this:
SELECT 1;
GO
SELECT 2;
GO
You would expect to see two rows in dm_pdw_exec_requests:
In your case, I'm assuming your procedure is the one with the command exec dbo.Proc1. You would only look for the total_elapsed_time for that statement. The other entries are other batches you have executed against your instance. We have a great write up on how to Monitor your workload using DMVs that will be very helpful.

Related

Microsoft PowerBI Match multiple values is seperate table using ID and Date

I am making a demo IoT device monitoring dashboard. I can't understand which function should I use to check if devices are online on a certain date.
This is my sample report table1 from which I separate the Date to another column.
A device can report multiple times on a single day on the server. If it doesn't hit the server no report will generate.
Then I created a lookup table2 that contains all the device ID.
Now I created another table3 and generate a calendar date which I link with the table1 date.
Now In the column, I put my device ID and want to fill the column as true or false if the device reported on a particular date. I am unable to do it.
I used IF ( ISBLANK ( COUNTROWS ( RELATEDTABLE this function it didn't work
I want to create something like this. Which will look up the ID and date to report like it.
It will be a great help if anyone can share any idea.
The screen shots below give you everything you need.
There is a Device Connections table of the devices and when they connected. I have converted the DateTime to a date so that it can be joined to the Dates table, just a list of dates you want to check connections for. I have a relationship that connect the dates of the two tables.
Note: You could also preserve the time at which the device was seen if
needed. It is probably best to have it as a separate column. I have
discarded the time for simplicity.
I have created a single measure:
HasBeenSeen = IF (CALCULATE(COUNTROWS(DeviceConnections)) > 0, TRUE, FALSE)
Which gives a TRUE/FALSE if the device has been seen or not for whatever context exists (e.g. a given date). You could also just count the number of occurrences and display them.
Then I created a matrix visual with the Date from the dates table on the rows, Device ID on the columns and HasBeenSeen as the values to give the desired result.
As I said in the comments to you question, if you accept BLANK in cells where you shown FALSE, you can apply this simple steps. Only Table1 is sufficient for this and need no other table or joining.
Create a very simple Measure like below.
true_false = "True"
Now add a Metrics and configure it as below-
And here is the final output-

AWS IoT Analytics queries for retrieving data from dataset using boto3

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj
There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.
Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.

Are Redshift system tables immutable and well ordered?

Redshift system tables only story a few days of logging data - periodically backing up rows from these tables is a common practice to collect and maintain proper history. To find new rows added in to system logs I need to check against my backup tables either on query (number) or execution time.
According to an answer on How do I keep more than 5 day's worth of query logs? we can simply select all rows with query > (select max(query) from log). The answer is unreferenced and assumes that query is inserted sequentially.
My question in two parts - hoping for references or code-as-proof - is
are query (identifiers) expected to be inserted sequentially, and
are system tables, e.g. stl_query, immutable or unchanging?
Assuming that we can't verify or prove both the above, then what's the right strategy to backup the system tables?
I am wary of this because I fully expect long running queries to complete after many other queries have started and completed.
I know query (identifier) is generated at query submit time, because I can monitor in progress queries. Therefore it is completed expected that a long running query=1 may complete after query=2. If the stl_query table is immutable then query=1 will be inserted after query=2, and the max(query) logic is flawed.
Alternatively, if query=1 is inserted into stl_query at run time, then the row must be updated upon completion (with end time, duration, etc). This would required me to do an upsert into the backup table.
I think the stl_query table is indeed immutable, it would seem that it's only written to after a query finishes.
Here is why I think that. First off, I ran this query on a cluster with running queries
select count(*) from stl_query where endtime is null
This returns 0. My hunch is that you'll probably see the same thing on your side.
To be double sure, I also ran this query:
select count(*) from stv_inflight i
inner join stl_query q on q.query = i.query
This also returns zero (while I did have queries inflight), which seems to confirm that queries are only logged in stl_query when they have finished executing and are not updated.
That said, I would rewrite the query to insert into your history table as such:
insert into admin.query_history (
select * from stl_query
where query not in (select query from admin.query_history)
)
That way, you'll always insert any records you don't have in the history table.

hive aggregate query takes wrong value from cache

I am running aggregate query on hive session.
hive>select count(1) from table_name;
For the first time it runs mapreduce program and returns result. But for the consecutive runs later in the day it returns same count from the cache(though table is getting updated hourly). which is wrong count.
tried:-
set hive.metastore.aggregate.stats.cache.enabled=false
hive.cache.expr.evaluation=false
set hive.fetch.task.conversion=none
But no luck. Using Hive 1.2.1.2.3.4.29-5 hive version. Thanks
Disable using stats for query calculation:
set hive.compute.query.using.stats=false;
See also this answer for more details: https://stackoverflow.com/a/41021682/2700344

Creating pairwise combination of ids of a very large table in bigquery

I have a very large table of ids (string) that has 424,970 rows and only a single column.
I am trying to create the combination of those ids in a new table. The motivation for creation of that table can be found in this question.
I tried the following query to create the pairwise combination table:
#standardSQL
SELECT
t1.id AS id_1,
t2.id AS id_2
FROM
`project.dataset.id_vectors` t1
INNER JOIN
`project.dataset.id_vectors` t2
ON
t1.id < t2.id
But the query fails after 15 minutes, with the following error message:
Query exceeded resource limits. 602467.2409093559 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)
Is there any workaround to run the query and get the desired output table with all combination of ids?
You can try splitting your table T into 2 smaller tables T1 and T2, then perform 4 joins for each of the smaller tables T1:T1, T1:T2, T2:T1, T2:T2, then union the results. This will be equivalent to joining T with itself. If it still fails try breaking it down into even smaller tables.
Alternatively set maximumBillingTier to a higher value https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs.
configuration.query.maximumBillingTier - Limits the billing tier for
this job. Queries that have resource usage beyond this tier will fail
(without incurring a charge). If unspecified, this will be set to your
project default.
If using Java, it can be set in JobQueryConfiguration. This configuration property is not supported in the UI console at the moment.
In order to split a table you can use FARM_FINGERPRINT function in BigQuery. E.g. the 1st part will have a filter:
where mod(abs(farm_fingerprint(id)), 10) < 5
And the 2nd part will the filter:
where mod(abs(farm_fingerprint(id)), 10) >= 5