Identifying unique texts - if-statement

I'm making a Utilization file for our team.
I'm having a bit of difficulty in identifying what kind of workflow that the agent did that day.
I need to identify first the workflow done by that agent for a specific day because each workflow has a different AHT (average handling time) for the computation of their capacity for that day.
I have this file where
Column A = agent's name
Column B = date
Column C = workflow
is there a way to identify the workflows that the agent did that day?
note: there are agents that are working with different workflows each day.
Here's a sample of what I was trying to do.
Sample 2

try:
=IF((I2="")*(I3=""),,UNIQUE(IFERROR(FILTER(D2:D, B2:B=I2, C2:C=I3), "no data")))
spreadsheet demo

Related

How to calculate avgPreviousExecutionMs for a job in google bigquery

In the performanceInsights of bigquery there is one field called avgPreviousExecutionMs, In the description, it was given as like this
Output only. Average execution ms of previous runs. Indicates the job ran slow compared to previous executions. To find previous executions, use INFORMATION_SCHEMA tables and filter jobs with the same query hash.
I tried to validate this avgPreviousExecutionMs data for one of my jobs based on the views in the information schema by filtering the queries with the same hash based on the query_info.query_hashes.normalized_literals field.
Steps that I have done to validate this info.
I ran a job on the flat rate pricing concurrently with other queries, to make it slower such that I can get the avgPreviousExecutionMs field in the PerformanceInsights section.
Now I want to validate this field with the information schema data
I ran this query in the information schema, by excluding my current job id
SELECT
AVG(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) as avg_duration
FROM
region-us.INFORMATION_SCHEMA.JOBS_BY_USER where query_info.query_hashes.
normalized_literals = "myqueryHash" and job_id != "myjobId";
The result of the query and the avgPreviousExecutionMs value shown for that job in this section are not matching.
How can we validate this info ?
the avgPreviousExecutionMs is based on taking the how much time
period of data.
This average is based on the JOBS view or JOBS_BY_USER view or JOBS_BY_FOLDER or JOBS_BY_ORGANIZATION

Get all BigQuery Query jobs across organisation that reference a specific table

Problem Statement
We're a large organisation (7000+ people) with many BigQuery projects. My team own a highly used set of approx 250 tables. We are aware of some data quality issues, but need to prioritise which tables we focus our efforts on.
In order to prioritise our effort, we plan to calculate two metrics for each table:
Monthly total count of query jobs referencing that table
Total number of distinct destination tables referencing that table
However, we are stuck on one aspect -- how do you access all the query jobs across the entire org that reference a specific table?
What we've tried
We've tried using the following query to find all query jobs referencing a table:
select count(*) from
`project-a`.`region-qualifier`.INFORMATION_SCHEMA.JOBS
where job_type = 'QUERY'
and referenced_tables.project_id = 'project-a'
and referenced_tables.dataset_id = 'dataset-b'
and referenced_tables.table_id = 'table-c'
Unfortunately, this is only showing query jobs that are kicked off with project-a as the billing project (afaik).
Summary
Imagine we have 50+ GCP projects that could be executing queries referencing a table we own, what we want is to see ALL those query jobs across all those projects.
Currently it's not possible to access all the query jobs across the entire organization that reference a specific table.
As you have mentioned you can list query jobs within the project using the query as:
select * from `PROJECT_ID`.`region-REGION_NAME`.INFORMATION_SCHEMA.JOBS
where job_type = 'QUERY'
PROJECT_ID is the ID of your Cloud project. If not specified, the default project is used.
You can use the query without the project-id as:
select * from `region-REGION_NAME`.INFORMATION_SCHEMA.JOBS
where job_type = 'QUERY'
For more information you can refer to this document.
If you want the feature to list query jobs across the entire organization be implemented, you can open a new feature request on the issue tracker describing your requirement.
Turns out that you can get this information through Google Cloud Logging.
The following command extracted the logs of all queries across the org referencing tables within <DATASET_ID>.
gcloud logging read 'timestamp >= "2022-09-01T00:00:00Z" AND resource.type=bigquery_dataset AND resource.labels.dataset_id=<DATASET_ID> AND severity=INFO'
Important that this command needs to be run from the project on which <DATASET_ID> exists and you need the roles/logging.admin role.
Worth to note that I was not able to test the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION which should do the trick.

Run update on multiple tables in BigQuery

I have lake dataset which take data from a OLTP system, with the nature of transactions we have lot of updates the next day, so to keep track of the latest record we are using active_flag = '1'.
We also created a update script which retires old records and updates active_flag = '0'.
Now the main question: how can i execute a update statement by changing table name automatically(programmatically).
I know we have a option of using cloudfunctions but it'll expire in 9 mins and I have atleast 350 tables to update.
Has anyone faced this situation earlier??
You can easily do this with Cloud Workflows.
There you setup the template calls to Bigquery as a substeps, and then you pass a list of tables, and loop through the items and invoke the BigQuery step for each item/table.
I wrote an article with samples that you can adapt: Automate the execution of BigQuery queries with Cloud Workflows

How can I monitor incurred BigQuery billings costs (jobs completed) by table/dataset in real-time?

The biggest chunk of my BigQuery billing comes from query consumption. I am trying to optimize this by understanding which datasets/tables consume the most.
I am therefore looking for a way to track my BigQuery usage, but ideally something that is more in realtime (that I don't have to wait a day before I get the final results). The best way would be for instance how much each table/dataset consumed in the last hour.
So far I managed to find the Dashboard Monitoring but this only allows to display the queries in flight per project and the stored bytes per table, which is not what I am after.
What other solutions are there to retrieve this kind of information?
Using Stackdriver logs, you could create a sink with Pub/Sub topic as target for real-time analysis that filter only BigQuery logs like this :
resource.type="bigquery_resource" AND
proto_payload.method_name="jobservice.jobcompleted" AND
proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes:*
(see example queries here : https://cloud.google.com/logging/docs/view/query-library?hl=en_US#bigquery-filters)
You could create the sink on a specific project, a folder or even an organization. This will retrieve all the queries done in BigQuery in that specific project, folder or organization.
The field proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes will give you the number of bytes processed by the query.
Based on on-demand BigQuery pricing (as of now, $5/TB for most regions, but check for your own region), you could easily estimate in real-time the billing. You could create a Dataflow job that aggregates the results in BigQuery, or simply consume the destination Pub/Sub topic with any job you want to make the pricing calculation :
jobPriceInUSD = totalBilledBytes / 1_000_000_000_000 * pricePerTB
because 1 TB = 1_000_000_000_000 B. As I said before, pricePerTB depends on regions (see : (https://cloud.google.com/bigquery/pricing#on_demand_pricing for the exact price). For example, as of time of writing :
$5/TB for us-east1
$6/TB for asia-northeast1
$9/TB for southamerica-east1
Also, for each month, as of now, the 1st TB is free.
It might be easier to use the INFORMATION_SCHEMA.JOBS_BY_* views because you don't have to set up the stackdriver logging and can use them right away.
Example taken & modified from How to monitor query costs in Google BigQuery
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
ROUND(SUM(total_bytes_processed) / gb_divisor,2) as bytes_processed_in_gb,
ROUND(SUM(IF(cache_hit != true, total_bytes_processed, 0)) * cost_factor,4) as cost_in_dollar,
user_email,
FROM (
(SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
UNION ALL
(SELECT * FROM `other-project.region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
)
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
GROUP BY
user_email
Some caveats:
you need to UNION ALL all of the projects that you use explicitly
JOBS_BY_USER did not work for me on my private account (supposedly because me login email is #googlemail and big query stores my email as #gmail`)
the WHERE condition needs to be adjusted for your billing period (instead of the last 30 days)
doesn't provide the "bytes billed" information, so we need to determine those based on the cache usage
doesn't include the "if less than 10MB use 10MB" condition
data is only retained for the past 180 days
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5; reflects only US costs - other regions might have different costs - see https://cloud.google.com/bigquery/pricing#on_demand_pricing
you can only query one region at a time

Sequence constraints of product feeds for Amazon MWS

I am currently working on a specification for a software component which will synchronize the product catalog of an ecommerce company with the Amazon Marketplace using Amazon MWS.
According to the MWS developer documentation, publishing products requires submitting up to 6 different feeds, which are processed asynchronously:
Product Feed: defines SKUs and contains descriptive data for the products
Inventory Feed: sets quantities/availability for each SKU
Price Feed: sets prices for SKUs
Image Feed: product images for each SKU
Relationship Feed: defines mappings between parent SKUs (e.g. a T-Shirt) and child SKUs (e.g. T-Shirt in a concrete size and color which is buyable)
Ovverride Feed:
My question concerns the following passage in the MWS documentation:
The Product feed is the first step in setting up your products on
Amazon. All subsequent catalog feeds are dependent upon the success of
this feed.
I am wondering what it means? There are at least two possibilities:
Do you have to wait until the Product feed is successfully processed before submitting subsequent feeds? This would mean that one had to request the processing state periodically until it is finished. This may take hours depending of the feed size and server load at Amazon. The process of synchronizing products would be more complex.
Can you send all the feeds immediately in one sequence and Amazon takes care that they are processed in a reasonable order? In this interpretation, the documentation would just tell the obvious, that the success of let's say image feed processing for a particular SKU depends on the success of inserting the SKU itself.
As I understand it for all other feeds other than the Product feed the products in question must already be on the catalogue, so your first possibility is the correct one.
However, this should only affect you on the very first run of the product feed or when you are adding a new product, as once the product is there you can then run the feeds in any order, unless you are using PurgeAndReplace of your entire catalogue each time which is not recommended.
The way I would plan it is this.
1) Run a Product Feed of the entire catalogue the very first time and wait for it to complete.
2) Run the other feeds in any order you like.
3) Changes to any of the products already on Amazon can now be done in any order. e.g you can run the price feed before the product feed if all you are doing is amending the description data etc
4) When you have to add a new product make sure you run the product feed first, then the other feeds.
If possible, I would create a separate process for adding new products. Also, I think it will help you if you only upload changes to products rather than the entire catalogue each time. It's a bit more work for you to determine what has changed but it will speed up the feed process and mean you're not always waiting for the product feed to complete.
Yes, Product Feed is the first primary feed.
You need to wait until product feed gets completed before sending out other feeds.
When You Send Product Feed, its status becomes:
1) _IN_PROGRESS_
2) SUBMITTED
3) DONE
4) COMPLETED
You must need to wait until status changes to " DONE " or "COMPLETED".
Thanks.