Amazon Athena LEFT OUTER JOIN query not working as expected - amazon-athena

I am trying to do a left ourter join in Athena and my query looks like the following:
SELECT customer.name, orders.price
FROM customer LEFT OUTER JOIN order
ON customer.id = orders.customer_id
WHERE price IS NULL;
Where each customer could only have one order in the orders table at most and there are customers with no order in the orders table at all. So I am expecting to get some number of records where there is a customer in the customer table with no records in orders table which means when I do LEFT OUTER JOIN the price will be NULL. But this query returns 0 every time I run it. I have queries both tables separately and pretty sure there is data in both but not sure why this is returning zero where it works if I remove the price IS NULL. I have also tried price = '' and price IN ('') and none of them works. Has anyone here had a similar experience before? Or is there something wrong with my query that I can not see or identify?

It seems that your query is correct. To validate, I created two CTEs that should match up with your customer and orders table and ran your query against them. When running the query below, it returns a record for customer 3 Ted Johnson who did not have an order.
WITH customer AS (
SELECT 1 AS id, 'John Doe' AS name
UNION
SELECT 2 AS id, 'Jane Smith' AS name
UNION
SELECT 3 AS id, 'Ted Johnson' AS name
),
orders AS (
SELECT 1 AS customer_id, 20 AS price
UNION
SELECT 2 AS customer_id, 15 AS price
)
SELECT customer.name, orders.price
FROM customer LEFT OUTER JOIN orders
ON customer.id = orders.customer_id
WHERE price IS NULL;
I'd suggest running the following queries:
COUNT(DISTINCT id) FROM customers;
COUNT(DISTINCT customer_id) FROM orders;
Based on the results you are seeing, I would expect those counts to match. Perhaps your system is creating a record in the orders table whenever a customer is created with a price of 0.

Probably you can't use where for order table.
SELECT customer.name, order.price
FROM customer LEFT OUTER JOIN order
ON customer.id = orders.customer_id AND order.price IS NULL;

Related

Loop through a ColdFusion query and sorting results

I have two tables:
Users (2 columns): ID, DisplayName, Active
Ticket_Followups (4 colums): id, requested_by, requested_date, ticket_id
I am tryiwng to group all the similar records in the ticket_followup table, first by recordcount and then by displayName.
Here is what I have so far:
<cfquery name="active_users" datasource="#datasource#">
select * from users
where active='1'
</cfquery>
<cfloop query="active_users">
<cfquery name="get_followups" datasource="#datasource#">
select date_of_followup_request, requested_by, ticket_id
from ticket_followup
where requested_by = '#active_users.displayName#'
</cfquery>
<cfoutput>
<tr>
<td>#active_users.displayName#</td>
<td>#get_followups.recordcount#</td>
</tr>
</cfoutput>
</cfloop>
I am able to successfully show the output for the total records by user, but there is no order to the output. I would like to group it so that it shows the DisplayName with the highest recordcount first, descending in order.
How can I do that?
This is a SQL issue, CF is just displaying data after the data is gathered.
You need to do this in one query.
You need to associate the ticket follow ups by user ID, not by name (Name could change, but not the ID).
There's a table of tickets I assume, but we'll stick to your two tables.
First, the tables:
Users
----------
id
DisplayName
Active
Ticket_Followups
----------
id
requested_by_id (Users.id)
requested_date
ticket_id
You can technically join by name, but it's a much slower query and I've no idea how much data you have.
This query joins the two tables and gives you a count of ticket follow ups by user. You can add an ORDER BY statement before the GROUP BY depending on your needs.
SELECT
a.DisplayName
, count(*) AS requested_count
FROM
Users AS a
INNER JOIN
Ticket_Followups b ON b.requested_by_id = a.id
WHERE
a.active = 1
GROUP BY
a.id
If you don't do this in one query, then for every user that has an active ticket, you're making another query.
10 users, 11 queries
20 users, 21 queries
etc.
Updated 2022-02-15
Query using DisplayName with an ORDER BY clause. This should make it clearer that you're counting the tickets per user and not the number of users.
SELECT
a.DisplayName
, count(a.*) AS ticket_count
FROM
Ticket_Followups AS a
INNER JOIN
Users AS b ON b.DisplayName = a.DisplayName
WHERE
a.active = 1
ORDER BY
a.DisplayName DESC
GROUP BY
a.DisplayName
Output:
<cfoutput query="queryName">
<li>#queryName.DisplayName# - #queryName.ticket_count#</li>
</cfoutput>

Count of column values in Table 1 present in Table 2 column - Power BI

I have two tables
table 1
product_name
ID.
abc
123
abc
456
table 2
product_name
ID.
abc
123
report layout
I want to know how many of them downloaded the trial product and out of those how many purchased. Left side trials, right sight purchases
PS: columns are not unique
Your example is rather vague. I am not sure what values you have in both tables.
You want for the "left visual" count unique IDs, right?
LeftVisual = calculate( DISTINCTCOUNT(table1[id])
And for the right side you want to count ID from Table2 only when that IDs appear in Table1?
RightVisual = calculate ( DISTINCTCOUNT(table2[id], TREATAS(table2[id],table1[id]) ))
Count1 = calculate(DISTINCTCOUNT(table1[id])
Count2 = Calcluate(DISTINCTCOUNT(table2[id]),
INTERSECT(SELECTCOLUMNS(TABLE2, "id", [ID]), SELECTCOLUMNS(TABLE1, "id", [ID])))

Why bigquery can't handle a query processing 4TB data?

I'm trying to run this query
SELECT
id AS id,
ARRAY_AGG(DISTINCT users_ids) AS users_ids,
MAX(date) AS date
FROM
users,
UNNEST(users_ids) AS users_ids
WHERE
users_ids != " 1111"
AND users_ids != " 2222"
GROUP BY
id;
Where users table is splitted table with id column and user_ids (comma separated) column and date column
on a +4TB and it give me resources
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations.
.. any idea why?
id userids date
1 2,3,4 1-10-20
2 4,5,6 1-10-20
1 7,8,4 2-10-20
so the final result I'm trying to reach
id userids date
1 2,3,4,7,8 2-10-20
2 4,5,6 1-10-20
Execution details:
It's constantly repartitioning - I would guess that you're trying to cramp too much stuff into the aggregation part. Just remove the aggregation part - I don't even think you have to cross join here.
Use a subquery instead of this cross join + aggregation combo.
Edit: just realized that you want to aggregate the arrays but with distinct values
WITH t AS (
SELECT
id AS id,
ARRAY_CONCAT_AGG(ARRAY(SELECT DISTINCT uids FROM UNNEST(user_ids) as uids WHERE
uids != " 1111" AND uids != " 2222")) AS users_ids,
MAX(date) OVER (partition by id) AS date
FROM
users
GROUP BY id
)
SELECT
id,
ARRAY(SELECT DISTINCT * FROM UNNEST(user_ids)) as user_ids
,date
FROM t
Just the draft I assume id is unique but it should be something along those lines? Grouping by arrays is not possible ...
array_concat_agg() has no distinct so it comes in a second step.

Best way to get unique count of customers across multiple products

Using AWS Athena I am trying to write a query to get a count of the number of unique customers who have ordered per product.
If a customer ordered a product 5 times I only want them counted as 1 for the indicated product. Though I want them to be counted if they ordered 3 other products with different SKU codes. The issue is our product titles have changed over time and when I run the following query I get results by product title with the sku code listed out multiple times due to the change in product titles but want the unique customer count by sku_code.
SELECT product_title, product_code, COUNT(DISTINCT customer_reference_id)
FROM "business_usage"."daily_business_usage_by_instance_type"
GROUP BY product_title, product_code
ORDER BY Product_code
This is the query I have tried to get a distinct count for customers per sku purchased but get a Syntax_error:Unexpected parameters (varchar, varchar) for function count. Expected: count() , count(T) T for the first line
SELECT product_name, COUNT(DISTINCT sku_code, customer_id)
FROM "Data"."Orders"
GROUP BY product_name, sku_code
ORDER BY sku_code
Any ideas on what I am doing wrong or if this is even the correct query to get the information I need?
If I understand you correctly, you want the number of unique customers by SKU, but you also want to retrieve the product title, which has changed over time and although related to SKU does not have a one-to-one relationship.
One way to achieve that is to group by SKU and use the ARBITRARY aggregate function to pick one product title from the group:
SELECT
ARBITRARY(product_title) AS product_title,
product_code,
COUNT(DISTINCT customer_reference_id)
FROM "business_usage"."daily_business_usage_by_instance_type"
GROUP BY product_code
ORDER BY product_code
As the name suggest, ARBITRARY will give you a value, but it is not defined which, and it might vary from run to run. You could also use MIN or MAX to get the first and last in alphabetical order.
It could be the case that you want to pick a product title in a more specific way, like the one from the row with the highest timestamp. Assuming your table has a column called order_date you could use the MAX_BY function to pick the product title from the most recent row in the group:
SELECT
MAX_BY(product_title, order_date) AS product_title,
product_code,
COUNT(DISTINCT customer_reference_id)
FROM "business_usage"."daily_business_usage_by_instance_type"
GROUP BY product_code
ORDER BY product_code

Adding multiple record count to a table

My students table has the following tables
student id | student year | test result | semester
I would like to group the records together to see how many re-tests did the student do in a particular semester.
I am trying to alter the table and add the total_tests_taken column to the table and use an
update statement like:
ALTER table students
(add total_tests_taken number );
UPDATE students
SET total_tests_taken = (select count(*) OVER ( PARTITION BY student_id, semester) FROM students)
but my sql fails saying: "ORA-01427: single-row subquery returns more than one row"
what am I doing wrong?
Do I need to create a temp table and than do it?
Thanks
the reason you are getting the error is because you are trying to set a column's value = a table. SET statement would update each row that matches the constraint with the given value. What you are trying to do can be accomplished by UPDATE with JOIN statement if your DBMS supports it. you can check out the answer to this question for the syntax
How can I do an UPDATE statement with JOIN in SQL?