Hive - filter different names - regex

I´ve got stucked by filtering some movie titels. My problem is that i have a lot of different movie titels for example:
Movies: Visitors:
Breaking Dawn Part 1+2 100
Breaking Dawn 1+2 40
Breaking Dawn 1 + 2 30
Dark Knight trilogy 3D 100
Dark Knight trilogy 3D 40
Dark Knight Trilogy HDF 30
Dark Knight Trilogy -HDF 100
Dark Knight trilogy_ (blank) 44
etc. +10000
So there are a lot of different movie titels, which aren´t named unique and also have some whitespaces at the end. I could fixed that problem a litte, but there are already a lot of titels, which have the same meaning but are different.
At the moment is that my Query:
SELECT regexp_replace(rtrim(allcinemadata.title)," - 3D | - 3D |3D |3D| 3D| - (3D) | - (3D) |(3D)"|"")
as clearTitle
FROM default.allcinemadata
group by
regexp_replace(rtrim(allcinemadata.title)," - 3D | - 3D |3D |3D| 3D| - (3D) | - (3D) |(3D)", "")
I´m not sure if that is the best solution for that problem.
Hope you guys can help me! :)

I couldn't test this with a larger dataset but it works for the sample data you provided in the question. based on soundex function on movie name get the total of the views, rest of the sql is self explanatory.
WITH movviews
AS (SELECT moviename,
totalviews,
Rank()
OVER (
partition BY Soundex(moviename)
ORDER BY totalviews DESC) rnk
FROM (SELECT moviename,
Sum(views)
OVER (
partition BY Soundex(moviename)
ORDER BY moviename) AS totalviews,
views
FROM movieviews
ORDER BY moviename)vv)
SELECT movviews.moviename,
movviews.totalviews
FROM movviews
WHERE rnk = 1
Output
movviews.moviename movviews.totalviews
Breaking Dawn Part 1+2 170
Dark Knight trilogy_ (blank) 314
Time taken: 62.257 seconds, Fetched: 2 row(s)
hive (default)>

Related

Power BI Matrix Visual Showing Row of Blank Values Even Though Source Data Does Not Have Blanks

I have two tables one with data about franchise locations (Franchise Profile Info) and one with Award data. Each franchise location is given a certain number of awards they are allowed to give out per year. Each franchise location rolls up to a larger group depending on where in the country they are located. These tables are in a 1 to 1 relationship using Franchise ID. I am trying to create a matrix with the number of awards, total utilized, and percentage utilized rolled up to group with the ability to expand the groups and see individual locations. For some reason when I add the value fields a blank row is created. There are not any blank rows in either of the original tables so I'm not sure where this is coming from.
Franchise Profile Info table
ID
Franchise Name
Group
Street Address
City
State
164
Park's
West
12 Park Dr.
Los Angeles
CA
365
A & J
East
243 Whiteoak Rd
Stafford
VA
271
Otto's
South
89 Main St.
St. Augustine
FL
Award table
ID
Year
TotalAwards
Utilized
164
2022
16
12
365
2022
5
5
271
2022
22
17
This tables are in a relationship with a 1 to 1 match on ID
What I want the matrix to look like
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
Instead what I'm getting is this
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
0
0
0%
I can't for the life of me figure out where this row is coming from. I can add in the Group and Franchise name as rows but as soon as I add any of the value columns this blank row shows up.
You have a value on the many side that does not exist on the one side. You can read a full explanation here. https://www.sqlbi.com/articles/blank-row-in-dax/

How do I create a pivot table with weighted averages from a table in PowerBI?

I have data in the following format:
Building
Tenant
Type
Floor
Sq Ft
Rent
Term Length
1 Example Way
Jeff
Renewal
5
100
100
6
47 Fake Street
Tom
New
3
500
200
12
I need to create a visualisation in PowerBI that displays a pivot table of attribute by tenant, with a weighted averages (by square foot) column, like this:
Jeff
Tom
Weighted Average (by Sq Ft)
Building
1 Example Way
47 Fake Street
-
Type
Renewal
New
-
Floor
5
3
-
Sq Ft
100
500
433.3333333
Rent
100
200
183.3333333
Term Length (months)
6
12
11
I have unpivoted the original data, like this:
Tenant
Attribute
Value
Jeff
Building
1 Example Way
Jeff
Type
Renewal
Jeff
Floor
5
Jeff
Sq Ft
100
Jeff
Rent
100
Jeff
Term Length (months)
6
Tom
Building
47 Fake Street
Tom
Type
New
Tom
Floor
3
Tom
Sq Ft
500
Tom
Rent
200
Tom
Term Length (months)
12
I can almost create what I need from the unpivoted data using a matrix (as below), but I can't calculate the weighted averages column from that matrix.
Jeff
Tom
Building
1 Example Way
47 Fake Street
Type
Renewal
New
Floor
5
3
Sq Ft
100
500
Rent
100
200
Term Length (months)
6
12
I can also create a table with my attributes as headers (instead of in a column). This displays the right values and lets me calculate weighted averages (as below).
Building
Type
Floor
Sq Ft
Rent
Term Length (months)
Jeff
1 Example Way
Renewal
5
100
100
6
Tom
47 Fake Street
New
3
500
200
12
Weighted Average (by Sq Ft)
-
-
-
433.3333333
183.3333333
11
However, it's important that these values are displayed vertically instead of horizontally. This is pretty straightforward in Excel, but I can't figure out how to do it in PowerBI. I hope this is clear. Can anyone help?
Thanks!

Average of percent of column totals in DAX

I have a fact table named meetings containing the following:
- staff
- minutes
- type
I then created a summarized table with the following:
TableA =
SUMMARIZECOLUMNS (
'meetings'[staff]
, 'meetings'[type]
, "SumMinutesByStaffAndType", SUM( 'meetings'[minutes] )
)
This makes a pivot table with staff as rows and columns as types.
For this pivottable I need to calculate each cell as a percent of the column total. For each staff I need the average of their percents. There are only 5 meeting types so I need the sum of these percents divided by 5.
I don't know how to divide one number grouped by two columns by another number grouped by one column. I'm coming from the SQL world so my DAX is terrible and I'm desperate for advice.
I tried creating another summarized table to get the sum of minutes for each type.
TableB =
SUMMARIZECOLUMNS (
'meetings'[type]
, "SumMinutesByType", SUM( 'meetings'[minutes] )
)
From there I want 'TableA'[SumMinutesByStaffAndType] / 'TableB'[SumMinutesByType].
TableC =
SUMMARIZECOLUMNS (
'TableA'[staff],
'TableB'[type],
DIVIDE ( 'TableA'[SumMinutesByType], 'TableB'[SumMinutesByType]
)
"A single value for column 'Minutes' in table 'Min by Staff-Contact' cannot be determined. This can happen when a measure formula refers to a column that contains many values without specifying an aggregation such as min, max, count, or sum to get a single result."
I keep arriving at this error which leads me to believe I'm not going about this the "Power BI way".
I have tried making measures and creating matrices on the reports view. I've tried using the group by feature in the Query Editor. I even tried both measures and aggregate tables. I'm likely overcomplicating it and way off the mark so any help is greatly appreciated.
Here's an example of what I'm trying to do.
## Input/First table
staff minutes type
--------- --------- -----------
Bill 5 TELEPHONE
Bill 10 FACE2FACE
Bill 5 INDIRECT
Bill 5 EMAIL
Bill 10 OTHER
Gary 10 TELEPHONE
Gary 5 EMAIL
Gary 5 OTHER
Madison 20 FACE2FACE
Madison 5 INDIRECT
Madison 15 EMAIL
Rob 5 FACE2FACE
Rob 5 INDIRECT
Rob 20 TELEPHONE
Rob 45 FACE2FACE
## Second table with SUM of minutes, Grand Total is column total.
Row Labels EMAIL FACE2FACE INDIRECT OTHER TELEPHONE
------------- ------- ----------- ---------- ------- -----------
Bill 5 10 5 10 5
Gary 5 5 10
Madison 15 20 5
Rob 50 5 20
Grand Total 25 80 15 15 35
## Third table where each of the above cells is divided by its column total.
Row Labels EMAIL FACE2FACE INDIRECT OTHER TELEPHONE
------------- ------- ----------- ------------- ------------- -------------
Bill 0.2 0.125 0.333333333 0.666666667 0.142857143
Gary 0.2 0 0 0.333333333 0.285714286
Madison 0.6 0.25 0.333333333 0 0
Rob 0 0.625 0.333333333 0 0.571428571
Grand Total 25 80 15 15 35
## Final table with the sum of the rows in the third table divided by 5.
staff AVERAGE
--------- -------------
Bill 29.35714286
Gary 16.38095238
Madison 23.66666667
Rob 30.5952381
Please let me know if I can clarify an aspect.
You can make use of the built in functions like %Row total in Power BI, Please find the snapshot below
If this is not what you are looking for, kindly let me know (I have used your Input table)

Combine Toad sql queries with decreasing output results into one list

I've been trying to produce a result where multiple queries return more restrictive returns. How can I see the full list as well as those records that meet the more restrictive conditions? Query 1 returns 538 records of sites in the given counties.
SELECT E_SITES.ID "SITE ID",
E_SITES.NAME "SITE NAME",
E_SITES.ADDR_1 "SITE ADDRESS"
E_SITES.CITY_NAME || ', ' || E_SITES.STATE_CODE || ' ' || E_SITES.POSTAL_CODE,
E_SITES.COUNTY_NAME
FROM E_SITES
WHERE E_SITES.COUNTY_NAME IN ('ALLAMAKEE', 'BENTON', 'BLACK HAWK', 'BREMER', 'BUCHANAN', 'CHICKASAW', 'CLAYTON', 'DELAWARE', 'DUBUQUE')
ORDER BY E_SITES.ID
Query 2 returns the number of sites that have a contact person identified. This is 503 records.
SELECT E_SITES.ID "SITE ID",
E_SITES.NAME "SITE NAME",
E_SITES.ADDR_1 "SITE ADDRESS"
E_SITES.CITY_NAME || ', ' || E_SITES.STATE_CODE || ' ' || E_SITES.POSTAL_CODE,
E_SITES.COUNTY_NAME,
E_INDIVIDUALS.FIRST_NAME || ' ' || E_INDIVIDUALS.LAST_NAME
FROM E_SITES, E_AFFILIATIONS, E_INDIVIDUALS
WHERE E_SITES.SITE_ID = E_AFFILIATIONS.SITE_ID
AND E_AFFILIATIONS.INDIVIDUAL_RID = E_INDIVIDUALS.RID
AND E_AFFILIATIONS.AFFILIATION_TYPE = ('SITE_CONTACT')
AND E_SITES.COUNTY_NAME IN ('ALLAMAKEE', 'BENTON', 'BLACK HAWK', 'BREMER', 'BUCHANAN', 'CHICKASAW', 'CLAYTON', 'DELAWARE', 'DUBUQUE')
ORDER BY E_SITES.ID
A further query would return those sites with a mailing address, which reduces the results down to 486 records. I need to get all 538 records, whether or not they have a contact or mailing address, and for those that do, have one row for each site.
Additional Information
My current results can look like this for Query 1 (including column headers for clarity, quotes to distinguish data elements):
"SITE ID" "SITE NAME" "SITE ADDRESS" "CITY, STATE ZIP" "COUNTY_NAME"
"09698" "BODINE ELECTRIC" "18114 KAPP DR" "PEOSTA, IA 52067" "BREMER"
"16895" "BRUGGEMAN LUMBER" "3003 WILLOW RD" "HOPKINTON, IA 52237" "DELAWARE"
"40047" "GENEVIEVE, LLC" "707 LINCOLN ST" "GARNAVILLOR, IA 52052" "CLAYTON"
Query 2 which requires a contact person currently only returns records that meet the requirement, even though I use the (+) operator.
"SITE ID" "SITE NAME" "SITE ADDRESS" "CITY, STATE ZIP" "COUNTY_NAME" "FIRST NAME LAST NAME"
"40047" "GENEVIEVE, LLC" "707 LINCOLN ST" "GARNAVILLOR, IA 52052" "CLAYTON" "DALE KARTMAN"
I get 1 record rather than the 3 records, with 2 having no contact person and 1 with a contact person. This is my dilema. I have to run each of these queries separately, get the results and copy them to a spreadsheet. Then I have to align the records with contact names to the 1st query of all facilities. Very labor intensive. Hope this helps clarify my needs.
If I understood you correctly, it is the OUTER JOIN you're looking for.
Here's a simple example (based on Scott's EMP and DEPT tables) which shows what it is.
There are 4 departments in the DEPT table:
SQL> select deptno from dept order by deptno;
DEPTNO
----------
10
20
30
40
However, no employee works in department 40:
SQL> select deptno, ename from emp order by deptno;
DEPTNO ENAME
---------- ----------
10 KING
10 CLARK
10 MILLER
20 FORD
20 SMITH
20 JONES
30 JAMES
30 TURNER
30 MARTIN
30 WARD
30 ALLEN
30 BLAKE
12 rows selected.
SQL>
If you want to display information collected from both of those tables (department name from the DEPT table and employee name from the EMP table), you'd join those tables - just like you did (I'll use ANSI syntax which actually JOINS tables, instead of enumerating them and putting join conditions into the WHERE clause):
SQL> select d.deptno, d.dname, e.ename
2 from dept d join emp e on e.deptno = d.deptno
3 order by d.deptno;
DEPTNO DNAME ENAME
---------- -------------- ----------
10 ACCOUNTING KING
10 ACCOUNTING CLARK
10 ACCOUNTING MILLER
20 RESEARCH FORD
20 RESEARCH SMITH
20 RESEARCH JONES
30 SALES JAMES
30 SALES TURNER
30 SALES MARTIN
30 SALES WARD
30 SALES ALLEN
30 SALES BLAKE
12 rows selected.
SQL>
Looks OK, but - I'd like to get information about DEPTNO = 40, although nobody works in it. So, use outer join:
SQL> select d.deptno, d.dname, e.ename
2 from dept d left join emp e on e.deptno = d.deptno
3 order by d.deptno;
DEPTNO DNAME ENAME
---------- -------------- ----------
10 ACCOUNTING KING
10 ACCOUNTING CLARK
10 ACCOUNTING MILLER
20 RESEARCH FORD
20 RESEARCH SMITH
20 RESEARCH JONES
30 SALES JAMES
30 SALES TURNER
30 SALES MARTIN
30 SALES WARD
30 SALES ALLEN
30 SALES BLAKE
40 OPERATIONS
13 rows selected.
SQL>
Right! Here it is! (note that LEFT JOIN produces the same result as LEFT OUTER JOIN; no need to specify "outer", although it makes thinks somewhat more obvious).
Also, there's the "old" Oracle outer join operator, (+) (literally, a + sign enclosed into round brackets). The above query would work as well if we put it like this:
select d.deptno, d.dname, e.ename
from dept d, emp e
where d.deptno = e.deptno (+);
I'd suggest you do the same with (outer join) your query. Once again:
join tables in the JOIN clause
put filters into the WHERE clause
Query will be easier to read and maintain, you'll know what is what, and - if necessary (and it might even be the case for you), if you use the "old" (+) operator, you won't be able to outer join one table to more than just one another table. As you're going deeper and deeper, you might need to outer join some table to several others, and that's where ANSI join takes place.
Good luck!

tabstat: How to sort/order the output by a certain variable?

I gathered some NBA players' data of their triple-double games, and would like to find out who got the most explosive data on average.
The source is "Basketball Reference - Player Game Finder - Triple Doubles".(Sorry that I can't post the direct url because of the lack of reputation)
So I generated a table summarizing descriptive statistics (e.g. count mean) for several variables (pts trb ast stl blk) using:
tabstat pts trb ast stl blk, statistics(count mean) format(%9.1f) by(player)
What I get is the following table:
tabstat result:
How can I tell Stata to filter the players by count >= 10 (who got 10 or more triple-doubles ever) as a column then sort the table by pts and get:
Ideal result:
Like above, I would say Michael Jordan and James Harden are the Top 2 most explosive triple-double players and Darrell Walker is the most economic one.
Do study https://stackoverflow.com/help/mcve on how to present an example other people can work with straight away. Also, avoiding sports-specific jargon that won't be universally comprehensible and focusing more on the general programming problem would help. Fortunately, what you want seems clear nevertheless.
To do this you need to create a variable defining the order desired in advance of your tabstat call. To get it (value) labelled as you wish, use labmask (search labmask then download from the Stata Journal location given).
Here is some technique.
sysuse auto, clear
egen mean = mean(weight), by(rep78)
egen count = count(weight), by(rep78)
egen group = group(mean rep78) if count >= 5
replace group = -group
labmask group, values(rep78)
label var group "`: var label rep78'"
tabstat mpg weight , by(group) s(count mean) format(%1.0f)
Summary statistics: N, mean
by categories of: group (Repair Record 1978)
group | mpg weight
-------+--------------------
2 | 8 8
| 19 3354
-------+--------------------
3 | 30 30
| 19 3299
-------+--------------------
4 | 18 18
| 22 2870
-------+--------------------
5 | 11 11
| 27 2323
-------+--------------------
Total | 67 67
| 21 3030
----------------------------
Key details:
The grouping variable is based not only on the means you want to sort on but also on the original grouping variable, just in case there are ties on the means.
To get ordering from highest mean downwards, the grouping variable must be negated.
tabstat doesn't show variable labels in the body of the table. (Usually there wouldn't be enough space for them.)