AWS EMR Hive: Not yet supported place for UDAF 'COUNT'

AWS EMR Hive: Not yet supported place for UDAF 'COUNT' - amazon-web-services

I have a pretty complicated query I am trying to convert over to use with Hive.
Specifically, I am running it as a Hive "step" in an AWS EMR cluster.
I have tried to clean up the query a bit for the post and just leave essence of the thing.
The full error message is:
FAILED: SemanticException [Error 10128]: Line XX:XX Not yet supported place for UDAF 'COUNT'
The line number is pointing to the COUNT at the bottom of the select statement:
INSERT INTO db.new_table (
new_column1,
new_column2,
new_column3,
... ,
new_column20
)
SELECT MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,
TBL1.col2,
TBL1.col3,
TBL1.col3 AS new_column3,
TBL1.col4,
CASE
WHEN TBL1.col5 = …
ELSE “some value”
END AS new_column5,
TBL1.col6,
TBL1.col7,
TBL1.col8,
CASE
WHEN TBL1.col9 = …
ELSE "some value"
END AS new_column9,
CASE
WHEN TBL1.col10 = …
ELSE "value"
END AS new_column10,
TBL1.col11,
"value" AS new_column12,
TBL2.col1,
TBL2.col2,
from_unixtime(…) AS new_column13,
CAST(…) AS new_column14,
CAST(…) AS new_column15,
CAST(…) AS new_column16,
COUNT(DISTINCT TBL1.col17) AS new_column17
FROM db.table1 TBL1
LEFT JOIN
db.table2 TBL2
ON TBL1.col311 = TBL2.col311
WHERE TBL1.col14 BETWEEN "low" AND "high"
AND TBL1.col44 = "Y"
AND TBL1.col55 = "N"
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20;
If I have left out too much, please let me know.
Thanks for your help!
Updates
It turns out, I did in fact leave out way too much info. Sorry for those who have already tried to help...
I made the updates above.
Removing the 20th group by column, eg:
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced: Expression not in GROUP BY key '' ''
LATEST
Removing the 20th group by column and adding the first one, eg:
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced:
Line XX:XX Invalid table alias or column reference 'new_column5':(possible column
names are: TBL1.col1, TBL1.col2, (looks like all columns of TBL1),
TBL2.col1, TBL2.col2, TBL2.col311)
Line # is referring the line with the SELECT statement. Just those three columns from TBL2 are listed in the error output.
The error seems to be pointing to COALESCE(new_column5). Note that I have a CASE statement within the TBL 1 select which I am running with AS new_column5.

You are addressing calculated column name new_column5 at the same subquery level where it is being calculated. This is not possible in Hive. Replace it with calculation itself or use upper level subquery.
This:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(CASE WHEN TBL1.col5 = … ELSE “some value” END," ")||"_"||...) AS new_col1,
Instead of this:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,

Related

Python printing lists with column headers

So I have a nested list containing these values
#[[Mark, 10, Orange],
#[Fred, 15, Red],
#[Gary, 12, Blue],
#[Ned, 21, Yellow]]
You can see that the file is laid out so you have (name, age, favcolour)
I want to make it so I can display each column with its corresponding header
E.G
Name|Age|Favourite colour
Mark|10 |Orange
Fred|15 |Red
Gary|12 |Blue
Ned |21 |Yellow
Thank You!

Simple solution using str.format() function:
l = [['Mark', 10, 'Orange'],['Fred', 15, 'Red'],['Gary', 12, 'Blue'],['Ned', 21, 'Yellow']]
f = '{:<10}|{:<3}|{:<15}' # format
# header(`Name` column has some gap as there could be long names, like "Cristopher")
print('Name |Age|Favourite colour')
for i in l:
print(f.format(*i))
The output:
Name |Age|Favourite colour
Mark |10 |Orange
Fred |15 |Red
Gary |12 |Blue
Ned |21 |Yellow

Pivoting with missing values

I have a DataFrame with the following simple schema:
root
|-- amount: double (nullable = true)
|-- Date: timestamp (nullable = true)
I was trying to see the sum of amounts per day and per hour, some like:
+---+--------+--------+ ... +--------+
|day| 0| 1| | 23|
+---+--------+--------+ ... +--------+
|148| 306.0| 106.0| | 0.0|
|243| 1906.0| 50.0| | 1.0|
| 31| 866.0| 100.0| | 0.0|
+---+--------+--------+ ... +--------+
Well, first I added a column hour and then I grouped by day, and pivoted by hour. However, I got an exception, which perhaps is related to missing sales for some hours. This is what I'm trying to fix but I haven't realized how.
(df.withColumn("hour", hour("date"))
.groupBy(dayofyear("date").alias("day"))
.pivot("hour")
.sum("amount").show())
An excerpt of the exception.
AnalysisException: u'resolved attribute(s) date#3972 missing from
day#5367,hour#5354,sum(amount)#5437 in operator !Aggregate
[dayofyear(cast(date#3972 as date))], [dayofyear(cast(date#3972 as
date)) AS day#5367, pivotfirst(hour#5354, sum(amount)#5437, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 0, 0) AS __pivot_sum(amount) AS sum(amount)#5487];'

The problem is unresolved day column. You can create it outside groupBy clause to address that:
df = (sc
.parallelize([
(1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")])
.toDF(["amount", "Date"])
.withColumn("Date", col("Date").cast("timestamp"))
.withColumn("hour", hour("date")))
with_day = df.withColumn("day", dayofyear("Date"))
with_day.groupBy("day").pivot("hour", range(0, 24)).sum("amount")
values argument for pivot is optional but advisable.

XlsxWriter: set_column() with one format for multiple non-continuous columns

I want to write my Pandas dataframe to Excel and apply a format to multiple individual columns (e.g., A and C but not B) using a one-liner as such:
writer = pd.ExcelWriter(filepath, engine='xlsxwriter')
my_format = writer.book.add_format({'num_format': '#'})
writer.sheets['Sheet1'].set_column('A:A,C:C', 15, my_format)
This results in the following error:
File ".../python2.7/site-packages/xlsxwriter/worksheet.py", line 114, in column_wrapper
cell_1, cell_2 = [col + '1' for col in args[0].split(':')]
ValueError: too many values to unpack
It doesn't accept the syntax 'A:A,C:C'. Is it even possible to apply the same formatting without calling set_column() for each column?

If the column ranges are non-contiguous you will have to call set_column() for each range:
writer.sheets['Sheet1'].set_column('A:A', 15, my_format)
writer.sheets['Sheet1'].set_column('C:C', 15, my_format)
Note, to do this programmatically you can also use a numeric range:
for col in (0, 2):
writer.sheets['Sheet1'].set_column(col, col, 15, my_format)

Or you could reference columns like this:
for col in ('X', 'Z'):
writer.sheets['Sheet1'].set_column(col+':'+col, None, my_format)

ValueError:[number] is not in the list, even though it is and the code i believe is correct

When i execute this testing code below, i get the error below it:
my_numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
my_input = input("Pick a number from 1 to 10?")
number_index = my_numbers.index(my_input)
print(number_index)
ERROR-----
number_index = my_numbers.index(my_input) ValueError: '1' is not in
list

is this python? if so, look like is python 3, then the error is simple: input give you a string, and you have a list of integers and no integer is going to be equal to a string, ever, so when you pass my_input, a string, to index it search in the list my_numbers for a match but all the things inside it are integer so it fail and give the error. The solution is simple transform the input to a integer like this:
my_input = int( input("Pick a number from 1 to 10?") )
the same apply to other languages but the fine details may vary...

Pandas unstack but only create multi index for certain columns

I have a data frame that is production data for a factory. The factory is organised into lines. The structure of the data is such that one of the columns contains repeating values that properly thought of are headers. I need to reshape the data. So in the following DataFrame the 'Quality' column contains 4 measures, that are then measured for each hour. Clearly this gives us four observations per line.
The goal here is to transpose this data, but such that some of the columns are single index and some are multi index. The row index should remain ['Date', 'ID']. The single index columns should be 'line_no', 'floor', 'buyer' and the multi index columns should be the hourly measures for each of the quality measures.
I know that this is possible because I accidentally stumbled across the way to do it. Basically as my code will show, I put everything in the index except the hourly data and then unstacked the quality column from the index. Then by chance, I tried to reset the index and it created this amazing dataframe where some columns were single index and some multi. Of course its highly impractical to have loads of columns in the index, because we might want to do stuff with them, like change them. My question is how to achieve this type of thing without having to go through this (what I feel is a) workaraound.
import random
import pandas as pd
d = {'ID' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'Date' : ['2013-05-04' for x in xrange(12)] + \
['2013-05-06' for x in xrange(12)],
'line_no' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'floor' : [5, 5, 5, 5, 6, 6, 6, 6, 5, 5, 5, 5] * 2,
'buyer' : ['buyer1', 'buyer1', 'buyer1', 'buyer1',\
'buyer2', 'buyer2', 'buyer2', 'buyer2',\
'buyer1', 'buyer1', 'buyer1', 'buyer1'] * 2,
'Quality' : ['no_checked', 'good', 'alter', 'rejected'] * 6,
'Hour1' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour2' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour3' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour4' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour5' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour6' : [random.randint(1000, 15000) for x in xrange(24)]}
DF = pd.DataFrame(d, columns = ['ID', 'Date', 'line_no', 'floor', 'buyer',
'Quality', 'Hour1', 'Hour2', 'Hour3', 'Hour4',
'Hour5', 'Hour6'])
DF.set_index(['Date', 'ID'])
So this is how I achieved what I wanted, but there must be a way to do this without having to go through all these steps. Help please...
# Reset the index
DF.reset_index(inplace = True)
# Put everything in the index
DF.set_index(['Date', 'ID', 'line_no', 'floor', 'buyer', 'Quality'], inplace = True)
# Unstack Quality
DFS = DF.unstack('Quality')
#Now this was the accidental workaround - gives exactly the result I want
DFS.reset_index(inplace = True)
DFS.set_index(['Date', 'ID'], inplace = True)
All help appreciated. Sorry for the long question, but at least there is some data riiiight!

In general inplace operations are not faster and IMHO less readable.
In [18]: df.set_index(['Date','ID','Quality']).unstack('Quality'))
Out[18]:
line_no floor buyer Hour1 Hour2 Hour3 Hour4 Hour5 Hour6
Quality alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected
Date ID
2013-05-04 1 1 5 buyer1 6920 8681 9317 14631 5739 2112 4211 12026 13577 1855 13884 12710 7250 2540 1948 7116 9874 7302 10961 8251 3070 2793 14293 10895
2 2 6 buyer2 7943 7501 13725 1648 7178 9670 6278 6888 9969 11766 9968 4722 7242 4049 6704 2225 6546 8688 11513 14550 2140 11941 1142 6683
3 3 5 buyer1 5155 2449 13648 2183 14184 7309 1185 10454 11742 14102 2242 14297 6185 5554 12505 13312 3062 7426 4421 5693 12342 11622 10431 13375
2013-05-06 1 1 5 buyer1 14563 1343 14419 3350 8526 1185 5244 14777 2238 3640 6717 1109 7777 13136 1732 8681 14454 1059 10606 6942 9349 4524 13931 11799
2 2 6 buyer2 14837 9524 8453 6074 11516 12356 9651 10650 15000 11374 4690 10914 1857 3231 14627 6590 6503 9268 13108 8581 8448 12013 14175 10783
3 3 5 buyer1 9032 12959 4613 6793 7918 2827 6027 13002 11771 13370 12767 11080 12624 13269 11740 10543 8609 14709 11921 12484 8670 12706 8001 8991
[6 rows x 27 columns]
is a quite reasonable idiom for what you are doing

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS EMR Hive: Not yet supported place for UDAF 'COUNT' - amazon-web-services

Related

Python printing lists with column headers

Pivoting with missing values

XlsxWriter: set_column() with one format for multiple non-continuous columns

ValueError:[number] is not in the list, even though it is and the code i believe is correct

Pandas unstack but only create multi index for certain columns

Categories

Resources