Related
I have a pretty complicated query I am trying to convert over to use with Hive.
Specifically, I am running it as a Hive "step" in an AWS EMR cluster.
I have tried to clean up the query a bit for the post and just leave essence of the thing.
The full error message is:
FAILED: SemanticException [Error 10128]: Line XX:XX Not yet supported place for UDAF 'COUNT'
The line number is pointing to the COUNT at the bottom of the select statement:
INSERT INTO db.new_table (
new_column1,
new_column2,
new_column3,
... ,
new_column20
)
SELECT MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,
TBL1.col2,
TBL1.col3,
TBL1.col3 AS new_column3,
TBL1.col4,
CASE
WHEN TBL1.col5 = …
ELSE “some value”
END AS new_column5,
TBL1.col6,
TBL1.col7,
TBL1.col8,
CASE
WHEN TBL1.col9 = …
ELSE "some value"
END AS new_column9,
CASE
WHEN TBL1.col10 = …
ELSE "value"
END AS new_column10,
TBL1.col11,
"value" AS new_column12,
TBL2.col1,
TBL2.col2,
from_unixtime(…) AS new_column13,
CAST(…) AS new_column14,
CAST(…) AS new_column15,
CAST(…) AS new_column16,
COUNT(DISTINCT TBL1.col17) AS new_column17
FROM db.table1 TBL1
LEFT JOIN
db.table2 TBL2
ON TBL1.col311 = TBL2.col311
WHERE TBL1.col14 BETWEEN "low" AND "high"
AND TBL1.col44 = "Y"
AND TBL1.col55 = "N"
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20;
If I have left out too much, please let me know.
Thanks for your help!
Updates
It turns out, I did in fact leave out way too much info. Sorry for those who have already tried to help...
I made the updates above.
Removing the 20th group by column, eg:
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced: Expression not in GROUP BY key '' ''
LATEST
Removing the 20th group by column and adding the first one, eg:
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced:
Line XX:XX Invalid table alias or column reference 'new_column5':(possible column
names are: TBL1.col1, TBL1.col2, (looks like all columns of TBL1),
TBL2.col1, TBL2.col2, TBL2.col311)
Line # is referring the line with the SELECT statement. Just those three columns from TBL2 are listed in the error output.
The error seems to be pointing to COALESCE(new_column5). Note that I have a CASE statement within the TBL 1 select which I am running with AS new_column5.
You are addressing calculated column name new_column5 at the same subquery level where it is being calculated. This is not possible in Hive. Replace it with calculation itself or use upper level subquery.
This:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(CASE WHEN TBL1.col5 = … ELSE “some value” END," ")||"_"||...) AS new_col1,
Instead of this:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,
I have a DataFrame with the following simple schema:
root
|-- amount: double (nullable = true)
|-- Date: timestamp (nullable = true)
I was trying to see the sum of amounts per day and per hour, some like:
+---+--------+--------+ ... +--------+
|day| 0| 1| | 23|
+---+--------+--------+ ... +--------+
|148| 306.0| 106.0| | 0.0|
|243| 1906.0| 50.0| | 1.0|
| 31| 866.0| 100.0| | 0.0|
+---+--------+--------+ ... +--------+
Well, first I added a column hour and then I grouped by day, and pivoted by hour. However, I got an exception, which perhaps is related to missing sales for some hours. This is what I'm trying to fix but I haven't realized how.
(df.withColumn("hour", hour("date"))
.groupBy(dayofyear("date").alias("day"))
.pivot("hour")
.sum("amount").show())
An excerpt of the exception.
AnalysisException: u'resolved attribute(s) date#3972 missing from
day#5367,hour#5354,sum(amount)#5437 in operator !Aggregate
[dayofyear(cast(date#3972 as date))], [dayofyear(cast(date#3972 as
date)) AS day#5367, pivotfirst(hour#5354, sum(amount)#5437, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 0, 0) AS __pivot_sum(amount) AS sum(amount)#5487];'
The problem is unresolved day column. You can create it outside groupBy clause to address that:
df = (sc
.parallelize([
(1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")])
.toDF(["amount", "Date"])
.withColumn("Date", col("Date").cast("timestamp"))
.withColumn("hour", hour("date")))
with_day = df.withColumn("day", dayofyear("Date"))
with_day.groupBy("day").pivot("hour", range(0, 24)).sum("amount")
values argument for pivot is optional but advisable.
I want to write my Pandas dataframe to Excel and apply a format to multiple individual columns (e.g., A and C but not B) using a one-liner as such:
writer = pd.ExcelWriter(filepath, engine='xlsxwriter')
my_format = writer.book.add_format({'num_format': '#'})
writer.sheets['Sheet1'].set_column('A:A,C:C', 15, my_format)
This results in the following error:
File ".../python2.7/site-packages/xlsxwriter/worksheet.py", line 114, in column_wrapper
cell_1, cell_2 = [col + '1' for col in args[0].split(':')]
ValueError: too many values to unpack
It doesn't accept the syntax 'A:A,C:C'. Is it even possible to apply the same formatting without calling set_column() for each column?
If the column ranges are non-contiguous you will have to call set_column() for each range:
writer.sheets['Sheet1'].set_column('A:A', 15, my_format)
writer.sheets['Sheet1'].set_column('C:C', 15, my_format)
Note, to do this programmatically you can also use a numeric range:
for col in (0, 2):
writer.sheets['Sheet1'].set_column(col, col, 15, my_format)
Or you could reference columns like this:
for col in ('X', 'Z'):
writer.sheets['Sheet1'].set_column(col+':'+col, None, my_format)
I am quite new to nympy and I am trying to read a tab(\t) delimited text file into an numpy array matrix using the following code:
train_data = np.genfromtxt('training.txt', dtype=None, delimiter='\t')
File contents:
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
what I expect is a 2-D array matrix of shape (3, 15)
but with my above code I only get a single row array of shape (3,)
I am not sure why those fifteen fields of each row are not assigned a column each.
I also tried using numpy's loadtxt() but it could not handle type conversions on my data i.e even though I gave dtype=None it tried to convert the strings to default float type and failed at it.
Tried code:
train_data = np.loadtxt('try.txt', dtype=None, delimiter='\t')
Error:
ValueError: could not convert string to float: State-gov
Any pointers?
Thanks
Actually the issue here is that np.genfromtxt and np.loadtxt both return a structured array if the dtype is structured (i.e., has multiple types). Your array reports to have a shape of (3,), because technically it is a 1d array of 'records'. These 'records' hold all your columns but you can access all the data as if it were 2d.
You are loading it correctly:
In [82]: d = np.genfromtxt('tmp',dtype=None)
As you reported, it has a 1d shape:
In [83]: d.shape
Out[83]: (3,)
But all your data is there:
In [84]: d
Out[84]:
array([ (38, 'Private', 215646, 'HS-grad', 9, 'Divorced', 'Handlers-cleaners', 'Not-in-family', 'White', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(53, 'Private', 234721, '11th', 7, 'Married-civ-spouse', 'Handlers-cleaners', 'Husband', 'Black', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(30, 'State-gov', 141297, 'Bachelors', 13, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'Asian-Pac-Islander', 'Male', 0, 0, 40, 'India', '>50K')],
dtype=[('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
The dtype of the array is structured as so:
In [85]: d.dtype
Out[85]: dtype([('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
And you can still access "columns" (known as fields) using the names given in the dtype:
In [86]: d['f0']
Out[86]: array([38, 53, 30])
In [87]: d['f1']
Out[87]:
array(['Private', 'Private', 'State-gov'],
dtype='|S9')
It's more convenient to give proper names to the fields:
In [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income"
In [105]: d = np.genfromtxt('tmp',dtype=None, names=names)
So you can now access the 'age' field, etc.:
In [106]: d['age']
Out[106]: array([38, 53, 30])
In [107]: d['income']
Out[107]:
array(['<=50K', '<=50K', '>50K'],
dtype='|S5')
Or the incomes of people under 35
In [108]: d[d['age'] < 35]['income']
Out[108]:
array(['>50K'],
dtype='|S5')
and over 35
In [109]: d[d['age'] > 35]['income']
Out[109]:
array(['<=50K', '<=50K'],
dtype='|S5')
Updated answer
Sorry, I misread your original question:
what I expect is a 2-D array matrix of shape (3, 15)
but with my above code I only get a single row array of shape (3,)
I think you misunderstand what np.genfromtxt() will return. In this case, it will try to infer the type of each 'column' in your text file and give you back a structured / "record" array. Each row will contain multiple fields (f0...f14), each of which can contain values of a different type corresponding to a 'column' in your text file. You can index a particular field by name, e.g. data['f0'].
You simply can't have a (3,15) numpy array of heterogeneous types. You can have a (3,15) homogeneous array of strings, for example:
>>> string_data = np.genfromtext('test', dtype=str, delimiter='\t')
>>> print string_data.shape
(3, 15)
Then of course you could manually cast the columns to whatever type you want, as in #DrRobotNinja's answer. However you might as well let numpy create a structured array for you, then index it by field and assign the columns to new arrays.
I do not believe Numpy arrays handle different datatypes within a single array. What can be done, is load the entire array as strings, then convert the necessary columns to numbers as necessary
# Load data as strings
train_data = np.loadtxt('try.txt', dtype=np.str, delimiter='\t')
# Convert numeric strings into integers
first_col = train_data[:,0].astype(np.int)
third_col = train_data[:,2].astype(np.int)
I have to annotated results:
cred_rec = Disponibilidad.objects.values('mac__mac', 'int_mes').annotate(tramites=Count('fuar'), recibidas=Count('fecha_disp_mac')
cred14 = Disponibilidad.objects.filter(int_disponible__lte=14).values('mac__mac', 'int_mes').annotate(en14=Count('fuar'))
Both have the same keys 'mac__mac' and 'int_mes', what I want is to create a new dictionary with the keys in cred_red, plus en14 from cred14.
I try some answers found here, but I missing something.
Thanks.
EDIT. After some tries and errors, I got this:
for linea in cred_rec:
clave = (linea['mac__mac'], linea['int_mes'])
for linea2 in cred14:
clave2 = (linea2['mac__mac'], linea2['int_mes'])
if clave2 == clave:
linea['en14'] = linea2['en14']
linea['disp'] = float(linea2['en14'])/linea['recibidas']*100
Now, I have to ask for a better solution. Thanks again.
=======
EDIT
This is how the input looks like:
fuar, mac_id, int_mes, int_disponible, int_exitoso, fecha_tramite, fecha_actualiza_pe, fecha_disp_mac
1229012106349,1,7,21,14,2012-07-02 08:33:54.0,2012-07-16 17:33:21.0,2012-07-23 08:01:22.0
1229012106350,1,7,25,17,2012-07-02 09:01:25.0,2012-07-19 17:45:57.0,2012-07-27 17:45:59.0
1229012106351,1,7,21,14,2012-07-02 09:15:12.0,2012-07-16 19:14:35.0,2012-07-23 08:01:22.0
1229012106352,1,7,24,16,2012-07-02 09:25:19.0,2012-07-18 07:52:18.0,2012-07-26 16:04:11.0
... a few thousand lines dropped ...
The fuar is like an order_id; mac__mac is like the site_id, mes is month; int_disponible is the timedelta between fecha_tramite and fecha_disp_mac; int_exitoso is the timedelta between fecha_tramite and fecha_actualiza_pe.
The output is like this:
mac, mes, tramites, cred_rec, cred14, % rec, % en 14
1, 7, 2023, 2006, 1313, 99.1596638655, 65.4536390828
1, 8, 1748, 1182, 1150, 67.6201372998, 97.2927241963
2, 8, 731, 471, 441, 64.4322845417, 93.6305732484
3, 8, 1352, 840, 784, 62.1301775148, 93.3333333333
tramites is the sum of all orders (fuar) in a month
cred_rec cred is our product, in theory for each fuar there is a cred, cred_rec is the sum of all cred produced in a month
cred_14 is the sum of all cred made in 14 days
% rec the relationship between fuar received and cred produced, in %
% en 14 is the relationship between the cred produced and the cred produced in time
I will use this table in a Annotated Time Line chart or a Combo Chart from Google Charts to show the performance of our manufacturation process.
Thanks for your time.
One immediate improvement to the current code you have would be to have the en14 and disp values precalculated and indexed by key. This will reduce the scans on the cred14 list, but will use memory to store the precalculated values.
def line_key(line):
return (line['mac__mac'], line['int_mes'])
cred14_calcs = {}
for line in cred14:
cred14_calcs[line_key(line)] = {
'en14': line['en14'],
'disp': float(line['en14'])/line['recibidas']*100
}
for line in cred_rec:
calc = cred14_calcs.get(line_key(line))
if calc:
line.update(calc)