Unable to find custom Hive InputFormat when using `where 1=1` - mapreduce

I'm using Hive and I'm encountering an exception when I'm performing a query with a custom InputFormat.
When I use the query select * from micmiu_blog; Hive works without problems, but if I use select * from micmiu_blog where 1=1; it seems that the framework cannot find my custom InputFormat class.
I have put the JAR file into "hive/lib","hadoop/lib" and I have also put "hadoop/lib" into the CLASSPATH. This is the log:
hive> select * from micmiu_blog where 1=1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1415530028127_0004, Tracking URL = http:/ /hadoop01-master:8088/proxy/application_1415530028127_0004/
Kill Command = /home/hduser/hadoop-2.2.0/bin/hadoop job -kill job_1415530028127_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-11-09 19:53:32,996 Stage-1 map = 0%, reduce = 0%
2014-11-09 19:53:52,010 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1415530028127_0004 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1415530028127_0004_m_000000 (and more) from job job_1415530028127_0004
Task with the most failures(4):
-----
Task ID:
task_1415530028127_0004_m_000000
URL:
http://hadoop01-master:8088/taskdetails.jsp?jobid=job_1415530028127_0004&tipid=task_1415530028127_0004_m_000000
-----
Diagnostic Messages for this Task:
Error: java.io.IOException: cannot find class hiveinput.MyDemoInputFormat
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:564)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:167)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:408)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

I met the problem just now. You should add your JAR file to the CLASSPATH with hive cli.
You can do it:
hive> add jar /usr/lib/xxx.jar;

Related

Query pyarrow dataset from GCP bucket is extremely slow

I am using pyarrow dataset to Query a parquet file in GCP, the code is straightforward
import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://duckddelta/lineitem",format="parquet")
lineitem_partition = ds.dataset("gs://duckddelta/delta2",format="parquet", partitioning="hive")
con = duckdb.connect()
def Query(request):
SQL = request.get_json().get('name')
df = con.execute(SQL).df()
return json.dumps(df.to_json(orient="records")), 200, {'Content-Type': 'application/json'}
then I call that function using a SQL Query
SQL = '''
SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM
lineitem
GROUP BY 1,2
ORDER BY 1,2 ;
'''
I know that local SSD storage is faster but I am getting some massive difference
The Query take 4 second, when the file is saved in my laptop
it take 54 second when run from google cloud function in the same region
take 3 minutes when I run it in Colab
it seems to me there is a bottleneck somewhere in google cloud function, I was expected a better performance
edit for more context : File is 1.2 GB, region is us-central1 (Iowa), cloud function gen 2, 8 GB, 8 CPU
sorry turn out it is a possible bug in DuckDB, apparently it DuckDB is not multi threaded in this particular scenario
https://github.com/duckdb/duckdb/issues/4525

How does the BigQuery command line tool "bq query" stop printing the sql query string for some jobs?

Let's say the following file exists in the current directory:
# example1.sql
SELECT 1 AS data
When I run bq query --use_legacy_sql=false --format=csv < example1.sql I get the expected output:
Waiting on bqjob_r1cc6530a5317399f_0000017f2b966269_1 ... (0s) Current status: DONE
data
1
But if I run the following query
# example2.sql
DECLARE useless_variable_1 STRING DEFAULT 'var_1';
DECLARE useless_variable_2 STRING DEFAULT 'var_2';
SELECT 1 AS data;
like this bq query --use_legacy_sql=false --format=csv < example2.sql I get the following output:
Waiting on bqjob_r501cc26a3d336b7c_0000017f2b97ceb1_1 ... (0s) Current status: DONE
SELECT 1 AS data; -- at [6:1]
data
1
As you can see, adding variables caused bq query to print the contents of "example2.sql" to stdout, which is not what I want. How can I avoid printing the contents of the file to stdout and instead just return the result of the query like in the first example?

How can we use clustering results in weka ?

I am using Weka for my internship but I have a little knowledge about data mining. So, maybe someone knows how can I apply the following results on my data-sets to get all data by cluster ? The method that I use now is to compute distances between my attributes and the mean value of each cluster then I classify them by the nearest value. But this method is too rough for me .
=== Run information ===
Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: wcet_cluster6 - Copie-weka.filters.unsupervised.attribute.Remove-R1-3,5-weka.filters.unsupervised.attribute.Remove-R5-12
Instances: 467
Attributes: 4
max
alt
stmt
bb
Test mode:evaluate on training data
=== Model and evaluation on training set ===
EM
Number of clusters selected by cross validation: 6
Cluster
Attribute 0 1 2 3 4 5
(0.28) (0.11) (0.25) (0.16) (0.04) (0.17)
==================================================================
max
mean 9.0148 10.9112 11.2826 10.4329 11.2039 10.0546
std. dev. 1.8418 2.7775 3.0263 2.5743 2.2014 2.4614
alt
mean 0.0003 19.6467 0.4867 2.4565 44.191 8.0635
std. dev. 0.0175 5.7685 0.5034 1.3647 10.4761 3.3021
stmt
mean 0.7295 77.0348 3.2439 12.3971 140.9367 33.9686
std. dev. 1.0174 21.5897 2.3642 5.1584 34.8366 11.5868
bb
mean 0.4362 53.9947 1.4895 7.2547 114.7113 22.2687
std. dev. 0.5153 13.1614 0.9276 3.5122 28.0919 7.6968
Time taken to build model (full training data) : 4.24 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 163 ( 35%)
1 50 ( 11%)
2 85 ( 18%)
3 73 ( 16%)
4 18 ( 4%)
5 78 ( 17%)
Log likelihood: -9.09081
Thanks for your help!!
I think no-one can really answer this. Some tips off the top of my head.
You have used the EM clustering algorithm, see animated gif on wikipedia page. From Weka's Documentation Synopsis:
"EM assigns a probability distribution to each instance which
indicates the probability of it belonging to each of the clusters. "
Is this complex output really what you want?
It also selects a number of clusters for you (unless you constrain that number).
In weka 3.7 you can use the unsupervised attribute filter "ClusterMembership" in the Preprocess dialog to replace your dataset with a result of the cluster assignments. You need to select one reference attribute, though. By default it selects the last one. This creates hard-to -interpret output.

Omnet Tkenv run config for multiple parameters: executing only the first value of parameter

My ini code for the config is as:
[Config BR54MBPS1MS]
description = "at 54MBPS with SI 1ms for 1250 Bytes with all time interval"
repeat = 2
sim-time-limit = 1 min
**.scalar-recording = true
**.vector-recording = false
**.host1.udpApp[0].messageLength = 1250B
**.wlan*.bitrate = 54Mbps
**.host1.udpApp[*].sendInterval = ${interval = 100..1200 step 100} us
**.vector-recording = false
output-scalar-file = 54Mbps/${configname}54Mbps${interval}us.sca
and I want to run it for all given intervals from 100 us to 1200 us with a gap of 100 us (at 100, 200, 300 ... us) in omnet tkenv or gui. The only option I read for it is by run it through run configuration as:
The problem is that, it runs only for 100us successfully, generates the output sca file and terminates the process. I am not able to figure out the reason for not running the for the next send interval.
In order to run all combinations of sendInterval values you should write * (asterisk) in Run number field and select Command line interface. Multiple runs are not possible when Tcl/Tk user interface is selected.

RRD DB fake value generator

I want to generate fake values in RRD DB for a period of 1 month and with 5 seconds as a frequency for data collection. Is there any tool which would fill RRD DB with fake data for given time duration.
I Googled a lot but did not find any such tool.
Please help.
I would recommend the following one liner:
perl -e 'my $start = time - 30 * 24 * 3600; print join " ","update","my.rrd",(map { ($start+$_*5).":".rand} 0..(30*24*3600/5))' | rrdtool -
this assumes you have an rrd file called my.rrd and that is contains just one data source expecting GAUGE type data.