I wrote a simple PIG program as follows to analyze a small and a modified version of the google n-grams dataset on AWS. The data looks something like this:
I am 1936 942 90
I am 1945 811 5
I am 1951 47 12
very cool 1923 118 10
very cool 1980 320 100
very cool 2012 994 302
very cool 2017 1820 612
and has the form:
n-gram TAB year TAB occurrences TAB books NEWLINE
I wrote the following program to calculate the occurences of an ngram per book:
inp = LOAD <insert input path here> AS (ngram:chararray, year:int, occurences:int, books:int);
filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);
groupinp = GROUP filter_input BY ngram;
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;
DUMP sum_occ;
However, the DUMP command does not work and gives the following error:
892520 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
18/03/28 00:56:09 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
1892554 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
18/03/28 00:56:09 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
1892555 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
18/03/28 00:56:09 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
1892591 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
18/03/28 00:56:09 INFO tez.TezLauncher: Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
1892592 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler - File concatenation threshold: 100 optimistic? false
18/03/28 00:56:09 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
1892593 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.AccumulatorOptimizerUtil - Reducer is to run in accumulative mode.
18/03/28 00:56:09 INFO util.AccumulatorOptimizerUtil: Reducer is to run in accumulative mode.
1892606 [main] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
18/03/28 00:56:09 INFO builtin.PigStorage: Using PigTextInputFormat
18/03/28 00:56:09 INFO input.FileInputFormat: Total input files to process : 1
1892626 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths to process : 1
1892627 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: joda-time-2.9.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: pig-0.17.0-core-h2.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: antlr-runtime-3.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: automaton-1.11-8.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: filter_input,groupinp,inp
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex:
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex:
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Set auto parallelism for vertex scope-240
18/03/28 00:56:09 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-240
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: sum_occ
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: sum_occ
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: sum_occ[5,10]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: sum_occ[5,10]
1892745 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex: GROUP_BY
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
1892762 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.
18/03/28 00:56:09 ERROR grunt.Grunt: ERROR 2017: Internal error creating job configuration.
Details at logfile: /mnt/var/log/pig/pig_1522196676602.log
How do I fix this?
If you are using an old version, kindly update it (should solve your problem)
PIG scripts are lazily evaluated, so unless you use a DUMP or STORE command you will not know what is wrong with your code.
When you run your code it will again throw the following error:
ERROR 1025: Invalid field projection. Projected field [occurences] does not exist in schema: group:chararray,filter_input:bag{:tuple(ngram:chararray,year:int,occurences:int,books:int)}.
Change the below line from
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;
to
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(filter_input.occurences) AS socc, SUM(filter_input.books) AS nbooks;
will solve this error.
I don't have enough reputation for making the comment, so writing it here.
My guess is you have unclosed quote.
What do you have at "insert input path here" part? Is the path enclosed with single quotes?
Not having enough reputations to comment so posting here, are writing the above pig statements in a script or running individually from grunt shell. Also can you give a brief about the logic behind sum_occ relation.
Related
I'm using GCP Composer2 to schedule pyspark (Structured Streaming) jobs,
The pyspark code reads/writes into Kafka.
The DAG uses operators - DataprocCreateClusterOperator (creates a GKE cluster),
DataprocSubmitJobOperator (runs the pyspark job), using operator - DataprocSubmitJobOperator deletes the dataproc cluster.
In the code below, i'm passing the jars and the files(certs/config files) required to run the pyspark code that reads/writes into Kafka
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"jar_file_uris" : ["gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar",
'gs://dataproc-spark-jars/bson-4.0.5.jar','gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar','gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar',
'gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar','gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar','gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar',
'gs://dataproc-spark-jars/spark-token-provider-kafka-0-10_2.12-3.2.0.jar','gs://dataproc-spark-jars/htrace-core4-4.1.0-incubating.jar','gs://dataproc-spark-jars/hadoop-client-3.3.1.jar','gs://dataproc-spark-jars/spark-sql-kafka-0-10_2.12-3.2.0.jar','gs://dataproc-spark-jars/hadoop-client-runtime-3.3.1.jar','gs://dataproc-spark-jars/hadoop-client-3.3.1.jar','gs://dataproc-spark-configs/kafka-clients-3.2.0.jar'],
"file_uris":['gs://kafka-certs/versa-kafka-gke-ca.p12','gs://kafka-certs/syslog-vani.p12',
'gs://kafka-certs/alarm-compression-user.p12','gs://kafka-certs/appstats-user.p12',
'gs://kafka-certs/insights-user.p12','gs://kafka-certs/intfutil-user.p12',
'gs://kafka-certs/reloadpred-chkpoint-user.p12','gs://kafka-certs/reloadpred-user.p12',
'gs://dataproc-spark-configs/topic-customer-map.cfg','gs://dataproc-spark-configs/params.cfg','gs://kafka-certs/issues-user.p12','gs://kafka-certs/anomaly-user.p12']
}
}
path = "gs://dataproc-spark-configs/pip_install.sh"
CLUSTER_GENERATOR_CONFIG = ClusterGenerator(
project_id=PROJECT_ID,
zone="us-east1-b",
master_machine_type="n1-standard-4",
worker_machine_type="n1-standard-4",
num_workers=4,
storage_bucket="dataproc-spark-logs",
init_actions_uris=[path],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl kafka-python'},
).make()
with models.DAG(
'UsingComposer2',
# Continue to run DAG twice per day
default_args=default_dag_args,
schedule_interval='0 0/12 * * *',
catchup=False,
) as dag:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id="create_dataproc_cluster",
cluster_name="composer2",
region=REGION,
cluster_config=CLUSTER_GENERATOR_CONFIG
)
run_dataproc_spark = DataprocSubmitJobOperator(
task_id="run_dataproc_spark",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID,
)
delete_dataproc_cluster = DataprocDeleteClusterOperator(
task_id="delete_dataproc_cluster",
project_id=PROJECT_ID,
cluster_name=CLUSTER_NAME,
region=REGION
)
create_dataproc_cluster >> run_dataproc_spark >> delete_dataproc_cluster
Question is - how do i pass package instead of the jars individually for spark-kafka?
When i do a spark-submit - i can pass a package, how do i do the same with Composer/Airflow ?
sample spark-submit command, where i pass the spark-sql-kafka and mongo-spark-connector packages
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 /Users/karanalang/PycharmProjects/Kafka/StructuredStreaming-KafkaConsumer-insignts.py
tia!
Update :
Based on #Anjela B's suggestion, tried the following but it does not work
changes to PYSPARK_JOB, to pass package :
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"properties": { #you can use this field to pass other properties
"org.apache.spark": "spark-sql-kafka-0-10_2.12:3.1.3",
"org.mongodb.spark": "mongo-spark-connector_2.12:3.0.2"
},
"file_uris":['gs://kafka-certs/versa-kafka-gke-ca.p12','gs://kafka-certs/syslog-vani.p12',
'gs://kafka-certs/alarm-compression-user.p12','gs://kafka-certs/appstats-user.p12',
'gs://kafka-certs/insights-user.p12','gs://kafka-certs/intfutil-user.p12',
'gs://kafka-certs/reloadpred-chkpoint-user.p12','gs://kafka-certs/reloadpred-user.p12',
'gs://dataproc-spark-configs/topic-customer-map.cfg','gs://dataproc-spark-configs/params.cfg','gs://kafka-certs/issues-user.p12','gs://kafka-certs/anomaly-user.p12']
}
Error :
22/06/17 22:57:28 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1655505629376_0004
22/06/17 22:57:29 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at versa-insights2-m/10.142.0.70:8030
22/06/17 22:57:30 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
Traceback (most recent call last):
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 442, in <module>
sys.exit(main())
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 433, in main
main_proc = insightGen()
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 99, in __init__
self.all_DF = self.spark.read \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 210, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.load.
: java.lang.ClassNotFoundException: Failed to find data source: mongo. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: mongo.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
... 14 more
You may use the following code to pass the configuration:
import datetime
from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
# If you are running Airflow in more than one time zone
# see https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
# for best practices
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
PYSPARK_JOB = {
"pyspark_job": {
"main_python_file_uri":
"gs://<bucket>/20220606.py", #this field is for .py packages
"properties": { #you can use this field to pass other properties
"org.apache.spark": "spark-sql-kafka-0-10_2.12:3.2.0",
"org.mongodb.spark": "mongo-spark-connector_2.12:3.0.2"
},
"python_file_uris": ["gs://<bucket>/20220606.py"]
},
"reference": {
"project_id": "<project_id>"
},
"placement": {
"cluster_name": "<cluster_name>"
}
}
REGION = "us-central1"
PROJECT_ID = "<project_id>"
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'composer_quickstart',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
run_dataproc_spark = DataprocSubmitJobOperator(
task_id="run_dataproc_spark",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID,
)
print_dag_run_conf >> run_dataproc_spark
I followed this PySpark Job Documentation to know which field to use to pass required packages.
AirFlow DAG logs:
*** Reading remote log from gs://us-central1-case-20220331-fde8f6be-bucket/logs/composer_quickstart/run_dataproc_spark/2022-06-06T06:53:24.637504+00:00/1.log.
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1033} INFO - Dependencies all met for <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [queued]>
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1033} INFO - Dependencies all met for <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [queued]>
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1239} INFO -
--------------------------------------------------------------------------------
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1240} INFO - Starting attempt 1 of 2
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1241} INFO -
--------------------------------------------------------------------------------
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1260} INFO - Executing <Task(DataprocSubmitJobOperator): run_dataproc_spark> on 2022-06-06 06:53:24.637504+00:00
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:52} INFO - Started process 65510 to run task
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'composer_quickstart', 'run_dataproc_spark', 'manual__2022-06-06T06:53:24.637504+00:00', '--job-id', '21439', '--raw', '--subdir', 'DAGS_FOLDER/20220606_1.py', '--cfg-path', '/tmp/tmp7p1eyqqm', '--error-file', '/tmp/tmpdr2m4rwe']
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:77} INFO - Job 21439: Subtask run_dataproc_spark
[2022-06-06, 06:53:41 UTC] {logging_mixin.py:109} INFO - Running <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [running]> on host airflow-worker-7b5f8fc749-pd8f9
[2022-06-06, 06:53:44 UTC] {taskinstance.py:1426} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_EMAIL=
AIRFLOW_CTX_DAG_OWNER=Composer Example
AIRFLOW_CTX_DAG_ID=composer_quickstart
AIRFLOW_CTX_TASK_ID=run_dataproc_spark
AIRFLOW_CTX_EXECUTION_DATE=2022-06-06T06:53:24.637504+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-06-06T06:53:24.637504+00:00
[2022-06-06, 06:53:44 UTC] {dataproc.py:1878} INFO - Submitting job
[2022-06-06, 06:53:44 UTC] {credentials_provider.py:312} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-06-06, 06:53:45 UTC] {dataproc.py:1890} INFO - Job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 submitted successfully.
[2022-06-06, 06:53:45 UTC] {dataproc.py:1903} INFO - Waiting for job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 to complete
[2022-06-06, 06:54:16 UTC] {dataproc.py:1907} INFO - Job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 completed successfully.
[2022-06-06, 06:54:16 UTC] {taskinstance.py:1268} INFO - Marking task as SUCCESS. dag_id=composer_quickstart, task_id=run_dataproc_spark, execution_date=20220606T065324, start_date=20220606T065339, end_date=20220606T065416
[2022-06-06, 06:54:16 UTC] {local_task_job.py:154} INFO - Task exited with return code 0
[2022-06-06, 06:54:16 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
Submitted Job:
I'm trying to connect to the Google NGrams dataset on AWS in EMR. (https://aws.amazon.com/datasets/google-books-ngrams/) However, when I try and load the data using pig, I get a lot of error messages and no real data, likely because the file in the S3 bucket referenced in the above link is encoded. Is there a way to directly access it from pig and apply the proper conversions to make it accessible?
I've tried loading the data, then using limit to try and dump the the first few rows, however I got several errors and a lot of random characters and boxes.
These are the commands I've tried to load the data:
trigrams = LOAD 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/3gram/data' AS (trigram:chararray, year:int, occurrences:float, pages:float, books:float);
out = LIMIT trigrams 10;
I expected to get the data output in the below format
n-gram TAB year TAB occurrences TAB pages TAB books
however, all I get are the following error messages and I'm not able to analyze the data.
268988 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT
19/09/04 01:48:04 INFO pigstats.ScriptState: Pig features used in the script: LIMIT
269024 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
19/09/04 01:48:04 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
269047 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
19/09/04 01:48:04 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
269103 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
19/09/04 01:48:04 INFO util.SpillableMemoryManager: Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
19/09/04 01:48:04 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/04 01:48:04 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/04 01:48:04 INFO output.DirectFileOutputCommitter: Direct Write: DISABLED
269186 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
19/09/04 01:48:04 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
269242 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
19/09/04 01:48:05 WARN data.SchemaTupleBackend: SchemaTupleBackend has already been initialized
269245 [main] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
19/09/04 01:48:05 INFO builtin.PigStorage: Using PigTextInputFormat
19/09/04 01:48:05 INFO input.FileInputFormat: Total input files to process : 1
269252 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
19/09/04 01:48:05 INFO util.MapRedUtil: Total input paths to process : 1
19/09/04 01:48:05 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
19/09/04 01:48:05 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7e6c862e89bc8db32c064454a55af74ddff73bae]
19/09/04 01:48:05 INFO s3n.S3NativeFileSystem: Opening 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/3gram/data' for reading
19/09/04 01:48:05 INFO output.FileOutputCommitter: Saved output of task 'attempt__0001_m_000001_1' to hdfs://ip-172-31-24-80.ec2.internal:8020/tmp/temp1150533356/tmp1066986243/_temporary/0/task__0001_m_000001
269523 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
19/09/04 01:48:05 WARN data.SchemaTupleBackend: SchemaTupleBackend has already been initialized
19/09/04 01:48:05 INFO input.FileInputFormat: Total input files to process : 1
269531 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
19/09/04 01:48:05 INFO util.MapRedUtil: Total input paths to process : 1
(SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.Text#com.hadoop.compression.lzo.LzoCodec�+Gz2rF?��n`�m�������+Gz2rF?��n`�m�֎~� ��|y��hx�������,,,,)
(�
�
�,,,,)
(��������������▒▒���� �!�"�#�$�%�&�'�(�)�*�+�,�-�.�/�0�1�2�3�4�5�6�7�8�9�:�;�<�=�>�?�#�A�B�C�D�E�F�G�H�I�J�K�L�M�N�O�P�Q�R�S�T�U�V�W�X�Y�Z�[�\�]�^�_�`�a�b�c�d�e�f�g�h�i�j�k�l�m�n�o�p�q�r�s�t�u�v�w�x�y�z�{�|�}�~����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�!��������������������������������������������������������������A���������������������������������������������������������������a������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"�#��������������������������������������������������������������B���������������������������������������������������������������b������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������#�#��������������������������������������������������������������C���������������������������������������������������������������c������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������$�#��������������������������������������������������������������D���������������������������������������������������������������d������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������%�#��������������������������������������������������������������E���������������������������������������������������������������e������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������&�#��������������������������������������������������������������F���������������������������������������������������������������f������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������'�#��������������������������������������������������������������G���������������������������������������������������������������g�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������(��������������������������������H��������������������������������h���������������������������������������������������������������������������������������������������������������������������������������������������������������,,,,)
(��,,,,)
(��,,,,)
(��,,,,)
(��,,,,)
(��,,,,)
(��,,,,)
(��,,,,)
Any help in solving this problem would be greatly appreciated!
Input file is in a sequence file format.
Default Pig loader is PigStorage() which is text based.
If it's a straight forward sequence file with no custom Writable objects, using SequenceFileLoader might work.
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
...
trigrams = LOAD 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/3gram/data' using SequenceFileLoader AS (trigram:chararray, year:int, occurrences:float, pages:float, books:float);
...
[enter image description here][1]I have created the following citrus test cases to test a basic connection between Rest client and server:
#Test
#CitrusTest
fun httpActionTest() {
variable("username", "user")
variable("password","password")
http().client("httpClient")
.send()
.post("/api/authenticate")
.messageType(MessageType.JSON)
.contentType("application/json")
.payload("{ \"username\": \"\${username}\", \"password\": \"\${password}\"}");
http().client("httpClient")
.receive()
.response(HttpStatus.OK)
.validate("$.token","asasasasas")
}
#CitrusTest
fun httpServerActionTest() {
http().server("httpServer")
.receive()
.post("/api/authenticate")
.payload("{ \"username\": \"\${username}\", \"password\": \"\${password}\"}")
.contentType("application/json")
.accept("application/json")
.extractFromPayload("username", "username")
.extractFromPayload("password", "password")
.validate("$.username", "user")
.validate("$.password","pass")
http().server("httpServer")
.send()
.response(HttpStatus.OK)
.payload("{\"token\": \"lsdkfjkh8sdfg98zsd\"}")
.version("HTTP/1.1")
.contentType("application/json")
}
I have defined the server and client endpoints in citrux-context.xml as follows:
<citrus-http:client id="httpClient"
request-url="http://localhost:8080"
request-method="GET"
content-type="application/json"
charset="UTF-8"
timeout="60000"/>
<citrus-http:server id="httpServer"
port="8080"
auto-start="true"
resource-base="src/test/resources"/>
While executing via IntelliJ, following logs are observed:
INFO: Loading XML bean definitions from URL [file:/home/jass/intersales/jk-magento/magento2-auth-service/target/test-classes/citrus-context.xml]
[main] INFO org.eclipse.jetty.util.log - Logging initialized #9851ms to org.eclipse.jetty.util.log.Slf4jLog
[main] INFO org.eclipse.jetty.server.Server - jetty-9.4.6.v20170531
[main] INFO org.eclipse.jetty.server.handler.ContextHandler.ROOT - Initializing Spring FrameworkServlet 'httpServer-servlet'
Oct 23, 2017 8:49:45 AM com.consol.citrus.http.servlet.CitrusDispatcherServlet initServletBean
INFO: FrameworkServlet 'httpServer-servlet': initialization started
Oct 23, 2017 8:49:45 AM org.springframework.web.context.support.XmlWebApplicationContext prepareRefresh
INFO: Refreshing WebApplicationContext for namespace 'httpServer-servlet-servlet': startup date [Mon Oct 23 08:49:45 CEST 2017]; root of context hierarchy
Oct 23, 2017 8:49:45 AM org.springframework.beans.factory.xml.XmlBeanDefinitionReader loadBeanDefinitions
INFO: Loading XML bean definitions from class path resource [com/consol/citrus/http/citrus-servlet-context.xml]
Oct 23, 2017 8:49:46 AM org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping register
...
INFO: Looking for #ControllerAdvice: WebApplicationContext for namespace 'httpServer-servlet-servlet': startup date [Mon Oct 23 08:49:45 CEST 2017]; root of context hierarchy
Oct 23, 2017 8:49:47 AM com.consol.citrus.http.servlet.CitrusDispatcherServlet initServletBean
INFO: FrameworkServlet 'httpServer-servlet': initialization completed in 1570 ms
[main] INFO org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler#1bb1fde8{/,file:///home/jass/intersales/jk-magento/magento2-auth-service/src/test/resources/,AVAILABLE}
[main] INFO org.eclipse.jetty.server.AbstractConnector - Started ServerConnector#1286528d{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
[main] INFO org.eclipse.jetty.server.Server - Started #12166ms
[main] INFO com.consol.citrus.http.server.HttpServer - Started server: httpServer
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus - .__ __
[main] INFO com.consol.citrus.Citrus - ____ |__|/ |________ __ __ ______
[main] INFO com.consol.citrus.Citrus - _/ ___\| \ __\_ __ \ | \/ ___/
[main] INFO com.consol.citrus.Citrus - \ \___| || | | | \/ | /\___ \
[main] INFO com.consol.citrus.Citrus - \___ >__||__| |__| |____//____ >
[main] INFO com.consol.citrus.Citrus - \/ \/
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - C I T R U S T E S T S 2.7.2
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - BEFORE TEST SUITE: SUCCESS
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.actions.EchoAction - Today is: 23.10.2017
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - TEST SUCCESS VerticleCitrusTest.echoToday (de.intersales.qbus2)
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus -
[qtp191568263-12] INFO com.consol.citrus.channel.ChannelSyncProducer - Message was sent to channel: 'httpServer.inbound'
[qtp191568263-12] WARN com.consol.citrus.channel.ChannelEndpointAdapter - Reply timed out after 1000ms. Did not receive reply message on reply channel
[main] INFO com.consol.citrus.http.client.HttpClient - HTTP message was sent to endpoint: 'http://localhost:8080/magento2/authenticate'
[main] INFO com.consol.citrus.validation.xml.DomXmlMessageValidator - XML message validation successful: All values OK
[main] INFO com.consol.citrus.validation.DefaultMessageHeaderValidator - Message header validation successful: All values OK
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - TEST SUCCESS VerticleCitrusTest.httpActionTest (de.intersales.qbus2)
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - AFTER TEST SUITE: SUCCESS
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - CITRUS TEST RESULTS
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - VerticleCitrusTest.echoToday ................................... SUCCESS
[main] INFO com.consol.citrus.Citrus - VerticleCitrusTest.httpActionTest .............................. SUCCESS
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - TOTAL: 2
[main] INFO com.consol.citrus.Citrus - FAILED: 0 (0.0%)
[main] INFO com.consol.citrus.Citrus - SUCCESS: 2 (100.0%)
[main] INFO com.consol.citrus.Citrus -
[main] INFO com.consol.citrus.Citrus - ------------------------------------------------------------------------
[main] INFO com.consol.citrus.report.HtmlReporter - Generated HTML test report
But getting an error when executing via mvn clean verify with the following error:
[main] ERROR com.consol.citrus.Citrus - TEST FAILED VerticleCitrusTest.httpActionTest <de.intersales.qbus2> Nested exception is:
org.springframework.beans.factory.NoSuchBeanDefinitionException: No bean named 'httpClient' available
at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBeanDefinition(DefaultListableBeanFactory.java:687)
...
`
Any suggestions or help is greatly appreciated.
EDITED The following is my project structure
[Placement of resources] [1]: https://i.stack.imgur.com/aVabX.png
I see multiple issues in your code and setup. First of all the httpServerActionTest() is missing the #Test annotation. If not put on class level this annotation needs to be repeated on each method in your test class.
Secondly the complete test structure does not make much sense to me. In httpActionTest() test you send a client request to the server while in httpServerActionTest() you receive that very same request as a server and validate its contents with Citrus. Your test is both client and server at the same time. Feels wrong to me! In particular this test setup will never work as Http is a synchronous protocol by nature and httpActionTest() is not able to succeed without httpServerActionTest() performing. You will get timeout exceptions on client side then. This will only work in case both methods are executed in parallel to each other.
Regarding the Maven failure: citrux-context.xml is misspelled (citrux vs. citrus). Also it seems to me that the file is not properly added to the Maven project as a resource. Did you keep the default Maven directory layout?
Once again the complete test setup does not clarify its purpose to me.
Installation details:
Pig Version: 0.16
Hadoop: 2.7.3
pig -h gives me results as expected.
I have tried : ant clean jar-all -Dhadoopversion=23 - but it didn't help.
My Hadoop installation folder is : /usr/local/bin/hadoop-2.7.3/
bashrc file:
export PIG_HOME="/usr/local/bin/pig/pig-0.16.0"
export PIG_CONF_DIR="$PIG_HOME/conf"
export PIG_CLASSPATH="/usr/local/bin/hadoop-2.7.3/etc/hadoop/"
export PATH=$PATH:$PIG_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/bin/pig/lib/*:.
Program:
log = LOAD '/home/dhaval/Desktop/excite-small.log' AS (user:chararray,
time:long, query:chararray);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
DUMP cntd;
Error:
2017-04-20 23:38:39,761 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-20 23:38:39,831 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY
2017-04-20 23:38:39,897 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-04-20 23:38:39,898 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-04-20 23:38:39,926 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2017-04-20 23:38:39,995 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2017-04-20 23:38:40,063 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2017-04-20 23:38:40,078 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.CombinerOptimizerUtil - Choosing to move algebraic foreach to combiner
2017-04-20 23:38:40,107 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2017-04-20 23:38:40,107 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2017-04-20 23:38:40,139 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-04-20 23:38:40,140 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-04-20 23:38:40,148 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - session.id is deprecated. Instead, use dfs.metrics.session-id
2017-04-20 23:38:40,149 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2017-04-20 23:38:40,174 [main] WARN org.apache.pig.backend.hadoop20.PigJobControl - falling back to default JobControl (not using hadoop 0.20 ?)
java.lang.NoSuchFieldException: runnerState
at java.lang.Class.getDeclaredField(Class.java:2070)
at org.apache.pig.backend.hadoop20.PigJobControl.<clinit>(PigJobControl.java:51)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:109)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:314)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:196)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:308)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1474)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1459)
at org.apache.pig.PigServer.storeEx(PigServer.java:1118)
at org.apache.pig.PigServer.store(PigServer.java:1081)
at org.apache.pig.PigServer.openIterator(PigServer.java:994)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:747)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:564)
at org.apache.pig.Main.main(Main.java:176)
2017-04-20 23:38:40,177 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2017-04-20 23:38:40,183 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2017-04-20 23:38:40,183 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2017-04-20 23:38:40,183 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2017-04-20 23:38:40,184 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2017-04-20 23:38:40,185 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2017-04-20 23:38:40,190 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=208348
2017-04-20 23:38:40,190 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2017-04-20 23:38:40,190 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2017-04-20 23:38:40,201 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2017-04-20 23:38:40,207 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2017-04-20 23:38:40,207 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2017-04-20 23:38:40,207 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Distributed cache not supported or needed in local mode. Setting key [pig.schematuple.local.dir] with code temp directory: /tmp/1492745920207-0
2017-04-20 23:38:40,285 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2017-04-20 23:38:40,285 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http.address
2017-04-20 23:38:40,294 [JobControl] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2017-04-20 23:38:40,302 [JobControl] ERROR org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl - Error while trying to run jobs.
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setupUdfEnvAndStores(PigOutputFormat.java:243)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:191)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:458)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:240)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:121)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
2017-04-20 23:38:40,302 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2017-04-20 23:38:40,309 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-04-20 23:38:40,309 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job null has failed! Stop running all dependent jobs
2017-04-20 23:38:40,309 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-04-20 23:38:40,310 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Could not write to log file: /log/path :/log/path (No such file or directory)
2017-04-20 23:38:40,310 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Unexpected System Error Occured: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setupUdfEnvAndStores(PigOutputFormat.java:243)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:191)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:458)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:240)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:121)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
2017-04-20 23:38:40,311 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-04-20 23:38:40,313 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.5.1 0.16.0 dhaval 2017-04-20 23:38:40 2017-04-20 23:38:40 GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
N/A cntd,grpd,log GROUP_BY,COMBINER Message: Unexpected System Error Occured: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setupUdfEnvAndStores(PigOutputFormat.java:243)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:191)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:458)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:240)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:121)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
file:/tmp/temp1942265384/tmp-1728388493,
Input(s):
Failed to read data from "/home/dhaval/Desktop/excite-small.log"
Output(s):
Failed to produce result in "file:/tmp/temp1942265384/tmp-1728388493"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
null
2017-04-20 23:38:40,314 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-04-20 23:38:40,317 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias cntd
2017-04-20 23:38:40,317 [main] WARN org.apache.pig.tools.grunt.Grunt - Could not write to log file: /log/path :/log/path (No such file or directory)
2017-04-20 23:38:40,317 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias cntd
at org.apache.pig.PigServer.openIterator(PigServer.java:1019)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:747)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:564)
at org.apache.pig.Main.main(Main.java:176)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:1011)
... 7 more
The problem is a compatibility issue between pig-0.16 and hadoop 2.x
You can refer the pig version compatible with Hadoop-2.7.3 here -
https://pig.apache.org/releases.html#19+June%2C+2017%3A+release+0.17.0+available
That means you should use Pig 0.17 only with hadoop versions starting from 2.7.3
I know its an old question, but someone facing the same issue can benefit from this.
I am trying to push the data to AWS S3. I had user the example in (http://druid.io/docs/0.7.0/Tutorial:-The-Druid-Cluster.html) but modified the common.runtime.properties as below
druid.storage.type=s3
druid.s3.accessKey=AKIAJWTETHZDEQLHQ7AQ
druid.s3.secretKey=tcTtvGXcqLmmMbo2hRunzlSA1P2X0O0bjVf537Nt
druid.storage.bucket=testfeed
druid.storage.baseKey=sample
Below is the logs for realtime node
2015-03-02T15:03:44,809 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.QueryConfig] from
props[druid.query.] as [io.druid.query.QueryConfig#2edcd9d]
2015-03-02T15:03:44,843 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.search.search.SearchQueryConfig]
from props[druid.query.search.] as
[io.druid.query.search.search.SearchQueryConfig#7939de8b]
2015-03-02T15:03:44,861 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.groupby.GroupByQueryConfig] from
props[druid.query.groupBy.] as
[io.druid.query.groupby.GroupByQueryConfig#bea8209]
2015-03-02T15:03:44,874 INFO [main]
org.skife.config.ConfigurationObjectFactory - Assigning value
[100000000] for [druid.processing.buffer.sizeBytes] on
[io.druid.query.DruidProcessingConfig#intermediateComputeSizeBytes()]
2015-03-02T15:03:44,878 INFO [main]
org.skife.config.ConfigurationObjectFactory - Assigning value [2] for
[druid.processing.numThreads] on
[io.druid.query.DruidProcessingConfig#getNumThreads()]
2015-03-02T15:03:44,878 INFO [main]
org.skife.config.ConfigurationObjectFactory - Using method itself for
[${base_path}.columnCache.sizeBytes] on
[io.druid.query.DruidProcessingConfig#columnCacheSizeBytes()]
2015-03-02T15:03:44,880 INFO [main]
org.skife.config.ConfigurationObjectFactory - Assigning default value
[processing-%s] for [${base_path}.formatString] on
[com.metamx.common.concurrent.ExecutorServiceConfig#getFormatString()]
2015-03-02T15:03:44,956 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.topn.TopNQueryConfig] from
props[druid.query.topN.] as
[io.druid.query.topn.TopNQueryConfig#276503c4]
2015-03-02T15:03:44,960 INFO [main] io.druid.guice.JsonConfigurator
- Loaded class[class io.druid.segment.loading.LocalDataSegmentPusherConfig] from
props[druid.storage.] as
[io.druid.segment.loading.LocalDataSegmentPusherConfig#360548eb]
2015-03-02T15:03:44,967 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.client.DruidServerConfig] from
props[druid.server.] as [io.druid.client.DruidServerConfig#75ba7964]
2015-03-02T15:03:44,971 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class
io.druid.server.initialization.BatchDataSegmentAnnouncerConfig] from
props[druid.announcer.] as
[io.druid.server.initialization.BatchDataSegmentAnnouncerConfig#1ff2a544]
2015-03-02T15:03:44,984 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.server.initialization.ZkPathsConfig] from
props[druid.zk.paths.] as
[io.druid.server.initialization.ZkPathsConfig#58d3f4be]
2015-03-02T15:03:44,990 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.curator.CuratorConfig] from
props[druid.zk.service.] as [io.druid.curator.CuratorConfig#5fd11499]
I got the issue. I had missed the s3 extension in common.runtime.properties. Once that was added data started getting pushed to s3.