Unable to use flume to extract files into hdfs

Unable to use flume to extract files into hdfs - hdfs

I am using flume-ng-1.2.0 with cdh3u5. I am simply trying to extract data from a text file and putting it into hdfs.
Here is the Configuration I am using:
agent1.sources = tail1
agent1.channels = Channel-2
agent1.sinks = HDFS
agent1.sources.tail1.type = exec
agent1.sources.tail1.command = tail -F /usr/games/sample1.txt
agent1.sources.tail1.channels = Channel-2
agent1.sinks.HDFS.channel = Channel-2
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://10.12.1.2:8020/user/hdfs/flume
agent1.sinks.HDFS.hdfs.fileType = DataStream
agent1.channels.Channel-2.type = memory
agent1.channels.Channel-2.capacity = 1000
and I am running agent by bin/flume-ng agent -n agent1 -c ./conf/ -f conf/flume.conf
and the logs which I am getting is
2012-10-11 12:10:36,626 INFO lifecycle.LifecycleSupervisor: Starting lifecycle supervisor 1
2012-10-11 12:10:36,631 INFO node.FlumeNode: Flume node starting - agent1
2012-10-11 12:10:36,639 INFO nodemanager.DefaultLogicalNodeManager: Node manager starting
2012-10-11 12:10:36,639 INFO lifecycle.LifecycleSupervisor: Starting lifecycle supervisor 12
2012-10-11 12:10:36,641 INFO properties.PropertiesFileConfigurationProvider: Configuration provider starting
2012-10-11 12:10:36,646 INFO properties.PropertiesFileConfigurationProvider: Reloading configuration file:conf/flume.conf
2012-10-11 12:10:36,657 INFO conf.FlumeConfiguration: Processing:HDFS
2012-10-11 12:10:36,670 INFO conf.FlumeConfiguration: Processing:HDFS
2012-10-11 12:10:36,670 INFO conf.FlumeConfiguration: Processing:HDFS
2012-10-11 12:10:36,670 INFO conf.FlumeConfiguration: Processing:HDFS
2012-10-11 12:10:36,671 INFO conf.FlumeConfiguration: Added sinks: HDFS Agent: agent1
2012-10-11 12:10:36,758 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [agent1]
2012-10-11 12:10:36,758 INFO properties.PropertiesFileConfigurationProvider: Creating channels
2012-10-11 12:10:36,800 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: CHANNEL, name: Channel-2, registered successfully.
2012-10-11 12:10:36,800 INFO properties.PropertiesFileConfigurationProvider: created channel Channel-2
2012-10-11 12:10:36,835 INFO sink.DefaultSinkFactory: Creating instance of sink: HDFS, type: hdfs
2012-10-11 12:10:37,753 INFO hdfs.HDFSEventSink: Hadoop Security enabled: false
2012-10-11 12:10:37,896 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: SINK, name: HDFS, registered successfully.
2012-10-11 12:10:37,899 INFO nodemanager.DefaultLogicalNodeManager: Starting new configuration:{ sourceRunners:{tail1=EventDrivenSourceRunner: { source:org.apache.flume.source.ExecSource#362f0d54 }} sinkRunners:{HDFS=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#4b142196 counterGroup:{ name:null counters:{} } }} channels:{Channel-2=org.apache.flume.channel.MemoryChannel#16a9255c} }
2012-10-11 12:10:37,900 INFO nodemanager.DefaultLogicalNodeManager: Starting Channel Channel-2
2012-10-11 12:10:37,901 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: Channel-2 started
2012-10-11 12:10:37,901 INFO nodemanager.DefaultLogicalNodeManager: Starting Sink HDFS
2012-10-11 12:10:37,905 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started
2012-10-11 12:10:37,910 INFO nodemanager.DefaultLogicalNodeManager: Starting Source tail1
2012-10-11 12:10:37,912 INFO source.ExecSource: Exec source starting with command:tail -F /usr/games/sample1.txt
I don't know where am I doing mistake. As I am a beginner I am not getting anything in hdfs and the flume-agent is kept on running. any suggestion and correction will be very helpful to me ,thanks.

One issue is that you have set agent1.sinks.HDFS.hdfs.file.Type = DataStream but the property is hdfs.fileType -- see https://flume.apache.org/FlumeUserGuide.html#hdfs-sink for more info.
I would try with a logger sink -- sink.type = logger -- just to see if anything comes through. Also make sure you're getting something when you run that tail -F command from your shell.
One more thing, which may be a red herring: there is a backtick (`) at the end of your log message. Maybe that was a paste error, but if not then if that's in your config file I wouldn't be surprised if it caused trouble. The message I am referring to is from the last line in your log:
Exec source starting with command:tail -F /usr/games/value.txt`

Related

GCP Composer 2 (Airflow 2) Data proc operators - pass package to PYSPARK_JOB

I'm using GCP Composer2 to schedule pyspark (Structured Streaming) jobs,
The pyspark code reads/writes into Kafka.
The DAG uses operators - DataprocCreateClusterOperator (creates a GKE cluster),
DataprocSubmitJobOperator (runs the pyspark job), using operator - DataprocSubmitJobOperator deletes the dataproc cluster.
In the code below, i'm passing the jars and the files(certs/config files) required to run the pyspark code that reads/writes into Kafka
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"jar_file_uris" : ["gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar",
'gs://dataproc-spark-jars/bson-4.0.5.jar','gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar','gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar',
'gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar','gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar','gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar',
'gs://dataproc-spark-jars/spark-token-provider-kafka-0-10_2.12-3.2.0.jar','gs://dataproc-spark-jars/htrace-core4-4.1.0-incubating.jar','gs://dataproc-spark-jars/hadoop-client-3.3.1.jar','gs://dataproc-spark-jars/spark-sql-kafka-0-10_2.12-3.2.0.jar','gs://dataproc-spark-jars/hadoop-client-runtime-3.3.1.jar','gs://dataproc-spark-jars/hadoop-client-3.3.1.jar','gs://dataproc-spark-configs/kafka-clients-3.2.0.jar'],
"file_uris":['gs://kafka-certs/versa-kafka-gke-ca.p12','gs://kafka-certs/syslog-vani.p12',
'gs://kafka-certs/alarm-compression-user.p12','gs://kafka-certs/appstats-user.p12',
'gs://kafka-certs/insights-user.p12','gs://kafka-certs/intfutil-user.p12',
'gs://kafka-certs/reloadpred-chkpoint-user.p12','gs://kafka-certs/reloadpred-user.p12',
'gs://dataproc-spark-configs/topic-customer-map.cfg','gs://dataproc-spark-configs/params.cfg','gs://kafka-certs/issues-user.p12','gs://kafka-certs/anomaly-user.p12']
}
}
path = "gs://dataproc-spark-configs/pip_install.sh"
CLUSTER_GENERATOR_CONFIG = ClusterGenerator(
project_id=PROJECT_ID,
zone="us-east1-b",
master_machine_type="n1-standard-4",
worker_machine_type="n1-standard-4",
num_workers=4,
storage_bucket="dataproc-spark-logs",
init_actions_uris=[path],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl kafka-python'},
).make()
with models.DAG(
'UsingComposer2',
# Continue to run DAG twice per day
default_args=default_dag_args,
schedule_interval='0 0/12 * * *',
catchup=False,
) as dag:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id="create_dataproc_cluster",
cluster_name="composer2",
region=REGION,
cluster_config=CLUSTER_GENERATOR_CONFIG
)
run_dataproc_spark = DataprocSubmitJobOperator(
task_id="run_dataproc_spark",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID,
)
delete_dataproc_cluster = DataprocDeleteClusterOperator(
task_id="delete_dataproc_cluster",
project_id=PROJECT_ID,
cluster_name=CLUSTER_NAME,
region=REGION
)
create_dataproc_cluster >> run_dataproc_spark >> delete_dataproc_cluster
Question is - how do i pass package instead of the jars individually for spark-kafka?
When i do a spark-submit - i can pass a package, how do i do the same with Composer/Airflow ?
sample spark-submit command, where i pass the spark-sql-kafka and mongo-spark-connector packages
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 /Users/karanalang/PycharmProjects/Kafka/StructuredStreaming-KafkaConsumer-insignts.py
tia!
Update :
Based on #Anjela B's suggestion, tried the following but it does not work
changes to PYSPARK_JOB, to pass package :
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"properties": { #you can use this field to pass other properties
"org.apache.spark": "spark-sql-kafka-0-10_2.12:3.1.3",
"org.mongodb.spark": "mongo-spark-connector_2.12:3.0.2"
},
"file_uris":['gs://kafka-certs/versa-kafka-gke-ca.p12','gs://kafka-certs/syslog-vani.p12',
'gs://kafka-certs/alarm-compression-user.p12','gs://kafka-certs/appstats-user.p12',
'gs://kafka-certs/insights-user.p12','gs://kafka-certs/intfutil-user.p12',
'gs://kafka-certs/reloadpred-chkpoint-user.p12','gs://kafka-certs/reloadpred-user.p12',
'gs://dataproc-spark-configs/topic-customer-map.cfg','gs://dataproc-spark-configs/params.cfg','gs://kafka-certs/issues-user.p12','gs://kafka-certs/anomaly-user.p12']
}
Error :
22/06/17 22:57:28 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1655505629376_0004
22/06/17 22:57:29 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at versa-insights2-m/10.142.0.70:8030
22/06/17 22:57:30 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
Traceback (most recent call last):
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 442, in <module>
sys.exit(main())
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 433, in main
main_proc = insightGen()
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 99, in __init__
self.all_DF = self.spark.read \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 210, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.load.
: java.lang.ClassNotFoundException: Failed to find data source: mongo. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: mongo.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
... 14 more

You may use the following code to pass the configuration:
import datetime
from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
# If you are running Airflow in more than one time zone
# see https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
# for best practices
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
PYSPARK_JOB = {
"pyspark_job": {
"main_python_file_uri":
"gs://<bucket>/20220606.py", #this field is for .py packages
"properties": { #you can use this field to pass other properties
"org.apache.spark": "spark-sql-kafka-0-10_2.12:3.2.0",
"org.mongodb.spark": "mongo-spark-connector_2.12:3.0.2"
},
"python_file_uris": ["gs://<bucket>/20220606.py"]
},
"reference": {
"project_id": "<project_id>"
},
"placement": {
"cluster_name": "<cluster_name>"
}
}
REGION = "us-central1"
PROJECT_ID = "<project_id>"
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'composer_quickstart',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
run_dataproc_spark = DataprocSubmitJobOperator(
task_id="run_dataproc_spark",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID,
)
print_dag_run_conf >> run_dataproc_spark
I followed this PySpark Job Documentation to know which field to use to pass required packages.
AirFlow DAG logs:
*** Reading remote log from gs://us-central1-case-20220331-fde8f6be-bucket/logs/composer_quickstart/run_dataproc_spark/2022-06-06T06:53:24.637504+00:00/1.log.
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1033} INFO - Dependencies all met for <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [queued]>
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1033} INFO - Dependencies all met for <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [queued]>
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1239} INFO -
--------------------------------------------------------------------------------
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1240} INFO - Starting attempt 1 of 2
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1241} INFO -
--------------------------------------------------------------------------------
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1260} INFO - Executing <Task(DataprocSubmitJobOperator): run_dataproc_spark> on 2022-06-06 06:53:24.637504+00:00
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:52} INFO - Started process 65510 to run task
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'composer_quickstart', 'run_dataproc_spark', 'manual__2022-06-06T06:53:24.637504+00:00', '--job-id', '21439', '--raw', '--subdir', 'DAGS_FOLDER/20220606_1.py', '--cfg-path', '/tmp/tmp7p1eyqqm', '--error-file', '/tmp/tmpdr2m4rwe']
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:77} INFO - Job 21439: Subtask run_dataproc_spark
[2022-06-06, 06:53:41 UTC] {logging_mixin.py:109} INFO - Running <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [running]> on host airflow-worker-7b5f8fc749-pd8f9
[2022-06-06, 06:53:44 UTC] {taskinstance.py:1426} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_EMAIL=
AIRFLOW_CTX_DAG_OWNER=Composer Example
AIRFLOW_CTX_DAG_ID=composer_quickstart
AIRFLOW_CTX_TASK_ID=run_dataproc_spark
AIRFLOW_CTX_EXECUTION_DATE=2022-06-06T06:53:24.637504+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-06-06T06:53:24.637504+00:00
[2022-06-06, 06:53:44 UTC] {dataproc.py:1878} INFO - Submitting job
[2022-06-06, 06:53:44 UTC] {credentials_provider.py:312} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-06-06, 06:53:45 UTC] {dataproc.py:1890} INFO - Job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 submitted successfully.
[2022-06-06, 06:53:45 UTC] {dataproc.py:1903} INFO - Waiting for job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 to complete
[2022-06-06, 06:54:16 UTC] {dataproc.py:1907} INFO - Job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 completed successfully.
[2022-06-06, 06:54:16 UTC] {taskinstance.py:1268} INFO - Marking task as SUCCESS. dag_id=composer_quickstart, task_id=run_dataproc_spark, execution_date=20220606T065324, start_date=20220606T065339, end_date=20220606T065416
[2022-06-06, 06:54:16 UTC] {local_task_job.py:154} INFO - Task exited with return code 0
[2022-06-06, 06:54:16 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
Submitted Job:

bash: spark-submit: command not found while executing dag in AWS- Managed Apache Airflow

I have to run a spark job, (I am new to spark) and getting following error-
[2022-02-16 14:47:45,415] {{bash.py:135}} INFO - Tmp dir root location: /tmp
[2022-02-16 14:47:45,416] {{bash.py:158}} INFO - Running command: spark-submit --class org.xyz.practice.driver.PractitionerDriver s3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar
[2022-02-16 14:47:45,422] {{bash.py:169}} INFO - Output:
[2022-02-16 14:47:45,423] {{bash.py:173}} INFO - bash: spark-submit: command not found
[2022-02-16 14:47:45,423] {{bash.py:177}} INFO - Command exited with return code 127
[2022-02-16 14:47:45,437] {{taskinstance.py:1482}} ERROR - Task failed with exception
What has to be done,
def run_spark(**kwargs):
import pyspark
sc = pyspark.SparkContext()
df = sc.textFile('s3://demoairflowpawan/people.txt')
logging.info('Number of lines in people.txt = {0}'.format(df.count()))
sc.stop()
spark_task = BashOperator(
task_id='spark_java',
bash_command='spark-submit --class {{ params.class }} {{ params.jar }}',
params={'class': 'org.xyz.practice.driver.PractitionerDriver', 'jar': 's3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar'},
dag=dag
)

The question is - why do you expect the spark-submit to be there?
If you created the airflow default pods, then they come with airflow code only.
You can check here an example for spark and airflow - https://medium.com/codex/executing-spark-jobs-with-apache-airflow-3596717bbbe3 - and they state specifically "Spark binaries must be added and mapped".
So you need to figure out how to download the spark binaries to the existing airflow pod.
Alternatively - you can create another k8s job which will do the spark-submit, and have your DAG activate this job.
sorry for the high level answer...

Copy data from hdfs to s3 using sys.process._

I need to copy files from hdfs to s3 using below command in a spark job
I am using the below steps, but I get below error, what am I missing:
import sys.process._
"s3-dist-cp --src /user/data-store/ --dest s3:/parquet2/ --s3ServerSideEncryption --groupBy='.*(.parquet).*'".!
17/08/09 07:56:00 INFO s3distcp.S3DistCp: Running with args: -libjars /usr/share/aws/emr/s3-dist-cp/lib/guava-15.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp-2.5.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src hdfs:///user/data-store/session/country=us/traffic/dt=2016-01-03/ --dest s3://digitas-dct-qa/development/vendors/comscore/clean/analytical-data/sessions-sad/temp4/country=us/traffic/dt=2016-01-03/parquet4/ --s3ServerSideEncryption --groupBy='.*(.parquet).*'
17/08/09 07:56:01 INFO s3distcp.S3DistCp: S3DistCp args: --src hdfs:///user/data-store/session/country=us/traffic/dt=2016-01-03/ --dest s3://digitas-dct-qa/development/vendors/comscore/clean/analytical-data/sessions-sad/temp4/country=us/traffic/dt=2016-01-03/parquet4/ --s3ServerSideEncryption --groupBy='.*(.parquet).*'
17/08/09 07:56:01 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/2581fcfc-c4fd-485a-9159-f34e7392cafa/output'
17/08/09 07:56:01 INFO s3distcp.S3DistCp: GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: us-east-1a
17/08/09 07:56:02 INFO s3distcp.S3DistCp: Created 0 files to copy 0 files
17/08/09 07:56:04 INFO s3distcp.S3DistCp: Reducer number: 31
17/08/09 07:56:05 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-1-55-195.ec2.internal:8188/ws/v1/timeline/
17/08/09 07:56:05 INFO client.RMProxy: Connecting to ResourceManager at ip-10-1-55-195.ec2.internal/10.1.55.195:8032
17/08/09 07:56:05 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1502254855242_0014
17/08/09 07:56:05 INFO s3distcp.S3DistCp: Try to recursively delete hdfs:/tmp/2581fcfc-c4fd-485a-9159-f34e7392cafa/tempspace
Exception in thread "main" java.lang.RuntimeException: Error running job
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:927)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-10-1-55-195.ec2.internal:8020/tmp/2581fcfc-c4fd-485a-9159-f34e7392cafa/files

logs from only some files showing up aws cloudwatch

I configured aws cloudwatch log service on my linux instance. In the config file I set it to keep track of 3 log files:
[general]
state_file = /var/lib/awslogs/agent-state
[plugins]
cwlogs = cwlogs
[default]
region = us-west-1
[/var/log/cron]
file = /var/log/cron
log_group_name = /var/log/cron
log_stream_name = {instance_id}
datetime_format = %b %d %H:%M:%S
[/var/log/messages]
file = /var/log/messages
log_group_name = /var/log/messages
log_stream_name = {instance_id}
datetime_format = %b %d %H:%M:%S
[/var/log/test.log]
file = /var/log/test.log
log_group_name = /var/log/test.log
log_stream_name = {instance_id}
datetime_format = %b %d %H:%M:%S
However, in my console I'm only seeing logs showing up from messages. The permissions for the 3 files I'm trying to keep track of are -rw-------.
Does anybody know why this might be happening? I'm echoing test logs into each individual file and only the ones inserted into messages are showing up.
EDIT**: Here is my awslogs.log
2016-08-25 17:58:31,227 - cwlogs.push - INFO - 631 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2016-08-25 17:58:31,228 - cwlogs.push - INFO - 631 - MainThread - Using default logging configuration.
2016-08-25 17:58:31,234 - cwlogs.push.stream - INFO - 631 - Thread-1 - Starting publisher for [d4a8beb9b6b4535cac41dc75f252df59, /var/log/messages]
2016-08-25 17:58:31,234 - cwlogs.push.stream - INFO - 631 - Thread-1 - Starting reader for [d4a8beb9b6b4535cac41dc75f252df59, /var/log/messages]
2016-08-25 17:58:31,235 - cwlogs.push.reader - INFO - 631 - Thread-4 - Replay events end at 52578.
2016-08-25 17:58:31,235 - cwlogs.push.reader - INFO - 631 - Thread-4 - Start reading file from 52284.
2016-08-25 17:58:32,308 - cwlogs.push.publisher - WARNING - 631 - Thread-2 - Caught exception: An error occurred (DataAlreadyAcceptedException) when calling the PutLogEvents operation: The given batch of log events has already been accepted. The next batch can be sent with sequenceToken: 49561203985967314162297491311273568778757530964511949634

It's possible your agent state file is corrupted because you kept making changes to the configuration. There are two ways to fix this:
Option 1: Use a new name for your configuration block header.
That is, change [/var/log/cron] to [/something/else].
Option 2: Delete the agent state file after stopping the service.
sudo service awslogs stop
sudo rm /var/lib/awslogs/agent-state
sudo service awslogs start
Please note that Option 2 may initially cause duplicate logs to be pushed to CloudWatch as a new state file is created.

Druid not storing to AWS S3

I am trying to push the data to AWS S3. I had user the example in (http://druid.io/docs/0.7.0/Tutorial:-The-Druid-Cluster.html) but modified the common.runtime.properties as below
druid.storage.type=s3
druid.s3.accessKey=AKIAJWTETHZDEQLHQ7AQ
druid.s3.secretKey=tcTtvGXcqLmmMbo2hRunzlSA1P2X0O0bjVf537Nt
druid.storage.bucket=testfeed
druid.storage.baseKey=sample
Below is the logs for realtime node
2015-03-02T15:03:44,809 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.QueryConfig] from
props[druid.query.] as [io.druid.query.QueryConfig#2edcd9d]
2015-03-02T15:03:44,843 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.search.search.SearchQueryConfig]
from props[druid.query.search.] as
[io.druid.query.search.search.SearchQueryConfig#7939de8b]
2015-03-02T15:03:44,861 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.groupby.GroupByQueryConfig] from
props[druid.query.groupBy.] as
[io.druid.query.groupby.GroupByQueryConfig#bea8209]
2015-03-02T15:03:44,874 INFO [main]
org.skife.config.ConfigurationObjectFactory - Assigning value
[100000000] for [druid.processing.buffer.sizeBytes] on
[io.druid.query.DruidProcessingConfig#intermediateComputeSizeBytes()]
2015-03-02T15:03:44,878 INFO [main]
org.skife.config.ConfigurationObjectFactory - Assigning value [2] for
[druid.processing.numThreads] on
[io.druid.query.DruidProcessingConfig#getNumThreads()]
2015-03-02T15:03:44,878 INFO [main]
org.skife.config.ConfigurationObjectFactory - Using method itself for
[${base_path}.columnCache.sizeBytes] on
[io.druid.query.DruidProcessingConfig#columnCacheSizeBytes()]
2015-03-02T15:03:44,880 INFO [main]
org.skife.config.ConfigurationObjectFactory - Assigning default value
[processing-%s] for [${base_path}.formatString] on
[com.metamx.common.concurrent.ExecutorServiceConfig#getFormatString()]
2015-03-02T15:03:44,956 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.query.topn.TopNQueryConfig] from
props[druid.query.topN.] as
[io.druid.query.topn.TopNQueryConfig#276503c4]
2015-03-02T15:03:44,960 INFO [main] io.druid.guice.JsonConfigurator
- Loaded class[class io.druid.segment.loading.LocalDataSegmentPusherConfig] from
props[druid.storage.] as
[io.druid.segment.loading.LocalDataSegmentPusherConfig#360548eb]
2015-03-02T15:03:44,967 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.client.DruidServerConfig] from
props[druid.server.] as [io.druid.client.DruidServerConfig#75ba7964]
2015-03-02T15:03:44,971 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class
io.druid.server.initialization.BatchDataSegmentAnnouncerConfig] from
props[druid.announcer.] as
[io.druid.server.initialization.BatchDataSegmentAnnouncerConfig#1ff2a544]
2015-03-02T15:03:44,984 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.server.initialization.ZkPathsConfig] from
props[druid.zk.paths.] as
[io.druid.server.initialization.ZkPathsConfig#58d3f4be]
2015-03-02T15:03:44,990 INFO [main] io.druid.guice.JsonConfigurator -
Loaded class[class io.druid.curator.CuratorConfig] from
props[druid.zk.service.] as [io.druid.curator.CuratorConfig#5fd11499]

I got the issue. I had missed the s3 extension in common.runtime.properties. Once that was added data started getting pushed to s3.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unable to use flume to extract files into hdfs - hdfs

Related

GCP Composer 2 (Airflow 2) Data proc operators - pass package to PYSPARK_JOB

bash: spark-submit: command not found while executing dag in AWS- Managed Apache Airflow

Copy data from hdfs to s3 using sys.process._

logs from only some files showing up aws cloudwatch

Druid not storing to AWS S3

Categories

Resources