Hive map-reduce query is failing - mapreduce

I am trying to run my first Hive query which can launches map-reduce job. I have followed all the steps given at "http://doc.mapr.com/display/MapR/Hive".
"web_log" table has been crated and data loading completed with no error.
But when trying to execute "SELECT web_log. FROM web_log WHERE web_log.url LIKE '%doc'*" I am getting following exception.
Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1429420954627_0002,
Tracking URL = http://yarn-training:8088/proxy/application_1429420954627_0002/
Kill Command = /opt/mapr/bin/hadoop job -kill job_1429420954627_0002 Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 2015-04-19 00:19:15,690 Stage-1 map = 0%, reduce = 0% Ended Job = job_1429420954627_0002 with errors Error during job, obtaining debugging information... FAILED:
Hi
Could someone guide me please.

You are having number of reducers = 0 as your job is map only job. And now is this your query ?? I think your select statement has something missing.
SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc';

Related

How can I run specific task/s from the Airflow dag

Current State of airflow dag:
ml_processors = [a, b, c, d, e]
abc_task >> ml_processors (all ml models from a to e run in parallel after abc task is successfully completed)
ml_processors >> xyz_task (once a to e all are successful xyz task runs)
Problem statement: There are instances when one of the machine learning models (task in airflow) get on new version with better accuracy and we want to reprocess our data. Now lets say c_processor get on new version and reprocessing is required to just reprocess the data for this processor. In that case I would like to run c_processor >> xyz_task only.
What I know/tried
I know that I can go back in successful dag runs and clear the task for certain period of time to run only specific task. But this way might not be very efficient when I have lets say c_processor, d_classifier to be rerun. And I would end up doing 2 steps here:
c_processor >> xyz_task
d_processor >> xyz_task which I would like to avoid
I read about "backfill in airflow" but looks like its more for whole dag instead of specific/ selected tasks from a dag
Environment/setup
Using google composer environment.
Dag is triggered on file upload in GCP storage.
I am interested to know if there are any other ways to rerun only specific tasks from airflow dag.
"clear"1 would also allow you to clear some specific tasks in a DAG with the --task-regex flag. In this case, you can run airflow tasks clear --task-regex "[c|d]_processor" --downstream -s 2021-03-22 -e 2021-03-23 <dag_id>, which clear the states for c and d processors with their downstreams.
One caveat though, this will also clean up the states for the original task runs.

Splunk - To search for concurrent run of processes

I want to check if there are multiple instances of a job/process running .
Ex: My Splunk search :
index=abc <jobname> | stats earliest(_time) AS earliest_time, latest(_time) AS latest_time count by source | convert ctime(earliest_time), ctime(latest_time) | sort - count
returns :
source earliest_time latest_time count
logA 06/06/2020 15:24:09 06/06/2020 15:24:59 1
logB 06/06/2020 15:24:24 06/06/2020 15:25:12 2
In the above since logB indicates job run before logA completion time, it is indication of concurrent run of process. I would like to generate a list of all such jobs if it is possible , any help is appreciated .
Thank you.
There is an inbuilt Splunk command for this, concurrency. This command requires an event start time and the duration, which we can calculate as the difference between the earliest and latest times. This command will create a new field called concurrency which is a measurement represent[ing] the total number of events in progress at the time that each particular event started, including the event itself.
index=abc <jobname> | stats earliest(_time) as et latest(_time) as lt count by source | eval duration=lt-et | concurrency start=et duration=duration | where concurrency>1
Docs for concurrency can be found at https://docs.splunk.com/Documentation/Splunk/8.0.4/SearchReference/Concurrency

alpakka cassandrasource read data from cassandra continuously

We are doing some POC to read cassandra table continuosly using Alpakka CassandraSource. Following is the sample code:
final Statement stmt = new SimpleStatement("SELECT * FROM testdb.emp1").setFetchSize(20);
final CompletionStage<List<Row>> rows = CassandraSource.create(stmt, session).runWith(Sink.seq(), materializer);
rows.thenAcceptAsync( e -> e.forEach(System.out::println));
The above code fetches the rows from emp1 table. Since this table grows continuosly we need to keep reading as soon as data available. Is there any way we can set continuous read in CassandraSource?
There is currently no support for continuously reading a table in Alpakka Cassandra connector. However you can make it work by wrapping CassandraSource.create in a RestartSource.withBackoff that will restart the cassandra source after it completes. More about restarting sources in the documentation.

Inserting rows on BigQuery: InsertAllRequest Vs BigQueryIO.writeTableRows()

When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?
With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.

'Premature end of Content-Length' with Spark Application using s3a

I'm writing a Spark based application which works around a pretty huge data stored on s3. It's about 15 TB in size uncompressed. Data is laid across multiple small LZO compressed files files, varying from 10-100MB.
By default the job spawns 130k tasks while reading dataset and mapping it to schema.
And then it fails around 70k tasks completions and after ~20 tasks failure.
Exception:
WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body
Looks like the s3 connection is getting closed prematurely.
I have tried nearly 40 different combos of configurations.
To summarize them: 1 executor to 3 executors per node, 18GB to 42GB --executor-memory, 3-5 --executor-cores, 1.8GB-4.0 GB spark.yarn.executor.memoryOverhead, Both, Kryo and Default Java serializers, 0.5 to 0.35 spark.memory.storageFraction, default, 130000 to 200000 partitions for bigger dataset. default, 200 to 2001 spark.sql.shuffle.partitions.
And most importantly: 100 to 2048 fs.s3a.connection.maximum property.
[This seems to be most relevant property to exception.]
[In all cases, driver was set to memory = 51GB, cores = 12, MEMORY_AND_DISK_SER level for caching]
Nothing worked!
If I run the program with half of the bigger dataset size (7.5TB), it finishes successfully in 1.5 hr.
What could I be doing wrong?
How do I determine the optimal value for fs.s3a.connection.maximum?
Is it possible that the s3 clients are getting GCed?
Any help will be appreciated!
Environment:
AWS EMR 5.7.0, 60 x i2.2xlarge SPOT Instances (16 vCPU, 61GB RAM, 2 x 800GB SSD), Spark 2.1.0
YARN is used as resource manager.
Code:
It's a fairly simple job, doing something like this:
val sl = StorageLevel.MEMORY_AND_DISK_SER
sparkSession.sparkContext.hadoopConfiguration.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 1200)
val dataset_1: DataFrame = sparkSession
.read
.format("csv")
.option("delimiter", ",")
.schema(<schema: StructType>)
.csv("s3a://...")
.select("ID") //15 TB
dataset_1.persist(sl)
print(dataset_1.count())
tmp = dataset_1.groupBy(“ID”).agg(count("*").alias("count_id”))
tmp2 = tmp.groupBy("count_id").agg(count("*").alias(“count_count_id”))
tmp2.write.csv(…)
dataset_1.unpersist()
Full Stacktrace:
17/08/21 20:02:36 INFO compress.CodecPool: Got brand-new decompressor [.lzo]
17/08/21 20:06:18 WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 79627927; received: 19388396
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:73)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:321)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:261)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:186)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1021)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
EDIT: We have another service which consume exactly same logs, it works just fine. But it uses old "s3://" scheme and is based on Spark-1.6. I'll try using "s3://" instead of "s3a://".