os.environ['mapreduce_map_input_file'] doesn't work - mapreduce

I created a simple map reduce in Python, just to test the os.environ['mapreduce_map_input_file'] call, as you can see below:
map.py
#!/usr/bin/python
import sys
# input comes from STDIN (stream data that goes to the program)
for line in sys.stdin:
l = line.strip().split()
for word in l:
# output goes to STDOUT (stream data that the program writes)
print "%s\t%d" %( word, 1 )
reduce.py
#!/usr/bin/python
import sys
import os
current_word = None
current_sum = 0
# input comes from STDIN (stream data that goes to the program)
for line in sys.stdin:
word, count = line.strip().split("\t", 1)
try:
count = int(count)
except ValueError:
continue
if word == current_word:
current_sum += count
else:
if current_word:
# output goes to STDOUT (stream data that the program writes)
print "%s\t%d" %( current_word, current_sum )
print (os.environ['mapreduce_map_input_file'])
current_word = word
current_sum = count
Error message:
Traceback (most recent call last):
File "/Users/brunomacedo/Desktop/Big-Data/./reduce.py", line 25, in <module>
print (os.environ['mapreduce_map_input_file'])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 23, in __getitem__
raise KeyError(key)
KeyError: 'mapreduce_map_input_file'
15/03/06 17:50:26 INFO streaming.PipeMapRed: Records R/W=16127/1
15/03/06 17:50:26 INFO streaming.PipeMapRed: MRErrorThread done
15/03/06 17:50:26 WARN streaming.PipeMapRed: java.io.IOException: Stream closed
15/03/06 17:50:26 INFO streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:128)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/06 17:50:26 WARN streaming.PipeMapRed: java.io.IOException: Stream closed
15/03/06 17:50:26 INFO streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/06 17:50:26 INFO mapred.LocalJobRunner: reduce task executor complete.
15/03/06 17:50:26 WARN mapred.LocalJobRunner: job_local1265836882_0001
java.lang.Exception: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/06 17:50:27 INFO mapreduce.Job: Job job_local1265836882_0001 failed with state FAILED due to: NA
15/03/06 17:50:27 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=181735210
FILE: Number of bytes written=292351104
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=0
HDFS: Number of read operations=0
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Map-Reduce Framework
Map input records=100
Map output records=334328
Map output bytes=2758691
Map output materialized bytes=3427947
Input split bytes=14100
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=3427947
Reduce input records=0
Reduce output records=0
Spilled Records=334328
Shuffled Maps =100
Failed Shuffles=0
Merged Map outputs=100
GC time elapsed (ms)=1224
Total committed heap usage (bytes)=49956257792
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2090035
File Output Format Counters
Bytes Written=0
15/03/06 17:50:27 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
Command I use to run it:
hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-D mapreduce.job.reduces=1 \
-file map.py \
-mapper map.py \
-file reduce.py \
-reducer reduce.py \
-input file:///Users/brunomacedo/Desktop/Big-Data/articles \
-output file:///Users/brunomacedo/Desktop/Big-Data/output
If I take out the line print (os.environ['mapreduce_map_input_file']) it works perfectly. The purpose of this line is to print the name of the input file where the word count came from. Anyway, I did this just to test this command, because I need to use it in a more complicated project.
Can someone please help me what is wrong with this call? Thank you very much!
Edit:
I am using Hadoop 2.6.0

It didn't work when I called os.environ['mapreduce_map_input_file'] or os.environ['map_input_file'] in the reduce file. But it works in the map file. It makes sense, as the reducer is not able to know from which input file your mapper output comes from (unless you send that information directly from the mapper).
I was unable to run just os.environ['mapreduce_map_input_file'] or os.environ['map_input_file'] directly. But using this in the mapper seems to consistently work:
try:
input_file = os.environ['mapreduce_map_input_file']
except KeyError:
input_file = os.environ['map_input_file']

Related

Python 2.7 : retrieve netCDF4 file with ftplib

Here is my issue :
I download some netCDF4 files from an FTP server, in two different ways: via FileZilla and via a Python 2.7 script using ftplib.
Code of Python script (running on Windows) :
# download the file
try:
ftp = FTP(server_address)
ftp.login(server_login, server_pass)
filepath = 'the_remote_rep/myNetCDF4File.nc'
filename = 'myNetCDF4File.nc'
local_dir = 'toto'
new_file = open('%s/%s' % (local_dir, filename), "w")
ftp.retrbinary('RETR %s' % filepath, new_file.write)
ftp.close()
new_file.close()
except Exception as e:
print("Error FTP : '" + str(e) + "'")
# update title into the file
try:
fname = 'toto/myNetCDF4File.nc'
dataset = netCDF4.Dataset(fname, mode='a')
setattr(dataset, 'title', 'In Situ Observation Re-Analysis')
dataset.close()
except Exception as e:
print("Error netCDF4 : '" + str(e) + "'")
Then, I get this message :
Error netCDF4 : '[Errno 22] Invalid argument: 'toto/myNetCDF4File.nc''
When I try the second block of code with a netCDF4 file downloaded via FileZilla (the same file for example), there is no error.
Also, when I try to get the netCDF version of the file using "ncdump -k", here is the response (OK with the other file) :
ncdump: myNetCDF4File.nc: Invalid argument
In addition, files do not have the same size depending on the method :
FileZilla : 22 972 Ko
Python ftplib : 23 005 Ko
Is it a problem from ftplib when writing the retrieved file? Or did I miss some parameters to correctly encode the file?
Thanks in advance.
EDIT : verbose messages from FileZilla :
...
Response: 230 Login successful.
Trace: CFtpLogonOpData::ParseResponse() in state 5
Trace: CControlSocket::SendNextCommand()
Trace: CFtpLogonOpData::Send() in state 9
Command: OPTS UTF8 ON
Trace: CFtpControlSocket::OnReceive()
Response: 200 Always in UTF8 mode.
Trace: CFtpLogonOpData::ParseResponse() in state 9
Status: Logged in
Trace: Measured latency of 114 ms
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpLogonOpData::Reset(0) in state 14
Trace: CFtpControlSocket::FileTransfer()
Trace: CControlSocket::SendNextCommand()
Trace: CFtpFileTransferOpData::Send() in state 0
Status: Starting download of /INSITU_OBSERVATIONS/myNetCDF4File.nc
Trace: CFtpChangeDirOpData::Send() in state 0
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpChangeDirOpData::Reset(0) in state 0
Trace: CFtpFileTransferOpData::SubcommandResult(0) in state 1
Trace: CControlSocket::SendNextCommand()
Trace: CFtpFileTransferOpData::Send() in state 5
Trace: CFtpRawTransferOpData::Send() in state 2
Command: PASV
Trace: CFtpControlSocket::OnReceive()
Response: 227 Entering Passive Mode (193,68,190,45,179,16).
Trace: CFtpRawTransferOpData::ParseResponse() in state 2
Trace: CControlSocket::SendNextCommand()
Trace: CFtpRawTransferOpData::Send() in state 4
Trace: Binding data connection source IP to control connection source IP 134.xx.xx.xx
Command: RETR myNetCDF4File.nc
Trace: CTransferSocket::OnConnect
Trace: CFtpControlSocket::OnReceive()
Response: 150 Opening BINARY mode data connection for myNetCDF4File.nc (9411620 bytes).
Trace: CFtpRawTransferOpData::ParseResponse() in state 4
Trace: CControlSocket::SendNextCommand()
Trace: CFtpRawTransferOpData::Send() in state 5
Trace: CTransferSocket::TransferEnd(1)
Trace: CFtpControlSocket::TransferEnd()
Trace: CFtpControlSocket::OnReceive()
Response: 226 Transfer complete.
Trace: CFtpRawTransferOpData::ParseResponse() in state 7
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpRawTransferOpData::Reset(0) in state 7
Trace: CFtpFileTransferOpData::SubcommandResult(0) in state 7
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpFileTransferOpData::Reset(0) in state 7
Status: File transfer successful, transferred 9 411 620 bytes in 89 seconds
Status: Disconnected from server
Trace: CFtpControlSocket::ResetOperation(66)
Trace: CControlSocket::ResetOperation(66)
In fact, this is a problem of binary configuration (thanks to your questions).
I added ftp.voidcmd('TYPE I') before retrieving file with ftplib, then I modified writing parameter of local file as new_file = open('%s/%s' % (local_ftp_path, filename), "wb") to specify that's a binary file.
Now the file is readable after download via ftplib and has same size as downloaded from FileZilla.
Thanks to your contribution.

AWS Glue job is failing for large input csv data on s3

For small s3 input files (~10GB), glue ETL job works fine but for the larger dataset (~200GB), the job is failing.
Adding a part of ETL code.
# Converting Dynamic frame to dataframe
df = dropnullfields3.toDF()
# create new partition column
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
# store the data in parquet format on s3
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")
Job executed for 4 hours and threw error.
File "script_2017-11-23-15-07-32.py", line 49, in
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir,
mode="append") File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py",
line 550, in save File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o172.save. : org.apache.spark.SparkException:
Job aborted. at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:748) Caused by:
org.apache.spark.SparkException: Job aborted due to stage failure:
Total size of serialized results of 3385 tasks (1024.1 MB) is bigger
than spark.driver.maxResultSize (1024.0 MB) at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
... 30 more
End of LogType:stdout
I would appreciate it if you could provide any guidance to resolve this issue.
You can only set configurable options like maxResultSize during context instantiation, and glue provides you with a context (from memory you can't instantiate a new context). I don't think you will be able to change the value of this property.
You'll normally get this error if you collect results to the driver which exceed the specified size. You aren't doing that in this case so the error is confusing.
It seems like you are spawning 3385 tasks, which are presumably related to the dates in your input file (3385 dates, ~9 years?). You might try writing this file in batches, e.g.
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
for year in range(2000,2018):
partitioned_dataframe = partitioned_dateframe.where(year(part_date) = year)
partitioned_dataframe.write.partitionBy(['part_date'])
.format("parquet")
.save(output_lg_partitioned_dir, mode="append")
I haven't checked this code; you'll at least need to import pyspark.sql.functions.year for it to work.
When I've done data processing with Glue I simply found that batching the work was more effective than trying to get large datasets be completed successfully. The system is good but hard to debug; the stability on large data doesn't come easily.

Start token not found error while using JsonSerDe

I am trying to import a JSON data from S3, and after making some queries, export the output as JSON format to S3 again. However, I get the "org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected" error at hive step on EMR cluster. In order to understand what the problem is, I simplify the Hive script and JSON data, but it keeps giving the same error. How can I solve this problem?
Cluster configuration:
Release: emr-5.3.1
Hive version: 2.1.1
Hadoop distribution: Amazon 2.7.3
Service Role: EMR_DefaultRole
MasterInstanceType: m4.large
The content of the simplifed JSON data:
[{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
Hive script:
DROP TABLE IF EXISTS SOURCE;
DROP TABLE IF EXISTS DESTINATION;
CREATE EXTERNAL TABLE SOURCE(MyID STRING, MyField STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://myPath/subPath/';
CREATE EXTERNAL TABLE DESTINATION(MyID STRING, MyField STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://anotherPath/subPath/';
INSERT OVERWRITE TABLE DESTINATION SELECT MyID, MyField FROM SOURCE;
And here is the stack trace:
Vertex failed, vertexName=Map 4, vertexId=vertex_1278452616863_0001_1_00, diagnostics=[Task failed, taskId=task_1278452616863, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1278452616863:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable [{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable [{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:95)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:383)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable [{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86)
... 17 more
Caused by: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:183)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:128)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:92)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:488)
... 18 more
Caused by: java.io.IOException: Start token not found where expected
at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:169)
... 21 more
Thanks.
JSON should start with { and not with array ([)
I tried with this approach updated my JSON file with structure as
{"MyID":"FOO123","MyField":"FOO"},
{"MyID":"BAR123","MyField":"BAR"}
but after done, I noticed only the first object is being inserted into the table.

java.io.IOException: Broken pipe on increasing number of mappers/reducers, a lot

I am running MapReduce job on a hadoop cluster of 6 nodes with 4 map tasks and 10 reduce tasks configured.
Mapper/Reducer fails a lot on increasing number of map/reduce tasks as below,
I encounter the following error:
stderr logs
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:130)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
and this:
syslog logs
2014-03-01 15:11:30,118 WARN org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:569)
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:130)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2014-03-01 15:11:30,118 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2014-03-01 15:11:30,121 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-03-01 15:11:30,146 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2014-03-01 15:11:30,146 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hduser for UID 1001 from the native implementation
2014-03-01 15:11:30,147 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:130)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2014-03-01 15:11:30,149 WARN org.apache.hadoop.mapred.Task: Parent died. Exiting attempt_201402281751_0042_r_000004_0
2014-03-01 15:11:31,252 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=983976/1957694
Even the simplest program is creating this problem:
mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
if line:
print "%s\n%s"%(line, line)
reducer.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
if line:
print "%s"%(line)
I am using the following command to run hadoop.
hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -D mapred.reduce.tasks=10 -file /home/hduser/code/K1D/code1/mapper2.py -mapper mapper2.py -file /home/hduser/code/K1D/code1/reducer2.py -reducer reducer2.py -input /user/hduser/data-out/part-00000 -output /user/hduser/data-out1 -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Can you suggest anything?
I ran into something similar and realized that I didn't have the exact version of python installed into one of my data nodes. I had
#!/usr/bin/env python
which I changed to
#!/usr/bin/env python2.7

pig REPLACE gives error

Let's assume that my file is named 'data' and looks like this:
2343234 {23.8375,-2.339921102} {(343.34333,-2.0000022)} 5-23-2013-11-am
I need to convert the 2nd field to a pair of coordinate numbers. So I wrote the follwoing code and called it basic.pig:
A = LOAD 'data' AS (f1:int, f2:chararray, f3:chararray. f4:chararray);
B = foreach A generate STRSPLIT(f2,',').$0 as f5, STRSPLIT(f2,',').$1 as f6;
C = foreach B generate REPLACE(f5,'{',' ') as f7, REPLACE(f6,'}',' ') as f8;
and then used (float) to convert the string to a float. But, the command 'REPLACE' fails to work and I get the following error:
-bash-3.2$ pig -x local basic.pig
2013-06-24 16:38:45,030 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled
Mar 22 2013, 02:13:53 2013-06-24 16:38:45,031 [main] INFO org.apache.pig.Main - Logging error messages to: /home/--/p/--test/pig_1372117125028.log
2013-06-24 16:38:45,321 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/isl/pmahboubi/.pigbootup not found
2013-06-24 16:38:45,425 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2013-06-24 16:38:46,069 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 7, column 0. Encountered: <EOF> after : ""
Details at logfile: /home/--/p/--test/pig_1372117125028.log
And this is the details of the pig_137..log
Pig Stack Trace
---------------
ERROR 1000: Error during parsing. Lexical error at line 7, column 0. Encountered: <EOF> after : ""
org.apache.pig.tools.pigscript.parser.TokenMgrError: Lexical error at line 7, column 0. Encountered: <EOF> after : ""
at org.apache.pig.tools.pigscript.parser.PigScriptParserTokenManager.getNextToken(PigScriptParserTokenManager.java:3266)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.jj_ntk(PigScriptParser.java:1134)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:104)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
================================================================================
I've got data like this:
2724 1919 2012-11-18T23:57:56.000Z {(33.80981975),(-118.105289)}
2703 6401 2012-11-18T23:57:56.000Z {(55.83525609),(-4.07733138)}
1200 4015 2012-11-18T23:57:56.000Z {(41.49609152),(13.8411998)}
7104 9227 2012-11-18T23:57:56.000Z {(-24.95351118),(-53.46538723)}
and I can do this:
A = LOAD 'my_tsv_data' USING PigStorage('\t') AS (id1:int, id2:int, date:chararray, loc:chararray);
B = FOREACH A GENERATE REPLACE(loc,'\\{|\\}|\\(|\\)','');
C = LIMIT B 10;
DUMP C;
This error
ERROR 1000: Error during parsing. Lexical error at line 7, column 0. Encountered: <EOF> after : ""
came to me because I had used different types of quotation marks. I started with ' and ended with ´ or `, and it took quite a while to find what went wrong. So it had nothing to do with line 7 (my script was not so long, and I shortened data to four lines which naturally did not help), nothing to do with column 0, nothing to do with EOF of data, and hardly anything to do with " marks which I didn't use. So quite misleading error message.
I found the cause by using grunt - pig command shell.