Getting hadoop pipes to work on os x with file access

Getting hadoop pipes to work on os x with file access - c++

my apologies if this question is trivial, but I haven't been able to solve it properly despite spending a few hours on google…
I have compiled and installed hadoop 2.5.2 from source on OS X 10.8. I think all went well with this, even though getting the native libraries compiled was a bit of a pain.
To verify my installation, I tried the examples that ship with hadoop like this:
$ hadoop jar /Users/per/source/hadoop-2.5.2-src/hadoop-dist/target/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar pi 2 5
which gives me back
Job Finished in 1.774 seconds
Estimated value of Pi is 3.60000000000000000000
So that seems to indicate that hadoop is at least working, I think.
Since my end goal is to get this working with file I/O from a C++ program using pipes, I also did try the following as an intermediate step:
$ hdfs dfs -put etc/hadoop input
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'
Which also seems to work (as in producing the correct output).
Then, finally, I tried the wordcount.cpp example, and this is not working, although I fail to understand how to fix it. The error message is quite clear.
2015-10-21 13:38:33,539 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user per
2015-10-21 13:38:33,678 INFO org.apache.hadoop.security.JniBasedUnixGroupsMapping: Error getting groups for per: getgrouplist: error looking up group. 5 (Input/output error)
So obviously there is something I don't get with the file permissions, but what? My hdfs-site.xml file looks like this, where I have tried to turn off permissions checks altogether.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.web.ugi</name>
<value>per</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
But I am confused since I at least seem to be able to use grep on files in hdfs (per the built-in example above).
Any feedback would be greatly appreciated. I am running all of this as myself, so I haven't created a new user for only hadoop, since I am only running it locally on my laptop for now.
Edit: Update with some additional output from my terminal window, according to the discussion below:
<snip>
15/10/21 14:47:41 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/10/21 14:47:41 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/10/21 14:47:41 INFO mapred.LocalJobRunner: map task executor complete.
15/10/21 14:47:41 WARN mapred.LocalJobRunner: job_local676909674_0001
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:104)
at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
<snip>
I can add more if that is relevant

Related

HTCondor - Partitionable slot not working

I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).

What is the output of condor_q -better when the job is idle?

Twisted ssh - session execCommand implementation

Good day. I apologize for asking for obvious things because I'm writing in PHP and I know Python at the level "I started learning this yesterday". I've already spent a few days on this - but to no avail.
I downloaded twisted example of the SSH server for version 20.3 from here https://docs.twistedmatrix.com/en/twisted-20.3.0/conch/examples/. Line 162 has an execCommand method that I need to implement to make it work. Then I noticed a comment in this method "We don't support command execution sessions". Therefore, the question: Is this comment apply only to the example, or twisted library entirely. Ie, is it possible to implement this method to make the example server will work as I need?
More information. I don't think that this info is required to answer my questions above.
Why do I need it? I'm trying to compile an environment for writing functional (!) tests (there would be no such problems with the unit tests, I guess). Our API uses the SSH client (phpseclib / SSH2) by 30%+ of endpoints. Whatever I do, I had only 3 options of the results depending on how did I implement this method: (result: success, response: "" - empty; result: success, response: "1"; result: failed, response: "Unable to fulfill channel request at… SSH2.php:3853"). Those were for an SSH2 Client. If the error occurs (3rd case), the server shows logs in the terminal:
[SSHServerTransport, 0,127.0.0.1] Got remote error, code 11 reason: ""
[SSHServerTransport, 0,127.0.0.1] connection lost

I just found this works:
def execCommand(self, protocol, cmd):
protocol.write('Some text to return')
protocol.session.conn.sendEOF(protocol.session)
If I don't send EOF the client throws a timeout error.

Streaming MutationGroups into Spanner

I'm trying to stream MutationGroups into spanner with SpannerIO.
The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's.
When I don't use any windows, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1585)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1470)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:868)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:823)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:52)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:20)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:388)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:372)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline.main(EntityBuilderPipeline.java:122)
:entityBuilder FAILED
Because of the error above I assume the input collection needs to be windowed and triggered, as SpannerIO uses a GroupByKey (this is also what I need for my use case):
...
.apply("1-minute windows", Window.<MutationGroup>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply(SpannerIO.write()
.withProjectId(entityConfig.getSpannerProject())
.withInstanceId(entityConfig.getSpannerInstance())
.withDatabaseId(entityConfig.getSpannerDb())
.grouped());
When I do this, I get the following exceptions during runtime:
java.lang.IllegalArgumentException: Attempted to get side input window for GlobalWindow from non-global WindowFn
org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.issueSideInputFetch(StreamingModeExecutionContext.java:631)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$UserStepContext.issueSideInputFetch(StreamingModeExecutionContext.java:683)
com.google.cloud.dataflow.worker.StreamingSideInputFetcher.storeIfBlocked(StreamingSideInputFetcher.java:182)
com.google.cloud.dataflow.worker.StreamingSideInputDoFnRunner.processElement(StreamingSideInputDoFnRunner.java:71)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.sdk.values.ValueWithRecordId$StripIdsDoFn.processElement(ValueWithRecordId.java:145)
After investigating further it appears to be due to the .apply(Wait.on(input)) in SpannerIO: It has a global side input which does not seem to work with my fixed windows, as the docs of Wait.java state:
If signal is globally windowed, main input must also be. This typically would be useful
* only in a batch pipeline, because the global window of an infinite PCollection never
* closes, so the wait signal will never be ready.
As a temporary workaround I tried the following:
add a GlobalWindow with triggers instead of fixed windows:
.apply("globalwindow", Window.<MutationGroup>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
This results in writes to spanner only when I drain my pipeline. I have the impression the Wait.on() signal is only triggered when the Global windows closes, and doesn't work with triggers.
Disable the .apply(Wait.on(input)) in SpannerIO:
This results in the pipeline getting stuck on the view creation which
is described in this SO post:
SpannerIO Dataflow 2.3.0 stuck in CreateDataflowView.
When I check the worker logs for clues, I do get the following warnings:
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type SpannerSchema have well defined equals method. This may produce incorrect results on some PipelineRunner
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner"
Note that everything works with the DirectRunner and that I'm trying to use the DataflowRunner.
Does anyone have any other suggestions for things I can try to get this running? I can hardly imagine that I'm the only one trying to stream MutationGroups into spanner.
Thanks in advance!

Currently, SpannerIO connector is not supported with Beam Streaming. Please follow this Pull Request which adds streaming support for spanner IO connector.

How to recover HDFS journal node?

I have configured 3 journalnodes, let's say JN1, JN2, JN3. Each of them saves the edit log under /tmp/hadoop/journalnode/mycluster...
Based on which, I started my namenode, secondary namenode and bunch of datanode. The system runs well until one day JN2 and JN3 are dead. Furthermore, the disks are corrupted.
Then I purchased the new disks and restarted JN2 and JN3. The bad thing is it didn't work anymore.
It keeps complaining
org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory /tmp/hadoop/dfs/journalnode/mycluster not formatted
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:457)
at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:640)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:185)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
Is there anyway to recover JN2 and JN3 from the only living JN1?
Really appreciate all the possible solutions!
Thanks,
Miles

I was able to fix the issues by creating missing directory on the Journal host where namenode will write the its' edit files.
Make sure the VERSION file is created, otherwise you will get org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException.
or copy version file in directory

The issue has gone after I duplicated the only existing /tmp/hadoop/journalnode/mycluster to JN2 and JN3.

Playing flv files from Cloudfront/S3 with Strobe Media Playback

I'm using the OSMF's Strobe Media Playback player to try and play files from AWS Cloudfront/S3
The bucket is called ct.recorder. The cloudfront distribution is called 1dm7svtk8jb00c.cloudfront.net, and it's origin is ct.recorder.
The video within the bucket is called vid_test001
I've tried initializing the player with rtmp://s34osaecrafusl.cloudfront.net/cfx/st/vid_test001
But that doesn't work.
I get Connection attempt rejected by FMS server. Connection failed.
I've also tried it with .flv at the end, but that doesn't work either.
Am I not linking to the file properly, or is it my player?

Well, I had an entire answer written up, speculating that it was related to bucket permissions, and now I'm scratching that answer and posting this, instead. :)
$ rtmpdump -r rtmp://s34osaecrafusl.cloudfront.net/cfx/st/vid_test001.flv -o testfile.flv
RTMPDump v2.4
(c) 2010 Andrej Stepanchuk, Howard Chu, The Flvstreamer Team; license: GPL
Connecting ...
WARNING: HandShake: client signature does not match!
INFO: Connected...
Starting download at: 0.000 kB
INFO: Metadata:
INFO: duration 13.82
INFO: videocodecid 2.00
INFO: audiocodecid 6.00
INFO: canSeekToEnd FALSE
INFO: createdby AMS 5
INFO: creationdate Tue Dec 03 13:41:46 2013
1190.238 kB / 13.82 sec (100.0%)
Download complete
This actually works for me... both with, and without, the .flv on the end, and the resulting file is a 7 second video of a guy looking at a webcam.
Using "smplayer" for Windows, I can connect to cloudfront with the rtmp:// url and stream the video, but it only works without the .flv on the end, using:
MPlayer Redxii-SVN-r36243-4.6.3 (C) 2000-2013 MPlayer Team
Custom build by Redxii, http://smplayer.sourceforge.net
Compiled against FFmpeg version N-52798-gf5846dc
Build date: Sun May 5 23:51:25 EDT 2013
This doesn't quite answer your question of why it isn't working, except to say that your player seems to be lying to you as far as "Connection attempt rejected by FMS server" because, at least from here, it's good, except for this part, and I don't know what it means.
WARNING: HandShake: client signature does not match!
However, that could just be a distraction.
It looks as if it's going to be your player... so trying other players would be worthwhile.
It is, of course, possible, that there's a regional issue involving the particular edge location inside cloudfront that you access from your location, which could be significantly different than the one I'm hitting, since it's geographically... but if another player works where you are, then you may have the answer you're looking for. Firing up wireshark and analyzing the protocol exchange could be an interesting exercise also.
Afterthought: the extra slash in your path could also be blowing something's mind, since an RTMP url apparently consists of two distinct components, "application"/"stream_name" and the point of delineation may be ambiguous at some level to some component in the chain. If cloudfront thinks the "application" is "cfx" and the stream is "st/vid_test001" but the client assumes the "application" is "cfx/st" with stream name "vid_test001" it seems like there could be some potential for interoperability trouble there. This is wild speculation, but perhaps worth experimentation, too.

The embed parameter urlIncludesFMSApplicationInstance needs to be set to true.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js