Map reduce without hadoop - mapreduce

I am new to Map reduce program.I want to know if I can run map reduce program as a normal java program without using Hadoop. What all libraries should I include?Is it possible?

It is possible, but in that you need to write each end every code block starting from map-->SS-->Reduce. Tobe very simple hadoop is a framework built on provides lot API to run the mapreduce job. It will take care of passing the input from file, Suffle and sort and then reduce function. you just need to understand the various API of haddop and the flow of data thats it.

Related

Informatica Cloud Incremental Load

guys.
I'm needing create a mapping to do incremental loads in informatica cloud. I know that I can do that with parameter files and using the $Lastruntinme. But, if i use FF as parameters, those parameters can be deleted. Using the $Lastruntime i could have temporal gaps into the target.
Is there other ways to do incremental loads? Maybe using loockup, or a way to use two sources in the same mapping, one reading the last written data and the second source reading the the source data; after that, compare both and get the last.
Any mechanism that reliably allows you to identify which records in your source need to be loaded into your target could be used to build an incremental etl load - but without knowing your data it is impossible for anyone to tell you what would work for you.
You also need to distinguish what would work in principle and what would work in practice. For example, comparing your source and target datasets might work with small datasets but would quickly become impractical as the size of either dataset grew

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.

Hbase BulkLoad without mapreduce

I'm wondering if it is possibile to write a java program that do a BulkLoad on HBase. I'm on a hadoop cluster but I don't need to write a MapReduce Job for some reason.
Thanks
BulkLoad works with HFile. So If you have HFiles, you can directly use LoadIncrementalHFiles to handle the bulk load.
Generally we use Map reduce, which can convert the data into above format, and perform Bulk Load.
If you have csv file, you can use ImportTsv utility to process your data into HFiles. use this link, for more information
It depends at which format you data is in currently.
Point to note is, Bulk Load, do not use Write ahead Logs(WAL). They skip this step and add data at a faster rate. if you have any other framework depending on the above WAL, consider other options of adding data in Hbase. Happy Coding.

Hadoop streaming c++ getTaskId

I've been trying to find a way to get (or pass) the taskId to my mapper in c++. I'm using hadoop streaming. So far I just got how to get it in Java. I need the task ID because I'm trying to write a file to HDFS, I'm using libhdfs c, but when I try to append concurrently it fails, because of the lease. Otherwise I'll have to change all my code to Java.
Thanks for your attention.
I figured that instead of using Hadoop Streaming, I could use Hadoop Pipes to get the taskID. However, I was not able to print to HDFS, so I changed my InputFormat/RecordReader and used the key received in the mapper to create files with different names.

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.