No such file exists while running Hadoop pipes using c++ - c++

While running hadoop map reduce program using hadoop pipes, the file which is present in the hdfs is not found by the map reduce. If the program is executed without hadoop pipes, the file is easily found by the libhdfs library but when running the program with
hadoop pipes -input i -ouput o -program p
command, the file is not found by the libhdfs and java.io.exception is thrown. Have tried to include the -fs parameter in the command but still the same results. I Have also included hdfs://localhost:9000/ with the files, and still no results. The file parameter is inside the c code as:
file="/path/to/file/in/hdfs" or "hdfs://localhost:9000/path/to/file"
hdfsFS fs = hdfsConnect("localhost", 9000);
hdfsFile input=hdfsOpenFile(fs,file,O_RDONLY,0,0,0);

Found the problem. The files in the hdfs are not available to the mapreduce task node. So instead had to pass the files to the distributed cache through the archive tag by compressing the files to a single tar file. Can also achieve this by writing a custom inputformat class and provide the files in the input parameter.

Related

How to tar a folder in HDFS?

Just like Unix command tar -czf xxx.tgz xxx/, is there a method can do the same thing in HDFS? I have a folder in HDFS has over 100k small files, and want to download it to local file system as fast as possible. hadoop fs -get is too slowly, I know hadoop archive can output a har, but it seems cannot solve my problem.
From what I see here,
https://issues.apache.org/jira/browse/HADOOP-7519
it is not possible to perform tar operation using hadoop commands. This has been filed as an improvement as I mentioned above and not resolved/available yet to use.
Hope this answers your question.
Regarding your scenario - having 100k small files in HDFS is not a good practice. You can find a way to merge them all (may be by creating tables through Hive or Impala from this data) or move all the small files to a single folder in HDFS and use hadoop fs -copyToLocal <HDFS_FOLDER_PATH>; to get the whole folder to your local along with all the files in it.

Getting status of gsutil cp command in parallel mode

This command copies a huge number of files from Google Cloud storage to my local server.
gsutil -m cp -r gs://my-bucket/files/ .
There are 200+ files, each of which is over 5GB in size.
Once all files are downloaded, another process kicks in and starts reading the files one by one and extract the info needed.
The problem is, even though the gsutil copy process is fast and downloads files in batches of multiple files at a very high speed, I still need to wait till all the files are downloaded before starting to process them.
Ideally I would like to start processing the first file as soon as it is downloaded. But with multiple cp mode, there seems to be no way of knowing when a file is downloaded (or is there?).
From Google docs, this can be done in individual file copy mode.
if ! gsutil cp ./local-file gs://your-bucket/your-object; then
<< Code that handles failures >>
fi
That means if I run the cp without -m flag, I can get a boolean on success for that file and I can kick off the file processing.
Problem with this approach is the overall download will take much longer as files are now downloading one by one.
Any insight?
One thing you could do is have a separate process that periodically lists the directory, filtering out the files that are incompletely downloaded (they are downloaded to a filename ending with '.gstmp' and then renamed after the download completes) and keeps track of files you haven't yet processed. You could terminate the periodic listing process when the gsutil cp process completes, or you could just leave it running, so it processes downloads for the next time you download all the files.
Two potential complications with doing that are:
If the number of files being downloaded is very large, the periodic directory listings could be slow. How big "very large" is depends on the type of file system you're using. You could experiment by creating a directory with the approximate number of files you expect to download, and seeing how long it takes to list. Another option would be to use the gsutil cp -L option, which builds a manifest showing what files have been downloaded. You could then have a loop reading through the manifest, looking for files that have downloaded successfully.
If the multi-file download fails partway through (e.g., due to a network connection that's dropped for longer than gsutil will retry), you'll end up with a partial set of files. For this case you might considering using gsutil rsync, which can be restarted and pick up where you left off.

there is any way to open and read a file over a SSH connection?

I have an access to some server where there is a lot of data. I can't copy the whole of data on my computer.
I can't compile on the server the program I want because the server doesn't have all libs I need.
I don't think that the server admin would be very happy to see me coming and asking to him to install some libs just for me...
So, I try to figure if there is a way to open a file like with,
FILE *fopen(const char *filename, const char *mode);
or
void std::ifstream::open(const char* filename, ios_base::openmode mode = ios_base::in);
but over a SSH connection. Then reading the file like I do for usual program.
both computer and server are running linux
I assume you are working on your Linux laptop and the remote machine is some supercomputer.
First non-technical advice: ask permission first to access the data remotely. In some workplaces you are not allowed to do that, even if it technically possible.
You could sort-of use libssh for that purpose, but you'll need some coding and read its documentation.
You could consider using some FUSE file system (on your laptop), e.g. some sshfs; you would then be able to access some supercomputer files as /sshfilesystem/foo.bar). It is probably the slowest solution, and probably not a very reliable one. I don't really recommend it.
You could ask permission to use NFS mounts.
Maybe you might consider some HTTPS access (if the remote computer has it for your files) using some HTTP/HTTPS client library like libcurl (or the other way round, some HTTP/HTTPS server library like libonion)
And you might (but ask permission first!) use some TLS connection (e.g. start manually a server like program on the remote supercomputer) perhaps thru OpenSSL or libgnutls
At last, you should consider installing (i.e. asking politely the installation on the remote supercomputer) or using some database software (e.g. a PostgreSQL or MariaDB or Redis or MongoDB server) on the remote computer and make your program become a database client application ...
BTW, things might be different if you access a few dozen of terabyte sized files in a random access (each run reading a few kilobytes inside them), or a million files, of which a given run access only a dozen of them with sequential reads, each file of a reasonable size (a few megabytes). In other words, DNA data, video films, HTML documents, source code, ... are all different cases!
Well, the answer to your question is no, as already stated several times (unless you think about implementing ssh yourself which is out of scope of sanity).
But as you also describe your real problem, it's probably just asking the wrong question, so -- looking for alternatives:
Alternative 1
Link the library you want to use statically to your binary. Say you want to link libfoo statically:
Make sure you have libfoo.a (the object archive of your library) in your library search path. Often, development packages for a library provided by your distribution already contain it, if not, compile the library yourself with options to enable the creation of the static library
Assuming the GNU toolchain, build your program with the following flags: -Wl,-Bstatic -lfoo -Wl,-Bdynamic (instead of just -lfoo)
Alternative 2
Create your binary as usual (linked against the dynamic library) and put that library (libfoo.so) e.g. in ~/lib on the server. Then run your binary there with LD_LIBRARY_PATH=~/lib ./a.out.
You can copy parts of file to your computer over SSH connection:
copy part of source file using dd command to temporary file
copy temporary file to your local box using scp or rsync
You can create a shell script to automate this if you need to do that multiple times.
Instead of fopen on a path, you can use popen on an ssh command. (Don't forget that FILE * streams obtained from popen are closed with pclose and not fclose).
You can simplify the interface by writing a function which wraps popen. The function accepts just the remote file name, and then generates the ssh command to fetch that file, properly escaping everything, like spaces in the file name, shell meta-characters and whatnot.
FILE *stream = popen("ssh user#host cat /path/to/remote/file", "r");
if (stream != 0) {
/* ... */
pclose(stream);
}
popen has some drawbacks because it processes a shell command. Because the argument to ssh is also a shell command that is processed on the remote end, it raises issues of double escaping: passing a command through as a shell command.
To do something more robust, you can create a pipe using pipe, then fork and exec* the ssh process, installing the write end of the pipe as its stdout, and use fdopen to create a FILE * stream on the reading end of the pipe in the parent process. This way, there is accurate control over the arguments which are handed to the process: at least locally, you're not running a shell command.
You can't directly(1) open a file over ssh with fopen() or ifstream::open. But you can leverage the existing ssh binary. Simply have your program read from stdin, and pipe the file to it via ssh:
ssh that_server cat /path/to/largefile | ./yourprogram
(1) Well, if you mount the remote system using sshfs you can access the files over ssh as if they were local.

SSH command gets terminated while compressing a directory

I am compressing a big directory about 50GB containing files and folder using putty SSH COMMAND LINE.i am using this command:
tar czspf file.tar.gz directory/
it starts work fine, but after some time it gets terminated with single word message "Terminated" and compression stopped near about 16GB of tar archive.
Is there any way to escape from terminated error or how to deal this problem, or any other method to make a tar of directory avoiding the terminate error.Thanks
Probably you conflict with some kind of file size limit. Not all file system supports very big files. In this case you could pipe the output of the tar into a split command like this:
tar czsp directory/|split -b4G fileprefix-

Chaining Hadoop MapReduce with Pipes (C++)

Does anyone know how to chain two MapReduce with Pipes API?
I already chain two MapReduce in a previous project with JAVA, but today I need to use C++. Unfortunately, I haven't seen any examples in C++.
Has someone already done it? Is it impossible?
Use Oozie Workflow. It allows you to use Pipes along with usual MapReduce jobs.
I finally manage to make Hadoop Pipes works. Here some steps to make works the wordcount examples available in src/examples/pipes/impl/.
I have a working Hadoop 1.0.4 cluster, configured following the steps described in the documentation.
To write a Pipes job I had to include the pipes library that is already compiled in the initial package. This can be found in C++ folder for both 32-bit and 64-bit architecture. However, I had to recompile it, which can be done following those steps:
# cd /src/c++/utils
# ./configure
# make install
# cd /src/c++/pipes
# ./configure
# make install
Those two commands will compile the library for our architecture and create a ’install’ directory in /src/c++ containing the compiled files.
Moreover, I had to add −lssl and −lcrypto link flags to compile my program. Without them I encountered some authentication exception at the running time.
Thanks to those steps I was able to run wordcount−simple that can be found in src/examples/pipes/impl/ directory.
However, to run the more complex example wordcount−nopipe, I had to do some other points. Due to the implementation of the record reader and record writer, we are directly reading or writing from the local file system. That’s why we have to specify our input and output path with file://. Moreover, we have to use a dedicated InputFormat component. Thus, to launch this job I had to use the following command:
# bin/hadoop pipes −D hadoop.pipes.java.recordreader=false −D hadoop.pipes.java.recordwriter=false −libjars hadoop−1.0.4/build/hadoop−test−1.0.4.jar −inputformat org.apache.hadoop.mapred.pipes.WordCountInputFormat −input file:///input/file −output file:///tmp/output −program wordcount−nopipe
Furthermore, if we look at org.apache.hadoop.mapred.pipes.Submitter.java of 1.0.4 version, the current implementation disables the ability to specify a non java record reader if you use InputFormat option.
Thus you have to comment the line setIsJavaRecordReader(job,true); to make it possible and recompile the core sources to take into account this change (http://web.archiveorange.com/archive/v/RNVYmvP08OiqufSh0cjR).
if(results.hasOption("−inputformat")) {
setIsJavaRecordReader(job, true);
job.setInputFormat(getClass(results, "−inputformat", job,InputFormat.class));
}