Flink read data from hdfs - hdfs

I'm a freshman in Flink and I am wondering that how to read data from hdfs . Can anybody give some advices or some easy examples for me ? Thank you all.

If your files are formatted in text file format, you can use 'readTextFile' method from 'ExecutionEnvironment' object.
Here is an example of various data sources. (https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/batch/index.html#data-sources)

Flink can read HDFS data which can be in any of the formats like text,Json,avro such as.
Support for Hadoop input/output formats is part of the flink-java maven modules which are required when writing flink jobs.
Sample 1 : To read text file named JsonSeries and print on console
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> lines = env.readTextFile("hdfs://localhost:9000/user/hadoop/input/JsonSeries.txt")
.name("HDFS File read");
lines.print();
Sample 2 : using input format
DataSet<Tuple2<LongWritable, Text>> inputHadoop =
env.createInput(HadoopInputs.readHadoopFile(new TextInputFormat(),
LongWritable.class, Text.class, "hdfs://localhost:9000/user/hadoop/input/JsonSeries.txt"));
inputHadoop.print();

With Flink 1.13, Hadoop 3.1.2, Java 1.8.0 on the Centos7 machine I was able to read from HDFS.
HADOOP_HOME and HADOOP_CLASSPATH were exported already. I think from version 1.11 something changed. I couldn't find even a simple example. Therefore I share my example.
I added to pom.xml following dependency
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.2</version>
</dependency>
My Scala code:
package com.vbo.datastreamapi
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object ReadWriteHDFS extends App {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.readTextFile("hdfs://localhost:9000/user/train/datasets/Advertising.csv")
stream.print()
env.execute("Read Write HDFS")
}

Related

Where is output file created after running FileWriter in AWS EMR

This is how I am writing to file. (Scala code)
import java.io.FileWriter
val fw = new FileWriter("my_output_filename.txt", true)
fw.write("something to write into output file")
fw.close()
This is part of a spark job I'm running on AWS EMR. Thw job runs and finishes successfully. The problem is I'm unable to locate my_output_filename.txt anywhere once it's done.
A bit more context:
What I'm trying to do: Doing some processing on each row of a dataframe and writing it into a file. so it looks something like this:
myDF.collect().foreach( row => {
import java.io.FileWriter
val fw = new FileWriter("my_output_filename.txt", true)
fw.write("row data to be written into file")
fw.close()
})
How I checked:
When I ran it in local, I found newly created file in same directory where code is present. But couldn't find it in remote node.
I ran find / -name "my_output_filename.txt".
I also checked in HDFS: hdfs dfs -find / -name "my_output_filename.txt"
Where can I find the output file ?
Is there a better way to do this?

Increase Haddop_HEAPSIZE in amazon EMR to run job with a few million input files

I am running into an issue with my EMR jobs where too many input files throws out of memory errors. Doing some research I think changing the HADOOP_HEAPSIZE config parameter is the solution. Old amazon forums from 2010 say it cannot be done.
can we do that now in 2018??
I run my jobs using the C# API for EMR and normally I set configurations using statements like below. can I set HADOOP_HEAPSIZE using similar commands.
config.Args.Insert(2, "-D");
config.Args.Insert(3, "mapreduce.output.fileoutputformat.compress=true");
config.Args.Insert(4, "-D");
config.Args.Insert(5, "mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec");
config.Args.Insert(6, "-D");
config.Args.Insert(7, "mapreduce.map.output.compress=true");
config.Args.Insert(8, "-D");
config.Args.Insert(9, "mapreduce.task.timeout=18000000");
If I need to bootstrap using a file, I can do that too. If someone can show me the contents of the file for the config change.
Thanks
I figured it out...
I created a shell script to increase the memory size on the master machine (code at the end)...
I run a bootstrap action like this
ScriptBootstrapActionConfig bootstrapActionScriptForHeapSizeIncrease = new ScriptBootstrapActionConfig
{
Path = "s3://elasticmapreduce/bootstrap-actions/run-if",
Args = new List<string> { "instance.isMaster=true", "<s3 path to my shell script>" },
};
The shell script code is this
#!/bin/bash
SIZE=8192
if ! [ -z $1 ] ; then
SIZE=$1
fi
echo "HADOOP_HEAPSIZE=${SIZE}" >> /home/hadoop/conf/hadoop-user-env.sh
Now I am able to run a EMR job with master machine tye as r3.xlarge and process 31 million input files

hadoop mapreduce job is not running

i have created a basic mapreduce program and created jar file out of it. when i am trying to run it from console like:
[cloudera#localhost ~]$ hadoop jar /home/cloudera/Desktop/csvjar.jar testpackage.Mapreduce /import/climate /output5
Nothing is happening, no error or map reduce status. It just displays
[cloudera#localhost ~]
Mapreduce is the class where map, reduce and main function resides. Jar file kept on local machine and HDFS also. I have tried with both the paths. Nothing happened in both the conditions. Output5 folder does not exist in the hdfs.
I also came through the same issue. In my code, I missed the closing braces while checking the arguments section in the driver code. I am attaching the part of the code with "}" for reference.
if(otherArgs.length !=3){
System.err.println("Number of argument passed is not 3");
System.exit(1);
}
I hope this would help you.

weka: how to generate libsvm training parameter

I am running libsvm through weka. Its output accuracy looks good to me, so I am planning to write a svm model by myself. However, weka didn't generate any training parameter, such as number of support vector. Therefore i cannot do anything. Searching the web, i found somebody said it would generate some parameters like the following:
optimization finished, #iter = 27
nu = 0.058475864943863545
obj = -1.871013102744184, rho = -0.19357337828800944
nSV = 9, nBSV = 0 `enter code here`
Total nSV = 9
but how come i didn't see any of them? any step that i missed? please help me. Thanks a lot.
Weka writes the output you mentioned to stderr.
So if you have started weka.sh or weka.bat from a terminal (or "command window" if you are on Windows), you should see that output appear in your terminal window after clicking "classify"
If you want to have access to this information via scripts, you can
redirect the output to a file and read in that file.
Here is how to edit the startup file weka.sh / weka.bat.
Edit this line (it is probably the last line) in order to write log info to a file instead of the terminal window:
java -cp $CP -Xmx8092m weka.gui.GUIChooser 2>>/opt/weka-stable/weka.log &
You can also add a properties file to your home directory to add more fine-grained behaviour.
https://weka.wikispaces.com/Properties+file
(You probably can also access information via the Weka Java API somehow, but you did not ask for that)

Libtorrent - Given a magnet link, how do you generate a torrent file?

I have read through the manual and I cannot find the answer. Given a magnet link I would like to generate a torrent file so that it can be loaded on the next startup to avoid redownloading the metadata. I have tried the fast resume feature, but I still have to fetch meta data when I do it and that can take quite a bit of time. Examples that I have seen are for creating torrent files for a new torrent, where as I would like to create one matching a magnet uri.
Solution found here:
http://code.google.com/p/libtorrent/issues/detail?id=165#c5
See creating torrent:
http://www.rasterbar.com/products/libtorrent/make_torrent.html
Modify first lines:
file_storage fs;
// recursively adds files in directories
add_files(fs, "./my_torrent");
create_torrent t(fs);
To this:
torrent_info ti = handle.get_torrent_info()
create_torrent t(ti)
"handle" is from here:
torrent_handle add_magnet_uri(session& ses, std::string const& uri add_torrent_params p);
Also before creating torrent you have to make sure that metadata has been downloaded, do this by calling handle.has_metadata().
UPDATE
Seems like libtorrent python api is missing some of important c++ api that is required to create torrent from magnets, the example above won't work in python cause create_torrent python class does not accept torrent_info as parameter (c++ has it available).
So I tried it another way, but also encountered a brick wall that makes it impossible, here is the code:
if handle.has_metadata():
torinfo = handle.get_torrent_info()
fs = libtorrent.file_storage()
for file in torinfo.files():
fs.add_file(file)
torfile = libtorrent.create_torrent(fs)
torfile.set_comment(torinfo.comment())
torfile.set_creator(torinfo.creator())
for i in xrange(0, torinfo.num_pieces()):
hash = torinfo.hash_for_piece(i)
torfile.set_hash(i, hash)
for url_seed in torinfo.url_seeds():
torfile.add_url_seed(url_seed)
for http_seed in torinfo.http_seeds():
torfile.add_http_seed(http_seed)
for node in torinfo.nodes():
torfile.add_node(node)
for tracker in torinfo.trackers():
torfile.add_tracker(tracker)
torfile.set_priv(torinfo.priv())
f = open(magnet_torrent, "wb")
f.write(libtorrent.bencode(torfile.generate()))
f.close()
There is an error thrown on this line:
torfile.set_hash(i, hash)
It expects hash to be const char* but torrent_info.hash_for_piece(int) returns class big_number which has no api to convert it back to const char*.
When I find some time I will report this missing api bug to libtorrent developers, as currently it is impossible to create a .torrent file from a magnet uri when using python bindings.
torrent_info.orig_files() is also missing in python bindings, I'm not sure whether torrent_info.files() is sufficient.
UPDATE 2
I've created an issue on this, see it here:
http://code.google.com/p/libtorrent/issues/detail?id=294
Star it so they fix it fast.
UPDATE 3
It is fixed now, there is a 0.16.0 release. Binaries for windows are also available.
Just wanted to provide a quick update using the modern libtorrent Python package: libtorrent now has the parse_magnet_uri method which you can use to generate a torrent handle:
import libtorrent, os, time
def magnet_to_torrent(magnet_uri, dst):
"""
Args:
magnet_uri (str): magnet link to convert to torrent file
dst (str): path to the destination folder where the torrent will be saved
"""
# Parse magnet URI parameters
params = libtorrent.parse_magnet_uri(magnet_uri)
# Download torrent info
session = libtorrent.session()
handle = session.add_torrent(params)
print "Downloading metadata..."
while not handle.has_metadata():
time.sleep(0.1)
# Create torrent and save to file
torrent_info = handle.get_torrent_info()
torrent_file = libtorrent.create_torrent(torrent_info)
torrent_path = os.path.join(dst, torrent_info.name() + ".torrent")
with open(torrent_path, "wb") as f:
f.write(libtorrent.bencode(torrent_file.generate()))
print "Torrent saved to %s" % torrent_path
If saving the resume data didn't work for you, you are able to generate a new torrent file using the information from the existing connection.
fs = libtorrent.file_storage()
libtorrent.add_files(fs, "somefiles")
t = libtorrent.create_torrent(fs)
t.add_tracker("http://10.0.0.1:312/announce")
t.set_creator("My Torrent")
t.set_comment("Some comments")
t.set_priv(True)
libtorrent.set_piece_hashes(t, "C:\\", lambda x: 0), libtorrent.bencode(t.generate())
f=open("mytorrent.torrent", "wb")
f.write(libtorrent.bencode(t.generate()))
f.close()
I doubt that it'll make the resume faster than the function built specifically for this purpose.
Try to see this code http://code.google.com/p/libtorrent/issues/attachmentText?id=165&aid=-5595452662388837431&name=java_client.cpp&token=km_XkD5NBdXitTaBwtCir8bN-1U%3A1327784186190
it uses add_magnet_uri which I think is what you need