Running Nutch crawls on EMR (newbie) - amazon-web-services

I'm a first time EMR/Hadoop user and first time Apache Nutch user. I'm trying to use Apache Nutch 2.1 to do some screen scraping. I'd like to run it on hadoop, but don't want to setup my own cluster (one learning curve at a time). So I'm using EMR. And I'd like S3 to be used for output (and whatever input I need).
I've been reading the setup wikis for Nutch:
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/NutchHadoopTutorial
And they've been very helpful in getting me up to speed on the very basics of nutch. I realize I can build nutch from source, preconfigure some regexes, then be left with a hadoop friendly jar:
$NUTCH_HOME/runtime/deploy/apache-nutch-2.1.job
Most of the tutorials culminate in a crawl command being run. In the Hadoop examples, it's:
hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5
And in the local deployment example it's something like:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
My question is as follows. What do I have to do to get my apache-nutch-2.1.job to run on EMR? What arguments to I pass it? For the hadoop crawl example above, the "urls" file is already on hdfs with seed URLs. How do I do this on EMR? Also, what do I specify on the command line to have my final output to go S3 instead of HDFS?

So to start off, this can not really to be done using the GUI. Instead I got nutch working using the AWS Java API.
I have my seed files located in s3, and I transfer my output back to s3.
I use the dsdistcp jar to copy the data from s3 to hdfs.
Here is my basic step config. MAINCLASS is going to be the package details of your nutch crawl. Something like org.apach.nutch.mainclass.
String HDFSDIR = "/user/hadoop/data/";
stepconfigs.add(new StepConfig()
.withName("Add Data")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(new HadoopJarStepConfig(prop.getProperty("S3DISTCP"))
.withArgs("--src", prop.getProperty("DATAFOLDER"), "--dest", HDFSDIR)));
stepconfigs.add(new StepConfig()
.withName("Run Main Job")
.withActionOnFailure(ActionOnFailure.CONTINUE)
.withHadoopJarStep(new HadoopJarStepConfig(nutch-1.7.jar)
.withArgs("org.apache.nutch.crawl.Crawl", prop.getProperty("CONF"), prop.getProperty("STEPS"), "-id=" + jobId)));
stepconfigs.add(new StepConfig()
.withName("Pull Output")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(new HadoopJarStepConfig(prop.getProperty("S3DISTCP"))
.withArgs("--src", HDFSDIR, "--dest", prop.getProperty("DATAFOLDER"))));
new AmazonElasticMapReduceClient(new PropertiesCredentials(new File("AwsCredentials.properties")), proxy ? new ClientConfiguration().withProxyHost("dawebproxy00.americas.nokia.com").withProxyPort(8080) : null)
.runJobFlow(new RunJobFlowRequest()
.withName("Job: " + jobId)
.withLogUri(prop.getProperty("LOGDIR"))
.withAmiVersion(prop.getProperty("AMIVERSION"))
.withSteps(getStepConfig(prop, jobId))
.withBootstrapActions(getBootStrap(prop))
.withInstances(getJobFlowInstancesConfig(prop)));

Related

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?
---------------------S3 image ---------------------
Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1
They are actually directory markers as path + /. Source 2
To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
Read about S3 , S3a and S3n here and here
Thanks to #stevel 's comment here
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

Sagemaker, get spark dataframe from data image url on S3

I am trying to obtain a sparkdataframe which contains the paths and image for all images in my data. The data is store as follow :
folder/image_category/image_n.jpg
I worked on a local jupyter notebook and got no problem with using following code:
dataframe = spark.read.format("image").load(path)
I need to do the same exercise using AWS sagemaker and S3. I created a bucket following the same pattern :
s3://my_bucket/folder/image_category/image_n.jpg
I've tried a lot of possible solutions i found online, based on boto3, s3fs and other stuff, but unfortunately i am still unable to make it work (and i am starting to lose faith ...).
Would anyone have something reliable i could base my work on ?

Transferring Pdf files from Local folder to AWS

I have a monthly activity where i get hundreds of PDF files in a folder and i need to transfer those to an AWS server . Currently i do this activity manually . But i need to automate this process of transfer of all pdf files form my local folder to a specific folder in AWS .
Also this process takes a lot of time ( approx 5 hours for 500 pdf files) . Is there a way to spped up the process?
While doing the copy from local to AWS you must be using some tool like winSCP or any SSH client, so you could automate the same using the script.
scp [-r] /you/pdf/dir youruser#aswhost:/home/user/path/
If you want to do it with speed, you could run multiple scp command in parallel of multiple terminal and may split files while creating to some logical grouped directories.
You can zip the files and transfer them. After transfer unzip the files.
Or else write a program which iterates over all files in your folder and uploads files to s3 using S3 api methods.

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

There is tiny problem when I try Cloudera 5.4.2. Base on this article
Apache Flume - Fetching Twitter Data
http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets.
Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large".
Oh, what is avro block and the limitation of the block size? Can I change it? What does it mean according to this message? Is it file's fault? Is it some records' fault? If Twitter's streaming met error data, it should core down. If it is all good to convert the tweets to Avro format, reversely, the Avro data should be read correctly, right?
And I also try the avro-tools-1.7.7.jar
java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232
{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT #ugaunion: .#ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"Twitter for iPhone"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}
{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`・ω・´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more
The same problem. I google it a lot, no answers at all.
Could anyone give me a solution if you have met this problem too? Or somebody help to give a clue if you fully understand Avro stuff or Twitter streaming underneath.
It is really intereting problem. Think about it.
Use Cloudera TwitterSource
Otherwise will meet this problem.
Unable to correctly load twitter avro data into hive table
In the article: This is apache TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
But it should be cloudera TwitterSource:
https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:
Cannot run Flume because of JAR conflict
You should compile it using maven
https://github.com/cloudera/cdh-twitter-example
Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.
Steps:
wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip
sudo yum install apache-maven
Put to flume plugins directory:
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
mvn package
Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.

AWS Elastic Mapreduce optimizing Pig job

I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue.
The logfiles in question are stored in s3 with keys that correspond to the dates they are emitted from the logging server, eg: /2013/03/01/access.log. These files are very, very large. My mapreduce job runs an Apache Pig script that simply examines some of the uri paths stored in the log files and outputs generalized counts that correspond to our business logic.
My client code in boto takes date times as input on cli and schedules a jobflow with a PigStep instance for every date needed. Thus, passing something like python script.py 2013-02-01 2013-03-01 would iterate over 29 days worth of datetime objects and create pigsteps with the respective input keys for s3. This means that the resulting jobflow could have many, many steps, one for each day in the timedelta between the from_date and to_date.
My problem is that my EMR jobflow is exceedingly slow, almost absurdly so. It's been running for a night now and hasn't made it even halfway through that example set. Is there something wrong I am doing creating many jobflow steps like this? Should I attempt to generalize the pig script for the different keys instead, rather than preprocessing it in the client code and creating a step for each date? Is this a feasible place to look for an optimization on Elastic Mapreduce? It's worth mentioning that a similar job for a months worth of comparable data passed to the AWS elastic-mapreduce cli ruby client took about 15 minutes to execute (this job was fueled by the same pig script.)
EDIT
Neglected to mention, job was scheduled for two instances of type m1.small, which admittedly may in itself be the problem.