Hadoop MapReduce Program for sending Emails - mapreduce

I want to write a Mapreduce program that picks a file from the HDFS and send the same an attachment in the email.
Can you please help me with the structure of the code? Since its not like a typical file processing, should I have a Mapper and reducer ?
Ideally I wanted to use the Oozie SMTP Action but it doesnt support the attachments in the email.

I'm not familiar with the Oozie SMTP Action part of the question but I can answer about the EMails.
One of the main paradigms of MapReduce is that you can implement any algorithm from a normal Java program in MapReduce. The difference between a normal Java file and MapReduce is being run on Hadoop, so you need to always emit a key-value pair from the mapper and reducer.
Therefore you can do whatever you want! Send an email, put it into a database, literally whatever you want as long as you emit a key-value pair from the mapper and reducer.
From what It sounds like you have a large data file with a lot of emails you want to send something to. Therefore you might want to use a Map only job. Since all you're doing is finding the email and sending an email, it doesn't seem like theres going to be an analysis or groupings done in a reducer, so you might be able to do a map only job.

In this case whole content length as Longwritable type considered as Key, for the Mapper. And whole email content as Text type as Value, for the Mapper.
writing above Key and Values to the context fix your problem.
No. of Mapper = 1
No. of Reducers = 0
It works for you...!

Related

What is design pattern to store JSON objects in C++?

A co-worker and I have been discussed the best way to store data in memory within our C++ server. Basically, we need to store all requisitions made by clients. Those requisitions come as JSONs objects, so each requisition may have different number of parameters. Later, clients can ask the server for a list of those requisitions.
The total number of requisitions is small (order of 10^3). Clients ask for the list of requisitions using pagination.
So my question is what is the standard way of doing that?
1) Create a class that stores every JSON and then, when requested, send the list of those JSONs.
2) Deserialize the JSON, store it in a class then serialize the data again when requested.
If 2, what is the best way of doing that in modern C++?
3) Another option?
Thank you.
If the client asks you to support JSON, the are only two steps you need to do:
Add some JSON (e.g this) library with a suitable license to project.
Use it.
If the implementation of JSON is not the main goal of the project, this should work.
Note: you can also get a lot of design hints inspecting the aforementioned repo.

How to feed bokeh streaming interface by a c++ application

I want to use bokeh to display a time series and provide the data with updates via source.stream(someDict). The data is, however, generated by a c++ application (server) that may run on the same machine or a machine in the network. I was looking into transmitting the updated data (only the newly added lines of the time series) via ZMQ to the python program (client).
The transmission of the message seems easy enough to implement but
the dictionary is column based. Is it not more efficient to append lines, i.e. one line per point in time, and send this?
If there is no good way for the first, what kind of object should I send? Do I need to marshal the information or is it sufficient to make a long string like {col1:[a,b,c,...], col2:[...],...} and send this to the client? I expect to send not more than a few hundred lines with 10 floats per second.
Thanks for all helpful answers.

How to load data into CrowdFlower's job by using GATE's crowdsourcing plugin?

I am trying to create a job on CrowdFlower using
GATE crowdsourcing plugin. My problem is I cannot load the data to the
job at all. What I have done so far in creating the job is:
Create job builder in PR.
Right click on the job builder and choose create a new CrowdFlower
job. The job appeared in my job's list in CrowdFlower.
Populate corpus with some documents, pre-processing them with some
ANNIE's application, e.g. tokenizer and sentence splitter
Add the job builder to a corpus pipeline, edit some parameters so
they match with the initial annotations (tokens and sentences)
Run the pipeline. (Of course I make sure the Job ID match)
After I did all those, the job still has 0 row data. I am wondering if
I have done something wrong because I am sure that I follow all the instructions on this tutorial, specifically from page 28 to 35. Any advice on this?
I bet you have a typo in one of the job builder runtime parameters :)
Double-check the names of annotations and annotation sets, make sure all of them exist in your documents. If they exist and the builder found them, a cf_..._id feature should appear on each entity annotation.
If the job builder found any annotations it would call the crowdflower API and throw an exception if it fails to upload the data. It really sounds like it's not sending any requests and the only reason I see is it can't find annotations.

Retrieve data from Account Server

I'm trying to make a game launcher in C++ and I was wondering if I could get some guidance on how to carry out this task. Basically I've created an API which outputs the account data in JSON format.
i.e {"success":true,"errorCode":0,"reason":"You're now logged in!"}
http_fetch("http://www.example.com/api/login.php?username="+username+"&password="+password+"");
How am I able to retrieve the data?
Sorry if you don't understand. English isn't my first language :)
-brownzilla
Look for a library that allows you to parse Json. Some examples:
Picojson
Rapidjson
Both are quite easy to use and allow you to turn json into a model that can later be used to map to your own model. Alternatively you could write your own Json parser though that would be a bit of work (reinventing the wheel perhaps).

Hadoop streaming c++ getTaskId

I've been trying to find a way to get (or pass) the taskId to my mapper in c++. I'm using hadoop streaming. So far I just got how to get it in Java. I need the task ID because I'm trying to write a file to HDFS, I'm using libhdfs c, but when I try to append concurrently it fails, because of the lease. Otherwise I'll have to change all my code to Java.
Thanks for your attention.
I figured that instead of using Hadoop Streaming, I could use Hadoop Pipes to get the taskID. However, I was not able to print to HDFS, so I changed my InputFormat/RecordReader and used the key received in the mapper to create files with different names.