Distributed Cache(Map side Joins) - mapreduce

I would like to know more about DistributedCache concept in Mapreduce.
In my Mapper class below i wrote a logic to read a file that is available in cache.
protected void setup(Context context) throws IOException,
InterruptedException {
super.setup(context);
localFiles =DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path myfile:localFiles)
{
String line=null;
String nameofFile=myfile.getName();
File file =new File(nameofFile);
FileReader fr= new FileReader(file);
BufferedReader br= new BufferedReader(fr);
line=br.readLine();
while(line!=null)
{
String[] arr=line.split("\t");
myMap.put(arr[0], arr[1]);
line=br.readLine();
}
}
}
Can someone tell me when will the above setUp(context) method gets called. Is that setUP(context) method called only once or for every map task that setup(context) method will run?

It's called per Mapper task or Reducer task only once. So if 10 mappers or reducers were spawned for a job, then for each mapper and reducer it will be called once.
General guidelines for what to add in this method, is any task that is required to do once can be written here, e.g. getting path of Distributed cache, Passing and getting parameters to mappers and reducers.
Similar is for cleanup method.

Related

Using a lock in C++ across multiple tasks

I am not really seeking code examples, but I'm hoping someone can review my program design and provide feedback. I am trying to figure out how do I ensure I have one instance of my "workflow" running at a time.
I am working in C++.
This is my workflow:
I read rows off of a Postgres database.
If the table has any records, I want to do these instructions:
Read the records and transform them to JSON
Send the JSON document to a remote Web service
Parse the response from the service. The service tells me which records were saved or not saved, based on their primary key.
I delete the successfully saved records
I log the unsuccessful records (there's another process that consumes the logs and so my work is done).
I want to perform all of this threads using a separate thread (or "task", whatever higher-level abstraction is available in C++), and I want to make sure that if my function for [1] gets called multiple times, the additional calls basically get "dropped" if step 1 is already in flight.
In C++, I believe I can use a flag and a mutex. I use a something like std::lock_guard<std::mutex> at the top of my method. Then the next line checks for a flag.
// MyWorkflow.cpp
std::mutex myMutex;
int inFlight = 0;
void process() {
std::lock_guard<std::mutex> guard(myMutex);
if (inflight) {
return;
}
inflight = 1;
std::vector<Widget> widgets = readFromMyTable();
std::string json = getJson(&widgets);
... // Send the json to the remote service and handle the response
}
Okay, let me explain my confusion. I want to use Curl to perform the HTTP request. But Curl works asynchronously. And so if I make the asynchronous HTTP call via Curl, my update function will just return and myMutex will be released, right?
I think in my asynchronous response handler, I need to call a second function that's in MyWorkflow.cpp
void markCompletion() {
std::lock_guard<std::mutex> guard(myMutex);
inFlight = 0; // Reset the inflight flag here
}
Is this the right approach? I am worried that if an exception is thrown anywhere before I call markCompletion(), I will block all future callers. I think I need to ensure I have proper exception handling and always call markCompletion().
I am terribly sorry for asking such a noob question, but I really want to learn to do this the right way.

Is there a way to programmatically invoke a WebJob function from another function within the WebJob and still have the output binding work?

Here's the code:
[return: Table("AnalysisDataTable", Connection = "TableStorageConnection")]
public static async Task<OrchestrationManagerAnalysisData> InputQueueTriggerHandler(
[QueueTrigger("DtOrchestrnRequestQueueName",
Connection = "StorageQueueConnection")] string queueMsg,
[OrchestrationClient] DurableOrchestrationClient client, ILogger logger)
{
logger.LogInformation($"***DtOrchestrationRequestQueueTriggerHandler(): queueMsg = {queueMsg}");
await ProcessInputMessage(queueMsg, client, logger);
// Below is where the code goes to format the TableStorage Entity called analysisData.
// This return causes the above output binding to be executed, saving analysis data to
// Table Storage.
return analysisData;
}
The above code works fine and saves analysisData to TableStorage.
However when I put the output binding attribute on ProcessInputMessage() which is programatically invoked
rather that invoked as a result of a trigger everything works OK except there is no data output
to Table Storage.
[return: Table("AnalysisDataTable", Connection = "TableStorageConnectionName")]
public static async Task<OrchestrationManagerAnalysisData>
ProcessInputMessage(string queueMsg, DurableOrchestrationClient client, ILogger logger)
{
// Do the processing of the input message.
// Below is where the code goes to format the TableStorage Entity called analysisData.
// This return causes the above output binding to be executed, saving analysis data to
// Table Storage.
return analysisData;
}
QUESTION is there a way to cause an output binding to "trigger" when invoked programatically from another function within the WebJob?
I like the labor saving characteristics of output bindings and want to leverage them as much as possible, while also having well factored code, i.e. tight cohesion in each method.
Thanks,
George
is there a way to cause an output binding to "trigger" when invoked programatically from another function within the WebJob?
In short, No.
You send data by using the return value of the function which apply the output binding attribute in function. So, if you want to invoke another function and write data into Table Storage.
If you want to achieve the idea you want, you need to overwrite the return method. However, it is a package integration method, so I suggest that you could use the TableOperation object that inserts the customer entity to the Table Storage.
TableOperation insertOperation = TableOperation.Insert(customer1);
// Execute the insert operation.
table.Execute(insertOperation);
For more details, you could refer to this article.

Aggregating a huge list from reducer input without running out of memory

At the reduce stage (67% of reduce percentage), my code ends up getting stuck and failing after hours of attempting to complete. I found out that the issue is that the reducer is receiving huge amounts of data that it can't handle and ends up running out of memory, which leads to the reducer being stuck.
Now, I am trying to find a way around this. Currently, I am assembling a list from the values received by the reducer fro each key. At the end of the reduce phase, I try to write the key and all of the values in the list. So my question is, how can I get the same functionality of having the key and list of values related to that key without running out of memory?
public class XMLReducer extends Reducer<Text, Text, Text, TextArrayWritable> {
private final Logger logger = Logger.getLogger(XMLReducer.class);
#Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//logger.info(key.toString());
Set<String> filesFinal = new HashSet<>();
int size = 0;
for(Text value : values) {
String[] files = value.toString().split(",\\s+");
filesFinal.add(value.toString());
//size++;
}
//logger.info(Integer.toString(size));
String[] temp = new String[filesFinal.size()];
temp = filesFinal.toArray(temp);
Text[] tempText = new Text[filesFinal.size()];
for(int i = 0; i < filesFinal.size(); i++) {
tempText[i] = new Text(temp[i]);
}
}
}
and TextArrayWritable is just a way to write an array to file
You can try reducing the amount of data that is read by the single reducer by writing a Custom partitioner.
HashPartitioner is the default partitioner that is used by the map reduce job. While this guarantees you uniform distribution, in some cases it is highly possible that many keys get hashed to a single reducer. As a result, a single reducer would have a lot of data compared to others. In your case, I think this is the issue.
To resolve this:
Analyze your data and the key on which you are doing group by. You
Try to come up with a partitioning function based on your group by key for your Custom Partitioner. Try limiting the number of keys for each partition.
You would see an increase in number of reduce tasks in your job. If the issue is related to uneven key distribution, the solution that I proposed should resolve your issue.
You could also try increasing reducer memory.

Spring Batch process multiple files concurrently

I'm using Spring Batch to process a large XML file (~ 2 millions entities) and update a database. The process is quite time-consuming, so I tried to use partitioning to try to speed up the processing.
The approach I'm pursuing is to split the large xml file in smaller files (say each 500 entities) and then use Spring Batch to process each file in parallel.
I'm struggling with the Java configuration to achieve the processing of multiple xml files in parallel. These are the relevant beans of my configuration
#Bean
public Partitioner partitioner(){
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
Resource[] resources;
try {
resources = resourcePatternResolver.getResources("file:/tmp/test/*.xml");
} catch (IOException e) {
throw new RuntimeException("I/O problems when resolving the input file pattern.",e);
}
partitioner.setResources(resources);
return partitioner;
}
#Bean
public Step partitionStep(){
return stepBuilderFactory.get("test-partitionStep")
.partitioner(personStep())
.partitioner("personStep", partitioner())
.taskExecutor(taskExecutor())
.build();
}
#Bean
public Step personStep() {
return stepBuilderFactory.get("personStep")
.<Person, Person>chunk(100)
.reader(personReader())
.processor(personProcessor())
.writer(personWriter)
.build();
}
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor asyncTaskExecutor = new SimpleAsyncTaskExecutor("spring_batch");
asyncTaskExecutor.setConcurrencyLimit(10);
return asyncTaskExecutor;
}
When I execute the job, I get different XML parsing errors (every time a different one). If I remove all the xml files but one from the folder, then the processing works as expected.
I'm not sure I understand 100% the concept of Spring Batch partitioning, especially the "slave" part.
Thanks!

Large file upload with Spark framework

I'm trying to upload large files to a web application using the Spark framework, but I'm running into out of memory errors. It appears that spark is caching the request body in memory. I'd like either to cache file uploads on disk, or read the request as a stream.
I've tried using the streaming support of Apache Commons FileUpload, but it appears that calling request.raw().getInputStream() causes Spark to read the entire body into memory and return an InputStream view of that chunk of memory, as done by this code. Based on the comment in the file, this is so that getInputStream can be called multiple times. Is there any way to change this behavior?
I recently had the same problem and I figured out that you could bypass the caching. I do so with the following function:
public ServletInputStream getInputStream(Request request) throws IOException {
final HttpServletRequest raw = request.raw();
if (raw instanceof ServletRequestWrapper) {
return ((ServletRequestWrapper) raw).getRequest().getInputStream();
}
return raw.getInputStream();
}
This has been tested with Spark 2.4.
I'm not familiar with the inner workings of Spark so one potentiall, minor downside with this function is that you don't know if you get the cached InputStream or not, the cached version is reusable, the non-cached is not.
To get around this downside I suppose you could implement a function similar to the following:
public boolean hasCachedInputStream(Request request) {
return !(raw instanceof ServletRequestWrapper);
}
Short answer is not that I can see.
SparkServerFactory builds the JettyHandler, which has a private static class HttpRequestWrapper, than handles the InputStream into memory.
All that static stuff means no extending available.