I'm making an application with Spark that will run some topic extration algorithms. For that, first I need to make some preprocessing, extracting the document-term matrix by the end. Ive could done that, but for a (not that much) big collection of documents (only 2 thousand, 5MB), this proccess is taking forever.
So, debugging, Ive found where the program kinda stucks, and it's in a reduce operation. What I'm doing in this part of the code is counting how many times each term occurs on the collection, so first I done a "map", couting it for each rdd, and them I "reduce" it, saving the result inside a hashmap. The map operation is very fast, but in the reduce, its splitting the operation in 40 blocks, and each block takes 5~10 minutes to proccess.
So I'm trying to figure out what I'm doing wrong, or if reduce operations are that much costly.
SparkConf: Standalone mode, using local[2]. I've tried to use it as "spark://master:7077", and it worked, but still the same slowness.
Code:
"filesIn" is a JavaPairRDD where the key is the file path and the value is the content of the file.
So, first the map, where I take this "filesIn", split the words, and count their frequency (in that case doesn't matter what document is)
And then the reduce, where I create a HashMap (term, freq).
JavaRDD<HashMap<String, Integer>> termDF_ = filesIn.map(new Function<Tuple2<String, String>, HashMap<String, Integer>>() {
#Override
public HashMap<String, Integer> call(Tuple2<String, String> t) throws Exception {
String[] allWords = t._2.split(" ");
HashMap<String, Double> hashTermFreq = new HashMap<String, Double>();
ArrayList<String> words = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
HashMap<String, Integer> termDF = new HashMap<String, Integer>();
for (String term : allWords) {
if (hashTermFreq.containsKey(term)) {
Double freq = hashTermFreq.get(term);
hashTermFreq.put(term, freq + 1);
} else {
if (term.length() > 1) {
hashTermFreq.put(term, 1.0);
if (!terms.contains(term)) {
terms.add(term);
}
if (!words.contains(term)) {
words.add(term);
if (termDF.containsKey(term)) {
int value = termDF.get(term);
value++;
termDF.put(term, value);
} else {
termDF.put(term, 1);
}
}
}
}
}
return termDF;
}
});
HashMap<String, Integer> termDF = termDF_.reduce(new Function2<HashMap<String, Integer>, HashMap<String, Integer>, HashMap<String, Integer>>() {
#Override
public HashMap<String, Integer> call(HashMap<String, Integer> t1, HashMap<String, Integer> t2) throws Exception {
HashMap<String, Integer> result = new HashMap<String, Integer>();
Iterator iterator = t1.keySet().iterator();
while (iterator.hasNext()) {
String key = (String) iterator.next();
if (result.containsKey(key) == false) {
result.put(key, t1.get(key));
} else {
result.put(key, result.get(key) + 1);
}
}
iterator = t2.keySet().iterator();
while (iterator.hasNext()) {
String key = (String) iterator.next();
if (result.containsKey(key) == false) {
result.put(key, t2.get(key));
} else {
result.put(key, result.get(key) + 1);
}
}
return result;
}
});
Thanks!
OK, so just off the top of my head:
Spark transformations are lazy. It means that map is not executed until you call subsequent reduce action so what you describe as slow reduce is most likely slow map + reduce
ArrayList.contains is O(N) so all these words.contains and terms.contains are extremely inefficient
map logic smells fishy. In particular:
if term has been already seen you never get into else branch
at first glance words and terms should have exactly the same content and should be equivalent to the hashTermFreq keys or termDF keys.
it looks like values in termDF can only take value 1. If this is what you want and you ignore frequencies what is the point of creating hashTermFreq?
reduce phase as implemented here means an inefficient linear scan with growing object over the data while you what you really want is reduceByKey.
Using Scala as a pseudocode your whole code can be efficiently expressed as follows:
val termDF = filesIn.flatMap{
case (_, text) =>
text.split(" ") // Split
.toSet // Take unique terms
.filter(_.size > 1) // Remove single characters
.map(term => (term, 1))} // map to pairs
.reduceByKey(_ + _) // Reduce by key
termDF.collectAsMap // Optionally
Finally it looks like you're reinventing the wheel. At least some tools you need are already implemented in mllib.feature or ml.feature
Related
I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult (asynchronous), I only get 1 block returned of type PAGE, never KEY_VALUE_SET. If I convert my PDF to an image and use the synchronous methods, I do get KEY_VALUE_SET back but results are completely inaccurate.
Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?
Sample Code below:
StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
String startJobId = startDocumentAnalysisResult.getJobId();
GetDocumentAnalysisResult documentAnalysisResult = null;
String jobStatus = "IN_PROGRESS";
while (jobStatus.equals("IN_PROGRESS")) {
try {
TimeUnit.SECONDS.sleep(10);
GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
.withJobId(startJobId)
.withMaxResults(1);
documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
jobStatus = documentAnalysisResult.getJobStatus();
} catch (Exception e) {
logger.error(e);
}
}
if (!jobStatus.equals("IN_PROGRESS")) {
List<Block> blocks = documentAnalysisResult.getBlocks();
logger.error("block list size " + blocks.size());
Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
Map<String, Block> keyMap = new HashMap<>();
Map<String, Block> valueMap = new HashMap<>();
Map<String, Block> blockMap = new HashMap<>();
for (Block block : blocks) {
logger.error("Block Type:" + block.getBlockType());
String blockId = block.getId();
blockMap.put(blockId, block);
if (block.getBlockType().equals("KEY_VALUE_SET")) {
if (block.getEntityTypes().contains("KEY")) {
keyMap.put(blockId, block);
} else {
valueMap.put(blockId, block);
}
}
}
keyValueBlockMap.put("keyMap", keyMap);
keyValueBlockMap.put("valueMap", valueMap);
keyValueBlockMap.put("blockMap", blockMap);
Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
for (String key : keyValueRelationShip.keySet()) {
logger.error("Key: " + key);
logger.error("Value: " + keyValueRelationShip.get(key));
}
}
Synchronous path which results in completely horrible results:
AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName())));
AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);
You are not using the recommended version for the AWS SDK for Java. You are using a old version and not the recommended one.
I have tested the AWS SDK for Java V2 and I am able to get lines and text that lines up with the AWS Management Console.
You can find textTract V2 examples in the repo linked above.
I am able to get to lines and the corresponding text by using software.amazon.awssdk.services.textract.TextractClient.
For example when i debug through the code using the same PNG as I used in the console, i get the proper result.
Suppose I have two class:
class Key {
private Integer id;
private String key;
}
class Value {
private Integer id;
private Integer key_id;
private String value;
}
Now I fill the first list as follows:
List<Key> keys = new ArrayLisy<>();
keys.add(new Key(1, "Name"));
keys.add(new Key(2, "Surname"));
keys.add(new Key(3, "Address"));
And the second one:
List<Value> values = new ArrayLisy<>();
values.add(new Value(1, 1, "Mark"));
values.add(new Value(2, 3, "Fifth Avenue"));
values.add(new Value(3, 2, "Fischer"));
Can you please tell me how can I rewrite the follow code:
for (Key k : keys) {
for (Value v : values) {
if (k.getId().equals(v.getKey_Id())) {
map.put(k.getKey(), v.getValue());
break;
}
}
}
Using Lambdas?
Thank you!
‐------UPDATE-------
Yes sure it works, I forget "using Lambdas" on the first post (now I added). I would like to rewrite the two nested for cicle with Lamdas.
Here is how you would do it using streams.
stream the keylist
stream an index for indexing the value list
filter matching ids
package the key instance key and the value instance value into a SimpleEntry.
then add that to a map.
Map<String, String> results = keys.stream()
.flatMap(k -> IntStream.range(0, values.size())
.filter(i -> k.getId() == values.get(i).getKey_id())
.mapToObj(i -> new AbstractMap.SimpleEntry<>(
k.getKey(), values.get(i).getValue())))
.collect(Collectors.toMap(Entry::getKey, Entry::getValue));
results.entrySet().forEach(System.out::println);
prints
Address=Fifth Avenue
Surname=Fischer
Name=Mark
Imo, your way is much clearer and easier to understand. Streams/w lambdas or method references are not always the best approach.
A hybrid approach might also be considered.
allocate a map.
iterate over the keys.
stream the values trying to find a match on key_id's and return first one found.
The value was found (isPresent) add to map.
Map<String,String> map = new HashMap<>();
for (Key k : keys) {
Optional<Value> opt = values.stream()
.filter(v -> k.getId() == v.getKey_id())
.findFirst();
if (opt.isPresent()) {
map.put(k.getKey(), opt.get().getValue());
}
}
I'm implementing the connected components algorithm using Flink's DataStream API, since there is not yet an implementation of it using this API.
For this algorithm, I'm separating the data by tumbling windows. So, for each window, I`m trying to compute the algorithm independently.
My problem comes from the iterative character of the algorithm. I implemented the data pipeline that I wanted for the interactions (the step data pipeline), which consists on FlatMaps, 1 Join, 1 ProcessWindow and 1 Filter. However, it seems that the stream I wanted to feedback the loop is not actually being fed back to the beginning of the loop, because the algorithm does not iterate. I suspect it is not possible to do it if the original iteration datastream was joined with another stream (even though the latter was originated by a flatMap on the former).
The code that I`m using is as follows:
//neigborsList = Datastream of <Vertex, [List of neighbors], label>
IterativeStream< Tuple3<Integer, ArrayList<Integer>, Integer> > beginning_loop = neigborsList.iterate(maxTimeout);
//Emits tuples Vertices and Labels for every vertex and its neighbors
DataStream<Tuple2<Integer,Integer> > labels = beginning_loop
//Datastream of <Vertex, label> for every neigborsList.f0 and element in neigborsList.f1
.flatMap( new EmitVertexLabel() )
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.minBy(1)
;
DataStream<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> updatedVertex = beginning_loop
//Update vertex label with the results from the labels reduction
.join(labels)
.where("vertex")
.equalTo("vertex")
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.apply(new JoinFunction<Tuple3<Integer,ArrayList<Integer>,Integer>, Tuple2<Integer,Integer>, Tuple4<Integer,ArrayList<Integer>,Integer,Integer>>() {
#Override
public Tuple4<Integer,ArrayList<Integer>,Integer,Integer> join(
Tuple3<Integer, ArrayList<Integer>, Integer> arg0, Tuple2<Integer, Integer> arg1)
throws Exception {
int hasConverged = 1;
if(arg1.f1.intValue() < arg0.f2.intValue() )
{
arg0.f2 = arg1.f1;
hasConverged=0;
}
return new Tuple4<>(arg0.f0,arg0.f1,arg0.f2,new Integer(hasConverged));
}
})
//Disseminates the convergence flag if a change was made in the window
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.process(new ProcessAllWindowFunction<Tuple4<Integer,ArrayList<Integer>,Integer,Integer>,Tuple4<Integer, ArrayList<Integer>, Integer, Integer>,TimeWindow >() {
#Override
public void process(
ProcessAllWindowFunction<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>, Tuple4<Integer, ArrayList<Integer>, Integer, Integer>, TimeWindow>.Context ctx,
Iterable<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> values,
Collector<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> out) throws Exception {
Iterator<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> iterator = values.iterator();
Tuple4<Integer, ArrayList<Integer>, Integer, Integer> element;
int hasConverged= 1;
while(iterator.hasNext())
{
element = iterator.next();
if(element.f3.intValue()>0)
{
hasConverged=0;
break;
}
}
//Re iterate and emit the values on the correct output
iterator = values.iterator();
Integer converged = new Integer(hasConverged);
while(iterator.hasNext())
{
element = iterator.next();
element.f3 = converged;
out.collect(element);
}
}
})
;
DataStream<Tuple3<Integer, ArrayList<Integer>, Integer>> feed_back = updatedVertex
.filter(new NotConvergedFilter())
//Remove the finished convergence flag
//Transforms the Tuples4 to Tuples3 so that it becomes compatible with beginning_loop
.map(new RemoveConvergeceFlag())
;
beginning_loop.closeWith(feed_back);
//Selects the windows that have already converged
DataStream<?> convergedWindows = updatedVertex
.filter(new ConvergedFilter() );
convergedWindows.print()
.setParallelism(1)
.name("Sink to stdout");
At the end of the execution convergedWindows does not receives any tupple (because the algorithm could not converge with only 1 iteration).
If I print the beginning_loop, I see the initial tupples and the tupples from feed_back resultant from the fist iteration. But, nothing else than that.
So, summarizing my question, could this be a limitation of Flink? If so, do you know a different way of updating the vertices labels after the initial reduction, one that is not based on joins?
PS. I'm using Flink 1.3.3
I want to build a JSON from two lists. I need to use the corresponding elements from both lists to create a single JSON object.
My problem could be solved with ordinary loop like this:
List<Class1> items = baseManager.findObjectsByNamedQuery(Class1.class, "Class1.findAll", new Object[]{});
for(int i=0 ; i<items.size();i++){
List<Class2> items2 = baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCreatedBy" ,new Object[] {items.get(i).getCreatedBy()});
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
JsonObjectBuilder jpb = Json.createObjectBuilder()
.add("createdBy",items.get(i).getCreatedBy())
.add("phone",items2.get(0).getPhone())
groupsBuilder.add(jpb);
}
Is it possible to solve it using Java 8 Stream API?
There are still some things unclear. Like why you are insisting on creating that SimpleDateFormat instance that you are not using anywhere. Or whether there is a significance in calling getCreatedBy() multiple times. Assuming that it is not necessary, the following code is equivalent
baseManager.findObjectsByNamedQuery(Class1.class, "Class1.findAll", new Object[]{})
.stream()
.map(item -> item.getCreatedBy())
.map(createdBy -> Json.createObjectBuilder()
.add("createdBy", createdBy)
.add("phone", baseManager.findObjectsByNamedQuery(
Class2.class, "Class2.findByCreatedBy", new Object[] {createdBy})
.get(0).getPhone())
)
.forEach(jpb -> groupsBuilder.add(jpb));
It’s still unclear to me whether (or why) findObjectsByNamedQuery is not a varargs method. It would be quite natural to be a varargs method, not requiring these explicit new Object[] { … } allocations.
With pure Java8 Stream API:
public void convertItemsToJSon(List<Item> items) {
...
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
Map<Item, List<Class2>> sqlItems = items
.stream()
.collect(Collectors.toMap(Function.identity(), (item) -> baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCratedBy", new Object[]{item.getCreatedBy()})));
sqlItems.entrySet()
.stream()
.map(sqlItem -> buildJson(sqlItem.getKey(), sqlItem.getValue()))
.forEach(groupsBuilder::add);
...
}
private JsonObjectBuilder buildJson(Item item, List<Class2> class2Items) {
return Json.createObjectBuilder().add("createdBy", item.getCreatedBy());
}
With StreamEx library
public void convertItemsToJSonStreamEx(List<Item> items) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
...
StreamEx.of(items)
.cross(item -> baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCratedBy", new Object[]{item.getCreatedBy()}).stream())
.mapKeys(item -> Json.createObjectBuilder().add("createdBy", item.getCreatedBy()))
.mapKeyValue(this::addField)
.forEach(groupsBuilder::add);
...
}
private JsonObjectBuilder addField(JsonObjectBuilder json, Class2 class2) {
// You logic how to convert class2 to field in JSON
return json;
}
Thanks for your help and solution. Most helpful was the first response from Vlad Bochenin The code is here:
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
items.stream().map(item -> {
List<Class2> items2 = baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCreatedBy", new Object[]{item.getCreatedBy()});
JsonObjectBuilder jpb = Json.createObjectBuilder()
.add("createdBy", item.getCreatedBy())
.add("phone", items2.get(0).getPhone())
return jpb;
}).forEach(groupsBuilder::add);
Using java 8 I want to create a new collection from a list and accumulate a sum along the way.
The source list consists of objects that look something like this:
class Event {
String description;
double sum;
}
With an example list like this:
{ { "desc1", 10.0 }, {"desc2", 14.0 }, {"desc3", 5.0 } }
The resulting list should look like this
desc1, 10.0, 10.0
desc2, 14.0, 24.0
desc3, 5.0, 29.0
I know how to sum up to get a final sum, in this case 29.0, but I want to create the result list and at the same time accumulate the sum along the way.
How can I do this with Java8?
You could do this by implementing your own collector to perform the mapping and summing together. Your streaming code would look like this:
List<SummedEvent> summedEvents = events.stream().collect(EventConsumer::new, EventConsumer::accept, EventConsumer::combine).summedEvents();
summedEvents.forEach((se) -> System.out.println(String.format("%s, %2f, %2f", se.description, se.sum, se.runningTotal)));
For this I've assumed a new class SummedEvent which also holds the running total. Your collector class would then be implemented something like this:
class EventConsumer {
private List<SummedEvent> summedEvents = new ArrayList<>();
private double runningTotal = 0;
public void accept(Event event) {
runningTotal += event.sum;
summedEvents.add(new SummedEvent(event.description, event.sum, runningTotal));
}
public void combine(EventConsumer other) {
this.summedEvents.addAll(other.summedEvents);
this.runningTotal += other.runningTotal;
}
public List<SummedEvent> summedEvents() {
return summedEvents;
}
}
If you will run your pipeline sequentially, you can use this little hack with peek.
double[] acc = {0};
List<CustomEvent> list = originalList.stream()
.peek(e -> acc[0] += e.sum)
.map(e -> new CustomEvent(e, acc[0]))
.collect(toList());
Be aware that you'll get wrong results if
the stream is run in parallel.
However I'm not sure if the pipeline can be run in parallel in one pass, but assuming the underlying list has a fast access to element at index i you can do it like this:
double[] acc = originalList.stream().mapToDouble(e -> e.sum).toArray();
Arrays.parallelPrefix(acc, Double::sum);
List<CustomEvent> lx = IntStream.range(0, originalList.size())
.parallel()
.mapToObj(i -> new CustomEvent(originalList.get(i), acc[i]))
.collect(toList());
parallelPrefix will apply the reduction you are looking for the sums. Then you just have to stream the indices and you map each event to its corresponding accumulated sum.