I'm comparing 2 ways to filter lists, with and without using streams. It turns out that the method without using streams is faster for a list of 10,000 items. I'm interested in understanding why is it so. Can anyone explain the results please?
public static int countLongWordsWithoutUsingStreams(
final List<String> words, final int longWordMinLength) {
words.removeIf(word -> word.length() <= longWordMinLength);
return words.size();
}
public static int countLongWordsUsingStreams(final List<String> words, final int longWordMinLength) {
return (int) words.stream().filter(w -> w.length() > longWordMinLength).count();
}
Microbenchmark using JMH:
#Benchmark
#BenchmarkMode(Throughput)
#OutputTimeUnit(MILLISECONDS)
public void benchmarkCountLongWordsWithoutUsingStreams() {
countLongWordsWithoutUsingStreams(nCopies(10000, "IAmALongWord"), 3);
}
#Benchmark
#BenchmarkMode(Throughput)
#OutputTimeUnit(MILLISECONDS)
public void benchmarkCountLongWordsUsingStreams() {
countLongWordsUsingStreams(nCopies(10000, "IAmALongWord"), 3);
}
public static void main(String[] args) throws RunnerException {
final Options opts = new OptionsBuilder()
.include(PracticeQuestionsCh8Benchmark.class.getSimpleName())
.warmupIterations(5).measurementIterations(5).forks(1).build();
new Runner(opts).run();
}
java -jar target/benchmarks.jar -wi 5 -i 5 -f 1
Benchmark Mode Cnt Score Error Units
PracticeQuestionsCh8Benchmark.benchmarkCountLongWordsUsingStreams thrpt 5 10.219 ± 0.408 ops/ms
PracticeQuestionsCh8Benchmark.benchmarkCountLongWordsWithoutUsingStreams thrpt 5 910.785 ± 21.215 ops/ms
Edit: (as someone deleted the update posted as an answer)
public class PracticeQuestionsCh8Benchmark {
private static final int NUM_WORDS = 10000;
private static final int LONG_WORD_MIN_LEN = 10;
private final List<String> words = makeUpWords();
public List<String> makeUpWords() {
List<String> words = new ArrayList<>();
final Random random = new Random();
for (int i = 0; i < NUM_WORDS; i++) {
if (random.nextBoolean()) {
/*
* Do this to avoid string interning. c.f.
* http://en.wikipedia.org/wiki/String_interning
*/
words.add(String.format("%" + LONG_WORD_MIN_LEN + "s", i));
} else {
words.add(String.valueOf(i));
}
}
return words;
}
#Benchmark
#BenchmarkMode(AverageTime)
#OutputTimeUnit(MILLISECONDS)
public int benchmarkCountLongWordsWithoutUsingStreams() {
return countLongWordsWithoutUsingStreams(words, LONG_WORD_MIN_LEN);
}
#Benchmark
#BenchmarkMode(AverageTime)
#OutputTimeUnit(MILLISECONDS)
public int benchmarkCountLongWordsUsingStreams() {
return countLongWordsUsingStreams(words, LONG_WORD_MIN_LEN);
}
}
public static int countLongWordsWithoutUsingStreams(
final List<String> words, final int longWordMinLength) {
final Predicate<String> p = s -> s.length() >= longWordMinLength;
int count = 0;
for (String aWord : words) {
if (p.test(aWord)) {
++count;
}
}
return count;
}
public static int countLongWordsUsingStreams(final List<String> words,
final int longWordMinLength) {
return (int) words.stream()
.filter(w -> w.length() >= longWordMinLength).count();
}
Whenever your benchmark says that some operation over 10000 elements takes 1ns (edit: 1µs), you probably found a case of clever JVM figuring out that your code doesn't actually do anything.
Collections.nCopies doesn't actually make a list of 10000 elements. It makes a sort of a fake list with 1 element and a count of how many times it's supposedly there. That list is also immutable, so your countLongWordsWithoutUsingStreams would throw an exception if there was something for removeIf to do.
You do not return any values from your benchmark methods, thus, JMH has no chance to escape the computed values and your benchmark suffers dead code elimination. You compute how long it takes to do nothing. See the JMH page for further guidance.
Saying this, streams can be slower in some cases: Java 8: performance of Streams vs Collections
Related
With the following code:
public class Main {
public static void main(String[] args) {
final List<Integer> items =
IntStream.rangeClosed(0, 23).boxed().collect(Collectors.toList());
final String s = items
.stream()
.map(Object::toString)
.collect(Collectors.joining(","))
.toString()
.concat(".");
System.out.println(s);
}
}
I get:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23.
What I would like to do, is to break the line every 10 items, in order to get:
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23.
I have try a lot of things after googling without any success !
Can you help me ?
Thanks,
Olivier.
If you're open to using a third-party library, the following will work using Eclipse Collections Collectors2.chunk(int).
String s = IntStream.rangeClosed(0, 23)
.boxed()
.collect(Collectors2.chunk(10))
.collectWith(MutableList::makeString, ",")
.makeString("", ",\n", ".");
The result of Collectors2.chunk(10) will be a MutableList<MutableList<Integer>>. At this point I switch from the Streams APIs to using native Eclipse Collections APIs which are available directly on the collections. The method makeString is similar to Collectors.joining(). The method collectWith is like Stream.map() with the difference that a Function2 and an extra parameter are passed to the method. This allows a method reference to be used here instead of a lambda. The equivalent lambda would be list -> list.makeString(",").
If you use just Eclipse Collections APIs, this problem can be simplified as follows:
String s = Interval.zeroTo(23)
.chunk(10)
.collectWith(RichIterable::makeString, ",")
.makeString("", ",\n", ".");
Note: I am a committer for Eclipse Collections.
If all you want to do is process these ascending numbers, you can do it like
String s = IntStream.rangeClosed(0, 23).boxed()
.collect(Collectors.groupingBy(i -> i/10, LinkedHashMap::new,
Collectors.mapping(Object::toString, Collectors.joining(","))))
.values().stream()
.collect(Collectors.joining(",\n", "", "."));
This solution can be adapted to work on an arbitrary random access list as well, e.g.
List<Integer> items = IntStream.rangeClosed(0, 23).boxed().collect(Collectors.toList());
String s = IntStream.range(0, items.size()).boxed()
.collect(Collectors.groupingBy(i -> i/10, LinkedHashMap::new,
Collectors.mapping(ix -> items.get(ix).toString(), Collectors.joining(","))))
.values().stream()
.collect(Collectors.joining(",\n", "", "."));
However, there is no simple and elegant solution for arbitrary streams, a limitation which applies to all kind of tasks having a dependency to the element’s position.
Here is an adaptation of the already linked in the comments Collector:
private static Collector<String, ?, String> partitioning(int size) {
class Acc {
int count = 0;
List<List<String>> list = new ArrayList<>();
void add(String elem) {
int index = count++ / size;
if (index == list.size()) {
list.add(new ArrayList<>());
}
list.get(index).add(elem);
}
Acc merge(Acc right) {
List<String> lastLeftList = list.get(list.size() - 1);
List<String> firstRightList = right.list.get(0);
int lastLeftSize = lastLeftList.size();
int firstRightSize = firstRightList.size();
// they are both size, simply addAll will work
if (lastLeftSize + firstRightSize == 2 * size) {
System.out.println("Perfect!");
list.addAll(right.list);
return this;
}
// last and first from each chunk are merged "perfectly"
if (lastLeftSize + firstRightSize == size) {
System.out.println("Almost perfect");
int x = 0;
while (x < firstRightSize) {
lastLeftList.add(firstRightList.remove(x));
--firstRightSize;
}
right.list.remove(0);
list.addAll(right.list);
return this;
}
right.list.stream().flatMap(List::stream).forEach(this::add);
return this;
}
public String finisher() {
return list.stream().map(x -> x.stream().collect(Collectors.joining(",")))
.collect(Collectors.collectingAndThen(Collectors.joining(",\n"), x -> x + "."));
}
}
return Collector.of(Acc::new, Acc::add, Acc::merge, Acc::finisher);
}
And usage would be:
String result = IntStream.rangeClosed(0, 24)
.mapToObj(String::valueOf)
.collect(partitioning(10));
Hello i created a program to handle a config file line by checking each lines and get the config blocks but for first time i made it with php and the speed was amazing. we have some blocks like this
Block {
}
php program can read each line and detect about 50,000 of this blocks in just 1 second after that i went to c++ to create my program in c++ but i saw a very very bad problem. my program was too slow (read 50,000 of this blocks in 55 seconds) while my php codes was exactly the same of c++ codes (in action and activity). php was 55x faster than c++ while the codes are the same.
this is my code in php
const PATH = "conf.txt";
if(!file_exists(PATH)) die("path_not_found");
if(!is_readable((PATH))) die("path_not_readable");
$Lines = explode("\r\n", file_get_contents(PATH));
class Block
{
public $Name;
public $Keys = array();
public $Blocks = array();
}
function Handle(& $Lines, $Start, & $Return_block, & $End_on)
{
for ($i = $Start; $i < count($Lines); $i++)
{
while (trim($Lines[$i]) != "")
{
$Pos1 = strpos($Lines[$i], "{");
$Pos2 = strpos($Lines[$i], "}");
if($Pos1 !== false && ($Pos2 === false || $Pos2 > $Pos1)) // Detect { in less position
{
$thisBlock = new Block();
$thisBlock->Name = trim(substr($Lines[$i], 0, $Pos1));
$Lines[$i] = substr($Lines[$i], $Pos1 + 1);
Handle($Lines, $i, $thisBlock, $i);
$Return_block->Blocks[] = $thisBlock;
}
else { // Detect } in less position than {
$Lines[$i] = substr($Lines[$i], $Pos2 + 1);
$End_on = $i;
return;
}
}
}
}
$DefaultBlock = new Block();
Handle($Lines, 0, $DefaultBlock, $NullValue);
$OutsideKeys = $DefaultBlock->Keys;
$Blocks = $DefaultBlock->Blocks;
echo "Found (".count($OutsideKeys).") keys and (".count($Blocks).") blocks.<br><br>";
and this is my code in C++
string Trim(string & s)
{
auto wsfront = std::find_if_not(s.begin(), s.end(), [](int c) {return std::isspace(c); });
auto wsback = std::find_if_not(s.rbegin(), s.rend(), [](int c) {return std::isspace(c); }).base();
return (wsback <= wsfront ? std::string() : std::string(wsfront, wsback));
}
class Block
{
private:
string Name;
vector <Block> Blocks;
public:
void Add(Block & thisBlock) { Blocks.push_back(thisBlock); }
Block(string Getname = string()) { Name = Getname; }
int Count() { return Blocks.size(); }
};
void Handle(vector <string> & Lines, size_t Start, Block & Return, size_t & LastPoint, bool CheckEnd = true)
{
for (size_t i = Start; i < Lines.size(); i++)
{
while (Trim(Lines[i]) != "")
{
size_t Pos1 = Lines[i].find("{");
size_t Pos2 = Lines[i].find("}");
if (Pos1 != string::npos && (Pos2 == string::npos || Pos1 < Pos2)) // Found {
{
string Name = Trim(Lines[i].substr(0, Pos1));
Block newBlock = Block(Name);
Lines[i] = Lines[i].substr(Pos1 + 1);
Handle(Lines, i, newBlock, i);
Return.Add(newBlock);
}
else { // Found }
Lines[i] = Lines[i].substr(Pos2 + 1);
return;
}
}
}
}
int main()
{
string Cont;
___PATH::GetFileContent("D:\\conf.txt", Cont);
vector <string> Lines = ___String::StringSplit(Cont, "\r\n");
Block Return;
size_t Temp;
// The problem (low handle speed) start from here not from including or split
Handle(Lines, 0, Return, Temp);
cout << "Is(" << Return.Count() << ")" << endl;
return 0;
}
as you can see, this codes are exactly the same in action but i don't know why php handling in this code is 55x faster than my c++ codes. you can create a txt file and create about 50,000 of this block's
Block {
}
and test it yourself. please help me to fix this. i am really confused (same codes but not same performance
php = 50,000 blocks and detect in 1 second
c++ = 50,000 blocks and detect in 55 seconds (and maybe more) !
i have no problem in my program design. because i got my performance completely on php but my problem is on c++ that is 55x slower than php in same code action !
i am using (visual studio 2017) to compile this program (c++)
First, "code" is singular, not plural.
C++ is a very different language than php. It is not "the same code", and it is nowhere near the same in action.
For example, these two lines:
Block newBlock = Block(Name);
Return.Add(newBlock);
First create a Block on the stack, and then call Block's copy constructor to make another one inside the vector. You then throw away the stack object.
Also, vectors guarantee that they are contiguous, so as you add new Blocks via your Add method, vector will occasionally stop, allocate another chunk of memory (twice as big as the last one, iirc), copy everything over to that new chunk, and then free the old one. Either preallocate the vector (via vector::reserve()), or consider using something like a deque that doesn't guarantee continuity in memory if you don't need that property.
I also don't know what ___String::StringSplit does, but you are almost certain to have the same vector growth problem in reading your file.
Culprit is in these 2 lines:
Handle(Lines, i, newBlock, i);
Return.Add(newBlock);
Let's say you have 5 levels of 1 block each. What Happens on bottom one? You copy one instance of block. What happens on level 4? You copy 2 blocks (parent and its child). So for level 5 you make 15 copies - 1+2+3+4+5. Look at this diagram:
Handle level1 copies 5 blocks (`Return`->level4->level3->level4->level5)
Handle level2 copies 4 blocks (`Return`->level3->level4->level5)
Handle level3 copies 3 blocks (`Return`->level4->level5
Handle level4 copies 2 blocks (`Return`->level5)
Handle level5 copies 1 block (`Return`)
Formula is:
S = ( N + N^2 ) / 2
so for levels 20 you would do 210 copies and so on.
Suggestion is to use move semantics to avoid this copy:
// change method Add to this
void Add(Block thisBlock) { Blocks.push_back(std::move(thisBlock)); }
// and change this call
Return.Add( std::move( newBlock ) );
Or allocate blocks dynamically using smart pointers
Out of simple curiousity, try this Trim implementation instead:
void _Trim(std::string& result, const std::string& s) {
const auto* ptr = s.data();
const auto* left = ptr;
const auto* end = s.data() + s.size();
while (ptr < end && std::isspace(*ptr)) {
++ptr;
}
if (ptr == end) {
result = "";
return;
}
left = ptr;
while (end > left && std::isspace(*(end-1))) {
--end;
}
result = std::string(left, end);
}
std::string Trim(const std::string& s) {
// Not sure if RVO would fire for direct implementation of _Trim here
std::string result;
_Trim(result, s);
return result;
}
And another optimization:
void Add(Block& thisBlock) {
Blocks.push_back(std::move(thisBlock));
}
// Don't use thisBlock after call to this function. It is
// far from being pretty but it should avoid *lots* of copies.
I wonder if you'll get better result. Pls let me know.
After watching some videos on the Rust language, I'm increasingly interested in examining my coding decisions based on mitigating the complexity of shared mutable state. Functional programming/Lambda Calculus seems to be the most popular standard to overcome the problem of shared mutable state. Are there alternatives though? Is there a consensus now that functional programming is a reasonable default approach to solve the problem?
Disclaimer:
I am aware that this post might not directly answer your question.
However, many programmers still overlook they can sometimes avoid shared mutability. I want to show you how here with an example and hope, it helps you though.
TL;DR: Ask yourself whether unshared mutability or shared immutability can also be options.
What about doubting whether you really need shared mutability?
If you turn one of both terms into the opposite, then you gain two useful alternatives:
unshared mutability
shared immutability
Let's have an example in Java 8 to illustrate what I mean.
This example of shared mutability uses synchronize to avoid visibility issues and race conditions:
public class MutablePoint {
private int x, y;
void move(int dx, int dy) {
x += dx;
y += dy;
}
#Override
public String toString() {
return "MutablePoint{x=" + x + ", y=" + y + '}';
}
}
public class SharedMutability {
public static void main(String[] args) {
final MutablePoint mutablePoint = new MutablePoint();
final Thread moveRightThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
synchronized (mutablePoint) {
mutablePoint.move(1, 0);
}
Thread.yield();
}
}, "moveRight");
final Thread moveDownThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
synchronized (mutablePoint) {
mutablePoint.move(0, 1);
}
Thread.yield();
}
}, "moveDown");
final Thread displayThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
synchronized (mutablePoint) {
System.out.println(mutablePoint);
}
Thread.yield();
}
}, "display");
moveRightThread.start();
moveDownThread.start();
displayThread.start();
}
}
Explanation:
We have got 3 threads. While the two threads moveRight and moveDown write on the mutable point, the one thread display reads from it. All 3 threads must synchronize on the mutable point to avoid visibility issues and race conditions.
How can you apply unshared mutability?
Unshared means "only one thread reading and writing on a mutable object".
You don't need much for that. It's quite easy. You always only access one mutable object from the same ONE thread. Therefore you don't need the keyword synchronize nor any locks nor the keyword volatile. Moreover, this one thread can be very fast without locks and broken memory barriers if it only focuses on reading and writing values in the mutable object.
However you are limited to that one thread. That's usually no problem unless you block that one thread with tasks like I/O (don't do that!). Furthermore, you must ensure that the mutable object doesn't "escape" somehow by being assigned to a variable or field outside the one thread and accessed from there.
If you apply unshared mutability to the example, it could look like that:
public class UnsharedMutability {
private static final ExecutorService accessorService = Executors.newSingleThreadExecutor(); // only ONE thread!
private static final MutablePoint mutablePoint = new MutablePoint();
public static void main(String[] args) {
final Thread moveRightThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
accessorService.submit(() -> {
mutablePoint.move(1, 0);
});
Thread.yield();
}
}, "moveRight");
final Thread moveDownThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
accessorService.submit(() -> {
mutablePoint.move(0, 1);
});
Thread.yield();
}
}, "moveDown");
final Thread displayThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
accessorService.submit(() -> {
System.out.println(mutablePoint);
});
Thread.yield();
}
}, "display");
moveRightThread.start();
moveDownThread.start();
displayThread.start();
}
}
Explanation:
We have got all 3 threads again. However, all 3 threads don't need to synchronize on the mutable point because they only access the mutable point in the same one thread which runs in the single threaded ExecutorService accessorService.
How can you apply shared immutability?
Immutability means "no ability to change the state of the object after its creation". Immutable objects always have only one state. Therefore they are always threadsafe. Immutable objects can create new immutable objects when you want to change them though.
However, creating too many objects too fast can cause a high memory consumption and lead to a higher GC activity. Sometimes you can deduplicate immutable objects if you have many duplicates of them.
If you apply shared immutability to the example, it could look like that:
public class ImmutablePoint {
private final int x;
private final int y;
public ImmutablePoint(int x, int y) {
this.x = x;
this.y = y;
}
ImmutablePoint move(int dx, int dy) {
return new ImmutablePoint(x+dx, y+dy);
}
#Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
ImmutablePoint that = (ImmutablePoint) o;
return x == that.x && y == that.y;
}
#Override
public int hashCode() {
return Objects.hash(x, y);
}
#Override
public String toString() {
return "ImmutablePoint{x=" + x + ", y=" + y + '}';
}
}
public class SharedImmutability {
private static AtomicReference<ImmutablePoint> pointReference = new AtomicReference<>(new ImmutablePoint(0, 0));
public static void main(String[] args) {
final Thread moveRightThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
pointReference.updateAndGet(point -> point.move(1, 0));
Thread.yield();
}
}, "moveRight");
final Thread moveDownThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
pointReference.updateAndGet(point -> point.move(0, 1));
Thread.yield();
}
}, "moveDown");
final Thread displayThread = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
System.out.println(pointReference.get());
Thread.yield();
}
}, "display");
moveRightThread.start();
moveDownThread.start();
displayThread.start();
}
}
Explanation:
We have got all 3 threads again. However, we use an immutable point instead of a mutable point. While the two threads moveRight and moveDown replace the older instance of the immutable point by a newer one in the atomic reference pointReference, the thread display can get the current instance from pointReference and display it (whenever this thread wants because the instance of immutable point is independent of older and newer ones).
Remark:
The calls to yield() should force thread switches because a loop with only 1000 iterations is just too small. Most CPUs execute such a loop in one time slice.
i am trying to get all 10 results of 10 cases from For loop. but when i run, it just returns for me the first result of the first time. any help for this condition, this is my whole code, it includes
2 files, i have tried many times to fix it.
//file BankAccount.cs
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Bank //just want to demo this thing, it hasn't completed
{
namespace BankAccountNS
{
public class BankAccount
{
private double m_balance;
public BankAccount(double balance)
{
m_balance = balance;
}
public bool getMoney(double amount) //funtion get money from account
{
if (amount > m_balance || amount < 0) //check money
{
return false;
}
return true;
}
}
}
}
//file BankAccountTests.cs
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Bank.BankAccountNS;
namespace BankTest
{
[TestClass]
public class BankAccountTests
{
[TestMethod]
public void TestEveryDebit(BankAccount Ba) //test every case from TestAll
{
Assert.IsTrue(Ba.getMoney(24000));
}
[TestMethod]
public void TestAll() //create all cases
{
for(int i = 0; i < 10; i++)
{
BankAccount Ba = new BankAccount(23996 + i);
TestEveryDebit(Ba);
}
}
}
}
I'm not really clear on what your (attempted) loop asserts would be accomplishing, but the method getMoney seemingly has 2 (or 3) useful unit tests:
Is the amount greater than the balance I have? - return false
Is my account balance less than zero - return false
Is my amount less than or equal too my balance? - return true
In your current setup (if it were to work) you're simply testing getMoney is returning true for amounts even greater than the balance - this is incorrect and does not adhere to the logic you have coded too.
I see your unit tests looking like:
private double _balance = 50;
private BankAccount _unitTestObject;
[TestMethod]
public void getMoney_returnsFalseWithInsufficientFunts() //create all cases
{
_unitTestObject = new BankAccount(_balance );
var results = _unitTestObject.getMoney(_balance+1);
Assert.IsFalse(results);
}
[TestMethod]
public void getMoney_returnsFalseWhenAccountHasLessThanZero() //create all cases
{
_unitTestObject = new BankAccount(-1);
var results = _unitTestObject.getMoney(1);
Assert.IsFalse(results);
}
[TestMethod]
public void getMoney_returnsTrueWhenAccountSufficientBalance() //create all cases
{
_unitTestObject = new BankAccount(_balance);
var results = _unitTestObject.getMoney(_balance);
Assert.IsTrue(results);
}
As I stated in comments, MSTest can't do parameterized tests, and what it looks like you're attempting to do (assert specific logic 10 times) could be done like this:
[TestClass]
public class BankAccountTests
{
[TestMethod]
public void TestAll() //create all cases
{
for(int i = 0; i < 10; i++)
{
BankAccount Ba = new BankAccount(23996 + i);
TestEveryDebit(Ba);
}
}
private void TestEveryDebit(BankAccount Ba) //test every case from TestAll
{
Assert.IsTrue(Ba.getMoney(24000));
}
}
But the test TestAll will always fail, because at some point in your loop, you're going to be trying to take out more amount than you have balance.
When Asserting based on a loop, the "success or failure" of the test is based on the whole, not each individual assert. So even though a few runs of your loop will "pass", the test will fail as a whole.
First, how D create parallel foreach (underlying logic)?
int main(string[] args)
{
int[] arr;
arr.length = 100000000;
/* Why it is working?, it's simple foreach which working with
reference to int from arr, parallel function return ParallelForeach!R
(ParallelForeach!int[]), but I don't know what it is.
Parallel function is part od phobos library, not D builtin function, then what
kind of magic is used for this? */
foreach (ref e;parallel(arr))
{
e = 100;
}
foreach (ref e;parallel(arr))
{
e *= e;
}
return 0;
}
And second, why it is slower then simple foreach?
Finally, If I create my own taskPool (and don't use global taskPool object), program never end. Why?
parallel returns a struct (of type ParallelForeach) that implements the opApply(int delegate(...)) foreach overload.
when called the struct submits a parallel function to the private submitAndExecute which submits the same task to all threads in the pool.
this then does:
scope(failure)
{
// If an exception is thrown, all threads should bail.
atomicStore(shouldContinue, false);
}
while (atomicLoad(shouldContinue))
{
immutable myUnitIndex = atomicOp!"+="(workUnitIndex, 1);
immutable start = workUnitSize * myUnitIndex;
if(start >= len)
{
atomicStore(shouldContinue, false);
break;
}
immutable end = min(len, start + workUnitSize);
foreach(i; start..end)
{
static if(withIndex)
{
if(dg(i, range[i])) foreachErr();
}
else
{
if(dg(range[i])) foreachErr();
}
}
}
where workUnitIndex and shouldContinue are shared variables and dg is the foreach delegate
The reason it is slower is simply because of the overhead required to pass the function to the threads in the pool and atomically accessing the shared variables.
the reason your custom pool doesn't shut down is likely you don't shut down the threadpool with finish