In python 2.7 I am trying to distribute the computation of a two-dimensional array on all of the cores.
For that I have two arrays associated with a variable at the global scope, one to read from and one to write to.
import itertools as it
import multiprocessing as mp
temp_env = 20
c = 0.25
a = 0.02
arr = np.ones((100,100))
x = arr.shape[0]
y = arr.shape[1]
new_arr = np.zeros((x,y))
def calc_inside(idx):
new_arr[idx[0],idx[1]] = ( arr[idx[0], idx[1] ]
+ c * ( arr[idx[0]+1,idx[1] ]
+ arr[idx[0]-1,idx[1] ]
+ arr[idx[0], idx[1]+1]
+ arr[idx[0], idx[1]-1]
- arr[idx[0], idx[1] ]*4
)
- 2 * a
* ( arr[idx[0], idx[1] ]
- temp_env
)
)
inputs = it.product( range( 1, x-1 ),
range( 1, y-1 )
)
p = mp.Pool()
p.map( calc_inside, inputs )
#for i in inputs:
# calc_inside(i)
#plot arrays as surface plot to check values
Assume there is some additional initialization for the array arr with some different values other than that exemplary 1-s, so that the computation ( an iterative calculation of the temperature ) actually makes a sense.
When I use the commented out for-loop, instead of the Pool.map() method, everything works fine and the array actually contains values. When using the Pool() function, the variable new_array just stays in its initialized state ( meaning it contains only the zeros, as it was originally initialised with ).
Q1 : Does that mean that Pool() prevents writing to global variables?
Q2 : Is there any other way to tackle this problem with parallelization?
A1 :
your code actually does not use any variable declared using a syntax of global <variable>. Nevertheless, do not try to go into trying to use them, the less in going into distributed processing.
A2 :
Yes, going parallelised is possible, but better get a thorough sense of the costs of doing so, before spending ( well, wasting ) efforts, that never justify costs for doing so.
Why starting about costs?
Will you pay your bank clerk an amount of $2.00 to receive just a $1.00 banknote in exchange?
Guess no one would ever do.
The same is with trying to go parallel.
Syntax is "free" and "promising",
Syntax costs of the actual execution of a simple and nicely-looking syntax-constructor is not. Expect rather shocking surprises instead of receiving any dinner for free.
What are the actual costs? Benchmark. Benchmark. Benchmark!
The useful work
Your code actually does just a few memory accesses and a few floating point operations "inside" the block and exits. These FLOP-s take less than a few tens of [ns] max a few units of [us] on recent CPU frequencies ~ 2.6 ~ 3.4 [GHz]. Do benchmark it:
from zmq import Stopwatch; aClk = Stopwatch()
temp_env = 20
c = 0.25
a = 0.02
aClk.start()
_ = ( 1.0
+ c * ( 1.0
+ 1.0
+ 1.0
+ 1.0
- 1.0 * 4
)
- 2 * a
* ( 1.0
- temp_env
)
)
T = aClk.stop()
So, a pure [SERIAL]-process execution, on a Quad-core CPU, will not be worse than about a T+T+T+T ( having been executed one after another ).
+-+-+-+--------------------------------- on CpuCore[0]
: : : :
<start>| : : : :
|T| : : :
. |T| : :
. |T| :
. |T|
. |<stop>
|<----->|
= 4.T in a pure [SERIAL]-process schedule on CpuCore[0]
What will happen, if the same amount of usefull work will be enforced to happen inside some form of the now [CONCURRENT]-process execution ( using multiprocessing.Pool's methods ), potentially using more CPU-cores for the same purpose?
The actual compute-phase, will not last less than T again, right? Why ever would it be? Yes, never.
A.......................B...............C.D Cpu[0]
<start>| : : : :
|<setup a subprocess[A]>| : :
. | :
. |<setup a subprocess[B]>|
. | +............+........................... Cpu[!0]
. | : : |
. |T| : |
. |<ret val(s) | | +............+... Cpu[!0]
. | [A]->main| | : :
. |T| :
. |<ret val(s) |
. | [B]->main|
.
.
. .. |<stop>
|<--------------------------------------------------------- .. ->|
i.e. >> 4.T ( while simplified, yet the message is clear )
The overheads i.e. the Costs you will always pay per-call ( and that is indeed many times )
Subprocess setup + termination costs ( benchmark it to learn the scales of these cost ). Memory access costs ( latency, where zero cache-re-use happens to help, as you exit ).
Epilogue
Hope the visual message is clear enough to ALWAYS start accounting the accrued costs before deciding about any achievable benefits from going into re-engineering code into a distributed process-flow, having a multi-core or even many-core [CONCURRENT] fabric available.
Related
Let say I have a set of symbols s = {a,b,c, ... } and a corpus1 :
a b g k.
o p a r b.
......
by simple counting I can calculate probabilities p(sym), p(sym1,sym2), p(sym1|sym2)
will use upper-case for set S and CORPUS2 and PROBABILITIES related to them
now I create Set 'S' of all combinations of 's', S = { ab,ac,ad,bc,bd,... }, such that I create CORPUS2 from corpus1 in the following manner :
ab ag ak, bg bk, gk.
op oa or ob, pa pr pb, ar ab, rb.
......
i.e all pairs combinations, order does not matter ab == ba. Commas are for visual purpose.
My question : Is it possible to express probabilities P(SYM), P(SYM1,SYM2), P(SYM1|SYM2) via p(sym), p(sym1,sym2), p(sym1|sym2) i.e. have a formula
PS> In my thinking I'm stuck at the following dilema ...
p(sym) = count(sym) / n
but to calculate P(SYM) w/o materializing CORPUS2 there seems to be no way, because it depends on p(sub-sym1),p(sub-sym2) multiplied by the lenght of the sequences they participate in. SYM = sub-sym1:sub-sym2
may be : ~P(SYM) = p(sub-sym1,sub-sym2) * p(sub-sym1) * p(sub-sym2) * avg-seq-len
P(SYM) = for seq in corpus1 :
total += ( len(seq) * (len(seq)+1)) / 2
for sub-sym1 and sub-sym2 in combinations(seq,2) :
if sub-sym1 and sub-sym2 == SYM :
count += 1
return count/total
there is a condition and hidden/random parameter/length involved ..
P(SYM1,SYM2), P(SYM1|SYM2) ??
Probabilities are defined/calculated in the usual way by counting .. for lower case symbols using corpus1 and for upper case symbols using CORPUS2.
I'm learning Antlr. At this point, I'm writing a little stack-based language as part of my learning process -- think PostScript or Forth. An RPN language. For instance:
10 20 mul
This would push 10 and 20 on the stack and then perform a multiply, which pops two values, multiplies them, and pushes 200. I'm using the visitor pattern. And I find myself writing some code that's kind of insane. There has to be a better way.
Here's a section of my WaveParser.g4 file:
any_operator:
value_operator |
stack_operator |
logic_operator |
math_operator |
flow_control_operator;
value_operator:
BIND | DEF
;
stack_operator:
DUP |
EXCH |
POP |
COPY |
ROLL |
INDEX |
CLEAR |
COUNT
;
BIND is just the bind keyword, etc. So my visitor has this method:
antlrcpp::Any WaveVisitor::visitAny_operator(Parser::Any_operatorContext *ctx);
And now here's where I'm getting to the very ugly code I'm writing, which leads to the question.
Value::Operator op = Value::Operator::NO_OP;
WaveParser::Value_operatorContext * valueOp = ctx->value_operator();
WaveParser::Stack_operatorContext * stackOp = ctx->stack_operator();
WaveParser::Logic_operatorContext * logicOp = ctx->logic_operator();
WaveParser::Math_operatorContext * mathOp = ctx->math_operator();
WaveParser::Flow_control_operatorContext * flowOp = ctx->flow_control_operator();
if (valueOp) {
if (valueOp->BIND()) {
op = Value::Operator::BIND;
}
else if (valueOp->DEF()) {
op = Value::Operator::DEF;
}
}
else if (stackOp) {
if (stackOp->DUP()) {
op = Value::Operator::DUP;
}
...
}
...
I'm supporting approximately 50 operators, and it's insane that I'm going to have this series of if statements to figure out which operator this is. There must be a better way to do this. I couldn't find a field on the context that mapped to something I could use in a hashmap table.
I don't know if I should make every one of my operators have a separate rule, and use the corresponding method in my visitor, or if what else I'm missing.
Is there a better way?
With ANTLR, it's usually very helpful to label components of your rules, as well as the high level alternatives.
If part of a parser rule can only be one thing with a single type, usually the default accessors are just fine. But if you have several alternatives that are essentially alternatives for the "same thing", or perhaps you have the same sub-rule reference in a parser rule more than one time and want to differentiate them, it's pretty handy to give them names. (Once you start doing this and see the impact to the Context classes, it'll become pretty obvious where they provide value.)
Also, when rules have multiple top-level alternatives, it's very handy to give each of them a label. This will cause ANTLR to generate a separate Context class for each alternative, instead of dumping everything from every alternative into a single class.
(making some stuff up just to get a valid compile)
grammar WaveParser
;
any_operator
: value_operator # val_op
| stack_operator # stack_op
| logic_operator # logic_op
| math_operator # math_op
| flow_control_operator # flow_op
;
value_operator: op = ( BIND | DEF);
stack_operator
: op = (
DUP
| EXCH
| POP
| COPY
| ROLL
| INDEX
| CLEAR
| COUNT
)
;
logic_operator: op = (AND | OR);
math_operator: op = (ADD | SUB);
flow_control_operator: op = (FLOW1 | FLOW2);
AND: 'and';
OR: 'or';
ADD: '+';
SUB: '-';
FLOW1: '>>';
FLOW2: '<<';
BIND: 'bind';
DEF: 'def';
DUP: 'dup';
EXCH: 'exch';
POP: 'pop';
COPY: 'copy';
ROLL: 'roll';
INDEX: 'index';
CLEAR: 'clear';
COUNT: 'count';
Here is my input data.
ㅡ.Input(Local)
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
This is the result of running apache-beam in a local environment.
ㅡ.Outout(Local)
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,2
Iot,2017,c++,2
Iot,2017,spring,2
Web,2016,java,1
Web,2016,spring,1
However, when I run the google-cloud-platform dataflow and put it in a bucket, the results are different.
ㅡ. Storage(Bucket)
Web,2016,java,1
Web,2016,spring,1
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Here is my code.
ㅡ. Code
#apache_beam
from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam as beam
pipeline_options = PipelineOptions(
project='project-id',
runner='dataflow',
temp_location='bucket-location'
)
def pardo_dofn_methods(test=None):
import apache_beam as beam
class split_category_advanced(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
self.k = 1
self.pre_processing = []
self.window = beam.window.GlobalWindow()
self.year_dict = {}
self.category_index = 0
self.language_index = 1
self.year_index = 2;
self.result = []
def setup(self):
print('setup')
def start_bundle(self):
print('start_bundle')
def finish_bundle(self):
print('finish_bundle')
for ppc_index in range(len(self.pre_processing)) :
if self.category_index == 0 or self.category_index%3 == 0 :
if self.pre_processing[self.category_index] not in self.year_dict :
self.year_dict[self.pre_processing[self.category_index]] = {}
if ppc_index + 2 == 2 or ppc_index + 2 == self.year_index :
# { category : { year : {} } }
if self.pre_processing[self.year_index] not in self.year_dict[self.pre_processing[self.category_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] = {}
# { category : { year : c : { }, c++ : { }, java : { }}}
language = self.pre_processing[self.year_index-1].split(' ')
for lang_index in range(len(language)) :
if language[lang_index] not in self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][language[lang_index]] = 1
else :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][
language[lang_index]] += 1
self.year_index = self.year_index + 3
self.category_index = self.category_index + 1
csvFormat = ''
for category, nested in self.year_dict.items() :
for year in nested :
for language in nested[year] :
csvFormat+= (category+","+str(year)+","+language+","+str(nested[year][language]))+"\n"
print(csvFormat)
yield beam.utils.windowed_value.WindowedValue(
value=csvFormat,
#value = self.pre_processing,
timestamp=0,
windows=[self.window],
)
def process(self, text):
for word in text.split(self.delimiter):
self.pre_processing.append(word)
print(self.pre_processing)
#with beam.Pipeline(options=pipeline_options) as pipeline:
with beam.Pipeline() as pipeline:
results = (
pipeline
| 'Gardening plants' >> beam.Create([
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
])
| 'Split category advanced' >> beam.ParDo(split_category_advanced(','))
| 'Save' >> beam.io.textio.WriteToText("bucket-location")
| beam.Map(print) \
)
if test:
return test(results)
if __name__ == '__main__':
pardo_dofn_methods_basic()
Code for executing simple word counting.
CSV column has an [ category, year, language, count ]
e.g) IoT, 2015, c, 1
Thank you for reading it.
The most likely reason you are getting different output is because of parallelism. When using DataflowRunner, operations run as parallel as possible. Since you are using a ParDo to count, when element Iot,c c++ spring,2017 goes to two different workers, the count doesn't happen as you want (you are counting in the ParDo).
You need to use Combiners (4.2.4)
Here you have an easy example of what you want to do:
def generate_kvs(element, csv_delimiter=',', field_delimiter=' '):
splitted = element.split(csv_delimiter)
fields = splitted[1].split(field_delimiter)
# final key to count is (Source, year, language)
return [(f"{splitted[0]}, {splitted[2]}, {x}", 1) for x in fields]
p = beam.Pipeline()
elements = ['Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017']
(p | Create(elements)
| beam.ParDo(generate_kvs)
| beam.combiners.Count.PerKey()
| "Format" >> Map(lambda x: f"{x[0]}, {x[1]}")
| Map(print))
p.run()
This would output the result you want no matter the distribution you get of elements across workers.
Note the idea of Apache Beam is to parallelise as much as possible and, in order to aggregate, you need Combiners
I would recommend you to check some wordcounts examples so you get the hang of the combiners
EDIT
Clarification on Combiners:
ParDo is a operation that happens in a element to element basis. It takes one element, makes some operations and sends the output to the next PTransform. When you need to do aggregate data (count elements, sum values, join sentences...), element wise operations don't work, you need something that takes a PCollection (i.e., many elements with a logic) and outputs something. This is where the combiners come in, they perform operations in a PCollection basis, which can be made across workers (part of the Map-Reduce operations)
In your example, you were using a Class parameter to store the count in the ParDo, so when a element went through it, it would change the parameter within the class. This would work when all elements go through the same worker, since the Class is "created" in a worker basis (i.e., they don't share states), but when there are more workers, the count (with the ParDo) is going to happen in each worker separately
I'm trying to write some code that (pseudo-randomly) generates a list of 7 numbers. I have it working for a single run. I'd like to be able to loop this code to generate multiple lists, which I can output to a txt file (I don't need help with this I'm quite comfortable working with i/o and files :)
I'm now using this code (thanks Jason for getting it this far):
import random
pool = []
original_pool = list( range( 1,60))
def selectAndPrune(x):
pool = []
list1 = []
random.shuffle(pool)
pool = original_pool.copy()
current_choice = random.choice(pool)
list1.append(current_choice)
pool.remove(current_choice)
random.shuffle(pool)
print(list1)
def repeater():
for i in range(19):
pool_list = []
pool = original_pool.copy()
a = [ selectAndPrune(pool) for x in range(7)]
pool_list.append(a)
repeater()
This is giving output of single value lists like:
[21]
[1]
[54]
[48]
[4]
[32]
[15]
etc.
The output I want is 19 lists, all containing 7 random ints:
[1,4,17,23,45,51,3]
[10,2,9,38,4,1,24]
[15,42,35,54,43,28,14]
etc
If I am understanding the question correctly, the objective is to repeat a function 19 times. However, this function slowly removes items from the list at each call, making it impossible to run past the size of the pool as currently written in the question. I suspect that the solution is something like this:
import random
def spinAndPrune():
random.shuffle( pool )
current_choice = random.choice( pool )
pool.remove( current_choice )
random.shuffle( pool )
return current_choice
First, I added a return command at the end of the function call. Next, you can make copy of the original pool, so that it is possible to re-run it as many times as desired. Also, you need to store the lists you want to keep:
# create an original pool of values
original_pool = list( range( 1, 60 ) )
# initialize a variable that stores previous runs
pool_list = []
# repeat 19 times
for i in range( 19 ):
# create a copy of the original pool to a temporary pool
pool = original_pool.copy()
# run seven times, storing the current choice in variable a
a = [ spinAndPrune() for x in range( 7 ) ]
# keep track of variable a in the pool_list
pool_list.append( a )
print( a )
Note the .copy() function to make a copy of the list. As an aside, the range() makes it easy to create lists containing integers 1-59.
If you need to extract a specific list, you can do something along the lines of this:
# print first list
print( pool_list[ 0 ] )
# print fifth list
print( pool_list[ 4 ] )
# print all the lists
print( pool_list )
In my other answer, I approached it by modifying the code in the original question. However, if the point is just to extract X number of values without repeat in a collection/list, it is probably easiest to just use the random.sample() function. The code can be something along the lines of this:
import random
pool = list( range( 1, 60 ) )
pool_list = []
# sample 19 times and store to the pool_list
for i in range( 19 ):
sample = random.sample( pool, 7 )
pool_list.append( sample )
print( sample )
I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:
library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)
for (i in rpt) {
n = n + 1
print(n)
patt = paste(rep("a", rpt[n]), collapse = "")
#time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
msecFF[n] = mean(timeFF$time)/1000000.
timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
msecFT[n] = mean(timeFT$time)/1000000.
}
library(ggplot2)
library(grid)
library(gridExtra)
axis(1,at=seq(0,1000,200),labels=T)
p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)
As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below).
I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.
Can someone explain the kink better and why at 300 and 600 characters?
Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.
gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern
(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)
The kinks might be related to the bits required to hold patterns of that length.
There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.
patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)
I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.