Groovy list in a map not showing a loop count properly - list

I have this code in Groovy;
def execution = []
def executor =[:]
for(loopcount=1;loopcount<4;loopcount++){
executor.executor = 'jmeter'
executor.scenario = 'scenario' + loopcount
println executor.scenario
executor.concurrency = 2
execution.add(executor)
}
execution.each{
println executor.scenario
}
It is a list of three maps, all the same apart from the scenario suffix increments. I am expecting;
scenario1
scenario2
scenario3
scenario1
scenario2
scenario3
But I get;
scenario1
scenario2
scenario3
scenario3
scenario3
scenario3
It's definitely adding three different maps in the list because the .each command is returning three values. And they're definitely different values in executor.scenario because the println in the loop is giving the correct '1, 2, 3' count. But why don't they stay as different values in the list?
I've also tried execution.push(executor) but that gives the same results. For context, this yaml is what I'm aiming for eventually;
---
execution:
- executor: "jmeter"
scenario: "scenario1"
concurrency: 2
- executor: "jmeter"
scenario: "scenario2"
concurrency: 2
- executor: "jmeter"
scenario: "scenario3"
concurrency: 2
And apart from the scenario count the rest of it works fine.

problem:
def execution = []
def executor =[:]
for(loopcount=1;loopcount<4;loopcount++){
execution.add(executor) // <<-- this line adds the same variable to the list 4 times
}
to fix this - declare executor inside the for loop
def execution = []
for(loopcount=1;loopcount<4;loopcount++){
def executor =[:] // <<-- creates a new object in a loop
execution.add(executor) // <<-- adds new object to a list
}
probably to make it more clear, let me specify what [] and [:] means:
def execution = new ArrayList()
for(loopcount=1;loopcount<4;loopcount++){
def executor = new LinkedHashMap()
execution.add(executor)
}
however you could declare the variable before the loop but you have to assign a new object into it inside loop
def execution = []
def executor
for(loopcount=1;loopcount<4;loopcount++){
executor = [:]
execution.add(executor)times
}

Related

In the local environment, the result value and the dataflow result values are different

Here is my input data.
ㅡ.Input(Local)
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
This is the result of running apache-beam in a local environment.
ㅡ.Outout(Local)
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,2
Iot,2017,c++,2
Iot,2017,spring,2
Web,2016,java,1
Web,2016,spring,1
However, when I run the google-cloud-platform dataflow and put it in a bucket, the results are different.
ㅡ. Storage(Bucket)
Web,2016,java,1
Web,2016,spring,1
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Here is my code.
ㅡ. Code
#apache_beam
from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam as beam
pipeline_options = PipelineOptions(
project='project-id',
runner='dataflow',
temp_location='bucket-location'
)
def pardo_dofn_methods(test=None):
import apache_beam as beam
class split_category_advanced(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
self.k = 1
self.pre_processing = []
self.window = beam.window.GlobalWindow()
self.year_dict = {}
self.category_index = 0
self.language_index = 1
self.year_index = 2;
self.result = []
def setup(self):
print('setup')
def start_bundle(self):
print('start_bundle')
def finish_bundle(self):
print('finish_bundle')
for ppc_index in range(len(self.pre_processing)) :
if self.category_index == 0 or self.category_index%3 == 0 :
if self.pre_processing[self.category_index] not in self.year_dict :
self.year_dict[self.pre_processing[self.category_index]] = {}
if ppc_index + 2 == 2 or ppc_index + 2 == self.year_index :
# { category : { year : {} } }
if self.pre_processing[self.year_index] not in self.year_dict[self.pre_processing[self.category_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] = {}
# { category : { year : c : { }, c++ : { }, java : { }}}
language = self.pre_processing[self.year_index-1].split(' ')
for lang_index in range(len(language)) :
if language[lang_index] not in self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][language[lang_index]] = 1
else :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][
language[lang_index]] += 1
self.year_index = self.year_index + 3
self.category_index = self.category_index + 1
csvFormat = ''
for category, nested in self.year_dict.items() :
for year in nested :
for language in nested[year] :
csvFormat+= (category+","+str(year)+","+language+","+str(nested[year][language]))+"\n"
print(csvFormat)
yield beam.utils.windowed_value.WindowedValue(
value=csvFormat,
#value = self.pre_processing,
timestamp=0,
windows=[self.window],
)
def process(self, text):
for word in text.split(self.delimiter):
self.pre_processing.append(word)
print(self.pre_processing)
#with beam.Pipeline(options=pipeline_options) as pipeline:
with beam.Pipeline() as pipeline:
results = (
pipeline
| 'Gardening plants' >> beam.Create([
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
])
| 'Split category advanced' >> beam.ParDo(split_category_advanced(','))
| 'Save' >> beam.io.textio.WriteToText("bucket-location")
| beam.Map(print) \
)
if test:
return test(results)
if __name__ == '__main__':
pardo_dofn_methods_basic()
Code for executing simple word counting.
CSV column has an [ category, year, language, count ]
e.g) IoT, 2015, c, 1
Thank you for reading it.
The most likely reason you are getting different output is because of parallelism. When using DataflowRunner, operations run as parallel as possible. Since you are using a ParDo to count, when element Iot,c c++ spring,2017 goes to two different workers, the count doesn't happen as you want (you are counting in the ParDo).
You need to use Combiners (4.2.4)
Here you have an easy example of what you want to do:
def generate_kvs(element, csv_delimiter=',', field_delimiter=' '):
splitted = element.split(csv_delimiter)
fields = splitted[1].split(field_delimiter)
# final key to count is (Source, year, language)
return [(f"{splitted[0]}, {splitted[2]}, {x}", 1) for x in fields]
p = beam.Pipeline()
elements = ['Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017']
(p | Create(elements)
| beam.ParDo(generate_kvs)
| beam.combiners.Count.PerKey()
| "Format" >> Map(lambda x: f"{x[0]}, {x[1]}")
| Map(print))
p.run()
This would output the result you want no matter the distribution you get of elements across workers.
Note the idea of Apache Beam is to parallelise as much as possible and, in order to aggregate, you need Combiners
I would recommend you to check some wordcounts examples so you get the hang of the combiners
EDIT
Clarification on Combiners:
ParDo is a operation that happens in a element to element basis. It takes one element, makes some operations and sends the output to the next PTransform. When you need to do aggregate data (count elements, sum values, join sentences...), element wise operations don't work, you need something that takes a PCollection (i.e., many elements with a logic) and outputs something. This is where the combiners come in, they perform operations in a PCollection basis, which can be made across workers (part of the Map-Reduce operations)
In your example, you were using a Class parameter to store the count in the ParDo, so when a element went through it, it would change the parameter within the class. This would work when all elements go through the same worker, since the Class is "created" in a worker basis (i.e., they don't share states), but when there are more workers, the count (with the ParDo) is going to happen in each worker separately

Reuse data in a Tensorflow graph (using Queues and tf.cond())

I'm building a resample layer in Tensorflow that is meant to reuse data for resample_n-times before getting new data from the net_data_layer() (a Que reading from some hdf5 datasources). I have a Network class and store the tensors returned from the data layer in it (self) for "global" access (e.g. for building the network-graph, but also by my train_op()):
self.batch_img, self.batch_label, self.batch_weights = net_data_layer()
I added some logic so that in the first iteration resample == False, with the intention of data-layer queues being called, and then resample == True for the next resample_n-th iterations. The part in the graph boils down to:
self.batch_img, self.batch_label, self.batch_weights = \
tf.cond(resample,
lambda: resample_data(),
lambda: net_data_layer())
Where resample_data() just returns the previously stored data, basically doing nothing but forwarding the already stored data to the graph:
def resample_data(): return self.batch_img, self.batch_label, self.batch_weights
But I'm getting an error:
ValueError: Operation 'ResampleLayer/cond/DataLayer_hdf5_dset/test_data/fifo_queue_enqueue' has been marked as not fetchable.
This led me to this discussion, where this was solved by "Putting the QueueRunner logic outside the tf.cond() context". However having the Queues IN there is the whole point of my condition, since when I run them (specifically the dequeue-op), they jump to the next sample in line.
Is there at all a way to make the graph reuse its already loaded data (or would that be kind of "cyclic" and isn't supported?). I'm out of depth... any ideas?
For reference the whole "ResampleLayer" code block:
with tf.variable_scope('ResampleLayer'):
resample_n = args.resample_n # user-defined
resample_i = 0 # resample counter, when 0 it generates new data, then counts up to resample_n
# dummy placeholders, needed for first run since tf.cond() doesn't do lazy evaluation
self.batch_img = tf.constant(0., dtype=tf.float32, shape=[batch_size]+shape_img)
self.batch_label = tf.constant(0., dtype=tf.float32, shape=[batch_size]+shape_label)
self.batch_weights = tf.constant(0., dtype=tf.float32, shape=[batch_size]+shape_weights)
def net_data_layer():
# run the data layer to generate a new batch
self.batch_img, self.batch_label, self.batch_weights = \
data_layer.data_TFRqueue(dataset_pth, batch_size=batch_size)
return self.batch_img, self.batch_label, self.batch_weights
def resample_data():
# just forward the existing batch
return self.batch_img, self.batch_label, self.batch_weights
with tf.variable_scope('ResampleLogic'):
t_resample_n = tf.Variable(self.resample_n, tf.int8)
t_resample_i = tf.Variable(self.resample_i, tf.int8)
# Determine when to resample. Pure Python equivalent:
# resample = False if resample_i == 0 else True
resample = tf.cond(tf.equal(t_resample_i, tf.constant(0)), lambda: tf.constant(False), lambda: tf.constant(True))
# "Loop" the resample counter. Pure Python equivalent:
# resample_i = resample_i + 1 if resample_i < resample_n else 0
tf.cond(t_resample_i < t_resample_n, lambda: t_resample_i + 1, lambda: tf.constant(0))
# the actual intended branching of the graph
self.batch_img, self.batch_label, self.batch_weights = \
tf.cond(resample,
lambda: resample_data(),
lambda: net_data_layer())
## NETWORK
self.output_mask = self._build_net(self.batch_img, ...)

Scala - Future List first completed with condition

I have a list of Futures and I want to get the first one completed with a certain condition.
Here is an example of a possible code:
val futureList: Future[List[T]] = l map (c => c.functionWithFuture())
val data = for {
c <- futureList
}yield c
data onSuccess {
case x => (x filter (d=> d.condition)).head
}
But it's not efficient, because I'll take only one element of the list, but computed all of them with a lot of latency.
I know firstCompletedOf but it's not what I'm searching.
(Sorry for my bad English.)
Try using a Promise and calling trySuccess on it as soon as a future that satisfies the condition completes. The first to call trySuccess will complete the future, the following ones will have no effect (as opposed to calling success, which can only be called once on a Promise).
Keep in mind that if no future in the list satisfies the condition, you will never have a result, i.e. the promise future will never complete.
import scala.concurrent.{ Await, Future, Promise }
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.util.Random
def condition(x: Int) = x > 50
val futures = (1 to 10) map (x => Future {
// wait a random number of ms between 0 and 1000
Thread.sleep(Random.nextInt(1000))
// return a random number < 100
Random.nextInt(100)
})
val p = Promise[Int]()
// the first one that satisfies the condition completes the promise
futures foreach { _ filter condition foreach p.trySuccess }
val result = p.future
// Watch out: the promise could never be fulfilled if all the futures are <=50
println("The first completed future that satisfies x>50 is: " + Await.result(result, 10.seconds))

Insertion order of a list based on order of another list

I have a sorting problem in Scala that I could certainly solve with brute-force, but I'm hopeful there is a more clever/elegant solution available. Suppose I have a list of strings in no particular order:
val keys = List("john", "jill", "ganesh", "wei", "bruce", "123", "Pantera")
Then at random, I receive the values for these keys at random (full-disclosure, I'm experiencing this problem in an akka actor, so events are not in order):
def receive:Receive = {
case Value(key, otherStuff) => // key is an element in keys ...
And I want to store these results in a List where the Value objects appear in the same order as their key fields in the keys list. For instance, I may have this list after receiving the first two Value messages:
List(Value("ganesh", stuff1), Value("bruce", stuff2))
ganesh appears before bruce merely because he appears earlier in the keys list. Once the third message is received, I should insert it into this list in the correct location per the ordering established by keys. For instance, on receiving wei I should insert him into the middle:
List(Value("ganesh", stuff1), Value("wei", stuff3), Value("bruce", stuff2))
At any point during this process, my list may be incomplete but in the expected order. Since the keys are redundant with my Value data, I throw them away once the list of values is complete.
Show me what you've got!
I assume you want no worse than O(n log n) performance. So:
val order = keys.zipWithIndex.toMap
var part = collection.immutable.TreeSet.empty[Value](
math.Ordering.by(v => order(v.key))
)
Then you just add your items.
scala> part = part + Value("ganesh", 0.1)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1))
scala> part = part + Value("bruce", 0.2)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1), Value(bruce,0.2))
scala> part = part + Value("wei", 0.3)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1), Value(wei,0.3), Value(bruce,0.2))
When you're done, you can .toList it. While you're building it, you probably don't want to, since updating a list in random order so that it is in a desired sorted order is an obligatory O(n^2) cost.
Edit: with your example of seven items, my solution takes about 1/3 the time of Jean-Philippe's. For 25 items, it's 1/10th the time. 1/30th for 200 (which is the difference between 6 ms and 0.2 ms on my machine).
If you can use a ListMap instead of a list of tuples to store values while they're gathered, this could work. ListMap preserves insertion order.
class MyActor(keys: List[String]) extends Actor {
def initial(values: ListMap[String, Option[Value]]): Receive = {
case v # Value(key, otherStuff) =>
if(values.forall(_._2.isDefined))
context.become(valuesReceived(values.updated(key, Some(v)).collect { case (_, Some(v)) => v))
else
context.become(initial(keys, values.updated(key, Some(v))))
}
def valuesReceived(values: Seq[Value]): Receive = { } // whatever you need
def receive = initial(keys.map { k => (k -> None) })
}
(warning: not compiled)

Scala objects not changing their internal state

I am seeing a problem with some Scala 2.7.7 code I'm working on, that should not happen if it the equivalent was written in Java. Loosely, the code goes creates a bunch of card players and assigns them to tables.
class Player(val playerNumber : Int)
class Table (val tableNumber : Int) {
var players : List[Player] = List()
def registerPlayer(player : Player) {
println("Registering player " + player.playerNumber + " on table " + tableNumber)
players = player :: players
}
}
object PlayerRegistrar {
def assignPlayersToTables(playSamplesToExecute : Int, playersPerTable:Int) = {
val numTables = playSamplesToExecute / playersPerTable
val tables = (1 to numTables).map(new Table(_))
assert(tables.size == numTables)
(0 until playSamplesToExecute).foreach {playSample =>
val tableNumber : Int = playSample % numTables
tables(tableNumber).registerPlayer(new Player(playSample))
}
tables
}
}
The PlayerRegistrar assigns a number of players between tables. First, it works out how many tables it will need to break up the players between and creates a List of them.
Then in the second part of the code, it works out which table a player should be assigned to, pulls that table from the list and registers a new player on that table.
The list of players on a table is a var, and is overwritten each time registerPlayer() is called. I have checked that this works correctly through a simple TestNG test:
#Test def testRegisterPlayer_multiplePlayers() {
val table = new Table(1)
(1 to 10).foreach { playerNumber =>
val player = new Player(playerNumber)
table.registerPlayer(player)
assert(table.players.contains(player))
assert(table.players.length == playerNumber)
}
}
I then test the table assignment:
#Test def testAssignPlayerToTables_1table() = {
val tables = PlayerRegistrar.assignPlayersToTables(10, 10)
assertEquals(tables.length, 1)
assertEquals(tables(0).players.length, 10)
}
The test fails with "expected:<10> but was:<0>". I've been scratching my head, but can't work out why registerPlayer() isn't mutating the table in the list. Any help would be appreciated.
The reason is that in the assignPlayersToTables method, you are creating a new Table object. You can confirm this by adding some debugging into the loop:
val tableNumber : Int = playSample % numTables
println(tables(tableNumber))
tables(tableNumber).registerPlayer(new Player(playSample))
Yielding something like:
Main$$anon$1$Table#5c73a7ab
Registering player 0 on table 1
Main$$anon$1$Table#21f8c6df
Registering player 1 on table 1
Main$$anon$1$Table#53c86be5
Registering player 2 on table 1
Note how the memory address of the table is different for each call.
The reason for this behaviour is that a Range is non-strict in Scala (until Scala 2.8, anyway). This means that the call to the range is not evaluated until it's needed. So you think you're getting back a list of Table objects, but actually you're getting back a range which is evaluated (instantiating a new Table object) each time you call it. Again, you can confirm this by adding some debugging:
val tables = (1 to numTables).map(new Table(_))
println(tables)
Which gives you:
RangeM(Main$$anon$1$Table#5492bbba)
To do what you want, add a toList to the end:
val tables = (1 to numTables).map(new Table(_)).toList
val tables = (1 to numTables).map(new Table(_))
This line seems to be causing all the trouble - mapping over 1 to n gives you a RandomAccessSeq.Projection, and to be honest, I don't know how exactly they work, but a bit less clever initialising technique does the job.
var tables: Array[Table] = new Array(numTables)
for (i <- 0 to numTables) tables(i) = new Table(i)
Using the first initialisation method I wasn't able to change the objects (just like you), but using a simple array everything seems to be working.