Related
Here is my input data.
ㅡ.Input(Local)
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
This is the result of running apache-beam in a local environment.
ㅡ.Outout(Local)
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,2
Iot,2017,c++,2
Iot,2017,spring,2
Web,2016,java,1
Web,2016,spring,1
However, when I run the google-cloud-platform dataflow and put it in a bucket, the results are different.
ㅡ. Storage(Bucket)
Web,2016,java,1
Web,2016,spring,1
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Here is my code.
ㅡ. Code
#apache_beam
from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam as beam
pipeline_options = PipelineOptions(
project='project-id',
runner='dataflow',
temp_location='bucket-location'
)
def pardo_dofn_methods(test=None):
import apache_beam as beam
class split_category_advanced(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
self.k = 1
self.pre_processing = []
self.window = beam.window.GlobalWindow()
self.year_dict = {}
self.category_index = 0
self.language_index = 1
self.year_index = 2;
self.result = []
def setup(self):
print('setup')
def start_bundle(self):
print('start_bundle')
def finish_bundle(self):
print('finish_bundle')
for ppc_index in range(len(self.pre_processing)) :
if self.category_index == 0 or self.category_index%3 == 0 :
if self.pre_processing[self.category_index] not in self.year_dict :
self.year_dict[self.pre_processing[self.category_index]] = {}
if ppc_index + 2 == 2 or ppc_index + 2 == self.year_index :
# { category : { year : {} } }
if self.pre_processing[self.year_index] not in self.year_dict[self.pre_processing[self.category_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] = {}
# { category : { year : c : { }, c++ : { }, java : { }}}
language = self.pre_processing[self.year_index-1].split(' ')
for lang_index in range(len(language)) :
if language[lang_index] not in self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][language[lang_index]] = 1
else :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][
language[lang_index]] += 1
self.year_index = self.year_index + 3
self.category_index = self.category_index + 1
csvFormat = ''
for category, nested in self.year_dict.items() :
for year in nested :
for language in nested[year] :
csvFormat+= (category+","+str(year)+","+language+","+str(nested[year][language]))+"\n"
print(csvFormat)
yield beam.utils.windowed_value.WindowedValue(
value=csvFormat,
#value = self.pre_processing,
timestamp=0,
windows=[self.window],
)
def process(self, text):
for word in text.split(self.delimiter):
self.pre_processing.append(word)
print(self.pre_processing)
#with beam.Pipeline(options=pipeline_options) as pipeline:
with beam.Pipeline() as pipeline:
results = (
pipeline
| 'Gardening plants' >> beam.Create([
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
])
| 'Split category advanced' >> beam.ParDo(split_category_advanced(','))
| 'Save' >> beam.io.textio.WriteToText("bucket-location")
| beam.Map(print) \
)
if test:
return test(results)
if __name__ == '__main__':
pardo_dofn_methods_basic()
Code for executing simple word counting.
CSV column has an [ category, year, language, count ]
e.g) IoT, 2015, c, 1
Thank you for reading it.
The most likely reason you are getting different output is because of parallelism. When using DataflowRunner, operations run as parallel as possible. Since you are using a ParDo to count, when element Iot,c c++ spring,2017 goes to two different workers, the count doesn't happen as you want (you are counting in the ParDo).
You need to use Combiners (4.2.4)
Here you have an easy example of what you want to do:
def generate_kvs(element, csv_delimiter=',', field_delimiter=' '):
splitted = element.split(csv_delimiter)
fields = splitted[1].split(field_delimiter)
# final key to count is (Source, year, language)
return [(f"{splitted[0]}, {splitted[2]}, {x}", 1) for x in fields]
p = beam.Pipeline()
elements = ['Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017']
(p | Create(elements)
| beam.ParDo(generate_kvs)
| beam.combiners.Count.PerKey()
| "Format" >> Map(lambda x: f"{x[0]}, {x[1]}")
| Map(print))
p.run()
This would output the result you want no matter the distribution you get of elements across workers.
Note the idea of Apache Beam is to parallelise as much as possible and, in order to aggregate, you need Combiners
I would recommend you to check some wordcounts examples so you get the hang of the combiners
EDIT
Clarification on Combiners:
ParDo is a operation that happens in a element to element basis. It takes one element, makes some operations and sends the output to the next PTransform. When you need to do aggregate data (count elements, sum values, join sentences...), element wise operations don't work, you need something that takes a PCollection (i.e., many elements with a logic) and outputs something. This is where the combiners come in, they perform operations in a PCollection basis, which can be made across workers (part of the Map-Reduce operations)
In your example, you were using a Class parameter to store the count in the ParDo, so when a element went through it, it would change the parameter within the class. This would work when all elements go through the same worker, since the Class is "created" in a worker basis (i.e., they don't share states), but when there are more workers, the count (with the ParDo) is going to happen in each worker separately
I'm trying to convert the following ES6 script to bucklescript and I cannot for the life of me figure out how to create a "closure" in bucklescript
import {Socket, Presence} from "phoenix"
let socket = new Socket("/socket", {
params: {user_id: window.location.search.split("=")[1]}
})
let channel = socket.channel("room:lobby", {})
let presence = new Presence(channel)
function renderOnlineUsers(presence) {
let response = ""
presence.list((id, {metas: [first, ...rest]}) => {
let count = rest.length + 1
response += `<br>${id} (count: ${count})</br>`
})
document.querySelector("main[role=main]").innerHTML = response
}
socket.connect()
presence.onSync(() => renderOnlineUsers(presence))
channel.join()
the part I cant figure out specifically is let response = "" (or var in this case as bucklescript always uses vars):
function renderOnlineUsers(presence) {
let response = ""
presence.list((id, {metas: [first, ...rest]}) => {
let count = rest.length + 1
response += `<br>${id} (count: ${count})</br>`
})
document.querySelector("main[role=main]").innerHTML = response
}
the closest I've gotten so far excludes the result declaration
...
...
let onPresenceSync ev =
let result = "" in
let listFunc = [%raw begin
{|
(id, {metas: [first, ...rest]}) => {
let count = rest.length + 1
result += `${id} (count: ${count})\n`
}
|}
end
] in
let _ =
presence |. listPresence (listFunc) in
[%raw {| console.log(result) |} ]
...
...
compiles to:
function onPresenceSync(ev) {
var listFunc = (
(id, {metas: [first, ...rest]}) => {
let count = rest.length + 1
result += `${id} (count: ${count})\n`
}
);
presence.list(listFunc);
return ( console.log(result) );
}
result is removed as an optimization beacuse it is considered unused. It is generally not a good idea to use raw code that depends on code generated by BuckleScript, as there's quite a few surprises you can encounter in the generated code.
It is also not a great idea to mutate variables considered immutable by the compiler, as it will perform optimizations based on the assumption that the value will never change.
The simplest fix here is to just replace [%raw {| console.log(result) |} ] with Js.log result, but it might be enlightening to see how listFunc could be written in OCaml:
let onPresenceSync ev =
let result = ref "" in
let listFunc = fun [#bs] id item ->
let count = Js.Array.length item##meta in
result := {j|$id (count: $count)\n|j}
in
let _ = presence |. (listPresence listFunc) in
Js.log !result
Note that result is now a ref cell, which is how you specify a mutable variable in OCaml. ref cells are updated using := and the value it contains is retrieved using !. Note also the [#bs] annotation used to specify an uncurried function needed on functions passed to external higher-order functions. And the string interpolation syntax used: {j| ... |j}
My version of RegEx is being greedy and now working as it suppose to. I need extract each message with timestamp and user who created it. Also if user has two or more consecutive messages it should go inside one match / block / group. How to solve it?
https://regex101.com/r/zD5bR6/1
val pattern = "((a\.b|c\.d)\n(.+\n)+)+?".r
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)
UPDATE
ndn solution throws StackOverflowError when executed:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4708)
.......
Code:
val pattern = "(?:.+(?:\\Z|\\n))+?(?=\\Z|\\w\\.\\w)".r
val array = (pattern findAllIn str).toArray.reverse foreach{println _}
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)
I don't think a regular expression is the right tool for this job. My solution below uses a (tail) recursive function to loop over the lines, keep the current username and create a Message for every timestamp / message pair.
import java.time.LocalTime
case class Message(user: String, timestamp: LocalTime, message: String)
val Timestamp = """\[(\d{2})\:(\d{2})\:(\d{2})\]""".r
def parseMessages(lines: List[String], usernames: Set[String]) = {
#scala.annotation.tailrec
def go(
lines: List[String], currentUser: Option[String], messages: List[Message]
): List[Message] = lines match {
// no more lines -> return parsed messages
case Nil => messages.reverse
// found a user -> keep as currentUser
case user :: tail if usernames.contains(user) =>
go(tail, Some(user), messages)
// timestamp and message on next line -> create a Message
case Timestamp(h, m, s) :: msg :: tail if currentUser.isDefined =>
val time = LocalTime.of(h.toInt, m.toInt, s.toInt)
val newMsg = Message(currentUser.get, time, msg)
go(tail, currentUser, newMsg :: messages)
// invalid line -> ignore
case _ =>
go(lines.tail, currentUser, messages)
}
go(lines, None, Nil)
}
Which we can use as :
val input = """
a.b
[10:12:03]
you can also get commands
[10:11:26]
from the console
[10:11:21]
can you check if has been resolved
[10:10:47]
ah, okay
c.d
[10:10:39]
anyways startsLevel is still 4
a.b
[10:09:25]
might be a dead end
[10:08:56]
that need to be started early as well
"""
val lines = input.split('\n').toList
val users = Set("a.b", "c.d")
parseMessages(lines, users).foreach(println)
// Message(a.b,10:12:03,you can also get commands)
// Message(a.b,10:11:26,from the console)
// Message(a.b,10:11:21,can you check if has been resolved)
// Message(a.b,10:10:47,ah, okay)
// Message(c.d,10:10:39,anyways startsLevel is still 4)
// Message(a.b,10:09:25,might be a dead end)
// Message(a.b,10:08:56,that need to be started early as well)
The idea is to take as little characters as possible that will be followed by a username or the end of the string:
(?:.+(?:\Z|\n))+?(?=\Z|\w\.\w)
See it in action
I have following list -
List(List(
List(((groupName,group1),(tagMember,["192.168.20.30","192.168.20.20","192.168.20.21"]))),
List(((groupName,group1),(tagMember,["192.168.20.30"]))),
List(((groupName,group1),(tagMember,["192.168.20.30","192.168.20.20"])))))
I want to convert it to -
List((groupName, group1),(tagMember,["192.168.20.30","192.168.20.20","192.168.20.21"]))
I tried to use .flatten but unable to form desired output.
How do I get above mentioned output using scala??
I had to make some changes to your input to make it valid.
Input List:
val ll = List(List(
List((("groupName","group1"),("tagMember", List("192.168.20.30","192.168.20.20","192.168.20.21")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30","192.168.20.20"))))
))
Code below works if the group, and tagMember are the same across all the elements in the list
def getUniqueIpsConstantGroupTagMember(inputList: List[List[List[((String, String), (String, List[String]))]]]) = {
// List[((String, String), (String, List[String]))]
val flattenedList = ll.flatten.flatten
if (flattenedList.size > 0) {
val group = flattenedList(0)._1
val tagMember = flattenedList(0)._2._1
val ips = flattenedList flatMap (_._2._2)
((group), (tagMember, ips.distinct))
}
else List()
}
println(getUniqueIpsConstantGroupTagMember(ll))
Output:
((groupName,group1),(tagMember,List(192.168.20.30, 192.168.20.20, 192.168.20.21)))
Now, let's assume you could have different groupNames.
Sample input:
val listWithVariableGroups = List(List(
List((("groupName","group1"),("tagMember",List("192.168.20.30","192.168.20.20","192.168.20.21")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30","192.168.20.20")))),
List((("groupName","group2"),("tagMember",List("192.168.20.30","192.168.20.10"))))
))
The following code should work.
def getUniqueIpsForMultipleGroups(inputList: List[List[List[((String, String), (String, List[String]))]]]) = {
val flattenedList = inputList.flatten.flatten
// Map[(String, String),List[(String, List[String])]]
val groupedByGroupNameId = flattenedList.groupBy(p => p._1) map {
case (key, value) => (key, ("tagMember", extractUniqueTagIps(value)))
}
groupedByGroupNameId
}
def extractUniqueTagIps(list: List[((String, String), (String, List[String]))]) = {
val ips = list flatMap (_._2._2)
ips.distinct
}
getUniqueIpsForMultipleGroups(listWithVariableGroups).foreach(println)
Output:
((groupName,group1),(tagMember,List(192.168.20.30, 192.168.20.20, 192.168.20.21)))
((groupName,group2),(tagMember,List(192.168.20.30, 192.168.20.10)))
I ve got the following class and I want to write some Spec test cases, but I am really new to it and I don't know how to start. My class do loke like this:
class Board{
val array = Array.fill(7)(Array.fill(6)(None:Option[Coin]))
def move(x:Int, coin:Coin) {
val y = array(x).indexOf(None)
require(y >= 0)
array(x)(y) = Some(coin)
}
def apply(x: Int, y: Int):Option[Coin] =
if (0 <= x && x < 7 && 0 <= y && y < 6) array(x)(y)
else None
def winner: Option[Coin] = winner(Cross).orElse(winner(Naught))
private def winner(coin:Coin):Option[Coin] = {
val rows = (0 until 6).map(y => (0 until 7).map( x => apply(x,y)))
val cols = (0 until 7).map(x => (0 until 6).map( y => apply(x,y)))
val dia1 = (0 until 4).map(x => (0 until 6).map( y => apply(x+y,y)))
val dia2 = (3 until 7).map(x => (0 until 6).map( y => apply(x-y,y)))
val slice = List.fill(4)(Some(coin))
if((rows ++ cols ++ dia1 ++ dia2).exists(_.containsSlice(slice)))
Some(coin)
else None
}
override def toString = {
val string = new StringBuilder
for(y <- 5 to 0 by -1; x <- 0 to 6){
string.append(apply(x, y).getOrElse("_"))
if (x == 6) string.append ("\n")
else string.append("|")
}
string.append("0 1 2 3 4 5 6\n").toString
}
}
Thank you!
I can only second Daniel's suggestion, because you'll end up with a more practical API by using TDD.
I also think that your application could be nicely tested with a mix of specs2 and ScalaCheck. Here the draft of a Specification to get you started:
import org.specs2._
import org.scalacheck.{Arbitrary, Gen}
class TestSpec extends Specification with ScalaCheck { def is =
"moving a coin in a column moves the coin to the nearest empty slot" ! e1^
"a coin wins if" ^
"a row contains 4 consecutive coins" ! e2^
"a column contains 4 consecutive coins" ! e3^
"a diagonal contains 4 consecutive coins" ! e4^
end
def e1 = check { (b: Board, x: Int, c: Coin) =>
try { b.move(x, c) } catch { case e => () }
// either there was a coin before somewhere in that column
// or there is now after the move
(0 until 6).exists(y => b(x, y).isDefined)
}
def e2 = pending
def e3 = pending
def e4 = pending
/**
* Random data for Coins, x position and Board
*/
implicit def arbitraryCoin: Arbitrary[Coin] = Arbitrary { Gen.oneOf(Cross, Naught) }
implicit def arbitraryXPosition: Arbitrary[Int] = Arbitrary { Gen.choose(0, 6) }
implicit def arbitraryBoardMove: Arbitrary[(Int, Coin)] = Arbitrary {
for {
coin <- arbitraryCoin.arbitrary
x <- arbitraryXPosition.arbitrary
} yield (x, coin)
}
implicit def arbitraryBoard: Arbitrary[Board] = Arbitrary {
for {
moves <- Gen.listOf1(arbitraryBoardMove.arbitrary)
} yield {
val board = new Board
moves.foreach { case (x, coin) =>
try { board.move(x, coin) } catch { case e => () }}
board
}
}
}
object Cross extends Coin {
override def toString = "x"
}
object Naught extends Coin {
override def toString = "o"
}
sealed trait Coin
The e1 property I've implemented is not the real thing because it doesn't really check that we moved the coin to the nearest empty slot, which is what your code and your API suggests. You will also want to change the generated data so that the Boards are generated with an alternation of x and o. That should be a great way to learn how to use ScalaCheck!
I suggest you throw all that code out -- well, save it somewhere, but start from zero using TDD.
The Specs2 site has plenty examples of how to write tests, but use TDD -- test driven design -- to do it. Adding tests after the fact is suboptimal, to say the least.
So, think of the most simple case you want to handle of the most simple feature, write a test for that, see it fail, write the code to fix it. Refactor if necessary, and repeat for the next most simple case.
If you want help with how to do TDD in general, I heartily endorse the videos about TDD available on Clean Coders. At the very least, watch the second part where Bob Martin writes a whole class TDD-style, from design to end.
If you know how to do testing in general but are confused about Scala or Specs, please be much more specific about what your questions are.