Apache Flink - Sum and keep grouped

Apache Flink - Sum and keep grouped - mapreduce

Suppose I have records like this:
("a-b", "data1", 1)
("a-c", "data2", 1)
("a-b", "data3", 1)
How can I group and sum in Apache Flink, such that I have the following results?
("a-b", ["data1", "data3"], 2)
("a-c", ["data2"], 1)
Regards,
Kevin

I've achieved this in the Flink shell ($FLINK_HOME/bin/start-scala-shell.sh local) with the following code:
import org.apache.flink.util.Collector
benv.
fromElements(("a-b", "data1", 1), ("a-c", "data2", 1), ("a-b", "data3", 1)).
groupBy(0).
reduceGroup {
(it: Iterator[(String, String, Int)], out: Collector[(String, List[String], Int)]) => {
// Watch out: if the group is _very_ large this can lead to OOM errors
val group = it.toList
// For all groups with at least one element (prevent out-of-bounds)
if (group.length > 0)
// Get the "name", all the elements and their third-column aggregate
out.collect((group(0)._1, group.map(_._2), group.map(_._3).sum))
}
}.print
With the following output
(a-b,List(data1, data3),2)
(a-c,List(data2),1)

Related

How to find a list of distinct dataset from a particular value in Scala?

There is one case class: "companyDetails"
case class companydetails(id: Int, name: String, overview: Int)
List of company details looks like something like this:
val listOFData:List[companyDetails]={
(1,"google",95),
(2,"Accenture",78),
(3,"google",90),
(4,"facebook",78),
(5,"City", 89)}
So I want a list of company details distinct by company name irrespective of the other parameters.
So my expected output should be:
{(1,"google",95),(2,"Accenture",78),(4,"facebook",78),(5,"City", 89)}
Please help me solve this problem in Scala. Thank you in advance.

Use Scaladoc, which lists a lot of useful functions, like distinctBy:
def distinctBy[B](f: (A) => B): List[A]
Selects all the elements of this immutable sequence ignoring the duplicates as determined by == after applying the transforming function f.
case class companyDetails(id: Int, name: String, overview: Int)
val listOFData: List[companyDetails] = List(
(1, "google", 95),
(2, "Accenture", 78),
(3, "google", 90),
(4, "facebook", 78),
(5, "City", 89)
)
.map(companyDetails.apply.tupled)
val result = listOFData.distinctBy(_.name) // will contain unique by name elements

What GROUPBY aggregator can I use to test if grouped values are equal to a constant?

Situation: I have table Bob where each row has a bunch of columns, including a Result, SessionID1, SessionID2.
Goal: I want to GroupBy SessionID1 and SessionID2 and see if any Results in the group are 0; I expect multiple rows to have the same ID1 and ID2 values. I then want to divide the count of groups with 0 results / the count of all groups.
Questions: I think I want something like:
GROUPBY (
Bob,
SessionID1,
SessionID2,
"Has at least 1 success",
???)
But what aggregator can I use for ??? to get a boolean indicating if any result in the group equals 0?
Also, if I want a count of groups with successes, do I just wrap the GROUPBY in a COUNT?

Consider this sample table:
You can try the following DAX to create a new summary table:
Summary = GROUPBY(Bob, Bob[SessionID1], Bob[SessionID2],
"Number of rows", COUNTX(CURRENTGROUP(), Bob[Result]),
"Number of successes", SUMX(CURRENTGROUP(), IF(Bob[Result] = 0, 1, 0)))
Then you can add a calculated column for the success ratio:
Success ratio = Summary[Number of successes] / Summary[Number of rows]
Results:
EDIT:
If what you want to calculate is something like Any success, then SUMMARIZE may be a better option to use than GROUPBY due to their function nature.
Summary2 = SUMMARIZE(Bob, Bob[SessionID1], Bob[SessionID2],
"Any success", IF(COUNTROWS(FILTER(Bob, Bob[Result] = 0)) > 0, 1, 0),
"Number of rows", COUNTROWS(Bob))
Results:

Search for item in a list of tuples

I have a list of tuples structured as per the below (data is example only):
[('aaa', 10), ('bbb', 10), ('ccc', 12), ('ddd', 12), ('eee', 14)]
I need to search the second item in each of the tuples (the number) to see if it exists in the list (eg search for 12 = found, search for 5 = not found.
Currently I am using the below, which works but may not be the best way in Python:
not_there = True
for a in final_set:
if final_set[1] == episode_id:
not_there = False
break
What is the best / most efficient way in Python to do this?

Maybe you can try something like this :
test = [('aaa', 10), ('bbb', 10), ('ccc', 12), ('ddd', 12), ('eee', 14)]
number = 10
for i in test:
if number in i:
print("Number {} found.".format(number))
else:
print("Number {} not found".format(number))
That should work regardless you're searching element 1 in the tuple (the 'aaa') or element 2 (the number).
Hope this helps.

What about this:
is_there = (len([item for item in final_set if item[1] == episode_id]) > 0)
Basically, [item for item in final_set if item[1] == episode_id] is a list comprehension expression which creates a list of the items in final_set such that item[1] == episode_id.
Then, you can check the length of the resulting list: if it is greater than 0, than something has been found.

Matching two lists in excel

I am trying to compare two months sales to each other in excel in the most automated way possible (just so it will be quicker for future months)
This months values are all worked out through formulae and last months will be copy and pasted into D:E. However as you can see there are some customers that made purchases last month and then did not this month (and vice versa). I basically need to be have all CustomerID's matching row by row. So for example it to end up like this:
Can anyone think of a good way of doing this without having to do it all manually? Thanks

Use the SUMIFS function or VLOOKUP. Like this:
http://screencast.com/t/VTBZrfHjo8tk
You should just have your entire customer list on one sheet and then add up the values associated with them month over month. The design you are describing is going to be a nightmare to maintain over time and serves no purpose. I can understand you would like to see the customers in a row like that, which is why I suggest SUMIFS.

This option compare only two columns, I think you do to think anoter way,
first I will add the date/month and then you can add down the next month value:
then you can use a simply pivot to see more month in the some time
any case if you want to format your two columns, you can use this code (you will to update with you reference, I used the date from your img example)
Sub OrderMachColumns()
Dim lastRow As Integer
Dim sortarray(1 To 2, 1 To 2) As String
Dim x As Long, y As Long
Dim TempTxt10 As String
Dim TempTxt11 As String
Dim TempTxt20 As String
Dim TempTxt22 As String
lastRow = Range("A3").End(xlDown).Row ' I use column A, same your example
For x = 3 To lastRow * 2
Cells(x, 1).Select
If Cells(x, 1) = "" Then GoTo B
If Cells(x, 4) = "" Then GoTo A
If Cells(x, 1) = Cells(x, 4) Then
Else
If Cells(x, 1).Value = Cells(x - 1, 4).Value Then
Range(Cells(x - 1, 4), Cells(x - 1, 5)).Select
Selection.Insert Shift:=xlDown, CopyOrigin:=xlFormatFromLeftOrAbove
ElseIf Cells(x, 1).Value = Cells(x + 1, 4).Value Then
Range(Cells(x, 1), Cells(x, 2)).Select
Selection.Insert Shift:=xlDown, CopyOrigin:=xlFormatFromLeftOrAbove
Else
sortarray(1, 1) = Cells(x, 1).Value
sortarray(1, 2) = "Cells(" & x & ", 1)"
sortarray(2, 1) = Cells(x, 4).Value
sortarray(2, 2) = "Cells(" & x & ", 4)"
For Z = LBound(sortarray) To UBound(sortarray)
For y = Z To UBound(sortarray)
If UCase(sortarray(y, 1)) > UCase(sortarray(Z, 1)) Then
TempTxt11 = sortarray(Z, 1)
TempTxt12 = sortarray(Z, 2)
TempTxt21 = sortarray(y, 1)
TempTxt22 = sortarray(y, 2)
sortarray(Z, 1) = TempTxt21
sortarray(y, 1) = TempTxt11
sortarray(Z, 2) = TempTxt22
sortarray(y, 2) = TempTxt12
End If
Next y
Next Z
Select Case sortarray(1, 2)
Case "Cells(" & x & ", 1)"
Range(Cells(x, 1), Cells(x, 2)).Select
Case "Cells(" & x & ", 4)"
Range(Cells(x, 4), Cells(x, 5)).Select
End Select
Selection.Insert Shift:=xlDown, CopyOrigin:=xlFormatFromLeftOrAbove
End If
End If
A:
Next x
B:
End Sub

Gatling, append a value to an existing list attribute in Session

I am Gatling and trying to append a value to an existing list attribute in Session. For example, let's suppose that the current Session has a list attribute as follow.
List(1, 2, 3)
Then, after running the code below,
exec(
http("Create_New_Lists")
.post("/api/v1/lists/sync")
.basicAuth("${email}", "test")
.body(StringBody("""{ "productListDto":{"id":"0","active":"true","items":[],"name":"""" + listName + """"},"token":"" }""")).asJSON
.check(jsonPath("""$..id""").saveAs("value_to_be_appended"))
)
I want to add "value_to_be_appended" to the list so that the list would be
List(1, 2, 3, 4) (if value_to_be_appended is 4)
How can I do this?
I would appreciate your help!

Write a exec block where you manipulate the session, fetch the existing list and replace it:
.exec { session =>
for {
existingList <- session("existingList").validate[List[Int]]
// the value you extracted is a String, not an Int
value_to_be_appended <- session("value_to_be_appended").validate[String]
} yield session.set("existingList", existingList ::: List(value_to_be_appended))
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Apache Flink - Sum and keep grouped - mapreduce

Suppose I have records like this: ("a-b", "data1", 1) ("a-c", "data2", 1) ("a-b", "data3", 1) How can I group and sum in Apache Flink, such that I have the following results? ("a-b", ["data1", "data3"], 2) ("a-c", ["data2"], 1) Regards, Kevin

Related

How to find a list of distinct dataset from a particular value in Scala?

What GROUPBY aggregator can I use to test if grouped values are equal to a constant?

Search for item in a list of tuples

Matching two lists in excel

Gatling, append a value to an existing list attribute in Session

Categories

Resources