Apache Flink - Sum and Group in a DataStream - mapreduce

Suppose I have records like this:
("a-b", "data1", 1)
("a-c", "data2", 1)
("a-b", "data3", 1)
How can I group and sum in Apache Flink, such that I have the following results when the input is a DataStream?
("a-b", ["data1", "data3"], 2)
("a-c", ["data2"], 1)
I only can think of either a solution by using {Time, Count}Windows and then grouping and applying transformations or by keeping the elements in an iteration stream (but this would need a lot of memory).
Regards, Kevin

Related

SubString match in Google Sheets using Regex and Query

This is the data in google sheets
Account Numkber Names
7728550,543216 Govt Req
772855,65432 Vodafone
I am trying to do a lookup of the account numbers with the formula
=QUERY(Sheet1!B$3:C$4,"Select C where B matches '^.*(" & B2 & ").*$' limit 1")
772855 - Govt req
How do I solve this ? There is a large chunk of data so I can't paste the values in different rows.
use:
=ARRAYFORMULA(IFNA(VLOOKUP(B2:B,
SPLIT(FLATTEN(SPLIT(Sheet1!F2:F, ",")&"×"&Sheet1!G2:G), "×"), 2, )))

Summing up number values extracted from one cell using rexexextract or regexreplace

I have numbers like the sample below stored in one cell:
First:
[9miles 12lbs weight 1g Raw]
Second:
[1miles 3lbs weight 7g Raw]
Third:
[20miles 6lbs weight 3g Raw]
I'd like to extract the numbers, sum them up () and place them in another cell in the same row. So far I can only manage to extract the first instance of regexp using regexextract formula. Is this even possible?
Desired outcome:
[30miles 21lbs weight 11g Raw]
try:
=INDEX(QUERY(IFERROR(REGEXEXTRACT(SPLIT(
FLATTEN(SPLIT(A1, ":")), " "), "\d+")*1, 0),
"select sum(Col1),sum(Col2),sum(Col4)"), 2)

How to get latest result set based on the timestamp in amazon qldb?

I have many an IonStruct as follows.
{
revenueId: "0dcb7eb6-8cec-4af1-babe-7292618b9c69",
ownerId: "u102john2021",
revenueAddedTime: 2020-06-20T19:31:31.000Z,
}
I want to write a query to select the latest records set within a given year.
for example, suppose I have a set of timestamps like this -
A - 2019-06-20T19:31:31.000Z
B - 2020-06-20T19:31:31.000Z
C - 2020-06-20T19:31:31.000Z
D - 2021-07-20T19:31:31.000Z
E - 2020-09-20T19:31:31.000Z
F - 2020-09-20T19:31:31.000Z
If the selected year is between 2020 and 2021, I want to return records which having the latest timestamp.
in this case. E and F,
I tried many ways like
"SELECT * FROM REVENUES AS r WHERE r.ownerId = ? AND r.revenueAddedTime >= ? AND r.revenueAddedTime < ?"
Can anyone help me here?
Although I have no experience in qldb syntax, it seems to have similar properties to other db syntax in the sense that you can format your timestamps using these doc:
https://docs.aws.amazon.com/qldb/latest/developerguide/ql-functions.timestamp-format.html
https://docs.aws.amazon.com/qldb/latest/developerguide/ql-functions.to_timestamp.html
Once you format the timestamp, you may be able to do the > and < query syntax.

OpenCV python vstack changes width

I'm using OpenCV 3.0.0 with Python 2.7 and trying something that ought to be simple.
I want to stack images vertically.
This simple example:
import cv2
import numpy as np
comb = np.vstack((row_0, row_1))
cv2.imwrite('foo.png', comb)
consistently produces a foo.png that is drastically narrower (in the browser) than row_0 and row_1.
Details:
row_0.shape
(1074, 785, 3)
row_1.shape
(1187, 785, 3)
comb.shape
(2261, 785, 3)
If I look at row_0.png in the browser, it is WAY wider than foo.png.
Question
How can I alter my code so row_0.png is the same width as foo.png in the browser?
np.vstack does two things - make sure the inputs are at least 2d (here they are 3d), and joins them on axis=0 (rows). In other words
np.concatenate((row0, row1), axis=0)
That's what I see happening - two dimensions are the same, the first is the sum of the 2 inputs:
(1074, 785, 3)
+
(1187, 785, 3)
=
(2261, 785, 3)
If the comb looks narrower, it is probably because of scaling. The ratio of 2nd dim to 1st has gotten smaller; that's to be expected if you join 2 arrays in this way. And given the dimensions, that's the only possible way.
Viewed as arrays, comb has more rows, same number of columns. But if 2261 is the image display width, then relative height will be less.

How to match Amazon / CJ / Linkshare Products

I need to create a data base with Amazon, commission junction & link share API's & data feeds and then match the same products to create comparisons on product information.
My problem is related to the matching process.
I start by matching products via SKU/UPC/ASIN but this not perform well because many of the products doesn't contain this information.
I maked some research and the most popular techniques I found are :
-Measuring cosine similarity via TF-IDF
-Measuring edit distance/ levenshtein / Jaro-Winkler
In this technique i used cosine similarity and Jaro-Winkler
How I do the matching :
Step 1 : Preprocessing
Preprocessing to transform strings into a normal form :
 Lowercase
 Filter stop words (new, by, the …)
 Strip whitespace
 replace all whitespace occurrences with a single space character
Step 2, Indexing :
Index Amazon products in a Solr core [core A] and CJ/Linkshare [core B] in an other core. The goal of indexing is to limit the number of string comparisons (via TF-IDF and Jaro-Winkler)
Step 3, matching :
I start by retrieving a product title from core B, make a solr search in core A with this title and take the top 30 results.
I measure similarity via TF-IDF between the product i want to match (the query) and the 30 results retrieved by solr search. I keep the products with similarity > 80%
sort the tokens from each product alphabetically.I then compare the transformed strings with Jaro Winkler distance and keep the products with similarity > 80% (==> This perform a Jaro Winkler similarity between phrases)
Here, I tokenize both strings (query and product to match) , and perform a comparison between tokens.
But this techniques also don't perform well. Example :
Product 1 : Orange by Hugo Boss, 3 Ounce Eau de toilette Spray
Product 2 : In Motion Orange By Hugo Boss Eau De Toilette Spray 3 Ounces
Product 1 and 2 are similar via this techniques but actually they are different.
How can I improve this algorithm? Is that the right way to match products?
How if i train a classifier with token's weight (using Jaro Winkler) (learning data from matched products via UPC) and use this classifier to match products in a final step?
PS : I have products from different categories (health, beauty, electronics, books, movies...) and data is very unstructured or incomplete.
Any advice will be helpfull
Thanks
Smail