Implement A-B functionality using Cascading with LEFT JOIN - mapreduce

There are two datasets A and B (having single column - ID)
Cat A
cat B
Cat A-B
This subtraction is done in 2 steps
this will give a dataset that has 2 column where 1st column will have all the entries for dataset A and 2nd column will have only the matching entries from B
2 2
4 4
5 5
Step 2: Filter the data set from step1 by records where 2nd field is null
Thus we have implemented A-B by using LEFT JOIN.
I am able to execute Step 1 but I am unable to implement step 2.
Below is the source code for step 1
public class AMinusB {
public static FlowDef createWorkflowLeftJoin(Tap aTap, Tap bTap,
Tap outputTap) {
Pipe bpipe = new Pipe("b_pipe");
Pipe apipe = new Pipe("a_pipe");
Fields b_user_id = new Fields("B_id");
Fields a_user_id = new Fields("A_id");
Pipe joinPipe = new HashJoin(apipe, a_user_id, bpipe, b_user_id,
new LeftJoin());
Pipe retainPipe = new Pipe("retain", joinPipe);
retainPipe = new Retain(retainPipe, new Fields("A_id", "B_id"));
Pipe cdistPipe = new Pipe("UniquePipe", retainPipe);
Fields selector = new Fields("A_id", "B_id");
cdistPipe = new Unique(cdistPipe, selector);
FlowDef flowDef = FlowDef.flowDef().addSource(apipe, aTap)
.addSource(bpipe, bTap).addTailSink(cdistPipe, outputTap)
.setName("A-B using left outer join");
return flowDef;
public static void main(String[] args) {
String Apath = "path to data set A";
String Bpath = "path to data set B";
String outputPath = "path to output";
Properties properties = new Properties();
FlowConnector flowConnector = new Hadoop2MR1FlowConnector(properties);
Fields A = new Fields("A_id");
Tap ATap = new Hfs(new TextDelimited(A, false, "\t"), Apath);
Fields B = new Fields("B_id");
Tap BTap = new Hfs(new TextDelimited(B, false, "\t"), Bpath);
Tap outputTap = new Hfs(new TextDelimited(false, "\t"), outputPath);
FlowDef flowDefLeftJoin = createWorkflowLeftJoin(ATap, BTap, outputTap);

Check Operation FilterNull .
cdistPipe = new Each(cdistPipe, selector,new FilterNull());


Copy certain number of Columns in a row

I update sheets on a weekly basis, I import an external file (starting point) using the code below import selected Columns to sheet2- all is good.
Sub Copy_Specific_Columns_ToAnother_Sheet()
Sheets("Data").Range("C:C").Copy Sheets("Sheet2").Range("B:B")
Sheets("Data").Range("D:D").Copy Sheets("Sheet2").Range("C:C")
Sheets("Data").Range("B:B").Copy Sheets("Sheet2").Range("D:D")
Sheets("Data").Range("I:I").Copy Sheets("Sheet2").Range("E:E")
Sheets("Data").Range("K:K").Copy Sheets("Sheet2").Range("F:F")
Sheets("Data").Range("H:H").Copy Sheets("Sheet2").Range("G:G")
Sheets("Data").Range("J:J").Copy Sheets("Sheet2").Range("H:H")
Sheets("Data").Range("AF:AF").Copy Sheets("Sheet2").Range("I:I")
'Clean up sheet with formatting
ActiveSheet.UsedRange.Font.Size = 12
ActiveSheet.UsedRange.Font.Name = "Calibri"
Range("a1").EntireRow.RowHeight = 45
Range("a2").EntireRow.RowHeight = 30
End Sub
On sheet2 I create a unique identifier (non macro) and import the sheet 2 details to a Master sheet where I make edits from Columns K onwards.
The code below looks for the unique identifier and pull in new rows
** However when new rows are added it means that my notes from column L onwards are deleted every time i update.
Can the code below be modified so that new rows only update to a specific column (say upto K) leaving my additional notes and entries untouched.. ? ("A" column has the unique identifier, MACRO looks for changes and pulls in new rows)
Sub Update_Data()
Dim wsSource As Worksheet
Dim wsDest As Worksheet
Dim recRow As Long
Dim lastRow As Long
Dim fCell As Range
Dim i As Long
'Define our worksheets
Set wsSource = Worksheets("Sheet2")
Set wsDest = Worksheets("Master")
Application.ScreenUpdating = False
recRow = 1
With wsSource
lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
For i = 2 To lastRow
'See if item is in Master sheet
Set fCell = wsDest.Range("A:A").Find(what:=.Cells(i, "A").Value, lookat:=xlWhole, MatchCase:=False)
If Not fCell Is Nothing Then
'Record is already in master sheet
recRow = fCell.Row
'Need to move this to master sheet after last found record
.Cells(i, "A").EntireRow.Copy
wsDest.Cells(recRow + 1, "A").EntireRow.Insert
recRow = recRow + 1
End If
Next i
End With
'Code clean up
Application.CutCopyMode = False
Application.ScreenUpdating = True
'Clean up sheet with formatting
ActiveSheet.UsedRange.Font.Size = 12
ActiveSheet.UsedRange.Font.Name = "Calibri"
Range("a1").EntireRow.RowHeight = 45
Range("a2").EntireRow.RowHeight = 30
End Sub

Extract the digits and append it in a different cell?

I am trying to automatically RegExp(extract) the digits(AREA number) in Column 3 combined with the Text 'A' to append in Column 1 Date INDEX.
The problem is I'm not yet familiar in using google sheets app-scripts.
Tried looking for solutions with similar situation as me, but to no avail.
I don't know to put VBA to app-scripts.
Tried using some codes.
I still can't seem to make it work.
Can anyone point me in the right direction?
Thank you if you can help me out. Thanks.
The scenarios is in the office i cant make column for the formula.
It must be "behind the scene".
My googlesheets
function onEdit(e) {
var rg=e.range;
var sh=e.range.getSheet();
var area=sh.getName();
var regExp = new RegExp("\d*"); // Extract the digits
var dataIndex = regExp.exec(area)[1];
if(rg.columnStart==3) { // Observe column 3
var vA=rg.getValues();
for(var i=0;i<vA.length;i++){
if(vA[i][0]) {
sh.getRange(rg.rowStart + i,1).appendText((dataIndex) +'A'); // append to column 1 with 'A' and extracted digits
This answer extends your approach of using a script with an OnEdit trigger. But there are a number of differences between the two sets of code.
The most significant difference is that I have used the Javascript split method (var fields = value.split(' ');) to get distinct values from the data entry.
Most of the other differences are error checking:
if(rg.columnStart === 3 && area === "work") {: test for sheet="work" as well as an edit on Column C
var value = e.value.toUpperCase();: anticipate that the test might be in lower case.
if (fields.length !=2){: test that there are two elements in the data entry.
if (fields[0] != "AREA"){: test that the first elment of the entry is the word 'area'
if (num !=0 && numtype ==="number"){; test that the second element is a number, and that it is NOT zero.
if (colA.length !=0){: test that Column A is not empty
var newColA = colA+"A"+num;: construct the new value for Column A by using unary operator '+'.
function onEdit(e){
// so5911459101
// test for edit in column C and sheet = work
var ss = SpreadsheetApp.getActiveSpreadsheet;
// get Event Objects
var rg=e.range;
var sh=e.range.getSheet();
var area=sh.getName();
var row = rg.getRow();
// test if the edit is in Column C of sheet = work
if(rg.columnStart === 3 && area === "work") { // Observe column 3 and sheet = work
//Logger.log("DEBUG: the edit is in Column C of 'Work'")
// get the edited value
var value = e.value.toUpperCase();
//Logger.log("DEBUG: the value = "+value+", length = "+value.length+", uppercase = "+value.toUpperCase());
// use Javascript split on the value
var fields = value.split(' ');
// Logger.log("DEBUG: number of fields = "+fields.length)
// test if there are two fields in the value
if (fields.length !=2){
// Logger.log("DEBUG: the value doesn't have two fields")
// Logger.log("DEBUG: the value has two fields")
// test if the first field = 'AREA'
if (fields[0] != "AREA"){
// Logger.log("DEBUG: do nothing because the value doesn't include area")
// Logger.log("DEBUG: do something because the value does include area")
// get the second field - it should be a value
var num = fields[1];
num =+num
var numtype = typeof num;
// Logger.log("DEBUG: num= "+num+" type = "+numtype); //number
// test type of second field
if (num !=0 && numtype ==="number"){
// Logger.log("DEBUG: the second field IS a number")
// get the range for the cell in Column A
var colARange = sh.getRange(row,1);
// Logger.log("DEBUG: the ColA range = "+colARange.getA1Notation());
// get the value of Column A
var colA = colARange.getValue();
// Logger.log("DEBUG: Col A = "+colA+", length = "+colA.length);
// test if Column A is empty
if (colA.length !=0){
var newColA = colA+"A"+num;
// Logger.log("DEBUG: the new cola = "+newColA);
// update the value in Column A
// Logger.log("DEBUG: do nothing because column A is empty")
// Logger.log("DEBUG: the second field isn't a number")
//Logger.log("DEBUG: the edit is NOT in Column C of 'Work'")
If the value in Column C is sourced from data validation, then no need for and testing except that the edit was in Column C and the sheet = "work".
Included two additional lines of code:
var colAfields = colA.split('-');
var colAdate = colAfields[0];
This has the effect of excluding any existing characters after the hyphen, and re-establishing the hyphen, row number plus "A" and the AREA numeral.
function onEdit(e){
// so5911459101 revised
// only one test - check for ColumnC and sheet="work"
// test for edit in column C and sheet = work
var ss = SpreadsheetApp.getActiveSpreadsheet;
// get Event Objects
var rg=e.range;
var sh=e.range.getSheet();
var area=sh.getName();
var row = rg.getRow();
// test if the edit is in Column C of sheet = work
if(rg.columnStart === 3 && area === "work") { // Observe column 3 and sheet = work
Logger.log("DEBUG: the edit is in Column C of 'Work'")
// get the edited value
var value = e.value
//Logger.log("DEBUG: the value = "+value+", length = "+value.length);
// use Javascript split on the value
var fields = value.split(' ');
// get the second field - it should be a value
var num = fields[1];
// get the range for the cell in Column A
var colARange = sh.getRange(row,1);
// Logger.log("DEBUG: the ColA range = "+colARange.getA1Notation());
// get the value of Column A
var colA = colARange.getValue();
// Logger.log("DEBUG: Col A = "+colA+", length = "+colA.length);
// use Javascript split on Column A in case of existing value
var colAfields = colA.split('-');
var colAdate = colAfields[0];
// build new value
var newColA = colAdate+"-"+row+"A"+num;
// Logger.log("DEBUG: the new cola = "+newColA);
// update the value in Column A
Logger.log("DEBUG: the edit is NOT in Column C of 'Work'")

Unique list from dynamic range table with possible blanks

I have an Excel table in sheet1 in which column A:
Name of company Company 1 Company 2 Company
3 Company 1 Company 4 Company 1 Company
I want to extract a unique list of company names to sheet2 also in column A. I can only do this with help of a helper column if I dont have any blanks between company names but when I do have I get one more company which is a blank.
Also, I've researched but the example was for non-dynamic tables and so it doesn't work because I don't know the length of my column.
I want in Sheet2 Column A:
Name of company Company 1 Company 2 Company 3
Company 4
Looking for the solution that requires less computational power Excel or Excel-VBA. The final order which they appear in sheet2 don't really matter.
Using a slight modification to Recorder-generated code:
Sub Macro1()
Sheets("Sheet1").Range("A:A").Copy Sheets("Sheet2").Range("A1")
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
With Sheets("Sheet2").Sort
.SortFields.Add Key:=Range("A2:A" & Rows.Count) _
, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange Range("A2:A" & Rows.Count)
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
End With
End Sub
Sample Sheet1:
Sample Sheet2:
The sort removes the blanks.
If the original data in Sheet1 was derived from formulas, then using PasteSpecial will remove unwanted formula copying. There is also a final sweep for empty cells:
Sub Macro1_The_Sequel()
Dim rng As Range
Sheets("Sheet2").Range("A1").PasteSpecial Paste:=xlPasteValues
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
Set rng = Sheets("Sheet2").Range("A2:A" & Rows.Count)
With Sheets("Sheet2").Sort
.SortFields.Add Key:=rng, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange rng
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
End With
Call Kleanup
End Sub
Sub Kleanup()
Dim N As Long, i As Long
With Sheets("Sheet2")
N = .Cells(Rows.Count, "A").End(xlUp).Row
For i = N To 1 Step -1
If .Cells(i, "A").Value = "" Then
.Cells(i, "A").Delete shift:=xlUp
End If
Next i
End With
End Sub
All of these answers use VBA. The easiest way to do this is to use a pivot table.
First, select your data, including the header row, and go to Insert -> PivotTable:
Then you will get a dialog box. You don't need to select any of the options here, just click OK. This will create a new sheet with a blank pivot table. You then need to tell Excel what data you're looking for. In this case, you only want the Name of company in the Rows section. On the right-hand side of Excel you will see a new section named PivotTable Fields. In this section, simply click and drag the header to the Rows section:
This will give a result with just the unique names and an entry with (blank) at the bottom:
If you don't want to use the Pivot Table further, simply copy and paste the result rows you're interested in (in this case, the unique company names) into a new column or sheet to get just those without the pivot table attached. If you want to keep the pivot table, you can right click on Grand Total and remove that, as well as filter the list to remove the (blank) entry.
Either way, you now have your list of unique results without blanks and it didn't require any formulas or VBA, and it took relatively few resources to complete (far fewer than any VBA or formula solution).
Here's another method using Excel's built-in Remove Duplicates feature, and a programmed method to remove the blank lines:
I have deleted the code using the above methodology as it takes too long to run. I have replaced it with a method that uses VBA's collection object to compile a unique list of companies.
The first method, on my machine, took about two seconds to run; the method below: about 0.02 seconds.
Sub RemoveDups()
Dim wsSrc As Worksheet, wsDest As Worksheet
Dim rRes As Range
Dim I As Long, S As String
Dim vSrc As Variant, vRes() As Variant, COL As Collection
Set wsSrc = Worksheets("sheet1")
Set wsDest = Worksheets("sheet2")
Set rRes = wsDest.Cells(1, 1)
'Get the source data
With wsSrc
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
'Collect unique list of companies
Set COL = New Collection
On Error Resume Next
For I = 2 To UBound(vSrc, 1) 'Assume Row 1 is the header
S = CStr(Trim(vSrc(I, 1)))
If Len(S) > 0 Then COL.Add S, S
Next I
On Error GoTo 0
'Populate results array
ReDim vRes(0 To COL.Count, 1 To 1)
vRes(0, 1) = vSrc(1, 1)
For I = 1 To COL.Count
vRes(I, 1) = COL(I)
Next I
'set results range
Set rRes = rRes.Resize(UBound(vRes, 1) + 1)
'Write the results
With rRes
.Value = vRes
'Uncomment the below line if you want
'.Sort key1:=.Columns(1), order1:=xlAscending, MatchCase:=False, Header:=xlYes
End With
End Sub
NOTE: You wrote you didn't care about the order, but if you want to Sort the results, that added about 0.03 seconds to the routine.
With two sheets named 1 and 2
Inside sheet named: 1
| | A |
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | |
| 5 | Company 3 |
| 6 | Company 1 |
| 7 | |
| 8 | Company 4 |
| 9 | Company 1 |
| 10 | Company 3 |
Result in sheet named: 2
| | A |
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | Company 3 |
| 5 | Company 4 |
Use this code in a regular module:
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
End Sub
The code return the date unsorted, just the way data appears.
if you want a sorted list, just add this code before the las line:
Dim sht As Worksheet
Set sht = Sheets("2")
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
End With
This way the result will be always sorted.
(The subrutine would be like this)
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
Dim sht As Worksheet
Set sht = Sheets("2")
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
End With
End Sub
If you have any question about the code, I will glad to explain.

How to do a cross join / cartesian product in RavenDB?

I have a web application that uses RavenDB on the backend and allows the user to keep track of inventory. The three entities in my domain are:
public class Location
string Id
string Name
public class ItemType
string Id
string Name
public class Item
string Id
DenormalizedRef<Location> Location
DenormalizedRef<ItemType> ItemType
On my web app, there is a page for the user to see a summary breakdown of the inventory they have at the various locations. Specifically, it shows the location name, item type name, and then a count of items.
The first approach I took was a map/reduce index on InventoryItems:
this.Map = inventoryItems =>
from inventoryItem in inventoryItems
select new
LocationName = inventoryItem.Location.Name,
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1
this.Reduce = indexEntries =>
from indexEntry in indexEntries
group indexEntry by new
} into g
select new
Count = g.Sum(entry => entry.Count),
That is working fine but it only displays rows for Location/ItemType pairs that have a non-zero count of items. I need to have it show all Locations and for each location, all item types even those that don't have any items associated with them.
I've tried a few different approaches but no success so far. My thought was to turn the above into a Multi-Map/Reduce index and just add another map that would give me the cartesian product of Locations and ItemTypes but with a Count of 0. Then I could feed that into the reduce and would always have a record for every location/itemtype pair.
this.AddMap<object>(docs =>
from itemType in docs.WhereEntityIs<ItemType>("ItemTypes")
from location in docs.WhereEntityIs<Location>("Locations")
select new
LocationName = location.Name,
ItemTypeName = itemType.Name,
Count = 0
This isn't working though so I'm thinking RavenDB doesn't like this kind of mapping. Is there a way to get a cross join / cartesian product from RavenDB? Alternatively, any other way to accomplish what I'm trying to do?
EDIT: To clarify, Locations, ItemTypes, and Items are documents in the system that the user of the app creates. Without any Items in the system, if the user enters three Locations "London", "Paris", and "Berlin" along with two ItemTypes "Desktop" and "Laptop", the expected result is that when they look at the inventory summary, they see a table like so:
| Location | Item Type | Count |
| London | Desktop | 0 |
| London | Laptop | 0 |
| Paris | Desktop | 0 |
| Paris | Laptop | 0 |
| Berlin | Desktop | 0 |
| Berlin | Laptop | 0 |
Here is how you can do this with all the empty locations as well:
AddMap<InventoryItem>(inventoryItems =>
from inventoryItem in inventoryItems
select new
LocationName = inventoryItem.Location.Name,
Items = new[]{
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1}
from location in locations
select new
LocationName = location.Name,
Items = new object[0]
this.Reduce = results =>
from result in results
group result by result.LocationName into g
select new
LocationName = g.Key,
Items = from item in g.SelectMany(x=>x.Items)
group item by item.ItemTypeName into gi
select new
ItemTypeName = gi.Key,
Count = gi.Sum(x=>x.Count)

Pre-increment assginement as Row Number to List

i trying to assign a row number and a Set-number for List, but Set Number containing wrong number of rows in one set.
var objx = new List<x>();
var i = 0;
var r = 1;
objY.ForEach(x => objx .Add(new x
RowNumber = ++i,
DatabaseID= x.QuestionID,
SetID= i == 5 ? r++ : i % 5 == 0 ? r += 1 : r
for Above code like objY Contains 23 rows, and i want to break 23 rows in 5-5 set.
so above code will give the sequence like[Consider only RowNumber]
[1 2 3 4 5][6 7 8 9][ 10 11 12 13 14 ].......
its a valid as by the logic
and if i change the logic for Setid as
SetID= i % 5 == 0 ? r += 1 : r
Result Will come Like
[1 2 3 4 ][5 6 7 8 9][10 11 12 13 14].
Again correct output of code
but expected for set of 5.
[1 2 3 4 5][ 6 7 8 9 10].........
What i missing.............
i should have taken my Maths class very Serious.
I think you want something like this:
var objX = objY.Select((x, i) => new { ObjX = x, Index = i })
.GroupBy(x => x.Index / 5)
.Select((g, i) =>
g.Select(x => new objx
RowNumber = x.Index + 1
DatabaseID = x.ObjX.QuestionID,
SetID = i + 1
Note that i'm grouping by x.Index / 5 to ensure that every group has 5 items.
Here's a demo.
it will be very helpful,if you can explain your logic
Where should i start? I'm using Linq methods to select and group the original list to create a new List<List<ObjX>> where every inner list has maximum 5 elements(less in the last if the total-count is not dividable by 5).
Enumerable.Select enables to project something from the input sequence to create something new. This method is comparable to a variable in a loop. In this case i project an anonymous type with the original object and the index of it in the list(Select has an overload that incorporates the index). I create this anonymous type to simply the query and because i need the index later in the GroupBy``.
Enumerable.GroupBy enables to group the elements in a sequence by a specified key. This key can be anything which is derivable from the element. Here i'm using the index two build groups of a maximum size of 5:
.GroupBy(x => x.Index / 5)
That works because integer division in C# (or C) results always in an int, where the remainder is truncated(unlike VB.NET btw), so 3/4 results in 0. You can use this fact to build groups of the specified size.
Then i use Select on the groups to create the inner lists, again by using the index-overload to be able to set the SetId of the group:
.Select((g, i) =>
g.Select(x => new objx
RowNumber = x.Index + 1
DatabaseID = x.ObjX.QuestionID,
SetID = i + 1
The last step is using ToList on the IEnumerable<List<ObjX>> to create the final List<List<ObX>>. That also "materializes" the query. Have a look at deferred execution and especially Jon Skeets blog to learn more.