Hadoop sort and group keys differently

Hadoop sort and group keys differently - mapreduce

I am doing a word count, so the mapper returns key and value pairs
zz 1
zz 1
b 1
c 1
and my reducer adds them up together
b 1
c 1
zz 2
but I want the keys to be sorted by length (decreasing)
zz 2
b 1
c 1
So I create a Comparator for the word length
public static class LengthComparator extends WritableComparator {
#Override
#Override
public int compare(WritableComparable a, WritableComparable b) {
String w1 = a.toString();
String w2 = b.toString();
if (w1.length == w2.length) return 0;
return w1.length() > w2.length() ? -1 : 1;
}
}
then set with
job.setSortComparatorClass(LengthComparator.class);
now my output is this
zz 2
b 2
so i try this
public static class LengthComparator extends WritableComparator {
#Override
#Override
public int compare(WritableComparable a, WritableComparable b) {
String w1 = a.toString();
String w2 = b.toString();
return w1.length() > w2.length() ? -1 : 1;
}
}
now my output is
zz 1
zz 1
b 1
c 1
How do I make it such that the keys are grouped by the key but the output is sorted by the length of the word?

Related

Flutter sort List of Objects after two values

I have a List of CustomObjects that I need to sort. First the objects should be sorted after their property dateTime and if that is the same, it should be sorted after another property, the compare-property.
I searched for multisort and found this:
medcimentNotificationListData.sort((med1, med2) {
var r = med1.datetime.compareTo(med2.datetime);
if (r != 0) return r;
return med1.mealTimeDescription.compareValue
.compareTo(med2.mealTimeDescription.compareValue);
});
But when printing the list right after it, the list is not sorted..
medcimentNotificationListData.forEach((medicamentNotificationData) {
print(
'${medicamentNotificationData.title}, order: ${medicamentNotificationData.mealTimeDescription.compareValue}');
});
What am I missing here? Is there an easy way to multisort?
Let me know if you need any more info!

when you are calling the sort() method the function calls (a, b) { // your function} which should return either -1 , 0 or 1 . this function is called on the existing order.
at first your element a is your first element and element b is second element of the list as the existing order of the list
if your function returns -1 it means your element a should be placed before the element b therefore it places a before the b and call the function again by replacing older element b as new element a and new element b will be the element after the old b element.
if your function returns 0 it means elements a and b are both same. therefore it places a before the b and call the function again by replacing older element b as new element a.
but when your function returns the 1 it means your element a is coming after the element b. therefore the function is called again by replacing element a with the element before the old element a.
Following code shows how this is works
final List<int> list = [1, 0, 3, 4, 2 , 6, 8, 2 , 5];
list.sort((a,b) {
print("a : $a, b : $b");
int result = a.compareTo(b);
print('result : $result \n');
return result;
});
output
a : 1, b : 0
result : 1
a : 1, b : 3
result : -1
a : 3, b : 4
result : -1
a : 4, b : 2
result : 1
a : 3, b : 2
result : 1
a : 1, b : 2
result : -1
a : 4, b : 6
result : -1
a : 6, b : 8
result : -1
a : 8, b : 2
result : 1
a : 6, b : 2
result : 1
a : 4, b : 2
result : 1
a : 3, b : 2
result : 1
a : 2, b : 2
result : 0
a : 8, b : 5
result : 1
a : 6, b : 5
result : 1
a : 4, b : 5
result : -1

Find position of first non-zero decimal

Suppose I have the following local macro:
loc a = 12.000923
I would like to get the decimal position of the first non-zero decimal (4 in this example).
There are many ways to achieve this. One is to treat a as a string and to find the position of .:
loc a = 12.000923
loc b = strpos(string(`a'), ".")
di "`b'"
From here one could further loop through the decimals and count since I get the first non-zero element. Of course this doesn't seem to be a very elegant approach.
Can you suggest a better way to deal with this? Regular expressions perhaps?

Well, I don't know Stata, but according to the documentation, \.(0+)? is suported and it shouldn't be hard to convert this 2 lines JavaScript function in Stata.
It returns the position of the first nonzero decimal or -1 if there is no decimal.
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
Explanation
We remove from input string a dot followed by optional consecutive zeros.
The difference between the lengths of original input string and this new string gives the position of the first nonzero decimal
Demo
Sample Snippet
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
var samples = [
"loc a = 12.00012",
"loc b = 12",
"loc c = 12.012",
"loc d = 1.000012",
"loc e = -10.00012",
"loc f = -10.05012",
"loc g = 0.0012"
]
samples.forEach(function(sample) {
console.log(getNonZeroDecimalPosition(sample))
})

You can do this in mata in one line and without using regular expressions:
foreach x in 124.000923 65.020923 1.000022030 0.0090843 .00000425 {
mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
}
4
2
5
3
6
Below, you can see the steps in detail:
. local x = 124.000823
. mata:
: /* Step 1: break Stata's local macro x in tokens using . as a parsing char */
: a = tokens(st_local("x"), ".")
: a
1 2 3
+----------------------------+
1 | 124 . 000823 |
+----------------------------+
: /* Step 2: tokenize the string in a[1,3] using 0 as a parsing char */
: b = tokens(a[3], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: /* Step 3: find which values are different from zero */
: c = b :!= "0"
: c
1 2 3 4
+-----------------+
1 | 0 0 0 1 |
+-----------------+
: /* Step 4: find the first index position where this is true */
: d = selectindex(c :!= 0)[1]
: d
4
: end
You can also find the position of the string of interest in Step 2 using the
same logic.
This is the index value after the one for .:
. mata:
: k = selectindex(a :== ".") + 1
: k
3
: end
In which case, Step 2 becomes:
. mata:
:
: b = tokens(a[k], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: end
For unexpected cases without decimal:
foreach x in 124.000923 65.020923 1.000022030 12 0.0090843 .00000425 {
if strmatch("`x'", "*.*") mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
else display " 0"
}
4
2
5
0
3
6

A straighforward answer uses regular expressions and commands to work with strings.
One can select all decimals, find the first non 0 decimal, and finally find its position:
loc v = "123.000923"
loc v2 = regexr("`v'", "^[0-9]*[/.]", "") // 000923
loc v3 = regexr("`v'", "^[0-9]*[/.][0]*", "") // 923
loc first = substr("`v3'", 1, 1) // 9
loc first_pos = strpos("`v2'", "`first'") // 4: position of 9 in 000923
di "`v2'"
di "`v3'"
di "`first'"
di "`first_pos'"
Which in one step is equivalent to:
loc first_pos2 = strpos(regexr("`v'", "^[0-9]*[/.]", ""), substr(regexr("`v'", "^[0-9]*[/.][0]*", ""), 1, 1))
di "`first_pos2'"
An alternative suggested in another answer is to compare the lenght of the decimals block cleaned from the 0s with that not cleaned.
In one step this is:
loc first_pos3 = strlen(regexr("`v'", "^[0-9]*[/.]", "")) - strlen(regexr("`v'", "^[0-9]*[/.][0]*", "")) + 1
di "`first_pos3'"

Not using regex but log10 instead (which treats a number like a number), this function will:
For numbers >= 1 or numbers <= -1, return with a positive number the number of digits to the left of the decimal.
Or (and more specifically to what you were asking), for numbers between 1 and -1, return with a negative number the number of digits to the right of the decimal where the first non-zero number occurs.
digitsFromDecimal = (n) => {
dFD = Math.log10(Math.abs(n)) | 0;
if (n >= 1 || n <= -1) { dFD++; }
return dFD;
}
var x = [118.8161330, 11.10501660, 9.254180571, -1.245501523, 1, 0, 0.864931613, 0.097007836, -0.010880074, 0.009066729];
x.forEach(element => {
console.log(`${element}, Digits from Decimal: ${digitsFromDecimal(element)}`);
});
// Output
// 118.816133, Digits from Decimal: 3
// 11.1050166, Digits from Decimal: 2
// 9.254180571, Digits from Decimal: 1
// -1.245501523, Digits from Decimal: 1
// 1, Digits from Decimal: 1
// 0, Digits from Decimal: 0
// 0.864931613, Digits from Decimal: 0
// 0.097007836, Digits from Decimal: -1
// -0.010880074, Digits from Decimal: -1
// 0.009066729, Digits from Decimal: -2

Mata solution of Pearly is very likable, but notice should be paid for "unexpected" cases of "no decimal at all".
Besides, the regular expression is not a too bad choice when it could be made in a memorable 1-line.
loc v = "123.000923"
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
Below code tests with more values of v.
foreach v in 124.000923 605.20923 1.10022030 0.0090843 .00000425 12 .000125 {
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
di "`v': The wanted number = `x'"
}

Split string of digits into individual cells, including digits within parentheses/brackets

I have a column where each cell has a string of digits, ?, -, and digits in parentheses/brackets/curly brackets. A good example would be something like the following:
3????0{1012}?121-2[101]--01221111(01)1
How do I separate the string into different cells by characters, where a 'character' in this case refers to any number, ?, -, and value within the parentheses/brackets/curly brackets (including said parentheses/brackets/curly brackets)?
In essence, the string above would turn into the following (spaced apart to denote a separate cell):
3 ? ? ? ? 0 {1012} ? 1 2 1 - 2 [101] - - 0 1 2 2 1 1 1 1 (01) 1
The amount of numbers within the parentheses/brackets/curly brackets vary. There are no letters in any of the strings.

Here you are!
RegEx method:
Sub Test_RegEx()
Dim s, col, m
s = "3????0{1012}?121-2[101]--01221111(01)1"
Set col = CreateObject("Scripting.Dictionary")
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "(?:\d|-|\?|\(\d+\)|\[\d+\]|\{\d+\})"
For Each m In .Execute(s)
col(col.Count) = m
Next
End With
MsgBox Join(col.items) ' 3 ? ? ? ? 0 {1012} ? 1 2 1 - 2 [101] - - 0 1 2 2 1 1 1 1 (01) 1
End Sub
Loop method:
Sub Test_Loop()
Dim s, col, q, t, k, i
s = "3????0{1012}?121-2[101]--01221111(01)1"
Set col = CreateObject("Scripting.Dictionary")
q = "_"
t = True
k = 0
For i = 1 To Len(s)
t = (t Or InStr(1, ")]}", q) > 0) And InStr(1, "([{", q) = 0
q = Mid(s, i, 1)
If t Then k = k + 1
col(k) = col(k) & q
Next
MsgBox Join(col.items) ' 3 ? ? ? ? 0 {1012} ? 1 2 1 - 2 [101] - - 0 1 2 2 1 1 1 1 (01) 1
End Sub

Something else to look at :)
Sub test()
'String to parse through
Dim aStr As String
'final string to print
Dim finalString As String
aStr = "3????0{1012}?121-2[101]--01221111(01)1"
'Loop through string
For i = 1 To Len(aStr)
'The character to look at
char = Mid(aStr, i, 1)
'Check if the character is an opening brace, curly brace, or parenthesis
Dim result As String
Select Case char
Case "["
result = loop_until_end(Mid(aStr, i + 1), "]")
i = i + Len(result)
result = char & result
Case "("
result = loop_until_end(Mid(aStr, i + 1), ")")
i = i + Len(result)
result = char & result
Case "{"
result = loop_until_end(Mid(aStr, i + 1), "}")
i = i + Len(result)
result = char & result
Case Else
result = Mid(aStr, i, 1)
End Select
finalString = finalString & result & " "
Next
Debug.Print (finalString)
End Sub
'Loops through and concatenate to a final string until the end_char is found
'Returns a substring starting from the character after
Function loop_until_end(aStr, end_char)
idx = 1
If (Len(aStr) <= 1) Then
loop_until_end = aStr
Else
char = Mid(aStr, idx, 1)
Do Until (char = end_char)
idx = idx + 1
char = Mid(aStr, idx, 1)
Loop
End If
loop_until_end = Mid(aStr, 1, idx)
End Function

Assuming the data is in column A starting in row 1 and that you want the results start in column B and going right for each row of data in column A, here is alternate method using only worksheet formulas.
In cell B1 use this formula:
=IF(OR(LEFT(A1,1)={"(","[","{"}),LEFT(A1,MIN(FIND({")","]","}"},A1&")]}"))),IFERROR(--LEFT(A1,1),LEFT(A1,1)))
In cell C1 use this formula:
=IF(OR(MID($A1,SUMPRODUCT(LEN($B1:B1))+1,1)={"(","[","{"}),MID($A1,SUMPRODUCT(LEN($B1:B1))+1,MIN(FIND({")","]","}"},$A1&")]}",SUMPRODUCT(LEN($B1:B1))+1))-SUMPRODUCT(LEN($B1:B1))),IFERROR(--MID($A1,SUMPRODUCT(LEN($B1:B1))+1,1),MID($A1,SUMPRODUCT(LEN($B1:B1))+1,1)))
Copy the C1 formula right until it starts giving you blanks (there are no more items left to split out from the string in the A cell). In your example, need to copy it right to column AA. Then you can copy the formulas down for the rest of your Column A data.

How can I have all the integers in a string that has a combination of alphanumeric characters using RegEx

For example I have: 1|2|3,4|5|6,7|8|10;
How can I output it like this:
A: 1 2 3
B: 4 5 6
C: 7 8 10
And how can I do this:
Array A = {1,2,3}
Array B = {4,5,6}
Array C = {7,8,10}

var reg = /(\d+)\|(\d+)\|(\d+)[,;]/g;
var str = "1|2|3,4|5|6,7|8|10;";
var index = 0;
str.replace(reg,function myfun(g,g1,g2,g3){
var ch = String.fromCharCode(65 + (index++));
return ch+": "+g1+" "+g2+" "+g3+"\n";
});
the second case should update myfun return string:
"Array "+ch+" = {"+g1+","+g2+","+g3+"}\n";

Returning list in ANTLR for type checking, language java

I am working on ANLTR to support type checking. I am in trouble at some point. I will try to explain it with an example grammar, suppose that I have the following:
#members {
private java.util.HashMap<String, String> mapping = new java.util.HashMap<String, String>();
}
var_dec
: type_specifiers d=dec_list? SEMICOLON
{
mapping.put($d.ids.get(0).toString(), $type_specifiers.type_name);
System.out.println("identext = " + $d.ids.get(0).toString() + " - " + $type_specifiers.type_name);
};
type_specifiers returns [String type_name]
: 'int' { $type_name = "int";}
| 'float' {$type_name = "float"; }
;
dec_list returns [List ids]
: ( a += ID brackets*) (COMMA ( a += ID brackets* ) )*
{$ids = $a;}
;
brackets : LBRACKET (ICONST | ID) RBRACKET;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
LBRACKET : '[';
RBRACKET : ']';
In rule dec_list, you will see that I am returning List with ids. However, in var_dec when I try to put the first element of the list (I am using only get(0) just to see the return value from dec_list rule, I can iterate it later, that's not my point) into mapping I get a whole string like
[#4,6:6='a',<17>,1:6]
for an input
int a, b;
What I am trying to do is to get text of each ID, in this case a and b in the list of index 0 and 1, respectively.
Does anyone have any idea?

The += operator creates a List of Tokens, not just the text these Tokens match. You'll need to initialize the List in the #init{...} block of the rule and add the inner-text of the tokens yourself.
Also, you don't need to do this:
type_specifiers returns [String type_name]
: 'int' { $type_name = "int";}
| ...
;
simply access type_specifiers's text attribute from the rule you use it in and remove the returns statement, like this:
var_dec
: t=type_specifiers ... {System.out.println($t.text);}
;
type_specifiers
: 'int'
| ...
;
Try something like this:
grammar T;
var_dec
: type dec_list? ';'
{
System.out.println("type = " + $type.text);
System.out.println("ids = " + $dec_list.ids);
}
;
type
: Int
| Float
;
dec_list returns [List ids]
#init{$ids = new ArrayList();}
: a=ID {$ids.add($a.text);} (',' b=ID {$ids.add($b.text);})*
;
Int : 'int';
Float : 'float';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
Space : ' ' {skip();};
which will print the following to the console:
type = int
ids = [a, b, foo]
If you run the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRStringStream("int a, b, foo;"));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.var_dec();
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hadoop sort and group keys differently - mapreduce

Related

Flutter sort List of Objects after two values

Find position of first non-zero decimal

Split string of digits into individual cells, including digits within parentheses/brackets

How can I have all the integers in a string that has a combination of alphanumeric characters using RegEx

Returning list in ANTLR for type checking, language java

Categories

Resources