Pseudocode to Calculate average using MapReduce

Pseudocode to Calculate average using MapReduce - mapreduce

Hi I want to write a MapReduce algorithm in pseudo code to solve the following problem:
Given input records in the following format:
address, zip, city, house_value,
please calculate the average house value for each zip code.
I would really appreciate if you could help me with this..

The easiest would be to use Apache Pig, here is an example of finding an average:
inpt = load 'data.txt' as (address:chararray, zip:chararray, city:chararray, house_value:long);
grp = group inpt by zip;
average = foreach grp generate FLATTEN(group) as (zip), AVG(inpt.house_value) as average_price;
dump average;
For Pseudo Map Reduce code you would need one MAPPER, COMBINER and a REDUCER
MAPPER(record):
zip_code_key = record['zip'];
value = {1, record['house_value']};
emit(zip_code_key, value);
COMBINER(zip_code_key, value_list):
record_num = 0;
value_sum = 0;
foreach (value : value_list) {
record_num += value[0];
value_sum += value[1];
}
value_out = {record_num, value_sum};
emit(zip_code_key, value_out);
REDUCER(zip_code_key, value_list):
record_num = 0;
value_sum = 0;
foreach (value : value_list) {
record_num += value[0];
value_sum += value[1];
}
avg = value_sum / record_num;
emit(zip_code_key, avg);

Related

How to set proper back test range

I do not know how to code and I am trying to learn Pinescript but it really makes no sense to me so i googled how to set a backtest range and used some code someone else wrote but it doesn't seem to be actually testing the area i would like, it tests the entirety of the chart. I'd like to test from 1/1/2018 to present. I'm trying to do this for multiple strategies so I can better tailor them to the current market. here is wat I have for one of them and if you are willing to help with the others I would very much appreciate it!!! feel free to DM me.
//#version=5
strategy("Bollinger Bands BACKTEST", overlay=true)
source = close
length = input.int(20, minval=1)
mult = input.float(2.0, minval=0.001, maxval=50)
basis = ta.sma(source, length)
dev = mult * ta.stdev(source, length)
upper = basis + dev
lower = basis - dev
buyEntry = ta.crossover(source, lower)
sellEntry = ta.crossunder(source, upper)
if (ta.crossover(source, lower))
strategy.entry("BBandLE", strategy.long, stop=lower, oca_name="BollingerBands", oca_type=strategy.oca.cancel, comment="BBandLE")
else
strategy.cancel(id="BBandLE")
if (ta.crossunder(source, upper))
strategy.entry("BBandSE", strategy.short, stop=upper, oca_name="BollingerBands", oca_type=strategy.oca.cancel, comment="BBandSE")
else
strategy.cancel(id="BBandSE")
//plot(strategy.equity, title="equity", color=color.red, linewidth=2, style=plot.style_areabr)
// === INPUT BACKTEST RANGE ===
fromMonth = input.int(defval = 1, title = "From Month", minval = 1, maxval = 12)
fromDay = input.int(defval = 1, title = "From Day", minval = 1, maxval = 31)
fromYear = input.int(defval = 2018, title = "From Year", minval = 1970)

Fuzzy matching in Google Sheets

Trying to compare two columns in GoogleSheets with this formula in Column C:
=if(A1=B1,"","Mismatch")
Works fine, but I'm getting a lot of false positives:
A.
B
C
MARY JO
Mary Jo
JAY, TIM
TIM JAY
Mismatch
Sam Ron
Sam Ron
Mismatch
Jack *Ma
Jack MA
Mismatch
Any ideas how to work this?

This uses a score based approach to determine a match. You can determine what is/isn't a match based on that score:
Score Formula = getMatchScore(A1,B1)
Match Formula = if(C1<.7,"mismatch",)
function getMatchScore(strA, strB, ignoreCase=true) {
strA = String(strA);
strB = String(strB)
const toLowerCase = ignoreCase ? str => str.toLowerCase() : str => str;
const splitWords = str => str.split(/\b/);
let [maxLenStr, minLenStr] = strA.length > strB.length ? [strA, strB] : [strB, strA];
maxLenStr = toLowerCase(maxLenStr);
minLenStr = toLowerCase(minLenStr);
const maxLength = maxLenStr.length;
const minLength = minLenStr.length;
const lenScore = minLength / maxLength;
const orderScore = Array.from(maxLenStr).reduce(
(oldItem, nItem, index) => nItem === minLenStr[index] ? oldItem + 1 : oldItem, 0
) / maxLength;
const maxKeyWords = splitWords(maxLenStr);
const minKeyWords = splitWords(minLenStr);
const keywordScore = minKeyWords.reduce(({ score, searchWord }, nItem) => {
const newSearchWord = searchWord?.replace(new RegExp(nItem, ignoreCase ? 'i' : ''), '');
score += searchWord.length != newSearchWord.length ? 1: 0;
return { score, searchWord: newSearchWord };
}, { score: 0, searchWord: maxLenStr }).score / minKeyWords.length;
const sortedMaxLenStr = Array.from(maxKeyWords.sort().join(''));
const sortedMinLenStr = Array.from(minKeyWords.sort().join(''));
const charScore = sortedMaxLenStr.reduce((oldItem, nItem, index) => {
const surroundingChars = [sortedMinLenStr[index-1], sortedMinLenStr[index], sortedMinLenStr[index+1]]
.filter(char => char != undefined);
return surroundingChars.includes(nItem)? oldItem + 1 : oldItem
}, 0) / maxLength;
const score = (lenScore * .15) + (orderScore * .25) + (charScore * .25) + (keywordScore * .35);
return score;
}

try:
=ARRAYFORMULA(IFERROR(IF(LEN(
REGEXREPLACE(REGEXREPLACE(LOWER(A1:A), "[^a-z ]", ),
LOWER("["&B1:B&"]"), ))>0, "mismatch", )))

Implementing fuzzy matching via Google Sheets formula would be difficult. I would recommend using a custom formula for this one or a full blown script (both via Google Apps Script) if you want to populate all rows at once.
Custom Formula:
function fuzzyMatch(string1, string2) {
string1 = string1.toLowerCase()
string2 = string2.toLowerCase();
var n = -1;
for(i = 0; char = string2[i]; i++)
if (!~(n = string1.indexOf(char, n + 1)))
return 'Mismatch';
};
What this does is compare if the 2nd string's characters order is found in the same order as the first string. See sample data below for the case where it will return mismatch.
Output:
Note:
Last row is a mismatch as 2nd string have r in it that isn't found at the first string thus correct order is not met.
If this didn't meet your test cases, add a more definitive list that will show the expected output of the formula/function so this can be adjusted, or see player0's answer which solely uses Google Sheets formula and is less stricter with the conditions.
Reference:
https://stackoverflow.com/a/15252131/17842569

The main limitation of traditional fuzzy matching is that it doesn’t take into consideration similarities outside of the strings. Topic clustering requires semantic understanding. Goodlookup is a smart function for spreadsheet users that gets very close to semantic understanding. It’s a pre-trained model that has the intuition of GPT-3 and the join capabilities of fuzzy matching. Use it like vlookup or index match to speed up your topic clustering work in google sheets.
https://www.goodlookup.com/

Looping in Mata with OLS

I need help with looping in Mata. I have to write a code for Beta coefficients for OLS in Mata using a loop. I am not sure how to call for the variables and create the code. Here is what I have so far.
foreach j of local X {
if { //for X'X
matrix XX = [mata:XX = cross(X,1 , X,1)]
XX
}
else {
mata:Xy = cross(X,1 , y,0)
Xy
}
I am getting an error message "invalid syntax".

I'm not sure what you need the loop for. Perhaps you can provide more information about that. However the following example may help you implement OLS in mata.
Load example data from bcuse:
ssc install bcuse
clear
bcuse bwght
mata
x = st_data(., ("male", "parity","lfaminc","packs"))
cons = J(rows(x), 1, 1)
X = (x, cons)
y = st_data(., ("lbwght"))
beta_hat = (invsym(X'*X))*(X'*y)
e_hat = y - X * beta_hat
s2 = (1 / (rows(X) - cols(X))) * (e_hat' * e_hat)
B = J(cols(X), cols(X), 0)
n = rows(X)
for (i=1; i<=n; i++) {
B =B+(e_hat[i,1]*X[i,.])'*(e_hat[i,1]*X[i,.])
}
V_robust = (n/(n-cols(X)))*invsym(X'*X)*B*invsym(X'*X)
se_robust = sqrt(diagonal(V_robust))
V_ols = s2 * invsym(X'*X)
se_ols = sqrt(diagonal(V_ols))
beta_hat
se_robust
end
This is far from the only way to implement OLS using mata. See the Stata Blog for another example using quadcross, I like my example because it preserves a little more of the matrix algebra in the code.

Multiplication of two variables for linear problems in glpk (gusek)

I'm trying to implemenet an assignment problem. I have the following problem when trying to multiply two variables in linear programming (using glpk gusek) in my goal function:
minimize PATH_COST: sum{k in Rodzaj_Transportu}(sum{z in numery_Zlecen}Koszty_Suma[k,z])*y[k,z]; #y is a binary variable; Koszty_Suma is total cost for ordez z and car type k
The following error is arising: "model.mod:47: multiplication of linear forms not allowed".
Code (.dat file):
data;
set numery_Zlecen := 1, 2, 3; #order numbers
set Miasta := '*some data: *' #cities.
#order numer (from city to city)
set Zlecenie[1] := Warszawa Paris;
set Zlecenie[2] := Berlin Praha;
set Zlecenie[3] := Praha Amsterdam;
#number of packages for transport for a particular order
param Ilosc_Wyrobow :=
1 10
2 50
3 110;
param Godziny_Pracy := 9; #number of working hours during the day
param Pojemnosc_Samochodu := 35; #capacity of the car (how many packages it can take)
param Srednia_Predkosc := 80; #average car speed
param Spalenie_Paliwa := 0.25; #fuel combustion
param Wynagrodzenie_za_Godzine := 20; #salary for one working hour
param Cena_Noclegu := 100; #price of accommodation
param Dystans: '*some data: *' #km between cities.
param Koszt_Paliwa : '*some data: *' #fuel consumption depends on country.
end;
Code (.mod file):
#INDEXY
#=====================================================================
set Miasta; #i,j
set numery_Zlecen; #z
set Zlecenie{numery_Zlecen} dimen 2; #p,q
set Rodzaj_Transportu; #k
#PARAMETRY
#=====================================================================
param Dystans {Miasta,Miasta};
param Ilosc_Wyrobow{numery_Zlecen};
param Godziny_Pracy >= 0;
param Pojemnosc_Samochodu {Rodzaj_Transportu}>= 0;
param Srednia_Predkosc >=0;
param Spalenie_Paliwa >=0;
param Koszt_Paliwa {Miasta,Miasta};
param Wynagrodzenie_za_Godzine >= 0;
param Cena_Noclegu >= 0;
#ZMIENE
#=====================================================================
var x{Miasta,Miasta,numery_Zlecen} <= 1, >= 0; #variable x equal 1 when we're going the path from city A to city B; otherwise it equals 0
var y{Rodzaj_Transportu,numery_Zlecen} binary <=1, >=0; #variable that shows what types of car/s we are using for order (can be 0 or 1)
var Koszty_Suma{Rodzaj_Transportu,numery_Zlecen}; #total costs
var Koszty_Transportu{numery_Zlecen}; #transport costs
var Koszty_Odpoczynku{numery_Zlecen}; #rest costs
var Koszty_Wynagrodzenia{numery_Zlecen}; #salary costs
#FUNKCjA CELU
#=====================================================================
minimize PATH_COST: sum{k in Rodzaj_Transportu}(sum{z in numery_Zlecen}Koszty_Suma[k,z])*y[k,z];
#OGRANICZENIA (constraints)
#=====================================================================
s.t. SOURCE{z in numery_Zlecen, (p,q) in Zlecenie[z], i in Miasta: i = p && p != q}:
sum {j in Miasta} (x[i ,j ,z ]) - sum {j in Miasta}( x[j ,i ,z ]) = 1;
s.t. INTERNAL {z in numery_Zlecen, (p,q) in Zlecenie[z],i in Miasta: i != p && i != q && p != q }:
sum {j in Miasta} (x[i ,j ,z ]) - sum {j in Miasta}( x[j ,i ,z ]) = 0;
s.t. OGR_KM_DZIEN{z in numery_Zlecen,(p,q) in Zlecenie[z], j in Miasta, i in Miasta: i != q}:
if (Dystans[i,j] > (Godziny_Pracy*Srednia_Predkosc)) and i != q then x[i,j,z] = 0;
s.t. OGR_KOSZTY_SUMA{z in numery_Zlecen, k in Rodzaj_Transportu}:
Koszty_Suma[k,z] = (Koszty_Transportu[z] + Koszty_Odpoczynku[z] + Koszty_Wynagrodzenia[z])*ceil(Ilosc_Wyrobow[z]/Pojemnosc_Samochodu[k]);
s.t. OGR_KOSZTY_TRANSPORTU{z in numery_Zlecen}:
Koszty_Transportu[z] = (sum{i in Miasta} (sum{j in Miasta} ( Dystans[i,j]*x[i,j, z]*Koszt_Paliwa[i,j] ) ))*Spalenie_Paliwa;
s.t. OGR_KOSZTY_ODPOCZYNKU{z in numery_Zlecen}:
Koszty_Odpoczynku[z] =
(sum{i in Miasta} (sum{j in Miasta} ( Dystans[i,j]*x[i,j, z] ) ))/(Godziny_Pracy*Srednia_Predkosc) * Cena_Noclegu;
s.t. OGR_KOSZTY_WYNAGRODZENIA{z in numery_Zlecen}:
Koszty_Wynagrodzenia[z] =
((sum{i in Miasta} (sum{j in Miasta} ( Dystans[i,j]*x[i,j, z] ) ))/(Srednia_Predkosc)) * Wynagrodzenie_za_Godzine;
s.t. OGR_Y_JEDEN{z in numery_Zlecen}:
sum{k in Rodzaj_Transportu}(y[k,z]) = 1;
solve;
How is it possible to get rid of this error? Any hints how to solve this kind of problem are welcome.

First I think the parentheses are incorrect (note that y[k,z] depends on z). The expression
sum{k in Rodzaj_Transportu}(sum{z in numery_Zlecen}Koszty_Suma[k,z])*y[k,z];
is not mathematically correct. So, I assume what you meant is:
sum{k in Rodzaj_Transportu}(sum{z in numery_Zlecen}Koszty_Suma[k,z]*y[k,z]);
Let me restate the problem a little bit. I assume we can write this as:
sum((i,j), x[i,j]*y[i,j])
with y a binary variable and x a continuous variable. I also assume 0 <= x[i,j] <= U[i,j]. (U is an upper bound).
Here is a way to linearize this quadratic term. We can introduce a variable z[i,j]=x[i,j]*y[i,j] using the following inequalities:
z[i,j] <= U[i,j]*y[i,j]
z[i,j] <= x[i,j]
z[i,j] >= x[i,j]-U[i,j]*(1-y[i,j])
0 <= z[i,j] <= U[i,j]
Now you just can minimize sum((i,j),z[i,j]). For a similar linearization see link.

Calculating value weighted returns in SAS

I have some data in the following format:
COMPNAME DATA CAP RETURN
I have found some code that will construct and calculate the value-weighted return based on the data.
This works great and is below:
PROC SUMMARY NWAY DATA = Data1 ; CLASS DATE ;
VAR RETURN / WEIGHT = CAP ;
OUTPUT
OUT = MKTRET
MEAN (RETURN) = MONTHLYRETURN
RUN;
The extension that I would like to make is in my head a little bit complicated.
I want to make the weights based on the market capitalization in June.
So this will be a buy and hold portfolios. The actual data has 100's of companies but to give a representative example for two companies with the sole explanation of how the weights will evolve...
Say for example I have two companies, A and B.
The CAP of A is £100m and B is £100m.
In July of one year, I would invest 50% in A and 50% in B.
The returns in July are 10% and -10%.
Therefore I would invest 55% and 45%.
It will go on like this until next June when I will re-balance again based on the market capitalisation...

10% monthly return is pretty speculative!
When the two companies differ by more than 200 you will need to also sell and buy to equalize the companies.
Presume the rates per month are simulated and stored in a data set. You can generate a simulated ledger as follows
add returns
compare balances
equalize by splitting 200 investment if balances are close enough
equalize by investing all 200 in one and selling and buying
Of course, a portfolio with more than 2 companies becomes a more complicated balancing act to achieve mathematical balance.
data simurate(label="Future expectation is not an indicator of past performance :)");
do month = 1 to 60;
do company = 1 to 2;
return = round (sin(company+month/4) / 12, 0.001); %* random return rate for month;
output;
end;
end;
run;
data want;
if 0 then set simurate;
declare hash lookup (dataset:'simurate');
lookup.defineKey ('company', 'month');
lookup.defineData('return');
lookup.defineDone();
month = 0;
bal1 = 0; bal2 = 0;
output;
do month = 1 to 60;
lookup.find(key:1, key:month); rate1 = return;
ret1 = round(bal1 * rate1, 0.0001);
lookup.find(key:2, key:month); rate2 = return;
ret2 = round(bal1 * rate2, 0.0001);
bal1 + ret1;
bal2 + ret2;
goal = mean(bal1,bal2) + 100;
sel1 = 0; buy1 = 0;
sel2 = 0; buy2 = 0;
if abs(bal1-bal2) <= 200 then do;
* difference between balances after returns is < 200;
* balances can be equalized simple investment split;
inv1 = goal - bal1;
inv2 = goal - bal2;
end;
else if bal1 < bal2 then do;
* sell bal2 as needed to equalize;
inv1 = 200;
inv2 = 0;
buy1 = goal - 200 - bal1;
sel2 = bal2 - goal;
end;
else do;
inv2 = 200;
inv1 = 0;
buy2 = goal - 200 - bal2;
sel1 = bal1 - goal;
end;
bal1 + (buy1 - sel1 + inv1);
bal2 + (buy2 - sel2 + inv2);
output;
end;
stop;
drop company return ;
format bal: 10.4 rate: 5.3;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pseudocode to Calculate average using MapReduce - mapreduce

Hi I want to write a MapReduce algorithm in pseudo code to solve the following problem: Given input records in the following format: address, zip, city, house_value, please calculate the average house value for each zip code. I would really appreciate if you could help me with this..

Related

How to set proper back test range

Fuzzy matching in Google Sheets

Looping in Mata with OLS

Multiplication of two variables for linear problems in glpk (gusek)

Calculating value weighted returns in SAS

Categories

Resources