CouchDB unexpected reduce/rereduce behaviour - mapreduce

I'm assuming I'm looking at the problem incorrectly, but I seem to be hitting rereduce unexpectedly
A simplified example of my problem would be a student report card
Data (in table format for easy reading)
Entered Student Assig Grade
2019-02-01 Alice 1 0.80
2019-03-01 Alice 2 0.50
2019-04-01 Alice 2 0.80
2019-04-01 Alice 3 0.80
The story that goes with the data is that Alice is a good student, but the instructor fat fingered the data entry. Alice, pointed it out to the instructor who then put an amendment into the grade report. For accounting reasons, entries may not be deleted, but updated entries may be added.
The Map:
function(doc){
var key = [doc.student,doc.assig];
var val = {
grade:doc.grade,
entered: doc.entered
};
emit(key,val);
}
There are then two objectives when reducing:
Take the most recently entered grade
Gather up stats on the grades
function(keys,values,rereduce){
if(rereduce){
return sum(values);
}
else{
value = values.pop();
for(var v in values){
v = values[v];
if(v.entered > value.entered){
value = v;
}
}
return value.grade;
}
}
The minute I apply group=true, I'm expecting to get a list of the most recent grades for each assignment. Instead, I'm getting sums of all the grades.
Key Actual Expected
["Alice",1] 0.8 0.8
["Alice",2] 1.3 0.8
["Alice",3] 0.8 0.8
Oddly, further reducing:
Key Actual Expected
["Alice"] 2.4 2.4
I'm confused. How have I confused myself?
(CouchDB v2.3)
EDIT
OK, so I understand how I've got it wrong (overflowing indexes), but now I'm wondering how to make it right...
http://guide.couchdb.org/draft/views.html#reduce
I'm also still very confused about the group_leve 1 & 2 behaviour.

I suspect what is going on is that I am using the wrong tool to the job. Reduces in CouchDB work best when they have one job.
The solution used in the end was to create an update function that simply prepends the new grade to an array on the existing document.
Heavily based on an RTFM moment: https://docs.couchdb.org/en/stable/ddocs/views/collation.html
Student Assig Grades
Alice 1 [{e:'2019-02-01',g:0.80}]
Alice 2 [{e:'2019-04-01',g:0.80},{e:'2019-03-01',g:0.50}]
Alice 3 [{e:'2019-04-01',g:0.80}]
The Map changes to...
function(doc){
var key = [doc.student,doc.assig];
var val = doc.grades.shift().grade;
emit(key,val);
}
... and the reduce can use the built-in reduce functions.
The only thing lost is that modifications to the data are not as apparent, however a it is simple to create a view to report on that:
function(doc){
for(var val in doc.grades){
var key = [doc.student,doc.assig,val.entered];
emit(key,val.grade);
}
}
Now the users have access to both the detailed listing of changes, as well as the aggregated results.
The code above is untested, as its a trivialized example of a problem. Conceptually, its correct; syntactically, I expect issues.

Related

Number of unique students that completed all 10 courses

I have a little formula problem that I would really appreciate some help with.
The list has columns with Student names that repeat, Course names that repeat, and course status that can be passed, not passed, or not started.
I would like to count the number of unique students that passed all 10 courses that are available.
I tried different variations of Calculate and COUNTROWS.
This is the formula I have at the moment that doesn't work
PassedAll =CALCULATE(DISTINCTCOUNT(Progress[Student]),Progress[Mark]="Passed",Progress[Course]="Course1"&&Progress[Course]="Course2")
I understand that && doesn't work in this scenario because in a single row it cannot be both courses. And I don't want to replace it with an OR, || operator because I want to count students that have Passed marks on each of these courses.
Can someone please recommend how to somehow replace the course section of the filter with something that will include all 10 courses?
If you want only number to show in "Card Visualization" then:
StudentPassed = countrows(filter(GENERATE(VALUES(Sheet1[Student]), ROW("CoursCompleted", CALCULATE( DISTINCTCOUNT(Sheet1[Course]), Sheet1[Mark] ="Passed"))), [CoursCompleted]= 10))
in my sample data 1 Student Passed all, 1 Student Passed 9courses, 1 Student Pass 8 (and no record for 2 of course).

How to Properly Sort Flutter List with numbers and alphabet

I have a list in flutter which I want to sort like this
List<SearchUserResult> userSearchList = [
SearchUserResult(name: "Bright Isaac", age: 27, price: "10000"),
SearchUserResult(name: "Bright John", age: 7, price: "5000"),
SearchUserResult(name: "Phil Isaac", age: 20, price: "Negotiable"),
SearchUserResult(name: "Sunday", age: 16, price: "400")
];
What am trying to do is to get the highest price and the lowest price
userSearchList.sort((a, b) => a.price.compareTo(b.price)); //Get highest price
userSearchList.sort((b, a) => a.price.compareTo(b.price)); // Get lowest price
but I want that if am trying to get the lowest price it should be numbers like 10000 cause if you look at the list there is a Negotiable price and i don't want that to appear when sorting it only the numbers And also if you can help me i also want to be able sort it so that only the Negotiable without the numbers will show. Please how do i go about this.
Assuming you don't have the capability to do this on the server-side and just want to do this in client-side code, it shouldn't be too difficult.
If you really don't care about doing it in an optimal way, you could write something like this:
class PriceSearchUser {
num price;
SearchUserResult sur;
final PriceSearchUser(this.price, this.sur);
}
...
final priceSearchUsers = userSearchList
// create new list of price-UserSearchResult, with price in a sortable value (num)
.map((sur) => PriceSearchUser(num.tryParse(sur.price), sur))
// filter out any where the price wasn't able to be parsed into a number
.where((psu) => psu.price != null)
// make it in to a list so you can sort it
.toList();
final sortedSearchUserResults = priceSearchUsers
// sort that list! (note the double '.' - that makes it return
// the same object so that I can call the next function
..sort((a,b) => a.price.compareTo(b.price))
// get back to your original objects
.map((psu) => psu.sur)
// (optional) make it back into a list rather than an iterable
.toList();
If you're only doing this on a few users that should be adequate; if you're doing it on a whole bunch then you might want to think about optimizing or doing it in an isolate.
You can do something very similar with a "where" clause to get only the "Negotiable" prices.
However, as one of the comments mentioned, you should probably think about how exactly you do this. You say that you have a lot of code now but I guarantee you'll have more code later the next time this becomes a problem.
Also, this doesn't seem like a super-serious money app, but if it were you always need to be very careful about how you treat your monetary values. Doubles sometimes round to weird things! You could use something like this package.

How to write loop across Hierarchical Data (household-individual) in stata?

I'm now working on a household survey data set and I'd like to give certain members extra IDs according to their relationship to the household head. More specifically, I need to identify the adult children of household head and his/her spouse, if married, and assign them "sub-household IDs".
The variables are: hhid - household ID; pid -individual ID; relhead - relationship with head.
Regarding relhead, a 1 represents the head, a 6 represents a child, and a 7 represents a child-in-law. Below some example data, including in the last column the desired outcome. I assume that whenever a 6 is followed by a 7, they constitute a couple and belong to the same sub-household.
hhid pid relhead sub_hhid(desired)
50 1 1 1
50 2 3 1
50 3 6 2
50 4 6 3
50 5 7 3
-----------------------------------------------
67 1 1 1
67 3 6 2
67 4 7 2
Here are some thoughts:
There may be married and unmarried adult children within one household, the family structure is a little bit complicated, so I want to write some loop across the members in a household.
The basic idea is in the outer loop we identify the children staying-at-home and then check if there's a spouse presented, if there is, then we give the couple an indicator, if not, we continue and give the single stay_chil other indicator. After walking through all the possible members within a household, we get a series of within-household IDs. To facilitate further analysis , I need some kind of external ID variable to separate the sub-families.
* Define N as the total number of household, n as number of individual household size
* sty_chil is indicator for adult child who living with parents(head)
* sty_chil_sp is adult child's spouse
* "hid" and "ind_id" are local macros
forvalue hid=1/N {
forvalue ind_id= 1/n {
if sty_chil[`ind_id']==1 {
check if sty_chil_sp[`ind_id+1']==1 {
if yes then assign sub_hhid to this couples *a 6-7 pairs,identifid as couple
}
else { * single 6 identifid as single child
assign sub_hhid to this child
}
else { *Other relationships rather than 6, move forward
++ind_id the members within a household
}
++hid *move forward across households
}
The built-in stata by,sort: is pretty powerful but here I want to treat part of family members who fall into certain criterion and leave other untouched, so a if-else type loop is more natural for me (even by: may achieve my goal,it's always too tactful when situation become not so simpleļ¼Œand we cannot exhaust all the possible pattern of household pattern).
An immediate problem is that I don't know how to write loop across house IDs and individual IDs, because I used to acquire the household size (increment of outer loop) using by command (I'm not sure in this case it's 1 or the numerber of family members), and I'm not sure if mix up the by and if loops is a good programming practice, I favor write a "full loop" in this case. Please give me some clues how to achieve my goal and provide (illustrate)pseudo code for me.
An extra question is I cannot find the ado file which contains the content of by command, does it exist?
I will abstract from the issue of whether the assumption used to create matches is a sensible one or not. Rather, let this be an example of reaching the desired results without using explicit loops. Some logic and the use of subscripting (see help subscripting) can get you far.
clear
set more off
*----- example data -----
input ///
hhid pid relhead sub_hhid
50 1 1 1
50 3 6 2
50 4 6 3
50 5 7 3
67 1 1 1
67 3 6 2
67 4 7 2
67 5 6 3
end
list, sepby(hhid)
*----- what you want -----
bysort hhid (pid): gen hhid2 = sum( !(relhead == 7 & relhead[_n-1] == 6) )
list, sepby(hhid)
As you can see, one line of code gets you there. The reasoning is the following:
sum() gives the running sum. The arguments to sum(), being conditions, can either be True or False. The ! denotes the logical not (see help operators).
If it is not the case that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, the condition evaluates to True and takes on the value of 1, increasing the running sum by 1. If it evaluates to False, meaning that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, then it takes on the value of 0 and the running sum will not increase. This gives the result you seek.
You do this using the by: prefix, since you want to check each original household independently, so to speak.
For the the first observation of each original household, the condition always evaluates to True. This is because there exist no "previous" observation (relationship), and Stata considers relhead to be missing (., a very large number) and therefore, not equal to 6. This takes the running sum from 0 to 1 for the first observation of each sub-group, and so on.
Bottom line: learn how to use by: and take advantage of the features offered by Stata. Do not swim against the current; not here.
Edit
Please note that instead of progressively changing your example data set, you should provide a representative example from the beginning. Not doing so can render answers that are initially OK, completely inadequate.
For your modified example, add:
replace hhid2 = 1 if !inlist(relhead,6,7)
That will simply assign anyone not 6 or 7 to the same household as the head. The head is assumed to always have hhid2 == 1. If the head can have hhid2 != 1, then
bysort hhid (relhead): replace hhid2 = hhid2[1] if !inlist(relhead,6,7)
should work.
You can follow with:
bysort hhid (pid): replace hhid2 = hhid2[_n-1] + 1 if hhid2 != hhid2[_n-1] & _n > 1
but because they are IDs, it's not really necessary.
Finally, use:
gen hhid3 = string(hhid) + "_" + string(hhid2)
to create IDs with the form 50_1, 50_2, 50_3, etc.
Like I said before, if your data presents more complications, you should present a relevant example.

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.

Counting unique login using Map Reduce

Let say I have a very big log file with this kind of format( based on where a user login )
UserId1 , New York
UserId1 , New Jersey
UserId2 , Oklahoma
UserId3 , Washington DC
....
userId999999999, London
Note that UserId1 logged in New York first and then he flied to New Jersey and logged again from there.
If I need to get how many unique user login (means 2 login will same userid considered as 1 login), how should I map and reduce it?
My initial plan is that I want to map it first to this kind of format :
UserId1, 1
UserId1, 1
UserId2, 1
UserId3, 1
And then reduce it to
UserId1, 2
UserId2, 1
UserId3, 1
But would this cause the output to be still big in number (Especially if common behaviour of user is to login 1 or 2 times a day ). Or is there a better way to implement this?
Do map-reduce.
For example, you have 10000 lines of data, but you can only process 1000 lines of data in a time.
Then, process 1000 lines of data for 10 times.
If the sum of lines of the 10 processing's result > 1000:
do the above step again.
else:
use set directly.
I recommend making use of a custom key in the map phase. You can refer the tutorial here for writing and using custom keys. The custom key should have two parts 1) userid 2)placeid. So essentially in the mapper phase you are doing this.
emit(<userid, place>, 1)
In the reduce phase, you just have to access the key and emit the two parts of the key separately.