I ran into a problem of design. Prepare, as there will be a lot to come, and perhaps it is a too complex problem for SO.
For people familiar with R this might help:
I created a package in R for creating correlations between variables (for the purpose of simulations). I created a multiple threads version to allow parallelism. While R makes it seem embarrassingly parallel, I almost reconsider this statement in C++.
Sample a 100000 by 2 random data matrix.
Split it into a list of (100,000 / 200) = 500 splits of 200 rows by 2 columns data matrices.
Using a multi-core list apply function (mclapply, which can handle matrices as "item" in a list), I was able to send these smaller data matrices with 2 dimensions to the same function, though in different threads.
The function randomly rearranged the first column conditionally on the second column to produce a certain correlation. It then returned the rearranged matrix. I am not asking for this part, I am just mentioning this to outrule solutions that skip steps by independently creating x and y.
The list apply function automatically glues together these 500 splits of 200 rows by 2 columns into one big matrix (100,000 by 2).
listdata:
[[1]]
[1] [2]
... ...
... ...
[[2]]
[1] [2]
... ...
... ...
Applying:
mclapply(listdata, myFunction, cores=4)
Result:
[1] [2]
... ...
... ...
... ...
... ...
Randomness
I want to have a thread-safe, statistically random (results unrelated to the pseudo randomness) function yielding aggregated dependent x and y vectors.
Randomness is difficult since there usually is a global state for a random generator, which is not possible to share in a parallel situation.
Some of my solutions showed that the threads all used similar values, which does not happen using R.
Qt
I am using Qt, and I am trying to use QtConcurrent for creating the right (convenient for user) amount of threads, given a variable amount of splits.
I'll first say that I did not find a way to be able to pass multiple arguments to a function in a QtConcurrent. They all seem to be focused on mapping, without allowing extra arguments.
One attempt was the following:
QList< QVector<double> > list_of_one_split;
QList< QList< QVector<double> > list_containing_all_splits;
QVector<double> vector_of_doubles(10);
qFill(vector_of_doubles.begin(), vector_of_doubles.end(), 1); // not random
list_of_a_split.append(vector_of_doubles); // x variable split
list_of_a_split.append(vector_of_doubles); // y variable split
list_containing_all_splits.append(list_of_a_split) // first split
list_containing_all_splits.append(list_of_a_split) // second split
QtFuture<void> res = `QtConcurrent::map(list_containing_all_splits.begin(),
list_containing_all_splits.end(),
mapFunction)`
QList< QVector<double> > mapFunction(QList< QVector<double> > &list_of_a_split)
{
// random arranging one column on other
return list_of_a_split;
}
For some reason this return did not seem to affect the original list.
Another way I tried was to generate the y within the mapper function, giving it X as input, and returning y into a new variable using QtConcurrent:mappedReduced(), mapping over x in split form, producing y, return y split, and using reduce to merge it together into QVector<double> y. This has serious random issues, as I am not able to send a generator as extra argument.
I am just showing my efforts here, but feel free to approach the problem from any way (though same results) you like.
The goal in the end:
// Whether or not packed in a QList< >:
QVector<double> x = {..., ..., ..., ...};
QVector<double> y = {..., ..., ..., ...}; // dependent on x
Related
In a cell, is it possible to do if(x=y, z, x) without having to repeat x in the value_if_false argument? Whether there is a way of using if() to make this work or another function doesn't matter, and there isn't a specific formula I'm struggling with as I come across this blocker quite often (hence posting).
To help illustrate the need, if we take x as a complex or more advanced formula, such as
ARRAYFORMULA(IF(E$6:Q$6 < EoMONTH($P$4,0), "Not Active", IF(E$6:Q$6<$Q$4 + ISBLANK($Q$4) > 0,
COUNTIF({'Data'!$B$3:$B&'Data'!$I$3:$I&'Data'!$K$3:$K},$B$4&$C9&E$6:Q$6), "Not Active")))
and I wanted to put an if statement in there that changed the result only if a condition was true, the formula would more than double in size due to having to reference x twice:
=ARRAYFORMULA(IF(IF(E$6:Q$6 < EoMONTH($P$4,0), "Not Active", IF(E$6:Q$6<$Q$4 + ISBLANK($Q$4) > 0,
COUNTIF({'Data'!$B$3:$B&'Data'!$I$3:$I&'Data'!$K$3:$K},$B$4&$C9&E$6:Q$6), "Not Active"))) = 0, "No data", IF(E$6:Q$6 < EoMONTH($P$4,0), "Not Active", IF(E$6:Q$6<$Q$4 + ISBLANK($Q$4) > 0,
COUNTIF({'Data'!$B$3:$B&'Data'!$I$3:$I&'Data'!$K$3:$K},$B$4&$C9&E$6:Q$6), "Not Active"))))
This is just an example (the code is irrelevant), I'm trying to keep my formulas neat, tidy and efficient so that handing off to others is easier. Then I'm also mindful that it is calculating the same complex formula twice, which would probably slow the spreadsheet down especially when iterated throughout a spreadsheet.
Interested to hear the community thoughts and suggestions on this, hopefully I was clear in explaining it. :)
The only simple way to achieve this would be with the use of helper columns. They don't need to be in the same sheet as your main equation, but they do need to be within that same spreadsheet as a whole (ie you could have a sheet named "calc" that's specifically used to calculate intermediate steps and set "variables" by referencing those cells).
The only other option (which gets a bit complicated) is to create a custom function within Google Apps Script. For example, if you wanted to calculate (B1*A4)/C5 in multiple places, you could create a custom function like this:
/**
* Returns a calculation using cells A4, B1, and C5.
* #return A calculation using cells A4, B1, and C5.
* #customfunction
*/
function x() {
var ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('MainSheet');
var val1 = ss.getRange('B1').getValue();
var val2 = ss.getRange('A4').getValue();
var val3 = ss.getRange('C5').getValue();
return (val1*val2)/val3;
}
Then in your sheet, you could use this within a formula like this:
=if(A1="yes", x(), "no")
This custom function could obviously be altered to fit one's needs (ex taking in arguments to define the cells that the calculations should be done on instead of hard coding them, etc).
Other than this, there is currently no way to define variables within a formula itself.
This is possible to a certain extent, using TEXT's Meta Instructions, if you're using numbers and simple math conditions.
x
y
z
output
10
10
5
10
=TEXT(A3,"[="&B3&"]0;"&C3&"")
x
y
z
output
11
10
5
5
As long as your complex formula returns a number for x(or the output can be coerced to a number), this should be possible and it avoids repetition.
I agree, I would love if there was like a DECODE or NVL type function you could use so that you didn't need to repeat the original statement multiple times.
However, in many cases, when I encounter this, I can often reference another cell. Not in the way that has been suggested already, where the formula exists in another cell, but rather that the decision to perform the formula is based on another cell.
For example, using your values, lets assume the formula ((if(x=y, z, x)) only gets calculated when column 'w' is populated. Maybe column 'w' is a key component of the formula. Then you can write the formula as: if(w="",z,x). It's not exactly the same as testing the answer to the equation first and doesn't work in all situations, but in many cases I can find another field that's of key relevance to the formula that lets me get around this.
Suppose I am using some twoway graph command in Stata. Without any action on my part Stata will choose some reasonable values for the ranges of both y and x axes, based both upon the minimum and maximum y and x values in my data, but also upon some algorithm that decides when it would be prettier for the range to extend instead to a number like '0' instead of '0.0139'. Wonderful! Great.
Now suppose that after (or while) I draw my graph, I want to slap some very important text onto it, and I want to be choosy about precisely where the text appears. Having the minimum and maximum values of the displayed axes would be useful: how can I get these min and max numbers? (Either before or while calling the graph command.)
NB: I am not asking how to set the y or x axis ranges.
Since this issue has been a bit of a headache for me for quite some time and I believe there is no good solution out there yet I wanted to write up two ways in which I was able to solve a similar problem to the one described in the post. Specifically, I was able to solve the issue of gray shading for part of the graph using these.
Define a global macro in the code generating the axis labels This is the less elegant way to do it but it works well. Locate the tickset_g.class file in your ado path. The graph twoway command uses this to draw the axes of any graph. There, I defined a global macro in the draw program that takes the value of the omin and omax locals after they have been set to the minimum between the axis range and data range (the command that does this is local omin = min(.scale.min,omin) and analogously for the max), since the latter sometimes exceeds the former. You could also define the global further up in that code block to only get the axis extent. You can then access the axis range using the globals after the graph command (and use something like addplot to add to the previously drawn graph). Two caveats for this approach: using global macros is, as far as I understand, bad practice and can be dangerous. I used names I was sure wouldn't be included in any program with the prefix userwritten. Also, you may not have administrator privileges that allow you to alter this file based on your organization's decisions. However, it is the simpler way. If you prefer a more elegant approach along the lines of what Nick Cox suggested, then you can:
Use the undocumented gdi natscale command to define your own axis labels The gdi commands are the internal commands that are used to generate what you see as graph output (cf. https://www.stata.com/meeting/dcconf09/dc09_radyakin.pdf). The tickset_g.class uses the gdi natscale command to generate the nice numbers of the axes. Basic documentation is available with help _natscale, basically you enter the minimum and maximum, e.g. from a summarize return, and a suggested number of steps and the command returns a min, max, and delta to be used in the x|ylabel option (several possible ways, all rather straightforward once you have those numbers so I won't spell them out for brevity). You'd have to adjust this approach in case you use some scale transformation.
Hope this helps!
I like Nick's suggestion, but if you're really determined, it seems that you can find these values by inspecting the output after you set trace on. Here's some inefficient code that seems to do exactly what you want. Three notes:
when I import the log file I get this message:
Note: Unmatched quote while processing row XXXX; this can be due to a formatting problem in the file or because a quoted data element spans multiple lines. You should carefully inspect your data after importing. Consider using option bindquote(strict) if quoted data spans multiple lines or option bindquote(nobind) if quotes are not used for binding data.
Sometimes the data fall outside of the min and max range values that are chosen for the graph's axis labels (but you can easily test for this).
The log linesize is actually important to my code below because the key values must fall on the same line as the strings that I use to identify the helpful rows.
* start a log (critical step for my solution)
cap log close _all
set linesize 255
log using "log", replace text
* make up some data:
clear
set obs 3
gen xvar = rnormal(0,10)
gen yvar = rnormal(0,.01)
* turn trace on, run the -twoway- call, and then turn trace off
set trace on
twoway scatter yvar xvar
set trace off
cap log close _all
* now read the log file in and find the desired info
import delimited "log.log", clear
egen my_string = concat(v*)
keep if regexm(my_string,"forvalues yf") | regexm(my_string,"forvalues xf")
drop if regexm(my_string,"delta")
split my_string, parse("=") gen(new)
gen axis = "vertical" if regexm(my_string,"yf")
replace axis = "horizontal" if regexm(my_string,"xf")
keep axis new*
duplicates drop
loc my_regex = "(.*[0-9]+)\((.*[0-9]+)\)(.*[0-9]+)"
gen min = regexs(1) if regexm(new3,"`my_regex'")
gen delta = regexs(2) if regexm(new3,"`my_regex'")
gen max_temp= regexs(3) if regexm(new3,"`my_regex'")
destring min max delta , replace
gen max = min + delta* int((max_temp-min)/delta)
*here is the info you want:
list axis min delta max
Assume I have the following array of objects:
Object 0:
[0]=1.1344
[1]=2.18
...
[N]=1.86
-----------
Object 1 :
[0]=1.1231
[1]=2.16781
...
[N]=1.8765
-------------
Object 2 :
[0]=1.2311
[1]=2.14781
...
[N]=1.5465
--------
Object 17:
[0]=1.31
[1]=2.55
...
[N]=0.75
How can I compare those objects?
You can see that object 0 and object 1 are very similar but object 17 not like any of them.
I would like to have algorithm tha twill give me all the similar object in my array
You tag this question with Algorithm (and I am not expert in C++) so lets give a pseudo code.
First, you should set a threshold which define 2 var with different under that threshold as similar. Second step will be to loop over all pair of elements and check for similarity.
Consider A to be array with n objects and m to be number of fields in each object.
threshold = 0.1
for i in (0, n):
for j in (i+1,n):
flag = true;
for k in (1,m):
if (abs(A[i][k] - A[j][k]) > threshold)
flag = false // if the absolute value of the diff is above the threshold object are not similar
break // no need to continue checks
if (flag)
print: element i and j similar // and do what ever
Time complexity is O(m * n^2).
Notice that you can use the same algorithm to sort the objects array - declare compare function as the max diff between field and then sort accordingly.
Hope that helps!
Your problem essentially boils down to nearest neighbor search which is a well researched problem in data mining.
There are diffent approaches to this problem.
I would suggest to decide first what number of similar elements you want OR to set a given threshold for the similarity. Than you have to iterate through all the vectors and compute a distance function between the query vector and each vector in the database.
I would suggest you to use Euclidean distance in your case since you have real nominal data.
You can read more about the topic of nearest neighbor search and Euclidean distancehere and here. Good luck!
What you need is a classifier, for your problem there are 2 algorithms depends on what you wanted.
If you need to find which object is most similar to the choosen object-m, you can use nearest neighbor algorithm or else if you need to find similar sets of objects you can use k-means algorithm to find k sets.
I have a cyclical signal I would like to model. I would like to allow the signal to be able to stretch and compress in time, and I do not know the exact profile.
At the moment, I am modelling the phase progression as a random walk, and capturing the cyclical nature by defining the mean likelihood as a sum of sines and cosines on the phase, where the weights on the cosines are parameters to be fitted.
i.e.
y = N(f(phase),sigma) = N(sum_i(a_i*sin(phase) + b_i*cos(phase)),sigma)
(i.e. latex image of above)
This seems to work to some extent, but I would like to change the definition of f so that it does not rely on sums of sin and cos.
I was looking at Gaussian Processes, and thinking that there could be a solution to this there - but I can't figure out how (if it's possible) to define the y in terms of phase when using GP.
There is an example on the pymc github site:
y_obs = pm.gp.GP('y_obs', cov_func=f_cov, sigma=s2_n, observed={'X':X, 'Y':y})
The problem here is that X is defined as observed, while I need to model it as a random variable.
I tried this form:
y_obs = pm.gp.GP('y_obs', X = phase , cov_func=f_cov, sigma=s2_n, observed={ 'Y':y})
But that leads to an error:
File "/home/person/.conda/envs/mcmcx/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 56, in __init__
raise TypeError("Expected int elements in shape")
I am new to HB/GP/pymc3... and even stackoverflow. Apologies if the question is off.
Here is a recursive function that I'm trying to create that finds all the subsets passed in an STL set. the two params are an STL set to search for subjects, and a number i >= 0 which specifies how big the subsets should be. If the integer is bigger then the set, return empty subset
I don't think I'm doing this correctly. Sometimes it's right, sometimes its not. The stl set gets passed in fine.
list<set<int> > findSub(set<int>& inset, int i)
{
list<set<int> > the_list;
list<set<int> >::iterator el = the_list.begin();
if(inset.size()>i)
{
set<int> tmp_set;
for(int j(0); j<=i;j++)
{
set<int>::iterator first = inset.begin();
tmp_set.insert(*(first));
the_list.push_back(tmp_set);
inset.erase(first);
}
the_list.splice(el,findSub(inset,i));
}
return the_list;
}
From what I understand you are actually trying to generate all subsets of 'i' elements from a given set right ?
Modifying the input set is going to get you into trouble, you'd be better off not modifying it.
I think that the idea is simple enough, though I would say that you got it backwards. Since it looks like homework, i won't give you a C++ algorithm ;)
generate_subsets(set, sizeOfSubsets) # I assume sizeOfSubsets cannot be negative
# use a type that enforces this for god's sake!
if sizeOfSubsets is 0 then return {}
else if sizeOfSubsets is 1 then
result = []
for each element in set do result <- result + {element}
return result
else
result = []
baseSubsets = generate_subsets(set, sizeOfSubsets - 1)
for each subset in baseSubssets
for each element in set
if no element in subset then result <- result + { subset + element }
return result
The key points are:
generate the subsets of lower rank first, as you'll have to iterate over them
don't try to insert an element in a subset if it already is, it would give you a subset of incorrect size
Now, you'll have to understand this and transpose it to 'real' code.
I have been staring at this for several minutes and I can't figure out what your train of thought is for thinking that it would work. You are permanently removing several members of the input list before exploring every possible subset that they could participate in.
Try working out the solution you intend in pseudo-code and see if you can see the problem without the stl interfering.
It seems (I'm not native English) that what you could do is to compute power set (set of all subsets) and then select only subsets matching condition from it.
You can find methods how to calculate power set on Wikipedia Power set page and on Math Is Fun (link is in External links section on that Wikipedia page named Power Set from Math Is Fun and I cannot post it here directly because spam prevention mechanism). On math is fun mainly section It's binary.
I also can't see what this is supposed to achieve.
If this isn't homework with specific restrictions i'd simply suggest testing against a temporary std::set with std::includes().