Getting all combinations of splitting an array into two equally sized groups in Julia - combinations

Given an array of 20 numbers, I would like to extract all possible combinations of two groups, with ten numbers in each, order is not important.
combinations([1, 2, 3], 2)
in Julia will give me all possible combinations of two numbers drawn from the array, but I also need the ones that were not drawn...

You can use setdiff to determine the items missing from any vector, e.g.,
y = setdiff(1:5, [2,4])
yields [1,3,5].

After playing around for a bit, I came up with this code, which seems to work. I'm sure it could be written much more elegantly, etc.
function removeall!(remove::Array, a::Array)
for i in remove
if in(i, a)
splice!(a, indexin([i], a)[1])
end
end
end
function combinationgroups(a::Array, count::Integer)
result = {}
for i in combinations(a, count)
all = copy(a)
removeall!(i, all)
push!(result, { i; all } )
end
result
end
combinationgroups([1,2,3,4],2)
6-element Array{Any,1}:
{[1,2],[3,4]}
{[1,3],[2,4]}
{[1,4],[2,3]}
{[2,3],[1,4]}
{[2,4],[1,3]}
{[3,4],[1,2]}

Based on #tholy's comment about instead of using the actual numbers, I could use positions (to avoid problems with numbers not being unique) and setdiff to get the "other group" (the non-selected numbers), I came up with the following. The first function grabs values out of an array based on indices (ie. arraybyindex([11,12,13,14,15], [2,4]) => [12,14]). This seems like it could be part of the standard library (I did look for it, but might have missed it).
The second function does what combinationgroups was doing above, creating all groups of a certain size, and their complements. It can be called by itself, or through the third function, which extracts groups of all possible sizes. It's possible that this could all be written much faster, and more idiomatical.
function arraybyindex(a::Array, indx::Array)
res = {}
for e in indx
push!(res, a[e])
end
res
end
function combinationsbypos(a::Array, n::Integer)
res = {}
positions = 1:length(a)
for e in combinations(positions, n)
push!(res, { arraybyindex(a, e) ; arraybyindex(a, setdiff(positions, e)) })
end
res
end
function allcombinationgroups(a::Array)
maxsplit = floor(length(a) / 2)
res = {}
for e in 1:5
println("Calculating for $e, so far $(length(res)) groups calculated")
push!(res, combinationsbypos(a, e))
end
res
end
Running this in IJulia on a 3 year old MacBook pro gives
#time c=allcombinationgroups([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
println(length(c))
c
Calculating for 1, so far 0 groups calculated
Calculating for 2, so far 20 groups calculated
Calculating for 3, so far 210 groups calculated
Calculating for 4, so far 1350 groups calculated
Calculating for 5, so far 6195 groups calculated
Calculating for 6, so far 21699 groups calculated
Calculating for 7, so far 60459 groups calculated
Calculating for 8, so far 137979 groups calculated
Calculating for 9, so far 263949 groups calculated
Calculating for 10, so far 431909 groups calculated
elapsed time: 11.565218719 seconds (1894698956 bytes allocated)
Out[49]:
616665
616665-element Array{Any,1}:
{{1},{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}}
{{2},{1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}}
⋮
{{10,12,13,14,15,16,17,18,19,20},{1,2,3,4,5,6,7,8,9,11}}
{{11,12,13,14,15,16,17,18,19,20},{1,2,3,4,5,6,7,8,9,10}}
ie. 53,334 groups calculated per second.
As a contrast, using the same outer allcombinationgroups function, but replacing the call to combinationsbypos with a call to combinationgroups (see previous answer), is 10x slower.
I then rewrote the array by index group using true or false flags as suggested by #tholy (I couldn't figure out how to get it work using [], so I used setindex! explicitly, and moved it into one function. Another 10x speedup! 616,665 groups in 1 second!
Final code (so far):
function combinationsbypos(a::Array, n::Integer)
res = {}
positions = 1:length(a)
emptyflags = falses(length(a))
for e in combinations(positions, n)
flag = copy(emptyflags)
setindex!(flag, true, e)
push!(res, {a[flag] ; a[!flag]} )
end
res
end
function allcombinationgroups(a::Array)
maxsplit = floor(length(a) / 2)
res = {}
for e in 1:maxsplit
res = vcat(res, combinationsbypos(a, e))
end
res
end

Related

Extracting numbers using Regex in Matlab

I would like to extract integers from strings from a cell array in Matlab. Each string contains 1 or 2 integers formatted as shown below. Each number can be one or two digits. I would like to convert each string to a 1x2 array. If there is only one number in the string, the second column should be -1. If there are two numbers then the first entry should be the first number, and the second entry should be the second number.
'[1, 2]'
'[3]'
'[10, 3]'
'[1, 12]'
'[11, 12]'
Thank you very much!
I have tried a few different methods that did not work out. I think that I need to use regex and am having difficulty finding the proper expression.
You can use str2num to convert well formatted chars (which you appear to have) to the correct arrays/scalars. Then simply pad from the end+1 element to the 2nd element (note this is nothing in the case there's already two elements) with the value -1.
This is most clearly done in a small loop, see the comments for details:
% Set up the input
c = { ...
'[1, 2]'
'[3]'
'[10, 3]'
'[1, 12]'
'[11, 12]'
};
n = cell(size(c)); % Initialise output
for ii = 1:numel(n) % Loop over chars in 'c'
n{ii} = str2num(c{ii}); % convert char to numeric array
n{ii}(end+1:2) = -1; % Extend (if needed) to 2 elements = -1
end
% (Optional) Convert from a cell to an Nx2 array
n = cell2mat(n);
If you really wanted to use regex, you could replace the loop part with something similar:
n = regexp( c, '\d{1,2}', 'match' ); % Match between one and two digits
for ii = 1:numel(n)
n{ii} = str2double(n{ii}); % Convert cellstr of chars to arrays
n{ii}(end+1:2) = -1; % Pad to be at least 2 elements
end
But there are lots of ways to do this without touching regex, for example you could erase the square brackets, split on a comma, and pad with -1 according to whether or not there's a comma in each row. Wrap it all in a much harder to read (vs a loop) cellfun and ta-dah you get a one-liner:
n = cellfun( #(x) [str2double( strsplit( erase(x,{'[',']'}), ',' ) ), -1*ones(1,1-nnz(x==','))], c, 'uni', 0 );
I'd recommend one of the loops for ease of reading and debugging.

How to perform rolling window calculations without SSC packages

Goal: perform rolling window calculations on panel data in Stata with variables PanelVar, TimeVar, and Var1, where the window can change within a loop over different window sizes.
Problem: no access to SSC for the packages that would take care of this (like rangestat)
I know that
by PanelVar: gen Var1_1 = Var1[_n]
produces a copy of Var1 in Var1_1. So I thought it would make sense to try
by PanelVar: gen Var1SumLag = sum(Var1[(_n-3)/_n])
to produce a rolling window calculation for _n-3 to _n for the whole variable. But it fails to produce the results I want, it just produces zeros.
You could use sum(Var1) - sum(Var1[_n-3]), but I also want to be able to make the rolling window left justified (summing future observations) as well as right justified (summing past observations).
Essentially I would like to replicate Python's ".rolling().agg()" functionality.
In Stata _n is the index of the current observation. The expression (_n - 3) / _n yields -2 when _n is 1 and increases slowly with _n but is always less than 1. As a subscript applied to extract values from observations of a variable it always yields missing values given an extra rule that Stata rounds down expressions so supplied. Hence it reduces to -2, -1 or 0: in each case it yields missing values when given as a subscript. Experiment will show you that given any numeric variable say numvar references to numvar[-2] or numvar[-1] or numvar[0] all yield missing values. Otherwise put, you seem to be hoping that the / yields a set of subscripts that return a sequence you can sum over, but that is a long way from what Stata will do in that context: the / is just interpreted as division. (The running sum of missings is always returned as 0, which is an expression of missings being ignored in that calculation: just as 2 + 3 + . + 4 is returned as 9 so also . + . + . + . is returned as 0.)
A fairly general way to do what you want is to use time series operators, and this is strongly preferable to subscripts as (1) doing the right thing with gaps (2) automatically working for panels too. Thus after a tsset or xtset
L0.numvar + L1.numvar + L2.numvar + L3.numvar
yields the sum of the current value and the three previous and
L0.numvar + F1.numvar + F2.numvar + F3.numvar
yields the sum of the current value and the three next. If any of these terms is missing, the sum will be too; a work-around for that is to return say
cond(missing(L3.numvar), 0, L3.numvar)
More general code will require some kind of loop.
Given a desire to loop over lags (negative) and leads (positive) some code might look like this, given a range of subscripts as local macros i <= j
* example i and j
local i = -3
local j = 0
gen double wanted = 0
forval k = `i'/`j' {
if `k' < 0 {
local k1 = -(`k')
replace wanted = wanted + L`k1'.numvar
}
else replace wanted = wanted + F`k'.numvar
}
Alternatively, use Mata.
EDIT There's a simpler method, to use tssmooth ma to get moving averages and then multiply up by the number of terms.
tssmooth ma wanted1=numvar, w(3 1)
tssmooth ma wanted2=numvar, w(0 1 3)
replace wanted1 = 4 * wanted1
replace wanted2 = 4 * wanted2
Note that in contrast to the method above tssmooth ma uses whatever is available at the beginning and end of each panel. So, the first moving average, the average of the first value and the three previous, is returned as just the first value at the beginning of each panel (when the three previous values are unknown).

Regular expression in Matlab to leave only specified items of an array

I have a list of channels:
channels = {'1LT1', '1LT2', '1LT3', '1LT4', '1LT5', '2LA1', '2LA2', '2LA3', '3LH1', '3LH5', '4LT1', '4LT2', '4LT3', '5LH1', '5LH2', '4LT10'}
I need to write an alogrithm to leave only distal channels. It means for each type of channel ('1LT', '2LA', '3LH', '4LT' and etc.) I need only channel with the highest last number. The best way is to return indexes of these channels. For example, for abovementioned list the results should be:
[5, 8, 10, 15, 16]
I think I can do it with regexp by splitting like that:
row_i = 1;
for ch_i=[1:length(channels)]
try
[n(row_i,:), ch_type(row_i,:)] = strsplit(channels{ch_i},'\d+[A-Z]', 'DelimiterType','RegularExpression');
row_i = row_i + 1;
catch
continue
end
end
But then I am really stuck. Can somebody give me some tips to create good algorithm?
I am thankful for any idea!
You can use regexp to break each string into a channel and number, create numeric labels for the channels using findgroups, convert the number string into an actual number with str2double, then splitapply to find the max for each group. Here's the code, although I can't test it right now so it may need some tweaks:
tokens = regexp(channels, '(\d+[A-Z]+)(\d+)', 'tokens');
tokens = vertcat(tokens{:});
[grps, channelID] = findgroups(tokens(:, 1));
nums = str2double(tokens(:, 2));
channelMax = splitapply(#max, nums, grps);
Using the channelID and channelMax values, you can then reconstruct the distal channel names and find their indices in the channel list using sprintf, strsplit, and ismember:
distal = strsplit(sprintf('%s%d\n', channelID, channelMax));
index = find(ismember(channels, distal));

Mathematica solution?

I am new with Mathematica and I have one more task to figure out, but I can't find the answer. I have two lists of numbers ("b","u"):
b = {8.734059001373602`, 8.330508824111284`, 5.620669156438947`,
1.4722145583571766`, 1.797504620275392`, 7.045821078656974`,
2.1437334927375247`, 2.295629405840401`, 9.749038328921163`,
5.9928406294151095`, 5.710839663259195`, 7.6983109942364365`,
1.02781847368645`, 4.909108426318685`, 2.5860897177525572`,
9.56334726886076`, 5.661774934433563`, 3.4927397824800384`,
0.4570000499566351`, 6.240122061193738`, 8.371962670138991`,
4.593105388706549`, 7.653068139076581`, 2.2715973346475877`,
7.6234743784167875`, 0.9177107503732636`, 3.182296027902268`,
6.196168580445633`, 0.1486794884986935`, 1.2920960388213274`,
7.478757220079665`, 9.610332785387424`, 0.05088141346751485`,
3.940557901075696`, 5.21881311050797`, 7.489624788199514`,
8.773397599406234`, 3.397275198258715`, 1.4847171141876618`,
0.06574278834161795`, 0.620801320529969`, 2.075457888143216`,
5.244608900551409`, 4.54384757203616`, 7.114276285060143`,
2.8878711430358344`, 5.70657733453041`, 8.759173986432632`,
1.9392596667256967`, 7.419234634325729`, 8.258205508179927`,
1.185315253730261`, 3.907753644335596`, 7.168561412289151`,
9.919881985898002`, 3.169835543867407`, 8.352858871046699`,
7.959492335118693`, 7.772764587074317`, 7.091413185764939`,
1.433673058797801`};
and
u={5.1929, 3.95756, 5.55276, 3.97068, 5.67986, 4.57951, 4.12308,
2.52284, 6.58678, 4.32735, 7.08465, 4.65308, 3.82025, 5.01325,
1.17007, 6.43412, 4.67273, 3.7701, 4.10398, 2.90585, 3.75596,
5.12365, 4.78612, 7.20375, 3.19926, 8.10662};
This is the LinePlot of "b" and "u";
I need to compare first 5 numbers from "b" to 1st number in "u" and always leave the maximum (replace "b"<"u" with "u"). Then I need to shift by 2 numbers and compare 3rd, 4th, 5th, 6th and 7th "b" with 2nd "u" and so on (shift always => 2 steps). But the overlapping numbers need to be "remembered" and compared in the next step, so that always the maximum is picked (e.g. 3rd, 4th and 5th "b" has to be > than 1st and 2nd "u").
Possibly the easiest way would be to cover the maximums showed in the image throughout the whole function, but I am new to this software and I don't have the experience to do that. Still It would be awesome if someone would figure out how to do this with a function that would do what I have described above.
I believe this does what you want:
With[{n = Length # u},
Array[b[[#]] ~Max~ Take[u, ⌊{#-2, #+1}/2⌋ ~Clip~ {1, n}] &, 2 n + 3]
]
{8.73406, 8.33051, 5.62067, 5.1929, 5.55276, 7.04582, 5.55276, 5.55276, 9.74904,--
Or if the length of u and v are appropriately matched:
With[{n = Length # u},
MapIndexed[# ~Max~ Take[u, ⌊(#2[[1]] + {-2, 1})/2⌋ ~Clip~ {1, n}] &, b]
]
These are quite a lot faster than Mark's solution. With the following data:
u = RandomReal[{1, 1000}, 1500];
b = RandomReal[{1, 1000}, 3004];
Mark's code takes 2.8 seconds, while mine take 0.014 and 0.015 seconds.
Please ask your future questions on the dedicated Mathematica StackExchange site:
I think that there's a small problem with your data, u doesn't have as many elements as Partition[b,5,2]. Leaving that to one side, the best I could do was:
Max /# Transpose[
Table[Map[If[# > 0, Max[#, u[[i]]], 0] &,
RotateRight[PadRight[Partition[b, 5, 2][[i]], Length[b]],
2 (i - 1)]], {i, 1, Length[u]}]]
which starts producing the same numbers as in your comment.
As ever, pick this apart from the innermost expression and work outwards.

Assigning list elements to variables

The output from Mathematica with the following operation FactorInteger[28851680048402838857] is as follows:
{{3897424303, 1}, {7402755719, 1}}
My question is: how could I go about extracting the two prime numbers (without the exponents) and assign them to an arbitrary variable?
I basically want to retrieve two primes, whatever they may be, and assign them some variables.
Ex: x0 = 3897424303 and x1 = 7402755719
Thanks!
The output is a list and you can use list manipulating functions like Part ([[ ]]) to pick the pieces you want, e.g.,
{x0, x1} = FactorInteger[28851680048402838857][[All, 1]]
or, without Part:
{{x0,dummy}, {x1,dummy}} = FactorInteger[28851680048402838857];
Implicit in your question is the issue of handing parts of the expression that is returned as output from functions such as FactorInteger. Allow me to suggest alternatives.
1. Keep all of the values in a {list} and access each element with Part:
x = First /# FactorInteger[7813426]
{2, 31, 126023}
x[[1]]
x[[3]]
2
126023
2. Store factors as values of the function x, mimicking indexation of an array:
(This code uses MapIndexed, Function.)
Clear[x]
MapIndexed[
(x[First##2] = First##1) &,
FactorInteger[7813426]
];
x[1]
x[3]
2
126023
You can see all the values using ? or ?? (see Information):
?x
Global`x
x[1]=2
x[2]=31
x[3]=126023