Matlab: find small islands of numbers surrounded by NaN - regex

I have a lengthy vector of numeric data, with some sequences of NaNs here and there. Most of the NaNs come in large chunks, but sometimes the segments of NaNs are close together, creating islands of numbers surrounded by NaNs like this:
...NaN 1 2 3 5 ... 9 4 2 NaN...
I would like to find all islands of data that are between 1 - 15000 elements in size and replace them by a solid block of NaNs.
I've tried a few things, but there are some problems --the data set is HUGE, hence converting it to a string and using a regular expression to do:
[found start end] = regexp(num2str(isnan(data)),'10{1,7}1','match','start','end')
is out of the question because it takes prohibitively long to do num2str(isnan(data)). So I need a numeric way to find all NaN-numbers-Nan where the number of numbers is between 1 and 15000.

Here is an example how you can do that:
% generate random data
data = rand(1,20)
data(data>0.5) = NaN
% add NaN before and after the original array
% for easier detection of the numerical block
% starting at 1st element and finishing at the last one
datatotest = [ NaN data NaN ];
NumBlockStart = find( ~isnan(datatotest(2:end)) & isnan(datatotest(1:end-1)) )+0
NumBlockEnd = find( ~isnan(datatotest(1:end-1)) & isnan(datatotest(2:end)) )-1
NumBlockLength = NumBlockEnd - NumBlockStart + 1
In this example NumBlockStart contains start index of numeric block and NumBlockEnd contains last index of numeric block. NumBlockLength contains length of each block.
Now you can do whatever you want to them :)
Here is possible output
data =
0.0382 0.3767 0.8597 0.2743 0.6276 0.2974 0.2587 0.8577 0.8319 0.1408 0.9288 0.0990 0.7653 0.7806 0.8576 0.8032 0.8340 0.1600 0.4937 0.7784
data =
0.0382 0.3767 NaN 0.2743 NaN 0.2974 0.2587 NaN NaN 0.1408 NaN 0.0990 NaN NaN NaN NaN NaN 0.1600 0.4937 NaN
NumBlockStart =
1 4 6 10 12 18
NumBlockEnd =
2 4 7 10 12 19
NumBlockLength =
2 1 2 1 1 2
UPDATE1
This is more efficient version:
data = rand(1,19)
data(data>0.5) = NaN
test2 = diff( ~isnan([ NaN data NaN ]) );
NumBlockStart = find( test2>0 )-0
NumBlockEnd = find( test2<0 )-1
NumBlockLength = NumBlockEnd - NumBlockStart + 1

Related

Using AND bitwise operator between a number, and its negative counterpart

I stumbled upon this simple line of code, and I cannot figure out what it does. I understand what it does in separate parts, but I don't really understand it as a whole.
// We have an integer(32 bit signed) called i
// The following code snippet is inside a for loop declaration
// in place of a simple incrementor like i++
// for(;;HERE){}
i += (i&(-i))
If I understand correctly it uses the AND binary operator between i and negative i and then adds that number to i. I first thought that this would be an optimized way of calculating the absolute value of an integer, however as I come to know, c++ does not store negative integers simply by flipping a bit, but please correct me if I'm wrong.
Assuming two's complement representation, and assuming i is not INT_MIN, the expression i & -i results in the value of the lowest bit set in i.
If we look at the value of this expression for various values of i:
0 00000000: i&(-i) = 0
1 00000001: i&(-i) = 1
2 00000010: i&(-i) = 2
3 00000011: i&(-i) = 1
4 00000100: i&(-i) = 4
5 00000101: i&(-i) = 1
6 00000110: i&(-i) = 2
7 00000111: i&(-i) = 1
8 00001000: i&(-i) = 8
9 00001001: i&(-i) = 1
10 00001010: i&(-i) = 2
11 00001011: i&(-i) = 1
12 00001100: i&(-i) = 4
13 00001101: i&(-i) = 1
14 00001110: i&(-i) = 2
15 00001111: i&(-i) = 1
16 00010000: i&(-i) = 16
We can see this pattern.
Extrapolating that to i += (i&(-i)), assuming i is positive, it adds the value of the lowest set bit to i. For values that are a power of two, this just doubles the number.
For other values, it rounds the number up by the value of that lowest bit. Repeating this in a loop, you eventually end up with a power of 2. As for what such an increment could be used for, that depends on the context of where this expression was used.

Optimal way to compress 60 bit string

Given 15 random hexadecimal numbers (60 bits) where there is always at least 1 duplicate in every 20 bit run (5 hexdecimals).
What is the optimal way to compress the bytes?
Here are some examples:
01230 45647 789AA
D8D9F 8AAAF 21052
20D22 8CC56 AA53A
AECAB 3BB95 E1E6D
9993F C9F29 B3130
Initially I've been trying to use Huffman encoding on just 20 bits because huffman coding can go from 20 bits down to ~10 bits but storing the table takes more than 9 bits.
Here is the breakdown showing 20 bits -> 10 bits for 01230
Character Frequency Assignment Space Savings
0 2 0 2×4 - 2×1 = 6 bits
2 1 10 1×4 - 1×2 = 2 bits
1 1 110 1×4 - 1×3 = 1 bits
3 1 111 1×4 - 1×3 = 1 bits
I then tried to do huffman encoding on all 300 bits (five 60bit runs) and here is the mapping given the above example:
Character Frequency Assignment Space Savings
---------------------------------------------------------
a 10 101 10×4 - 10×3 = 10 bits
9 8 000 8×4 - 8×3 = 8 bits
2 7 1111 7×4 - 7×4 = 0 bits
3 6 1101 6×4 - 6×4 = 0 bits
0 5 1100 5×4 - 5×4 = 0 bits
5 5 1001 5×4 - 5×4 = 0 bits
1 4 0010 4×4 - 4×4 = 0 bits
8 4 0111 4×4 - 4×4 = 0 bits
d 4 0101 4×4 - 4×4 = 0 bits
f 4 0110 4×4 - 4×4 = 0 bits
c 4 1000 4×4 - 4×4 = 0 bits
b 4 0011 4×4 - 4×4 = 0 bits
6 3 11100 3×4 - 3×5 = -3 bits
e 3 11101 3×4 - 3×5 = -3 bits
4 2 01000 2×4 - 2×5 = -2 bits
7 2 01001 2×4 - 2×5 = -2 bits
This yields a savings of 8 bits overall, but 8 bits isn't enough to store the huffman table. It seems because of the randomness of the data that the more bits you try to encode with huffman the less effective it works. Huffman encoding seemed to work best with 20 bits (50% reduction) but storing the table in 9 or less bits isnt possible AFAIK.
In the worst-case for a 60 bit string there are still at least 3 duplicates, the average case there are more than 3 duplicates (my assumption). As a result of at least 3 duplicates the most symbols you can have in a run of 60 bits is just 12.
Because of the duplicates plus the less than 16 symbols, I can't help but feel like there is some type of compression that can be used
If I simply count the number of 20-bit values with at least two hexadecimal digits equal, there are 524,416 of them. A smidge more than 219. So the most you could possibly save is a little less than one bit out of the 20.
Hardly seems worth it.
If I split your question in two parts:
How do I compress (perfect) random data: You can't. Every bit is some new entropy which can't be "guessed" by a compression algorithm.
How to compress "one duplicate in five characters": There are exactly 10 options where the duplicate can be (see table below). This is basically the entropy. Just store which option it is (maybe grouped for the whole line).
These are the options:
AAbcd = 1 AbAcd = 2 AbcAd = 3 AbcdA = 4 (<-- cases where first character is duplicated somewhere)
aBBcd = 5 aBcBd = 6 aBcdB = 7 (<-- cases where second character is duplicated somewhere)
abCCd = 8 abCdC = 9 (<-- cases where third character is duplicated somewhere)
abcDD = 0 (<-- cases where last characters are duplicated)
So for your first example:
01230 45647 789AA
The first one (01230) is option 4, the second 3 and the third option 0.
You can compress this by multiplying each consecutive by 10: (4*10 + 3)*10 + 0 = 430
And uncompress it by using divide and modulo: 430%10=0, (430/10)%10=3, (430/10/10)%10=4. So you could store your number like that:
1AE 0123 4567 789A
^^^ this is 430 in hex and requires only 10 bit
The maximum number for the three options combined is 1000, so 10 bit are enough.
Compared to storing these 3 characters normally you save 2 bit. As someone else already commented - this is probably not worth it. For the whole line it's even less: 2 bit / 60 bit = 3.3% saved.
If you want to get rid of the duplicates first, do this, then look at the links at the bottom of the page. If you don't want to get rid of the duplicates, then still look at the links at the bottom of the page:
Array.prototype.contains = function(v) {
for (var i = 0; i < this.length; i++) {
if (this[i] === v) return true;
}
return false;
};
Array.prototype.unique = function() {
var arr = [];
for (var i = 0; i < this.length; i++) {
if (!arr.contains(this[i])) {
arr.push(this[i]);
}
}
return arr;
}
var duplicates = [1, 3, 4, 2, 1, 2, 3, 8];
var uniques = duplicates.unique(); // result = [1,3,4,2,8]
console.log(uniques);
Then you would have shortened your code that you have to deal with. Then you might want to check out Smaz
Smaz is a simple compression library suitable for compressing strings.
If that doesn't work, then you could take a look at this:
http://ed-von-schleck.github.io/shoco/
Shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but you can generate your own compression model based on your specific input data.
Let me know if it works!

iterate over Dataframe row by index value and find max

I need to iterate over df rows based on its index. I need to find the max in the column p1 and fill it in the output dataframe (along with the max p1), the same for the column p2. In each range of my row indexes (sub_1_ica_1---> sub_1_ica_n), there must be only one 1 and one 2 and I need to fill the remaining with zeros. That's why I need to do the operation range by range.
I tried to split the index name and make a counter for each subject to be used to iterate over the rows, but I feel that I am in the wrong way!
from collections import Counter
a = df.id.tolist()
indlist = []
for x in a:
i = x.split('_')
b = int(i[1])
indlist.insert(-1,b)
c=Counter(indlist)
keyInd = c.keys()
Any ideas?
EDIT: according to Jerazel example my desired output would look like this.
First I find the max for p1 and p2 columns which will be translated in the new df into 1 and 2, and the remaining fields will be zeros
I think you need numpy.argmax with max, also if need columns names use idxmax:
idx = ['sub_1_ICA_0','sub_1_ICA_1','sub_1_ICA_2','sub_2_ICA_0','sub_2_ICA_1','sub_2_ICA_2']
df = pd.DataFrame({'p0':[7,8,9,4,2,3],
'p1':[1,3,5,7,1,0],
'p2':[5,9,6,1,2,4]}, index=idx)
print (df)
cols = ['p0','p1','p2']
df['a'] = df[cols].values.argmax(axis=1)
df['b'] = df[cols].max(axis=1)
df['c'] = df[cols].idxmax(axis=1)
print (df)
p0 p1 p2 a b c
sub_1_ICA_0 7 1 5 0 7 p0
sub_1_ICA_1 8 3 9 2 9 p2
sub_1_ICA_2 9 5 6 0 9 p0
sub_2_ICA_0 4 7 1 1 7 p1
sub_2_ICA_1 2 1 2 0 2 p0
sub_2_ICA_2 3 0 4 2 4 p2

MATLAB: How to read PRE tag and create cellarray with NaN

I am trying to read data from html file
The data are delimmited by <PRE></PRE> tag
e.g.:
<pre>
12.0 29132 -60.3 -91.4 1 0.01 260 753.2 753.3 753.2
10.0 30260 -57.9 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 1009.2 1011.8 1009.3
</pre>
t = regexp(html, '<PRE[^>]*>(.*?)</PRE>', 'tokens');
where t is a cell of char
Well, now I would to replace the blank space with NaN and to obtain:
12.0 29132 -60.3 -91.4 1 0.01 260 Nan 753.2 753.3 753.2
10.0 30260 -57.9 Nan 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 NaN NaN 1009.2 1011.8 1009.3
This data will be saved on mydata.dat file
If you have the HTML file hosted somewhere, then:
url = 'http://www.myDomain.com/myFile.html';
html = urlread(url);
% Use regular expressions to remove undesired HTML markup.
txt = regexprep(html,'<script.*?/script>','');
txt = regexprep(txt,'<style.*?/style>','');
txt = regexprep(txt,'<pre.*?/pre>','');
txt = regexprep(txt,'<.*?>','')
Now you should have the date in text format in txt variable. You can use textscan to parse the txt var and you can scan for the whitespace or for the numbers.
More Info:
- urlread
- regexprep
This isn't the perfect solution but it seems to get you there.
Assuming t is one long string, the delimiter is white space, and you know the number of columns:
numcols = 7;
sample = '1 2 3 4 5 7 1 3 5 7';
test = textscan(sample,'%f','delimiter',' ','MultipleDelimsAsOne',false);
test = test{:}; % Pull the double out of the cell array
test(2:2:end) = []; % Dump out extra NaNs
test2 = reshape(test,numcols,length(test)/numcols)'; % Have to mess with it a little to reshape rowwise instead of columnwise
Returns:
test2 =
1 2 3 4 5 NaN 7
1 NaN 3 NaN 5 NaN 7
This is assuming the delimiter is white space and constant. Textscan doesn't allow you to stack whitespace as a delimiter, so it throws a NaN after each white space character if there isn't data present. In your example data there are two white space characters between each data point, so every other NaN (or, more generically, n_whitespace - 1) can be thrown out, leaving you with the NaNs you actually want.

Matlab: Read in ascii with varying column numbers & different formats

I have been working on a puzzling question that involves reading an ascii file into matlab that contains 2 parts of different formats, the first part also including different column numbers.
MESH2D
MESHNAME "XXX"
E3T 1 1 29 30 1
E4Q 2 2 31 29 1 1
E4Q 3 31 2 3 32 1
...
...
...
ND 120450 5.28760039e+004 7.49260000e+004 8.05500000e+002
ND 120451 5.30560039e+004 7.49260000e+004 6.84126709e+002
ND 120452 5.32360039e+004 7.49260000e+004 6.97750000e+002
ND 120453 5.34010039e+004 7.49110000e+004 7.67000000e+002
NS 1 2 3 4 5 6 7 8 9 10
NS 11 12 13 14 15 16 17 18 19 20
NS 21 22 23 24 25 26 27 -28
BEGPARAMDEF
GM "Mesh"
I am only interested in the lines that contain the triangles and start with E3T/E4Q and the corresponding lines that hold the coordinates of the nodes of the triangles and start with ND. For the triangles (E3T/E4Q lines) I am only interest in the first 4 numbers, therefore I was trying to do something like this:
fileID = fopen(test);
t1 = textscan(fileID, '%s',3);
t2 = textscan(fileID, '%s %d %d %d*[^\n]');
fclose(fileID);
So read in the header to jump to the data and then read the first string and following 4 numbers, then jump to the end of the line and restart. But this does not work. I only get A single line with data and not the rest of the file. Also, I do not know how to treat the second part of the file, which starts at an arbitrary amount of numbers (which I could of course look up manually and feed into matlab, but would prefer matlab to find this change in format automatically).
Do you have any suggestions?
Cheers!
I suggest that you first read all the lines of the file with textscan as strings, and then filter out whatever you need:
fid = fopen(filename, 'r');
C = textscan(fid, '%s', 'delimiter', '');
fclose(fid);
Then parse only the E3T/E4Q/ND lines using regexp:
C = regexp(C, '(\w*)(.*)', 'tokens');
C = cellfun(#(x){x{1}{1}, str2num(x{1}{2})}, C, 'UniformOutput', false);
C = vertcat(C{:});
And then group corresponding E3T/E4Q and ND lines:
idx1 = strcmp(C(:, 1), 'E3T') | strcmp(C(:, 1), 'E4Q');
idx2 = strcmp(C(:, 1), 'ND');
N = max(nnz(idx1), nnz(idx2));
indices = cellfun(#(x)x(1:4), C(idx1, 2), 'UniformOutput', false);
S = struct('tag', [C(idx1, 1); cell(N - nnz(idx1), 1)], ...
'indices', [indices; cell(N - nnz(idx1), 1)], ...
'nodes', [C(idx2, 2); cell(N - nnz(idx2), 1)]);
I named the E3T/E4Q values "indices" and ND values "nodes". The resulting array S contains structs, each having three fields: tag (either E3T or E4Q), indices and nodes. Note that if you have more "indices" than "nodes" or vice versa, the missing values are indicated by an empty matrix.
I know this is not perfect, but if your files are not too big you can do something like this:
fileID = fopen(test,'r');
while ~feof(fileID)
FileLine = fgetl(fileID);
[LineHead,Rem] = strtok(FileLine); % separated string header and numbers
switch LineHead
case 'MESH2D'
% do something here
case 'MESHNAME'
% do something here
case 'E3T'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'E4Q'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'ND'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
% or if you prefer to parse first number separately
[strFirst,strOthers] = strtok(Rem);
FirstInteger = str2num(strFirst);
[Floats,FloatsCount] = sscanf(strOthers, '%g');
% and so on...
end
end
fclose(fileID);
OF course, you have to handle strings starting with MESH2D, MESHNAME, or GM separately