Matlab: Read in ascii with varying column numbers & different formats - regex

I have been working on a puzzling question that involves reading an ascii file into matlab that contains 2 parts of different formats, the first part also including different column numbers.
MESH2D
MESHNAME "XXX"
E3T 1 1 29 30 1
E4Q 2 2 31 29 1 1
E4Q 3 31 2 3 32 1
...
...
...
ND 120450 5.28760039e+004 7.49260000e+004 8.05500000e+002
ND 120451 5.30560039e+004 7.49260000e+004 6.84126709e+002
ND 120452 5.32360039e+004 7.49260000e+004 6.97750000e+002
ND 120453 5.34010039e+004 7.49110000e+004 7.67000000e+002
NS 1 2 3 4 5 6 7 8 9 10
NS 11 12 13 14 15 16 17 18 19 20
NS 21 22 23 24 25 26 27 -28
BEGPARAMDEF
GM "Mesh"
I am only interested in the lines that contain the triangles and start with E3T/E4Q and the corresponding lines that hold the coordinates of the nodes of the triangles and start with ND. For the triangles (E3T/E4Q lines) I am only interest in the first 4 numbers, therefore I was trying to do something like this:
fileID = fopen(test);
t1 = textscan(fileID, '%s',3);
t2 = textscan(fileID, '%s %d %d %d*[^\n]');
fclose(fileID);
So read in the header to jump to the data and then read the first string and following 4 numbers, then jump to the end of the line and restart. But this does not work. I only get A single line with data and not the rest of the file. Also, I do not know how to treat the second part of the file, which starts at an arbitrary amount of numbers (which I could of course look up manually and feed into matlab, but would prefer matlab to find this change in format automatically).
Do you have any suggestions?
Cheers!

I suggest that you first read all the lines of the file with textscan as strings, and then filter out whatever you need:
fid = fopen(filename, 'r');
C = textscan(fid, '%s', 'delimiter', '');
fclose(fid);
Then parse only the E3T/E4Q/ND lines using regexp:
C = regexp(C, '(\w*)(.*)', 'tokens');
C = cellfun(#(x){x{1}{1}, str2num(x{1}{2})}, C, 'UniformOutput', false);
C = vertcat(C{:});
And then group corresponding E3T/E4Q and ND lines:
idx1 = strcmp(C(:, 1), 'E3T') | strcmp(C(:, 1), 'E4Q');
idx2 = strcmp(C(:, 1), 'ND');
N = max(nnz(idx1), nnz(idx2));
indices = cellfun(#(x)x(1:4), C(idx1, 2), 'UniformOutput', false);
S = struct('tag', [C(idx1, 1); cell(N - nnz(idx1), 1)], ...
'indices', [indices; cell(N - nnz(idx1), 1)], ...
'nodes', [C(idx2, 2); cell(N - nnz(idx2), 1)]);
I named the E3T/E4Q values "indices" and ND values "nodes". The resulting array S contains structs, each having three fields: tag (either E3T or E4Q), indices and nodes. Note that if you have more "indices" than "nodes" or vice versa, the missing values are indicated by an empty matrix.

I know this is not perfect, but if your files are not too big you can do something like this:
fileID = fopen(test,'r');
while ~feof(fileID)
FileLine = fgetl(fileID);
[LineHead,Rem] = strtok(FileLine); % separated string header and numbers
switch LineHead
case 'MESH2D'
% do something here
case 'MESHNAME'
% do something here
case 'E3T'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'E4Q'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'ND'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
% or if you prefer to parse first number separately
[strFirst,strOthers] = strtok(Rem);
FirstInteger = str2num(strFirst);
[Floats,FloatsCount] = sscanf(strOthers, '%g');
% and so on...
end
end
fclose(fileID);
OF course, you have to handle strings starting with MESH2D, MESHNAME, or GM separately

Related

Optimal way to compress 60 bit string

Given 15 random hexadecimal numbers (60 bits) where there is always at least 1 duplicate in every 20 bit run (5 hexdecimals).
What is the optimal way to compress the bytes?
Here are some examples:
01230 45647 789AA
D8D9F 8AAAF 21052
20D22 8CC56 AA53A
AECAB 3BB95 E1E6D
9993F C9F29 B3130
Initially I've been trying to use Huffman encoding on just 20 bits because huffman coding can go from 20 bits down to ~10 bits but storing the table takes more than 9 bits.
Here is the breakdown showing 20 bits -> 10 bits for 01230
Character Frequency Assignment Space Savings
0 2 0 2×4 - 2×1 = 6 bits
2 1 10 1×4 - 1×2 = 2 bits
1 1 110 1×4 - 1×3 = 1 bits
3 1 111 1×4 - 1×3 = 1 bits
I then tried to do huffman encoding on all 300 bits (five 60bit runs) and here is the mapping given the above example:
Character Frequency Assignment Space Savings
---------------------------------------------------------
a 10 101 10×4 - 10×3 = 10 bits
9 8 000 8×4 - 8×3 = 8 bits
2 7 1111 7×4 - 7×4 = 0 bits
3 6 1101 6×4 - 6×4 = 0 bits
0 5 1100 5×4 - 5×4 = 0 bits
5 5 1001 5×4 - 5×4 = 0 bits
1 4 0010 4×4 - 4×4 = 0 bits
8 4 0111 4×4 - 4×4 = 0 bits
d 4 0101 4×4 - 4×4 = 0 bits
f 4 0110 4×4 - 4×4 = 0 bits
c 4 1000 4×4 - 4×4 = 0 bits
b 4 0011 4×4 - 4×4 = 0 bits
6 3 11100 3×4 - 3×5 = -3 bits
e 3 11101 3×4 - 3×5 = -3 bits
4 2 01000 2×4 - 2×5 = -2 bits
7 2 01001 2×4 - 2×5 = -2 bits
This yields a savings of 8 bits overall, but 8 bits isn't enough to store the huffman table. It seems because of the randomness of the data that the more bits you try to encode with huffman the less effective it works. Huffman encoding seemed to work best with 20 bits (50% reduction) but storing the table in 9 or less bits isnt possible AFAIK.
In the worst-case for a 60 bit string there are still at least 3 duplicates, the average case there are more than 3 duplicates (my assumption). As a result of at least 3 duplicates the most symbols you can have in a run of 60 bits is just 12.
Because of the duplicates plus the less than 16 symbols, I can't help but feel like there is some type of compression that can be used
If I simply count the number of 20-bit values with at least two hexadecimal digits equal, there are 524,416 of them. A smidge more than 219. So the most you could possibly save is a little less than one bit out of the 20.
Hardly seems worth it.
If I split your question in two parts:
How do I compress (perfect) random data: You can't. Every bit is some new entropy which can't be "guessed" by a compression algorithm.
How to compress "one duplicate in five characters": There are exactly 10 options where the duplicate can be (see table below). This is basically the entropy. Just store which option it is (maybe grouped for the whole line).
These are the options:
AAbcd = 1 AbAcd = 2 AbcAd = 3 AbcdA = 4 (<-- cases where first character is duplicated somewhere)
aBBcd = 5 aBcBd = 6 aBcdB = 7 (<-- cases where second character is duplicated somewhere)
abCCd = 8 abCdC = 9 (<-- cases where third character is duplicated somewhere)
abcDD = 0 (<-- cases where last characters are duplicated)
So for your first example:
01230 45647 789AA
The first one (01230) is option 4, the second 3 and the third option 0.
You can compress this by multiplying each consecutive by 10: (4*10 + 3)*10 + 0 = 430
And uncompress it by using divide and modulo: 430%10=0, (430/10)%10=3, (430/10/10)%10=4. So you could store your number like that:
1AE 0123 4567 789A
^^^ this is 430 in hex and requires only 10 bit
The maximum number for the three options combined is 1000, so 10 bit are enough.
Compared to storing these 3 characters normally you save 2 bit. As someone else already commented - this is probably not worth it. For the whole line it's even less: 2 bit / 60 bit = 3.3% saved.
If you want to get rid of the duplicates first, do this, then look at the links at the bottom of the page. If you don't want to get rid of the duplicates, then still look at the links at the bottom of the page:
Array.prototype.contains = function(v) {
for (var i = 0; i < this.length; i++) {
if (this[i] === v) return true;
}
return false;
};
Array.prototype.unique = function() {
var arr = [];
for (var i = 0; i < this.length; i++) {
if (!arr.contains(this[i])) {
arr.push(this[i]);
}
}
return arr;
}
var duplicates = [1, 3, 4, 2, 1, 2, 3, 8];
var uniques = duplicates.unique(); // result = [1,3,4,2,8]
console.log(uniques);
Then you would have shortened your code that you have to deal with. Then you might want to check out Smaz
Smaz is a simple compression library suitable for compressing strings.
If that doesn't work, then you could take a look at this:
http://ed-von-schleck.github.io/shoco/
Shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but you can generate your own compression model based on your specific input data.
Let me know if it works!

Extract values from .dat file with fortran, with lines and especific variables

I need take te values of NPNOD, NELEM, and the others. and take the values of the next matrix
$DIMENSIONES DEL PROBLEMA
DIMENSIONES : NPNOD= 27 , NELEM= 8 , NMATS= 1 , \
NNODE= 8 , NDIME= 3 , \
NCARG= 1 , NGDLN= 3, NPROP= 5, \
NGAUS= 1 , NTIPO= 1 , IWRIT= 1 ,\
INDSO= 10 , NPRES= 9
$---------------------------------------------------------
GEOMETRIA
$ CONECTIVIDADES ELEMENTALES
$ ELEM. MATER. SECUENCIA DE CONECTIVIDADES
1 1 8 6 12 20 18 15 23 25
2 1 19 8 20 24 26 18 25 27
3 1 5 2 6 8 14 11 15 18
4 1 17 5 8 19 21 14 18 26
5 1 7 4 9 13 8 6 12 20
6 1 16 7 13 22 19 8 20 24
7 1 3 1 4 7 5 2 6 8
8 1 10 3 7 16 17 5 8 19
To read a mixture of Characters and numbers in fortran is done best by first reading the whole line into a character string and then to read the respective numerical values from this string. The details will depend a lot on the flexibility you need to have to deal with changing input formats. The more you can rely on the assumption that an input file will always be of identical structure the easier things get.
You did not specify in your question the details of the 10 numbers in the rows numbered 1 to 8. Lets assume that the first number is the row of the matrix , the second number the number of the current matrix and the remaining eight numbers are the elements. Lets further assume that the elements in one row of the matrix will always be listed in one single input line.
character(len=5), dimension(10) :: fields
character(len=80) :: string
character(len=1024) :: grand
integer, dimension(10) :: values
fields(1) = 'NPNOD' ! and so on ...
read(unit=ird, '(a)', iostat=ios) string ! read line 'DIMENSIONES...'
read(unit=ird, '(a)', iostat=ios) string ! read dummy string
grand(1:80) = string ! place into grand total
read(unit=ird, '(a)', iostat=ios) string ! read dummy string
grand(81:160) = string ! append to grand total
...! repeat for three more lines
grand(len_trim(grand)+1:len_trim(grand)+1) = ',' ! Append a final comma
do i=1,10 ! Loop over all field names
ilen = len_trim(field(i)) ! store length of field name, may vary?
ipos = index(grand, field(i)) ! Find start of field name
icom = index(grand(ipos+ilen+1:len(grand), ',') ! locate trailing comma
read(grand(ipos+ilen+1:ipos+ilen+icom-1),*) values(i) ! Read numerical value
enddo
read(unit=ird, '(a)', iostat=ios) string ! read dummy string
read(unit=ird, '(a)', iostat=ios) string ! read dummy string
read(unit=ird, '(a)', iostat=ios) string ! read dummy string
do i= 1, values(2) ! if NELEM is in fields(2)
read(ird, *) irow, imat, (array(i,j),j=1, values(2)) ! read leading two no's and elements
enddo
You still have to define the integer variables, ird, ilen, ipos, icom, irow, imat, ios, i, j, the matrix array or the many matrices you need to read.
Upon a read the value of the status variable ios should be inspected...
Essentially I do:
define a character variable field with all the names 'NPNOD' ...
read and concatenate the lines 'dimensiones' until 'INDSO' into the string "grand"
Loop over all field and
detect position of the field name and the position of the trailing comma
read from the grand string the subsection that contains only the numerical value
Loop over the rows with matrix elements to read the two first numbers and the matrix elements.
As I appended a final comma, that is missing in the line 'INDSO...', the loop over the field names does not have to bother with the special case of 'NPRES', which does not have a trailing comma in the original input file.
Your actual code should
check if there are never more than 10 fields
is 80 character the maximum length of any individual input line
will 1024 characters be neough for the concatenated list.

Python multidimensional array

I'm a bit of a beginner in Python, but I think I have a simple question. I am using image processing to detect lines in an image
lines = cv2.HoughLinesP(edges,1,np.pi/180,50,minLineLength,maxLineGap)
lines.shape is (151, 1, 4) meaning that I've detected 151 lines, and has 4 parameters x1, y1, x2, y2.
What I want to do is add another factor to lines, called slope, thus increasing lines.shape to (151, 1, 5). I know I can concatenate an empty array of zeros at the end of lines, but how do I make it so I can call it in a for loop or the like?
For example I want to be able to say
for slope in lines
#do stuff
Unfortunately, the HoughLinesP function returns a numpy array of type int32. I stayed up past my bedtime to figure this out, though, so I'm going to post it anyways. I just multiplied the slopes by 1000 and put them in the array like that. Hopefully, it's still useful to you.
slopes = []
for row in lines:
slopes.append((row[0][1] - row[0][3]) / float(row[0][0] - row[0][2]) * 1000)
new_column = []
for slope in slopes:
new_column.append([slope])
new_array = np.insert(lines, 4, new_column, axis=2)
print lines
print
print new_array
Sample output:
[[[14 66 24 66]]
[[37 23 54 56]]
[[ 7 62 28 21]]
[[70 61 81 61]]
[[24 64 42 64]]]
[[[ 14 66 24 66 0]]
[[ 37 23 54 56 1941]]
[[ 7 62 28 21 -1952]]
[[ 70 61 81 61 0]]
[[ 24 64 42 64 0]]]
Edit: Better (and full) code with same output
import cv2
import numpy as np
img = cv2.imread('cmake_logo-main.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(img,50,150,apertureSize = 3)
lines = cv2.HoughLinesP(edges,1,np.pi/180,50,3,10)
def slope(xy):
return (xy[1] - xy[3]) / float(xy[0] - xy[2]) * 1000
new_column = [[slope(row[0])] for row in lines]
new_array = np.insert(lines, 4, new_column, axis=2)
print lines
print
print new_array
Based on your comments, here's my guess as to what you should do:
lines = np.squeeze(lines)
# remove the unneeded middle dim, a convenience, but not required
slope = <some calculation> # expect (151,) array of floats
mask = np.ones((151,),dtype=bool) # boolean mask
<assign False to mask for all lines you want to delete>
<alt start with False, and set True to keepers>
lines = lines[mask]
slope = lines[mask]
Alternatively you could extend lines with np.hstack([lines, np.zeros((151,1))]) (or concatenate on axis 1). But if as Jason thinks, lines is dtype int, and slope must be float, that won't work. You'd have to use his scaling solution.
You could also use a structured array to combine the ints and float columns into one array. Why do that if it is just as easy to keep slope as separate variable?

Counting ways of breaking up a string of digits into numbers under 26

Given a string of digits, I wish to find the number of ways of breaking up the string into individual numbers so that each number is under 26.
For example, "8888888" can only be broken up as "8 8 8 8 8 8 8". Whereas "1234567" can be broken up as "1 2 3 4 5 6 7", "12 3 4 5 6 7" and "1 23 4 5 6 7".
I'd like both a recurrence relation for the solution, and some code that uses dynamic programming.
This is what I've got so far. It only covers the base cases which are a empty string should return 1 a string of one digit should return 1 and a string of all numbers larger than 2 should return 1.
int countPerms(vector<int> number, int currentPermCount)
{
vector< vector<int> > permsOfNumber;
vector<int> working;
int totalPerms=0, size=number.size();
bool areAllOverTwo=true, forLoop = true;
if (number.size() <=1)
{
//TODO: print out permetations
return 1;
}
for (int i = 0; i < number.size()-1; i++) //minus one here because we dont care what the last digit is if all of them before it are over 2 then there is only one way to decode them
{
if (number.at(i) <= 2)
{
areAllOverTwo = false;
}
}
if (areAllOverTwo) //if all the nubmers are over 2 then there is only one possable combination 3456676546 has only one combination.
{
permsOfNumber.push_back(number);
//TODO: write function to print out the permetions
return 1;
}
do
{
//TODO find all the peremtions here
} while (forLoop);
return totalPerms;
}
Assuming you either don't have zeros, or you disallow numbers with leading zeros), the recurrence relations are:
N(1aS) = N(S) + N(aS)
N(2aS) = N(S) + N(aS) if a < 6.
N(a) = 1
N(aS) = N(S) otherwise
Here, a refers to a single digit, and S to a number. The first line of the recurrence relation says that if your string starts with a 1, then you can either have it on its own, or join it with the next digit. The second line says that if you start with a 2 you can either have it on its own, or join it with the next digit assuming that gives a number less than 26. The third line is the termination condition: when you're down to 1 digit, the result is 1. The final line says if you haven't been able to match one of the previous rules, then the first digit can't be joined to the second, so it must stand on its own.
The recurrence relations can be implemented fairly directly as an iterative dynamic programming solution. Here's code in Python, but it's easy to translate into other languages.
def N(S):
a1, a2 = 1, 1
for i in xrange(len(S) - 2, -1, -1):
if S[i] == '1' or S[i] == '2' and S[i+1] < '6':
a1, a2 = a1 + a2, a1
else:
a1, a2 = a1, a1
return a1
print N('88888888')
print N('12345678')
Output:
1
3
An interesting observation is that N('1' * n) is the n+1'st fibonacci number:
for i in xrange(1, 20):
print i, N('1' * i)
Output:
1 1
2 2
3 3
4 5
5 8
6 13
7 21
8 34
9 55
If I understand correctly, there are only 25 possibilities. My first crack at this would be to initialize an array of 25 ints all to zero and when I find a number less than 25, set that index to 1. Then I would count up all the 1's in the array when I was finished looking at the string.
What do you mean by recurrence? If you're looking for a recursive function, you would need to find a good way to break the string of numbers down recursively. I'm not sure that's the best approach here. I would just go through digit by digit and as you said if the digit is 2 or less, then store it and test appending the next digit... i.e. 10*digit + next. I hope that helped! Good luck.
Another way to think about it is that, after the initial single digit possibility, for every sequence of contiguous possible pairs of digits (e.g., 111 or 12223) of length n we multiply the result by:
1 + sum, i=1 to floor (n/2), of (n-i) choose i
For example, with a sequence of 11111, we can have
i=1, 1 1 1 11 => 5 - 1 = 4 choose 1 (possibilities with one pair)
i=2, 1 11 11 => 5 - 2 = 3 choose 2 (possibilities with two pairs)
This seems directly related to Wikipedia's description of Fibonacci numbers' "Use in Mathematics," for example, in counting "the number of compositions of 1s and 2s that sum to a given total n" (http://en.wikipedia.org/wiki/Fibonacci_number).
Using the combinatorial method (or other fast Fibonacci's) could be suitable for strings with very long sequences.

Ternary Numbers, regex

I'm looking for some regex/automata help. I'm limited to + or the Kleene Star. Parsing through a string representing a ternary number (like binary, just 3), I need to be able to know if the result is 1-less than a multiple of 4.
So, for example 120 = 0*1+2*3+1*9 = 9+6 = 15 = 16-1 = 4(n)-1.
Even a pointer to the pattern would be really helpful!
You can generate a series of values to do some observation with bc in bash:
for n in {1..40}; do v=$((4*n-1)); echo -en $v"\t"; echo "ibase=10;obase=3;$v" | bc ; done
3 10
7 21
11 102
15 120
19 201
23 212
27 1000
31 1011
...
Notice that each digit's value (in decimal) is either 1 more or 1 less than something divisible by 4, alternately. So the 1 (lsb) digit is one more than 0, the 3 (2nd) digit is one less than 4, the 9 (3rd) digit is 1 more than 8, the 27 (4th) digit is one less than 28, etc.
If you sum up all the even-placed digits and all the odd-placed digits, then add 1 to the odd-placed ones (if counting from 1), you should get equality.
In your example: odd: (0+1)+1, even: (2). So they are equal, and so the number is of the form 4n-1.