Python multidimensional array - python-2.7

I'm a bit of a beginner in Python, but I think I have a simple question. I am using image processing to detect lines in an image
lines = cv2.HoughLinesP(edges,1,np.pi/180,50,minLineLength,maxLineGap)
lines.shape is (151, 1, 4) meaning that I've detected 151 lines, and has 4 parameters x1, y1, x2, y2.
What I want to do is add another factor to lines, called slope, thus increasing lines.shape to (151, 1, 5). I know I can concatenate an empty array of zeros at the end of lines, but how do I make it so I can call it in a for loop or the like?
For example I want to be able to say
for slope in lines
#do stuff

Unfortunately, the HoughLinesP function returns a numpy array of type int32. I stayed up past my bedtime to figure this out, though, so I'm going to post it anyways. I just multiplied the slopes by 1000 and put them in the array like that. Hopefully, it's still useful to you.
slopes = []
for row in lines:
slopes.append((row[0][1] - row[0][3]) / float(row[0][0] - row[0][2]) * 1000)
new_column = []
for slope in slopes:
new_column.append([slope])
new_array = np.insert(lines, 4, new_column, axis=2)
print lines
print
print new_array
Sample output:
[[[14 66 24 66]]
[[37 23 54 56]]
[[ 7 62 28 21]]
[[70 61 81 61]]
[[24 64 42 64]]]
[[[ 14 66 24 66 0]]
[[ 37 23 54 56 1941]]
[[ 7 62 28 21 -1952]]
[[ 70 61 81 61 0]]
[[ 24 64 42 64 0]]]
Edit: Better (and full) code with same output
import cv2
import numpy as np
img = cv2.imread('cmake_logo-main.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(img,50,150,apertureSize = 3)
lines = cv2.HoughLinesP(edges,1,np.pi/180,50,3,10)
def slope(xy):
return (xy[1] - xy[3]) / float(xy[0] - xy[2]) * 1000
new_column = [[slope(row[0])] for row in lines]
new_array = np.insert(lines, 4, new_column, axis=2)
print lines
print
print new_array

Based on your comments, here's my guess as to what you should do:
lines = np.squeeze(lines)
# remove the unneeded middle dim, a convenience, but not required
slope = <some calculation> # expect (151,) array of floats
mask = np.ones((151,),dtype=bool) # boolean mask
<assign False to mask for all lines you want to delete>
<alt start with False, and set True to keepers>
lines = lines[mask]
slope = lines[mask]
Alternatively you could extend lines with np.hstack([lines, np.zeros((151,1))]) (or concatenate on axis 1). But if as Jason thinks, lines is dtype int, and slope must be float, that won't work. You'd have to use his scaling solution.
You could also use a structured array to combine the ints and float columns into one array. Why do that if it is just as easy to keep slope as separate variable?

Related

Bubble-sorting rows of Fortran 2D array

I am working on the second part of an assignment which asks me to reorder a matrix such that each row is in monotonically increasing order and so that the first element of each row is monotonically increasing. If two rows have the same initial value, the rows should be ordered by the second element in the row. If those are both the same, it should be the third element, continuing through the last element.
I have written a bubble sort that works fine for the first part (reordering each row). I have written a bubble sort for the second part (making sure that the first element of each row is monotonically increasing). However, I am running into an infinite loop and I do not understand why.
I do understand that the issue is that my "inorder" variable is not eventually getting set to true (which would end the while loop). However, I do not understand why inorder is not getting set to true. My logic is the following: once the following code has swapped rows to the point that the rows are all in order, we will pass through the while loop one more time (and inorder will get set to true), which will cause the while loop to end. I am stumped as to why this isn't happening.
inorder = .false.
loopA: do while ( .not. inorder ) !While the rows are not ordered
inorder = .true.
loopB: do i = 1, rows-1 !Iterate through the first column of the array
if (arr(i,1)>arr(i+1,1)) then !If we find a row that is out of order
inorder = .false.
tempArr = arr(i+1,:) !Swap the corresponding rows
arr(i+1,:) = arr(i,:)
arr(i,:) = tempArr
end if
if (arr(i,1)==arr(i+1,1)) then !The first elements of the rows are the same
loopC: do j=2, cols !Iterate through the rest of the row to find the first element that is not the same
if (arr(i,j)>arr(i+1,j)) then !Found elements that are not the same and that are out of order
inorder = .false.
tempArr = arr(i+1,:) !Swap the corresponding rows
arr(i+1,:) = arr(i,:)
arr(i,:) = tempArr
end if
end do loopC
end if
end do loopB
end do loopA
Example input:
6 3 9 23 80
7 54 78 87 87
83 5 67 8 23
102 1 67 54 34
78 3 45 67 28
14 33 24 34 9
Example (correct) output (that my code is not generating):
1 34 54 67 102
3 6 9 23 80
3 28 45 67 78
5 8 23 67 83
7 54 78 87 87
9 14 24 33 34
It is also possible that staring at this for hours has made me miss something stupid, so I appreciate any pointers.
When you get to compare rows where the first element is identical, you then go through the whole array and compare every single item.
So if you have two arrays like this:
1 5 3
1 2 4
Then the first element is the same, it enters the second part of your code.
In second place, 5>2, so it swaps it:
1 2 4
1 5 3
But then it doesn't stop. In third place, 4>3, so it swaps it back
1 5 3
1 2 4
And now you're back to where you were.
Cheers

Is there better option than map?

Well, I am making a c++ program, that goes through long streams of symbols and I need to store information for further analysis where in the stream appears symbol sequences of certain length. For instance in binary stream
100110010101
I have a sequences for example of length 6 like this:
100110 starting on position 0
001100 starting on position 1
011001 starting on position 2
etc.
What I need to store are vectors of all positions where I can find the one certain sequence. So the result should be something like a table, maybe resembling a hash table that look like this:
sequence/ positions
10010101 | 1 13 147 515
01011011 | 67 212 314 571
00101010 | 2 32 148 322 384 419 455
etc.
Now, I figured mapping strings to integers is slow, so because I have information about symbols in the stream upfront, I can use it to map this fixed length sequences to an integer.
The next step was to create a map, that maps these "representing integers" to a corresponding index in the table, where I add next occurence of this sequence. However this is slow, much slower than I can afford. I tried both ordered and unordered map of both std and boost libraries, none having enough efficiency. And I tested it, the map is the real bottleneck here
And here is the loop in pseudocode:
for (int i=seqleng-1;i<stream.size();i++) {
//compute characteristic value for the sequence by adding one symbol
charval*=symb_count;
charval+=sdata[j][i]-'0';
//sampspacesize is number off all possible sequence with this symbol count and this length
charval%=sampspacesize;
map<uint64,uint64>::iterator &it=map.find(charval);
//if index exists, add starting position of the sequence to the table
if (it!=map.end()) {
(table[it->second].add(i-seqleng+1);
}
//if current sequence is found for the first time, extend the table and add the index
else {
table.add_row();
map[charval]=table.last_index;
table[table.last_index].add(i-seqleng+1)
}
}
So the question is, can I use something better than a map to keep the record of corresponding indeces in the table, or is this the best way possible?
NOTE: I know there is a fast way here, and thats creating a storage large enough for every possible symbol sequence (meaning if I have sequence of length 10 and 4 symbols, I reserve 4^10 slots and can omitt the mapping), but I am going to need to work with lengths and number of symbols that results in reserving amount of memory way beyond the computer's capacity. But the the actual number of used slots will not exceed 100 million (which is guaranteed by the maximal stream length) and that can be stored in a computer just fine.
Please ask anything if there is something unclear, this is my first large question here, so I lack experience to express myself the way others would understand.
An unordered map with pre-allocated space is usually the fastest way to store any kind of sparse data.
Given that std::string has SSO I can't see why something like this won't be about as fast as it gets:
(I have used an unordered_multimap but I may have misunderstood the requirements)
#include <unordered_map>
#include <string>
#include <iostream>
using sequence = std::string; /// #todo - perhaps replace with something faster if necessary
using sequence_position_map = std::unordered_multimap<sequence, std::size_t>;
int main()
{
auto constexpr sequence_size = std::size_t(6);
sequence_position_map sequences;
std::string input = "11000111010110100011110110111000001111010101010101111010";
if (sequence_size <= input.size()) {
sequences.reserve(input.size() - sequence_size);
auto first = std::size_t(0);
auto last = input.size();
while (first + sequence_size < last) {
sequences.emplace(input.substr(first, sequence_size), first);
++first;
}
}
std::cout << "results:\n";
auto first = sequences.begin();
auto last = sequences.end();
while(first != last) {
auto range = sequences.equal_range(first->first);
std::cout << "sequence: " << first->first;
std::cout << " at positions: ";
const char* sep = "";
while (first != range.second) {
std::cout << sep << first->second;
sep = ", ";
++first;
}
std::cout << "\n";
}
}
output:
results:
sequence: 010101 at positions: 38, 40, 42, 44
sequence: 000011 at positions: 30
sequence: 000001 at positions: 29
sequence: 110000 at positions: 27
sequence: 011100 at positions: 25
sequence: 101110 at positions: 24
sequence: 010111 at positions: 46
sequence: 110111 at positions: 23
sequence: 011011 at positions: 22
sequence: 111011 at positions: 19
sequence: 111000 at positions: 26
sequence: 111101 at positions: 18, 34, 49
sequence: 011110 at positions: 17, 33, 48
sequence: 001111 at positions: 16, 32
sequence: 110110 at positions: 20
sequence: 101010 at positions: 37, 39, 41, 43
sequence: 010001 at positions: 13
sequence: 101000 at positions: 12
sequence: 101111 at positions: 47
sequence: 110100 at positions: 11
sequence: 011010 at positions: 10
sequence: 101101 at positions: 9, 21
sequence: 010110 at positions: 8
sequence: 101011 at positions: 7, 45
sequence: 111010 at positions: 5, 35
sequence: 011101 at positions: 4
sequence: 001110 at positions: 3
sequence: 100000 at positions: 28
sequence: 000111 at positions: 2, 15, 31
sequence: 100011 at positions: 1, 14
sequence: 110001 at positions: 0
sequence: 110101 at positions: 6, 36
After many suggestions in comments and answer, I tested most of them and picked the fastest possibility, reducing the bottleneck caused by the mapping to almost the same time it ran without the "map"(but producing incorrect data, however I needed to find the minimum speed this can be reduced to)
This was achieved by replacing the unordered_map<uint64,uint> and vector<vector<uint>> with just unordered_map<uint64, vector<uint> >, more precisely boost::unordered_map. I tested it also with unord_map<string,vector<uint>> and it surprised me that it was not that much slower as I expected. However it was slower.
Also, probably due to the fact ordered_map moves nodes to remain a balanced tree in its internal structure, ord_map<uint64, vector<uint>> was a bit slower than ord_map<uint64,uint> together with vector<vector<uint>>. But since unord_map does not move its internal data during computation, seems that it is the fastest possible configuration one can use.

Slicing a box of certain width along arbitrary line through 3d array

I have a big (600,600,600) numpy array filled with my data. Now I would like to extract regions from this with a given width around an arbitrary line through the box.
For the line I have the x, y and z coordinates of every point in separate numpy arrays. So let's say the line has 35 points in the data box, then the x, y and z arrays each have lengths of 35 as well. I can extract the points along the line itself by using indexing like this
extraction = data[z,y,x]
Now ideally I'd like to extract a box around it by doing something like the following
extraction = data[z-3:z+3,y-3:y+3,z-3:z+3]
but because x, y and z are arrays, this is not possible. The only way I could think of of doing this is through a for-loop for each point, so
extraction = np.array([])
for i in range(len(x)):
extraction = np.append(extraction,data[z[i]-3:z[i]+3,y[i]-3:y[i]+3,z[i]-3:z[i]+3])
and then reshaping the extraction array afterwards. However, this is very slow and there will be some overlap between each of the slices in this for-loop I'd like to prevent.
Is there a simple way to do this directly without a for-loop?
EDIT:
Let me rephrase the question through another idea I came up with that is also slow. I have a line running through the datacube. I have a lists of x, y and z coordinates (the coordinates being the indices in the datacube array) with all the points that define the line.
As an example these lists look like this:
x_line: [345 345 345 345 342 342 342 342 342 342 342 342 342 342 342 342]
y_line: [540 540 540 543 543 543 543 546 546 546 549 549 549 549 552 552]
z_line: [84 84 84 87 87 87 87 87 90 90 90 90 90 93 93 93]
As you can see, some of these coordinates are identical, due to the lines being defined in different coordinates and then binned to the resolution of the data box.
Now I want to mask all cells in the datacube with a distance larger than 3 cells.
For a single point along the line (x_line[i], y_line[i], z_line[i]) this is relatively easy.I created a meshgrid for the coordinates in the datacube and then create a mask array of zeros and put everything satisfying the condition to 1:
data = np.random.rand(600,600,600)
x_box,y_box,z_box = np.meshgrid(n.arange(600),n.arange(600),n.arange(600))
mask = np.zeros(np.shape(data))
for i in range(len(x_line)):
distance = np.sqrt((x_box-x_line[i])**2 + (y_box-y_line[i])**2 + (z_box-z_line[i])**2)
mask[distance <= 3] = 1.
extraction = data[mask == 1.]
The advantage of this is that the mask array removes the problem of having duplicate extractions. However, both the meshgrid and distance calculations are very slow. So is it possible to do the calculation of the distance directly on the entire line without having to do a for-loop over each line point, so that it directly masks all cells that are within a distance of 3 cells from ANY of the line points?
How about this?
# .shape = (N,)
x, y, z = ...
# offsets in [-3, 3), .shape = (6, 6, 6)
xo, yo, zo = np.indices((6, 6, 6)) - 3
# box indices, .shape = (6, 6, 6, N)
xb, yb, zb = x + xo[...,np.newaxis], y + yo[...,np.newaxis], z + zo[...,np.newaxis]
# .shape = (6, 6, 6, N)
extractions = data[xb, yb, zb]
This extracts a series of 6x6x6 cubes, each "centered" on the coordinates in x, y, and z
This will produce duplicate coordinates, and fail on cases near the borders
If you keep your xyz in one array, this gets a little less verbose, and you can remove the duplicates:
# .shape = (N,3)
xyz = ...
# offsets in [-3, 3), .shape = (6, 6, 6, 3)
xyz_offset = np.moveaxis(np.indices((6, 6, 6)) - 3, 0, -1)
# box indices, .shape = (6, 6, 6, N, 3)
xyz_box = xyz + xyz_offset[...,np.newaxis,:]
if remove_duplicates:
# shape (M, 3)
xyz_box = xyz_box.reshape(-1, 3)
xyz_box = np.unique(xyz_box, axis=0)
xb, yb, zb = xyz_box
extractions = data[xb, yb, zb]

Powerball number generator

To win the Powerball lottery (an extremely unlikely event so don't waste your time) you have to pick six numbers correctly. The first five numbers are drawn from a drum containing 53 balls and the sixth is drawn from a drum containing 42 balls. The chances of doing this are 1 in 120,526,770.
The output needs to be in the form:
Official (but fruitless) Powerball number generator
How many sets of numbers? 3
Your numbers: 3 12 14 26 47 Powerball: 2
Your numbers: 1 4 31 34 51 Powerball: 17
Your numbers: 10 12 49 50 53 Powerball: 35
import random
#Powerball
print "Offical Powerball number generaor"
x = int(raw_input("How many sets of numbers? "))
z = range(1,42)
z1 = random.choice(z)
def list1():
l1=[]
n=1
while n<=5:
y = range(1,53)
y1 = random.choice(y)
l1.append(y1)
n +=1
print sorted(l1)
i=1
while i<=x:
# print "Your numbers: " + list1() + "Powerball: "+ str(z1)
print list1()
raw_input("Press<enter>")
My code's output goes on a infinite loop. I have to kill it. And the message is:
None
[2, 7, 22, 33, 42]
None
[15, 19, 19, 26, 48]
None
[1, 5, 7, 26, 41]
None
[7, 42, 42, 42, 51]
None
..... etc ....
while i<=x: - you never increment i, so it is stuck in your last loop...
To avoid such things and remove the noise of i+=1 lines in your code I suggest using for loops for i in range(x) and for n in range(5).
Better yet, the following expression can replace list1:
[random.choice(range(1,53)) for x in xrange(5)]
At least, that does the same as your code. But what you probably really want (to avoid the same ball being chosen twice) is:
random.sample( range(1,53), 5 )

Matlab: Read in ascii with varying column numbers & different formats

I have been working on a puzzling question that involves reading an ascii file into matlab that contains 2 parts of different formats, the first part also including different column numbers.
MESH2D
MESHNAME "XXX"
E3T 1 1 29 30 1
E4Q 2 2 31 29 1 1
E4Q 3 31 2 3 32 1
...
...
...
ND 120450 5.28760039e+004 7.49260000e+004 8.05500000e+002
ND 120451 5.30560039e+004 7.49260000e+004 6.84126709e+002
ND 120452 5.32360039e+004 7.49260000e+004 6.97750000e+002
ND 120453 5.34010039e+004 7.49110000e+004 7.67000000e+002
NS 1 2 3 4 5 6 7 8 9 10
NS 11 12 13 14 15 16 17 18 19 20
NS 21 22 23 24 25 26 27 -28
BEGPARAMDEF
GM "Mesh"
I am only interested in the lines that contain the triangles and start with E3T/E4Q and the corresponding lines that hold the coordinates of the nodes of the triangles and start with ND. For the triangles (E3T/E4Q lines) I am only interest in the first 4 numbers, therefore I was trying to do something like this:
fileID = fopen(test);
t1 = textscan(fileID, '%s',3);
t2 = textscan(fileID, '%s %d %d %d*[^\n]');
fclose(fileID);
So read in the header to jump to the data and then read the first string and following 4 numbers, then jump to the end of the line and restart. But this does not work. I only get A single line with data and not the rest of the file. Also, I do not know how to treat the second part of the file, which starts at an arbitrary amount of numbers (which I could of course look up manually and feed into matlab, but would prefer matlab to find this change in format automatically).
Do you have any suggestions?
Cheers!
I suggest that you first read all the lines of the file with textscan as strings, and then filter out whatever you need:
fid = fopen(filename, 'r');
C = textscan(fid, '%s', 'delimiter', '');
fclose(fid);
Then parse only the E3T/E4Q/ND lines using regexp:
C = regexp(C, '(\w*)(.*)', 'tokens');
C = cellfun(#(x){x{1}{1}, str2num(x{1}{2})}, C, 'UniformOutput', false);
C = vertcat(C{:});
And then group corresponding E3T/E4Q and ND lines:
idx1 = strcmp(C(:, 1), 'E3T') | strcmp(C(:, 1), 'E4Q');
idx2 = strcmp(C(:, 1), 'ND');
N = max(nnz(idx1), nnz(idx2));
indices = cellfun(#(x)x(1:4), C(idx1, 2), 'UniformOutput', false);
S = struct('tag', [C(idx1, 1); cell(N - nnz(idx1), 1)], ...
'indices', [indices; cell(N - nnz(idx1), 1)], ...
'nodes', [C(idx2, 2); cell(N - nnz(idx2), 1)]);
I named the E3T/E4Q values "indices" and ND values "nodes". The resulting array S contains structs, each having three fields: tag (either E3T or E4Q), indices and nodes. Note that if you have more "indices" than "nodes" or vice versa, the missing values are indicated by an empty matrix.
I know this is not perfect, but if your files are not too big you can do something like this:
fileID = fopen(test,'r');
while ~feof(fileID)
FileLine = fgetl(fileID);
[LineHead,Rem] = strtok(FileLine); % separated string header and numbers
switch LineHead
case 'MESH2D'
% do something here
case 'MESHNAME'
% do something here
case 'E3T'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'E4Q'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'ND'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
% or if you prefer to parse first number separately
[strFirst,strOthers] = strtok(Rem);
FirstInteger = str2num(strFirst);
[Floats,FloatsCount] = sscanf(strOthers, '%g');
% and so on...
end
end
fclose(fileID);
OF course, you have to handle strings starting with MESH2D, MESHNAME, or GM separately