I'm currently working on a program (bioinformatics project) that involves reading multiple files, including a matrix, and outputting the results onto another file. What I'm having the most trouble with is how I would go about reading the matrix file like a coordinate system (for lack of a better term)? Is there a simple way to do this without using 2D arrays? For example, if I have the following amino acids in:
fileA: CTTNCLAPLA
fileB: CTTNSITPVA
The program would then read the two files, compare each letter, and refer to the matrix to find the number corresponding to the two letters, which in turn determines the probability of a letter in fileA mutating to a letter in fileB.
Since the first letter in each file is C, the program would read the matrix and output in a separate file:
C T T N C L A P L A
| | | | . : : | : |
C T T N S I T P V A
The "." means that the number according to the matrix was 0 but not the same letter, "|" means that the letter is the same, and the ":" means that the number was greater than zero but not the same letter.
Here is part of the matrix (the rest wouldn't fit):
NOTE: The matrix I must use is in a .csv file, and does not include spaces.
_, A, R, N, D, C
A, 2,-2, 0, 0,-2
R,-2, 6, 0,-1,-2
N, 0, 0, 2, 2,-4
D, 0,-1, 2, 4,-5
C,-2,-4,-4,-5,12
I apologize if my explanation is confusing. Please let me know if you need any clarification. Any help is greatly appreciated. Thanks in advance!
To avoid 2D arrays, you can use 1D array with linear index, and implement convenience helper function to convert 2D coordinate to linear array index - like here Linear indexing in symmetric matrices
I would just create a class / struct and then create a array of objects. This should eliminate your need for a 2D array.
Related
I'm trying to find the indices of non-zero elements in a 3*3 integer matrix using numpy as a part of the tictactoe game problem. I realize that np.where is a good option for this case and tried it out, the output I get doesn't look right still. Can you please help me code this part ? I have given my partial code below.
input: s, an integer matrix of dimension 3*3
example:
output: m,a list of possible next moves, where each next move is a (r,c) tuple where r denotes the row number, c denotes the column number.
example:
[code]
m = np.where(s==0)
Here's a quick solution:
import numpy as np
s = np.matrix('0, 0, 0; 0, 1, 0; 0, 0, 0')
m = np.where(s==0)
m = list(zip(m[0], m[1]))
print(m)
s is the input matrix, where you can see that the middle square is taken, and then we use np.where() just like you did, which produces two arrays, then use zip() to combine them into tuples and list() to convert the output to a list of tuples of valid moves.
I have two 2d arrays.
a=['the flower is red','butterflies are pretty','dog is a man best friend']
b=['231','01','034']
Array a contains sentences, while array b is the indexes of the word that I would like to extract from array a.
For example by comparing the individual elements in b[0] which is 231, I would like to extract is,red,flower where as for b[2], I would like to extract dog, man,best.
So, in order to do that, I have to elements a[] word by word, and then compare with each of the individual elements in b[] (for example to read 2,3,1 individually to compare with the index in a[i][j].)
Hence, I would require two 2d array loops and compare them. [ 4 for loops I think]
for i in a:
x= i.split()
#x=one word
for idx, word in enumerate(x):
#idx= index of one word, word=one word
for i in b:
for y in i:
if y == idx: #comparing y which is a number with the index in a[]
print(word)
the code above is incorrect somehow and i don't know what or where went wrong.
So, what is the code to get the wanted result?
for idx, s in enumerate(b):
r = []
for c in s:
r.append(a[idx].split()[int(c)])
print r
I got a numpy 1d arrays, and I want to find the indices of the array such that its values are in the closed interval specified by another 1d array. To be concrete, here is an example
A= np.array([ 0.69452994, 3.4132039 , 6.46148658, 17.85754453,
21.33296454, 1.62110662, 8.02040621, 14.05814177,
23.32640469, 21.12391059])
b = np.array([ 0. , 3.5, 9.8, 19.8 , 50.0])
I want to find the indices in b such that values in A are in which closed interval (b is always in sorted order starting from 0 and ending in the max possible value A can ever take.
In this specific example, my output will be
indx = [0,0,1,2,3,0,1,2,3,3]
How can I do it ?. I tried with np.where, without any success.
Given the sorted nature of b, we can simply use searchsorted/digitize to get the indices where elements off A could be placed to keep the sorted order, which in essence means getting the boundary indices for each of the b elements and finally subtract 1 from those indices for the desired output.
Thus, assuming the right-side boundary is an open one, the solution would be -
np.searchsorted(b,A)-1
np.digitize(A,b,right=True)-1
For left-side open boundary, use :
np.searchsorted(b,A,'right')-1
np.digitize(A,b,right=False)-1
I do not use any matrix library, but instead plain std::vector for my matrix data.
To fill it with 2D data I use this code:
data[iy + dataPointsY * ix] = value;
I would like to know is this is correct or if it must be the other way (ix first).
To my understanding fftw needs 'Row-major Format'. Since I use it the formula should be according to row-major format.
Assuming you want row major format for fftw, what you want is:
data[ix + iy*dataPointsY]
The point of row-major is, when the combined index increased by 1, the corresponding row index would be same (assuming not overflowing to the next row).
double m[4][4];
mp = (double*)m;
mp[1+2*3] == m[2][1]; //true
mp[2+2*3] == m[2][2]; //true
mp[2+2*3] == m[3][1]; //false
In general, there's no "right" way to store a matrix. Row major format is also called "C-style" matrix, while column major is called "fortran-style" matrix. The naming is due to different multidimensional array indexing scheme between the two language.
I've sucessfully implemented a BWT stage (using regular string sorting) for a compression testbed I'm writing. I can apply the BWT and then inverse BWT transform and the output matches the input. Now I wanted to speed up creation of the BW index table using suffix arrays. I have found 2 relatively simple, supposedly fast O(n) algorithms for suffix array creation, DC3 and SA-IS which both come with C++/C source code. I tried using the sources (out-of-the-box compiling SA-IS source can also be found here), but failed to get proper a proper suffix array / BWT index table out. Here's what I've done:
T=input data, SA=output suffix array, n=size of T, K=alphabet size, BWT=BWT index table
I work on 8-bit bytes, but both algorithms need a unique sentinel / EOF marker in form of a zero byte (DC3 needs 3, SA-IS needs one), thus I convert all my input data to 32-bit integers, increase all symbols by 1 and append the sentinel zero bytes. This is T.
I create an integer output array SA (of size n for DC3, n+1 for KA-IS) and apply the algorithms. I get results similar to my sorting BWT transform, but some values are odd (see UPDATE 1). Also the results of both algorithms differ slightly. The SA-IS algorithm produces an excess index value at the front, so all results need to be copied left by one index (SA[i]=SA[i+1]).
To convert the suffix array to the proper BWT indices, I subtract 1 from the suffix array values, do a modulo and should have the BWT indices (according to this): BWT[i]=(SA[i]-1)%n.
This is my code to feed the SA algorithms and convert to BWT. You should be able to more or less just plug in the SA construction code from the papers:
std::vector<int32_t> SuffixArray::generate(const std::vector<uint8_t> & data)
{
std::vector<int32_t> SA;
if (data.size() >= 2)
{
//copy data over. we need to append 3 zero bytes,
//as the algorithm expects T[n]=T[n+1]=T[n+2]=0
//also increase the symbol value by 1, because the algorithm alphabet is [1,K]
//(0 is used as an EOF marker)
std::vector<int32_t> T(data.size() + 3, 0);
std::copy(data.cbegin(), data.cend(), T.begin());
std::for_each(T.begin(), std::prev(T.end(), 3), [](int32_t & n){ n++; });
SA.resize(data.size());
SA_DC3(T.data(), SA.data(), data.size(), 256);
OR
//copy data over. we need to append a zero byte,
//as the algorithm expects T[n-1]=0 (where n is the size of its input data)
//also increase the symbol value by 1, because the algorithm alphabet is [1,K]
//(0 is used as an EOF marker)
std::vector<int32_t> T(data.size() + 1, 0);
std::copy(data.cbegin(), data.cend(), T.begin());
std::for_each(T.begin(), std::prev(T.end(), 1), [](int32_t & n){ n++; });
SA.resize(data.size() + 1); //crashes if not one extra byte at the end
SA_IS((unsigned char *)T.data(), SA.data(), data.size() + 1, 256, 4); //algorithm expects size including sentinel
std::rotate(SA.begin(), std::next(SA.begin()), SA.end()); //rotate left by one to get same result as DC3
SA.resize(data.size());
}
else
{
SA.push_back(0);
}
return SA;
}
void SuffixArray::toBWT(std::vector<int32_t> & SA)
{
std::for_each(SA.begin(), SA.end(), [SA](int32_t & n){ n = ((n - 1) < 0) ? (n + SA.size() - 1) : (n - 1); });
}
What am I doing wrong?
UPDATE 1
When applying the algorithms to short amounts of test text data like "yabbadabbado" / "this is a test." / "abaaba" or a big text file (alice29.txt from the Canterbury corpus) they work fine. Actually the toBWT() function isn't even necessary.
When applying the algorithms to binary data from a file containing the full 8-bit byte alphabet (executable etc.), they don't seem to work correctly. Comparing the results of the algorithms to that of the regular BWT indices, I notice erroneous indices (4 in my case) at the front. The number of indices (incidently?) corresponds to the recursion depth of the algorithms. The indices point to where the original source data had the last occurrences of 0s (before I converted them to 1s when building T)...
UPDATE 2
There are more differing values when I binary compare the regular BWT array and the suffix array. This might be expected, as afair sorting must not necessarily be the same as with a standard sort, BUT the resulting data transformed by the arrays should be the same. It is not.
UPDATE 3
I tried modifying a simple input string till both algorithm "failed". After changing two bytes of the string "this is a test." to 255 or 0 (from 74686973206973206120746573742Eh to e.g. 746869732069732061FF74657374FFh, the last byte has to be changed!) the indices and transformed string are not correct anymore. It also seems to be enough to change the last character of the string to a character already ocurring in the string, e.g. "this is a tests" 746869732069732061207465737473h. Then two indices and two characters of the transformed strings will swapped (comparing regular sorting BWT and BWT that uses SAs).
I find the whole process of having to convert the data to 32-bit a bit awkward. If somebody has a better solution (paper, better yet, some source code) to generate a suffix array DIRECTLY from a string with an 256-char alphabet, I'd be happy.
I have now figured this out. My solution was two-fold. Some people suggested using a library, which I did SAIS-lite by Yuta Mori.
The real solution was to duplicate and concatenate the input string and run the SA-generation on this string. When saving the output string you need to filter out all SA indices above the original data size. This is not an ideal solution, because you need to allocate twice as much memory, copy twice and do the transform on the double amount of data, but it is still 50-70% faster than std::sort. If you have a better solution, I'd love to hear it.
You can find the updated code here.