Why so many 0s in the answer? - clojure

(take 100 (iterate rand-int 300))
evaluates differently, of course, each time... but usually with a ton of zeros. The result always leads with a 300. For example:
(300 93 59 58 25 14 9 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
I would have expected 100 random integers between 0 and 300.
What am I not understanding?

See docs for iterate:
Returns a lazy sequence of x, (f x), (f (f x)) etc. f must be free of side-effects
So, that's the reason your sequence is always starting with 300.
And why there are so many zeros? When you use iterate like this, rand-int takes the previous result and uses it as a new upper limit (exclusive) for a random number. So, your results can look like this:
=> 300
(rand-int *1)
=> 174
(rand-int *1)
=> 124
(rand-int *1)
=> 29
(rand-int *1)
=> 17
(rand-int *1)
=> 16
(rand-int *1)
=> 7
You can check yourself that this sequence leads to zero.
If you really want to get 100 random integers between 0 and 300, use repeatedly instead:
(repeatedly 100 #(rand-int 300))


What is the 5x5 equivalent of the 3x3 emboss kernel?

-2 -1 0
-1 1 1
0 1 2
This is 3x3 emboss kernel. How should I write this in 5x5?
As I understand, these filters take directional differences (see the wikipidea page).
We can decompose you filter into directions
0 -1 0 0 0 0 -2 0 0
0 0 0 -1 0 1 0 0 0
0 1 0 0 0 0 0 0 2
So, I think you can expand it over these 3 directions giving emphasis
0 0 -1 0 0 0 0 0 0 0 -2 0 0 0 0
0 0 -1 0 0 0 0 0 0 0 0 -2 0 0 0
0 0 0 0 0 -1 -1 0 1 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 2 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 2
So, the final kernel would be
-2 0 -1 0 0
0 -2 -1 0 0
-1 -1 1 1 1
0 0 1 2 0
0 0 1 0 2
May be you can also try interpolating filter coefficients marked as x
-2 x -1 0 0
x -2 -1 0 0
-1 -1 1 1 1
0 0 1 2 x
0 0 1 x 2
The simple solution to fitting any lower-dimensional convolution kernel into a higher-dimensional matrix of the same rank is to surround it by zero weights. This is especially true when you're dealing with a concept like embossing, which is arguably more interested in immediate vector of change than the rate at which it is changing. That is, for this embossing matrix,
You could equivalently use this in 5 x 5:
Granted, this will get you a different visual effect than anything with any part of the matrix filled in; but sometimes, especially with edge-detection, immediate clarity is more important. We aren't always displaying it. If this were something like a Guassian blur kernel, having a greater range could improve the effect, but embossing isn't that different conceptually from Sobel-Feldman and it may be better to keep it tight.

time series sliding window with occurrence counts

I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you
There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
In [143]: r
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

sed to remove certain spaces in the middle

I have a text file like this:
6.2341 -0.4024 -2.0936 Cl 0 0 0 0 0 0 0 0 0 0 0 0
0.1148 -3.7525 1.0392 S 0 0 0 0 0 0 0 0 0 0 0 0
-2.5441 -0.8745 1.3714 F 0 0 0 0 0 0 0 0 0 0 0 0
The format is: columns 1 to 10, 11 to 20, 21 to 30 are x,y,z coordinates in (10.4) format, i.e. length=10, 4 digits after the decimal point; column 31 is always a space; columns 32 to 32 are the atom type; the remaining columns are not important.
However, for some unknown reason, the atom type field is right-shifted by two columns, like this:
6.2341 -0.4024 -2.0936 Cl 0 0 0 0 0 0 0 0 0 0 0 0
0.1148 -3.7525 1.0392 S 0 0 0 0 0 0 0 0 0 0 0 0
-2.5441 -0.8745 1.3714 F 0 0 0 0 0 0 0 0 0 0 0 0
How to use the sed command and regular expression to match these lines and delete the two extra spaces?
sed -r 's/(.{30}) /\1/' will do the trick.
Group the first 30 characters, match two additional spaces, replace the whole with the grouped characters.
If you don't mind using neither sed nor regular expressions you can just use cut to remove the 2 offending characters:
$ cut --complement -c31,32 file
6.2341 -0.4024 -2.0936 Cl 0 0 0 0 0 0 0 0 0 0 0 0
0.1148 -3.7525 1.0392 S 0 0 0 0 0 0 0 0 0 0 0 0
-2.5441 -0.8745 1.3714 F 0 0 0 0 0 0 0 0 0 0 0 0

False Acceptance Rate and False Rejection Rate calculation using a n*n confusion matrix

FAR and FRR are used to express the results of biometric devices. Below is the confusion matrix produced by biometric data produced in weka. I couldn't find any resources explaining the procedure to calculate FAR and FRR using a n*n confusion matrix. Any help explaining the procedure would be of great help. Thanks in advance!
Weka also gives these values, TP Rate, FP Rate, Precision, Recall, F-Measure and ROC Area. Please suggest if the required values can be calculated using these.
=== Confusion Matrix ===
a b c d e f g h i j k l m n o <-- classified as
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 | a = user1
0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 | b = user2
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 | c = user3
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 | d = user4
0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 | e = user5
0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 | f = user6
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 | g = user7
0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 | h = user9
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 | i = user10
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 | j = user11
0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 | k = user14
0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 | l = user15
0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 | m = user16
0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 | n = user17
0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 | o = user19
The accepted answer here by user "chl" has a reference to the Biometrics Literature: https://stats.stackexchange.com/questions/3489/calculating-false-acceptance-rate-for-a-gaussian-distribution-of-scores .
He says,
[the ROC curve] is a plot of (TAR=1-FRR, the false rejection rate) against false
acceptance rate (FAR).
However, commonly the ROC curve happens to be a plot of TP Rate as a function of False Positive Rate (FP Rate).
Seems you can use TP Rate and FP Rate.

Matlab: how to work with sparse keys to access sparse data?

I am trying to access the sparse mlf with the keys such as BEpos and BEneg where one key per line. Now the problem is that most commands are not meant to deal with too large input: bin2dec requires clean binary numbers without spaces but the regexp hack fails to too many rows -- and so on.
How to work with sparse keys to access sparse data?
K>> mlf=sparse([],[],[],2^31,1);
BEpos =
(1,1) 1
(2,3) 1
(2,4) 1
K>> mlf(bin2dec(num2str(BEpos)))=1
Error using bin2dec (line 36)
Binary string must be 52 bits or less.
K>> num2str(BEpos)
ans =
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K>> bin2dec(num2str('1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'))
Error using bin2dec (line 36)
Binary string must be 52 bits or less.
K>> regexprep(num2str(BEpos),'[^\w'']','')
Error using regexprep
The 'STRING' input must be a one-dimensional array
of char or cell arrays of strings.
Manually works
K>> mlf(bin2dec('1000000000000000000000000000000'))
ans =
All zero sparse: 1-by-1
Consider a different approach using manual binary to decimal conversions:
pows = pow2(size(BEpos,2)-1 : -1 : 0);
inds = uint32(BEpos*pows.')
I haven't benchmarked this, but it might work faster than bin2dec and cell arrays.
How it works
This is pretty simple: the powers of 2 are calculated and stored in pows (assuming the MSB is in the leftmost position). Then they are multiplied by the bits in the matching positions and summed to produce the corresponding decimal values.
Try to index with this:
inds = uint32( bin2dec(cellstr(num2str(BEpos,'%d'))) );