I have a regular expression for capturing repeating numerical patterns in a string of number. However, it is not able to distinguish between single and multiple digits within a number.
Given a string:
0 5 0 0 0 16 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 1 1 1 11 1 1 1 1 1 1 1 2 11 1 4 4 4 16
and regular expression
(\d+)( \1)+
the match result is
0 5 0 0 0 16 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 1 1 1 11 1 1 1 1 1 1 1 2 1 1 1 4 4 4 16
The regex is not able to distinguish between 1 and 11.
(Note: 11 could also be a repeating number and maximum 3 digits are possible in a number)
You need to add a word boundary to regex. For example:
(\b\d+)( \1\b)+
See https://regex101.com/r/ZSCMjF/1
Related
I have the following data structure
Y
cum_sum
1
1
1
1
1
1
0
1
0
1
1
1
0
1
1
1
1
1
I would like to have cum_sum change so that it calculates the cumulative sum while Y is unchanged, so that the data is:
Y
cum_sum
1
1
1
2
1
3
0
1
0
2
1
1
0
1
1
1
1
2
Not sure how to do it and I've tried searching but the phrasing I'm using leads me to different questions
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(y cum_sum)
1 1
1 1
1 1
0 1
0 1
1 1
0 1
1 1
1 1
end
replace cum_sum = cum_sum + cum_sum[_n-1] if y == y[_n-1]
The problem seems quite suited for a GPU, FPGA, etc. (because it's quite parallel); but I'm looking for a CPU-based and somewhat architecture independent solution right now. I think a good answer could be just some unimplemented pseudo-code, but my program is in pure C++20, so the answer should be relevant in that context (e.g., don't assume something very-high-level like Python, don't use compiler specific intrinsics or assembly). I'm not expecting mind-blowing performance, but I do want the answer to be significantly faster than the three implementations I already have (in this file): a very naive approach without generator matrices, and a naive "multiply input vector with dense generator matrix" approach. The answer should work for arbitrary code word and input lengths, but the important code word lengths are under, say, 2000 bits, and small input lengths are not important.
Some preliminaries: the binary numbers in question have addition and multiplication defined as, respectively, the "exclusive-or" (XOR) and "and" logical/bitwise operations. This extends to binary matrix multiplication.
Hamming codes are old and well-known binary linear block error detecting/correcting codes. Each code word is a string of bits where some bit-positions are designated as parity bits, used for error detection and correction, while the rest of the bits are data bits, which are just copies of the input bits if there was no error. We are considering only Hamming codes where the parity bits are at traditional power-of-two positions (i.e., with 1-based numbering: bit 1, bit 2, bit 4, bit 8, ...). Thus each possible code can be determined using its length n (number of bits in a code word) or its rank k (number of data bits in a code word). A Hamming code can be referred to as (n, k), e.g., (7, 4) or (40, 34).
Each code has a generator matrix, a binary matrix with which an input vector can be multiplied to obtain a code word. Thus the set of the code words of a certain code is exactly the set of linear combinations of the rows of the generator matrix.
The desired program is basically a coder: it takes as input an (n, k) pair to give the code (yes, this is redundant - only one of the pair is needed in essence) and an arbitrary binary message, divides the message into k-bits long sub-messages and outputs a sequence of n-bits long code words, each encoding one sub-message.
I'm hoping for an answer leveraging properties specific to our generator matrices here (e.g., special representation for the generator matrix and special vector-matrix multiplication algorithm), so here are examples of generator matrices for some codes:
Hamming code (3, 1) (has only the code words 000 and 111):
111
Hamming code (5, 2) (has only the code words 00000, 11100, 10011 and 01111):
11100
10011
Hamming code (6, 3):
111000
100110
010101
(Notice how each generator matrix contains the generator matrices of all the smaller codes.)
Hamming code (150, 142) (all zeros left blank so the ones would stand out more):
111
1 11
1 1 1
11 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
11 1 1 1 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
Notice how there are relatively few ones among all the zeros in most generator matrices, and there's definitely a pattern, shape even, to the matrices.
I'm weak in all relevant areas here, so please try to correct any possible mistakes I made.
I have a dataframe that looks like:
subgroup value
0 1 0
1 1 1
2 1 1
3 1 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 1
9 3 0
10 3 0
I need to add a column that add 1 whenever there is at least one value different than 0 in the different subgroups. Please, note that if the value 1 is repeated more than once in the same subgroup, it doesn't affect the count.
The result should be:
subgroup value count
0 1 0 1
1 1 1 1
2 1 1 1
3 1 1 1
4 2 0 1
5 2 0 1
6 2 0 1
7 3 0 2
8 3 1 2
9 3 0 2
10 3 0 2
Thank you in advance for your help!
Using shift with -1 and 1 and cumsum the result
mask=(df.value.ne(df.value.shift()))&(df.value.ne(df.value.shift(-1)))
mask.cumsum()
Out[18]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 2
9 2
10 2
Name: value, dtype: int32
Using merge and groupby
df.merge(df.groupby('subgroup').value.sum().gt(0).cumsum().reset_index(name='out'))
subgroup value out
0 1 0 1
1 1 1 1
2 1 1 1
3 1 0 1
4 2 0 1
5 2 0 1
6 2 0 1
7 3 0 2
8 3 1 2
9 3 0 2
10 3 0 2
After I use clojure.jdbc/insert! to insert some data, it printed many "1", so I am wondering whether the insert is done in batch mode which has better performance or done one by one which is slow. We'd better to have it run like java jdbc batch insert.
clojurewerkz.testcom.core=> (time (apply (partial j/insert! postgres-db 'test_clojure [:a :b :c :d :e]) (map #(process-row % constraints) (repeat 10000 row))))
"Elapsed time: 540.111482 msecs"
(1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
clojure.jdbc/insert! eventually calls
(apply db-do-prepared db transaction? (first stmts) (rest stmts))
which calls db-do-execute-prepared-statement so it appears your answer is yes! it does them in batches.
I'm looking for a J code to do the following.
Suppose I have a list of random integers (sorted),
2 3 4 5 7 21 45 49 61
I want to start with the first element and remove any multiples of the element in the list then move on to the next element cancel out its multiples, so on and so forth.
Thus the output
I'm looking at is 2 3 5 7 61. Basically a Sieve Of Eratosthenes. Would appreciate if someone could explain the code as well, since I'm learning J and find it difficult to get most codes :(
Regards,
babsdoc
It's not exactly what you ask but here is a more idiomatic (and much faster) version of the Sieve.
Basically, what you need is to check which number is a multiple of which. You can get this from the table of modulos: |/~
l =: 2 3 4 5 7 21 45 49 61
|/~ l
0 1 0 1 1 1 1 1 1
2 0 1 2 1 0 0 1 1
2 3 0 1 3 1 1 1 1
2 3 4 0 2 1 0 4 1
2 3 4 5 0 0 3 0 5
2 3 4 5 7 0 3 7 19
2 3 4 5 7 21 0 4 16
2 3 4 5 7 21 45 0 12
2 3 4 5 7 21 45 49 0
Every pair of multiples gives a 0 on the table. Now, we are not interested in the 0s that correspond to self-modulos (2 mod 2, 3 mod 3, etc; the 0s on the diagonal) so we have to remove them. One way to do this is to add 1s on their place, like so:
=/~ l
1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1
(=/~l) + (|/~l)
1 1 0 1 1 1 1 1 1
2 1 1 2 1 0 0 1 1
2 3 1 1 3 1 1 1 1
2 3 4 1 2 1 0 4 1
2 3 4 5 1 0 3 0 5
2 3 4 5 7 1 3 7 19
2 3 4 5 7 21 1 4 16
2 3 4 5 7 21 45 1 12
2 3 4 5 7 21 45 49 1
This can be also written as (=/~ + |/~) l.
From this table we get the final list of numbers: every number whose column contains a 0, is excluded.
We build this list of exclusions simply by multiplying by column. If a column contains a 0, its product is 0 otherwise it's a positive number:
*/ (=/~ + |/~) l
256 2187 0 6250 14406 0 0 0 18240
Before doing the last step, we'll have to improve this a little. There is no reason to perform long multiplications since we are only interested in 0s and not-0s. So, when building the table, we'll keep only 0s and 1s by taking the "sign" of each number (this is the signum:*):
* (=/~ + |/~) l
1 1 0 1 1 1 1 1 1
1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1
1 1 1 1 1 0 1 0 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
so,
*/ * (=/~ + |/~) l
1 1 0 1 1 0 0 0 1
From the list of exclusion, you just copy:# the numbers to your final list:
l #~ */ * (=/~ + |/~) l
2 3 5 7 61
or,
(]#~[:*/[:*=/~+|/~) l
2 3 5 7 61
Tacit iteration is usually done with the conjunction Power. When the test for completion needs to be something other than hitting a fixpoint, the Do While construction works well.
In this solution filterMultiplesOfHead is applied repeatedly until there are no more numbers not either applied or filtered. Numbers already applied are accumulated in a partial answer. When the list to be processed is empty the partial answer is the result, after stripping off the boxing used to segregate processed from unprocessed data.
filterMultiplesOfHead=: {. (((~: >.)# %~) # ]) }.
appendHead=: (>#[ , {.#>#])/
pass=: appendHead ; filterMultiplesOfHead#>#{:
prep=: a: , <
unfinished=: [: -. a: -: {:
sieve=: [: ; [: pass^:unfinished^:_ prep
sieve 2 3 4 5 7 21 45 49 61
2 3 5 7 61
prep 2 3 4 7 9 10
┌┬────────────┐
││2 3 4 7 9 10│
└┴────────────┘
appendHead prep 2 3 4 7 9 10
2
filterMultiplesOfHead 2 3 4 7 9 10
3 7 9
pass^:2 prep 2 3 4 7 9 10
┌───┬─┐
│2 3│7│
└───┴─┘
sieve 1-.~/:~~.>:?.$~100
2 3 7 11 29 31 41 53 67 73 83 95 97