Find RSTART for multiple matches in a line - regex

I am writing a program to examine the string STRING to see where it matches SUBSTRING using gawk. One problem I have run into is that the match function only gives the left most match in the string. My current thought is to use gsub to find out how many times the SUBSTRING is present and then use match multiple times using the last substring(STRING,RSTART+1) to find the true start positions of each position, of course with some edits to the code. I am wondering if there is an easier way than this, or a built in function that gives all RSTARTS.
Example:
STRING=DDDADDCDFFDFGSDD
SUBSTRING=D
EDIT:
I looked at the array function for match (thanks for pointing me to more up to date documentation than I had been reading). This still doesn't work, as it allows you to search for multiple things in the same string, but still only gives the left most location of each of these strings.
For example:
$ echo DDDADDCDFFDFGSDD | gawk '{match($0,/D/,a); for (i in a) print i,a[i]}'
0start 1
0length 1
0 D
it works to find the left most of multiple things
echo gDDDADDCDFFDFGSDD | gawk '{match($0,/(D)(A)/,a); for (i in a) print i,a[i]}'
0start 4
0length 2
1start 4
2start 5
2length 1
1length 1
0 DA
1 D
2 A
So we are still finding the left most match (which is what the documentation say it will do)

There isn't a native way to deal with this that i have found, so I wrote this function to do it. This will only work with version of gawk that allow for multidimensional arrays, though making this work with older versions of awk would be simple as well, though parsing afterwards would be more difficult.
The function searches through the string for the regex and populates an array MM. It returns -1 if there was an error, 0 if there were no matches found, else it returns the number of matches found.
function multiMatch(string,subs){
split("",MM,"")
RLENGTH=0
RSTART=0
t=0
s=string
if (length(string) == 0 || length(subs) == 0){
print "Must have string and Regex to look for"
return -1
}
while (1) {
t=RSTART+t
s=substr(string,t+1)
if ( length(s) == 0 ){
break
}
match(s,subs)
if (RLENGTH == -1) {
break
}
found=substr(string,0,length(string)-(length(string)-t-RSTART+1))"-"substr(string,t+RSTART,RLENGTH)"-"substr(string,t+RSTART+RLENGTH);
MM[n]["RSTART"]=RSTART
MM[n]["RLENGTH"]=RLENGTH
MM[n]["STR"]=found
n++
}
return n
}
Example
echo doogggogogggggggooogggogggggooogoooggoooo g*o | awk '
BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"}
{
print "Found "multiMatch($1,$2)" Matches"
for (x in MM) {
print x,MM[x]["RSTART"],MM[x]["RLENGTH"],MM[x]["STR"]
}
}'
OUTPUT
Found 40 Matches
2 1 d-o-ogggogogggggggooogggogggggooogoooggoooo
1 1 1 do-o-gggogogggggggooogggogggggooogoooggoooo
2 1 4 doo-gggo-gogggggggooogggogggggooogoooggoooo
3 1 3 doog-ggo-gogggggggooogggogggggooogoooggoooo
4 1 2 doogg-go-gogggggggooogggogggggooogoooggoooo
5 1 1 dooggg-o-gogggggggooogggogggggooogoooggoooo
6 1 2 doogggo-go-gggggggooogggogggggooogoooggoooo
7 1 1 doogggog-o-gggggggooogggogggggooogoooggoooo
8 1 8 doogggogo-gggggggo-oogggogggggooogoooggoooo
9 1 7 doogggogog-ggggggo-oogggogggggooogoooggoooo
10 1 6 doogggogogg-gggggo-oogggogggggooogoooggoooo
11 1 5 doogggogoggg-ggggo-oogggogggggooogoooggoooo
12 1 4 doogggogogggg-gggo-oogggogggggooogoooggoooo
13 1 3 doogggogoggggg-ggo-oogggogggggooogoooggoooo
14 1 2 doogggogogggggg-go-oogggogggggooogoooggoooo
15 1 1 doogggogoggggggg-o-oogggogggggooogoooggoooo
16 1 1 doogggogogggggggo-o-ogggogggggooogoooggoooo
17 1 1 doogggogogggggggoo-o-gggogggggooogoooggoooo
18 1 4 doogggogogggggggooo-gggo-gggggooogoooggoooo
19 1 3 doogggogogggggggooog-ggo-gggggooogoooggoooo
20 1 2 doogggogogggggggooogg-go-gggggooogoooggoooo
21 1 1 doogggogogggggggoooggg-o-gggggooogoooggoooo
22 1 6 doogggogogggggggooogggo-gggggo-oogoooggoooo
23 1 5 doogggogogggggggooogggog-ggggo-oogoooggoooo
24 1 4 doogggogogggggggooogggogg-gggo-oogoooggoooo
25 1 3 doogggogogggggggooogggoggg-ggo-oogoooggoooo
26 1 2 doogggogogggggggooogggogggg-go-oogoooggoooo
27 1 1 doogggogogggggggooogggoggggg-o-oogoooggoooo
28 1 1 doogggogogggggggooogggogggggo-o-ogoooggoooo
29 1 1 doogggogogggggggooogggogggggoo-o-goooggoooo
30 1 2 doogggogogggggggooogggogggggooo-go-ooggoooo
31 1 1 doogggogogggggggooogggogggggooog-o-ooggoooo
32 1 1 doogggogogggggggooogggogggggooogo-o-oggoooo
33 1 1 doogggogogggggggooogggogggggooogoo-o-ggoooo
34 1 3 doogggogogggggggooogggogggggooogooo-ggo-ooo
35 1 2 doogggogogggggggooogggogggggooogooog-go-ooo
36 1 1 doogggogogggggggooogggogggggooogooogg-o-ooo
37 1 1 doogggogogggggggooogggogggggooogoooggo-o-oo
38 1 1 doogggogogggggggooogggogggggooogoooggoo-o-o
39 1 1 doogggogogggggggooogggogggggooogoooggooo-o-

Related

One variable in kg and grams. another indicates which unit; how can I get new variable in kg?

In Stata quantity has inputs in both kg and grams. while unit =1 indicates kg and unit=2 indicates grams. How can I generate a new variable quantity_kg which converts all gram values into kg?
My existing dataset-
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
My expected dataset
input double(hhid quantity unit unit_price quantity_kg)
1 24 1 . 24
1 4 1 . 4
1 350 2 50 .35
1 550 2 90 .55
1 2 1 65 2
1 3.5 1 85 3.5
1 1 1 20 1
1 4 1 25 4
1 2 1 . 2
2 1 1 30 1
2 2 1 15 2
2 1 1 20 1
2 250 2 10 .25
2 2 1 20 2
2 400 2 10 .40
2 100 2 60 .10
2 1 1 20 1
The code below does what you want.
This looks like household data where one typically has to do a lot of unit conversions. They are also a common source of error so I have included the best practice of defining conversion rates and unit codes in locals. If you define this at one place, then you can reuse these locals in multiple places where you convert units. It is easy to spot typos in the rows with replace as you would notice if one row said kilo_rate but then gram_unit. In this simple example it might be overkill, but if you have many units and rates, then this is a neat way to avoid errors.
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
end
*Define conversion rates and unit codes
local kilo_rate = 1
local kilo_unit = 1
local gram_rate = 0.001
local gram_unit = 2
*Create the standardized variable
gen quantity_kg = .
replace quantity_kg = quantity * `kilo_rate' if unit == `kilo_unit'
replace quantity_kg = quantity * `gram_rate' if unit == `gram_unit'
// unit 1 means kg, unit 2 means g, and 1000 g = 1 kg
generate quantity_kg = cond(unit == 1, quantity, cond(unit == 2, quantity/1000, .))
Your example doesn't have any missing values on unit, but it does no harm to imagine that they might occur.
Providing a comment by way of explanation could be anywhere between redundant and essential for third parties.

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

Keep first record when event occurrs

I have the following data in Stata:
clear
* Input data
input grade id exit time
1 1 . 10
2 1 . 20
3 1 2 30
4 1 0 40
5 1 . 50
1 2 0 10
2 2 0 20
3 2 0 30
4 2 0 40
5 2 0 50
1 3 1 10
2 3 1 20
3 3 0 30
4 3 . 40
5 3 . 50
1 4 . 10
2 4 . 20
3 4 . 30
4 4 . 40
5 4 . 50
1 5 1 10
2 5 2 20
3 5 1 30
4 5 1 40
5 5 1 50
end
The objective is to take the first row foreach id when a event occurs and if no event occur then take the last report foreach id. Here is a example for the data I hope to attain
* Input data
input grade id exit time
3 1 2 30
5 2 0 50
1 3 1 10
5 4 . 50
1 5 1 10
end
The definition of an event appears to be that exit is not zero or missing. If so, then all you need to do is tweak the code in my previous answer:
bysort id (time): egen when_first_e = min(cond(exit > 0 & exit < ., time, .))
by id: gen tokeep = cond(when_first_e == ., time == time[_N], time == when_first_e)
Previous thread was here.

Regular expression to distinguish between single and multiple digit numbers

I have a regular expression for capturing repeating numerical patterns in a string of number. However, it is not able to distinguish between single and multiple digits within a number.
Given a string:
0 5 0 0 0 16 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 1 1 1 11 1 1 1 1 1 1 1 2 11 1 4 4 4 16
and regular expression
(\d+)( \1)+
the match result is
0 5 0 0 0 16 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 1 1 1 11 1 1 1 1 1 1 1 2 1 1 1 4 4 4 16
The regex is not able to distinguish between 1 and 11.
(Note: 11 could also be a repeating number and maximum 3 digits are possible in a number)
You need to add a word boundary to regex. For example:
(\b\d+)( \1\b)+
See https://regex101.com/r/ZSCMjF/1

Regular Expression to match few characters from a string

I am trying to find a string within another string. However, I am trying to match even if one or more character is not matching.
Let me explain with an example :
Let's say I have a string 'abcdefghij'. Now if the string to match is 'abcd',
I could write strfind('abcdefghij', 'abc')
Now, I have a string 'adcf'. Notice that, there is a mismatch in two characters, I would consider it as a match.
Any idea how to do it ?
I know, this is not the most optimal code.
Example :
a='abcdefghijk';
b='xbcx'
c='abxx'
d='axxd'
e='abcx'
f='xabc'
g='axcd'
h='abxd'
i ='abcd'
All these strings should match with a. I hope this example makes it more clear. The idea is, if there is a mismatch of 1 or 2 characters also, it should be considered as a match.
You could do it like this:
A = 'abcdefghij'; % Main string
B = 'adcf'; % String to be found
tolerance = 2; % Maximum number of different characters to tolerate
nA = numel(A);
nB = numel(B);
pos = find(sum(A(mod(cumsum([(1:nA)' ones(nA, nB - 1)], 2) - 1, nA) + 1) == repmat(B, nA, 1), 2) >= nB - tolerance);
In this case it will return pos = [1 3]'; because "adcf" can be matched on the first position (matching "a?c?") and on the third position (matching "?d?f")
Explanation:
First, we take the sizes of A and B
Then, we create the matrix [(1:nA)' ones(nA, nB - 1)], which gives us this:
Output:
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 1 1 1
8 1 1 1
9 1 1 1
10 1 1 1
We perform a cumulative sum to the right, using cumsum, to achieve this:
Output:
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
10 11 12 13
And use the mod function so each number is between 1 and nA, like this:
Output:
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 1
9 10 1 2
10 1 2 3
We then use that matrix as an index for the A matrix.
Output:
abcd
bcde
cdef
defg
efgh
fghi
ghij
hija
ijab
jabc
Note this matrix has all possible substrings of A with size nB.
Now we use repmat to replicate B down, 'nA rows'.
Output:
adcf
adcf
adcf
adcf
adcf
adcf
adcf
adcf
adcf
adcf
And perform a direct comparison:
Output:
1 0 1 0
0 0 0 0
0 1 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Summing to the right give us this:
Output:
2
0
2
0
0
0
0
0
0
0
Which are the number of character matches on each possible substring.
To finish, we use find to select the indexes of the matches within our tolerance.
In your code
c=a-b is not valid (Matrix dimensions not same)
If you need at least one match, not in order, (as your example says), you can have something like this :-
>> a='abcdefgh';
>> b='adcf';
>> sum(ismember(a,b)) ~= 0
ans =
1