R: Get value in between two patterns

R: Get value in between two patterns - regex

I have a string representing a command line where a binary file and a series of arguments are given.
string = "./bin -m A 4 -n 12 --LongName1 12 --LongName2 45 -t Hello -l 0.002 "
I'd like to extract the numerical value that is associated to --LongName1. How can I do that? Note that --LongName2 does not necessarily follow LongName1. Anything could follow LongName1 including the end of the string.
I found a solution and it seems to work fine but it is really ugly:
re = regexpr("LongName1", string)
start = attr(re, "match.length") + re[1] + 1
nbdigits = which(is.na(sapply(strsplit(substr(string, start, nchar(string)), ""), as.numeric)))[1] - 1
as.numeric(substr(string, start, start + nbdigits - 1))
# 12

Use a regex with look-behind:
string = "./bin -m A 4 -n 12 --LongName1 12 --LongName2 45 -t Hello -l 0.002 "
pattern <- "(?<=--LongName1 )\\d*"
m <- regexpr(pattern, string, perl = TRUE)
regmatches(string, m)
#[1] "12"

Related

Find value of a string given a superstring regex

How can I match for a string that is a substring of a given input string, preferable with regex?
Given a value: A789Lfu891MatchMe2ENOTSTH, construct a regex that would match a string where the string is a substring of the given value.
Expected matches:
MatchMe
ENOTST
891
Expected Non Match
foo
A789L<fu891MatchMe2ENOTSTH_extra
extra_A789L<fu891MatchMe2ENOTSTH
extra_A789L<fu891MatchMe2ENOTSTH_extra
It seems easier for me to do the reverse: /\w*MatchMe\w*/, but I can't wrap my head around this problem.
Something like how SQL would do it:
SELECT * FROM my_table mt WHERE 'A789Lfu891MatchMe2ENOTSTH' LIKE '%' || mt.foo || '%';

Prefix suffixes
You could alternate prefix suffixes, like turn the superstring abcd into a pattern like ^(a|(a)?b|((a)?b)?c|(((a)?b)?c)?d)$. For your example, the pattern has 1253 characters (exactly 2000 fewer than #tobias_k's).
Python code to produce the regex, can then be tested with tobias_k's code (try it online):
from itertools import accumulate
t = "A789Lfu891MatchMe2ENOTSTH"
p = '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
Suffix prefixes
Suffix prefixes look nicer and match faster: ^(a(b(c(d)?)?)?|b(c(d)?)?|c(d)?|d)$. Sadly the generating code is less elegant.
Divide and conquer
For a shorter regex, we can use divide and conquer. For example for the superstring abcdefg, every substring falls into one of three cases:
Contains the middle character (the d). Pattern for that: ((a?b)?c)?d(e(fg?)?)?
Is left of that middle character. So recursively build a regex for the superstring abc: a|a?bc?|c.
Is right of that middle character. So recursively build a regex for the superstring efg: e|e?fg?|g.
And then make an alternation of those three cases:
a|a?bc?|c|((a?b)?c)?d(e(fg?)?)?|e|e?fg?|g
Regex length will be Θ(n log n) instead of our previous Θ(n2).
The result for your superstring example of 25 characters is this regex with 301 characters:
^(A|A?78?|8|((A?7)?8)?9(Lf?)?|Lf?|f|(((((A?7)?8)?9)?L)?f)?u(8(9(1(Ma?)?)?)?)?|89?|9|(8?9)?1(Ma?)?|Ma?|a|(((((((((((A?7)?8)?9)?L)?f)?u)?8)?9)?1)?M)?a)?t(c(h(M(e(2(E(N(O(T(S(TH?)?)?)?)?)?)?)?)?)?)?)?|c|c?hM?|M|((c?h)?M)?e(2E?)?|2E?|E|(((((c?h)?M)?e)?2)?E)?N(O(T(S(TH?)?)?)?)?|OT?|T|(O?T)?S(TH?)?|TH?|H)$
Benchmark
Speed benchmarks don't really make that much sense, as in reality we'd just do a regular substring check (in Python s in t), but let's do one anyway.
Results for matching all substrings of your superstring, using Python 3.9.6 on my PC:
1.09 ms just_all_substrings
25.18 ms prefix_suffixes
3.47 ms suffix_prefixes
3.46 ms divide_and_conquer
And on TIO and their "Python 3.8 (pre-release)" with quite different results:
0.30 ms just_all_substrings
46.90 ms prefix_suffixes
2.24 ms suffix_prefixes
2.95 ms divide_and_conquer
Regex lengths (also printed by the below benchmark code):
3253 characters - just_all_substrings
1253 characters - prefix_suffixes
1253 characters - suffix_prefixes
301 characters - divide_and_conquer
Benchmark code (Try it online!):
from timeit import repeat
import re
from itertools import accumulate
def just_all_substrings(t):
return "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
def prefix_suffixes(t):
return '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
def suffix_prefixes(t):
return '^(' + '|'.join(list(accumulate(t[::-1], '{1}({0})?'.format))[::-1]) + ')$'
def divide_and_conquer(t):
def suffixes(t):
# Example: abc => ((a?b)?c)?
regex = f'{t[0]}?'
for c in t[1:]:
regex = f'({regex}{c})?'
return regex
def prefixes(t):
# Example: efg => (e(fg?)?)?
regex = f'{t[-1]}?'
for c in t[-2::-1]:
regex = f'({c}{regex})?'
return regex
def superegex(t):
n = len(t)
if n == 1:
return t
if n == 2:
return f'{t}?|{t[1]}'
mid = n // 2
contain = suffixes(t[:mid]) + t[mid] + prefixes(t[mid+1:])
left = superegex(t[:mid])
right = superegex(t[mid+1:])
return '|'.join([left, contain, right])
return '^(' + superegex(t) + ')$'
creators = just_all_substrings, prefix_suffixes, suffix_prefixes, divide_and_conquer,
t = "A789Lfu891MatchMe2ENOTSTH"
substrings = [t[start:stop]
for start in range(len(t))
for stop in range(start+1, len(t)+1)]
def test(p):
match = re.compile(p).match
return all(map(re.compile(p).match, substrings))
for creator in creators:
print(test(creator(t)), creator.__name__)
print()
print('Regex lengths:')
for creator in creators:
print('%5d characters -' % len(creator(t)), creator.__name__)
print()
for _ in range(3):
for creator in creators:
p = creator(t)
number = 10
time = min(repeat(lambda: test(p), number=number)) / number
print('%5.2f ms ' % (time * 1e3), creator.__name__)
print()

One way to "construct" such a regex would be to build a disjunction of all possible substrings of the original value. Example in Python:
import re
t = "A789Lfu891MatchMe2ENOTSTH"
p = "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
good = ["MatchMe", "ENOTST", "891"]
bad = ["foo", "A789L<fu891MatchMe2ENOTSTH_extra",
"extra_A789L<fu891MatchMe2ENOTSTH",
"extra_A789L<fu891MatchMe2ENOTSTH_extra"]
assert all(re.match(p, s) is not None for s in good)
assert all(re.match(p, s) is None for s in bad)
For the value "abcd", this would e.g. be "^(a|ab|abc|abcd|b|bc|bcd|c|cd|d)$"; for the given example it's a bit longer, with 3253 characters...

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.

Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Regex to match despite some of the characters not matching pattern?

I'm working with some bioinformatics data, and I've got this sed expression:
sed -n 'N;/.*:\(.*\)\n.*\1/{p;n;p;n;p};D' file.txt
It currently takes a file that is structured such as:
#E00378:1485 1:N:0:ABC
ABCDEF ##should match, all characters present
+
#
#E00378:1485 1:N:1:ABC
XYZABX ##should match, with permutation
+
#
#E00378:1485 1:N:1:ABCDE
ZABCDXFGH ##should match, with permutation
+
#
#E00378:1485 1:N:1:CBA
ABC ##should not match, order not preserved
+
#
Then it returns 4 lines if the sequence after : is found in the second line, so in this case I would get:
#E00378:1485 1:N:0:ABC
ABCDEF
+
#
However, I am looking to expand my search a little, by adding the possibility of searching for any single permutation of the letters, while maintaining the order, such that ABX, ZBC, AHC, ABO would all match the search criteria ABC.
Is a search like this possible to construct as a one-liner? Or should I write a script?
I was thinking it should be possible to programmatically change one of the letters to a * in the pattern space.
I am trying to make something along the lines of an AWK pattern that has a match defined as:
p = "";
p = p "."a[2]a[3]a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]"."a[3]a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]"."a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]"."a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]"."a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]"."a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]a[6]"."a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]a[6]a[7]".";
m = p;
But I can't seem to figure out how to make it programmatically for n numbers.

Okay, check this out where fuzzy is your input above:
£ perl -0043 -MText::Fuzzy -ne 'if (/.*:(.*?)\n(.*?)\n/) {my ($offset, $edits, $distance) = Text::Fuzzy::fuzzy_index ($1, $2); print "$offset $edits $distance\n";}' fuzzy
3 kkk 0
5 kkd 1
5 kkkkd 1
Since you haven't been 100% clear on your "fuzziness" criteria (and can't be until you have a measurement tool), I'll explain this first. Reference here:
http://search.cpan.org/~bkb/Text-Fuzzy-0.27/lib/Text/Fuzzy.pod
Basically, for each record (which I've assumed are split on # which is the -0043 bit), the output is an offset, how the 1st string can become the 2nd string, and lastly the "distance" (Levenshtein, I would assume) between the two strings.
So..
£ perl -0043 -MText::Fuzzy -ne 'if (/.*:(.*?)\n(.*?)\n/) {my ($offset, $edits, $distance) = Text::Fuzzy::fuzzy_index ($1, $2); print "$_\n" if $distance < 2;}' fuzzy
#E00378:1485 1:N:0:ABC
ABCDEF
+
#
#E00378:1485 1:N:1:ABC
XYZABX
+
#
#E00378:1485 1:N:1:ABCDE
ZABCDXFGH
+
#
See here for installing perl modules like Text::Fuzzy
https://www.thegeekstuff.com/2008/09/how-to-install-perl-modules-manually-and-using-cpan-command/
Example input/output for a record that wouldn't be printed (distance is 3):
#E00378:1485 1:N:1:ABCDE
ZDEFDXFGH
+
#
gives us this (or simply doesn't print with the second perl command)
3 dddkk 3

Awk doesn't have sed back-references, but has more expressiveness to make up the difference. The following script composes the pattern for matching from the final field of the lead line, then applies the pattern to the subsequent line.
#! /usr/bin/awk -f
BEGIN {
FS = ":"
}
# Lead Line has 5 fields
NF == 5 {
line0 = $0
seq = $NF
getline
if (seq != "") {
n = length(seq)
if (n == 1) {
pat = seq
} else {
# ABC -> /.BC|A.C|AB./
pat = "." substr(seq, 2, n - 1)
for (i = 2; i < n; ++i)
pat = pat "|" substr(seq, 1, i - 1) "." substr(seq, i + 1, n - i)
pat = pat "|" substr(seq, 1, n - 1) "."
}
if ($0 ~ pat) {
print line0
print
getline; print
getline; print
next
}
}
getline
getline
}
If the above needs some work to form a different matching pattern, we mostly limit our modification to the lines of pattern composition. By the way... I noticed that sequences repeat -- to make this faster we can implement caching:
#! /usr/bin/awk -f
BEGIN {
FS = ":"
# Noticed that sequences repeat
# -- implement caching of patterns
split("", cache)
}
# Lead Line has 5 fields
NF == 5 {
line0 = $0
seq = $NF
getline
if (seq != "") {
if (seq in cache) {
pat = cache[seq]
} else {
n = length(seq)
if (n == 1) {
pat = seq
} else {
# ABC -> /.BC|A.C|AB./
pat = "." substr(seq, 2, n - 1)
for (i = 2; i < n; ++i)
pat = pat "|" substr(seq, 1, i - 1) "." substr(seq, i + 1, n - i)
pat = pat "|" substr(seq, 1, n - 1) "."
}
cache[seq] = pat
}
if ($0 ~ pat) {
print line0
print
getline; print
getline; print
next
}
}
getline
getline
}

Add two different numbers in a single text field space separated in Access VBA

I am using Access and VBA to tidy up a database before a migration. One field is going from text to an INT. So I need to convert and possibly add some numbers which exist in a singular field.
Examples:
F/C 3 other 8 should become 11
Calender-7 should become 7
21 F/C and 1 other should become 22
29 (natural ways) should become 29
The second and fourth line are simple enough, just use the following regex in VBA
Dim rgx As New RegExp
Dim inputText As String
Dim outputText As String
rgx.Pattern = "[^0-9]*"
rgx.Global = True
inputText = "29 (natural ways)"
outputText = rgx.Replace(inputText, "")
The downside is if I use it on option 1 or 3:
F/C 3 other 8 will become 38
Calender-7 will become 7
21 F/C and 1 other will become 211
29 (natural ways) will become 29
This is simple enough in bash, I can just keep the spaces by adding one to [^0-9 ]* and then piping it into awk which will add every field using a space as a delimiter like so:
sed 's/[^0-9 ]*//g' | awk -F' ' 's=0; {for (i=1; i<=NF; i++) s=s+$i; print s}'
F/C 3 other 8 will become 11
21 F/C and 1 other will become 22
The problem is I cannot use bash, and there are far too many values to do it by hand. Is there any way to use VBA to accomplish this?

Instead of using the replace method, just capture and then add up all the numbers. For example:
Option Explicit
Function outputText(inputText)
Dim rgx As RegExp
Dim mc As MatchCollection, m As Match
Dim I As Integer
Set rgx = New RegExp
rgx.Pattern = "[0-9]+"
rgx.Global = True
Set mc = rgx.Execute(inputText)
For Each m In mc
I = I + CInt(m) 'may Need to be cast as an int in Access VBA; not required in Excel VBA
Next m
outputText = I
End Function

I'm not sure if there are any easier way for your question. Here I've wrote small function for you.
Requirement: add all numbers in a string, identify "consecutive" digits as one number.
pseudo:
Loop through given text
find the first number and check/loop if following chars are numbers
if following chars are numbers treat as one number else pass the
result
continue searching from last point and add the result to the total
in code:
Public Function ADD_NUMB(iText As String) As Long
Dim I, J As Integer
Dim T As Long
Dim TM As String
For I = 1 To Len(iText)
If (InStr(1, "12346567890", Mid$(iText, I, 1)) >= 1) Then
TM = Mid(iText, I, 1)
For J = I + 1 To Len(iText)
If (InStr(1, "12346567890", Mid$(iText, J, 1)) >= 1) Then
TM = TM & Mid$(iText, J, 1)
Else
Exit For
End If
Next J
T = T + Val(Nz(TM, 0))
I = J
End If
Next I
ADD_NUMB = T
End Function
usage:
dim total as integer
total = ADD_NUMB("21 F/C and 1 other")
not sure about performance but it will get you what you need :)

How do I capture all occurences in a string in Vim?

I want to capture all certain occurrences in a string in Vimscript.
example:
let my_calculation = '200/3 + 23 + 100.5/3 -2 + 4*(200/2)'
How can I capture all numbers (including dots if there are) before and after the '/'? in 2 different variables:
- output before_slash: 200100.5200
- output after slash 332
How can I replace them if a condition occurs?
p.e. if after a single '/' there is no '.' add '.0' after this number
I tried to use matchstring and regex but after trying and trying I couldn't resolve it.

A useful feature that can be taken advantage of in this case is substitution
with an expression (see :help sub-replace-\=).
let [a; b] = [[]]
call substitute(s, '\(\d*\.\?\d\+\)/\(\d*\.\?\d\+\)\zs',
\ '\=add(a,submatch(1))[1:0]+add(b,submatch(2))[1:0]', 'g')

To answer the second part of the question:
let my_calculation = '200/3 + 23 + 100.5/3 -2 + 4*(200/2)'
echo substitute(my_calculation, '\(\/[0-9]\+\)\([^0-9.]\|$\)', '\1.0\2', 'g')
The above outputs:
200/3.0 + 23 + 100.5/3.0 -2 + 4*(200/2.0)

Give this a try:
function! GetNumbers(string)
let pairs = filter(split(a:string, '[^0-9/.]\+'), 'v:val =~ "/"')
let den = join(map(copy(pairs), 'matchstr(v:val, ''/\zs\d\+\(\.\d\+\)\?'')'), '')
let num = join(map(pairs, 'matchstr(v:val, ''\d\+\(\.\d\+\)\?\ze/'')'), '')
return [num, den]
endfunction
let my_calculation = '200/3 + 23 + 100.5/3 -2 + 4*(200/2)'
let [a,b] = GetNumbers(my_calculation)
echo a
echo b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

R: Get value in between two patterns - regex

Use a regex with look-behind: string = "./bin -m A 4 -n 12 --LongName1 12 --LongName2 45 -t Hello -l 0.002 " pattern <- "(?<=--LongName1 )\\d*" m <- regexpr(pattern, string, perl = TRUE) regmatches(string, m) #[1] "12"

Related

Find value of a string given a superstring regex

Stata Regex for 'standalone' numbers in string

Regex to match despite some of the characters not matching pattern?

Add two different numbers in a single text field space separated in Access VBA

How do I capture all occurences in a string in Vim?

Categories

Resources