How to match any pattern to a any string using Ruby? - regex

I would like to create a function in Ruby which accepts the following parameters:
A pattern string (e.g. "abab", "aabb", "aaaa", etc.)
An input string (e.g. "dogcatdogcat", "carcarhousehouse", etc.)
The return of the function should be "true" if the string matches the pattern and "false" if not.
My approach for the first step:
Use regex in order to separate the input string into an array of words (e.g. ["dog", "cat", "dog", "cat"]).
My regex expertise is not good enough to be able to find the right regex for this problem.
Does anyone know how to perform the appropriate regex so that recurring words get separated assuming the input string is always some form of pattern?

You can use capture groups and backreferences to match the same substring multiple times, e.g.:
abab = /\A(.+)(.+)\1\2\z/
aabb = /\A(.+)\1(.+)\2\z/
aaaa = /\A(.+)\1\1\1\z/
'dogcatdogcat'.match?(abab) #=> true
'dogcatdogcat'.match?(aabb) #=> false
'dogcatdogcat'.match?(aaaa) #=> false
'carcarhousehouse'.match?(abab) #=> false
'carcarhousehouse'.match?(aabb) #=> true
'carcarhousehouse'.match?(aaaa) #=> false
In the above pattern, (.+) defines a capture group that matches one or more characters. \1 then refers to the 1st capturing group and matches the same substring. (\2 is the 2nd group and so on)
\A and \z are anchors to match the beginning and end of the string.

Related

I have a regex that returns nil values for excluded words. How do I return nothing instead?

Given the following test string:
{{one}}
<content>{{two}}</content>
{{three}}
I only want to match {{one}} and {{two}}. I have the following regex:
{{((?!#)(?!\/).*?)}}|(?:<content\b[^>]*>[^<>]*<\/content>)
That matches {{one}} and {{three}}, but also matches a nil value (see: https://rubular.com/r/E4faa6Tze04WnG). How do I only match {{one}} and {{three}} and NOT the nil value?
(that is, the regex should only return two matches instead of three)
Taken from your comment:
I have a large body of text and I want to use ruby's gsub method to replace {{tags}} that are outside of the <content> tags.
This regex should do, what you need:
(^{{(?!#|\/).*}}$)
This matches both {{one}} and {{three}}, and similar interpolations à la {{tag}}, except those: <content>{{tag}}</content>.
Can I ignore only tags specifically and not other tags? For example, I tried it with tags here: rubular.com/r/jTKxwjNuKoSjgN, which I don't want to ignore.
Sure thing. Try this one:
(?!<content>)({{(?!#|\/).*?}})(?!<\/content>)
If you need an explanation of how and why this regex works, you can take a look at the explanation section here: https://regex101.com/r/d4DEK1/1
I suggest doing it in two steps to accomodate more complex strings. I have assumed that the strings "one" and "three" are to be extracted from the following string.
str = <<-_
{{one}}
<content>cats {{two}} and <content2>{{four}}</content2> dogs</content>
{{three}}
_
r0 = /
<
([^>]+) # match >= 1 characters other than '>' in capture group 1
>
.+? # match one or more characters lazily
<\/ # match '<' then forward slash
\1 # match the contents of capture group 1
>
/x # free-spacing regex definition mode
r1 = /
(?<=\{\{) # match '{{' in a positive lookbehind
[^\}]+ # match any number of characters other than '}'
(?=\}\}) # match '}}' in a positive lookahead
/x # free-spacing regex definition mode
str.gsub(r0, '').scan(r1)
#=> ["one", "three"]
The first step is:
str.gsub(r0, '')
#=> "{{one}}\n\n{{three}}\n"
This of course works if the second line of the string is simply
"<content>{{two}}</content>\n"
The two regular expressions are conventionally written as follows.
r0 = /<([^>]+)>.+?<\/\1>/
r1 = /(?<=\{\{)[^\}]+(?=\}\})/

Regex (ruby) to remove all instances of a set of characters EXCEPT when they are at the start of a string

I want to remove all instances of these characters ["+", "-", "~"] from a string, except when they occur at the start of the string.
For example:
"abc" => "abc"
"ab+c" => "abc"
"+abc" => "+abc"
"-+abc" => "-abc"
"ab+-c" => "abc"
Note with the fourth one that the + is removed, because it wasn't the first character. So, if there are multiple "unwanted" characters at the start of a string, we only keep the first one.
I can't quite figure out the regex syntax for this. Can anyone help? I'm using Ruby but regex syntax tends to be the same across languages.
The ^(![\+\-\~] pattern matches the start of a line and then captures into Group 1 a ! char followed with +, - or ~ char, so you remove only !+, !~ or !- at the start of a line.
You may use
/(?!\A)[+~-]/
It matches any +, ~ or - char ([+~-]) that are not at the start of the string ((?!\A)). The (?!\A) is a negative lookahead that fails the match if its pattern is not matched immediately to the right of the current location. If the location is at the start of the string (\A assets this very position), the match is failed. Since \A is an anchor that does not consume any text, a so-called zero-length pattern, there is no difference if you use a lookahead or lookbehind, (?<!\A).
Make sure - is either at the start or end of the character class, and you won't have to escape it.
Ruby demo:
strs = ["abc", "ab+c", "+abc", "-+abc", "ab+-c"]
strs.each { |x| p x.gsub(/(?!\A)[-+~]/, "") }
Output:
"abc"
"abc"
"+abc"
"-abc"
"abc"

regexp - find numbers in a string in any order

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"
What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.
you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

Regex lookahead to match everything prior to 1st OR 2nd group of digits

Regex in VBA.
I am using the following regex to match the second occurance of a 4-digit group, or the first group if there is only one group:
\b\d{4}\b(?!.+\b\d{4}\b)
Now I need to do kind of the opposite: I need to match everything up until the second occurance of a 4-digit group, or up until the first group if there is only one. If there are no 4-digit groups, capture the entire string.
This would be sufficient.
But there is also a preferable "bonus" route: If there exists a way to match everything up until a 4-digit group that is optionally followed by some random text, but only if there is no other 4-digit group following it. If there exists a second group of 4 digits, capture everything up until that group (including the first group and periods, but not commas). If there are no groups, capture everything. If the line starts with a 4-digit group, capture nothing.
I understand that also this could (should?) be done with a lookahead, but I am not having any luck in figuring out how they work for this purpose.
Examples:
Input: String.String String 4444
Capture: String.String String 4444
Input: String4444 8888 String
Capture: String4444
Input: String String 444 . B, 8888
Capture: String String 444 . B
Bonus case:
Input: 8888 String
Capture:
for up until the second occurrence of a 4-digit group, or up until the first group if there is only one use this pattern
^((?:.*?\d{4})?.*?)(?=\s*\b\d{4}\b)
Demo
per comment below, use this pattern
^((?:.*?\d{4})?.*?(?=\s*\b\d{4}\b)|.*)
Demo
You can use this regex in VBA to capture lines with 4-digit numbers, or those that do not have 4-digit numbers in them:
^((?:.*?[0-9]{4})?.*?(?=\s*?[0-9]{4})|(?!.*[0-9]{4}).*)
See demo, it should work the same in VBA.
The regex consists of 2 alternatives: (?:.*?[0-9]{4})?.*?(?=\s*?[0-9]{4}) and (?!.*[0-9]{4}).*.
(?:.*?[0-9]{4})?.*?(?=\s*?[0-9]{4}) matches 0 or more (as few as possible) characters that are preceded by 0 or 1 sequence of characters followed by a 4-digit number, and are followed by optional space(s) and 4 digit number.
(?!.*[0-9]{4}).* matches any number of any characters that do not have a 4-digit number inside.
Note that to only match whole numbers (not part of other words) you need to add \b around the [0-9]{4} patterns (i.e. \b[0-9]{4}\b).
Matches everything except spaces till last occurace of a 4 digit word
You can use the following:
(?:(?! ).)+(?=.*\b\d{4}\b)
See DEMO
For your basic case (marked by you as sufficient), this will work:
((?:(?!\d{4}).)*(?:\d{4})?(?:(?!\d{4}).)*)(?=\d{4})
You can pad every \d{4} internally with \b if you need to.
See a demo here.
If anyone is interested, I cheated to fully solve my problem.
Building on this answer, which solves the vast majority of my data set, I used program logic to catch some rarely seen use-cases. It seemed difficult to get a single regex to cover all the situations, so this seems like a viable alternative.
Problem is illustrated here.
The code isn't bulletproof yet, but this is the gist:
Function cRegEx (str As String) As String
Dim rExp As Object, rMatch As Object, regP As String, strL() As String
regP = "^((?:.*?[0-9]{4})?.*?(?:(?=\s*[0-9]{4})|(?:(?!\d{4}).)*)|(?!.*[0-9]{4}).*)"
' Encountered two use-cases that weren't easily solvable with regex, due to the already complex pattern(s).
' Split str if we encounter a comma and only keep the first part - this way we don't have to solve this case in the regex.
If InStr(str, ",") <> 0 Then
strL = Split(str, ",")
str = strL(0)
End If
' If str starts with a 4-digit group, return an empty string.
If cRegExNum(str) = False Then
Set rExp = CreateObject("vbscript.regexp")
With rExp
.Global = False
.MultiLine = False
.IgnoreCase = True
.Pattern = regP
End With
Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
cRegEx = rMatch(0)
Else
cRegEx = ""
End If
Else
cRegEx = ""
End If
End Function
Function cRegExNum (str As String) As Boolean
' Does the string start with 4 non-whitespaced integers?
' Return true if it does
Dim rExp As Object, rMatch As Object, regP As String
regP = "^\d{4}"
Set rExp = CreateObject("vbscript.regexp")
With rExp
.Global = False
.MultiLine = False
.IgnoreCase = True
.Pattern = regP
End With
Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
cRegExNum = True
Else
cRegExNum = False
End If
End Function

Regex in PHP: take all the words after the first one in string and truncate all of them to the first character

I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo