Need to match any string between two delimiters - regex

I need a regex to match parts of a string. For example, in the following string
Fault,10.224.2.3:4450,XX_XXX0039_XX.XX/0,AA,BBBBBB
I want to match the entire string and extract Fault,10.224.2.3:4450 and AA,BBBBBB. However, I want to ignore ,XX_XXX0039_XX.XX/0,.
Note that the string to ignore includes the delimiters, the commas (,). The string to ignore may contain the following characters:
./_0-9A-Za-z
The position of the period (.) is not fixed. Other examples of the pattern I want to ignore are:
,XX_XXX0039_XX.XX/0,
,XX_XX0039_XXXXX/1,
,X_XX0039_X/4,
I am using the regex in Simple Event Coordinator.

(\w+,\d+.\d+.\d+.\d+:\d+).*?,(\w+,\w+)

The best is to avoid your delimiter ,
Regex:
[^,]+
Result:
[0] => XX_XXX0039_XX.XX/0
[1] =>
[2] => XX_XX0039_XXXXX/1
[3] =>
[4] => X_XX0039_X/4

Related

R regex to remove all except letters, apostrophes and specified multi-character strings

Is there an R regex to remove all except letters, apostrophes and specified multi-character strings? The "specified multi-character strings" are arbitrary and of arbitrary length. Let's say "~~" & && in this case (so ~ & & should be removed but not ~~ & &&)
Here I have:
gsub("[^ a-zA-Z']", "", "I like~~cake~too&&much&now.")
Which gives:
## [1] "I like~~cake~toomuchnow"
And...
gsub("[^ a-zA-Z'~&]", "", "I like~~cake~too&&much&now.")
gives...
## "I like~~cake~too&&much&now"
How can I write an R regex to give:
"I like~~caketoo&&muchnow"
EDIT Corner cases from Casimir and BrodieG...
I'd expect this behavior:
x <- c("I like~~cake~too&&much&now.", "a~~~b", "a~~~~b", "a~~~~~b", "a~&a")
## [1] "I like~~caketoo&&muchnow." "a~~b"
## [3] "a~~~~b" "a~~~~b"
## [5] "aa"
Neither of the current approaches gives this.
One way, match/capture the "specified multi-character strings" while replacing the others.
gsub("(~~|&&)|[^a-zA-Z' ]", "\\1", x)
# [1] "I like~~caketoo&&muchnow" "a~~b"
# [3] "a~~~~b" "a~~~~b"
# [5] "aa"
(?<![&~])[^ a-zA-Z'](?![&~])
Try this.See demo.Use this with perl=True option.
https://regex101.com/r/wU7sQ0/25
You can use this pattern:
gsub("[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z ']|\\z)", "", x, perl=TRUE)
online demo
The idea is to build an always true pattern that is the translation of this sentence:
substrings I want to keep are always followed by a character I want to remove or the end of the string
So, all you need to do is to describe the substring you want to keep:
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*
Note that, since this subpattern is optional (it matches the empty string) and greedy, the whole pattern will never fail whatever the position on the string, so all matches are consecutive (no need to add a \G anchor) from the begining to the end.
For the same reason there is no need to add possessive quantifiers or to use atomic groups to prevent catastrophic backtrackings because (?:[^A-Za-z ']|\\z) can't fail.
This pattern allows to replace a string in few steps, but you can improve it more:
if you avoid the last match (that is useless since it matches only characters you want to keep or the empty string before the end) with the backtracking control verb (*COMMIT).
It forces the regex engine to stop the search once the end of the string is reached:
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z ']|\\z(*COMMIT).)
if you make the pattern able to match several special characters in one match:
(except if they are ~ or &)
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z '][^A-Za-z '~&]*|\\z(*COMMIT).)
demo

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Optionally prevent a string at the end of a wildcard from being matched

I have the following string:
12345 This could be anythingREMOVE
I need to match 12345 and This could be anything. Unfortunately, the format I need to parse also has a string at the end of the line that isn't always present (REMOVE in this example). How can I match what I'm looking for without REMOVE? I've tried the following pattern:
^(\d{5}) (.*)(?:REMOVE|$)
Unfortunately, REMOVE is picked up by the wildcard:
(
[0] => Array
(
[0] => 12345 This could be anythingREMOVE
)
[1] => Array
(
[0] => 12345
)
[2] => Array
(
[0] => This could be anythingREMOVE
)
)
If last string REMOVE is optional then why can't use use htis regex:
"/^(\d{5}) /"
However if you really want to avoid REMOVE in matching pattern then use this:
$s = '12345 This could be anythingREMOVE';
if (preg_match("/^(\d{5}) (.*?)(?:REMOVE|)$/", $s, $arr))
var_dump($arr);
Output:
array(3) {
[0]=>
string(34) "12345 This could be anythingREMOVE"
[1]=>
string(5) "12345"
[2]=>
string(22) "This could be anything"
}
You can try this regex:
^(\d{5})((?:.(?!REMOVE))+.)
How It Works
^(\d{5}) -- Matches start of string, followed by five digits [0-9]. Group of parentheses use to captured the text matched.
((?:.(?!REMOVE))+ -- Matches any character if not immediately followed by the secuence REMOVE one or more times. It stops at the n in anything. it can't match the g because is followed by REMOVE.
.) -- Allow the g to match.

RegExp pattern to capture around two-characters delimiter

I have a string which is something like:
prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::
I want to retrieve the value associated to a key (say, key1). The following pattern:
::key1==([^:]*)
...will work only if there are no ':' character in the value, so I want to make sure the pattern matching will stop only for the substring ::, but I'm can't find how to do that, as most examples I see are about single character matching.
How do I modify the regexp pattern to match all characters between "::key1==" and the next "::" ?
Thanks!
Can you do something like this : ::key1==(.*?)::? Assuming the language supports the lazy ? operator, this should work.
As mentioned in my comment to your question, if the entirety of your string is
prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::
I would suggest exploding/splitting the string at :: instead of using regex as it will usually always be faster. You didn't specify language but here is a php example:
// string
$string = "prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::";
// explode using :: as delimiter
$string = explode('::',$string);
// for each element...
foreach ($string as $value) {
// check if it has == in it
if (strpos($value,'==')!==false) $matches[] = $value;
}
// output
echo "<pre>";print_r($matches);
output:
Array
(
[0] => key0==value0
[1] => key1==value1
[2] => key2==value2
[3] => key3==value3
[4] => key4==value4
)
However, if you insist on the regex approach, here negative look-ahead alternative
::((?:(?!::).)+)
php example
// string
$string = "prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::";
preg_match_all('~::((?:(?!::).)+)~',$string,$matches);
//output
echo "<pre>";print_r($matches);
output
Array
(
[0] => key0==value0
[1] => key1==value1
[2] => key2==value2
[3] => key3==value3
[4] => key4==value4
)
I think you're looking for a positive look-ahead:
::key0==(.*?)(?=::\w+==)
With the following:
prefix::key0==val::ue0::key1==value1::key2==value2::key3==value3::key4==value4::
It correctly finds val::ue0. This also assumes the keys conform to \w ([0-9A-Za-z_])
Also, a positive look-ahead may be a bit of overkill, but will work if the answer contains ::, too.

Regex for capturing numbered text list

I have a test list that I am trying to capture data from using a regex.
Here is a sample of the text format:
(1) this is a sample string /(2) something strange /(3) another bit of text /(4) the last one/ something!/
I have a Regex that currently captures this correctly, but I am having some difficulty with making it work under outlier conditions.
Here is my regex
/\(?\d\d?\)([^\)]+)(\/|\z)/
Unfortunately some of the data contains parentheses like this:
(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/
The substrings '(1998-1999)' and '(blah)' make it fail!
Anyone care to have a crack at this one?
Thank you :D
I would try this:
\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))
This rather scary looking regex does the following:
It looks for one or more digits wrapped in parentheses and captures them;
There must be at least one white space character after the digits in parentheses. This white space is ignored (not captured);
A non-greedy wildcard expression is used. This is (imho) the preferable way to using negative character groups (eg [^/]+) for this kind of problem;
The positive lookahead ((?=...)) says the expression must be followed by a backslash and then one of:
one or more digits wrapped in parentheses; or
the string terminator.
To give you an example in PHP (you don't specify your language):
$s = '(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/';
preg_match_all('!\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))!', $s, $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => (1) this is a sample string (1998-1999)
[1] => (2) something strange (blah)
[2] => (3) another bit of text
[3] => (4) the last one/ something!
)
[1] => Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
)
[2] => Array
(
[0] => this is a sample string (1998-1999)
[1] => something strange (blah)
[2] => another bit of text
[3] => the last one/ something!
)
)
Some notes:
You don't specify what you want to capture. I've assumed the list item number and the text. This could be wrong in which case just drop those capturing parentheses. Either way you can get the whole match;
I've dropped the trailing slash from the match. This may not be your intent. Again just change the capturing to suit;
I've allowed any number of digits for the item number. Your version allowed only two. If you prefer it that way replace \d+ with \d\d?.
Prepend a / to the beginning of string, append a (0) to the end of the string, then split the whole string with the pattern \/\(\d+\), and discard the first and last empty elements.
As long as / cannot appear in the text...
\(?\d?\d[^/]+