Related
I know that the following regex will match "red", "green", or "blue".
red|green|blue
Is there a straightforward way of making it match everything except several specified strings?
If you want to make sure that the string is neither red, green nor blue, caskey's answer is it. What is often wanted, however, is to make sure that the line does not contain red, green or blue anywhere in it. For that, anchor the regular expression with ^ and include .* in the negative lookahead:
^(?!.*(red|green|blue))
Also, suppose that you want lines containing the word "engine" but without any of those colors:
^(?!.*(red|green|blue)).*engine
You might think you can factor the .* to the head of the regular expression:
^.*(?!red|green|blue)engine # Does not work
but you cannot. You have to have both instances of .* for it to work.
Depends on the language, but there are generally negative-assertions you can put in like so:
(?!red|green|blue)
(Thanks for the syntax fix, the above is valid Java and Perl, YMMV)
Matching Anything but Given Strings
If you want to match the entire string where you want to match everything but certain strings you can do it like this:
^(?!(red|green|blue)$).*$
This says, start the match from the beginning of the string where it cannot start and end with red, green, or blue and match anything else to the end of the string.
You can try it here: https://regex101.com/r/rMbYHz/2
Note that this only works with regex engines that support a negative lookahead.
You don't need negative lookahead. There is working example:
/([\s\S]*?)(red|green|blue|)/g
Description:
[\s\S] - match any character
* - match from 0 to unlimited from previous group
? - match as less as possible
(red|green|blue|) - match one of this words or nothing
g - repeat pattern
Example:
whiteredwhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredwhiteredwhiteredwhiteredwhiteredwhiteredgreenbluewhiteredwhiteredwhiteredwhiteredwhiteredredgreenredgreenredgreenredgreenredgreenbluewhiteredbluewhiteredbluewhiteredbluewhiteredbluewhiteredwhite
Will be:
whitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhite
Test it: regex101.com
I had the same question, the solutions proposed were almost working but they had some issue. In the end the regex I used is:
^(?!red|green|blue).*
I tested it in Javascript and .NET.
.* should't be placed inside the negative lookahead like this: ^(?!.*red|green|blue) or it would make the first element behave different from the rest (i.e. "anotherred" wouldn't be matched while "anothergreen" would)
Matching any text but those matching a pattern is usually achieved with splitting the string with the regex pattern.
Examples:
c# - Regex.Split(text, #"red|green|blue") or, to get rid of empty values, Regex.Split(text, #"red|green|blue").Where(x => !string.IsNullOrEmpty(x)) (see demo)
vb.net - Regex.Split(text, "red|green|blue") or, to remove empty items, Regex.Split(text, "red|green|blue").Where(Function(s) Not String.IsNullOrWhitespace(s)) (see demo, or this demo where LINQ is supported)
javascript - text.split(/red|green|blue/) (no need to use g modifier here!) (to get rid of empty values, use text.split(/red|green|blue/).filter(Boolean)), see demo
java - text.split("red|green|blue"), or - to keep all trailing empty items - use text.split("red|green|blue", -1), or to remove all empty items use more code to remove them (see demo)
groovy - Similar to Java, text.split(/red|green|blue/), to get all trailing items use text.split(/red|green|blue/, -1) and to remove all empty items use text.split(/red|green|blue/).findAll {it != ""}) (see demo)
kotlin - text.split(Regex("red|green|blue")) or, to remove blank items, use text.split(Regex("red|green|blue")).filter{ !it.isBlank() }, see demo
scala - text.split("red|green|blue"), or to keep all trailing empty items, use text.split("red|green|blue", -1) and to remove all empty items, use text.split("red|green|blue").filter(_.nonEmpty) (see demo)
ruby - text.split(/red|green|blue/), to get rid of empty values use .split(/red|green|blue/).reject(&:empty?) (and to get both leading and trailing empty items, use -1 as the second argument, .split(/red|green|blue/, -1)) (see demo)
perl - my #result1 = split /red|green|blue/, $text;, or with all trailing empty items, my #result2 = split /red|green|blue/, $text, -1;, or without any empty items, my #result3 = grep { /\S/ } split /red|green|blue/, $text; (see demo)
php - preg_split('~red|green|blue~', $text) or preg_split('~red|green|blue~', $text, -1, PREG_SPLIT_NO_EMPTY) to output no empty items (see demo)
python - re.split(r'red|green|blue', text) or, to remove empty items, list(filter(None, re.split(r'red|green|blue', text))) (see demo)
go - Use regexp.MustCompile("red|green|blue").Split(text, -1), and if you need to remove empty items, use this code. See Go demo.
NOTE: If you patterns contain capturing groups, regex split functions/methods may behave differently, also depending on additional options. Please refer to the appropriate split method documentation then.
All except word "red"
var href = '(text-1) (red) (text-3) (text-4) (text-5)';
var test = href.replace(/\((\b(?!red\b)[\s\S]*?)\)/g, testF);
function testF(match, p1, p2, offset, str_full) {
p1 = "-"+p1+"-";
return p1;
}
console.log(test);
All except word "red"
var href = '(text-1) (frede) (text-3) (text-4) (text-5)';
var test = href.replace(/\(([\s\S]*?)\)/g, testF);
function testF(match, p1, p2, offset, str_full) {
p1 = p1.replace(/red/g, '');
p1 = "-"+p1+"-";
return p1;
}
console.log(test);
I know that the following regex will match "red", "green", or "blue".
red|green|blue
Is there a straightforward way of making it match everything except several specified strings?
If you want to make sure that the string is neither red, green nor blue, caskey's answer is it. What is often wanted, however, is to make sure that the line does not contain red, green or blue anywhere in it. For that, anchor the regular expression with ^ and include .* in the negative lookahead:
^(?!.*(red|green|blue))
Also, suppose that you want lines containing the word "engine" but without any of those colors:
^(?!.*(red|green|blue)).*engine
You might think you can factor the .* to the head of the regular expression:
^.*(?!red|green|blue)engine # Does not work
but you cannot. You have to have both instances of .* for it to work.
Depends on the language, but there are generally negative-assertions you can put in like so:
(?!red|green|blue)
(Thanks for the syntax fix, the above is valid Java and Perl, YMMV)
Matching Anything but Given Strings
If you want to match the entire string where you want to match everything but certain strings you can do it like this:
^(?!(red|green|blue)$).*$
This says, start the match from the beginning of the string where it cannot start and end with red, green, or blue and match anything else to the end of the string.
You can try it here: https://regex101.com/r/rMbYHz/2
Note that this only works with regex engines that support a negative lookahead.
You don't need negative lookahead. There is working example:
/([\s\S]*?)(red|green|blue|)/g
Description:
[\s\S] - match any character
* - match from 0 to unlimited from previous group
? - match as less as possible
(red|green|blue|) - match one of this words or nothing
g - repeat pattern
Example:
whiteredwhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredgreenbluewhiteredwhiteredwhiteredwhiteredwhiteredwhiteredgreenbluewhiteredwhiteredwhiteredwhiteredwhiteredredgreenredgreenredgreenredgreenredgreenbluewhiteredbluewhiteredbluewhiteredbluewhiteredbluewhiteredwhite
Will be:
whitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhitewhite
Test it: regex101.com
I had the same question, the solutions proposed were almost working but they had some issue. In the end the regex I used is:
^(?!red|green|blue).*
I tested it in Javascript and .NET.
.* should't be placed inside the negative lookahead like this: ^(?!.*red|green|blue) or it would make the first element behave different from the rest (i.e. "anotherred" wouldn't be matched while "anothergreen" would)
Matching any text but those matching a pattern is usually achieved with splitting the string with the regex pattern.
Examples:
c# - Regex.Split(text, #"red|green|blue") or, to get rid of empty values, Regex.Split(text, #"red|green|blue").Where(x => !string.IsNullOrEmpty(x)) (see demo)
vb.net - Regex.Split(text, "red|green|blue") or, to remove empty items, Regex.Split(text, "red|green|blue").Where(Function(s) Not String.IsNullOrWhitespace(s)) (see demo, or this demo where LINQ is supported)
javascript - text.split(/red|green|blue/) (no need to use g modifier here!) (to get rid of empty values, use text.split(/red|green|blue/).filter(Boolean)), see demo
java - text.split("red|green|blue"), or - to keep all trailing empty items - use text.split("red|green|blue", -1), or to remove all empty items use more code to remove them (see demo)
groovy - Similar to Java, text.split(/red|green|blue/), to get all trailing items use text.split(/red|green|blue/, -1) and to remove all empty items use text.split(/red|green|blue/).findAll {it != ""}) (see demo)
kotlin - text.split(Regex("red|green|blue")) or, to remove blank items, use text.split(Regex("red|green|blue")).filter{ !it.isBlank() }, see demo
scala - text.split("red|green|blue"), or to keep all trailing empty items, use text.split("red|green|blue", -1) and to remove all empty items, use text.split("red|green|blue").filter(_.nonEmpty) (see demo)
ruby - text.split(/red|green|blue/), to get rid of empty values use .split(/red|green|blue/).reject(&:empty?) (and to get both leading and trailing empty items, use -1 as the second argument, .split(/red|green|blue/, -1)) (see demo)
perl - my #result1 = split /red|green|blue/, $text;, or with all trailing empty items, my #result2 = split /red|green|blue/, $text, -1;, or without any empty items, my #result3 = grep { /\S/ } split /red|green|blue/, $text; (see demo)
php - preg_split('~red|green|blue~', $text) or preg_split('~red|green|blue~', $text, -1, PREG_SPLIT_NO_EMPTY) to output no empty items (see demo)
python - re.split(r'red|green|blue', text) or, to remove empty items, list(filter(None, re.split(r'red|green|blue', text))) (see demo)
go - Use regexp.MustCompile("red|green|blue").Split(text, -1), and if you need to remove empty items, use this code. See Go demo.
NOTE: If you patterns contain capturing groups, regex split functions/methods may behave differently, also depending on additional options. Please refer to the appropriate split method documentation then.
All except word "red"
var href = '(text-1) (red) (text-3) (text-4) (text-5)';
var test = href.replace(/\((\b(?!red\b)[\s\S]*?)\)/g, testF);
function testF(match, p1, p2, offset, str_full) {
p1 = "-"+p1+"-";
return p1;
}
console.log(test);
All except word "red"
var href = '(text-1) (frede) (text-3) (text-4) (text-5)';
var test = href.replace(/\(([\s\S]*?)\)/g, testF);
function testF(match, p1, p2, offset, str_full) {
p1 = p1.replace(/red/g, '');
p1 = "-"+p1+"-";
return p1;
}
console.log(test);
I would expect this line of JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
to return something like:
["foo bar baz", "foo", " bar", " baz"]
but instead it returns only the last captured match:
["foo bar baz", " baz"]
Is there a way to get all the captured matches?
When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.
That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.
So generally speaking, depending on what you need to do:
If it's an option, split on delimiters instead
Instead of matching /(pattern)+/, maybe match /pattern/g, perhaps in an exec loop
Do note that these two aren't exactly equivalent, but it may be an option
Do multilevel matching:
Capture the repeated group in one match
Then run another regex to break that match apart
References
regular-expressions.info/Repeating a Capturing Group vs Capturing a Repeating Group
Javascript flavor notes
Example
Here's an example of matching <some;words;here> in a text, using an exec loop, and then splitting on ; to get individual words (see also on ideone.com):
var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";
var r = /<(\w+(;\w+)*)>/g;
var match;
while ((match = r.exec(text)) != null) {
print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz
The pattern used is:
_2__
/ \
<(\w+(;\w+)*)>
\__________/
1
This matches <word>, <word;another>, <word;another;please>, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is then split on the semicolon delimiter.
Related questions
How do you access the matched groups in a javascript regex?
How's about this? "foo bar baz".match(/(\w+)+/g)
Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:
var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);
try using 'g':
"foo bar baz".match(/\w+/g)
You can use LAZY evaluation.
So, instead of using * (GREEDY), try using ? (LAZY)
REGEX: (\s*\w+)?
RESULT:
Match 1: foo
Match 2: bar
Match 3: baz
I've been working on this for a while, but it's beyond my understanding of regex.
I'm using Yahoo Pipes on an RSS, and I want to create hashtags from titles; so, I'd like to remove space from everything between quotes, but, if there's a colon within the quotes, I only want the space removed between the words before the colon.
And, it would be great if I could also capture the unspaced words as a group, to be able to use: #$1 to output the hashtag in one step.
So, something like:
"The New Apple: Worlds Within Worlds" Before We Begin...
Could be substituted like #$1 - with this result:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
After some work, I was able to come up with, this regex:
\s(?=\s)?|(‘|’|(Review)|:.*)
("Review" was a word that often came before colons and wouldn't be stripped, if it were later in the title; that's what that's for, but I would like to not require that, to be more universal)
But, it has two problems:
I have to use multiple steps. The result of that regex would be:
"TheNewApple: Worlds Within Worlds" Before We Begin...
And I could then add another regex step, to put the hash # in front
But, it only works if the quotes are first, and I don't know how to fix that...
You can do this all in one step with regex, with a caveat. You run into problems with a repeated capturing group because only the last iteration is available in the replacement string. Searching for ( (\w+))+ and replacing with $2 will replace all the words with just the last match - not what we want.
The way around this is to repeat the pattern an arbitrary number of times that will suffice for your use. Each separate group can be referenced.
Search: "(\w+)(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?
Replace: "#$1$2$3$4$5$6
This will replace up to 6-word titles, exactly as you need them. First, "(\w+) matches any word following a quote. In the replacement string, it is put back as "#$1, adding the hashtag. The rest is a repeated list of (?: (\w+))? matches, each matching a possible space and word. Notice the space is part of a non-capturing group; only the word is part of the inner capture group. In the replacement string, I have $1$2$3$4$5$6, which puts back the words, without the spaces. Notice that a colon will not match any part of this, so it will stop once it hits a colon.
Examples:
"The New Apple: Worlds Within Worlds" Before We Begin...
"The New Apple" Before We Begin...
"One: Two"
only "One" word
this has "Two Words"
"The Great Big Apple Dumpling"
"The Great Big Apple Dumpling Again: Part 2"
Results:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
"#TheNewApple" Before We Begin...
"#One: Two"
only "#One" word
this has "#TwoWords"
"#TheGreatBigAppleDumpling"
"#TheGreatBigAppleDumplingAgain: Part 2"
You can match the text with
"([^:]*)(.*?)"(.*)
then use some programming language to output the result like this:
'"#' + removeSpace($1) + $2 + '"' + $3
I have no idea what language you're using, but this seems like a poor choice for regex. In Python I'd do this:
# Python 3
import re
titles = ['''"The New Apple: Worlds Within Worlds" Before We Begin...''',
'''"Made Up Title: For Example Only" So We Can Continue...''']
hashtagged_titles = list()
for title in titles:
hashtagme, *restofstring = title.split(":")
hashtag = '"#'+hashtagme[1:].translate(str.maketrans('', '', " "))
result = "{}:{}".format(hashtag, restofstring)
hashtagged_titles.append(result)
Do a global search for
\ (?=.*:)
Replaced with nothing. Example
You'll need a second search on the results of that if you want to capture "TheNewApple" as a single word.
Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?
Example:
"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24
I want to extract URL,TITLE,TAGS,
/"(url|title|tags)":"((\\"|[^"])*)"/i
I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:
"
A literal ".
(url|title|tags)
Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.
":"
Another literal string.
(
The beginning of another group. (Group 2)
(
Another group (3)
\\"
The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.
|
or...
[^"]
Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.
)
End of group 3...
*
The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.
)"
The end of group 2, and a literal ".
I've done a few non-obvious things here, that may come in handy:
Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
The i at the end of the expression makes it case insensitive.
Group 1 contains the name of the captured field.
EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.
Your new Regex is:
/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i
All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:
\[(S(,S)*)?\]
Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.
With our same S Notation, here's the whole dirty Regular Expression:
/"(url|title|tags)":(S|\[(S(,S)*)?\])/i
If it helps to see it in action, here's a view of it in action.
This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.
EDIT:
# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10
(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)
# test document
[
{
"_id": "56af331efbeca6240c61b2ca",
"index": 120000,
"guid": "bedb2018-c017-429E-b520-696ea3666692",
"isActive": false,
"balance": "$2,202,350",
"object": {
"name": "am",
"lastname": "lang"
}
}
]
the json string you'd like to extract field value from
{"fid":"321","otherAttribute":"value"}
the following regex expression extract exactly the "fid" field value "321"
(?<=\"fid\":\")[^\"]*
Please try below expression:
/"(url|title|tags)":("([^""]+)"|\[[^[]+])/gm
Explanation:
1st Capturing Group (url|title|tags): This is alternatively capturing the characters 'url','title' and 'tags' literally (case sensitive).
2nd Capturing Group ("([^""]+)"|[[^[]+]):
1st Alternative "([^""]+)" is matches all words within " and " including " and "
2nd Alternative [[^[]+] is matches all words within [ and ] including [ and ]
I have tested here
I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.
First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:
"matched".search("ch") // yields 3
For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).
Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:
find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain
With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.
You can see the library and code I authored at http://json.spiritway.co/
if your json is
{"key1":"abc","key2":"xyz"}
then below regex will extract key1 or key2 based on a key that you pass in regex
"key2(.*?)(?=,|}|$)
you can verify it here - regex101.com
Why does it have to be a Regular Expression object?
Here we can just use a Hash object first and then go search it.
mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}
The output of which would be
=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}
Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.
mh.values_at(:url, :title, :tags)
The output:
["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]
Taking the pattern that FrankieTheKneeman gave you:
pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i
we can search the mh hash by converting it to a json object.
/#{pattern}/.match(mh.to_json)
The output:
=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">
Of course this is all done in Ruby which is not a tag that you have but relates I hope.
But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.
pattern = /"(title)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">
pattern = /"(tags)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
=> nil
Sorry about that last one. It will have to be handled differently.