I am having problems constructing a regex that will allow the full range of UTF-8 characters with the exception of 2 characters: '_' and '?'
So the whitelist is: ^[\u0000-\uFFFF]
and the blacklist is: ^[^_%]
I need to combine these into one expression.
I have tried the following code, but does not work the way I had hoped:
String input = "this";
Pattern p = Pattern
.compile("^[\u0000-\uFFFF]+$ | ^[^_%]");
Matcher m = p.matcher(input);
boolean result = m.matches();
System.out.println(result);
input: this
actual output: false
desired output: true
You can use character class intersections/subtractions in Java regex to restrict a "generic" character class.
The character class [a-z&&[^aeiuo]] matches a single letter that is not a vowel. In other words: it matches a single consonant.
Use
"^[\u0000-\uFFFF&&[^_%]]+$"
to match all the Unicode characters except _ and %.
More about character class intersections/subtractions available in Java regex, see The Java™ Tutorials: Character Classes.
A test at the OCPSoft Visual Regex Tester showing there is no match when a % is added to the string:
And the Java demo:
String input = "this";
Pattern p = Pattern.compile("[\u0000-\uFFFF&&[^_%]]+"); // No anchors because `matches()` is used
Matcher m = p.matcher(input);
boolean result = m.matches();
System.out.println(result); // => true
Here is a sample code to exclude some of characters from a range using Lookahead and Lookbehind Zero-Length Assertions that actually do not consume characters in the string, but only assert whether a match is possible or not.
Sample code: (exclude m and n from range a-z)
String str = "abcdmnxyz";
Pattern p=Pattern.compile("(?![mn])[a-z]");
Matcher m=p.matcher(str);
while(m.find()){
System.out.println(m.group());
}
output:
a b c d x y z
In the same way you can do it.
Regex explanation (?![mn])[a-z]
(?! look ahead to see if there is not:
[mn] any character of: 'm', 'n'
) end of look-ahead
[a-z] any character of: 'a' to 'z'
You can divide the whole range in sub-ranges and can solve the above problem with ([a-l]|[o-z]) or [a-lo-z] regex also.
Your problem is the spaces either side of the pipe.
Neither of
" ^.*"
".*$ "
will match anything, because nothing appears before start or after end.
This has a chance:
^[\u0000-\uFFFF]+$|^[^_%]
I have a code segment that reads lines from a file and I want to filter certain lines out. Basically, I want to filter everything out that has not three tabulator-separated columns, where the first column is a number and the other two columns can contain every character except tabulator and newline (Dos & Unix).
I already checked my regex on http://www.regexr.com/ and there it works.
scala> val mystr = """123456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0#\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
scala> val myreg = "^[0-9]+(\t[^\t\r\n]+){2}(\n|\r\n)$"
scala> mystr.matches(myreg)
res2: Boolean = false
What I found out is that the problem is related to special characters. For example a simple example:
scala> val tabstr = """123456\t123456"""
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res3: Boolean = false
scala> val tabstr = "123456\t123456"
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res4: Boolean = true
It seems I mustn't use a raw string for my line (see mystr in the first code block). But if I don't use a raw string scala complains about
error: invalid escape character
So how can I deal with this messy input and still use my regex to filter out some lines?
You are using raw string literals. Inside raw string literals, \ is not used to escape sequences like tab \t or newline \n, the \n in a raw string literal is just 2 characters following each other.
In a regex, to match a literal \, you need to use 2 backslashes in a raw-string literal based regex, and 4 backslashes in a regular string.
So, to match all your inputs, you need to use the following regexps:
val mystr = """23456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0#\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
val myreg = """[0-9]+(?:\\t(?:(?!\\[trn]).)*){2}(?:\\r)?(?:\\n)"""
println(mystr.matches(myreg)) // => true
val tabstr = """123456\t123456"""
println(tabstr.matches("""[0-9]+\\t[0-9]+""")) // => true
val tabstr2 = "123456\t123456"
println(tabstr2.matches("""^[0-9]+(?:\\t|\t)[0-9]+$""")) // => true
Non-capturing groups are not of importance here, since you just need to check if a string matches (that means, you do not even need a ^ and $ since the whole input string must match) and you can still use capturing groups. If you later need to extract any matches/capturing groups, non-capturing groups will help you get a "cleaner" output structure, that is it.
The last two regexps are easy enough, (?:\\t|\t) matches either a \+t or a tab. \t just matches a tab.
The first one has a tempered greedy token (this is a simplified regex, a better one can be used with unrolling the loop method: [0-9]+(?:\\t[^\\]*(?:\\(?![trn])[^\\]*)*){2}(?:\\r)?(?:\\n)).
[0-9]+ - 1 or more digits
(?:\\t(?:(?!\\[trn]).)*){2} - tempered greedy token, 2 occurrences of a literal string \t followed by any characters but a newline other than 2-symbol combinations \t or \r or \n.
(?:\\r)? - 1 or 0 occurrences of \r
(?:\\n) - one occurrence of a literal combination of \ and n.
I'm trying to get at the contents of a string like this (2.2,3.4) with a scala regular expression to obtain a string like the following 2.2,3.4
This will get me the string with parenthesis and all from a line of other text:
"""\(.*?\)"""
But I can't seem to find a way to get just the contents of the parenthesis.
I've tried: """\((.*?)\)""" """((.*?))""" and some other combinations, without luck.
I've used this one in the past in other Java apps: \\((.*?)\\), which is why I thought the first attempt in the line above """\((.*?)\)""" would work.
For my purposes, this looks something like:
var points = "pointA: (2.12, -3.48), pointB: (2.12, -3.48)"
var parenth_contents = """\((.*?)\)""".r;
val center = parenth_contents.findAllIn(points(0));
var cxy = center.next();
val cx = cxy.split(",")(0).toDouble;
Use Lookahead and Lookbehind
You can use this regex:
(?<=\()\d+\.\d+,\d+\.\d+(?=\))
Or, if you don't need precision inside the parentheses:
(?<=\()[^)]+(?=\))
See demo 1 and demo 2
Explanation
The lookbehind (?<=\() asserts that what precedes is a (
\d+\.\d+,\d+\.\d+ matches the string
or, in Option 2, [^)]+ matches any chars that are not a closing parenthesis
The lookahead (?=\)) asserts that what follows is a )
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
May be try this out
val parenth_contents = "\\(([^)]+)\\)".r
parenth_contents: scala.util.matching.Regex = \(([^)]+)\)
val parenth_contents(r) = "(123, abc)"
r: String = 123, abc
A even sample regex for matching all occurrence of both parenthesis itself and content inside the parenthesises.
(\([^)]+\)+)
1st Capturing Group (\([^)]+\)+)
\( matches the character ( literally (case sensitive)
Match a single character not present in the list below [^)]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
) matches the character ) literally (case sensitive)
\)+ matches the character ) literally (case sensitive)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
https://regex101.com/r/MMNRRo/1
\((.*?)\) works - you just need to extract the matched group. The easiest way to do that is to use the unapplySeq method of scala.util.matching.Regex:
scala> val wrapped = raw"\((.*?)\)".r
wrapped: scala.util.matching.Regex = \((.*?)\)
val wrapped(r) = "(123,abc)"
r: String = 123,abc
I want to implement a regular expression check for a string with a character which repeats itself more than twice.
I am using ActionScript 3.
i.e.:
koby = true
kobyy = true
kobyyy = false
I tried using
/((\w)\2?(?!\2))+/
but it does not seem to work (using RegExp.test())
If you want to invalidate the complete string, when there is a character repeated 3 times, you can use a negative lookahead assertion:
^(?!.*(\w)\1{2}).*
See it here on Regexr.
The group starting with (?! is a negated lookahead assertion. That means the whole regex (.* to match the complete string) will fail, when there is a word character that is repeated 3 times in the string.
^ is an anchor for the start of the string.
^ # match the start of the string
(?!.* # fail when there is anywhere in the string
(\w) # a word character
\1{2} # that is repeated two times
)
.* # match the string
I also tried this one:
var regExp:RegExp = new RegExp('(\\w)\\1{2}');
trace(!regExp.test('koby'));
trace(!regExp.test('kobyy'));
trace(!regExp.test('kobyyy'));
I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b