I want to implement a regular expression check for a string with a character which repeats itself more than twice.
I am using ActionScript 3.
i.e.:
koby = true
kobyy = true
kobyyy = false
I tried using
/((\w)\2?(?!\2))+/
but it does not seem to work (using RegExp.test())
If you want to invalidate the complete string, when there is a character repeated 3 times, you can use a negative lookahead assertion:
^(?!.*(\w)\1{2}).*
See it here on Regexr.
The group starting with (?! is a negated lookahead assertion. That means the whole regex (.* to match the complete string) will fail, when there is a word character that is repeated 3 times in the string.
^ is an anchor for the start of the string.
^ # match the start of the string
(?!.* # fail when there is anywhere in the string
(\w) # a word character
\1{2} # that is repeated two times
)
.* # match the string
I also tried this one:
var regExp:RegExp = new RegExp('(\\w)\\1{2}');
trace(!regExp.test('koby'));
trace(!regExp.test('kobyy'));
trace(!regExp.test('kobyyy'));
Related
PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);
The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)
You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s
So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers
Basically I have a string and I want to find the shortest sub-string (including the beginning) that matches the repetition of a character N times, it doesn't matter if consecutive or not. I want to use it in Javascript.
Example:
Let's figure out the character is '/' and we want it to match 5 repetitions.
For this string:
http://remote-computer.example.local/home/dev/proj/sdk/docs/index.html#/api
The matching string would be:
http://remote-computer.example.local/home/dev/
For this string:
////remote-computer/example/local/home
The matching string would be:
////remote-computer/
How about this regex:
^((?:[^/]*/){5})
The sub-string you want will be catched in group 1.
In javascript you could do:
var re = new RegExp("^((?:[^/]*/){5})); // excape the slashes is not mandatory
or
var re = /^((?:[^\/]*\/){5})/; // here you have to excape the slashes
Explanation:
^ : begining of the string
( : start capture group 1
(?: : start non capture group
[^/]*/ : 0 or more any character that is not a slash, followed by a slash
){5} : the non capture group occurs 5 times
) : end of group 1
You can use (?:.*?/){5}. See a demo here.
This matches the exact same substrings as Toto’s regexp but is shorter:
There’s no need to use ^; regexps start matching at the beginning by default.
You don’t need a capture group because you want the whole match.
.*?/ matches "everything until the next /, including it", which can also be written as [^/]*/ like Toto did.
I have tried this:
[a]?[b]?[c]?[d]?[e]?[f]?[g]?[h]?[i]?[j]?[k]?[l]?[m]?[n]?[o]?[p]?[q]?[r]?[s]?[t]?[u]?[v]?[w]?[x]?[y]?[z]?
But this RegEx rejects string where the order in not alphabetical, like these:
"zabc"
"azb"
I want patterns like these two to be accepted too. How could I do that?
EDIT 1
I don't want letter repetitions, i.e., I want the following strings to be rejected:
aazb
ozob
Thanks.
You can use a negative lookahead assertion to make sure no two characters are the same:
^(?!.*(.).*\1)[a-z]*$
Explanation:
^ # Start of string
(?! # Assert that it's impossible to match the following:
.* # any number of characters
(.) # followed by one character (capture that in group 1)
.* # followed by any number of characters
\1 # followed by the same character as the one captured before
) # End of lookahead
[a-z]* # Match any number of ASCII lowercase letters
$ # End of string
Test it live on regex101.com.
Note: This regex needs to brute-force check all possible character pairs, so performance may be a problem with larger strings. If you can use anything besides regex, you're going to be happier. For example, in Python:
if re.search("^[a-z]*$", mystring) and len(mystring) == len(set(mystring)):
# valid string
Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn't work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?
If you're trying to do this as a learning exercise, you picked a very hard problem :)
First of all, here is the solution:
\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)
Now, here is the explanation:
We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.
Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).
Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).
Here is an example (using C#):
var s = "goat goat leopard bird leopard horse";
foreach (Match m in Regex.Matches(s, #"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
Console.WriteLine(m.Value);
Output:
bird
horse
It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):
Regex regexObj = new Regex(
#"( # Match and capture into backreference no. 1:
\b # (from the start of the word)
\p{L}+ # a succession of letters
\b # (to the end of a word).
) # End of capturing group.
(?<= # Now assert that the preceding text contains:
^ # (from the start of the string)
(?: # (Start of non-capturing group)
(?! # Assert that we can't match...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
\1 # we reach the word we've just matched.
) # End of lookbehind assertion.
# We now know that we have just matched the first instance of that word.
(?= # Now look ahead to assert that we can match the following:
(?: # (Start of non-capturing group)
(?! # Assert that we can't match again...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
$ # the end of the string.
) # End of lookahead assertion.",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
If you are trying to match an English word, the best form is:
[a-zA-Z]+
The problem with \w is that it also includes _ and numeric digits 0-9.
If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.
Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.
(\w+){1} will match each word.
After that you could always perfrom the count on the matches....
Higher level solution:
Create an array of your matches:
preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);
Let PHP count your array elements:
$tmp_array = array_count_values($matches[1]);
Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}
Low level but does what you want:
Pass your text in an array using split:
$array = split('\s+', $text);
Iterate over that array:
foreach ($array as $word) { ... }
Check each word if it is a word:
if (!preg_match('/[^a-zA-Z]/', $word) continue;
Add the word to a temporary array as key:
if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;
After the loop. Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}