I need to have regular expression which find a character which not following by same character after it. Mean exclude double even multiple same character.
For example, when i need to find 'e' character from string: "need only single characteeer", it is mean will find 'e' on each words breakdown as below:
"need" > not match because it has double 'e'
"only" > not match because no 'e'
"single" > match because has only single 'e'
"characteeer" > not match because has multiple of 'e'
Not sure whether it is possible or not. Any answer or comment will be highly appreciated. Thanks in advance.
UPDATE
Maybe my question above still ambiguous. Actually i need to find the 'e' character only instead the words. I am going to replace it with double character. So the one which already has double character will not replaced.
The main purpose is to replace 'e' with 'ee' for example. But the one which has 'ee' or 'eee' already, or even more 'e', will be untouched.
UPDATE:
(?<!e)e(?!e)
Will match e not with negative lookbehind to prevent preceeding e and negative lookahead preventing following e.
Can be checked here
\b(([A-Za-z])(?!\2))+\b
Will match a word (sequence of one or more characters between A-Za-Z), with negative lookahead which prevents following character to be the same as last match, group 2, or using non capturing group.
/\b(?:([A-Za-z])(?!\1))+\b/g
however only will match because it doesn't contain repeated character.
to match a word containing e but no ee
/(?<![a-z])(?=[a-z]*e)(?![a-z]*ee)[a-z]+/gi
/\b([a-df-z]*e[a-df-z]*)\b\s*/g
You could add the flag case insensitive /i if needed.
Explanation:
/ : regex delimiter
\b : word boundary
( : start group 1
[a-df-z]* : 0 or more letter that is not "e"
e : 1 letter "e"
[a-df-z]* : 0 or more letter that is not "e"
) : end group 1
\b : word boundary
\s* : 0 or more spaces
/g : regex delimiter, global flag
As you didn't give which language you're using, here is a perl script:
my $str = "need only single characteeer";
my #list = $str =~ /\b([a-df-z]*e[a-df-z]*)\b\s*/g;
say Dumper\#list;
Output:
$VAR1 = [
'single'
];
And a php script:
$str = "need only single characteeer";
preg_match_all("/\b([a-df-z]*e[a-df-z]*)\b\s*/", $str, $match);
print_r($match);
Output:
Array
(
[0] => Array
(
[1] => single
)
[1] => Array
(
[1] => single
)
)
Related
I need to have regular expression which find a character which not following by same character after it. Mean exclude double even multiple same character.
For example, when i need to find 'e' character from string: "need only single characteeer", it is mean will find 'e' on each words breakdown as below:
"need" > not match because it has double 'e'
"only" > not match because no 'e'
"single" > match because has only single 'e'
"characteeer" > not match because has multiple of 'e'
Not sure whether it is possible or not. Any answer or comment will be highly appreciated. Thanks in advance.
UPDATE
Maybe my question above still ambiguous. Actually i need to find the 'e' character only instead the words. I am going to replace it with double character. So the one which already has double character will not replaced.
The main purpose is to replace 'e' with 'ee' for example. But the one which has 'ee' or 'eee' already, or even more 'e', will be untouched.
UPDATE:
(?<!e)e(?!e)
Will match e not with negative lookbehind to prevent preceeding e and negative lookahead preventing following e.
Can be checked here
\b(([A-Za-z])(?!\2))+\b
Will match a word (sequence of one or more characters between A-Za-Z), with negative lookahead which prevents following character to be the same as last match, group 2, or using non capturing group.
/\b(?:([A-Za-z])(?!\1))+\b/g
however only will match because it doesn't contain repeated character.
to match a word containing e but no ee
/(?<![a-z])(?=[a-z]*e)(?![a-z]*ee)[a-z]+/gi
/\b([a-df-z]*e[a-df-z]*)\b\s*/g
You could add the flag case insensitive /i if needed.
Explanation:
/ : regex delimiter
\b : word boundary
( : start group 1
[a-df-z]* : 0 or more letter that is not "e"
e : 1 letter "e"
[a-df-z]* : 0 or more letter that is not "e"
) : end group 1
\b : word boundary
\s* : 0 or more spaces
/g : regex delimiter, global flag
As you didn't give which language you're using, here is a perl script:
my $str = "need only single characteeer";
my #list = $str =~ /\b([a-df-z]*e[a-df-z]*)\b\s*/g;
say Dumper\#list;
Output:
$VAR1 = [
'single'
];
And a php script:
$str = "need only single characteeer";
preg_match_all("/\b([a-df-z]*e[a-df-z]*)\b\s*/", $str, $match);
print_r($match);
Output:
Array
(
[0] => Array
(
[1] => single
)
[1] => Array
(
[1] => single
)
)
i searching to find some Perl Regular Expression Syntax about some requirements i have in a project.
First i want to exclude strings from a txt file (dictionary).
For example if my file have this strings:
path.../Document.txt |
tree
car
ship
i using Regular Expression
a1testtre -- match
orangesh1 -- match
apleship3 -- not match [contains word from file ]
Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).
For example :
adminnisstrator21 -- match (have 2 times a repetition of chars)
kkeeykloakk -- not match have over 3 times repetition
stack22ooverflow -- match (have 2 times a repetition of chars)
for this i have try
\b(?:([a-z])(?!\1))+\b
but it works only for the first char-reppeat
Any idea how to solve these two?
To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:
^(?!.*(?:tree|car|ship)).*$
^ Assert start of string
(?! negative lookahead, assert what is on the right is not
.*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
) Close negative lookahead
.* Match any char except a newline
$ Assert end of string
Regex demo
To not allow a string to have over 3 times a char repeat you could use:
\b(?!(?:\w*(\w)\1){3})\w+\b
\b Word boundary
(?! Negative lookahead, assert what is on the right is not
(?: NOn capturing group
\w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
){3} Close non capturing group and repeat 3 times
) close negative lookahead
\w+ Match 1+ word characters
\b word boundary
Regex demo
Update
According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
^
Pattern demo
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
use warnings;
use strict;
use feature qw(say);
use Path::Tiny;
my $file = shift // die "Usage: $0 file\n"; #/
my #words = split ' ', path($file)->slurp;
my $exclude = join '|', map { quotemeta } #words;
foreach my $string (qw(a1testtre orangesh1 apleship3))
{
if ($string !~ /$exclude/) {
say "OK: $string";
}
}
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)
This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.†
The check that successive duplicate characters do not occur more than three times
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
my #chars_that_repeat = $string =~ /(.)\1+/g;
if (#chars_that_repeat < 3) {
say "OK: $string";
}
}
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.
This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.
† Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } #words;
#==> so|sole|solely
for a quicker match (so matches all three). This, by all means, appears to be the case here.
But, if you wanted to correctly identify which word matched then you must have longer words first,
solely|sole|so
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round,
sort { length $b <=> length $a }
I hope someone else will come with a better solution, but this seems to do what you want:
\b Match word boundary
(?: Start capture group
(?:([a-z0-9])(?!\1))* Match all characters until it encounters a double
(?:([a-z0-9])\2)+ Match all repeated characters until a different one is reached
){0,2} Match capture group 0 or 2 times
(?:([a-z0-9])(?!\3))+ Match all characters until it encounters a double
\b Match end of word
I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.
My problem is that i have 2 regex that working:
Not allow over 3 pairs of chars:
(?=^(?!(?:\w*(.)\1){3}).+$)
Not allow over 4 times a char to repeat:
(?=^(?:(.)(?!(?:.*?\1){4}))*$)
Now i want to combine them into one row like:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
but its working only the regex that is first and not both of them
As mentioned in comment to #zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:
# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST
# TRUSTEES|TRUSTEE|TRUST
sub tight_match_order {
return #_ unless #_ > 1;
my (#alts, #ordered_alts, %alts_seen);
#alts = map { $alts_seen{$_}++ ? () : $_ } #_;
TEST: {
my $alt = shift #alts;
if (grep m#$alt#, #alts) {
push #alts => $alt;
} else {
push #ordered_alts => $alt;
}
redo TEST if #alts;
}
#ordered_alts
}
So following #zdim's answer:
...
my #words = split ' ', path($file)->slurp;
#words = tight_match_order(#words); # add this line
my $exclude = join '|', map { quotemeta } #words;
...
HTH
I want to substitute variables marked by a "#" and terminated by a dot or a non-alphanumeric character.
Example: Variable #name should be substituted be "Peter"
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def
So if the variable is terminated with a dot, it is replaced and the dot removed. Is it terminated by any non-alphanum character, it is replaced also.
I use the following:
s/#name\./Peter/i
s/#name(\W)/Peter$1/i
This works but is it possible to merge it into one expression?
There are several possible approaches.
s/#name(\W)/"Peter" . ($1 eq "." ? "" : $1)/e
Here we use /e to turn the replacement part into an expression, so we can inspect $1 and choose the replacement string dynamically.
s/#name(?|\.()|([^.\w]))/Peter$1/
Here we use (?| ) to reset the numbering of capture groups between branches, so both \.() and ([^.\w]) set $1. If a . is matched, $1 becomes the empty string; otherwise it contains the matched character.
You may use
s/#name(?|\.()|(\W))/Peter$1/i
Details
#name - matches the literal substring
(?|\.()|(\W)) - a branch reset group matching either of the two alternatives:
\.() - a dot and then captures an empty string into $1
| - or
(\W) - any non-word char captured into $1.
So, upon a match, $1 placeholder is either empty or contains any non-word char other than a dot.
You can do this by using either a literal dot or a word boundary for the terminator
Like this
s/#name(?:\.|\b)/Peter/i
Here's a complete program that reproduces the required output shown in your question
use strict;
use warnings 'all';
for my $s ( 'abc#name.def', 'abc#namedef', 'abc#name-def' ) {
( my $s2 = $s ) =~ s/#name(?:\.|\b)/Peter/i;
printf "%-12s => %-s\n", $s, $s2;
}
output
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def
I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo
Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn't work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?
If you're trying to do this as a learning exercise, you picked a very hard problem :)
First of all, here is the solution:
\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)
Now, here is the explanation:
We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.
Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).
Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).
Here is an example (using C#):
var s = "goat goat leopard bird leopard horse";
foreach (Match m in Regex.Matches(s, #"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
Console.WriteLine(m.Value);
Output:
bird
horse
It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):
Regex regexObj = new Regex(
#"( # Match and capture into backreference no. 1:
\b # (from the start of the word)
\p{L}+ # a succession of letters
\b # (to the end of a word).
) # End of capturing group.
(?<= # Now assert that the preceding text contains:
^ # (from the start of the string)
(?: # (Start of non-capturing group)
(?! # Assert that we can't match...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
\1 # we reach the word we've just matched.
) # End of lookbehind assertion.
# We now know that we have just matched the first instance of that word.
(?= # Now look ahead to assert that we can match the following:
(?: # (Start of non-capturing group)
(?! # Assert that we can't match again...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
$ # the end of the string.
) # End of lookahead assertion.",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
If you are trying to match an English word, the best form is:
[a-zA-Z]+
The problem with \w is that it also includes _ and numeric digits 0-9.
If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.
Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.
(\w+){1} will match each word.
After that you could always perfrom the count on the matches....
Higher level solution:
Create an array of your matches:
preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);
Let PHP count your array elements:
$tmp_array = array_count_values($matches[1]);
Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}
Low level but does what you want:
Pass your text in an array using split:
$array = split('\s+', $text);
Iterate over that array:
foreach ($array as $word) { ... }
Check each word if it is a word:
if (!preg_match('/[^a-zA-Z]/', $word) continue;
Add the word to a temporary array as key:
if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;
After the loop. Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}