How to check number of different characters using regex? - regex

I'm trying to create regex to find all inputs containing max three different characters. It doesn't matter how long the input is.
Example of cases:
"32 32 32 32 34" --> match
"MM" --> match
" " --> match
"1234" --> no match
I've done regex to find inputs of four or more different chars, but now I need it in opposite way...
(.).*(?\1)(.).*(?\1)(?\2)(.).*(?\1)(?\2)(?\3)(.)
Main question is: How to check number of different characters?

The following will match a string with a maximum of three different non-space characters
^\s*(\S)?(?:\s|\1)*(\S)?(?:\s|\1|\2)*(\S)?(?:\s|\1|\2|\3)*$
(\S) matches one non-space character and captures it so it can then be referenced later in the regex using a back-reference e.g. \1. The ? in the (\S)? are used so the string can contain zero, one, two or three types of non-space characters.
The ?: make a group non-capturing.
The first part of the regex captures up to three different non-space characters \1, \2, \3, and then (?:\s|\1|\2|\3)* ensures only those characters or space \s can then appear before the end of the string $.
One way, in Javascript, to count the number of different non-space characters in a string "using regex":
var str = 'ABC ABC';
var chars = '';
str.replace( /\S/g, function ( m ) {
if ( chars.indexOf(m) == -1 ) chars += m;
});
chars.length; // 3

Good q. Here's the simplest I could come up:
^\s*([^\s]{1,3}\s+)*[^\s]{0,3}$
Explanation:
^\s* matches any amount of whitespace at the start.
([^\s]{1,3}\s+)* matches repeating groups of between one and three
non-whitespace characters followed by at least one whitespace character. Consider putting ?: after ( to make this a non-capturing group.
The final [^\s]{0,3} allows the string to end with up to three non-whitespace characters (so it doesn't have to end with whitespace as enforced by 2.)
Visualisation:
Demo:
Test it here: Debuggex Demo

Related

Regex ignore special character with greedy

I used the following regex to catch 10 numbers and letters:
/[a-zA-Z0-9]{10}/g
It works fine if the 10 characters are only numbers and letters.
e.g. input: 12345xcdw034342
it catches 12345xcdw0
But in this case with special characters or space, it doesn't catch it.
123}456712234324Zz3 or 123}45 71223AB3
It should catch 10 numbers and letters regardness of characters.
Any help would be gratefully appreciated.
You can do it but not without any extra processing
As you have not spetified what language you're using Ill use Javascript for being quite universal but the same logic must apply in any language.
Here are the options I can think of
if I have testString = "12#34{56A789BDE"
Match the all until the first ten alphanumeric caracters, and then remove the spetial characters in the resulting string
testString.match(/(\w.*?){10}/)[0].replaceAll(/\W/g, '')
// results '123456A789'
// explanation: we take the first \w and use .*? to indicate that we dont care if the alphanumeric has a non-alphanumeric right next to it, then we clean the result by removing \W which means non-alphanumeric
Match only the first ten alphanumeric caracters and then join them to make a result string
testString.match(/\w/g).splice(0,10).join('')
// results '123456A789'
// explanation: we match 10 groups of aphanumeric characters represented by \w (note the lowercase) and we join the first 10 (using splice to get them) as each group "()" is in the case of javascript returned as an element of an array of matches
Remove the spetial characters from your string and then take the first ten
testString.replaceAll(/\W/g,'').match(/\w{10}/)[0]
// results '123456A789'
// explanation: we replace \W which means non alpha numeric characters, with '' to delete them then we match the first ten
You can use
/[a-zA-Z0-9](?:[^a-zA-Z0-9]*[a-zA-Z0-9]){9}/g
See the regex demo. Details:
[a-zA-Z0-9] - an alphanumeric
(?:[^a-zA-Z0-9]*[a-zA-Z0-9]){9} - nine occurrences of any zero or more chars other than an alphanumeric char and then an alphanumeric char.

A regular expression for matching a group followed by a specific character

So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers

Regular Expression that matches when any string has minimum three characters, and + signs are surrounded by minimum three characters

I wanted to create regex expression that only matches when any string has three or more character and if any + sign in the string then after and before + sign it must be minimum three characters required,
I have created one regex it fulfills me all requirement except one that before first + sign must be minimum three characters but it matches with less character
this is my current regex: (\+[a-z0-9]{3}|[a-z0-9]{0,3})$
ab+abx this string should not match but it matched in my regex
Example:
Valid Strings:
sss
sdfsgdf
4534534
dfs34543
sdafds+3232+sfdsafd
qwe+sdf
234+567
cvb+243
Invalid Strings:
a
aa
a+
aa+
+aa
+a
a+a
aa+aa
aaa+a
You can use this regex,
^[^+\n]{3,}(?:\+[^+\n]{3,})*$
Explanation:
^ - Start of string
[^+\n]{3,} - This ensures it matches any characters except + and newline, \n you can actually remove if the input you're trying to match doesn't contain any newlines and {3,} allows it to match at least three and more characters
(?:\+[^+\n]{3,})* - This part further allows matching of a + character then further separated by at least three or more characters and whole of it zero or more times to keep appearance of + character optional
$ - End of input
Demo
Edit: Updating solution where a space does not participate in counting the number of characters in either side of + where minimum number of character required were three
You can use this regex to ignore counting spaces within the text,
^(?:[^+\n ] *){3,}(?:\+ *(?:[^+\n ] *){3,})*$
Demo
Also, in case you're dealing with only alphanumeric text, you can use this simpler and easier to maintain regex,
^(?:[a-z0-9] *){3,}(?:\+ *(?:[a-z0-9] *){3,})*$
Demo
You could repeat 0+ times matching 3 or more times what is listed in the character class [a-z0-9] preceded by a plus sign:
^[a-z0-9]{3,}(?:\+[a-z0-9]{3,})*$
That will match:
^ Start of string
[a-z0-9]{3,} Match 3+ times what is listed in the character class
(?: Non capturing group
\+[a-z0-9]{3,} Match + sign followed by matching 3+ times what is listed in the character class
)* Close group and repeat 0+ times
$ End of string

Regex perl with letters and numbers

I need to extract a strings from a text file that contains both letters and numbers. The lines start like this
Report filename: ABCL00-67900010079415.rpt ______________________
All I need is the last 8 numbers so in this example that would be 10079415
while(<DATA>){
if (/Report filename/) {
my ($bagID) = ( m/(\d{8}+)./ );
print $bagID;
}
Right now this prints out the first 8 but I want the last 8.
You just need to escape the dot, so that it would match the 8 digit characters which exists before the dot charcater.
my ($bagID) = ( m/(\d{8}+)\./ );
. is a special character in regex which matches any character. In-order to match a literal dot, you must need to escape that.
To match the last of anything, just precede it with a wildcard that will match as many characters as possible
my ($bag_id) = / .* (\d{8}) /x
Note that I have also use the /x modifier so that the regex can contain insignificant whitespace for readability. Also, your \d{8}+ is what is called a possessive quantifier; it is used for optimising some regex constructions and makes no difference at the end of the pattern

regex: find one-digit number

I need to find the text of all the one-digit number.
My code:
$string = 'text 4 78 text 558 my.name#gmail.com 5 text 78998 text';
$pattern = '/ [\d]{1} /';
(result: 4 and 5)
Everything works perfectly, just wanted to ask it is correct to use spaces?
Maybe there is some other way to distinguish one-digit number.
Thanks
First of all, [\d]{1} is equivalent to \d.
As for your question, it would be better to use a zero width assertion like a lookbehind/lookahead or word boundary (\b). Otherwise you will not match consecutive single digits because the leading space of the second digit will be matched as the trailing space of the first digit (and overlapping matches won't be found).
Here is how I would write this:
(?<!\S)\d(?!\S)
This means "match a digit only if there is not a non-whitespace character before it, and there is not a non-whitespace character after it".
I used the double negative like (?!\S) instead of (?=\s) so that you will also match single digits that are at the beginning or end of the string.
I prefer this over \b\d\b for your example because it looks like you really only want to match when the digit is surrounded by spaces, and \b\d\b would match the 4 and the 5 in a string like 192.168.4.5
To allow punctuation at the end, you could use the following:
(?<!\S)\d(?![^\s.,?!])
Add any additional punctuation characters that you want to allow after the digit to the character class (inside of the square brackets, but make sure it is after the ^).
Use word boundaries. Note that the range quantifier {1} (a single \d will only match one digit) and the character class [] is redundant because it only consists of one character.
\b\d\b
Search around word boundaries:
\b\d\b
As explained by the others, this will extract single digits meaning that some special characters might not be respected like "." in an ip address. To address that, see F.J and Mike Brant's answer(s).
It really depends on where the numbers can appear and whether you care if they are adjacent to other characters (like . at the end of a sentence). At the very least, I would use word boundaries so that you can get numbers at the beginning and end of the input string:
$pattern = '/\b\d\b/';
But you might consider punctuation at the end like:
$pattern = '/\b\d(\b|\.|\?|\!)/';
If one-digit numbers can be preceded or followed by characters other than digits (e.g., "a1 cat" or "Call agent 7, pronto!") use
(?<!\d)\d(?!\d)
Demo
The regular expression reads, match a digit (\d) that is neither preceded nor followed by digit, (?<!\d) being a negative lookbehind and (?!\d) being a negative lookahead.