Regular Expression - Only match Alphanumerics and a SINGLE whitespace between words - regex

I'm new to Regular Expressions...
I've been asked a regular expression that accepts Alphanumerics, a few characters more, and only ONE whitespace between words.
For example :
This should match :
"Hello world"
This shouldn't :
"Hello world"
Any ideas?
This was my expression:
[\w':''.'')''(''\[''\]''{''}''-''_']+$
I already tried the \s? (the space character once or never - right? ) but I didn't get it to work.

Using Oniguruma regex syntax, you could do something like:
^[\w\.:\(\)\[\]{}\-_](?: ?[\w\.:\(\)\[\]{}\-_])*$
Assuming that the 'other characters' are . : () [] {} - _
This regex will match a string that must begin and end with a word character or one of the other allowed characters and cannot have more than one space in a row.
If you're using the x flag (ignore whitespace in regular expression) you'll need to do this instead:
^[\w\.:\(\)\[\]{}\-_](?:\ ?[\w\.:\(\)\[\]{}\-_])*$
The only difference is the \ in front of the space.

What about:
^[\w\.:\(\)\[\]{}\-]+( [\w\.:\(\)\[\]{}\-]+)*$
Matches:
^[\w\.:\(\)\[\]{}\-]+: line begins with 1 or more acceptable characters (underscore is included in \w).
( [\w\.:\(\)\[\]{}\-]+): look to include a single separator character and 1 or more acceptable characters.
*$: repeat single separator and word 0 or more times.
Tested:
Hello(space)World: TRUE
Hello(space)(space)World: FALSE
Hello: TRUE
Hello(space): FALSE
Hello(tab)World: FALSE

Related

Regex to extract the characters [duplicate]

I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO

A regular expression for matching a group followed by a specific character

So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers

How can I check it with regular Expression?

I have a long input string that contains certain field names in-bedded in it. For instance:
SELECT some-name, some-name FROM [some-table] WHERE [some-column] = 'some-value'
The actual field name may change, but it is always in the form of word-word. I need to perform a regex replace on the string so that the output will look like this:
SELECT some - name, some - name FROM [some-table] WHERE [some-column] = 'some - value'
In other words, when the field name is enclosed in square-brackets, it should be left untouched, but when it is not, spaces should be inserted on either side of the dash. There are no nested square brackets and the reserved word could be one or more in the string.
You can do this:
Regex.Replace(input, "(?<!\[[^-\]]*)(\w+)-(\w+)(?![^-\]]*\])", "$1 - $2")
Here's an explanation of the pattern:
(?<!\[[^-\]]*) - This is a negative look-behind. It asserts that matches cannot be immediately preceded by text that matches the sub-pattern \[[^-\]]*. In other words, the matches we are looking for cannot be preceded by a [ character followed by any number of characters that are not a - or a ].
(\w+)-(\w+) - Matches one or more word-characters, then a dash, and then one or more word characters following the dash. By enclosing the sub-patterns on either side of the dash in capturing groups, we can then refer to their values as $1 and $2 in the replacement pattern.
(?![^-\]]*\]) - This is a negative look-ahead. Similar to the negative look-behind, it asserts that matches cannot be immediately followed by text which matches the sub pattern [^-\]]*\]. In other words, a match cannot be followed by any number of characters that are not a - or a ] and then a closing ].
See a demo.
At first glance, you might assume that you could simply assert that is must not be immediately preceded by a [ character and that it must not be immediately followed by a ] character. In other words, (?<!\[)(\w+)-(\w+)(?!\]). However, that pattern would still match the text ome-nam in the input [some-name] because the text ome-nam is not immediately preceded or followed by the brackets.
Dim regex As Regex = New Regex("\[[^-]*-[^-]*\]")
Dim match As Match = regex.Match("A long string containing square brackets [some-name]")
If match.Success Then
Console.WriteLine(match.Value)
End If
Or you could use Regex.IsMatch:
Return Regex.IsMatch("A long string containing square brackets [some-name]",
"\[[^-]*-[^-]*\]")
You may match and capture the [...] substrings and then only match hyphens that are not surrounded with hyphens to replace them:
Dim nStr As String = "SELECT 'some-name' FROM [some-name]"
Dim nResult = Regex.Replace(nStr, "(\[.+?])|\s*-\s*", New MatchEvaluator(Function(m As Match)
If m.Groups(1).Success Then
Return m.Groups(1).Value
Else
Return " - "
End If
End Function))
So, what is happening is:
(\[[^]]+]) - matches and stores the value of [...] substring inside the Group(1) buffer (or \[.+?] can be used here to match a [, then 1 or more any characters and then ] - with RegexOptions.Singleline flag so that . could match a newline, too)
(?<!\s)-(?!\s) - matches any hyphen not preceded ((?<!\s)) or followed ((?!\s)) with whitespace (\s). Actually, we may even use \s*-\s* (where \s* stands for zero or more whitespaces as many as possible since * is a greedy quantifier matching zero or more occurrences of the quantified subpattern) here to remove any whitespace there is to make sure we just insert 1 space before and after -.
If Group 1 matches, then we just re-insert it (Return m.Groups(1).Value), else we insert the space-enclosed hyphen Return " - ".
Just to check if it exists, you could try
\[[^\]]+-[^\]]+\]
It matches a literal [ and then any characters, except ], up to (including) a hyphen. Then again any characters, except ], up to a literal ].
See it here at regex101.
Actually I don't know the vb.net syntax but you can use regex as
/[\s\'](\w+)\-(\w+)/g
find the (\w+)-(\w+) which is followed by space or ' and replace your string with capture group 1st - 2nd
See the sample here

regex to match entire words containing only certain characters

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)
This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""
The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+
Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").
Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.

Reg Ex question

What does the following reg ex code mean?
'/^\w{4,20}$/'
It means that string should contain from 4 to 20 word characters (letters, digits, and underscores). Here:
^ (caret) matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well
$ (dollar) matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break
\w shorthand character class matching word characters (letters, digits, and underscores). Can be used inside and outside character classes.
{n,m} where n >= 0 and m >= n Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times
Let me show you a usage example. Say, we have the file with the following contents:
[spongebob#conductor /tmp]$ cat file.txt
between4and20
therearetoomanyalphanumcharacters
foo
okay
Now you want to get only those strings which match your pattern '/^\w{4,20}$/':
[spongebob#conductor /tmp]$ grep -E '^\w{4,20}$' blah
between4and20
okay
On output you see only those lines, which fulfil your regular expression.
Ah, also, don't confuse ^ (caret) with ^ immediately after the opening [, the latter negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [), for example [^a-d] matches x (any character except a, b, c or d).
It means:
^ Between the beginning,
$ and the end of a given string,
\w{4,20} there should be only 4-20 Alphanumeric characters (like
a,b,c,d,1,2,3...etc, and also _)
I think you'll find Wikipedia's page on Regular Expressions a big, big help while learning regexes.
And just so there is no confusion, ^ and $ don't necessarily need each other,
If the regex was:
'/^\w{4,20}/'
That'd mean: The match should be at the start of the string, followed by 4-20 alphanumeric characters.
Example (match in bold): Foobar baz
And if the regex pattern was:
'/\w{4,20}$/'
That'd mean: The match should be at the end of the string, proceeded by 4-20 alpha-numeric characters
Example (match in bold): Foo barbaz
/ opening delimiter
^ = start of sting
\w = word character
{x,y} min max
$ = end of string
/end delimiter