Limitation of spaces in regex string - regex

Total regex newbie here and I have been all over the place to try and find an answer. I need to add exactly 1 space followed by a string of letter characters (min 3 max 30) I have the following but it will accept more than 1 space which is the problem:
^[:blank:][A-z]{3,30}$
Any help with this would be great

[A-z] will also capture [, \, ], ^, _, `.
Use this regex to allow exactly 1 space in the beginning and then 3 to 30 English letters:
^[[:blank:]][a-zA-Z]{3,30}$
See demo.

To be unicode compatible:
^\p{Zs}\p{L}{3,30}$
Where \p{Zs} stands for a space character
and \p{L} stands for a letter.

Related

Remove any digit only in first N characters

I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="‌​")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.
You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.
The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)

Regex for {!Customobject_relateobject.name}

I don't know regex can you please help me to get regex for
{!Customobject_relateobject.name}
String "Customobject_relateobject.name" can contain only "_" and "." in middle of word not even in first or last
"{!" and "}" is mandatory
Thanks in Advance.
You can use the following regex:
\{![a-zA-Z0-9_.]*}
See demo
The regex means:
\{! - matches {! literally
[a-zA-Z0-9_.]* - 0 or more (due to *) characters that are lower- or uppercase Latin letters, digits from 0 to 9, underscore or dot
} - literal }.
{!^[a-zA-Z0-9]?[a-zA-Z0-9._]*[a-zA-Z0-9]?$} if empty strings like {!} are not allowed and only latin and digits should be inside the parenthesis
I guess the word can't end with '.' or '_' or have any digit in it. So this regex will give you what you want:
\{!(([a-zA-Z]+(_|\.)?)+[a-zA-Z]+)\}
If you want digits have this regex:
\{!(([a-zA-Z0-9]+(_|\.)?)+[a-zA-Z0-9]+)\}
Don't use the '\w' because it match the '_', and you can end with two together.

RegEx - 1 to 10 Alphanumeric Spaces Okay

New to Regular Expressions. Thanks in advance!
Need to validate field is 1-10 mixed-case alphanumeric and spaces are allowed. First character must be alphanumeric, not space.
Good Examples:
"Larry King"
"L King1"
"1larryking"
"L"
Bad Example:
" LarryKing"
This is what I have and it does work as long as the data is exactly 10 characters. The problem is that it does not allow less than 10 characters.
[0-9a-zA-Z][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ]
I've read and tried many different things but am just not getting it.
Thank you,
Justin
I don't know what environment you are using and what engine. So I assume PCRE (typically for PHP)
this small regex does exact what you want: ^(?i)(?!\s)[a-z\d ]{1,10}$
What's going on?!
the ^ marks the start of the string (delete it, if the expression must not match the whole string)
the (?i) tells the engine to be case insensitive, so there's no need to write all letter lower and upper case in the expression later
the (?!\s) ensures the following char won't be a white space (\s) (it's a so called negative lookahead)
the [a-z\d ]{1,10} matches any letter (a-z), any digit (\d) and spaces () in a row with min 1 and max 10 occurances ({1,10})
the $ at the end marks the end of the string (delete it, if the expression must not match the whole string)
Here's also a small visualization for better understanding.
Debuggex Demo
Try this: [0-9a-zA-Z][0-9a-zA-Z ]{0,9}
The {x,y} syntax means between x and y times inclusive. {x,} means at least x times.
You want something like this.
[a-zA-Z0-9][a-zA-Z0-9 ]{0,9}
This first part ensures that it is alphanumeric. The second part gets your alphanumeric with a space. the {0,9} allows from anywhere from 0 to 9 occurrences of the second part. This will give your 1-10
Try this: ^[(^\s)a-zA-Z0-9][a-z0-9A-Z ]*
Not a space and alphanumeric for the first character, and then zero or more alphanumeric characters. It won't cap at 10 characters but it will work for any set of 1-10 characters.
The below is probably most semantically correct:
(?=^[0-9a-zA-Z])(?=.*[0-9a-zA-Z]$)^[0-9a-zA-Z ]{1,10}$
It asserts that the first and last characters are alphanumeric and that the entire string is 1 to 10 characters in length (including spaces).
I assume that the space is not allowed at the end too.
^[a-zA-Z0-9](?:[a-zA-Z0-9 ]{0,8}[a-zA-Z0-9])?$
or with posix character classes:
^[[:alnum:]](?:[[:alnum:] ]{0,8}[[:alnum:]])?$
i think the simplest way is to go with \w[\s\w]{0,9}
Note that \w is for [A-Za-z0-9_] so replace it by [A-Za-z0-9] if you don't want _
Note that \s is for any white char so replace it by if you don't want the others

Match substring from beginning that must end with a certain string and has a certain maximum length

Let's assume a valid string consists of some sentences and has a maximum length of 10. A sentence ends with a dot and at least one whitespace character.
lol. omg rofl. => lol.
lol. omg. rofl. => lol. omg.
lol. => lol.
lol omg rofl. => no match
Any ideas?
/^.{,8}\. /
Explanation:
^ matches start of string
.{,8} matches up to 8 chars (10 - specified 2 chars)
\. matches literal dot and a space
Edit: Oh, I missed the sentence contains at least 1 space. Hmm, let me think …
By looking at https://stackoverflow.com/a/1839379/498634 I think the following might work:
/^(?!.{11,}).* .*\. /
^ start of string
(?!.{11,}) negative look-ahead for exclude strings longer than 10
.* .* any sequence with at least one space
\. literal dot and space
What about this? (Assuming your regex engine does support look aheads)
^.{0,9}\.(?= |$)
See it here on Regexr
Matches 0 to 0 characters from the start of the string with a . as last character and requires a space or the end of the row/string to follow the dot.
I assumed the space after the dot does not count into the sentence length.
The (?= |$) is a positive lookahead, it ensures that a space or the end of the row/string is following, but does not match it.

Regex, every non-alphanumeric character except white space or colon

How can I do this one anywhere?
Basically, I am trying to match all kinds of miscellaneous characters such as ampersands, semicolons, dollar signs, etc.
[^a-zA-Z\d\s:]
\d - numeric class
\s - whitespace
a-zA-Z - matches all the letters
^ - negates them all - so you get - non numeric chars, non spaces and non colons
This should do it:
[^a-zA-Z\d\s:]
If you want to treat accented latin characters (eg. à Ñ) as normal letters (ie. avoid matching them too), you'll also need to include the appropriate Unicode range (\u00C0-\u00FF) in your regex, so it would look like this:
/[^a-zA-Z\d\s:\u00C0-\u00FF]/g
^ negates what follows
a-zA-Z matches upper and lower case letters
\d matches digits
\s matches white space (if you only want to match spaces, replace this with a space)
: matches a colon
\u00C0-\u00FF matches the Unicode range for accented latin characters.
nb. Unicode range matching might not work for all regex engines, but the above certainly works in Javascript (as seen in this pen on Codepen).
nb2. If you're not bothered about matching underscores, you could replace a-zA-Z\d with \w, which matches letters, digits, and underscores.
Try this:
[^a-zA-Z0-9 :]
JavaScript example:
"!##$%* ABC def:123".replace(/[^a-zA-Z0-9 :]/g, ".")
See a online example:
http://jsfiddle.net/vhMy8/
In JavaScript:
/[^\w_]/g
^ negation, i.e. select anything not in the following set
\w any word character (i.e. any alphanumeric character, plus underscore)
_ negate the underscore, as it's considered a 'word' character
Usage example - const nonAlphaNumericChars = /[^\w_]/g;
No alphanumeric, white space or '_'.
var reg = /[^\w\s)]|[_]/g;
If you mean "non-alphanumeric characters", try to use this:
var reg =/[^a-zA-Z0-9]/g //[^abc]
This regex works for C#, PCRE and Go to name a few.
It doesn't work for JavaScript on Chrome from what RegexBuddy says. But there's already an example for that here.
This main part of this is:
\p{L}
which represents \p{L} or \p{Letter} any kind of letter from any language.`
The full regex itself: [^\w\d\s:\p{L}]
Example: https://regex101.com/r/K59PrA/2
Previous solutions only seem reasonable for English or other Latin-based languages without accents, etc. Those answers are for that reason not generalised to answer the question.
According to the Whitespace character article on Wikipedia, these are all the whitespace characters in Unicode:
U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+0085, U+00A0, U+1680, U+180E, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+200B, U+200C, U+200D, U+2028, U+2029, U+202F, U+205F, U+2060, U+3000, U+FEFF
So in my opinion, the most inclusive solution would be (might be slow, but this is about accuracy):
\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF
Thus, to answer OP's question to include "every non-alphanumeric character except white space or colon", prepend a hat ^ to not include above characters and add the colon to that, and surround the regex in [ and ] to instruct it to 'any of these characters':
"[^:\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]"
Debuggex Demo
Bonus: solution for R
trimws2 <- function(..., whitespace = "[\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]") {
trimws(..., whitespace = whitespace)
}
This is even faster than trimws() itself which sets " \t\n\r".
microbenchmark::microbenchmark(trimws2(" \t\r\n"), trimws(" \t\r\n"))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> trimws2(" \\t\\r\\n") 29.177 29.875 31.94345 30.4990 31.3895 105.642 100 a
#> trimws(" \\t\\r\\n") 45.811 46.630 48.25076 47.2545 48.2765 116.571 100 b
Try to add this:
^[^a-zA-Z\d\s:]*$
This has worked for me... :)
[^\w\s-]
Character set of characters which not:
Alphanumeric
Whitespace
Colon