I need to remove all non-numeric characters from a string. Because people use different thousands separators - and some people thought using commas for this purpose was perfectly fine and not at all easily confused with a decimal point (especially since some countries use commas for decimal points), the regex is shaping up to be annoying.
Here's my current attempt:
[^\d,.]|[.,](?=.*[.,])|(?<=,.*),
[^\d,.] match all non-numeric characters that aren't commas or dots (because they also use ' and ’ as separators, not just dots and commas)
[.,](?=.*[.,]) match all commas and dots that are still followed by a dot or comma
(?<=,.*), match all commas if we've previously seen a comma, already
(I'll probably have to split (2) into two cases, later, but that's not the issue of this question.
The purpose of (3) is that if the string contains multiple commas, we can safely assume that it's used as a thousands separator and not as a decimal point.
I.e.
123,456 should be interpreted as 123.456 (and therefore the , not match the regex)
123,456,789 should be interpreted as 123456789 (and therefore both commas match the regex)
Of course (?<=,.*), is not valid because look-behinds need be a fixed length and .* is not.
How do I match these pesky commas?
(The intention is to eventually feed the regex to a Java string replacement method.)
var sanitisedInput = rawInput.replaceAll(<regex>, "")
The below regex pattern might help.
Pattern: (?:(?<=\.)|,)(\d+)(?:(?=\.)|,)
Replacement: \1
Demo: https://regex101.com/r/KtFX8S/2/
Explanation:
(?<=\.)|,) - Match either , or positive lookbehind of ..
Similarly match pattern at the end
Use the captured group 1 in the replacement
Related
Need help coming up with a regex that only allows numbers, letters, empty string, or spaces.
^[_A-z0-9]*((-|\s)*[_A-z0-9])*$
This one is the closest I've found but it allows underscores and hyphen.
Only letters, numbers, space, or empty string?
Then 1 character class will do.
^[A-Za-z0-9 ]*$
^ : start of the string or line (depending on the flag)
[A-Za-z0-9 ]* : zero or more upper-case or lower-case letters, or digits, or spaces.
$ : end of the string or line (depending on the flag)
The A-z range contains more than just letters.
You can see that in the ASCII table.
And \s for whitespace also includes tabs or linebreaks (depending on the flag).
But if you also want those, then just use that instead of the space.
^[A-Za-z0-9\s]*$
Also, depending on the regex engine/dialect that your language/tool uses, you could use \p{L} for any unicode letter.
Since [A-Za-z] only includes the normal ascii letters.
Reference here
Your regex is too complicated for what you need.
the first part is fine, you are allowing letter and number, you could simply add the space character with it.
Then, if you use the * character, which translate to 0 or any, you could take care of your empty string problem.
See here.
/^[a-z0-9 ]*$/gmi
Notice here that i'm not using A-z like you were because this translate to any character between the A in ascii (101) and the z(172). this mean it will also match char in between (133 to 141 that are not number nor letter). I've instead use a-z which allow lowercase letter and used the flag i which tell the regex to not take care of the case.
Here is a visual explanation of the regex
You can also test more cases in this regex101
Matching only certain characters is equivalent to not matching any other character, so you could use the regex r = /[^a-z\d ]/i to determine if the string contains any character other than the ones permitted. In Ruby that would be implemented as follows.
"aBc d01e e$9" !~ r #=> false
"aBc d01e ex9" !~ r #=> true
In this situation there may not much to choose between this approach and attempting to match /\A[a-z\d ]+\z/i, but in other situations the use of a negative match can simplify the regex considerably.
I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.
The main thing I am trying to do here is learn regex so that I have a better understanding of it. What I am trying to do is a find and replace using regex to remove only the commas that are within the numbers.
I can do this using multiple find/replace patterns, and I can also do this using a brute force method of matching a large number and ignoring commas, however I am wondering if there is some way to place the numbers and comma into a capture group but ignore the commas from output.
Here is an example of a list of numbers:
"7,033.00","0.00","7,033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123,123,123.00","123,444,38.01"
So my 'brute-force' method is the following:
\"([0-9]+)[,]?([0-9]*)[,]?([0-9]*)[,]?([0-9]*[.]+[0-9]+)\"
This would account for any number up to 999,999,999,999.00. It contains the four capture groups $1$2$3$4 and will output any number I would expect in the format that I want.
Example of wanted output using a replace of $1$2$3$4:
7033.00,0.00,7033.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1,1,1,!!$,,123123123.00,12344438.01
What I would like to do is something like this (pseudo code):
[\"]([0-9]+)([(?:,)[0-9]*][.]+[0-9]+)[\"]
The idea behind this is:
Match the first quotation mark but ignore it
Match a group of numbers and place in capture group $1
Match either a number or comma followed by a period and one or more numbers and store in a capture group, but leave the commas out of the capture group.
Match the last quotation mark but ignore it
I've been reading and reading but can't seem to find a way to ignore part of a capture group the way I want to do it. Any suggestions or can it not be done?
A two step method would be to match the commas first then remove the quotes, which might work too:
(,)(?=([0-9]{2,3}[.,]))
Well, regexr uses ECMAScript regex, so you might use something like
"|([0-9]),(?=[0-9])(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
And replace with $1.
regexr demo
Otherwise, with PCRE, you might use something like:
"|(?<=[0-9]),(?=[0-9])(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
And replace with nothing, where it makes use of lookarounds to make sure that the comma in question is surrounded by [0-9] (ECMAScript doesn't support lookbehinds currently).
regex101 demo
" matches a literal quote character.
| means OR, so the regex matches a " or a ([0-9]),(?=[0-9]) (or (?<=[0-9]),(?=[0-9]))
([0-9]) is a capture group to get one digit.
, matches a literal comma.
(?=[0-9]) is a positive lookahead and ensures that the comma is followed by a digit, without matching the digit itself.
(?<=[0-9]) is a positive lookbehind and ensures that the comma is preceded by a digit, again without matching the digit itself.
(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$) ensures that there are an odd number of quotes ahead, and this in turn means that this will match a comma only within quotes, assuming that there are no unbalanced or escaped quotes.
In two steps:
First remove all commas within quotes (i.e. commas that are followed by an odd number of quotes. This even works with escaped quotes since in CSV files, quotes are escaped by doubling):
>>> import re
>>> s = '"7,033.00","0.00","7,033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123,123,123.00","123,444,38.01"'
>>> s = re.sub(r',(?!(?:[^"]*"[^"]*")*[^"]*$)', '', s)
>>> s
'"7033.00","0.00","7033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123123123.00","12344438.01"'
Then remove all the quotes:
>>> s.replace('"', '')
'7033.00,0.00,7033.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1,1,1,!!$,,123123123.00,12344438.01'
I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn't allow for spaces between words.
^[a-zA-Z0-9_]*$
For example, when using this regular expression "HelloWorld" is fine, but "Hello World" does not match.
How can I tweak it to allow spaces?
tl;dr
Just add a space in your character class.
^[a-zA-Z0-9_ ]*$
Now, if you want to be strict...
The above isn't exactly correct. Due to the fact that * means zero or more, it would match all of the following cases that one would not usually mean to match:
An empty string, "".
A string comprised entirely of spaces, " ".
A string that leads and / or trails with spaces, " Hello World ".
A string that contains multiple spaces in between words, "Hello World".
Originally I didn't think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn't a concern. Now that the question's gained some popularity however, I want to say...
...use #stema's answer.
Which, in my flavor (without using \w) translates to:
^[a-zA-Z0-9_]+( [a-zA-Z0-9_]+)*$
(Please upvote #stema regardless.)
Some things to note about this (and #stema's) answer:
If you want to allow multiple spaces between words (say, if you'd like to allow accidental double-spaces, or if you're working with copy-pasted text from a PDF), then add a + after the space:
^\w+( +\w+)*$
If you want to allow tabs and newlines (whitespace characters), then replace the space with a \s+:
^\w+(\s+\w+)*$
Here I suggest the + by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n, so you'll need the + to catch both.
Still not working?
Check what dialect of regular expressions you're using.* In languages like Java you'll have to escape your backslashes, i.e. \\w and \\s. In older or more basic languages and utilities, like sed, \w and \s aren't defined, so write them out with character classes, e.g. [a-zA-Z0-9_] and [\f\n\p\r\t], respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I'm guessing it's not only those folks who are coming across this question. Currently it's the first hit on google for the search phrase, regular expression space word.
One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w this is in most regex flavours the same than [a-zA-Z0-9_] (in some it is Unicode based)
^\w+( \w+)*$
This will allow a series of at least one word and the words are divided by spaces.
^ Match the start of the string
\w+ Match a series of at least one word character
( \w+)* is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$ matches the end of the string
This one worked for me
([\w ]+)
Try with:
^(\w+ ?)*$
Explanation:
\w - alias for [a-zA-Z_0-9]
"whitespace"? - allow whitespace after word, set is as optional
I assume you don't want leading/trailing space. This means you have to split the regex into "first character", "stuff in the middle" and "last character":
^[a-zA-Z0-9_][a-zA-Z0-9_ ]*[a-zA-Z0-9_]$
or if you use a perl-like syntax:
^\w[\w ]*\w$
Also: If you intentionally worded your regex that it also allows empty Strings, you have to make the entire thing optional:
^(\w[\w ]*\w)?$
If you want to only allow single space chars, it looks a bit different:
^((\w+ )*\w+)?$
This matches 0..n words followed by a single space, plus one word without space. And makes the entire thing optional to allow empty strings.
This regular expression
^\w+(\s\w+)*$
will only allow a single space between words and no leading or trailing spaces.
Below is the explanation of the regular expression:
^ Assert position at start of the string
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\s\w+)*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s Match any white space character [\r\n\t\f ]
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ Assert position at end of the string
Just add a space to end of your regex pattern as follows:
[a-zA-Z0-9_ ]
This does not allow space in the beginning. But allowes spaces in between words. Also allows for special characters between words. A good regex for FirstName and LastName fields.
\w+.*$
For alphabets only:
^([a-zA-Z])+(\s)+[a-zA-Z]+$
For alphanumeric value and _:
^(\w)+(\s)+\w+$
If you are using JavaScript then you can use this regex:
/^[a-z0-9_.-\s]+$/i
For example:
/^[a-z0-9_.-\s]+$/i.test("") //false
/^[a-z0-9_.-\s]+$/i.test("helloworld") //true
/^[a-z0-9_.-\s]+$/i.test("hello world") //true
/^[a-z0-9_.-\s]+$/i.test("none alpha: ɹqɯ") //false
The only drawback with this regex is a string comprised entirely of spaces. " " will also show as true.
It was my regex: #"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$"
I just added ([\w ]+) at the end of my regex before *
#"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)([\w ]+)*$"
Now string is allowed to have spaces.
This regex allow only alphabet and spaces:
^[a-zA-Z ]*$
Try with this one:
result = re.search(r"\w+( )\w+", text)
I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.