I want to create a regex to allow space between characters of a specific string.
The context is we have a unclean database, with string that contains sometimes space where they shouldn't have. I'm not yet allow to remove the space in the database (replace(' ', '')).
I would like to have a regex to be able to match a string even if the string is cut with space.
ex:
obama would match "obama", "ob ama", " obama", " ob ama", "obam a", but not "obamaa", "ocama", " ".
Is it possible? If yes, how?
Thanks.
Just add <space>* inbetween each character.
\bo *b *a *m *a\b
or use [ \t]* in the above instead of a space.
You can do this without regular expression also
>>> a = "ob ama"
>>> ''.join(a.split(' ')) == 'obama'
True
This should work:
(\S? )*\S
If leading spaces are a problem, this should be modified. Also, this allows multiple spaces but you didn't really say anything about that. And if you need to allow other kinds of whitespace characters besides regular space, this needs some more modification. This should handle other whitespace characters:
(\S?\s)*\S
Related
I have this string:
"Common Waxbill - Estrilda astrild"
How can I write 2 separate regexes for the words before and after the hyphen? The output I would want is:
"Common Waxbill"
and
"Estrilda astrild"
This is quite simple:
.*(?= - ) # matches everything before " - "
(?<= - ).* # matches everything after " - "
See this tutorial on lookaround assertions.
If you cannot use look-behinds, but your string is always in the same format and cannout contain more than the single hyphen, you could use
^[^-]*[^ -] for the first one and \w[^-]*$ for the second one (or [^ -][^-]*$ if the first non-space after the hyphen is not necessarily a word-character.
A little bit of explanation:
^[^-]*[^ -] matches the start of the string (anchor ^), followed by any amount of characters, that are not a hyphen and finally a character thats not hyphen or space (just to exclude the last space from the match).
[^ -][^-]*$ takes the same approach, but the other way around, first matching a character thats neither space nor hyphen, followed by any amount of characters, that are no hyphen and finally the end of the string (anchor $). \w[^-]*$ is basically the same, it uses a stricter \w instead of the [^ -]. This is again used to exclude the whitespace after the hyphen from the match.
Another solution is to string split on the hyphen and remove white space.
Two alternate methods
The main challenge of your Question is that you want two separate items. This means that your process is dependent on another language. RegEx itself does not parse or separate a string; it only explains what we are looking for. The language you are using will make the actual separation. My answer gets your results in PHP, but other languages should have comparable solutions.
If you want to just do the job in your Question, and if you're using PHP...
Method 1: explode("-", $list); -> $array[]
This is useful if your list is longer than two items:
<?php
// Generate our list
$list = "Common Waxbill - Estrilda astrild";
$item_arr = explode("-", $list);
// Iterate each
foreach($item_arr as $item) {
echo $item.'<br>';
}
// See what we have
echo '
<pre>Access array directly:</pre>'.
'<pre>'.$item_arr[0].'x <--notice the trailing space</pre>'.
'<pre>'.$item_arr[1].' <--notice the preceding space</pre>';
...You could clean up each item and reassign them to a new array with trim(). This would get the text your Question asked for (no extra spaces before or after)...
// Create a workable array
$i=0; // Start our array key counter
foreach($item_arr as $item) {
$clean_arr[$i++] = trim($item);
}
// See what we have
echo '
<pre>Access after cleaning:</pre>'.
'<pre>'.$clean_arr[0].'x <--no space</pre>'.
'<pre>'.$clean_arr[1].' <--no space</pre>';
?>
Output:
Common Waxbill
Estrilda astrild
Access array directly:
Common Waxbill x <--notice the trailing space
Estrilda astrild <--notice the preceding space
Access after cleaning:
Common Waxbillx <--no space
Estrilda astrild <--no space
Method 2: substr(strrpos()) & substr(strpos())
This is useful if your list will only have two items:
<?php
// Generate our list
$list = "Common Waxbill - Estrilda astrild";
// Start splitting
$first_item = trim(substr($list, strrpos($list, '-') + 1));
$second_item = trim(substr($list, 0, strpos($list, '-')));
// See what we have
echo "<pre>substr():</pre>
<pre>$first_item</pre>
<pre>$second_item</pre>
";
?>
Output:
substr():
Estrilda astrild
Common Waxbill
Note strrpos() and strpos() are different and each have different syntax.
If you're not using PHP, but you want to do the job in some other language without depending on RegEx, knowing the language would be helpful.
Generally, programming languages come with tools for jobs like this out of box, which is part of why people choose the languages they do.
I'm trying to use a regex for testing for only letters and spaces allowed in an input field. I tried using /^[a-zA-Z]*$/ but this doesn't allow spaces.
Updated this question. I'm looking to only have one space between words.
So for example I have a city name like New Orleans so only letters and one space between words should be allowed.
The [a-aA-Z] character class, and \s character classes have the following caveats.
a-zA-Z may be too restrictive if you wish to accept Unicode, in which case, \p{Alpha} would be a better character class.
\s may be either too permissive or too restrictive depending on the Unicode or ASCII semantics under which your regex is operating. You may only want to just put a single space in the character class.
If we knew what language and whether or not you intend for Unicode semantics, that might help. Things aren't as simple as they were back in the ASCII-only days.
Update:
Here's a more robust solution that still has the original Unicode caveats, but allows for a single space character between words, and nowhere else.
use strict;
use warnings;
my #cities = (
'New Orleans',
'New Jersey',
'Salt Lake City',
'Sacramento',
' Leading space',
'Trailing space ',
'Two consecutive spaces',
' Two leading spaces',
'Two trailing spaces ',
);
foreach my $city ( #cities ) {
print "<<$city>>: ";
if( $city =~ m/^[a-zA-Z]+(?:\s[a-zA-Z]+)*$/ ) {
print "match\n";
}
else {
print "reject\n";
}
}
Matches multiple groups of letters separated by single spaces.
(\w+ )+\w
I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)
Im trying to compare is a string is present among a list of Strings using regex.
I tried using the following...
(?!MyDisk1$|MyDisk2$)
But this isnt working... for the scenarios like
(?!My disk1$|My Disk2$)
Can you suggest a better approach to deal with such situations..
I get the list of strings from an sql query... So I am not sure where the spaces are present. The list of Strings vary like My Disk1, MyDisk2, My_Disk3, ABCD123, XYZ_123, MNP 123 etc.... or any other String with [a-zA-Z0-9_ ]
You can make the spaces optional using a zero-or-one quantifier (?):
(?!My ?disk1$|My ?Disk2$)
This assertion will reject substrings like MyDisk2 or My Disk2. Or to handle potentially many spaces, use a zero-or-more quantifier (*):
(?!My *disk1$|My *Disk2$)
Note that if you're running this in an engine which ignores whitespace in the pattern you may need to use a character class, like this:
(?!My[ ]*disk1$|My[ ]*Disk2$)
Or to handle spaces or underscores:
(?!My[ _]*disk1$|My[ _]*Disk2$)
Unfortunately if the spaces can be anywhere in the string, (but you still care about matching the other letters in order), you'd have to do something like this:
(?! *M *y *d *i *s *k *1$| *M *y *D *i *s *k *2$)
Or to handle spaces or underscores:
(?![ _]*M[ _]*y[ _]*d[ _]*i[ _]*s[ _]*k[ _]*1$|[ _]*M[ _]*y[ _]*D[ _]*i[ _]*s[ _]*k[ _]*2$)
But to be honest, at that point, you may be better off preprocessing your data before you try to use your regex with it.
use this Regex upending i at the end that will mean that your regex is case-insensitive
/my\s?disk[12]\$/i
this will match all possible scenarios.
You can do this:
/(?[^\s_-]+(\s|_|-)?[^\s_-]*?$)/i
'?' quantifier means 0 or 1 of the preceding pattern.
/i is for case insensitive. The separator can be space or underscore or dash.I have replace My and disk with a string of length 1 or more which does not contain space ,underscore or dash.. Now it wil match "Shikhar Subedi" "dprpradeep" or "MyDisk 54".
The + quantifier means 1 or more. ^ means not. * means 0 or more. So the string after the space is optional.
How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"