Regex for letters plus space - regex

I'm trying to use a regex for testing for only letters and spaces allowed in an input field. I tried using /^[a-zA-Z]*$/ but this doesn't allow spaces.
Updated this question. I'm looking to only have one space between words.
So for example I have a city name like New Orleans so only letters and one space between words should be allowed.

The [a-aA-Z] character class, and \s character classes have the following caveats.
a-zA-Z may be too restrictive if you wish to accept Unicode, in which case, \p{Alpha} would be a better character class.
\s may be either too permissive or too restrictive depending on the Unicode or ASCII semantics under which your regex is operating. You may only want to just put a single space in the character class.
If we knew what language and whether or not you intend for Unicode semantics, that might help. Things aren't as simple as they were back in the ASCII-only days.
Update:
Here's a more robust solution that still has the original Unicode caveats, but allows for a single space character between words, and nowhere else.
use strict;
use warnings;
my #cities = (
'New Orleans',
'New Jersey',
'Salt Lake City',
'Sacramento',
' Leading space',
'Trailing space ',
'Two consecutive spaces',
' Two leading spaces',
'Two trailing spaces ',
);
foreach my $city ( #cities ) {
print "<<$city>>: ";
if( $city =~ m/^[a-zA-Z]+(?:\s[a-zA-Z]+)*$/ ) {
print "match\n";
}
else {
print "reject\n";
}
}

Matches multiple groups of letters separated by single spaces.
(\w+ )+\w

Related

Regex get text before and after a hyphen

I have this string:
"Common Waxbill - Estrilda astrild"
How can I write 2 separate regexes for the words before and after the hyphen? The output I would want is:
"Common Waxbill"
and
"Estrilda astrild"
This is quite simple:
.*(?= - ) # matches everything before " - "
(?<= - ).* # matches everything after " - "
See this tutorial on lookaround assertions.
If you cannot use look-behinds, but your string is always in the same format and cannout contain more than the single hyphen, you could use
^[^-]*[^ -] for the first one and \w[^-]*$ for the second one (or [^ -][^-]*$ if the first non-space after the hyphen is not necessarily a word-character.
A little bit of explanation:
^[^-]*[^ -] matches the start of the string (anchor ^), followed by any amount of characters, that are not a hyphen and finally a character thats not hyphen or space (just to exclude the last space from the match).
[^ -][^-]*$ takes the same approach, but the other way around, first matching a character thats neither space nor hyphen, followed by any amount of characters, that are no hyphen and finally the end of the string (anchor $). \w[^-]*$ is basically the same, it uses a stricter \w instead of the [^ -]. This is again used to exclude the whitespace after the hyphen from the match.
Another solution is to string split on the hyphen and remove white space.
Two alternate methods
The main challenge of your Question is that you want two separate items. This means that your process is dependent on another language. RegEx itself does not parse or separate a string; it only explains what we are looking for. The language you are using will make the actual separation. My answer gets your results in PHP, but other languages should have comparable solutions.
If you want to just do the job in your Question, and if you're using PHP...
Method 1: explode("-", $list); -> $array[]
This is useful if your list is longer than two items:
<?php
// Generate our list
$list = "Common Waxbill - Estrilda astrild";
$item_arr = explode("-", $list);
// Iterate each
foreach($item_arr as $item) {
echo $item.'<br>';
}
// See what we have
echo '
<pre>Access array directly:</pre>'.
'<pre>'.$item_arr[0].'x <--notice the trailing space</pre>'.
'<pre>'.$item_arr[1].' <--notice the preceding space</pre>';
...You could clean up each item and reassign them to a new array with trim(). This would get the text your Question asked for (no extra spaces before or after)...
// Create a workable array
$i=0; // Start our array key counter
foreach($item_arr as $item) {
$clean_arr[$i++] = trim($item);
}
// See what we have
echo '
<pre>Access after cleaning:</pre>'.
'<pre>'.$clean_arr[0].'x <--no space</pre>'.
'<pre>'.$clean_arr[1].' <--no space</pre>';
?>
Output:
Common Waxbill
Estrilda astrild
Access array directly:
Common Waxbill x <--notice the trailing space
Estrilda astrild <--notice the preceding space
Access after cleaning:
Common Waxbillx <--no space
Estrilda astrild <--no space
Method 2: substr(strrpos()) & substr(strpos())
This is useful if your list will only have two items:
<?php
// Generate our list
$list = "Common Waxbill - Estrilda astrild";
// Start splitting
$first_item = trim(substr($list, strrpos($list, '-') + 1));
$second_item = trim(substr($list, 0, strpos($list, '-')));
// See what we have
echo "<pre>substr():</pre>
<pre>$first_item</pre>
<pre>$second_item</pre>
";
?>
Output:
substr():
Estrilda astrild
Common Waxbill
Note strrpos() and strpos() are different and each have different syntax.
If you're not using PHP, but you want to do the job in some other language without depending on RegEx, knowing the language would be helpful.
Generally, programming languages come with tools for jobs like this out of box, which is part of why people choose the languages they do.

Regex to allow space between char of a specific string

I want to create a regex to allow space between characters of a specific string.
The context is we have a unclean database, with string that contains sometimes space where they shouldn't have. I'm not yet allow to remove the space in the database (replace(' ', '')).
I would like to have a regex to be able to match a string even if the string is cut with space.
ex:
obama would match "obama", "ob ama", " obama", " ob ama", "obam a", but not "obamaa", "ocama", " ".
Is it possible? If yes, how?
Thanks.
Just add <space>* inbetween each character.
\bo *b *a *m *a\b
or use [ \t]* in the above instead of a space.
You can do this without regular expression also
>>> a = "ob ama"
>>> ''.join(a.split(' ')) == 'obama'
True
This should work:
(\S? )*\S
If leading spaces are a problem, this should be modified. Also, this allows multiple spaces but you didn't really say anything about that. And if you need to allow other kinds of whitespace characters besides regular space, this needs some more modification. This should handle other whitespace characters:
(\S?\s)*\S

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

Splitting Two Characters In a String - Perl

I'm trying to split this string. Here's the code:
my $string = "585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1,656|432|289|1|1,136|206|327|1|1,585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1%656|432|289|1|1%136|206|327|1|1%654|404|411|1|1";
my #ids = split(",", $string);
What I want is to split only % and , in the string, I was told that I could use a pattern, something like this? /[^a-zA-Z0-9_]/
Character classes can be used to represent a group of possible single characters that can match. And the ^ symbol at the beginning of a character class negates the class, saying "Anything matches except for ...." In the context of split, whatever matches is considered the delimiter.
That being the case, `[^a-zA-Z0-9_] would match any character except for the ASCII letters 'a' through 'z', 'A' through 'Z', and the numeric digits '0' through '9', plus underscore. In your case, while this would correctly split on "," and "%" (since they're not included in a-z, A-Z, 0-9, or _), it would mistakenly also split on "|", as well as any other character not included in the character class you attempted.
In your case it makes a lot more sense to be specific as to what delimiters to use, and to not use a negated class; you want to specify the exact delimiters rather than the entire set of characters that delimiters cannot be. So as mpapec stated in his comment, a better choice would be [%,].
So your solution would look like this:
my #ids = split/[%,]/, $string;
Once you split on '%' and ',', you'll be left with a bunch of substrings that look like this: 585|487|314|1|1 (or some variation on those numbers). In each case, it's five positive integers separated by '|' characters. It seems possible to me that you'll end up wanting to break those down as well by splitting on '|'.
You could build a single data structure represented by list of lists, where each top level element represents a [,%] delimited field, and consists of a reference to an anonymous array consisting of the pipe-delimited fields. The following code will build that structure:
my #ids = map { [ split /\|/, $_ ] } split /[%,]/, $string;
When that is run, you will end up with something like this:
#ids = (
[ '585', '487', '314', '1', '1' ],
[ '651', '365', '302', '1', '1' ],
# ...
);
Now each field within an ID can be inspected and manipulated individually.
To understand more about how character classes work, you could check perlrequick, which has a nice introduction to character classes. And for more information on split, there's always perldoc -f split (as mentioned by mpapec). split is also discussed in chapter nine of the O'Reilly book, Learning Perl, 6th Edition.

regex to match a maximum of 4 spaces

I have a regular expression to match a persons name.
So far I have ^([a-zA-Z\'\s]+)$ but id like to add a check to allow for a maximum of 4 spaces. How do I amend it to do this?
Edit: what i meant was 4 spaces anywhere in the string
Don't attempt to regex validate a name. People are allowed to call themselves what ever they like. This can include ANY character. Just because you live somewhere that only uses English doesn't mean that all the people who use your system will have English names. We have even had to make the name field in our system Unicode. It is the only Unicode type in the database.
If you care, we actually split the name at " " and store each name part as a separate record, but we have some very specific requirements that mean this is a good idea.
PS. My step mum has 5 spaces in her name.
^ # Start of string
(?!\S*(?:\s\S*){5}) # Negative look-ahead for five spaces.
([a-zA-Z\'\s]+)$ # Original regex
Or in one line:
^(?!(?:\S*\s){5})([a-zA-Z\'\s]+)$
If there are five or more spaces in the string, five will be matched by the negative lookahead, and the whole match will fail. If there are four or less, the original regex will be matched.
Screw the regex.
Using a regex here seems to be creating a problem for a solution instead of just solving a problem.
This task should be 'easy' for even a novice programmer, and the novel idea of regex has polluted our minds!.
1: Get Input
2: Trim White Space
3: If this makes sence, trim out any 'bad' characters.
4: Use the "split" utility provided by your language to break it into words
5: Return the first 5 Words.
ROCKET SCIENCE.
replies
what do you mean screw the regex? your obviously a VB programmer.
Regex is the most efficient way to work with strings. Learn them.
No. Php, toyed a bit with ruby, now going manically into perl.
There are some thing ( like this case ) where the regex based alternative is computationally and logically exponentially overly complex for the task.
I've parse entire php source files with regex, I'm not exactly a novice in their use.
But there are many cases, such as this, where you're employing a logging company to prune your rose bush.
I could do all steps 2 to 5 with regex of course, but they would be simple and atomic regex, with no weird backtracking syntax or potential for recursive searching.
The steps 1 to 5 I list above have a known scope, known range of input, and there's no ambiguity to how it functions. As to your regex, the fact you have to get contributions of others to write something so simple is proving the point.
I see somebody marked my post as offensive, I am somewhat unhappy I can't mark this fact as offensive to me. ;)
Proof Of Pudding:
sub getNames{
my #args = #_;
my $text = shift #args;
my $num = shift #args;
# Trim Whitespace from Head/End
$text =~ s/^\s*//;
$text =~ s/\s*$//;
# Trim Bad Characters (??)
$text =~ s/[^a-zA-Z\'\s]//g;
# Tokenise By Space
my #words = split( /\s+/, $text );
#return 0..n
return #words[ 0 .. $num - 1 ];
} ## end sub getNames
print join ",", getNames " Hello world this is a good test", 5;
>> Hello,world,this,is,a
If there is anything ambiguous to anybody how that works, I'll be glad to explain it to them. Noted that I'm still doing it with regexps. Other languages I would have used their native "trim" functions provided where possible.
Bollocks -->
I first tried this approach. This is your brain on regex. Kids, don't do regex.
This might be a good start
/([^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+|)
|)
|)
|)
)/
( Linebroken for clarity )
/([^\s]+(\s[^\s]+(\s[^\s]+(\s[^\s]+|)|)|))/
( Actual )
I've used [^\s]+ here instead of your A-Z combo for succintness, but the point is here the nested optional groups
ie:
(Hello( this( is( example))))
(Hello( this( is( example( two)))))
(Hello( this( is( better( example))))) three
(Hello( this( is()))))
(Hello( this()))
(Hello())
( Note: this, while being convoluted, has the benefit that it will match each name into its own group )
If you want readable code:
$word = '[^\s]+';
$regex = "/($word(\s$word(\s$word(\s$word(\s$word|)|)|)|)|)/";
( it anchors around the (capture|) mantra of "get this, or get nothing" )
#Sir Psycho : Be careful about your assumptions here. What about hyphenated names? Dotted names (e.g. Brian R. Bondy) and so on?
Here's the answer that you're most likely looking for:
^[a-zA-Z']+(\s[a-zA-Z']+){0,4}$
That says (in English): "From start to finish, match one or more letters, there can also be a space followed by another 'name' up to four times."
BTW: Why do you want them to have apostrophes anywhere in the name?
^([a-zA-Z']+\s){0,4}[a-zA-Z']+$
This assumes you want 4 spaces inside this string (i.e. you have trimmed it)
Edit: If you want 4 spaces anywhere I'd recommend not using regex - you'd be better off using a substr_count (or the equivalent in your language).
I also agree with pipTheGeek that there are so many different ways of writing names that you're probably best off trusting the user to get their name right (although I have found that a lot of people don't bother using capital letters on ecommerce checkouts).
Match multiple whitespace followed by two characters at the end of the line.
Related problem ----
From a string, remove trailing 2 characters preceded by multiple white spaces... For example, if the column contains this string -
" 'This is a long string with 2 chars at the end AB "
then, AB should be removed while retaining the sentence.
Solution ----
select 'This is a long string with 2 chars at the end AB' as "C1",
regexp_replace('This is a long string with 2 chars at the end AB',
'[[[:space:]][a-zA-Z][a-zA-Z]]*$') as "C2" from dual;
Output ----
C1
This is a long string with 2 chars at the end AB
C2
This is a long string with 2 chars at the end
Analysis ----
regular expression specifies - match and replace zero or more occurences (*) of a space ([:space:]) followed by combination of two characters ([a-zA-Z][a-zA-Z]) at the end of the line.
Hope this is useful.