Regex get text before and after a hyphen - regex

I have this string:
"Common Waxbill - Estrilda astrild"
How can I write 2 separate regexes for the words before and after the hyphen? The output I would want is:
"Common Waxbill"
"Estrilda astrild"

This is quite simple:
.*(?= - ) # matches everything before " - "
(?<= - ).* # matches everything after " - "
See this tutorial on lookaround assertions.

If you cannot use look-behinds, but your string is always in the same format and cannout contain more than the single hyphen, you could use
^[^-]*[^ -] for the first one and \w[^-]*$ for the second one (or [^ -][^-]*$ if the first non-space after the hyphen is not necessarily a word-character.
A little bit of explanation:
^[^-]*[^ -] matches the start of the string (anchor ^), followed by any amount of characters, that are not a hyphen and finally a character thats not hyphen or space (just to exclude the last space from the match).
[^ -][^-]*$ takes the same approach, but the other way around, first matching a character thats neither space nor hyphen, followed by any amount of characters, that are no hyphen and finally the end of the string (anchor $). \w[^-]*$ is basically the same, it uses a stricter \w instead of the [^ -]. This is again used to exclude the whitespace after the hyphen from the match.

Another solution is to string split on the hyphen and remove white space.

Two alternate methods
The main challenge of your Question is that you want two separate items. This means that your process is dependent on another language. RegEx itself does not parse or separate a string; it only explains what we are looking for. The language you are using will make the actual separation. My answer gets your results in PHP, but other languages should have comparable solutions.
If you want to just do the job in your Question, and if you're using PHP...
Method 1: explode("-", $list); -> $array[]
This is useful if your list is longer than two items:
// Generate our list
$list = "Common Waxbill - Estrilda astrild";
$item_arr = explode("-", $list);
// Iterate each
foreach($item_arr as $item) {
echo $item.'<br>';
// See what we have
echo '
<pre>Access array directly:</pre>'.
'<pre>'.$item_arr[0].'x <--notice the trailing space</pre>'.
'<pre>'.$item_arr[1].' <--notice the preceding space</pre>';
...You could clean up each item and reassign them to a new array with trim(). This would get the text your Question asked for (no extra spaces before or after)...
// Create a workable array
$i=0; // Start our array key counter
foreach($item_arr as $item) {
$clean_arr[$i++] = trim($item);
// See what we have
echo '
<pre>Access after cleaning:</pre>'.
'<pre>'.$clean_arr[0].'x <--no space</pre>'.
'<pre>'.$clean_arr[1].' <--no space</pre>';
Common Waxbill
Estrilda astrild
Access array directly:
Common Waxbill x <--notice the trailing space
Estrilda astrild <--notice the preceding space
Access after cleaning:
Common Waxbillx <--no space
Estrilda astrild <--no space
Method 2: substr(strrpos()) & substr(strpos())
This is useful if your list will only have two items:
// Generate our list
$list = "Common Waxbill - Estrilda astrild";
// Start splitting
$first_item = trim(substr($list, strrpos($list, '-') + 1));
$second_item = trim(substr($list, 0, strpos($list, '-')));
// See what we have
echo "<pre>substr():</pre>
Estrilda astrild
Common Waxbill
Note strrpos() and strpos() are different and each have different syntax.
If you're not using PHP, but you want to do the job in some other language without depending on RegEx, knowing the language would be helpful.
Generally, programming languages come with tools for jobs like this out of box, which is part of why people choose the languages they do.


TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

Regex: Removing Space Between Quotes, And Stopping Before a Colon (With Yahoo Pipes)

I've been working on this for a while, but it's beyond my understanding of regex.
I'm using Yahoo Pipes on an RSS, and I want to create hashtags from titles; so, I'd like to remove space from everything between quotes, but, if there's a colon within the quotes, I only want the space removed between the words before the colon.
And, it would be great if I could also capture the unspaced words as a group, to be able to use: #$1 to output the hashtag in one step.
So, something like:
"The New Apple: Worlds Within Worlds" Before We Begin...
Could be substituted like #$1 - with this result:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
After some work, I was able to come up with, this regex:
("Review" was a word that often came before colons and wouldn't be stripped, if it were later in the title; that's what that's for, but I would like to not require that, to be more universal)
But, it has two problems:
I have to use multiple steps. The result of that regex would be:
"TheNewApple: Worlds Within Worlds" Before We Begin...
And I could then add another regex step, to put the hash # in front
But, it only works if the quotes are first, and I don't know how to fix that...
You can do this all in one step with regex, with a caveat. You run into problems with a repeated capturing group because only the last iteration is available in the replacement string. Searching for ( (\w+))+ and replacing with $2 will replace all the words with just the last match - not what we want.
The way around this is to repeat the pattern an arbitrary number of times that will suffice for your use. Each separate group can be referenced.
Search: "(\w+)(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?
Replace: "#$1$2$3$4$5$6
This will replace up to 6-word titles, exactly as you need them. First, "(\w+) matches any word following a quote. In the replacement string, it is put back as "#$1, adding the hashtag. The rest is a repeated list of (?: (\w+))? matches, each matching a possible space and word. Notice the space is part of a non-capturing group; only the word is part of the inner capture group. In the replacement string, I have $1$2$3$4$5$6, which puts back the words, without the spaces. Notice that a colon will not match any part of this, so it will stop once it hits a colon.
"The New Apple: Worlds Within Worlds" Before We Begin...
"The New Apple" Before We Begin...
"One: Two"
only "One" word
this has "Two Words"
"The Great Big Apple Dumpling"
"The Great Big Apple Dumpling Again: Part 2"
"#TheNewApple: Worlds Within Worlds" Before We Begin...
"#TheNewApple" Before We Begin...
"#One: Two"
only "#One" word
this has "#TwoWords"
"#TheGreatBigAppleDumplingAgain: Part 2"
You can match the text with
then use some programming language to output the result like this:
'"#' + removeSpace($1) + $2 + '"' + $3
I have no idea what language you're using, but this seems like a poor choice for regex. In Python I'd do this:
# Python 3
import re
titles = ['''"The New Apple: Worlds Within Worlds" Before We Begin...''',
'''"Made Up Title: For Example Only" So We Can Continue...''']
hashtagged_titles = list()
for title in titles:
hashtagme, *restofstring = title.split(":")
hashtag = '"#'+hashtagme[1:].translate(str.maketrans('', '', " "))
result = "{}:{}".format(hashtag, restofstring)
Do a global search for
\ (?=.*:)
Replaced with nothing. Example
You'll need a second search on the results of that if you want to capture "TheNewApple" as a single word.

Regex to create url friendly string

I want to create a url friendly string (one that will only contain letters, numbers and hyphens) from a user input to :
remove all characters which are not a-z, 0-9, space or hyphens
replace all spaces with hyphens
replace multiple hyphens with a single hyphen
Expected outputs :
my project -> my-project
test project -> test-project
this is # long str!ng with spaces and symbo!s -> this-is-long-strng-with-spaces-and-symbos
Currently i'm doing this in 3 steps :
$identifier = preg_replace('/[^a-zA-Z0-9\-\s]+/','',strtolower($project_name)); // remove all characters which are not a-z, 0-9, space or hyphens
$identifier = preg_replace('/(\s)+/','-',strtolower($identifier)); // replace all spaces with hyphens
$identifier = preg_replace('/(\-)+/','-',strtolower($identifier)); // replace all hyphens with single hyphen
Is there a way to do this with one single regex ?
Yeah, #Jerry is correct in saying that you can't do this in one replacement as you are trying to replace a particular string with two different items (a space or dash, depending on context). I think Jerry's answer is the best way to go about this, but something else you can do is use preg_replace_callback. This allows you to evaluate an expression and act on it according to what the match was.
$string = 'my project
test project
this is # long str!ng with spaces and symbo!s';
$string = preg_replace_callback('/([^A-Z0-9]+|\s+|-+)/i', function($m){$a = '';if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';}return $a;}, $string);
print $string;
Here is what this means:
/([^A-Z0-9]+|\s+|-+)/i This looks for any one of your three quantifiers (anything that is not a number or letter, more than one space, more than one hyphen) and if it matches any of them, it passes it along to the function for evaluation.
function($m){ ... } This is the function that will evaluate the matches. $m will hold the matches that it found.
$a = ''; Set a default of an empty string for the replacement
if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';} If our match (the value stored in $m[1]) contains multiple spaces or hyphens, then set $a to a dash instead of an empty string.
return $a; Since this is a function, we will return the value and that value will be plopped into the string wherever it found a match.
Here is a working demo
I don't think there's one way of doing that, but you could reduce the number of replaces and in an extreme case, use a one liner like that:
It first removes all non-alphanumeric/space/dash with nothing, then replaces all spaces and multiple dashes with a single one.
Since you want to replace each thing with something different, you will have to do this in multiple iterations.
Sorry D:

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
As provided by ePharaoh, the answer is
To have the above apply to either single quoted or double quoted strings, use
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance:
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
should work with any quoted string
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
// Reverse the matched strings.
matches = myString
// Reverse the string.
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced:
here is one that work with both " and ' and you easily add others at the start.
it uses the backreference (\1) match exactley what is in the first group (" or ').
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
" It\'s big \"problem "
match result:
It\'s big \"problem
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)

regex to match a maximum of 4 spaces

I have a regular expression to match a persons name.
So far I have ^([a-zA-Z\'\s]+)$ but id like to add a check to allow for a maximum of 4 spaces. How do I amend it to do this?
Edit: what i meant was 4 spaces anywhere in the string
Don't attempt to regex validate a name. People are allowed to call themselves what ever they like. This can include ANY character. Just because you live somewhere that only uses English doesn't mean that all the people who use your system will have English names. We have even had to make the name field in our system Unicode. It is the only Unicode type in the database.
If you care, we actually split the name at " " and store each name part as a separate record, but we have some very specific requirements that mean this is a good idea.
PS. My step mum has 5 spaces in her name.
^ # Start of string
(?!\S*(?:\s\S*){5}) # Negative look-ahead for five spaces.
([a-zA-Z\'\s]+)$ # Original regex
Or in one line:
If there are five or more spaces in the string, five will be matched by the negative lookahead, and the whole match will fail. If there are four or less, the original regex will be matched.
Screw the regex.
Using a regex here seems to be creating a problem for a solution instead of just solving a problem.
This task should be 'easy' for even a novice programmer, and the novel idea of regex has polluted our minds!.
1: Get Input
2: Trim White Space
3: If this makes sence, trim out any 'bad' characters.
4: Use the "split" utility provided by your language to break it into words
5: Return the first 5 Words.
what do you mean screw the regex? your obviously a VB programmer.
Regex is the most efficient way to work with strings. Learn them.
No. Php, toyed a bit with ruby, now going manically into perl.
There are some thing ( like this case ) where the regex based alternative is computationally and logically exponentially overly complex for the task.
I've parse entire php source files with regex, I'm not exactly a novice in their use.
But there are many cases, such as this, where you're employing a logging company to prune your rose bush.
I could do all steps 2 to 5 with regex of course, but they would be simple and atomic regex, with no weird backtracking syntax or potential for recursive searching.
The steps 1 to 5 I list above have a known scope, known range of input, and there's no ambiguity to how it functions. As to your regex, the fact you have to get contributions of others to write something so simple is proving the point.
I see somebody marked my post as offensive, I am somewhat unhappy I can't mark this fact as offensive to me. ;)
Proof Of Pudding:
sub getNames{
my #args = #_;
my $text = shift #args;
my $num = shift #args;
# Trim Whitespace from Head/End
$text =~ s/^\s*//;
$text =~ s/\s*$//;
# Trim Bad Characters (??)
$text =~ s/[^a-zA-Z\'\s]//g;
# Tokenise By Space
my #words = split( /\s+/, $text );
#return 0..n
return #words[ 0 .. $num - 1 ];
} ## end sub getNames
print join ",", getNames " Hello world this is a good test", 5;
>> Hello,world,this,is,a
If there is anything ambiguous to anybody how that works, I'll be glad to explain it to them. Noted that I'm still doing it with regexps. Other languages I would have used their native "trim" functions provided where possible.
Bollocks -->
I first tried this approach. This is your brain on regex. Kids, don't do regex.
This might be a good start
( Linebroken for clarity )
( Actual )
I've used [^\s]+ here instead of your A-Z combo for succintness, but the point is here the nested optional groups
(Hello( this( is( example))))
(Hello( this( is( example( two)))))
(Hello( this( is( better( example))))) three
(Hello( this( is()))))
(Hello( this()))
( Note: this, while being convoluted, has the benefit that it will match each name into its own group )
If you want readable code:
$word = '[^\s]+';
$regex = "/($word(\s$word(\s$word(\s$word(\s$word|)|)|)|)|)/";
( it anchors around the (capture|) mantra of "get this, or get nothing" )
#Sir Psycho : Be careful about your assumptions here. What about hyphenated names? Dotted names (e.g. Brian R. Bondy) and so on?
Here's the answer that you're most likely looking for:
That says (in English): "From start to finish, match one or more letters, there can also be a space followed by another 'name' up to four times."
BTW: Why do you want them to have apostrophes anywhere in the name?
This assumes you want 4 spaces inside this string (i.e. you have trimmed it)
Edit: If you want 4 spaces anywhere I'd recommend not using regex - you'd be better off using a substr_count (or the equivalent in your language).
I also agree with pipTheGeek that there are so many different ways of writing names that you're probably best off trusting the user to get their name right (although I have found that a lot of people don't bother using capital letters on ecommerce checkouts).
Match multiple whitespace followed by two characters at the end of the line.
Related problem ----
From a string, remove trailing 2 characters preceded by multiple white spaces... For example, if the column contains this string -
" 'This is a long string with 2 chars at the end AB "
then, AB should be removed while retaining the sentence.
Solution ----
select 'This is a long string with 2 chars at the end AB' as "C1",
regexp_replace('This is a long string with 2 chars at the end AB',
'[[[:space:]][a-zA-Z][a-zA-Z]]*$') as "C2" from dual;
Output ----
This is a long string with 2 chars at the end AB
This is a long string with 2 chars at the end
Analysis ----
regular expression specifies - match and replace zero or more occurences (*) of a space ([:space:]) followed by combination of two characters ([a-zA-Z][a-zA-Z]) at the end of the line.
Hope this is useful.