Split string on un-escaped character in D - regex

What is the best way to split a string on an un-escaped character?
Eg. split this (raw) string
`example string\! it is!split in two parts`
on '!', so that it produces this array:
["example string! it is", "split in two parts"]
std.regex.split seems to almost be the right thing. There is a problem though, this code matches the correct split character, but also consumes the last character on the left part.
auto text = `example string\! it is!split in two parts`;
return text.split(regex(`[^\\]!`)).map!`a.replace("\\!", "!")`.array;
The whole regex match is removed on split, so this array is the result:
["example string! it i", "split in two parts"]
What is the best way to get to the first array without iterating the string myself?

Use a negative lookbehind:
(?<!\\)\!

Related

End a regular expression pattern with a string

all. I have spent some time now to learn regular expression, but eventually there is a problem I cannot solve properly.
Lets assume the following 'string' (html-extract):
"{'2018-05-02', '2018-01-05', r, '2018-07-01', '2017-07-02', '2016-07-31' random_text XYCCC Letters and 55565798 ]}"
My intention is, to extract all values from '2018-05-02' ... to (and excluding) random_text. I tried to achieve this through chosing the "anything but" structure to achieve this [^a] (not a):
\'[^random]*
The above does not do the job, because random is not a string, but a set of characters, hence the 'r' in the string will split my extracted value.
If there is no r in the text before the word random_text, this would work fine:
\'[^r]*
Is there any way to include a specific string as the end of my sequence. e.g.
start: \'
repeated characters unlike string: [^{my_string}]*
Appreciate any insight :)
This regex will do the job:
'.+'(?= random)
Just replace random with the string you want to exclude at the end.
Demo & explanation

Why one word breaks all right output in regex (perl)?

I want to understand the situation with regular expression in Perl.
$str = "123-abc 23-rr";
Need to show both words beside minus.
Regular expression is:
#mas=$str=~/(?:([\d\w]+)\-([\d\w]+))/gx;
And it show right output: 123, abc, 23, rr.
But if I change string a little and put one word in start:
$str = "word 123-abc 23-rr";
And I want to take account this first word, so I change my regexp:
#mas=$str=~/\w+\s(?:\s*([\d\w]+)\-([\d\w]+))*/gx;
My output must be same, but there are: 23, rr. If I remove \s* or * the output is 123, abc. But it's still not right. Anyone knows why?
Rather than making an ever more specific regex for an ever more specific string, consider taking advantage of the overall pattern.
Each piece is separated by whitespace.
The first piece is a word.
The rest are pairs separated by dashes.
First split the pieces on whitespace.
my #pieces = split /\s+/, $str;
Then remove the first piece, it doesn't have to be split.
my $word = shift #pieces;
Then split each piece on - into pairs.
my %pairs = map { split /-/, $_ } #words;
For each match, each capture is returned.
In the first snippet, the pattern matches twice.
123-abc 23-rr
\_____/ \___/
There are two captures, so four (2*2=4) values are returned.
In the second snippet, the pattern matches once.
word 123-abc 23-rr
\________________/
There are two captures, so two (2*1=2) values are returned.

Generalized Regex from a set of String

I have this problem. I need to find automatically a way to generate a regex that match a set of string.
For example, given the set of string in input:
S = ["Casino Royale (1928)", "Mission Goldfinger", "A view to a kill"]
create iterating at the start a regex that match the first string, so:
regex1 = "\w{6}\s\w{6}\s\(\d{4}\)"
then compare regex1 with the second string, so:
regex2 = "\w{6-7}\s\w{6-10}(\s\(\d{4}\))?"
and then with the last string, so the final output is:
regex_output = "\w{1-7}\s\w{4-10}(\s\w{2}\s\w\s\w{4}|\s\(\d{4}\))?"
I would like to if it is possible to realize. Maybe it is a problem of complexity theory, maybe.
Thanks in advice.
Use an alternation of literals:
^\QCasino Royale (1928)\E|\QMission Goldfinger\E|\QA view to a kill\E$
\Q...\E means the characters contained to be matched literally.
This approach can of course handle an arbitrarily large list of strings.

R - regexp in each string of a table of char

I would like to make a regex operation at each string of an array.
For instance, take the first characters of each string before a '-'. The results will be store in another array.
('Hello-1','Hi-2','Hola-3')
will give
('Hello','Hi','Hola')
Is there a way do do it in R without a loop ?
Thanks!
Based on the updated question, we can match the character '-' followed by one or more characters until the end of the string and replace with ''.
sub('-.*$', '', test)

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)