Splitting Two Characters In a String - Perl - regex

I'm trying to split this string. Here's the code:
my $string = "585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1,656|432|289|1|1,136|206|327|1|1,585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1%656|432|289|1|1%136|206|327|1|1%654|404|411|1|1";
my #ids = split(",", $string);
What I want is to split only % and , in the string, I was told that I could use a pattern, something like this? /[^a-zA-Z0-9_]/

Character classes can be used to represent a group of possible single characters that can match. And the ^ symbol at the beginning of a character class negates the class, saying "Anything matches except for ...." In the context of split, whatever matches is considered the delimiter.
That being the case, `[^a-zA-Z0-9_] would match any character except for the ASCII letters 'a' through 'z', 'A' through 'Z', and the numeric digits '0' through '9', plus underscore. In your case, while this would correctly split on "," and "%" (since they're not included in a-z, A-Z, 0-9, or _), it would mistakenly also split on "|", as well as any other character not included in the character class you attempted.
In your case it makes a lot more sense to be specific as to what delimiters to use, and to not use a negated class; you want to specify the exact delimiters rather than the entire set of characters that delimiters cannot be. So as mpapec stated in his comment, a better choice would be [%,].
So your solution would look like this:
my #ids = split/[%,]/, $string;
Once you split on '%' and ',', you'll be left with a bunch of substrings that look like this: 585|487|314|1|1 (or some variation on those numbers). In each case, it's five positive integers separated by '|' characters. It seems possible to me that you'll end up wanting to break those down as well by splitting on '|'.
You could build a single data structure represented by list of lists, where each top level element represents a [,%] delimited field, and consists of a reference to an anonymous array consisting of the pipe-delimited fields. The following code will build that structure:
my #ids = map { [ split /\|/, $_ ] } split /[%,]/, $string;
When that is run, you will end up with something like this:
#ids = (
[ '585', '487', '314', '1', '1' ],
[ '651', '365', '302', '1', '1' ],
# ...
);
Now each field within an ID can be inspected and manipulated individually.
To understand more about how character classes work, you could check perlrequick, which has a nice introduction to character classes. And for more information on split, there's always perldoc -f split (as mentioned by mpapec). split is also discussed in chapter nine of the O'Reilly book, Learning Perl, 6th Edition.

Related

Split by regex with capturing groups in lookahead produces repeating fragments in results

I was hoping for a one-liner to insert thousands separators into string of digits with decimal separator (example: 78912345.12). My first attempt was to split the string in places where there is either 3 or 6 digits left until decimal separator:
console.log("5789123.45".split(/(?=([0-9]{3}\.|[0-9]{6}\.))/));
which gave me the following result (notice how fragments of original string are repeated):
[ '5', '789123.', '789', '123.', '123.45' ]
I found out that "problem" (please read problem here as my obvious misunderstanding) comes from using a group within lookahead expression. This simple expression works "correctly":
console.log("abcXdeYfgh".split(/(?=X|Y)/));
when executed prints:
[ 'abc', 'Xde', 'Yfgh' ]
But the moment I surround X|Y with parentheses:
console.log("abcXdeYfgh".split(/(?=(X|Y))/));
the resulting array looks like:
[ 'abc', 'X', 'Xde', 'Y', 'Yfgh' ]
Moreover, when I change the group to a non-capturing one, everything comes back to "normal":
console.log("abcXdeYfgh".split(/(?=(?:X|Y))/));
this yields again:
[ 'abc', 'Xde', 'Yfgh' ]
So, I could do the same trick (changing to non-capturing group) within original expression (and it indeed works), but I was hoping for an explanation of this behavior I cannot understand. I experience identical results when trying to do the same in .NET so it seems like a fundamental thing with how regular expression lookaheads work. This is my question: why lookahead with capturing groups produces those "strange" results?
Capturing groups inside a regex pattern inside a regex split method/function make the captured texts appear as separate elements in the resulting array (for most of the major languages).
Here is C#/.NET reference:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
Here is JavaScript reference:
If separator is a regular expression that contains capturing parentheses, then each time separator is matched, the results (including any undefined results) of the capturing parentheses are spliced into the output array. However, not all browsers support this capability.
Just a note: the same behavior is observed with
PHP (with preg_split and PREG_SPLIT_DELIM_CAPTURE flag):
print_r(preg_split("/(?<=(X))/","XYZ",-1,PREG_SPLIT_DELIM_CAPTURE));
// --> [0] => X, [1] => X, [2] => YZ
Ruby (with string.split):
"XYZ".split(/(?<=(X))/) # => X, X, YZ
But it is the opposite in Java, the captured text is not part of the resulting array:
System.out.println(Arrays.toString("XYZ".split("(?<=(X))"))); // => [X, YZ]
And in Python, with re module, re.split cannot split on the zero-width assertion, so the string does not get split at all with
print(re.split(r"(?<=(X))","XXYZ")) # => ['XXYZ']
Here is a simple way to do it in Javascript
number.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",")
Normally, including capture buffers could sometimes produce extra elements
if mixing with lookaheads.
You are on the right track but didn't have a natural anchor.
If you use a string where all the characters are the same type
(in your case digits), and using lookaheads, its not good enough
to do the split incrementally based on a length of common characters.
The engine just bumps along one character at a time, splitting on that
character and including the captured ones as elements.
You could handle this by consuming the capture in the process,
like (?=(\d{3}))\1 but that not only splits at the wrong place but
injects an empty element in the array.
The solution is to use the Natural Anchor, the DOT, then split at
multiples of 3 up to the dot anchor.
This forces the engine to seek to the point at which there are multiples
away from the anchor.
Then your problem is solved, no need for captures and the split is perfect.
Regex: (?=(?:[0-9]{3})+\.)
Formatted:
(?=
(?: [0-9]{3} )+
\.
)
C#:
string[] ary = Regex.Split("51234555632454789123.45", #"(?=(?:[0-9]{3})+\.)");
int size = ary.Count();
for (int i = 0; i < size; i++)
Console.WriteLine(" {0} = '{1}' ", i, ary[i]);
Output:
0 = '51'
1 = '234'
2 = '555'
3 = '632'
4 = '454'
5 = '789'
6 = '123.45'

Regex for letters plus space

I'm trying to use a regex for testing for only letters and spaces allowed in an input field. I tried using /^[a-zA-Z]*$/ but this doesn't allow spaces.
Updated this question. I'm looking to only have one space between words.
So for example I have a city name like New Orleans so only letters and one space between words should be allowed.
The [a-aA-Z] character class, and \s character classes have the following caveats.
a-zA-Z may be too restrictive if you wish to accept Unicode, in which case, \p{Alpha} would be a better character class.
\s may be either too permissive or too restrictive depending on the Unicode or ASCII semantics under which your regex is operating. You may only want to just put a single space in the character class.
If we knew what language and whether or not you intend for Unicode semantics, that might help. Things aren't as simple as they were back in the ASCII-only days.
Update:
Here's a more robust solution that still has the original Unicode caveats, but allows for a single space character between words, and nowhere else.
use strict;
use warnings;
my #cities = (
'New Orleans',
'New Jersey',
'Salt Lake City',
'Sacramento',
' Leading space',
'Trailing space ',
'Two consecutive spaces',
' Two leading spaces',
'Two trailing spaces ',
);
foreach my $city ( #cities ) {
print "<<$city>>: ";
if( $city =~ m/^[a-zA-Z]+(?:\s[a-zA-Z]+)*$/ ) {
print "match\n";
}
else {
print "reject\n";
}
}
Matches multiple groups of letters separated by single spaces.
(\w+ )+\w

Regex to create url friendly string

I want to create a url friendly string (one that will only contain letters, numbers and hyphens) from a user input to :
remove all characters which are not a-z, 0-9, space or hyphens
replace all spaces with hyphens
replace multiple hyphens with a single hyphen
Expected outputs :
my project -> my-project
test project -> test-project
this is # long str!ng with spaces and symbo!s -> this-is-long-strng-with-spaces-and-symbos
Currently i'm doing this in 3 steps :
$identifier = preg_replace('/[^a-zA-Z0-9\-\s]+/','',strtolower($project_name)); // remove all characters which are not a-z, 0-9, space or hyphens
$identifier = preg_replace('/(\s)+/','-',strtolower($identifier)); // replace all spaces with hyphens
$identifier = preg_replace('/(\-)+/','-',strtolower($identifier)); // replace all hyphens with single hyphen
Is there a way to do this with one single regex ?
Yeah, #Jerry is correct in saying that you can't do this in one replacement as you are trying to replace a particular string with two different items (a space or dash, depending on context). I think Jerry's answer is the best way to go about this, but something else you can do is use preg_replace_callback. This allows you to evaluate an expression and act on it according to what the match was.
$string = 'my project
test project
this is # long str!ng with spaces and symbo!s';
$string = preg_replace_callback('/([^A-Z0-9]+|\s+|-+)/i', function($m){$a = '';if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';}return $a;}, $string);
print $string;
Here is what this means:
/([^A-Z0-9]+|\s+|-+)/i This looks for any one of your three quantifiers (anything that is not a number or letter, more than one space, more than one hyphen) and if it matches any of them, it passes it along to the function for evaluation.
function($m){ ... } This is the function that will evaluate the matches. $m will hold the matches that it found.
$a = ''; Set a default of an empty string for the replacement
if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';} If our match (the value stored in $m[1]) contains multiple spaces or hyphens, then set $a to a dash instead of an empty string.
return $a; Since this is a function, we will return the value and that value will be plopped into the string wherever it found a match.
Here is a working demo
I don't think there's one way of doing that, but you could reduce the number of replaces and in an extreme case, use a one liner like that:
$text=preg_replace("/[\s-]+/",'-',preg_replace("/[^a-zA-Z0-9\s-]+/",'',$text));
It first removes all non-alphanumeric/space/dash with nothing, then replaces all spaces and multiple dashes with a single one.
Since you want to replace each thing with something different, you will have to do this in multiple iterations.
Sorry D:

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

c# regex #"[;]+"

In c#, there is a line of code such as:
string[] values = Regex.Split(fielddata, #"[;]+");
On what values does this split? I'm getting a bit confused by the mixture of literals from the # sign and what the square braces and + mean here. Any ideas?
# is a verbatim string literal, meaning you don't have to escape special characters. As Asad already said, it splits on one or more consecutive semicolon, where + stands for 1 or more (regex grammar)
Here's a runnable example: http://ideone.com/whLqUe
string input = "a;b; ;c;;;d";
string[] values = Regex.Split(input, #";+");
foreach (var value in values)
Console.WriteLine(value);
outputting
a
b
c
d
Here is a good tutorial.
[...] is a character class matching any single character inside the square brackets. In this case it is redundant, just writing #";+" would mean exactly the same.
+ repeats the previous character or pattern 1 or more times.
So this splits on consecutive ; (as many as possible).
The verbatim string (#"...") is used simply as a matter of good practice. Once you need to escape things inside regular expressions, it gets ugly if you use a normal string. Again, in this particular example, it would not make a difference to leave out the #. But it's something worth getting used to.
Those brackets are unnecessary. That regex is equivalent to the following:
string[] values = Regex.Split(fielddata, #";+");
It'll split on any amount of semi-colons, so that "1;2;;3;;4;;;5;;6;7" would return an array:
['1', '2', '3', '4', '5', '6', '7']
The split method will split fielddata on 1 or more semi colons. The # symbol means that you do not have to escape characters and the string is verbatim what is between the double quotes.
if fielddata = "a;b;c;;d;e;;;f"
then
values = ["a","b","c","d","e","f"]