perl split strange behavior - regex

I apologize in advance, this is probably a very stupid question with an obvious solution which is escaping the eye of a rather beginner in perl, or it may also have been in Stackoverflow as a solved question, but my lack of knowledge about what exactly to look for is preventing me from actually finding the answer.
I have a string like:
$s = FOO: < single blankspace> BAR <some whitespace character> some more text with whitespace that can span over multiple lines, i.e. has \n in them ;
#please excuse the lack of quotes, and large text describing the character in angular brackets, but in this example, but I have the string correctly defined, and in plase of <blankspace> I have the actual ASCII 32 character etc.
Now I want to split the $s, in this way:
($instType, $inst, $trailing) = split(/\s*/, $s, 3);
#please note that i do not use the my keyword as it is not in a subroutine
#but i tested with my, it does not change the behavior
I would expect, that $instType takes the value FOO: , without any surrounding space, in the actual test string there is a colon, and I believe, to the best of my knowledge, that it will remain in the $instType. Then it is rather obvious to expect that $inst takes similary the value BAR , without any surrounding spaces, and then finally one may also lean on $trail to take the rest of the string.
However, I am getting:
$instType takes F , that is just the single char,
$inst takes O, the single charater in the 2nd position in the string
$trail takes O: BAR and the rest.
How do I address the issue?
PS perl is 5.18.0

the problem is the quantifier * that allows zero space (zero or more), you must use + instead, that means 1 or more.
Note that there is exactly zero space between F and O.

You wrote:
#please note that i do not use the my keyword as it is not in a subroutine
#but i tested with my, it does not change the behavior
You can, and should, use my outside of subroutines, too. Using that in conjunction with use strict prevents silly errors like this:
$some_field = 'bar';
if ( $some_feild ) { ... }
If those statements were separated, it could be awfully hard to track down that bug.

Related

Title Case Conversion in SPARQL Query with REGEX

I am one inch away from the solution to my problem. I am attempting title case conversion of strings retrieved via SPARQL. I am using the REPLACE function in combination with LCASE and REGEX:
BIND (replace(lcase(?label), "(\\b[a-z](?!\\s))", ucase("$1") ) as ?title_case)
lcase(?label): all characters in the string becomes lowercase
(\\b[a-z](?!\\s)): matches the first letter of each word in the string
ucase($1): is the backreference to the first letter matched, that act as replacement after turning it into UPPER case.
Expected Result: animal husbandry methods becomes Animal Husbandry Methods
That solution is working almost right, but not quite, for reasons beyond my comprehension; check here an example at work.
When you run the query you won't notice anything different in the ?title_case, but if you edit the ucase("$1") for ucase("aaa") you see it magically replacing correctly the first letter of each word:
Result: animal husbandry methods becomes AAAnimal AAAusbandry AAAethods
It seems to me the UCASE function does not have any affect on the backreference $1
Who can explain to me why so, and what is to do to rectify this behavior?
You can use SUBSTR{} function to solve the issue.
Eg: BIND (REPLACE(LCASE(?label), "(\\b[a-z](?!\\s))", UCASE(SUBSTR(?label, 1, 1)) ) as ?title_case)
Function calls in SPARQL follow traditional conventions of most programming languages, that is that the inner functions are evaluated first, and their return values are then given as arguments to the outer function. replace here takes 3 strings, the input string, the pattern, and the replacement. ucase is interpreted independently on how the result is used, it simply converts its argument to uppercase and, surprisingly, the uppercase of $1 is $1!
In other languages, what you'd usually do is use some overload of the function that accepts a function/expression instead of the string as the replacement, so that you could call anything from within. That is not possible in SPARQL, all the replace function can do is insert the capture unmodified.
I am afraid what you want to do is not perfectly achievable in SPARQL alone. Your options are:
Use a SPARQL extension that contains a function that makes it possible, if supported by the endpoint.
If your query is a part of a larger pipeline, convert the results in another way, for example using XSLT.
Since you only care about [a-z], you can simply expand out all the letters and replace them one by one: replace(replace(lcase(?label), "(\\ba(?!\\s))", "A" ), "(\\bb(?!\\s))", "B" ) and so on. Not a very elegant or performant solution, but it gets the job done.
A shorter option is to use a pattern like ^(.*?)(\b[a-z](?!\s))(.*)$ to split the string into 3 parts, which you can extract with replacements to $1, $2 and $3, respectively. Concatenate the first part with the uppercase of the second part, and repeat the whole process for the last part. You will again have to repeat the patterns, but this time it is the same pattern so there is a potential for optimization. A downside is that you have to end this "recursion" somewhere, so you can only replace a fixed number of words.

Regex to match hexadecimal and integer numbers [duplicate]

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.
The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/
Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A
Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).
Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.
Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

VB.NET Use Regex to split string on single commas, but not on double commas

There's a file I'm trying to use for a list of strings that has the following rules:
Cannot begin or end with an unescaped comma.
A comma is escaped by a preceding comma.
Strings are separated by unescaped commas.
Everything else is absolutely face-value.
I've been fiddling around with some VB.NET code to parse a file like this and split it up into either a String() or a List(Of String), but it's gotten to be a little annoying. It's not that I can't figure this out; it's that I don't want to write crap code. If it's unnecessarily confusing, unecessarily slow, or anything else like that, it's not good enough.
Now, I know this almost starts to sound a little like a Code Review question, but I'm really starting to think that maybe a good regex would work better than trying to do this programmatically. Unfortunately regexes are not easy to work with, and while using one to tell it to escape on a comma may be a trivial matter, getting it to also ignore double commas and such is a bit more of an issue, at least for somebody who's not used to regexes.
How do you do this (properly) in VB.NET? In particular, I'm having a little bit of trouble putting together a wild card that'll match anything at all but a comma. It's also taking me a little bit to find out whether #1 has to be verified programmatically, or whether it can be done in the regex itself at the same time as the split operation.
EDIT
I just "woke up" and realized that this syntax is ambiguous, since in an odd-numbered series of three or more commas, you don't know what's escaped and what isn't. I'm just going to accept the current answer and move on.
Haven't used VB.net in a long time ... but I would't got the RegEx way.
What about splitting the string by "," ...
Dim parts As String() = s.Split(New Char() {","c})
You will get a list of items, now you only need to take care of the empty items (escaped commas) and join them with the correct preceding item.
PS: not sure if split gives you empty items in case of ",,"

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

Can I use a boolean AND condition in a regular expression?

Say, if I have a DN string, something like this:
OU=Karen,OU=Office,OU=admin,DC=corp,DC=Fabrikam,DC=COM
How to make a regular expression to pick only DNs that have both OU=Karen and OU=admin?
This is the regex lookahead solution, matching the whole string if it contains required parts in any order just for the reference. If you do not store the pattern in some sort of configurable variable, I'd stick with nhahtdh's solution, though.
/^(?=.*OU=Karen)(?=.*OU=admin).*$/
^ - line start
(?= - start zero-width positive lookahead
.* - anything or nothing
OU=Karen - literal
) - end zero-width positive lookahead
- place as many positive or negative look-aheads as required
.* - the whole line
$ - line end
You realise you don't have to do everything with a single regex, or even one regex.
Regular expressions are very good for catching classes of input but, if you have two totally fixed strings, you can just use a contains()-type method for both of them and then and the results.
Alternatively, if you need to use regexes, you can do that twice (once per string) and and the results together.
If you need to do it with a single regex, you could try something like:
,OU=Karen,.*,OU=admin,|,OU=admin,.*,OU=Karen,
but you'll then have to also worry about when those stanzas appear at the start or end of the line, and all sorts of other edge cases (one or both at start or end, both next to each other, names like Karen7 or administrator-lesser, and so on).
Having to allow for all possibilities will probably end up with something monstrous like:
^OU=Karen(,[^,]*)*,OU=admin,|
^OU=Karen(,[^,]*)*,OU=admin$|
,OU=Karen(,[^,]*)*,OU=admin,|
,OU=Karen(,[^,]*)*,OU=admin$|
^OU=admin(,[^,]*)*,OU=Karen,|
^OU=admin(,[^,]*)*,OU=Karen$|
,OU=admin(,[^,]*)*,OU=Karen,|
,OU=admin(,[^,]*)*,OU=Karen$
although, with an advanced enouge regex engine, this may be reducible to something smaller (although it would be unlikely to be any faster, simply because of all the forward-looking/back-tracking).
One way that could be improved without a complex regex is to massage your string slightly before-hand so that boundary checks aren't needed:
newString = "," + origString.replace (",", ",,") + ","
so that it starts and ends with a comma and all commas within it are duplicated:
,OU=Karen,,OU=Office,,OU=admin,,DC=corp,,DC=Fabrikam,,DC=COM,
Then you need only check for the much simpler:
,OU=Karen,.*,OU=admin,|,OU=admin,.*,OU=Karen,
and this removes all the potential problems mentioned:
either at start of string.
either at end of string.
both abutting each other.
extended names like Karen2 being matched accidentally.
Probably the best way to do this (if your language allows) is to simply split the string on commas and examine them, something like:
str = "OU=Karen,OU=Office,OU=admin,DC=corp,DC=Fabrikam,DC=COM"
elems[] = str.splitOn(",")
gotKaren = false
gotAdmin = false
for each elem in elems:
if elem = "OU=Karen": gotKaren = true
if elem = "OU=admin": gotAdmin = true
if gotKaren and gotAdmin:
weaveYourMagicHere()
This both ignores the order in which they may appear and bypasses any regex "gymnastics" that may be required to detect the edge cases.
It also has the advantage of probably being more readable than the equivalent regex :-)
If you must use a regex, you can use
/OU=Karen.*?OU=admin|OU=admin.*?OU=Karen/
You can contains(), or indexOf() as many times as the number of conditions to check the exact string. No need for regex.
Extensible regex (as in it can support more conditions) may be possible with look ahead, but I doubt it will perform any better.
If you want to perform this type of action multiple times on the same string, and there are many tokens on the string, then you may consider parsing the string and store in some data structure.
No, not unless you're using vi: it has an \& operator
/(OU=Karen.*OU=admin|ou=admin.*OU=Karen)/
This might be close enough, though, or similar.
You can use something like (OU\=Karen