Title Case Conversion in SPARQL Query with REGEX - regex

I am one inch away from the solution to my problem. I am attempting title case conversion of strings retrieved via SPARQL. I am using the REPLACE function in combination with LCASE and REGEX:
BIND (replace(lcase(?label), "(\\b[a-z](?!\\s))", ucase("$1") ) as ?title_case)
lcase(?label): all characters in the string becomes lowercase
(\\b[a-z](?!\\s)): matches the first letter of each word in the string
ucase($1): is the backreference to the first letter matched, that act as replacement after turning it into UPPER case.
Expected Result: animal husbandry methods becomes Animal Husbandry Methods
That solution is working almost right, but not quite, for reasons beyond my comprehension; check here an example at work.
When you run the query you won't notice anything different in the ?title_case, but if you edit the ucase("$1") for ucase("aaa") you see it magically replacing correctly the first letter of each word:
Result: animal husbandry methods becomes AAAnimal AAAusbandry AAAethods
It seems to me the UCASE function does not have any affect on the backreference $1
Who can explain to me why so, and what is to do to rectify this behavior?

You can use SUBSTR{} function to solve the issue.
Eg: BIND (REPLACE(LCASE(?label), "(\\b[a-z](?!\\s))", UCASE(SUBSTR(?label, 1, 1)) ) as ?title_case)

Function calls in SPARQL follow traditional conventions of most programming languages, that is that the inner functions are evaluated first, and their return values are then given as arguments to the outer function. replace here takes 3 strings, the input string, the pattern, and the replacement. ucase is interpreted independently on how the result is used, it simply converts its argument to uppercase and, surprisingly, the uppercase of $1 is $1!
In other languages, what you'd usually do is use some overload of the function that accepts a function/expression instead of the string as the replacement, so that you could call anything from within. That is not possible in SPARQL, all the replace function can do is insert the capture unmodified.
I am afraid what you want to do is not perfectly achievable in SPARQL alone. Your options are:
Use a SPARQL extension that contains a function that makes it possible, if supported by the endpoint.
If your query is a part of a larger pipeline, convert the results in another way, for example using XSLT.
Since you only care about [a-z], you can simply expand out all the letters and replace them one by one: replace(replace(lcase(?label), "(\\ba(?!\\s))", "A" ), "(\\bb(?!\\s))", "B" ) and so on. Not a very elegant or performant solution, but it gets the job done.
A shorter option is to use a pattern like ^(.*?)(\b[a-z](?!\s))(.*)$ to split the string into 3 parts, which you can extract with replacements to $1, $2 and $3, respectively. Concatenate the first part with the uppercase of the second part, and repeat the whole process for the last part. You will again have to repeat the patterns, but this time it is the same pattern so there is a potential for optimization. A downside is that you have to end this "recursion" somewhere, so you can only replace a fixed number of words.

Related

Dynamic class operations within a regular expression

I am trying to write a regex which excludes certain characters from a class based on the current content of capturing groups. The specific task that made me look for such a thing was to match lowercase letters in alphabetical order.
I searched through Rex's page (https://www.rexegg.com/regex-class-operations.html) to see if there was any way to change the class' content, but was unable to find anything.
Take the following attempt as a brief example: ([a-z])[a-z--[\1]]
Though it's not a correct regular expression, it demonstrates the concept. The idea is that it would match two letters that are not the same.
Note: the expression shown follows a Python-like syntax, and can also be written as:
([a-z])[a-z&&[\1]] or ([a-z])(?![\1])[a-z]
But I am going to use the Python syntax.
In the examples above the nested brackets are optional(in certain engines), but for the ultimate goal they are necessary. The pattern I am trying to match the ordered letters with would be something like this:
^(?:([a-z])([a-z--[a-(?(2)\2|\1)]])*+)?$
The first character class matches a letter which is immediately captured by the group, meaning that the letter will be excluded from the group containing the conditional. the first time the second group tries to match, condition inside the conditional statement evaluates to false, since there has not been a second capture yet, so it "matches" the first group's content, which should result in the exclusion of the first letter from the class. In later steps the second group will be set, meaning that all the letters between 'a' and the most recently captured letter will be excluded.
I know, it seems complicated. Maybe refactoring the pattern will help, take a look at this one:
^(?:([a-z])([(?(2)\2|\1)-z])*+)?$
This example makes no use of set operations, but the idea is roughly the same. The first group matches a letter, then the class inside the second group matches anything between the captured letter and 'z', which is noted by the [(?(2)\2|\1)-z] part. The conditional is there to ensure that the lower boundary of the character interval is the most recently captured character.
This could also be written using subroutine calls, but I doubt it would solve the problem. The issue might be that the classes are precompiled (and so are subroutines), so they cannot change during the matching process.
Are you guys aware of a workaround or an engine that supports such operations? I am interested in the dynamic class operation itself rather than a different way to match alphabetically ordered letters.

Regex to match hexadecimal and integer numbers [duplicate]

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.
The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/
Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A
Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).
Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.
Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

Regex matching order numbers e.g DE + 10 numbers or AT +10 numbers

I'm actually trying to match some order numbers in a string.
The string could look like this:
SDSFwsfcwqrewrPL0000018604ergerzergdsfa
or
FwsfcwqrewrAT0000018604ergerzergdsfaD
and I need to match the "PL0000018604" or "AT0000018604".
Actually I'm using something like that and it works:
.?(AT[0-9]{10})|(BE[0-9]{10})|(FR[0-9]{10})|(IT[0-9]{10})
But the more order prefixes we get, the longer the expression will be.
It is always 2 uppercase chars followed by 10 digits and I want to specify the different uppercase chars.
Is there any shorter version?
Thanks for your help :)
If the prefixes must be specific, there's not much of a way to make the pattern shorter. Though, you can collect all the prefixes at the front of the expression so you only have to have the numeric part once.
for example:
(AT|BE|FR|IT)[0-9]{10}
Depending on how you call it, if you need the whole expression to be captured as a group (versus simply matching, which is what the question asked about), you can add parenthesis around the whole expression. That doesn't change what is matched, but it will change what is returned by whatever function uses the expression.
((AT|BE|FR|IT)[0-9]{10})
And, of course, if you just want the number part to also be captured as a separate group, you can add more parenthesis
((AT|BE|FR|IT)([0-9]{10}))

Negating Regular Expression for Price

I have a regular expression for matching price where decimals are optional like so,
/[0-9]+(\.[0-9]{1,2})?/
Now what I would like to do is get the inverse of the expression, but having trouble doing so. I came up with something simple like,
/[^0-9.]/g
But this allows for multiple '.' characters and more than 2 numbers after the decimal. I am using jQuery replace function on blur to correct an input price field. So if a user types in something like,
"S$4sd3.24151 . x45 blah blah text blah" or "!#%!$43.24.234asdf blah blah text blah"
it will return
43.24
Can anyone offer any suggestions for doing this?
I would do it in two steps. First delete any non-digit and non-dot-character with nothing.
/[^0-9.]//g
This will yield 43.24151.45 and 43.24.234 for the first and second example respectively.
Then you can use your first regex to match the first occurence of a valid price.
/\d(\.\d{1,2})?/
Doing this will give you 43.24 for both examples.
I suppose in programming, it is not always clear what "inverse" means.
To suggest a solution exclusively based on the example that you presented, I will present one that is very similar to what Vince presented. I am having difficulty composing a Regular Expression that both matches the pattern that you need and captures a potentially arbitrary number of digits, through repeating capture groups. And I am not sure whether this would be doable in some reasonable way (perhaps someone else does). But a two step approach should be straightforward.
To note, I suspect that you are referring to JavaScript's replace function, which is a member of the String Object, and not jQuery replaceWith and replaceAll functions, in referring to 'jQuery replace function.' The latter are 'Dom manipulation' functions. But, correct me if I misunderstood.
As an example, based on some hypothetical input, you could use
<b>var numeric_raw = jQuery('input.textbox').attr ('value').replace (/[^0-9.]/g, "")</b>
to remove all characters from a value entered in a text field that are not digits or periods;
then you could use
<b>var numeric_str = numeric_raw.replace (/^[0]*(\d+\.\d{1,2}).*$/, "$1")</b>
The difference between the classes specified here and in Vince's answer are in that I am including filtering for leading 0s.
To note, in Vince's first reg ex, there might be an extra '/' -- but perhaps it has a purpose that I didn't catch.
With respect to "inverse," one way to understand your initial inquiry is that you are looking for an expression that does the opposite of the one that you provided.
To note, while the expression that you provided (/[0-9]+(.[0-9]{1,2})?/) does match both whole numbers and decimal numbers with up to two fractional digits, it also matches any single digit -- so, it may identify a match where one might not be envisioned, for a given input string. The expression does not have anchors ('^', '$'), and so might allow multiple possible matches. For example, in the String "1.111", both "1.11" and "1" match the pattern that you provided.
It appears to me that one pattern that matches any string that does not match your pattern is the following, or at least does this for most cases can be this:
/^(?:(?!.*[0-9]+(\.[0-9]{1,2})?).*)*$/
-- if someone could identify a precisely 'inverse' pattern, please feel free -- I am having some trouble understanding how lookaheads are interpreted at least for some nuances.
This relies on "negative lookahead" functionality, which JavaScript these days supports. You could refer to several stackoverflow postings for more information (eg. Regular Expressions and negating a whole character group), and there are multiple resources that could be found on the Internet that discuss "lookahead" and "lookbehind."
I suppose this answer carries some redundancy with respect to the one already given -- I might have commented on the Original Poster's post or on Vince's answer (instead of writing at least parts of my answer), but I am not yet able to make comments!

How to create regular expression to get all functions from code

I have some problem with my regular expression. I need to find all functions in text. I have this regular expression \w*\([^(]*\). It works fine until text does not contais brackets without function name. For example for this string 'hello world () testFunction()' it returns () and testFunction(), but I need only testFunction(). I want to use it in my c# application to parse passed to my method string. Can anybody help me?
Thanks!
Programming languages have a hierarchical structure, which means that they cannot be parsed by simple regular expressions in the general case. If you want to write correct code that always works, you need to use an LR-parser. If you simply want to apply a hack that will pick up most functions, use something like:
\w+\([^)]*\)
But keep in mind that this will fail in some cases. E.g. it cannot differentiate between a function definition (signature) and a function call, because it does not look at the context.
Try \w+\([^(]*\)
Here I have changed \w* to \w+. This means that the match will need to contain atleast one text character.
Hope that helps
Change the * to + (if it exists in your regex implementation, otherwise do \w\w*). This will ensure that \w is matched one or more times (rather than the zero or more that you currently have).
It largely depends on the definition of "function name". For example, based on your description you only want to filter out the "empty"names, and not want to find all valid names.
If your current solution is largely enough, and you have problems with this empty names, then try to change the * to a +, requiring at least one word character right before the bracket.
\w+([^(]*)
OR
\w\w*([^(]*)
Depending on your regexp application's syntax.
(\w+)\(
regex groups would have the names of variables without any parentesis, you can add them later if you want, i supposed you don't need the parameters.
If you do need the parameters then use:
\w+\(.*\)
for a greedy regex (it would match nested functions calls)
or...
\w+\([^)]*\)
for a non-greedy regex (doesn't match nested function calls, will match only the inner one)