PHP and XSD Regular expression for currency and location names - regex

I am trying to find a suitable regular expression that allows number, digits, spaces and other characters that can be used in names such as . ' -_. I need to implement an expression as a PHP preg_match and an XSD pattern. Currently I have the following for PHP
/^[a-zA-Z0-9 '-.]
Which allows the characters I want (unless there are any other special characters you could kindly recommend I use). The issue with this is that it allows special characters to be used one after the other, allowing values such as .-- . I need it so that this can't happen, only allowing a special character if a letter or digit comes before it.
I would also like the equivalent for an XSD pattern but everything I have tried so far has been inadequate. I am currently using
[\w\d '-.']*[\w\d][\w\d '-.]*
In addition, the length must be between 3-50 (which works in all cases currently).
Any guidance would be fantastic as I have searched high and low for an answer.
Valid names could be:
Netherlands Antillean guilde
Timor-Leste
Cote d'Ivoire
Including letters with accents.
Invalid names could be:
Tes''t
Test (space before)
_test (special character before
-'.
...

Related

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

Regex for matching any URL character

I have come accross a specification that said described a field as :
Any URL char
And I wanted to validate it on my side via a REGEX.
I searched a bit and, even if I found this great SO question that contains every piece of information I needed, I found it too bad not to have a question asking precisely for the regex, so here I am.
What would be a proper regex matching any URL character ?
Edit
I extracted the following regex from what I understood from the specification :
[\w\-.~:/?#\[\]#!$&'()*+,;=%]
So, is this REGEX right and exhaustive or did I miss anything ?
After reading the specification, I guess it is simply "all ASCII characters".
See the Characters section:
A URI is composed from a limited set of characters consisting of
digits, letters, and a few graphic symbols. A reserved subset of
those characters may be used to delimit syntax components within a
URI while the remaining characters, including both the unreserved set
and those reserved characters not acting as delimiters, define each
component's identifying data.
Although there is an indication that only digits, letters and some symbols are supported, you may see a suggested regex to parse a URI at the Appendix B. Parsing a URI Reference with a Regular Expression that may actually match pretty every char:
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
What you collected as a [\w.~:/?#\[\]#!$&'()*+,;=%-] pattern is too restrictive, unless \w is Unicode aware (URI may contain any Unicode letters), then, it might be working more or less for you.
If you plan to match just ASCII URLs, use ^[\x00-\x7F]+$ (any 1+ ASCII symbols) or ^[!-~]+$ (only visible ASCII).

Valid name cannot accept only hyphen

I am weak in regex but I am learning. Currently I have a requirement to validate name and I am not able to write a valid regex for it. A valid name would contain alphabet only or alphabet with hyphens or spaces.
Example of valid name would be
jones
jones-smiht
a loreal jones
but if the name contains digits it's an invalid name. The following regex
^[-\\sa-zA-Z]+$ works fine but only - is also considered as a valid name.
How do I modify it so that a valid name must contain letters regardless or whether it contains hyphens and spaces?
I think you're looking for this regex:
^[a-zA-Z][-\\sa-zA-Z]*$
This will make sure your name always starts with a letter instead of starting with hyphen or space.
Note: In Java you can also make use of (?i) for ignore case and shorten your regex as follows:
(?i)^[a-z][-\\sa-z]*$
The literal answer for you would be ^[a-zA-Z][-\sa-zA-Z]*$.
There are better answers: for instance,
([a-zA-Z]+)([-\s][a-zA-Z]+)*
will allow any number of words separated by single space or dash, allowing for simon peyton-jones, but disallowing silliness like --jumbo-spaz--.
And copied from the response I tried to publish on the deleted answer:
Regexp is single-backslash. However, since regexps are constructed from strings in Java, you need to escape the backslash; but it is the feature of strings, not of regexps. So, regexp is \s, but you need to write Pattern.compile("\\s") in Java. Not all languages have this twist, so keeping rules of strings separate from what Regexp is is useful.

I need a regular expression First name must starts with minimum 2 characters

I need a regular expression First name must starts with minimum 2 characters.
I am using following
/^[A-Z a-z]{2,25}$/
Your regex appears to allow spaces so I'm not sure if that's valid.
If you want a regex that dictates at least two alpha characters at the start, you can just use:
/^[A-Za-z]{2}/
This will force the first two characters to be alpha with no restriction whatsoever on the rest of the string. If you want to (as it looks) allow those same character plus spaces for up to another 23 characters after that, use:
/^[A-Za-z]{2}[A-Za-z ]{0,23}$/
Otherwise, if your definition of characters includes a space even for the first two characters, your current regex should be fine.
The question is which regexp are you using? I assume POSIX RE.
Your solution is fairly good, but if You check a real name, the space is useless. (Maybe You are check for mind name as well???) And (I suppose) the first character should be a capital letter, rest lower case, isn't it? If so I suggest:
/^[[:upper:]][[:lower:]]{1,24}$/
The advantage of using built in character classes, it that it can work in other languages as well.
Example:
$ echo -e "Alpha\nalpha\naLpha\nThisisaverylongfirstname\nThisfirstnameismuchlongertobesure"\
|egrep '^[[:upper:]][[:lower:]]{1,24}$'
Alpha
Thisisaverylongfirstname

Preparing number using abbreviations

RegEx for BMHT in a sequence is my previous post.
I'm looking to build a number using abbreviations, and ofcourse using regex.
Now I know how to validate a number with BMTH abbreviations.
Now my next and final target is to build a number using the abbreviations.
e.g. -2T2H22.55 should be displayed as -2,222.55
-2M2H22.63 should be displayed as -2,000,222.63
Help appreciated.
Flex's scripting language, ActionScript, is an ECMAScript implementation like JavaScript, so regex literals have to be delimited with slashes, for example: /^(?:\d+B)?(?:\d{1,3}M)?(?:\d{1,3}T)?(?:\d{1}H)?(\.[0-9]*)?/.
But that regex still has some problems. For one thing, you don't account for the minus sign or the two digits after the hundreds place. And, while the decimal point may be optional, if it is present you should require it to be followed by at least one digit (so +, not * in that last group).
Finally, you'll need to capture the various components so you can use them to construct the number. Here's my result:
/^(-?)(?:(\d+)B)?(?:(\d{1,3})M)?(?:(\d{1,3})T)?(?:(\d)H)?(\d{0,2})(\.\d+)?$/
The minus sign, if present, will be captured in group $1. The rest of the components will be in groups $2 through $7. You can use them in a callback function to construct the number. Also, notice that everything in this regex is optional; it will match an empty string or just a hyphen, so you'll need to check for that.