Perl Regex valid and invalid Roman numerals VIII and VIH [duplicate] - regex

This question already has answers here:
Regex match entire words only
(7 answers)
Closed 21 hours ago.
I’d like to detect Roman numerals within scanned text. The text quality is poor and numbers such as VIII will sometime come out as VIH, so I'd also like to check for H and I can't restrict my search to only valid numbers, just a simple test for the uppercase letters IVXLC and H in any sequence. There are only small numbers, no thousands etc. I tried [IVXLCH]+ (which I think is a character class) but it does not detect VIII. I must have the code wrong and am looking for help. Thanks in advance.
$phr =~ s/(\s[IVXLCH]+\s)/Roman$1/g;

The expression [IVXLCH]+ does match VIII, but due to the surrounding \s which requires whitespace either side, VIII wouldn't match if appeared at the start or end of the input or with adjacent punctuation etc.
Use \b ("word boundary") instead:
$phr =~ s/(\b[IVXLCH]+\b)/Roman$1/g;
btw D and M are also Roman numerals.

Related

How to get the reversed result of the following regex? [duplicate]

This question already has answers here:
How can I "inverse match" with regex?
(10 answers)
Closed 6 months ago.
Regex: /^[0-9\p{L}.,\s]+$/u
I would like to replace the characters not matching with the regex with "".
As I understand, you simply want to drop all chars not matching your regex. So the idea is to invert the class of chars:
/^[0-9\p{L}.,\s]+$/u should become /[^\d\p{L}.,\s]+/gu (I added the ^ after the [ to say "not in this list of chars" and replaced 0-9 by \d for digits. Use the g modifier (=global ) to match multiple times.
Running it: https://regex101.com/r/IQz6K5/1
I'm not sure that ,, . and the space will be enough ponctuation. It would be interesting to have a complete example of what you are trying to achieve. You could use another unicode character class for ponctuation if needed, typically with \p{P}. See more info about unicode classes here: https://www.regular-expressions.info/unicode.html#category

Regex matching exact pattern (full word with or without apostrophe) and not partial components of a word [duplicate]

This question already has answers here:
Matching Unicode letter characters in PCRE/PHP
(5 answers)
Closed 2 years ago.
I am trying to write a regex to match full words with or without an apostrophe.
I did this:
\b[a-zA-Z']+\b
However, it is matching the letters in bold Jönas while the desired is to not match the word Jönas at all because of the ö on it.
The right matches should go for anything in a-zA-Z'
Thus following cases should match in full:
Jonas
Don't
hasn'T
But not for:
Jönas
Dön't
Hélló
demo here: https://regex101.com/r/2sVN5S/1/ (where Jönas and Hélton should not be matched at all not even partially)
How to fix the regex, to follow this exact match?
UPDATE. Anubhava and Wiktor Stribiżew pointed out that using \b[a-zA-Z']+\b in Unicode mode is enough (fiddle 1 and fiddle 2).
As said Wiktor, there is no use case the answer below is relevant (no engine supports look-around groups while not supporting Unicode mode). So this answer isn't anymore relevant.
You can use this regex:
\b(?<![\x80-\xFF])[a-zA-Z']+(?![\x80-\xFF])\b
Here, [\x80-\xFF] stands for a range of character codes above ASCII 7bit set (where non-english letters lies). Basically, it looks for:
a sequence of english letters with or without apostrophes ...
not preceded by non-english letters (negative look-before group (?<!...)
not followed by non-english letters (negative look-ahead group (?!...)
Working Regex101.com sample.

Does not match when the string does not have a dot but it will match multiple dots [duplicate]

This question already has answers here:
Regex to allow alphanumeric and dot
(3 answers)
Closed 4 years ago.
I am trying to match the string when there's 0 or multiple dots. The regex that I can only match multiple dots but not 0 dot.
(\w*)((\w*\.)+\w*)
These are the test string I am using
dial.check.Catch.Url
dial.check.Catch.Url.Dial.check.Catch.Url
32443.324342.23423424.23.423.423.42.34.234.32.4..2..2.342.4
234dfasfd2aa4234234.234aa341.4.123daaadf.df.af....
12fd.dafd
.
abc
The Regex will match these
dial.check.Catch.Url
dial.check.Catch.Url.Dial.check.Catch.Url
32443.324342.23423424.23.423.423.42.34.234.32.4..2..2.342.4
234dfasfd2aa4234234.234aa341.4.123daaadf.df.af....
12fd.dafd
.
But not this one:
abc
https://regexr.com/?38ed7
If you really must use a regex, here is one (but it is inefficient):
/^(?![^.]*\.[^.]*$).*$/
It says:
Match a string so that the beginning of the string is not followed by a whole string with a single dot.
It does some backtracking when parsing the negative lookahead.
As mentioned in the comments to the question, I do think, unless you must have a regex, that a simple function might be better. But if you like the conciseness of a regex and performance is not a huge concern, you can go with the one I gave above. Regexes with "nots" in them are generally a tad messy, but once you understand lookarounds they do become doable. Cheers.
/\..*\.|^[^.]*$/
Or, in plain English:
Match EITHER a dot, then any number of characters, then another dot; OR the beginning of the string, then any number of non-dots, then the end of the string.

Matching a substring of n numbers, but not if there are any numbers after that [duplicate]

This question already has answers here:
Java RegEx that matches exactly 8 digits
(3 answers)
Closed 5 years ago.
Basically I'm looking for a regex that matches some simple phone numbers.
I want to match numbers in a longer string of text like 123 4567, 891-0111, or 21314151, something that is (hopefully) identified by (\d{3,4}[- ]\d{3,4}|\d{4,8}), but I don't want to match them if they're part of a longer number like 3919503570275.
If I require the next character to be a non-digit or the end of a line, then that next character is also included in the match, which I don't want.
Surround your regex with a lookahead and a lookbehind to reject \d on both sides:
(?<!\d)(\d{3,4}[- ]\d{3,4}|\d{4,8})(?!\d)
Demo.
Note that this would accept a string that looks like a phone number preceded or followed by letters.
Depending on what programming language you use, I suggest to either use negative look-ahead or to use groups to extract the number.
See https://www.regular-expressions.info/lookaround.html for information about lookaround pattern.

Regex, two uppercase characters in a string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I'd like to use Regex to ensure that there are at least two upper case characters in a string, in any position, together or not.
The following gives me two together:
([A-Z]){2}
Environment - Classic ASP VB.
You can use the simple regex
[A-Z].*[A-Z]
which matches an upper case letter, followed by any number of anything (except linefeed), and another uppercase letter.
If you need it to allow linefeeds between the letters, you'll have to set the single line flag. If you're using JavaScript (U should always include flavor/language-tag when asking regex-related questions), it doesn't have that possibility. Then the solution suggested by Wiktor S in a comment to another answer should work.
[A-Z].*[A-Z]
A to Z , any symbols between, again A to Z
update
As Wiktor mentioned in comments:
This regex will check for 2 letters on a line (in most regex flavors), not in a string.
So
[A-Z][^A-Z]*[A-Z]
Should do the thing (In most regex flavors/tools)
I believe what you're looking for is something like this:
.*([A-Z]).*([A-Z]).*
Broken out into segments thats:
.* //Any number of characters (including zero)
([A-Z]) //A capital letter
.* //Any number of characters (including zero)
([A-Z]) //A second capital letter
.* //Any number of characters (including zero)