Regex: Match only characters with preceding lowercase letter(s) - regex

I'd like to clean up a subtitle file that has many errors because of OCR. On of the errors is that the l is displayed as I. Of course sometimes the I is really a I, mainly in the case of:
The beginning of a sentence: I'm Ieaving... or - I'm Ieaving....
In names: IsabeIIe.
Maybe a few weird cases.
Since names are difficult to detect, I figured it would be best to replace only the I's with one or more directly preceding lowercase letters and check the rest manually. So after the conversion I get I'm Ieaving and Isabelle. This is the most 'barebone' automated solution I can think of since there are not that many words that have a lowercase letter directly preceding an uppercase letter.
How can I do this in Regex? Thanks in advance.

If your regex engine supports lookbehind, you can find all I's preceded by a lowercase letter like this:
(?<=[a-z])I
Otherwise, you could match both characters, and the second one will be the I.
[a-z]I

Either one of these, and if your engine supports modifier groups.
(?-i:(?<=[a-z])I)
or
(?-i:[a-z]I)
For Unicode, you will want to use properties.

/([a-z])I/ would capture upper case I's preceded by any lowercase letter a-z.

Related

Regex to replace first lowercase character in a line into uppercase

I have a very large file containing thousands of sentences. In all of them, the first word of each sentence begins with lowercase, but I need them to begin with uppercase.
I looked through the site trying to find a regex to do this but I was unable to. I learned a lot about regex in the process, which is always a plus for my job, but I was unable to find specifically what I am looking for.
I tried to find a way of compiling the code from several answers, including the following:
Convert first lowercase to uppercase and uppercase to lowercase (regex?)
how to change first two uppercase character to lowercase character on each line in vim
Regex, two uppercase characters in a string
Convert a char to upper case using regular expressions (EditPad Pro)
But for different reasons none of them served my purpose.
I am working with a translation-specific application which accepts regex.
Do you think this is possible at all? It would save me hours of tedious work.
You can use this regex to search for the first letters of sentences:
(?<=[\.!?]\s)([a-z])
It matches a lowercase letter [a-z], following the end of a previous sentence (which might end with one of the following: [\.!?]) and a space character \s.
Then make a substitution with \U$1.
It doesn't work only for the very first sentence. I intentionally kept the regex simple, because it's easy to capitalize the very first letter manually.
Working example: https://regex101.com/r/hqwK26/1
UPD: If your software doesn't support \U, you might want to copy your text to Notepad++ and make a replacement there. The \U is fully supported, just checked.
UPD2: According to the comments, the task is slightly different, and just the first letters of each line should be capitalized.
There is a simple regex for that: ^([a-z]), with the same substitution pattern.
Here is a working example: https://regex101.com/r/hqwK26/2
Taking Ildar's answer and combining both of his patterns should work with no compromises.
(?<=[\.!?]\s)([a-z])|^([a-z])
This is basically saying, if first pattern OR second pattern. But because you're now technically extracting 2 groups instead of one, you'll have to refer to group 2 as $2. Which should be fine because only one of the patterns should be matched.
So your substitution pattern would then be as follows...
\U$1$2
Here's a working example, again based on Ildar's answer...
https://regex101.com/r/hqwK26/13

Regular Expression for Password strength with one special characters except Underscore

I have the following regular expression:
^.*(?=^.{8,}$)(?=.*\d)(?=.*[!##$%^&*-])(?=.*[A-Z])(?=.*[a-z]).*$
I am using it to validate for
At least one letter
least one capital letter
least one number
least one special characters
least 8 characters
But along with this I need to restrict the underscore (_).
If I enter password Pa$sw0rd, this is validating correctly, which is true.
If I enter Pa$_sw0rd this is also validating correctly, which is wrong.
The thing is the regex is passing when all the rules are satisfied. I want a rule to restrict underscore along with above.
Any help will be very appreciable.
I think you can use a negated character class [^_]* to add this restriction (also, remove the initial .*, it is redundant, and the first look-ahead is already at the beginning of the pattern, no need to duplicate ^, and it is totally redundant since the total length limit can be checked at the end):
^(?=.*\d)(?=.*[!##$%^&*-])(?=.*[A-Z])(?=.*[a-z])[^_]{8,}$
See demo
^(?=.*?\d)(?=.*?[!##$%^&*-])(?=.*?[A-Z])(?=.*?[a-z])(?!.*_).{8,}$
You can try this..* at start is of no use.See demo.
https://regex101.com/r/pG1kU1/34

Here a word is a string of letters, preceded and followed by nonletters

I asked his question earlier but none of the responses solved the problem. Here is the full question:
Give a single UNIX pipeline that will create a file file1 containing all the words in file2, one word per line.Here a word is a string of letters, preceded and followed by nonletters.
I tried every single example that was given below, but i get "syntax error"s when using them.
Does anyone know how I can solve this??
Thanks
if your regex flavor support it you can use lookarounds:
(?<![a-zA-Z])[a-zA-Z]+(?![a-zA-Z])
(?<!..): not preceded by
(?!..): not followed by
If it is not the case you can use capturing groups and negated character classes:
(^|[^a-zA-Z])([a-zA-Z]+)($|[^a-zA-Z])
where the result is in group 2
^|[^a-zA-Z]: start of the string or a non letter characters (all character except letters)
$: end of the string
or the same with one capturing group and two non capturing groups:
(?:^|[^a-zA-Z])([a-zA-Z]+)(?:$|[^a-zA-Z])
(result in group 1)
In order to be unicode compatible, you could use:
(?:^|\PL)\pL+(?:\PL|$)
\pL stands for any letter in any language
\PL is the opposite of \pL
When your objective is to actually find words, the most natural way would be
\b[A-Za-z]+\b
However, this assumes normal word boundaries, like whitespaces, certain punctuations or terminal positions. Your requirement suggests you want to count things like the "example" in "1example2".
In that case, I would suggest using
[A-Za-z]+
Note that you don't actually need to look for what precedes or follows the alphabets. This already captures all alphabets and only alphabets. The greedy requirement (+) ensures that nothing is left out from a capture.
Lookarounds etc should not be necessary because what you want to capture and what you want to exclude are exact inverses of each other.
[Edit: Given the new information in comments]
The methods below are similar to Casimir's, except that we exclude words at terminals (which we were explicitly trying to capture, because of your original description).
Lookarounds
(?<=[^A-Za-z])[A-Za-z]+(?=[^A-Za-z])
Test here. Note that this uses negated positive lookarounds, and not Negative lookarounds as they would end up matching at the string terminals (which are, to the regex engine as much as to me, non-alphabets).
If lookarounds don't work for you, you'd need capturing groups.
Search as below, then take the first captured group.
[^A-Za-z]([A-Za-z]+)[^A-Za-z]
When talking about regex, you need to be extremely specific and accurate in your requirements.

Block all caps sentences

I'm trying to use a regex expression to block all caps sentences (sentences with only capital letters) but I can't succeed at finding the pattern. I was thinking about ^[a-z] but this doesn't work at all.
Any suggestion?
You can perhaps use something like this to make sure there's at least one lowercase character (note that's this is some kind of reverse logic):
^.*[a-z].*$
(Unless the function you're using uses regex against the whole pattern by default, you can drop the beginning and end of line anchors)
If you want the regex to be more strict (though I don't think that's very practical here), you can perhaps use something of the sort...
^[A-Z.,:;/() -]*[A-Z]+[A-Z.,:;/() -]*$
To allow only uppercase letters, and some potential punctuations (you can add or remove them from the character classes as you need) and spaces.
Simply look for [a-z]... If that matches, your sentence passes. If not, it is all caps (or punctuation).
It depends on what flavour of regex you're using but if you have one that supports lookaheads then you can use the following expression:
(?-i)^(?!(?=.*?[A-Z])(?:[A-Z]|(?i)[^a-z])*$)
It won't capture anything but will return false if the letters used are all in caps, and return true if any of the letters used are lower case.
Can't ^[A-Z]+$ simply suit your needs? If it matches, it means that the input string contains only capital letters.
Demo on RegExr.
The following regex
(^|\.)[[:space:]A-Z]+\.
will find any line containing only uppercase letters and whitespace between either start of line, or the preceding full stop.
It appears that you want to detect sentences that have words that have upper case nested inside the word, ex: hEllo, gOODbye, worD; that is any word that has an uppercase after a lower case, or any word with two or more uppercase beside each other.
uppercase after lowercase
[a-z][A-Z]
two or more paired uppercase
[A-Z][A-Z]
Combined them with alternation,
/*([a-z][A-Z]|[A-Z][A-Z])/

Regular expression (alphanumeric)

I need a regular expression to allow the user to enter an alphanumeric string that starts with a letter (not a digit).
This should work in any of the Regular Expression (RE) engines. There is a nicer syntax in the PCRE world but I prefer mine to be able to run anywhere:
^[A-Za-z][A-Za-z0-9]*$
Basically, the first character must be alpha, followed by zero or more alpha-numerics. The start and end tags are there to ensure that the whole line is matched. Without those, you may match the AB12 of the "###AB12!!!" string.
Full explanation:
^ start tag.
[A-Za-z] any one of the upper/lower case letters.
[A-Za-z0-9] any one of the upper/lower case letters or digits,
* repeated zero or more times.
$ end tag
Update:
As Richard Szalay rightly points out, this is ASCII only (or, more correctly, any encoding scheme where the A-Z, a-z and 0-9 groups are contiguous) and only for the "English" letters.
If you want true internationalized REs (only you know whether that is a requirement), you'll need to use one of the more appropriate RE engines, such as the PCRE mentioned above, and ensure it's compiled for Unicode mode. Then you can use "characters" such as \p{L} and \p{N} for letters and numerics respectively. I think the RE in that case would be:
^\p{L}[\pL\pN]*$
but I'm not certain. I've never used REs for our internationalized software. See here for more than you ever wanted to know about PCRE.
I think this should do the work:
^[A-Za-z][A-Za-z0-9]*$
You're looking for a pattern like this:
^[a-zA-Z][a-zA-Z0-9]*$
That one requires one letter and any number of letters/numbers after that. You may want to adjust the allowed lengths.