Regex to replace first lowercase character in a line into uppercase - regex

I have a very large file containing thousands of sentences. In all of them, the first word of each sentence begins with lowercase, but I need them to begin with uppercase.
I looked through the site trying to find a regex to do this but I was unable to. I learned a lot about regex in the process, which is always a plus for my job, but I was unable to find specifically what I am looking for.
I tried to find a way of compiling the code from several answers, including the following:
Convert first lowercase to uppercase and uppercase to lowercase (regex?)
how to change first two uppercase character to lowercase character on each line in vim
Regex, two uppercase characters in a string
Convert a char to upper case using regular expressions (EditPad Pro)
But for different reasons none of them served my purpose.
I am working with a translation-specific application which accepts regex.
Do you think this is possible at all? It would save me hours of tedious work.

You can use this regex to search for the first letters of sentences:
(?<=[\.!?]\s)([a-z])
It matches a lowercase letter [a-z], following the end of a previous sentence (which might end with one of the following: [\.!?]) and a space character \s.
Then make a substitution with \U$1.
It doesn't work only for the very first sentence. I intentionally kept the regex simple, because it's easy to capitalize the very first letter manually.
Working example: https://regex101.com/r/hqwK26/1
UPD: If your software doesn't support \U, you might want to copy your text to Notepad++ and make a replacement there. The \U is fully supported, just checked.
UPD2: According to the comments, the task is slightly different, and just the first letters of each line should be capitalized.
There is a simple regex for that: ^([a-z]), with the same substitution pattern.
Here is a working example: https://regex101.com/r/hqwK26/2

Taking Ildar's answer and combining both of his patterns should work with no compromises.
(?<=[\.!?]\s)([a-z])|^([a-z])
This is basically saying, if first pattern OR second pattern. But because you're now technically extracting 2 groups instead of one, you'll have to refer to group 2 as $2. Which should be fine because only one of the patterns should be matched.
So your substitution pattern would then be as follows...
\U$1$2
Here's a working example, again based on Ildar's answer...
https://regex101.com/r/hqwK26/13

Related

Finding all possible Acronyms

I have creating a script using VBA to go through a Word document to find all word that could possibly be an acronym but I found that my regEx pattern is not find all of them.
The regEx pattern I am using is "([A-Z]{2,})(-([A-Z]{2,})[A-Za-z0-9])"
With this pattern I am able to find
AA
AAA
AA-BB
AA-BBB
AAA-BB
AAA-BBB
AAA-1234
AAA-BBB-1234
but it does not find these words
B2B
B2B-1234
B2B-A1A-1234
The expectation of the word match should be that the first character is a letter and must contains at least two uppercase letters and at least one number. In addition, if there are dashes in the the word then the characters before the dash must match the expectation of the word match.
Is there is a way to use the regEx pattern above to also include the letter-digit-letter acronyms too?
Milco, welcome to StackOverflow. I think that the following regex will work for you:
([A-Z][A-Z0-9]+)(-[A-Z0-9]{2,})*
This regex accommodates digits and an optional number of hyphenated terms and matches each of your cases above. I tested it out at regextesteronline.com - I'm assuming that VB.net regexes are the same as VBA, which they should be, at least for basic regexes.

RegExp space character

I have this regular expression: ^[a-zA-Z]\s{3,16}$
What I want is to match any name with any spaces, for example, John Smith and that contains 3 to 16 characters long..
What am I doing wrong?
Background
There are a couple of things to note here. First, a quantifier (in this case, {3,16}) only applies to the last regex token. So what your current regex really is saying is to "Match any string that has a single alphabetical character (case-insensitive) followed by 3 to 16 whitespace characters (e.g. spaces, tabs, etc.)."
Second, a name can have more than 2 parts (a middle name, certain ethnic names like "De La Cruz") or include special characters such as accented vowels. You should consider if this is something you need to account for in your program. These things are important and should be considered for any real application.
Assumptions and Answer
Now, let's just assume you only want a certain format for names that consists of a first name, a last name, and a space. Let's also assume you only want simple ASCII characters (i.e. no special characters or accented characters). Furthermore, both the first and last names should start with a capital character followed by only lower-case characters. Other than that, there are no restrictions on the length of the individual parts of the name. In this case, the following regex would do the trick:
^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$
Notes
The first token after the ^ character is what is called a positive lookahead. Basically a positive look ahead will match the regex between the opening (?= and closing ) without actually moving the position of the cursor that is matching the string.
Notice I removed the \s token, since you usually want only a (space). The space can be replaced with the \s token, if tabs and other whitespace is desired there.
I also added a restriction that a name must start with a capital letter followed by only lower-case letters.
Crude English Translation
To help your understanding, here is a simple English translation of what the regex is really doing. The part in italics is just copied from the first part of the English translation of the regex.
"Match any string that has 3-16 characters and starts with a capital alphabetical character followed by one or more (+) alphabetical characters followed by a single space followed by a capital alphabetical character followed by one or more (+) alphabetical characters and ends with any lowercase letter."
Tools
There are a couple of tools I like to use when I am trying to tackle a challenging regex. They are listed below in no particular order:
https://regex101.com/ - Allows you to test regex expressions in real time. It also has a nifty little library to help you along.
http://www.regular-expressions.info/ - Basically a repository of knowledge on regex.
Edit/Update
You mentioned in your comments that you are using your regex in JavaScript. JavaScript uses a forward slash surrounding the regex to determine what is a regex. For this simple case, there are 2 options for using a regex to match a string.
First, use String's match method as follows
"John Smith".match(/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/);
Second, create a regex and use its test() method. For example,
/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/.test("John Smith");
The latter is probably what you want as it simply returns true or false depending on whether the regex actually matches the string or not.

regex to match first instance of a word but only when preceeded by match from another pattern

I've found some info on finding the first instance of a word in a string, but I'm trying to find the first instance of a word (two, actually, but in separate calls) only when it is preceded by some very specific text (an IP address delimited by underscores) that varies slightly. Also, these words are separated by underscores, so for some reason \b isn't working for me.
Here's some example strings to test against one line at a time. Only bolded words should be matched.
192_168_10_2_card02_port01_other_text_with_card_or_port
10_22_1_200_card4_port5_another_string_with_port_or_card
something_else_with_card_or_port_in_it
And in a second call, I'd like to match a different word in these strings.
192_168_10_2_card02_port01_other_text_with_card_or_port
10_22_1_200_card4_port5_another_string_with_port_or_card
something_else_with_card_or_port_in_it
My regex flavor is POSIX regex (for PostgreSQL 9.4). I've been able to run with anything that works in here http://regexpal.com/ so far.
Even if it can't solve for all 3 examples at once, if it could just solve for the first two, that would be very helpful.
Edit: To be absolutely clear, my intent is to replace the first string 'card' with the character 'c' and then to replace the first string 'port' with the letter 'p' without affecting any instance of 'card' or 'port' that are not immediately followed by numbers. This is why my match needs to include just those first words without their corresponding numbers.
If you can use negative lookahead you can use card((?!port).)*port to match a string with card, than any number of characters not followed by port, then card again.
EDIT:
if the input is always in the same format, then you can be more specific by using card[0-9]{1,2}_port. This will keep it from matching any other extraneous instances of card and port
EDIT2:
to match only the word in the first case you can use a positive lookahead: card(?=[0-9]{1,2}_port). Im not sure if your flavor allows positive lookbehind (the tester doesnt, but that is in js), but give (?<=card[0-9]{1,2}_)port a shot. If positive lookbehind doesnt work you may need to look into alternatives.
The \b assertion is not working in this case because _ is considered a word character.
Demo
You can use a look behind:
(?<=_)(card).*?(?<=_)(port)
Demo
To be even more specific, use the IP address pattern:
(^(?:\d+_){4})(card\d+)_(port\d+)
Demo
I had to solve this in two steps. In the first, I matched only lines with the IP string in the beginning (this excludes lines like my 3rd example). In the second step, I used regexp_replace to replace the first match of each word.
Unfortunately, I had completely missed the fact that regexp_replace only replaces the first match unless told otherwise with the 'g' flag:
WHEN (SELECT regexp_matches(mystring, '^1(?:[0-9]{1,3}_){4}card[0-9]{1,2}_port[0-9]{1,2}')) IS NOT NULL
THEN regexp_replace(regexp_replace(mystring, 'card', 'c'), 'port', 'p')
Though I still wish I could figure out how to match one of those words in a single expression, and I would accept any answer that could achieve that.

Regex for capitalizing first letter in a tag, alt=", etc

I've found regular expressions that capitalize the first letter in a sentence. But does anyone know a regex that capitalizes the first letter inside a tag, including URL and image attributes (e.g. title="antelope" or alt="antelope").
I used another regex to change all my image paths to lower case, and it zapped a bunch of my tags as well (alt, title, h2, etc.). So now I'd like to get a head start fixing them by capitalizing the first letters.
I'm working on a Mac, using Dreamweaver and TextWrangler as my text editors.
Before...
alt="antelope" title="antelope" <h2>antelope
After...
alt="Antelope" title="Antelope" <h2>Antelope
Regex
(="\w|>\w)
Replace Regex
\U$1\E
Description: This will work for your example, depending on the regex engine you are using.
Debuggex Demo
This replaces the value in parameters in a url. NOT in html, as I now see that is what you mean. Oh well.
Find what: (\?|\&)([a-z_]+=)([a-z])([^&]+)
Replace (all) with: $1$2\u$3$4
Free spaced:
(\?|\&)
Capture group 1: Either the literal question mark or ampersand.
([a-z_]+=)
Capture group 2: One or more of any lowercase letter or underscore, followed by the equals sign.
([a-z])
Capture group 3: The first letter in the value of the url parameter. Note this does not even notice parameters whose values don't start with a letter.
([^&]+)
Capture group four: Every other character in the value. Or more specifically, one or more of any character as long as it's not an ampersand. This is a negative character class.
The \u in the replace-with is an option in TextWrangler (and in TextPad, which is what I use...so TextWrangler might also use the Boost regex engine) replacement that uppercases the immediately-following character. I'm not sure if this would work if capture groups 3 and 4 were merged.
Try it (although it doesn't have the \u option.)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a lot of helpful information in it, including a list of online regex testers (in the bottom section), so you can try things out yourself. All the links in this answer come from the FAQ.

Password validation regex

I am trying to get one regular expression that does the following:
makes sure there are no white-space characters
minimum length of 8
makes sure there is at least:
one non-alpha character
one upper case character
one lower case character
I found this regular expression:
((?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])(?!\s).{8,})
which takes care of points 2 and 3 above, but how do I add the first requirement to the above regex expression?
I know I can do two expressions the one above and then
\s
but I'd like to have it all in one, I tried doing something like ?!\s but I couldn't get it to work. Any ideas?
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])\S{8,}$
should do. Be aware, though, that you're only validating ASCII letters. Is Ä not a letter for your requirements?
\S means "any character except whitespace", so by using this instead of the dot, and by anchoring the regex at the start and end of the string, we make sure that the string doesn't contain any whitespace.
I also removed the unnecessary parentheses around the entire expression.
Tim's answer works well, and is a good reminder that there are many ways to solve the same problem with regexes, but you were on the right track to finding a solution yourself. If you had changed (?!\s) to (?!.*\s) and added the ^ and $ anchors to the end, it would work.
^((?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])(?!.*\s).{8,})$