Regex to match exact string - regex

I am having some trouble to figure out a way to detect a prefix on a message by having the exact amount of characters, for example:
I tried the regex ^[!]{1} the messages I was testing with were:
!test
!!!test
But that regex would mark both strings as starting with the prefix.
I have saved both prefixes and I want to detect which one is being parsed.

You need to tell the regex what an invalid match is as well. One way to do this is to look for an exclamation mark and then anything but an exclamation mark.
^![^!]
or you could look for an ! then an alpha character:
^![a-zA-Z]
depends what should follow.

Related

Regex for fixing YAML strings

I am trying to create a bunch of YAML files, mostly composed of strings of text. Now when using apostrophes in words, they must be escaped by typing a double apostrophe, because I’m using apostrophes to wrap the strings.
I want to create a regex that will check for apostrophes in the text that aren’t double. What I have is this:
^([^'\n]*?)'(([^'\n]*?)'(?!')([^'\n]+?))*?'$\n
https://regex101.com/r/v4nUTn/3
My issue is that as soon as my string has a double apostrophe, but also has an apostrophe which isn’t a double apostrophe, it doesn’t match because my negative lookahead doesn’t match as soon as it sees the double apostrophe. (for example the string t''e'st won’t match even though it is missing a double apostrophe after the e)
How can I make it so that my negative lookahead will not fail as soon as it sees one double apostrophe?
This regex should work:
\w'\w
Test here.
My guess is that maybe an expression similar to
('[^'\r\n]*'|[^\r\n\w']+)|([\w']*)
would be an option to look into.
If the second capturing group returns true, then the string is undesired.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
One suggestion would be to do this in two steps.
For example, if every 'candidate' value looks like this: - 'something here' (where you want to test the apostrophes in the something here content of the string, then first isolate out that content via:
/^\s*- '(.+)'$/im
And then make sure all apostrophe's appear as you want them to appear within match group 1 of the result.
Then, replace the original match with your 'sanitised' match.
Doing this means you don't have to be concerned with the bounding apostrophes causing complications to the check for apostrophes in the value.
Note: there may well be a perfect one-step regex to do this, but understanding that you can break tasks into several steps is useful if you spend a lot of time with regular expressions, and can help you sidestep 'perfect regex paralysis'.
If you want your string to match if there is at least one 'single quote' between your singlequote strings, then you should allow consumption of either a string which doesn't have any singlequote in it or consume if it contains two singlequotes and then you should modify your regex a bit to consume two singlequotes and add |'' in your regex, which will now consume either non-singlequote text or a portion which has at least two singlequotes.
Try this updated regex demo and see if this works like you wanted?
https://regex101.com/r/v4nUTn/4

Regex to replace first lowercase character in a line into uppercase

I have a very large file containing thousands of sentences. In all of them, the first word of each sentence begins with lowercase, but I need them to begin with uppercase.
I looked through the site trying to find a regex to do this but I was unable to. I learned a lot about regex in the process, which is always a plus for my job, but I was unable to find specifically what I am looking for.
I tried to find a way of compiling the code from several answers, including the following:
Convert first lowercase to uppercase and uppercase to lowercase (regex?)
how to change first two uppercase character to lowercase character on each line in vim
Regex, two uppercase characters in a string
Convert a char to upper case using regular expressions (EditPad Pro)
But for different reasons none of them served my purpose.
I am working with a translation-specific application which accepts regex.
Do you think this is possible at all? It would save me hours of tedious work.
You can use this regex to search for the first letters of sentences:
(?<=[\.!?]\s)([a-z])
It matches a lowercase letter [a-z], following the end of a previous sentence (which might end with one of the following: [\.!?]) and a space character \s.
Then make a substitution with \U$1.
It doesn't work only for the very first sentence. I intentionally kept the regex simple, because it's easy to capitalize the very first letter manually.
Working example: https://regex101.com/r/hqwK26/1
UPD: If your software doesn't support \U, you might want to copy your text to Notepad++ and make a replacement there. The \U is fully supported, just checked.
UPD2: According to the comments, the task is slightly different, and just the first letters of each line should be capitalized.
There is a simple regex for that: ^([a-z]), with the same substitution pattern.
Here is a working example: https://regex101.com/r/hqwK26/2
Taking Ildar's answer and combining both of his patterns should work with no compromises.
(?<=[\.!?]\s)([a-z])|^([a-z])
This is basically saying, if first pattern OR second pattern. But because you're now technically extracting 2 groups instead of one, you'll have to refer to group 2 as $2. Which should be fine because only one of the patterns should be matched.
So your substitution pattern would then be as follows...
\U$1$2
Here's a working example, again based on Ildar's answer...
https://regex101.com/r/hqwK26/13

RegEx: Searching for numbers (int, float) that are NOT part of a word

I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments