Putting a group within a group [123[a-u]] - regex

I'm having a lot more difficulty than I anticipated in creating a simple regex to match any specific characters, including a range of characters from the alphabet.
I've been playing with regex101 for a while now, but every combination seems to result in no matches.
Example expression:
[\n\r\t\s\(\)-]
Preferred expression:
[[a-z][a-Z]\n\r\t\s\(\)-]
Example input:
(123) 241()-127()()() abc ((((((((
Ideally the expression will capture every character except the digits
I know I could always manually input "abcdefgh".... but there has to be an easier way. I also know there are easier ways to capture numbers only, but there are some special characters and letters which I may eventually need to include as well.

With regex you can set the regex expression to trigger on a range of characters like in your above example [a-z] that will capture any letter in the alphabet that is between a and z. To trigger on more than one character you can add a "+" to it or, if you want to limit the number of characters captured you can use {n} where n is the number of characters you want to capture. So, [a-z]+ is one or more and [a-z]{4} would match on the first four characters between a and z.

You can use partial intervals. For example, [a-j] will match all characters from a to j. So, [a-j]{2} for string a6b7cd will match only cd. Also you can use these intervals several times within same group like this: [a-j4-6]{4}. This regex will match ab44 but not ab47

Overlooked a pretty small character. The term I was looking for was "Alternative" apparently.
[\r\t\n]|[a-z] with the missing element being the | character. This will allow it to match anything from the first group, and then continue on to match the second group.
At least that's my conclusion when testing this specific example.

Related

Regex Match 6 Letter String With Chars and Number and No positive look around

I know there are several similar answers, but I am struggling to find one that fits my use case.
I need a regex to extract IDs that are 6 characters long and have a mix of numbers and characters.
The IDs will start with one of the following chars [eEdDwWaA]
I have had some solutions that have nearly worked, but the tool I want to plug this regex into does NOT support positive look around and every answer seems to use this.
The string I need to find can be anywhere in text and will either be preceded by a whitespace or a backslash.
Example of what I would want to match is eh3geh (case insensitive)
Here is what I have so far [eEdDwWaA](?:[0-9]+[a-z]|[a-z]+[0-9],{5})[a-z0-9]*
This works for the most part but it is not consistently matching and I'm not sure why.
If you can't use a lookahead an idea is to capture using The Trick.
The trick is that we match what we don't want on the left side of the alternation (the |), then we capture what we do want on the right side....
[\\ ](?:.[a-z]{5}|([eEdDwWaA][a-z0-9]{5}))\b
.[a-z]{5} we don't want only letters (left side)
|(...) but capture what we need to group one (righte side)
Here is the demo at regex101
Get the captures of group 1 on program-side (where group not null/empty).

Regex - Alternate between letters and numbers

I am wondering how to build a regex that would match forever "D1B2C4Q3" but not "DDA1Q3" nor "D$1A2B".
That is a number must always follow a letter and vice versa. I've been working on this for a while and my current expression ^([A-Z0-9])(?!)+$ clearly does not work.
^([A-Z][0-9])+$
By combining the letters and digits into a single character class, the expression matches either in any order. You need to seperate the classes sequentially within a group.
I might actually use a simple regex pattern with a negative lookahead to prevent duplicate letters/numbers from occurring:
^(?!.*(?:[A-Z]{2,}|[0-9]{2,}))[A-Z0-9]+$
Demo
The reason I chose this approach, rather than a non lookaround one, is that we don't know a priori whether the input would start or end with a number or letter. There are actually four possible combinations of start/end, and this could make for a messy pattern.
I'm guessing maybe,
^(?!.*\d{2}|.*[A-Z]{2})[A-Z0-9]+$
might work OK, or maybe not.
Demo 1
A better approach would be:
^(?:[A-Z]\d|\d[A-Z])+$
Demo 2
Or
^(?:[A-Z]\d|\d[A-Z])*$
Or
^(?:[A-Z]\d|\d[A-Z]){1,}$
which would depend if you'd like to have an empty string valid or not.
Another idea that will match A, A1, 1A, A1A, ...
^\b\d?(?:[A-Z]\d)*[A-Z]?$
See this demo at regex101
\b the word boundary at ^ start requires at least one char (remove, if empty string valid)
\d? followed by an optional digit
(?:[A-Z]\d)* followed by any amount of (?: upper alpha followed by digit )
[A-Z]?$ ending in an optional upper alpha
If you want to accept lower alphas as well, use i flag.

Regex how to match two similar numbers in separate match groups?

I got the following string:
[13:49:38 INFO]: Overall : Mean tick time: 4.126 ms. Mean TPS:
20.000
the bold numbers should be matched, each into its own capture group.
My current expression is (\d+.\d{3}) which matches 4.126 how can I match my 20.000 now into a second capture group? Adding the same capture group again makes it find nothing. So what I basically need is, "search for first number, then ignore everything until you find next digit."
You could use something like so: (\d+\.\d{3}).+?(\d+\.\d{3})$ (example here) which essentially is your regex (plus a minor fix) twice, with the difference that it will also look for the same pattern again at the end of the string.
Another minor note, your regex contains, a potential issue in which you are matching the decimal point with the period character. In regular expression language, the period character means any character, thus your expression would also match 4s222. Adding an extra \ in front makes the regex engine treat is as an actual character, and not a special one.

Here a word is a string of letters, preceded and followed by nonletters

I asked his question earlier but none of the responses solved the problem. Here is the full question:
Give a single UNIX pipeline that will create a file file1 containing all the words in file2, one word per line.Here a word is a string of letters, preceded and followed by nonletters.
I tried every single example that was given below, but i get "syntax error"s when using them.
Does anyone know how I can solve this??
Thanks
if your regex flavor support it you can use lookarounds:
(?<![a-zA-Z])[a-zA-Z]+(?![a-zA-Z])
(?<!..): not preceded by
(?!..): not followed by
If it is not the case you can use capturing groups and negated character classes:
(^|[^a-zA-Z])([a-zA-Z]+)($|[^a-zA-Z])
where the result is in group 2
^|[^a-zA-Z]: start of the string or a non letter characters (all character except letters)
$: end of the string
or the same with one capturing group and two non capturing groups:
(?:^|[^a-zA-Z])([a-zA-Z]+)(?:$|[^a-zA-Z])
(result in group 1)
In order to be unicode compatible, you could use:
(?:^|\PL)\pL+(?:\PL|$)
\pL stands for any letter in any language
\PL is the opposite of \pL
When your objective is to actually find words, the most natural way would be
\b[A-Za-z]+\b
However, this assumes normal word boundaries, like whitespaces, certain punctuations or terminal positions. Your requirement suggests you want to count things like the "example" in "1example2".
In that case, I would suggest using
[A-Za-z]+
Note that you don't actually need to look for what precedes or follows the alphabets. This already captures all alphabets and only alphabets. The greedy requirement (+) ensures that nothing is left out from a capture.
Lookarounds etc should not be necessary because what you want to capture and what you want to exclude are exact inverses of each other.
[Edit: Given the new information in comments]
The methods below are similar to Casimir's, except that we exclude words at terminals (which we were explicitly trying to capture, because of your original description).
Lookarounds
(?<=[^A-Za-z])[A-Za-z]+(?=[^A-Za-z])
Test here. Note that this uses negated positive lookarounds, and not Negative lookarounds as they would end up matching at the string terminals (which are, to the regex engine as much as to me, non-alphabets).
If lookarounds don't work for you, you'd need capturing groups.
Search as below, then take the first captured group.
[^A-Za-z]([A-Za-z]+)[^A-Za-z]
When talking about regex, you need to be extremely specific and accurate in your requirements.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.