Regex to capture Letters & Spaces - regex

I have a spec that says a particular field will be alpha-text, right-padded with spaces to be 10 characters long, and I want to capture the alpha-part of the match.
This expression captures the entire section:
"([[:alpha:][:s:]]{10})"
However, I only want to capture the alpha-part, and still match (but not capture) on the remaining white-space. So if the alpha is 3-characters long, the next match needs to 7 white-spaces.
How can I do this?

I would say your best bet is to use 2 regular expressions. Regex doesn't really have support for what you're trying to do.
The first regular expression would get all strings length 10 right padded by spaces
([a-zA-Z\s]{10})
After that, just capture the word part. We know each string is only 10 characters at this point.
(\w+)\s*

This regex pattern will match a string, starting with (optional) [A-Za-z] characters, ending with upto 10 spaces, for a total string length of 10.
"^([A-Za-z]+)?\\ {0,10}"
Then, I added a positive lookahead to ensure the pattern only matches when the string length is 10.
"^(?=.{10}$)([A-Za-z]+)?\\ {0,10}$"
Edit: Try this using the [:alpha:] and [:space:]
"^(?=.{10}$)([:alpha:]+)?[:space:]{0,10}$"

Related

regular expression to find a string that always contains 6 characters, CAPITALS and numbers only

I've been trying to catch a string (6 characters) like ABC123 (or any combination of Capitals and numbers) using a regular expression. I can catch ABCDE1 or 1ABCDE or even AC34FG. As long as the string contains at least 1 CAPITAL and 1 number the regular expression works just fine. But something like ABCDEF or 123456 does not! What am I missing? The regular expression I use is:
(?<=\t)([0-9]+[A-Z]+|[A-Z]+[0-9]+)[0-9A-Z]*(?=\t)
Any help would be appreciated! Thanks!
In your (?<=\t)([0-9]+[A-Z]+|[A-Z]+[0-9]+)[0-9A-Z]*(?=\t) pattern, you explicitly require at least 1 digit to be followed with at least 1 letter (with [0-9]+[A-Z]+) (and vice versa with [A-Z]+[0-9]+) only in between tab chars.
To just match any 6 char substring in between tabs that consists of uppercase ASCII letters or digits, you may use
(?<=\t)[A-Z0-9]{6}(?=\t)
See this regex demo.
Or, to also match at the start/end of string:
(?<![^\t])[A-Z0-9]{6}(?![^\t])
See another regex demo.
If i understand you correctly, your aproach is way too complicated.
/\b[A-Z0-9]{6}\b/
Catches any (exact) 6 character string, as long as either capitals or numbers or both are present.
Note the \b part as a word boundary, you could change these delimiters to whatever fits your need.
Another word of warning: A-Z captures only 26 uppercase characters, Umlauts or accented characters will not be cought here, use something like \p{L} if your engine supports it and your data requires it. See https://www.regular-expressions.info/unicode.html for more details.

How to find words that contain string with a limited size

I need to find all the words in an inputted text that has (?i:val) in it and are no longer that 5 characters.
So far I got: \b([a-zA-Z]*(?i:val)[a-zA-Z]*){1,4}\b
If we take this sample text to look in: In computer science, a value is an expression which cannot be evaluated any further (a normal form). Val is also a match
I get 3 matches (value, evaluated and Val), however evaluated should not match the pattern, as it is too long. What is the right way to get this straight?
Your pattern does not account for the length of the words matched.
Use word boundaries and a lookahead like this:
(?i)\b(?=\w*val)\w{1,5}\b
See regex demo
The regex matches:
\b - a leading word boundary since the next pattern is \w
(?=\w*val) - a lookahead making sure there is a val substring after zero or more word characters
\w{1,5} - matches 1 to 5 word characters
\b - trailing word boundary that stops words of more than 5 characters long from matching
You may use an ASCII JS version of the regex:
/\b(?=[a-z]*val)[a-z]{1,5}\b/i
It's important to understand why the "evaluated" was matched. Note:
[a-zA-Z]* matches the "e"
(?i:val) matches "val"
[a-zA-Z]* matches "uated"
Actually there's not repetition here! The pattern was matched in only one iteration.
You can achieve what you want using lookarounds, but I think that regex is not the best tool for this task. I highly recommend you using other functions depending on what you have.

Match only exact numbers, not pre of suffixed with slash/dash etc

I need a regular expression that matches only numbers of length 7 (they can have leading zeros). I used the following super easy regex: \b[0-9]{7}\b. However, this regex also matches numbers in e.g. 5254-6408499 and (0241)4013999 (see https://regex101.com/r/zF5hV7/1).
How can I prevent them from being matched? I only want numbers of length 7 having leading and/or trailing spaces.
Depending on the regular expression flavor, you could create your own boundaries:
(?<=^| )\d{7}(?= |$)
This asserts that either the beginning of the string or a space precedes moving on to matching exactly 7 digits only if the engine asserts that either a space or the end of string follows.
You can use this regex:
(?:^|\s)([0-9]{7})(?:\s|$)
and grab captured group #1
Updated RegEx Demo

How can I match the last 4 characters of a word beginning &c using PCRE regex?

I'm trying to match the last four characters (alphanumeric) of all words beginning with the sequence &c.
For instance, in the string below, I'd like to match the pieces in bold:
Colour one is &cFF2AC3 and colour two is &c22DE4A.
Can anybody help me with the correct regex expression? I've spent hours on this great resource to no avail.
it looks like hexadecimal numbers, so use this pattern
&c[0-9A-F]{2}\K([0-9A-F]{4})
DEMO
This:
/(?i)\s*&c(?:[a-z0-9]{2})([a-z0-9]{4})\b/
append a g to the end of it if you want it to find all matches in a given text
Try this
/(?:^| )&c\w*(\w{4})\b/
If you want to try it in the regex tester you linked to, make sure to use the g modifier to see all matches.
Explanation: (?:^| ) matches either a space or the start of the string, &c\w* matches the ampersand and the the first however many characters of the word, and then \w{4} captures the last 4 characters. \b on the end asserts a word break (a "non-word" character or the end of the string).

Regular expression to match last number in a string

I need to extract the last number that is inside a string. I'm trying to do this with regex and negative lookaheads, but it's not working. This is the regex that I have:
\d+(?!\d+)
And these are some strings, just to give you an idea, and what the regex should match:
ARRAY[123] matches 123
ARRAY[123].ITEM[4] matches 4
B:1000 matches 1000
B:1000.10 matches 10
And so on. The regex matches the numbers, but all of them. I don't get why the negative lookahead is not working. Any one care to explain?
Your regex \d+(?!\d+) says
match any number if it is not immediately followed by a number.
which is incorrect. A number is last if it is not followed (following it anywhere, not just immediately) by any other number.
When translated to regex we have:
(\d+)(?!.*\d)
Rubular Link
I took it this way: you need to make sure the match is close enough to the end of the string; close enough in the sense that only non-digits may intervene. What I suggest is the following:
/(\d+)\D*\z/
\z at the end means that that is the end of the string.
\D* before that means that an arbitrary number of non-digits can intervene between the match and the end of the string.
(\d+) is the matching part. It is in parenthesis so that you can pick it up, as was pointed out by Cameron.
You can use
.*(?:\D|^)(\d+)
to get the last number; this is because the matcher will gobble up all the characters with .*, then backtrack to the first non-digit character or the start of the string, then match the final group of digits.
Your negative lookahead isn't working because on the string "1 3", for example, the 1 is matched by the \d+, then the space matches the negative lookahead (since it's not a sequence of one or more digits). The 3 is never even looked at.
Note that your example regex doesn't have any groups in it, so I'm not sure how you were extracting the number.
I still had issues with managing the capture groups
(for example, if using Inline Modifiers (?imsxXU)).
This worked for my purposes -
.(?:\D|^)\d(\D)