Regex: remove all except first character and last number - regex

I know that ^. is first character and (\d+)(?!.*\d) is last number. I've tried using | between these and have been trying to find code for the second character, but with no success.
This is in R.
Take for example:
'ABCD some random words and spaces 1234' should output 'A4' when I do
sub([regex here], "", 'ABCD some random words and spaces 1234')

If you used ^.|(\d+)(?!.*\d), the pattern would only match the first char and remove it with sub, and would remove the first char and the last 1+ digits if used with gsub without backreferences in the replacement pattern. See this pattern demo.
You can use
sub("^(.).*(\\d).*$", "\\1\\2", "ABCD some random words and spaces 1234")
See the R demo and the regex demo.
This TRE regex pattern matches:
^ - start of string
(.) - Group 1 capturing any char
.* - 0+ any chars as many as possible up to the last...
(\\d) - Group 2 capturing a digit
.* - the rest of the string
$ - end of string.
The \\1\\2 replacement pattern re-inserts the values captured with Group 1 and Group 2 back to the result.

Related

regular expression with If condition question

I have the following regular expressions that extract everything after first two alphabets
^[A-Za-z]{2})(\w+)($) $2
now I want to the extract nothing if the data doesn't start with alphabets.
Example:
AA123 -> 123
123 -> ""
Can this be accomplished by regex?
Introduce an alternative to match any one or more chars from start to end of string if your regex does not match:
^(?:([A-Za-z]{2})(\w+)|.+)$
See the regex demo. Details:
^ - start of string
(?: - start of a container non-capturing group:
([A-Za-z]{2})(\w+) - Group 1: two ASCII letters, Group 2: one or more word chars
| - or
.+ - one or more chars other than line break chars, as many as possible (use [\w\W]+ to match any chars including line break chars)
) - end of a container non-capturing group
$ - end of string.
Your pattern already captures 1 or more word characters after matching 2 uppercase chars. The $ does not have to be in a group, and this $2 should not be in the pattern.
^[A-Za-z]{2})(\w+)$
See a regex demo.
Another option could be a pattern with a conditional, capturing data in group 2 only if group 1 exist.
^([A-Z]{2})?(?(1)(\w+)|.+)$
^ Start of string
([A-Z]{2})? Capture 2 uppercase chars in optional group 1
(? Conditional
(1)(\w+) If we have group 1, capture 1+ word chars in group 2
| Or
.+ Match the whole line with at least 1 char to not match an empty string
) Close conditional
$ End of string
Regex demo
For a match only, you could use other variations Using \K like ^[A-Za-z]{2}\K\w+$ or with a lookbehind assertion (?<=^[A-Za-z]{2})\w+$

Regex to match the letter string group between 2 numbers

Is it possible to match only the letter from the following string?
RO41 RNCB 0089 0957 6044 0001 FPS21098343
What I want: FPS
What I'm trying LINK : [0-9]{4}\s*\S+\s+(\S+)
What I get: FPS21098343
Any help is much appreciated! Thanks.
You can try with this:
var String = "0258 6044 0001 FPS21098343";
var Reg = /^(?:\d{4} )+ *([a-zA-Z]+)(?:\d+)$/;
var Match = Reg.exec(String);
console.log(Match);
console.log(Match[1]);
You can match up to the first one or more letters in the following way:
^[^a-zA-Z]*([A-Za-z]+)
^.*?([A-Za-z]+)
^[\w\W]*?([A-Za-z]+)
(?s)^.*?([A-Za-z]+)
If the tool treats ^ as the start of a line, replace it with \A that always matches the start of string.
The point is to match
^ / \A - start of string
[^a-zA-Z]* - zero or more chars other than letters
([A-Za-z]+) - capture one or more letters into Group 1.
The .*? part matches any text (as short as possible) before the subsequent pattern(s). (?s) makes . match line break chars.
Replace A-Za-z in all the patterns with \p{L} to match any Unicode letters. Also, note that [^\p{L}] = \P{L}.
To grep all the groups of letters that go in a row in any place in the string you can simply use:
([a-zA-Z]+)
You could use a capture group to get FPS:
\b[0-9]{4}\s+\S+\s+([A-Z]+)
The pattern matches:
\b[0-9]{4} A wordboundary to prevent a partial match, and match 4 digits
\s+\S+\s+ Match 1+ non whitespace chars between whitespace chars
([A-Z]+) Capture group 1, match 1+ chars A-Z
Regex demo
If the chars have to be followed by digits till the end of the string, you can add \d+$ to the pattern:
\b[0-9]{4}\s+\S+\s+([A-Z]+)\d+$
Regex demo

Regex capture string up to character or end of line

I need one regex to capture a string up to a :, but the problem is that the : is not always there.
At this moment I am able to capture the groups when I have the : but not when I dont.
Not sure what I am doing wrong.
strings to capture
XXX 1 A:B (working)
XXX 1 A: (working)
XXX A (not working)
My regex:
^(?P<grp1>[A-Z]{3,10})\s(?P<grp2>.*)(?=\:)(?:.)*$
You can use
^(?P<grp1>[A-Z]{3,10})\s(?P<grp2>.*?)(?::.*)?$
See the regex demo. Details:
^ - start of string
(?P<grp1>[A-Z]{3,10}) - Group "grp1": three to ten uppercase letters
\s - a whitespace
(?P<grp2>.*?) - Group "grp2": any zero or more chars other than line break chars, as few as possible
(?::.*)? - an optional group matching any zero or more chars other than line break chars as many as possible
$- end of string.
Optionally match a single : after it
^(?P<grp1>[A-Z]{3,10})\s(?P<grp2>[^:\r\n]*)(?::[^:\r\n]*)?$
^ Start of string
(?P<grp1>[A-Z]{3,10}) Group grp1
\s Match a whitspace char
(?P<grp2>[^:\r\n]*) Group 2 grp2 Match any char except : or a newline
(?::[^:\r\n]*)? Optionally match a single : between optional chars other than : or a newline
$ End of string
Regex demo

Regex to exclude text between first occurrence of underline and last occurrence of hyphon

I am searching for a regex to match
the text until (and including) the first occurrence of an underline
and
the text after the last occurrence of a hyphon, but cutting all
leading zeros
Sample input:
Text_un_important-0011
Desired result (by concatenting all matches):
Text_11
I have come up with: (?:^|(?:00))(.+?)(?:$|_) but it has some flaws: It works only if there is exactly one hyphen at the end and two leading zeros. Unfortunately I cannot fix it.
If you follow your current logic, you may use
^[^_]*_|[1-9]\d*$
See the regex demo. Details:
^ - start of string
[^_]* - 0 or more chars other than _ and then
_ - a _ char
| - or
[1-9] - a non-zero digit
\d* - 0 or more digits
$ - end of string
You may also use a replace logic:
Find: _.*-0*
Replace: _
See the regex demo. The _.*-0* pattern matches
_ - an underscore
.* - any 0 or more chars other than line break chars, as many as possible
- - a hyphen
0* - zero or more 0 chars.
Since the first _ is consumed, the replacement pattern should be _.
So , you want to capture the first word and the final digits that are not zero:
([\w\d]+)(_[\w\d]+)*(-0*)(\d+)
use the first and last captured group.
([\w\d]+) captures a word that may have a number (ex. word1)
(_[\w\d]+)* captures repetitions of _word (ex. _word1_anotherword2_third3)
(-0*) captures a hyphen followed by consecutive zeros (ex. -00000000)
(\d+) captures all the digits following the consecutive zeros

Why does (?:\s)\w{2}(?:\s) not match only a 2 letter sub string with spaces around it not the spaces as well?

I am trying to make a regex that matches all sub strings under or equal to 2 characters with a space on ether side. What did I do wrong?
ex what I want to do is have it match %2 or q but not 123 and not the spaces.
update this \b\w{2}\b if it also matched one letter sub strings and did not ignore special characters like - or #.
You should use
(^|\s)\S{1,2}(?=\s)
Since you cannot use a look-behind, you can use a capture group and if you replace text, you can then restore the captured part with $1.
See regex demo here
Regex breakdown:
(^|\s) - Group 1 - either a start of string or a whitespace
\S{1,2} - 1 or 2 non-whitespace characters
(?=\s) - check if after 1 or 2 non-whitespace characters we have a whitespace. If not, fail.