Seperate string by recognizing first digit with regex - regex

I'm using ([^\d]+)\s?(.+) for dividing a string by taking the first digit that appears inside the string.
Exp.: Test123 --> Group1: Test, Group2: 123 # that works
but
Exp.: Test --> Group1: Tes, Group2: t # I expect: Group1: Test, Group 2: [empty]
How to edit the regex, so it fits my expcetation?

If you need to match up to the first digit if there is one, you may use
^(.*?)\s*(\d.*)?$
See the regex demo
^ - start of string
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible (since *? is a lazy quantifier)
\s* - 0+ whitespaces
(\d.*)? - Group 2: an optional capturing group matching 1 or 0 occurrences of a digit and then any 0+ chars other than line break chars as many as possilbe (* is a greedy quantifier)
$ - end of string.

Your regex almost works
Problem: The problem lies in your second capturing group (.+) this means at least one of any character. It will grab the 't' at the end of test in order to make a match, since it must have at least one character in it.
Solution: replace your second capturing group with (.*) this means at least zero of any character. (ie): it does not need to have any characters in it to make a match and it will grab any number of characters after 'Test'
here is your new working regex code:
([^\d]+)\s?(.*)

Related

moving characters by using regex

I'm trying to move matched characters to the end of sentence.
from
300p apple in house
orange 200p in school
to
apple in house 300p
orange in school 200p
So I matched (.+)([\d]+p)(.+)$ and substituted with \1 \3 \2.
But the result is like
30 apple in house 0p
orange 20 in school 0p
I also checked greedy concept, but I don't know what is problem. How can I fix this?
You can use
^(.*?)(\d+p) *(.+)
Replace with \1\3 \2.
See the regex demo. Details:
^ - start of string (or line if you use a multiline mode)
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
(\d+p) - Group 2: one or more digits, and then a p char
* - zero or more spaces
(.+) - Group 3: any one or more chars other than line break chars as many as possible (since it is a greedy subpattern, no $ anchor is required, the match will go up to the end of string (or line if you use a multiline mode)).
With your shown samples only, please try following regex.
^(\D+)?(\d+p)\s*(.+)$
Online demo for above regex
Explanation:
^(\D+)? ##Matching from starting and creating 1st capturing group which has all non-digits in it and keeping it as optional.
(\d+p) ##Creating 2nd capturing group which matches 1 or more digits followed by p here.
\s* ##Matching 0 or more occurrences of spaces here.
(.+)$ ##Creating 3rd capturing group here which has everything in it.

regular expression with If condition question

I have the following regular expressions that extract everything after first two alphabets
^[A-Za-z]{2})(\w+)($) $2
now I want to the extract nothing if the data doesn't start with alphabets.
Example:
AA123 -> 123
123 -> ""
Can this be accomplished by regex?
Introduce an alternative to match any one or more chars from start to end of string if your regex does not match:
^(?:([A-Za-z]{2})(\w+)|.+)$
See the regex demo. Details:
^ - start of string
(?: - start of a container non-capturing group:
([A-Za-z]{2})(\w+) - Group 1: two ASCII letters, Group 2: one or more word chars
| - or
.+ - one or more chars other than line break chars, as many as possible (use [\w\W]+ to match any chars including line break chars)
) - end of a container non-capturing group
$ - end of string.
Your pattern already captures 1 or more word characters after matching 2 uppercase chars. The $ does not have to be in a group, and this $2 should not be in the pattern.
^[A-Za-z]{2})(\w+)$
See a regex demo.
Another option could be a pattern with a conditional, capturing data in group 2 only if group 1 exist.
^([A-Z]{2})?(?(1)(\w+)|.+)$
^ Start of string
([A-Z]{2})? Capture 2 uppercase chars in optional group 1
(? Conditional
(1)(\w+) If we have group 1, capture 1+ word chars in group 2
| Or
.+ Match the whole line with at least 1 char to not match an empty string
) Close conditional
$ End of string
Regex demo
For a match only, you could use other variations Using \K like ^[A-Za-z]{2}\K\w+$ or with a lookbehind assertion (?<=^[A-Za-z]{2})\w+$

How to regex string that can end with a number and group each part

I have the following test strings:
Battery Bank 1
Dummy 32 Segment 12
System
Modbus 192.168.0.1 Group
I need a regex that can match and group these as follows:
Group 1: Battery Bank
Group 2: 1
Group 1: Dummy 32 Segment
Group 2: 12
Group 1: System
Group 2: null
Group 1: Modbus 192.168.0.1 Group
Group 2: null
Basically, capture everything (including numbers) into group 1 unless the string ends with a whitespace followed by 1 or more digits. If it does, capture this number into group 2.
This regex is not doing what I need as everything is captured into the first group.
([\w ]+)( \d+)?
https://regex101.com/r/GEtb5G/1/
Basically, capture everything (including numbers) into group 1 unless the string ends with a whitespace followed by 1 or more digits. If it does, capture this number into group 2.
You may use this group that allows an empty match in 2nd capture group:
^(.+?) *(\d+|)$
Updated RegEx Demo
RegEx Details:
^: Start
(.+?): Match 1+ of any character (lazy) in capture group #1
*: Match 0 or more spaces
(\d+|): Match 1+ digits or nothing in 2nd capture group
$: End
You can use
^\s*(.*[^\d\s])(?:\s*(\d+))?\s*$
See the regex demo (note \s are replaced with spaces since the test string in the demo is a single multiline string).
If the regex is to be used with a multiline flag to match lines in a longer multiline text, you can use
^[^\S\r\n]*(.*[^\d\s])(?:[^\S\r\n]*(\d+))?[^\S\r\n]*$
See the regex demo.
Details:
^ - start of a string
\s* - zero or more whitespaces
(.*[^\d\s]) - Group 1: any zero or more chars other than line break chars as many as possible and then a char other than a digit and whitespace
(?:\s*(\d+))? - an optional sequence of
\s* - zero or more whitespaces
(\d+) - Group 2: one or more digits
\s* - zero or more whitespaces
$ - end of string.
In the second regex, [^\S\r\n]* matches any zero or more whitespaces other than LF and CR chars.

Regex - match until group that may or may not occur

I have following text:
:3:Start!##$%^&*():31:Start!##$%^&*():31:End!##$%^&*():3:End
and with following regex:
(:3:Start)(.*)(:31:Start.*:31:End)?(.*)(:3:End)
Why group3 is not found even though it exists. Even if I set group2 as not greedy:
(:3:Start)(.*?)(:31:Start.*:31:End)?(.*)(:3:End)
How Can I capture group with optional subgroup if it occurs in the middle of the text
You may achieve what you need if you enclose the (.*?) and (:31:Start.*:31:End) groups into an optional non-capturing group (quantified with a greedy ? quantifier) and making the optional group obligatory:
(:3:Start)(?:(.*?)(:31:Start.*:31:End))?(.*)(:3:End)
|____________________________|
See the regex demo. It will work like this:
(:3:Start) - will capture into Group 1 the :3:Start` string
(?:(.*?)(:31:Start.*:31:End))? - will attempt to match once a sequence of patterns:
(.*?) - Group 2: any 0 or more chars other than line break chars as few as possible
(:31:Start.*:31:End) - Group 3: :31:Start.*:31:End string
(.*) - Group 4: any 0 or more chars other than line break chars as many as possible
(:3:End) - captures into Group 5 :3:End string
Why doesn't your pattern work?
See your pattern demo, the !##$%^&*():31:Start!##$%^&*():31:End!##$%^&*() substring is captured into Group 4, matched with (.*) pattern. It happens because (.*?)(:31:Start.*:31:End)? first skips the .*? pattern (it is lazy, non-greedy, the engine does not even attempt to match it when it sees such a pattern the first time, it goes on matching with obligatory patterns and only comes back when the subsequent patterns do not match), and (:31:Start.*:31:End)? matches an empty string right after :3:Start substring. The rest finds a match, thus, no optional text is matched into your expected group.

Regex: remove all except first character and last number

I know that ^. is first character and (\d+)(?!.*\d) is last number. I've tried using | between these and have been trying to find code for the second character, but with no success.
This is in R.
Take for example:
'ABCD some random words and spaces 1234' should output 'A4' when I do
sub([regex here], "", 'ABCD some random words and spaces 1234')
If you used ^.|(\d+)(?!.*\d), the pattern would only match the first char and remove it with sub, and would remove the first char and the last 1+ digits if used with gsub without backreferences in the replacement pattern. See this pattern demo.
You can use
sub("^(.).*(\\d).*$", "\\1\\2", "ABCD some random words and spaces 1234")
See the R demo and the regex demo.
This TRE regex pattern matches:
^ - start of string
(.) - Group 1 capturing any char
.* - 0+ any chars as many as possible up to the last...
(\\d) - Group 2 capturing a digit
.* - the rest of the string
$ - end of string.
The \\1\\2 replacement pattern re-inserts the values captured with Group 1 and Group 2 back to the result.