Regex match all characters up to number if there is one [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I'm trying to figure out how to write a regex that will match every charter up to, but not including the first number in the character sequence if there is one.
Ex:
Input: abc123
Output: abc
Input: #$%##<>#<123
Output: #$%##<>#<
Input: abc
Output: abc
Input: abc #####-122
Output: abc #####-

You can use:
/^([^\d\n]+)\d*.*$/gm
This will also handle scenarios where you have multiple sets of numbers in a string. Example here.
Explanation:
^ # define the start of the stirng
( # open capture group
[^\d\n]+ # match anything that isn't a digit or a newline that occurs once or more
) # close capture group
\d* # zero or more digits
.* # anything zero or more times
$ # define the end of the string
g # global
m # multi line
The greedy matching will mean that by default you will match the capture group and stop capturing as soon as either a digit or anything that isn't matched in the capture group or the end of the string it encountered.

[Update] Try this regex:
([^0-9\n]+)[0-9]?.*
Regex explains:
( capturing group starts
[^0-9\n] match a single character other than numbers and new line
+ match one or more times
) capturing group ends
[0-9] match a single digit number (0-9)
? match zero or more times
.* if any, match all other than new line
Thanks #Robbie Averill for clarifying OP's requirement. Here is the demo.

I did not select a correct answer because the correct answer was left in the comments. "^\D+"
I am working in java, so putting it all together I got:
Pattern p = Pattern.compile("^\\D+");
Matcher m = p.matcher("Testing123Testing");
String extracted = m.group(1);

Use the character class feature: [...]
Identify numbers: [0-9]
Negate the class: [^0-9]
Allow as many as you like: [^0-9]*

Related

Need Regex to validate 11-digit phone number without plus sign [duplicate]

This question already has answers here:
How to validate phone numbers using regex
(43 answers)
Closed 2 years ago.
I need a regex to validate phone number without plus (+) sign for example
46123456789,46-123-456-789,46-123-456-789
number should be 11 digit rest of should ignore
i am currently using this Regex /([+]?\d{1,2}[.-\s]?)?(\d{3}[.-]?){2}\d{4}/g
its not correct at all
About the pattern you tried:
Using this part in your pattern [+]? optionally matches a plus sign. It is wrapped in an optional group ([+]?\d{1,2}[.-\s]?)? possibly also matching 12 digits in total.
The character class [.-\s] matches 1 of the listed characters, allowing for mixed delimiters like 333-333.3333
You are not using anchors, and can also possible get partial matches.
You could use an alternation | to match either the pattern with the hyphens and digits or match only 11 digits.
^(?:\d{2}-\d{3}-\d{3}-\d{3}|\d{11})$
^ Start of string
(?: Non capture group for the alternation
\d{2}-\d{3}-\d{3}-\d{3} Match either the number of digits separated by a hyphen
| Or
\d{11} Match 11 digits
) Close group
$ End of string.
Regex demo
If you want multiple delimiters which have to be consistent, you could use a capturing group with a backreference \1
^(?:\d{2}([-.])\d{3}\1\d{3}\1\d{3}|\d{11})$
Regex demo
I would have this function return true or false and use as is.
function isPhoneValid(phone) {
let onlyNumbers = phone.replace(/[^0-9]/g, "");
if (onlyNumbers.length != 11) console.log(phone + ' is invalid');
else console.log(phone + ' is valid');
}
isPhoneValid('1 (888) 555-1234');
isPhoneValid('(888) 555-1234');
I am not sure how is the input looks like. But based on your question I supposed you want to trim it and match it with regex?
trim your input.
string.split(/[^0-9.]/).join('');
and you can match it with this regex:
((\([0-9]{3}\))|[0-9]{3})[\s\-]?[\0-9]{3}[\s\-]?[0-9]{4}$

Is there a regex for adding the first 4 characters to end of string and the last 4 characters to start of string?

I have some lines which I need to alter. They are protein sequences. How would I copy the first 4 characters of the line to the end of the line, and also copy the last 4 characters to the beginning of the line?
The strings are variable which complicates it, for example:
>X
LTGLGIGTGMAATIINAISVGLSAATILSLISGVASGGAWVLAGAKQALKEGGKKAGIAF
>Y
LVATGMAAGVAKTIVNAVSAGMDIATALSLFSGAFTAAGGIMALIKKYAQKKLWKQLIAA
Moreover, how could I exclude lines with a '>' at the beginning (these are names of the corresponding sequence)?
Does anyone know a regex which will allow this to work?
I've already tried some regex solutions but I'm not very experienced with this sort of thing and I can find the end string but can't get it to replace:
Find:
(...)$
Replace:
^$2$1"
An example of what I want to achieve is:
>1
ABCDEFGHIJKLMNOPQRSTUVWXYZ
becomes:
>1
WXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCD
Thanks
Try doing a find, in regex mode, on the following pattern:
^([A-Z]{4}).*([A-Z]{4})$
Then replace with the first four and last four characters swapped:
$2$0$1
Demo
You can use the regex below.
^(([A-Z]{4})([A-Z]*)([A-Z]{4}))$
^ asserts the position at the start of the line, so nothing can come before it.
( is the start of a capture group, this is group 1.
( is the start of a capture group, this is group 2. This group is inside group 1.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 2.
( is the start of a capture group, this is group 3.
[A-Z]* matches capital characters from A to Z between zero and infinite times.
) is the end of capture group 3.
( is the start of a capture group, this is group 4.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 4.
$ asserts the position at the end of the line, so nothing can come after it.
See how it works with a replace here: https://regex101.com/r/W786uL/3.
$4$1$2
$4 means put capture group 4 here. Which is the last 4 characters.
$1 means put capture group 1 here. Which is everything in the entire string.
$2 means put capture group 2 here. Which is the first 4 characters.
You can use
^(.{4})(.*?)(.{4})$
^ - start of sting
(.{4}) - Match any for characters except new line
(.*?) - Match any character zero or more time (lazy mode)
$ - End of string
Demo

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

R- regex extracting a string between a dash and a period

First of all I apologize if this question is too naive or has been repeated earlier. I tried to find it in the forum but I'm posting it as a question because I failed to find an answer.
I have a data frame with column names as follows;
head(rownames(u))
[1] "A17-R-Null-C-3.AT2G41240" "A18-R-Null-C-3.AT2G41240" "B19-R-Null-C-3.AT2G41240"
[4] "B20-R-Null-C-3.AT2G41240" "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"
What I want is to use regex in R to extract the string in between the first dash and the last period.
Anticipated results are,
[1] "R-Null-C-3" "R-Null-C-3" "R-Null-C-3"
[4] "R-Null-C-3" "R-Transgenic-C-3" "R-Transgenic-C-3"
I tried following with no luck...
gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))
Would someone be able to help me with this problem?
Thanks a lot in advance.
Shani.
Here is a solution to be used with gsub:
v <- c("A17-R-Null-C-3.AT2G41240", "A18-R-Null-C-3.AT2G41240", "B19-R-Null-C-3.AT2G41240", "B20-R-Null-C-3.AT2G41240", "A21-R-Transgenic-C-3.AT2G41240", "A22-R-Transgenic-C-3.AT2G41240")
gsub("^[^-]*-([^.]+).*", "\\1", v)
See IDEONE demo
The regex matches:
^[^-]* - zero or more characters other than -
- - a hyphen
([^.]+) - Group 1 matching and capturing one or more characters other than a dot
.* - any characters (even including a newline since perl=T is not used), any number of occurrences up to the end of the string.
This can easily be achieved with the following regex:
-([^.]+)
# look for a dash
# then match everything that is not a dot
# and save it to the first group
See a demo on regex101.com. Outputs are:
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Transgenic-C-3
R-Transgenic-C-3
Regex
-([^.]+)\\.
Description
- matches the character - literally
1st Capturing group ([^\\.]+)
[^\.]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
. matches the character . literally
\\. matches the character . literally
Debuggex Demo
Output
MATCH 1
1. [4-14] `R-Null-C-3`
MATCH 2
1. [29-39] `R-Null-C-3`
MATCH 3
1. [54-64] `R-Null-C-3`
MATCH 4
1. [85-95] `R-Null-C-3`
MATCH 5
1. [110-126] `R-Transgenic-C-3`
MATCH 6
1. [141-157] `R-Transgenic-C-3`
This seems an appropriate case for lookarounds:
library(stringr)
str_extract(v, '(?<=-).*(?=\\.)')
where
(?<= ... ) is a positive lookbehind, i.e. it looks for a - immediately before the next captured group;
.* is any character . repeated 0 or more times *;
(?= ... ) is a positive lookahead, i.e. it looks for a period (escaped as \\.) following what is actually captured.
I used stringr::str_extract above because it's more direct in terms of what you're trying to do. It is possible to do the same thing with sub (or gsub), but the regex has to be uglier:
sub('.*?(?<=-)(.*)(?=\\.).*', '\\1', v, perl = TRUE)
.*? looks for any character . from 0 to as few as possible times *? (lazy evaluation);
the lookbehind (?<=-) is the same as above;
now the part we want .* is put in a captured group (...), which we'll need later;
the lookahead (?=\\.) is the same;
.* captures any character, repeated 0 to as many as possible times (here the end of the string).
The replacement is \\1, which refers to the first captured group from the pattern regex.

Pipe separated values in groups of 3 regex

I have the following string
abc|ghy|33d
The regex below matches it fine
^([\d\w]{3}[|]{1})+[\d\w]{3}$
The string changes but the characters separated by the pipe are always in 3's ... so we can have
krr|455
we can also have
ddc
Here's where the problem happens: The regex explained above doesn't match the string if there is only one set of letters ... i.e. "dcc"
Let's do this step by step.
Your regex :
^([\d\w]{3}[|]{1})+[\d\w]{3}$
We can already see some changes. [|]{1} is equivalent to \|.
Then, we see that you match the first part (aaa|) at least once (the + operator matches once at least). Also, \w matches numbers.
The * operator matches 0 or more. So :
^(?:\w{3}\|)*\w{3}$
works.
See here.
Explanation
^ Matches beggining of string
(?:something)* matches something zero time or more. the group is non-capturing as you won't need to
\w{3} matches 3 alphanumeric characters
\| matches |
$ matches end of string.
^[\d\w]{3}(?:[|][\d\w]{3}){0,2}$
You simply quantify the variable part.See demo.
https://regex101.com/r/tS1hW2/18
You can modify your regex as below:
^([\d\w]{3})(\|[\d\w]{3})*$
here first match 3 alphaNumeric and then alphaNum with | as prefix.
Demo
Your description is a little awkward, but I'm guessing you want to be able to match
abc
abc|def
abc|def|ghi
You can do that with
/^\w{3}(?:\|\w{3}){0,2}$/
Visualization
Explanation
^ — match beginning of string
\w{3} — match any 3 of [A-Za-z0-9_]
(? ... )? — non-capturing group, 0 or 1 matches
\| — literal | character
$ — end of string
If the goal is to match any amount of 3-letter segments, you can use
/^(?:\w{3}(?:\||$))+$/