How to extract certain portion of String using RegEx expression? - regex

I would like to extract certain values from the given String
String --> 1ABCDE23
I need only ABCDE from the above string value. Always skip the first value(1) and get next 5 characters(ABCDE) and skip the rest(23)
Appreciate your help

Always skip the first value(1) and get next 5 characters(ABCDE) and skip the rest(23)
This is just extracting a substring from a string! You don't need a regex for this - there has to be a faster function for that in the language you're using, assuming you're using a sane language.
Here are some examples:
Java: String abcde = "1ABCDE23".substring(1, 6);
JavaScript: var abcde = "1ABCDE23".substring(1, 6);
C++: std::string abcde = std::string("1ABCDE23").substr(1, 5);
Python: abcde = "1ABCDE23"[1:6]
PHP: $abcde = substr("1ABCDE23", 1, 5);
C#: string abcde = "1ABCDE23".Substring(1, 5);
Perl: $abcde = substr "1ABCDE23", 1, 5;
Ruby: abcde = "1ABCDE23"[1...6]
If you're using an insane language that has a regex engine with support for capturing groups but not a facility for extracting substrings from strings, you can run this regex (which is suggested by sln) and take the first capturing group:
^.(.{5}).*
^ the match must be at the beginning of the string
. match any character
( ) put what's matched by the parenthesized expression into the 1st capturing group
.{5} match 5 characters; any character goes
.* match 0 or more characters; any character goes

Related

Regex to replace all non numbers but allow a '+' prefix

I want to delete all invalid letters from a string which should represent a phone number. Only a '+' prefix and numbers are allowed.
I tried in Kotlin with
"+1234abc567+".replace("[^+0-9]".toRegex(), "")
It works nearly perfect, but it does not replace the last '+'.
How can I modify the regex to only allow the first '+'?
You could do a regex replacement on the following pattern:
(?<=.)\+|[^0-9+]+
Sample script:
String input = "+1234abc567+";
String output = input.replaceAll("(?<=.)\\+|[^0-9+]+", "");
System.out.println(input); // +1234abc567+
System.out.println(output); // +1234567
Here is an explanation of the regex pattern:
(?<=.)\+ match a literal + which is NOT first (i.e. preceded by >= 1 character)
| OR
[^0-9+]+ match one or more non digit characters, excluding +
You can use
^(\+)|\D+
Replace with the backreference to the first group, $1. See the regex demo.
Details:
^(\+) - a + at the start of string captured into Group 1
| - or
\D+ - one or more non-digit chars.
NOTE: a raw string literal delimited with """ allows the use of a single backslash to form regex escapes, such as \D, \d, etc. Using this type of string literals greatly simplifies regex definitions inside code.
See the Kotlin demo:
val s = "+1234abc567+"
val regex = """^(\+)|\D+""".toRegex()
println(s.replace(regex, "$1"))
// => +1234567

Ruby Regex - If the string is more than 10 characters, remove the first character if it is a "1"

Without using a gem, I just want to write a simple regex formula to remove the first character from strings if it's a 1, and, if there are more than 10 total characters in the string. I never expect more than 11 characters, 11 should be the max. But in the case there are 10 characters and the string begins with "1", I don't want to remove it.
str = "19097147835"
str&.remove(/\D/).sub(/^1\d{10}$/, "\1").to_i
Returns 0
I'm looking for it to return "9097147835"
You could use your pattern, but add a capture group around the 10 digits to use the group in the replacement.
\A1(\d{10})\z
For example
str = "19097147835"
puts str.gsub(/\D/, '').sub(/\A1(\d{10})\z/, '\1').to_i
Output
9097147835
Another option could be removing all the non digits, and match the last 10 digits:
\A1\K\d{10}\z
\A Start of string
1\K Match 1 and forget what is matched so far
\d{10} Match 10 digits
\z End of string
Regex demo | Ruby demo
str = "19097147835"
str.gsub(/\D/, '').match(/\A1\K\d{10}\z/) do |match|
puts match[0].to_i
end
Output
9097147835
You can use
str.gsub(/\D/, '').sub(/\A1(?=\d{10})/, '').to_i
See the Ruby demo and the regex demo.
The regex matches
\A - start of string
1 - a 1
(?=\d{10}) - immediately to the right of the current location, there must be 10 digits.
Non regex example:
str = str[1..] if (str.start_with?("1") and str.size > 10)
Regexes are powerful, but not easy to maintain.

Make regex quantifier length depend on previous capture group

I'm hoping to use a regex to parse strings which begin with an integer n. After a space, there are n characters, after which there may be more text. I'm hoping to capture n and the n characters that follow. There are no constraints on these n characters. In other words, 5 hello world should match with the capture groups 5 and hello.
I tried this regex, but it wouldn't compile because its structure depends on the input: (\d+) .{\1}.
Is there a way to get the regex compiler to do what I want, or do I have to parse this myself?
I'm using Rust's regex crate, if that matters. And if it's not possible with regex, is it possible with another, more sophisticated regex engine?
Thanks!
As #Cary Swoveland said in the comments, this is not possible in regex in one step without hard-coding the various possible lengths.
However, it is not too difficult to take a substring of the matched string with length from the matched digit:
use regex::Regex;
fn main() {
let re = Regex::new(r"(\d+) (.+)").unwrap();
let test_str = "5 hello world";
for cap in re.captures_iter(test_str) {
let length: usize = cap[1].parse().unwrap_or(0);
let short_match: String = cap[2].chars().take(length).collect();
println!("{}", short_match); // hello
}
}
If you know you'll only be dealing with ASCII characters (no Unicode, accent marks, etc.) then you can use the simpler slice syntax let short_match = &cap[2][..length];.
If Perl is your option, would you please try:
perl -e '
$str = "5 abcdefgh";
$str =~ /(\d+) ((??{".{".($^N)."}"}))/;
print "1st capture group = $1\n";
print "2nd capture group = $2\n";
print "whole capture group = $&\n";
'
Output:
1st capture group = 5
2nd capture group = abcde
whole capture group = 5 abcde
[Explanation]
If the (??{...}) block is encountered in a regex, its contents
are expanded as a Perl code on the fly.
The special variable $^N refers to the last captured group
and is expanded as 5 in the case.
Then the code (??{".{".($^N)."}"}) is evaluated as .{5} which
represents a dot followed by a quantifier.

Use regex to extract substrings delimited by another substring

I'm trying to use a regular expression to capture substrings delimited by another substring. For example, if I had the sentence
My cat is a cat.
and the delimiter I wanted to use was "cat", the output should be
My
is a
.
I've been unable to find a solution where the delimiter isn't a single character.
Edit: I'm writing this in Java, and the output represents groups returned by Java's Matcher class in a call like "myMatcher.group()". Sorry for the confusion.
What you need is String#split as Tushar pointed out in the comment.
String s = "My cat is a cat.";
String[] res = s.split("cat");
System.out.println(Arrays.toString(res));
This is the only correct way to do it.
Now, you want to know how to match any text other than cat with the Matcher.
DISCLAIMER: do not use it in Java since it is highly impractical and non-performance-wise.
You may match the cat and capture it into a Group, and add another alternative to the pattern that will match any text other than cat.
String s = "My cat is a cat.";
Pattern pattern = Pattern.compile("(?i)(cat)|[^c]*(?:c(?!at)[^c]*)*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (matcher.group(1) == null) { // Did we match "cat"?
if (!matcher.group(0).isEmpty()) // Is the match text NOT empty? System.out.println(matcher.group(0)); // Great, print it
}
}
See the IDEONE demo
Pattern details:
(?i) - case insensitive inline modifier
(cat) - Group 1 capturing a substring cat
| - or
[^c]*(?:c(?!at)[^c]*)* - a substring that is not a starting point for a cat substring. It is an unrolled (?s)(?:(?!cat).)* tempered greedy token.
[^c]* - 0+ chars other than c or C
(?:c(?!at)[^c]*)* - zero or more sequences of:
c(?!at) - c or C not followed with at, At, AT, aT
[^c]* - 0+ chars other than c or C

Regex in PHP: take all the words after the first one in string and truncate all of them to the first character

I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo