How to Regex and extract even new line until a match - regex

I have use regex to successfully extract anything right after "Abc 123" but it doesn't extract anything from the new line.
Is there any way I can use regex to extract the following:
"Abc 123 def
ghi
jkl"
"Abc 123 def ghi jkl mno"
"Abc 123 def ghi jkl
mno"
I am using Regex in Talend.

I think you want to exract substrings that start at the beginning of a line with 1+ word chars, then a whitespace, then 1 or more digits and span across multiple lines up to the same pattern.
You may use the following regex (note the flags and notation may differ depending on the language you are using):
/^(\w+)\s(\d+)(.*(?:\r?\n(?!\w+\s\d).*)*)/gm
See the regex demo.
Details:
^ - start of a line
(\w+) - Group 1: one or more word chars
\s - 1 whitespace
(\d+) - Group 2: one or more digits
(.*(?:\r?\n(?!\w+\s\d).*)*) - Group 3:
.* - any 0+ chars other than line break chars
(?:\r?\n(?!\w+\s\d).*)* - zero or more sequences of:
\r?\n - a line break...
(?!\w+\s\d) - that is not followed with 1+ word chars, whitespace, 1+ digits
.* - any 0+ chars other than line break chars

(\w)+\s(\d+)((.|\R)+) is what you want so after escaping it'll be:
(\\w)+\\s(\\d+)((.|\\R)+).
\R is a new group in Java regex available since Java 8 - it stands for a line break. Both: \r\n and \n.
If you only allow a single linebreak:
(\w)+\s(\d+)((.+)(\R.+){0,1})
I think that you should specify more what is your desired output, but from this answer you can learn how to include multiple lines or up to 2 lines

Related

Regex Word Boundary including hyphen

With the regular expression \s{2}\b I'm trying to replace the last two whitespaces before a word with a pipe (|) using a Word Boundary, but it's ignoring the number -18.055,81. With what kind of expression I can also get the last two whitespaces before -18.055,81?
Example:
1234 - This is a test 18.055,81 -18.055,81 0,00 0,00 18.055,81
Results in:
1234 - This is a test |18.055,81 -18.055,81 |0,00 |0,00 |18.055,81
What I want:
1234 - This is a test |18.055,81 |-18.055,81 |0,00 |0,00 |18.055,81
You can use
\s{2}(?=-?\d)
\s{2}(?=-?\w)
See the regex demo. Details:
\s{2} - two whitespace chars
(?=-?\d) - that are immediately followed with an optional hyphen and then a digit.
If there are any other word chars expected (letters, or underscores), replace \d with \w.

Seperate string by recognizing first digit with regex

I'm using ([^\d]+)\s?(.+) for dividing a string by taking the first digit that appears inside the string.
Exp.: Test123 --> Group1: Test, Group2: 123 # that works
but
Exp.: Test --> Group1: Tes, Group2: t # I expect: Group1: Test, Group 2: [empty]
How to edit the regex, so it fits my expcetation?
If you need to match up to the first digit if there is one, you may use
^(.*?)\s*(\d.*)?$
See the regex demo
^ - start of string
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible (since *? is a lazy quantifier)
\s* - 0+ whitespaces
(\d.*)? - Group 2: an optional capturing group matching 1 or 0 occurrences of a digit and then any 0+ chars other than line break chars as many as possilbe (* is a greedy quantifier)
$ - end of string.
Your regex almost works
Problem: The problem lies in your second capturing group (.+) this means at least one of any character. It will grab the 't' at the end of test in order to make a match, since it must have at least one character in it.
Solution: replace your second capturing group with (.*) this means at least zero of any character. (ie): it does not need to have any characters in it to make a match and it will grab any number of characters after 'Test'
here is your new working regex code:
([^\d]+)\s?(.*)

Tokenizing a string with a regular expression

Suppose I have a string like this: abc def ghi jkl (I put a space at the end for the sake of simplicity but it doesn't really matter for me) and I want to capture its "chunks" as follows:
abc
def
ghi
jkl
if and only if there are 1-4 "chunks" in the string. I have already tried the following regex:
^([^ ]+ ){1,4}$
at Regex101.com but it only captures the last occurrence. A warning about it is issued:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
How to correct the regular expression to achieve my goal?
Since you have no access to the code, the only solution you might use is a regex based on the \G operator that will only allow consecutive matches and a lookahead anchored at the start that will require 1 to 4 non-whitespace chunks in the string.
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^))\s*\K\S+
See the regex demo
Details:
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^)) - a custom boundary that checks if:
^(?=\s*\S+(?:\s+\S+){0,3}\s*$) - the string start position (^) that is followed with 1 to 4 non-whitespace chunks, separated with 1+ whitespaces, and trailing/leading whitespaces are allowed, too
| - or
\G(?!^) - the current position at the end of the previous successful match (\G also matches the start of a string, thus we have to use the negative lookahead to exclude that matching position, since there is a separate check performed)
\s* - zero or more whitespaces
\K - a match reset operator discarding all the text matched so far
\S+ - 1 or more characters other than whitespace
It can be done on linux using tr:
tr -sc 'a-zA-Z' '\n' < text.txt > out_text.txt
where in a text.txt file is your string to be normalized.

regex to get the last item after space

How to get the last item by regex?
"Read the information failed.
111 a bcd
SAM Error Log not up supported"
I did this
111\s(.*)$
but it gives me
output = "a bcd sam"
But, I want output of regex which starts with 111 as
output = "sam" // for the line starts with 111
Also, how can i make change if there is any space before 111?
you can test this at https://regex-golang.appspot.com/assets/html/index.html
Note that 111\s(.*)$ matches 111 anywhere inside the string (the first occurrence) and then captures into Submatch 1 any 0+ characters up to the end of the string.
If there is a space before the last sam, you may use
^111.*\s(\S+)$
Pattern explanation:
^ - start of string
111 - a literal substring 111
.* - any characters, 0 or more, as many as possible up to the last...
\s - whitespace
(\S+) - Submatch 1 capturing one or more non-whitespace characters
$ - end of string.
If you want to get the line that starts with 111 (and any leading whitespace is allowed) and has some whitespace after which your submatch is located, you may consider either
(?m)^\s*111[^\r\n]*\s(\S+)$
(a . is replaced with [^\r\n] because in Go regex, a dot . matches any character incl. a newline), or - to make sure you only match horizontal whitespace:
(?m)^[^\S\r\n]*111[^\r\n]*[^\S\r\n](\S+)$
or even
(?m)^[^\S\r\n]*111[^\r\n]*[^\S\r\n](\S+)$
Explanation:
(?m)^ - start of the line (due to the (?m) MULTILINE modifier, the ^ now matches a line start and $ will match the line end)
[^\S\r\n]* - zero or more whitespaces except LF and CR (=horizontal whitespace)
111 - a literal 111
[^\r\n]* - any 0+ characters other than CR and LF as many as possible up to the last....
[^\S\r\n] - horizontal space
(\S+) - Submatch 1 capturing 1+ non-whitespace chars
$ - end of line (prepend with [^\S\r\n]* or [^\S\n]* to allow trailing horizontal whitespace)
Here is a Go demo:
package main
import (
"fmt"
"regexp"
)
func main() {
s := `Read SMART Log Directory failed.
111 a bcd sam
111 sam
SMART Error Log not supported`
rex := regexp.MustCompile(`(?m)^[^\S\r\n]*111[^\r\n]*[^\S\r\n](\S+)$`)
fmt.Printf("%#v\n", rex.FindAllStringSubmatch(s,-1))
}
Output: [][]string{[]string{" 111 a bcd sam", "sam"}, []string{" 111 sam", "sam"}}
This should work for you:
\s(\w+)$ // The output will be `sam`
This means capture the last string ($) after a whitespace.
You can use this:
^\s*1{3}.*\s(\S+)$
^ start of line
\s* 0 or more occurrences of space in the beginning
1{3} followed by three ones (i.e. 111)
.* followed by anything
\s followed by space
(\S+)$ ending with non-space characters. First capture group.

Count number of words in a string

How can I match the number of words in a string to be > then 5 using regex?
Input1: stack over flow => the regex will not match anything
Input2: stack over flow stack over => the regex will match this string
I have tried counting the spaces with /\/s/ but that didn't really helped me, because I need to match only strings with no of words > 5
Also I don't want to use split by spaces.
I would rely on a whitespace/non-whitespace patterns and allow trailing/leading whitespace:
^\s*\S+(?:\s+\S+){4,}\s*$
See demo
Explanation:
^ - start of string
\s* - optional any number of whitespace symbols
\S+ - one or more non-whitespace symbols
(?:\s+\S+){4,} - 4 or more sequences of one or more whitespace symbols followed with one or more non-whitespace symbols
\s* - zero or more (optional) trailing whitespace symbols
$ - end of string
^ *\w+(?: +\w+){4,}$
You can use this regex.See demo.
https://regex101.com/r/cZ0sD2/15
To check if there are at least 5 words in string:
(?:\w+\W+){4}\b
(?:\w+\W+){4} 4 words separated by non word characters
\b followed by a word boundary -> requires a 5th word
See demo at regex101