Removing leading and trailing white spaces as well as leading zeroes - regex

I am trying to remove the leading zeroes 0011223344 and also leading and trailing white spaces
/^0+(?=[0-9]/
s/^\s+|\s+$//
How can I combine the two to get the same output.
11223344

You may use this regex in perl:
s/^\h*0*(?=\d)|\h+$
RegEx Details:
\h matches a horizontal whitespace
^\h*0*(?=\d): At the start it will match 0 or more leading whitespaces followed by 0 or more leading zeroes as long as there is at least one digit ahead
| OR
\h+$: At the end it will match 1+ horizontal whitespaces
Examples:
perl -pe 's/^\h*0+(?=\d)|\h+$//g' <<< ' 001 '
1
perl -pe 's/^\h*0+(?=\d)|\h+$//g' <<< ' 000 '
0
perl -pe 's/^\h*0+(?=\d)|\h+$//g' <<< ' 0000000123 '
123
perl -pe 's/^\h*0*(?=\d)|\h+$//g' <<< ' 123 '
123

You may use
s/^[\s0]+|\s+$//g
Or, for a corner case like ' 0000 ' where you would still like to keep one zero:
s/^(?:\s*(0)+\s*$|[\s0]+)|\s+$/$1/g
See the regex demo #1 and regex demo #2.
Details
^[\s0]+ - matches one or more zeros or whitespace at the start of the string
^(?:\s*(0)+\s*$|[\s0]+) - matches
^ - start of string
(?:\s*(0)+\s*$|[\s0]+) - either of
\s*(0)+\s*$ - 0+ whitespaces, 1 or more zeros each time captured into Group 1, and then 0+ whitespaces till end of string
|- or
[\s0]+ - 1 or more whitespaces or zeros
| - or
\s+$ - matches one or more whitespace chars at the end of string.

Related

Extract Values Between Pattern Match

I'm trying to extract any numerical values between a pattern match in a text file.
Parsed Log File Text
> GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2
I want to pull the 25 from f25 in nmmb_2p5km.f25.conus.grib2
Attempted Code
sed -e 's/nmmb_2p5km\(.*\)grib2/\1/'
You may use
log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
sed 's/.*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.*/\1/' <<< "$log"
The .*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.* pattern matches
.* - any 0+ chars
nmmb_2p5km - a literal substring
[^0-9]* - 0+ non-digit chars
\([0-9]*\) - Capturing group 1 (later referred to with \1 from the replacement pattern): 0+ digits
[^0-9]* - 0+ non-digit chars
grib2.* - grib2 and any 0+ chars.
Alternatively, you may use grep with a PCRE pattern like
grep -Po 'nmmb_2p5km\D*\K\d+' <<< "$log"
Details
nmmb_2p5km - a literal substring
\D* - 0+ non-digit chars
\K - match reset oeprator discarding all text matched so far
\d+ - 1+ digits.
See the online sed and grep demo.
Using perl one-liner
> export log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
> perl -ne ' BEGIN { $x=$ENV{log};$x=~s/(.+?)(\d+)\.conus\.(.+)/\2/g; print "$x\n"; exit } '
25
>

Regex to accept special characters only with alpha numeric

I created the following regex:
^[a-zA-Z0-9 ']{1,24}\r?$
It should accept alphanumeric, space and apostrophe. And the input should be minimum 1 character and maximum 24 characters.But it accepts inputs with only apostrophe and space also(e.g. " ' ' "). I’m expecting to accept apostrophe and space only with some alphanumeric characters. So below test cases should pass
Pass
Test
Test'My Regex
Test' 123' Regex '
Fail
''
You may use
^(?=.{1,24}$)[a-zA-Z0-9 ']*[A-Za-z][a-zA-Z0-9 ']*$
Or, if strings with just a single digit already make them valid:
^(?=.{1,24}$)[a-zA-Z0-9 ']*[A-Za-z0-9][a-zA-Z0-9 ']*$
See the regex demo
Details
^ - start of string
(?=.{1,24}$) - the whole string must contain 1 to 24 chars
[a-zA-Z0-9 ']* - 0+ alphanumeric, space or ' chars
[A-Za-z] - an alpha char (NOTE replace with [A-Za-z0-9] to also allow strings with just a digit)
[a-zA-Z0-9 ']* - 0+ alphanumeric, space or ' chars
$ - end of string.

Capture word between optional hyphens regex

I've following type of strings,
abc - xyz
abc - pqr - xyz
abc - - xyz
abc - pqr uvw - xyz
I want to retrieve the text xyz from 1st string and pqr from 2nd string, `` (empty) from 3rd & pqr uvw. The 2nd hyphen is optional. abc is static string, it has to be there. I've tried following regex,
/^(?:abc) - (.*)[^ -]?/
But it gives me following output,
xyz
pqr - xyz
- xyz
pqr uvw - xyz
I don't need the last part in the second string. I'm using perl for scripting. Can it be done via regex?
Note that (.*) part is a greedily quantified dot and it grabs any 0+ chars other than line break chars, as many as possible, up to the end of the line and the [^ -]?, being able to match an empty string due to the ? quantifier (1 or 0 repetitions), matches the empty string at the end of the line. Thus, pqr - xyz output for abc - pqr - xyz is only logical for the regex engine.
You need to use a more restrictive pattern here. E.g.
/^abc\h*-\h*((?:[^\s-]+(?:\h+[^\s-]+)*)?)/
See the regex demo.
Details
^ - start of a string
abc - an abc
\h*-\h* - a hyphen enclosed with 0+ horizontal whitespaces
((?:[^\s-]+(?:\h+[^\s-]+)*)?) - Group 1 capturing an optional occurrence of
[^\s-]+ - 1 or more chars other than whitespace and -
(?:\h+[^\s-]+)* - zero or more repetitions of
\h+ - 1+ horizontal whitespaces
[^\s-]+ - 1 or more chars other than whitespace and -
You could use ^[^-]*-\s*\K[^\s-]*.
Here's how it works:
^ # Matches at the beginning of the line (in multiline mode)
[^-]* # Matches every non - characters
- # Followed by -
\s* # Matches every spacing characters
\K # Reset match at current position
[^\s-]* # Matches every non-spacing or - characters
Demo.
Update for multiple enclosed words: ^[^-]*-\s*\K[^\s-]*(?:\s*[^\s-]+)*
Last part (?:\s*[^\s-]+)* checks for existence of any other word preceded by space(s).
Demo
You could use split:
$answer = (split / \- /, $t)[1];
Where $t is the text string and you want the 2nd split (i.e. [1] as starts from 0). Works for everything except abc - - xyz but if the separator is " - " then it should have 2 spaces in the middle to return nothing. If abc - - xyz is correct then you can do this before the split for all to work:
$t =~ s/\- \-/- -/;
It simply inserts an extra space so it'll match " - " twice with nothing in-between.
Can it be done via regex?
Yes, with three simple regexes: - and ^\s+ and \s+$.
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
open my $INFILE, '<', 'data.txt';
my #results = map {
(undef, my $target) = split /-/, $_, 3;
$target =~ s/^\s+//; #remove leading spaces
$target =~ s/\s+$//; #remove trailing spaces
$target;
} <$INFILE>;
close $INFILE;
say Dumper \#results;
--output:--
$VAR1 = [
'xyz',
'pqr',
'',
'pqr uvw'
];

How to Regex and extract even new line until a match

I have use regex to successfully extract anything right after "Abc 123" but it doesn't extract anything from the new line.
Is there any way I can use regex to extract the following:
"Abc 123 def
ghi
jkl"
"Abc 123 def ghi jkl mno"
"Abc 123 def ghi jkl
mno"
I am using Regex in Talend.
I think you want to exract substrings that start at the beginning of a line with 1+ word chars, then a whitespace, then 1 or more digits and span across multiple lines up to the same pattern.
You may use the following regex (note the flags and notation may differ depending on the language you are using):
/^(\w+)\s(\d+)(.*(?:\r?\n(?!\w+\s\d).*)*)/gm
See the regex demo.
Details:
^ - start of a line
(\w+) - Group 1: one or more word chars
\s - 1 whitespace
(\d+) - Group 2: one or more digits
(.*(?:\r?\n(?!\w+\s\d).*)*) - Group 3:
.* - any 0+ chars other than line break chars
(?:\r?\n(?!\w+\s\d).*)* - zero or more sequences of:
\r?\n - a line break...
(?!\w+\s\d) - that is not followed with 1+ word chars, whitespace, 1+ digits
.* - any 0+ chars other than line break chars
(\w)+\s(\d+)((.|\R)+) is what you want so after escaping it'll be:
(\\w)+\\s(\\d+)((.|\\R)+).
\R is a new group in Java regex available since Java 8 - it stands for a line break. Both: \r\n and \n.
If you only allow a single linebreak:
(\w)+\s(\d+)((.+)(\R.+){0,1})
I think that you should specify more what is your desired output, but from this answer you can learn how to include multiple lines or up to 2 lines

regex to get the last item after space

How to get the last item by regex?
"Read the information failed.
111 a bcd
SAM Error Log not up supported"
I did this
111\s(.*)$
but it gives me
output = "a bcd sam"
But, I want output of regex which starts with 111 as
output = "sam" // for the line starts with 111
Also, how can i make change if there is any space before 111?
you can test this at https://regex-golang.appspot.com/assets/html/index.html
Note that 111\s(.*)$ matches 111 anywhere inside the string (the first occurrence) and then captures into Submatch 1 any 0+ characters up to the end of the string.
If there is a space before the last sam, you may use
^111.*\s(\S+)$
Pattern explanation:
^ - start of string
111 - a literal substring 111
.* - any characters, 0 or more, as many as possible up to the last...
\s - whitespace
(\S+) - Submatch 1 capturing one or more non-whitespace characters
$ - end of string.
If you want to get the line that starts with 111 (and any leading whitespace is allowed) and has some whitespace after which your submatch is located, you may consider either
(?m)^\s*111[^\r\n]*\s(\S+)$
(a . is replaced with [^\r\n] because in Go regex, a dot . matches any character incl. a newline), or - to make sure you only match horizontal whitespace:
(?m)^[^\S\r\n]*111[^\r\n]*[^\S\r\n](\S+)$
or even
(?m)^[^\S\r\n]*111[^\r\n]*[^\S\r\n](\S+)$
Explanation:
(?m)^ - start of the line (due to the (?m) MULTILINE modifier, the ^ now matches a line start and $ will match the line end)
[^\S\r\n]* - zero or more whitespaces except LF and CR (=horizontal whitespace)
111 - a literal 111
[^\r\n]* - any 0+ characters other than CR and LF as many as possible up to the last....
[^\S\r\n] - horizontal space
(\S+) - Submatch 1 capturing 1+ non-whitespace chars
$ - end of line (prepend with [^\S\r\n]* or [^\S\n]* to allow trailing horizontal whitespace)
Here is a Go demo:
package main
import (
"fmt"
"regexp"
)
func main() {
s := `Read SMART Log Directory failed.
111 a bcd sam
111 sam
SMART Error Log not supported`
rex := regexp.MustCompile(`(?m)^[^\S\r\n]*111[^\r\n]*[^\S\r\n](\S+)$`)
fmt.Printf("%#v\n", rex.FindAllStringSubmatch(s,-1))
}
Output: [][]string{[]string{" 111 a bcd sam", "sam"}, []string{" 111 sam", "sam"}}
This should work for you:
\s(\w+)$ // The output will be `sam`
This means capture the last string ($) after a whitespace.
You can use this:
^\s*1{3}.*\s(\S+)$
^ start of line
\s* 0 or more occurrences of space in the beginning
1{3} followed by three ones (i.e. 111)
.* followed by anything
\s followed by space
(\S+)$ ending with non-space characters. First capture group.