Linux shell extracting substring between matching patterns

Linux shell extracting substring between matching patterns - regex

Let's say I have a string poskek|gfgfd|XLSE|a1768|d234|uijjk and I want to extract just the LSE part.
I only know that there will be |X directly before LSE, and | directly after the part I am interested in LSE.

The other answer using sed should work, but I always find sed to be a bit awkward for regex selection, as it's really intended for replacement (hence why either side of the pattern needs to be flanked with .* and the part you actually want needs to be in parentheses). Here's a solution using grep:
grep -Po '\|X\K[^|]+'
-P signals grep to use Perl's regex engine which is more advanced
-o only prints the matching part of the line
\|X match a literal vertical bar and a capital X
\K forget what has currently been matched (do not include it in the final output)
[^|]+ one or more characters other than vertical bars

As a pure bash solution, please try:
str='poskek|gfgfd|XLSE|a1768|d234|uijjk'
ext=${str#*|X}
ext=${ext%%|*}
echo "$ext"
If regex is available, following also works:
if [[ $str =~ .*\|X([^|]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi

echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' | sed -n 's/.*|X\([^|]\+\).*/\1/p'
That ought to do the trick.
Explained:
sed -n will not print anything unless specified
s/ - search and replace
.*|X - match everything up to and including |X
\([^|]\+\) - capture multiple (at least one) character that isn't a |
.* - match the rest of the text (just to "eat it up")
/\1/p - Replace all matched text with the first capture, and print

For this particular case, you could do the rather unconventional:
awk '$1=="X"{$1="";print}' FS= OFS= RS=\|

try this
echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' |
awk -F "|" '{for(i=1;i<=NF;++i) printf "%s", (substr($i,1,1)=="X"?substr($i,2):"")}'
where
-F is field seperator => '|'
NF is number of fields

Related

find recurring pattern with `sed`

I am using GNU bash 4.3.48
I expected that
echo "23S62M1I19M2D" | sed 's/.*\([0-9]*M\).*/\1/g'
would output 62M19M... But it doesn't.
sed 's/\([0-9]*M\)//g' deletes ALL [0-9]*M and retrieves 23S1I2D. but the group \1 is not working as I thought it would.
sed 's/.*\([0-9]*M\).*/ \1 /g', retrieves M...
What am I doing wrong?
Thank you!

With your shown samples and with awk you could try following program.
echo "23S62M1I19M2D" |
awk '
{
val=""
while(match($0,/[0-9]+M/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
'
Explanation: Simple explanation would be, using echo to print values and sending it as a standard input to awk program. In awk program using its match function to match regex mentioned in it(/[0-9]+M) running loop to find all matches in each line and printing the collected matched values at last of each line.

This might work for you (GNU sed):
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//gp}' file
Surround the match by newlines and then remove non-matching parts.
Alternative, using grep and tr:
grep -o '[0-9]*M' file | tr -d '\n'
N.B. tr removes all newlines (including the last one) to restore the last newline, use:
grep -o '[0-9]*M' file | tr -d '\n' | paste
The alternate solution will concatenate all results into a single line. To achieve the same result with the first solution use:
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//g;H};${x;s/\n//gp}' file

The problem is that the .* is greedy. Since only M is obligatory, when the engine finds last M, it satisfies the regex, so all string is matched, M is captured and thus kept after replacing with \1 backreference.
That means, you can't easily do this with sed. You can do that with Perl much easier since it supports matching and skipping pattern:
#!/bin/bash
perl -pe 's/\d+M(*SKIP)(*F)|.//g' <<< "23S62M1I19M2D"
See the online demo. The pattern matches
\d+M(*SKIP)(*F) - one or more digits, M, and then the match is omitted and the next match is searched for from the failure position
|. - or matches any char other than a line break char.
Or simply match all occurrences and concatenate them:
perl -lane 'BEGIN{$a="";} while (/\d+M/g) {$a .= $&} END{print $a;}' <<< "23S62M1I19M2D"
All \d+M matches are appended to the $a variable which is printed at the end of processing the string.

Your substitution is probably working, but not substituting what you think it is.
In the substitution s/\(foo...\)/\1/, the \1 matches whatever \(...\) matches and captures, so your substitution is replacing foo... by foo...!
% echo "1234ABC" | sed 's/\([A-Z]\)/-\1-/'g
1234-A--B--C-
So you'll need to match more, but capture only a portion of the match. For example:
echo "23S62M1I19M2D" | sed 's/[0-9]*[A-LN-Z]*\([0-9]*M\)/\1/g'
62M19M2D
In the case of sed 's/.*\([0-9]*M\).*/\1/g' (did that appear in an edit to the question, or did I just miss it?), the .* matches ‘greedily’ – it matches as much as it possibly can, thus including the digits before the M. In the example above, the [A-LN-Z] is required to be at the end of the uncaptured part, so the digits are forced to be matched by the [0-9] inside the capture.
Getting a clear idea of what ‘greedy’ means is a really important idea when writing or interpreting regexps.

If you know you will only encounter the suffixes S, M, I and D, an alternative approach would be explicitly deleting the combinations you don't want:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[SID]//g'
This gives the expected:
62M19M
Update: This variant produces the same output, but rejects all non-numeric, non-M suffixes:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[^0-9M]//g'

Excluding the first 3 characters of a string using regex

Given any string in bash, e.g flaccid, I want to match all characters in the string but the first 3 (in this case I want to exclude "fla" and match only "ccid"). The regex also needs to work in sed.
I have tried positive look behind and the following regex expressions (as well as various other unsuccessful ones):
^.{3}+([a-z,A-Z]+)
sed -r 's/(?<=^....)(.[A-Z]*)/,/g'
Google hasn't been very helpful as it only produce results like "get first 3 characters .."
Thanks in advance!

If you want to get all characters but the first 3 from a string, you can use cut:
str="flaccid"
cut -c 4- <<< "$str"
or bash variable subsitution:
str="flaccid"
echo "${str:3}"
That will strip the first 3 characters out of your string.

You may just use a capturing group within an expression like ^.{3}(.*) / ^.{3}([a-zA-Z]+) and grab the ${BASH_REMATCH[1]} contents:
#!/bin/bash
text="flaccid"
rx="^.{3}(.*)"
if [[ $text =~ $rx ]]; then
echo ${BASH_REMATCH[1]};
fi
See online Bash demo
In sed, you should also be using capturing groups / backreferences to get what you need. To just keep the first 3 chars, you may use a simple:
echo "flaccid" | sed 's/.\{3\}//'
See this regex demo. The .\{3\} matches exactly any 3 chars and will remove them from the beginning only, since g modifier is not used.
Now, both the solutions above will output ccid, returning the first 3 chars only.

Using sed, just remove them
echo string | sed 's/^...//g'

How is it that no-one has named the most simple and portable solution:
shell "Parameter expansions":
str="flacid"
echo "${str#???}
For a regex (bash):
$ str="flaccid"
$ regex='^.{3}(.*)$'
$ [[ $str =~ $regex ]] && echo "${BASH_REMATCH[1]}"
ccid
Same regex in sed:
$ echo "flaccid" | sed -E "s/$regex/\1/"
ccid
Or sed (Basic Regex):
$ echo "flaccid" | sed 's/^.\{3\}\(.*\)$/\1/'
ccid

replace more than one special character with sed

I´m a nooby in regex so i have my headache with sed.
I need help to replace all special characters from the given company names with "-".
So this is the given string:
FML Finanzierungs- und Mobilien Leasing GmbH & Co. KG
I want the result:
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG
I tried the following:
nr = $(echo "$name" | sed -e 's/ /-/g'))
so this replace all whitespaces with -, but what the right expression to replace the others? My one search via google are not very successful.

That depends on what you consider to be a special character -- I say this because you appear to consider & a regular character but not ., which seems a bit odd. Anyway, I imagine something of the form
nr=$(echo "$name" | sed 's/[^[:alnum:]&]\+/-/g')
would serve you best. Here [^[:alnum:]&] matches any character that is not alphanumeric or &, and [^[:alnum:]&]\+ matches a sequence of one or more such characters, so the sed call replaces all such sequences in $name with a hyphen. If there are other characters that you consider regular, add them to the set. Note that the handling of umlauts and suchlike depends on your locale.
Also note that echo may cause trouble if $name begins with a hyphen (it could be parsed as options for echo), so if you can tether yourself to bash,
nr=$(sed 's/[^[:alnum:]&]\+/-/g' <<< "$name")
might be more robust.

Apparently you wan to remove - and . and then replace spaces with -.
This would do it, by saying sed -e 'one thing' -e 'another thing':
$ echo "$name" | sed -e 's/[-\.]//g' -e 's/ /-/g'
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG
Note we enclose within square backets all the characters that we want to treat equally: [-\.] means either - or . (we need to escape it, otherwise it would match any character).

Do this help you:
awk -vOFS=- '{gsub(/[.-]/,"");$1=$1}1' <<< "$name"
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG
gsub(/[.-]/,"") Removes . and _
-vOFS=- sets new field separator to -
$1=$1 reconstruct the line so it uses new field separator
1 print the line.
To get it to a variable
nr=$(awk -vOFS=- '{gsub(/[.-]/,"");$1=$1}1' <<< "$name")

Try this way also
echo "name" | sed 's/ \|- \|\. /-/g'
OutPut :
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG

pipe sed command to create multiple files

I need to get X to Y in the file with multiple occurrences, each time it matches an occurrence it will save to a file.
Here is an example file (demo.txt):
\x00START how are you? END\x00
\x00START good thanks END\x00
sometimes random things\x00\x00 inbetween it (ignore this text)
\x00START thats nice END\x00
And now after running a command each file (/folder/demo1.txt, /folder/demo2.txt, etc) should have the contents between \x00START and END\x00 (\x00 is null) in addition to 'START' but not 'END'.
/folder/demo1.txt should say "START how are you? ", /folder/demo2.txt should say "START good thanks".
So basicly it should pipe "how are you?" and using 'echo' I can prepend the 'START'.
It's worth keeping in mind that I am dealing with a very large binary file.
I am currently using
sed -n -e '/\x00START/,/END\x00/ p' demo.txt > demo1.txt
but that's not working as expected (it's getting lines before the '\x00START' and doesn't stop at the first 'END\x00').

If you have GNU awk, try:
awk -v RS='\0START|END\0' '
length($0) {printf "START%s\n", $0 > ("folder/demo"++i".txt")}
' demo.txt
RS='\0START|END\0' defines a regular expression acting as the [input] Record Separator which breaks the input file into records by strings (byte sequences) between \0START and END\0 (\0 represents NUL (null char.) here).
Using a multi-character, regex-based record separate is NOT POSIX-compliant; GNU awk supports it (as does mawk in general, but seemingly not with NUL chars.).
Pattern length($0) ensures that the associated action ({...}) is only executed if the records is nonempty.
{printf "START%s\n", $0 > ("folder/demo"++i)} outputs each nonempty record preceded by "START", into file folder/demo{n}.txt", where {n} represent a sequence number starting with 1.

You can use grep for that:
grep -Po "START\s+\K.*?(?=END)" file
how are you?
good thanks
thats nice
Explanation:
-P To allow Perl regex
-o To extract only matched pattern
-K Positive lookbehind
(?=something) Positive lookahead
EDIT: To match \00 as START and END may appear in between:
echo -e '\00START hi how are you END\00' | grep -aPo '\00START\K.*?(?=END\00)'
hi how are you
EDIT2: The solution using grep would only match single line, for multi-line it's better use perl instead. The syntax will be very similar:
echo -e '\00START hi \n how\n are\n you END\00' | perl -ne 'BEGIN{undef $/ } /\A.*?\00START\K((.|\n)*?)(?=END)/gm; print $1'
hi
how
are
you
What's new here:
undef $/ Undefine INPUT separator $/ which defaults to '\n'
(.|\n)* Dot matches almost any character, but it does not match
\n so we need to add it here.
/gm Modifiers, g for global m for multi-line

I would translate the nulls into newlines so that grep can find your wanted text on a clean line by itself:
tr '\000' '\n' < yourfile.bin | grep "^START"
from there you can take it into sed as before.

How to use grep to get anything just after `name=`?

I’m stuck in trying to grep anything just after name=, include only spaces and alphanumeric.
e.g.:
name=some value here
I get
some value here
I’m totally newb in this, the following grep match everything including the name=.
grep 'name=.*' filename
Any help is much appreciated.

As detailed here, you want a positive lookbehind clause, such as:
grep -P '(?<=name=)[ A-Za-z0-9]*' filename
The -P makes grep use the Perl dialect, otherwise you'd probably need to escape the parentheses. You can also, as noted elsewhere, append the -o parameter to print out only what is matched. The part in brackets specifies that you want alphanumerics and spaces.
The advantage of using a positive lookbehind clause is that the "name=" text is not part of the match. If grep highlights matched text, it will only highlight the alphanumeric (and spaces) part. The -o parameter will also not display the "name=" part. And, if you transition this to another program like sed that might capture the text and do something with it, you won't be capturing the "name=" part, although you can also do that by using capturing parenthess.

Try this:
sed -n 's/^name=//p' filename
It tells sed to print nothing (-n) by default, substitute your prefix with nothing, and print if the substitution occurs.
Bonus: if you really need it to only match entries with only spaces and alphanumerics, you can do that too:
sed -n 's/^name=\([ 0-9a-zA-Z]*$\)/\1/p' filename
Here we've added a pattern to match spaces and alphanumerics only until the end of the line ($), and if we match we substitute the group in parentheses and print.

gawk
echo "name=some value here" | awk -F"=" '/name=/ { print $2}'
or with bash
str="name=some value here"
IFS="="
set -- $str
echo $1
unset IFS
or
str="name=some value here"
str=${str/name=/}

grep does not extract like you expect. What you need is
grep "name=" file.txt | cut -d'=' -f1-

grep will print the entire line where it matches the pattern. To print only the pattern matched, use the grep -o option. You'll probably also need to use sed to remove the name= part of the pattern.
grep -o 'name=[0-9a-zA-Z ]' myfile | sed /^name=/d

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Linux shell extracting substring between matching patterns - regex

Let's say I have a string poskek|gfgfd|XLSE|a1768|d234|uijjk and I want to extract just the LSE part. I only know that there will be |X directly before LSE, and | directly after the part I am interested in LSE.

As a pure bash solution, please try: str='poskek|gfgfd|XLSE|a1768|d234|uijjk' ext=${str#|X} ext=${ext%%|} echo "$ext" If regex is available, following also works: if [[ $str =~ .*\|X([^|]+) ]]; then echo "${BASH_REMATCH[1]}" fi

For this particular case, you could do the rather unconventional: awk '$1=="X"{$1="";print}' FS= OFS= RS=\|

try this echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' | awk -F "|" '{for(i=1;i<=NF;++i) printf "%s", (substr($i,1,1)=="X"?substr($i,2):"")}' where -F is field seperator => '|' NF is number of fields

Related

find recurring pattern with `sed`

Excluding the first 3 characters of a string using regex

replace more than one special character with sed

pipe sed command to create multiple files

How to use grep to get anything just after `name=`?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Linux shell extracting substring between matching patterns - regex

Let's say I have a string poskek|gfgfd|XLSE|a1768|d234|uijjk and I want to extract just the LSE part. I only know that there will be |X directly before LSE, and | directly after the part I am interested in LSE.

As a pure bash solution, please try: str='poskek|gfgfd|XLSE|a1768|d234|uijjk' ext=${str#*|X} ext=${ext%%|*} echo "$ext" If regex is available, following also works: if [[ $str =~ .*\|X([^|]+) ]]; then echo "${BASH_REMATCH[1]}" fi

For this particular case, you could do the rather unconventional: awk '$1=="X"{$1="";print}' FS= OFS= RS=\|

try this echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' | awk -F "|" '{for(i=1;i<=NF;++i) printf "%s", (substr($i,1,1)=="X"?substr($i,2):"")}' where -F is field seperator => '|' NF is number of fields

Related

find recurring pattern with `sed`

Excluding the first 3 characters of a string using regex

replace more than one special character with sed

pipe sed command to create multiple files

How to use grep to get anything just after `name=`?

Categories

Resources

As a pure bash solution, please try: str='poskek|gfgfd|XLSE|a1768|d234|uijjk' ext=${str#|X} ext=${ext%%|} echo "$ext" If regex is available, following also works: if [[ $str =~ .*\|X([^|]+) ]]; then echo "${BASH_REMATCH[1]}" fi