Replacing newline after a pattern in unix - regex

I want to replace a newline with space after a pattern. For example my text is:
1.
good movie
(2006)
This is a world class movie for music.
Dir:
abc
With:
lan
,
cer
,
cro
Comedy
|
Drama
|
Family
|
Musical
|
Romance
120 mins.
53,097
I want above text to become something like this
1. good movie (2006) This is a wold class movie fo music. Dir: abc With: lan, cer, cro comedy | Drama | Family | Musical | Romance 120 mins

After the question update, the requirements for the solution changed:
cat test.txt | tr '\n' ' ' | perl -ne 's/(?<!\|) ([A-Z])/\n\1/g; print' | sed 's/ ,/,/g' | sed 's/ \([0-9]\+\)/\n\1/g'; echo
output:
1. good movie (2006)
This is a world class movie for music.
Dir: abc
With: lan, cer, cro
Comedy | Drama | Family | Musical | Romance
120 mins.
Explanation:
First I replace all newline characters using tr.
Second I replace every capital letter by a preceding newline and
itself unless it is preceeded by a pipe "| "symbol.
The third one corrects the comma spacings.
The last moves the duration declaration to a new line
The echo at the very end is to append a 'newline' to the output.
Deprecated:
Building on kpie's comment, I suggest you the following solution:
cat test.txt | sed ':a;N;$!ba;s/\n//g' | sed 's/\([A-Z]\)/\n\1/g'
I pasted your input into test.txt.
The first sed replacement is explained here: https://stackoverflow.com/a/1252191/1863086
The second one replaces every captial letter by a preceding newline and itself.
EDIT:
Another possibility using tr:
cat test.txt | tr -d '\n' | sed 's/\([A-Z]\)/\n\1/g'; echo

Related

Filtering matched content

I want to Filter all content after match with the content and bring the first value after the "."
I have an output something like this:
Output:
product: 13.6.0.35_0
More specifically, I need only the first two digits and the first digit after the dot, remembering that we should not cling to the values in the issue, but rather on the method of filtering the content.
Expected:
13.6
I tried something like:
echo "product: 13.6.0.35_0" | grep -ow '\w*13\w*'
If you need to use grep with the current logic, you can use
echo "product: 13.6.0.35_0" | grep -ow '13\.[0-9]*' | head -1
where 13\.[0-9]* matches 13, . and zero or more digits (as whole word due to w option) and head -1 gets the first match.
You may also use sed or awk:
sed -En 's/.* ([0-9]+\.[0-9]+).*/\1/p' <<< "product: 13.6.0.35_0"
awk -F'[[:space:].]' '{print $2"."$3}' <<< "product: 13.6.0.35_0"
See the online demo.
The sed command matches any text up to space, then matches the space and captures the two subsequent dot-separated numbers into Group 1 (\1) and then the rest of the line is matched and replaced with Group 1 value that is printed (as the default line output is suppressed with -n).
In the awk command, the field separator is set to whitespace and . with -F'[[:space:].]' and the {print $2"."$3} part prints the second and third field values joined with a ..
A pure shell solution using the builtin read , Parameter Expansion and curly braces for command groupings.
echo "product: 13.6.0.35_0" | { read -r _ value; echo "${value%.*.*}" ; }
You can also use cut:
echo 'product: 13.6.0.35_0' | cut -d ' ' -f2 | cut -d '.' -f1-2
13.6
I reached the expected output, it's simple but it works:
var=$(echo "product: 13.6.0.35_0" | grep -Eo '[[:digit:]]+' | sed -n 1,2p)
echo ${var} | sed 's/ /./g'

How to shorter regular expression?

First off, I'm relatively new to regular expressions: I've built a regex that I'm using with sed that works fine for me, it looks like:
sed 's/^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9][0-9][0-9] | info | tst.33.12.carmen | !: //g' but I'm pretty sure all the repetitive character occurrences could be simplified. How would I do this?
I want to replace:
20180630 180212.407107 | info | tst.33.12.carmen | !: from a line of text (timestamp in the front could be any numbers, strings behind the first '|' are constant)
EDIT: Since OP has put sample of input now so adding this solution.
sed -E 's/^[0-9]{8} [0-9]{6}\.[0-9]{6} \| info \| tst\.[0-9]{2}\.[0-9]{2}\.carmen \| \!:$//' Input_file
Test of code's working:
Let's say following is the Input_file:
cat Input_file
20180630 180212.407107 | info | tst.33.12.carmen | !:
fdfjwhfwifrwvf
vwkdnvkwkvwnvwv
20180630 180212.407107 | info | tst.33.12.carmen | !:
dwbvwbvwvbb
Now after running above code following will be the output then.
sed -E 's/^[0-9]{8} [0-9]{6}\.[0-9]{6} \| info \| tst\.[0-9]{2}\.[0-9]{2}\.carmen \| \!:$//' Input_file
fdfjwhfwifrwvf
vwkdnvkwkvwnvwv
dwbvwbvwvbb
With sed's -E option you could use like following but fair warning that it is opted from your solution and never tested since no samples were produced in your post.
sed -E 's/^[0-9]{8} [0-9]{5}.[0-9]{5} | info | tst.33.12.carmen | !: //g'
If you don't care about matching the exact format of your prefix, but just want to accept some combination of digits, dots and spaces, you can simplify the first part to:
[ .0-9]*
The complete sed expression then looks like:
sed 's/^[ .0-9]*| info | tst\.[0-9]*\.[0-9]*\.carmen | !:$//' file

Matching end of ine in GREP

I have this piece of bash script which is supposed to match words that end with an 'a'. However, when I run this, I get no output, despite the fact that my text file has words that end with 'a'.
cat $1 | cut -d'|' -f3 | cut -d',' -f2 | sed 's/^ //' | egrep -i "a$"
If I remove the '$' it shows output, but with '$' it returns nothing. It still works, just doesn't match.
Would appreciate some help with this. Thanks.
A sample of the file
MTNG1511|5013566|Xin, Mackenzie Darren MTNG9902|5079970|Park, Xue
Hannah Vanessa MTNG1511|5059072|Chung, Michael Jia Tianyu
MTNG1521|5060774|Lim, Stephanie Lauren MTNG1531|5060774|Lim, Stephanie
Lauren MTNG2521|5060774|Lim, Stephanie Lauren MTNG9020|5060538|Bi,
Samuel Shiyu MTNG9021|5060538|Bi, Samuel Shiyu MTNG9902|5072116|Hu,
Kai Zhi Patrick
Output should be
Park, Xue Hannah Vanessa
Since it ends with an 'a'
There's probably extra whitespace at the end of your word.
Try adding
sed 's/[ \t]*$//'
to remove the whitespace -- or else change your grep to allow for whitespace at the end.
This is very simple using grep
grep -o '[^\|]\+$' < "$1" | grep 'a\s*$'
Output
$ bash example file.txt
Park, Xue Hannah Vanessa
$
[^\|]\+ match one or more characters that aren't | to the end of the line.
a\s*$ match a as last character but check for some spaces before the line feed.

Match last name using awk

Say there's a file like this
1 | John Smith | 70000
2 | Al McSmith | 60000
If I use
awk -F"|" '$2~/Smith/' file
both rows are matched.
Is there a way to only match John Smith? (USING AWK ONLY)
EDIT: I'm trying to match the people that have Smith as their last name, without matching McSmith, or O'Smith, etc.
this may work for you:
awk -F'|' '$2~/ Smith\s*$/' file
it won't match:
fooSmith
Smithfoo
foo Smith is middlename
Just stick a Space before Smith:
awk -F'|' '$2~/ Smith/' testfile
If there is a name like John Smitherton in there, then stick a space after as well (since it looks like you have <space><delim><space> between each field). Otherwise you can get a little fancier with the regex, but your space padding is pretty useful here.
Another solution using grep
grep -E "[^|]*\|[^|]*\<Smith\>"
explanation
[^|] match any character except |
\| match with |
\< \> start and end of word
I've made test. I created file: test.in with your content:
1 | John Smith | 70000
2 | Al McSmith | 60000
Then tried another expression:
awk -F'|' '{print $2~/\sSmith\s/}' test.in
It prints:
1
0
So, 1 for Smith, 0 for McSmith.
[UPD] \s - is an additional character, specific for gawk

How can I output only captured groups with sed?

Is there a way to tell sed to output only captured groups?
For example, given the input:
This is a sample 123 text and some 987 numbers
And pattern:
/([\d]+)/
Could I get only 123 and 987 output in the way formatted by back references?
The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
don't default to printing each line (-n)
exclude zero or more non-digits
include one or more digits
exclude one or more non-digits
include one or more digits
exclude zero or more non-digits
print the substitution (p) (on one line)
In general, in sed you capture groups using parentheses and output what you capture using a back reference:
echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'
will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:
echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'
outputs "a bar a".
If you have GNU grep:
echo "$string" | grep -Po '\d+'
It may also work in BSD, including OS X:
echo "$string" | grep -Eo '\d+'
These commands will match any number of digit sequences. The output will be on multiple lines.
or variations such as:
echo "$string" | grep -Po '(?<=\D )(\d+)'
The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.
Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail
you can use grep
grep -Eow "[0-9]+" file
run(s) of digits
This answer works with any count of digit groups. Example:
$ echo 'Num123that456are7899900contained0018166intext' \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Expanded answer.
Is there any way to tell sed to output only captured groups?
Yes. replace all text by the capture group:
$ echo 'Number 123 inside text' \
| sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123
s/[^0-9]* # several non-digits
\([0-9]\{1,\}\) # followed by one or more digits
[^0-9]* # and followed by more non-digits.
/\1/ # gets replaced only by the digits.
Or with extended syntax (less backquotes and allow the use of +):
$ echo 'Number 123 in text' \
| sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123
To avoid printing the original text when there is no number, use:
$ echo 'Number xxx in text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
(-n) Do not print the input by default.
(/p) print only if a replacement was done.
And to match several numbers (and also print them):
$ echo 'N 123 in 456 text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456
That works for any count of digit runs:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Which is very similar to the grep command:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166
About \d
and pattern: /([\d]+)/
Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.
The selected answer use such "character classes" to build a solution:
$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
That solution only works for (exactly) two runs of digits.
Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:
$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"
But, as has been already explained, using a s/…/…/gp command is better:
$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987
That will cover both repeated runs of digits and writing a short(er) command.
Give up and use Perl
Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)
Print the entire matching part, no matching groups or lookbehind needed:
cat <<EOS | perl -lane 'print m/\d+/g'
a1 b2
a34 b56
EOS
Output:
12
3456
Single match per line, often structured data fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
a1 b2
a34 b56
EOS
Output:
1
34
With lookbehind:
cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
a1 b2
a34 b56
EOS
Multiple fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
a1 c0 b2 c0
a34 c0 b56 c0
EOS
Output:
1 2
34 56
Multiple matches per line, often unstructured data:
cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
34 78
With lookbehind:
cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
3478
I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.
If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:
> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers
These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)
Try
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
I got this under cygwin:
$ (echo "asdf"; \
echo "1234"; \
echo "asdf1234adsf1234asdf"; \
echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
1234
1234 1234
1 2 3 4 5 6 7 8 9
$
You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:
echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'
It's not what the OP asked for (capturing groups) but you can extract the numbers using:
S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'
Gives the following:
123
987
I want to give a simpler example on "output only captured groups with sed"
I have /home/me/myfile-99 and wish to output the serial number of the file: 99
My first try, which didn't work was:
echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99
To make this work, we need to capture the unwanted portion in capture group as well:
echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99
*) Note that sed doesn't have \d
You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this
rg '(\d+)' -or '$1'
where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.