awk/sed regex, extract a column that has the delimiter - regex

I have a file with this format:
two columns of numbers in the beginning and two columns of number in the end and one column in the middle which is the name but the name has a delimiter of space which mess things up.
Is there any kind of regex that I can take out the name column correctly. Is there anyway that i can use sed to replace (or remove) the space in that column so that I can take that out column out easily?
Example:
1 2 name 3 4
12 12 name1 name2 3 4
12 12 name1 name2 name3 name4 3 4
3 4 name 3 4
--
The output that I want to have is:
name
name1_name2
name1_name2_name3_name4
name
Thanks,
Amir,

One solution using awk is:
cat foo | awk '{ for(i=3; i<=NF-3; i++) { printf $i "_"; } printf $i "\n"; }'
Here is the same thing using sed:
cat foo | sed -e 's/^[0-9 ]*//g' -e 's/ [0-9 ]*$//g' -e 's/ /_/g'
POSIX compliant for clarity:
cat foo | sed -e 's/^[[:digit:][:space:]]*//g' -e 's/[[:space:]]*[[:digit:][:space:]]*$//g' -e 's/ /_/g'

sed 's/^[0-9]\+ [0-9]\+ \(.*\) [0-9]\+ [0-9]\+$/\1/;s/ /_/g'

another awk way without looping
awk 'BEGIN{OFS="_"}{$1=$2=$NF=$(NF-1)="";gsub(/__/,"")}1' yourFile
test:
kent$ cat t
1 2 name 3 4
12 12 name1 name2 3 4
12 12 name1 name2 name3 name4 3 4
3 4 name 3 4
kent$ awk 'BEGIN{OFS="_"}{$1=$2=$NF=$(NF-1)="";gsub(/__/,"")}1' t
name
name1_name2
name1_name2_name3_name4
name

Couple of Perl options
perl -lne '/\d+ \d+ (.+) \d+ \d+/ and do {($_ = $1) =~ s/ /_/g; print}'
perl -lape 'for (1..2) {shift #F; pop #F}; $_ = join "_", #F'

Related

Replace first four characters of a string

I would need to mask, with Regular Expression in Sed Editor, a string after 4 characters.
Example: 1234567890
Result: 1234XXXXXX
Can you help me?
In GNU sed:
echo "1234567890" | sed "s/./X/5g"
Following awk may help you here.
echo "1234567890" | awk '{for(i=5;i<=length($0);i++){val=val?val "X":"X"};print substr($0,1,4) val;val=""}'
Output will be as follows.
1234XXXXXX
$echo "1234567890"|awk '{for(i=5;i<=NF;i++)$i="X"}7' FS="" OFS=""
1234XXXXXX
With perl, you can put arbitrary perl code in the replacement text:
perl -pe 's/^(....)(.*)/$1 . "X" x length($2)/e' <<END
1
12
123
1234
12345
123456
1234567
12345678
123456789
1234567890
END
1
12
123
1234
1234X
1234XX
1234XXX
1234XXXX
1234XXXXX
1234XXXXXX
With sed
count=4
echo "1234567890" | \
sed ':A;s/\(\(.\)\{'"$count"'\}\)\(X*\)[^X]/\1\3X/;tA'
or
sed -E ':A;s/((.){'"$count"'})(X*)[^X]/\1\3X/;tA'

How to use a regex for the field separator in AWK?

I read another answer that show how one can set the field separator using the -F flag:
awk -F 'INFORMATION DATA ' '{print $2}' t
Now I'm curious how I can use a regex for the field separator. My attempt can be seen below:
$ echo "1 2 foo\n2 3 bar\n42 2 baz"
1 2 foo
2 3 bar
42 2 baz
$ echo "1 2 foo\n2 3 bar\n42 2 baz" | awk -F '\d+ \d+ ' '{ print $2 }'
# 3 blank lines
I was expecting to get the following output:
foo
bar
baz
This is because my regex \d+ \d+ matches "the first 2 numbers separated by a space, followed by a space". But I'm printing the second record. As shown on rubular:
How do I use a regex as the awk field separator?
First of all echo doesn't auto escape and outputs a literal \n. So you'll need to add -e to enable escapes. Second of all awk doesn't support \d so you have to use [0-9] or [[:digit:]].
echo -e "1 2 foo\n2 3 bar\n42 2 baz" | awk -F '[0-9]+ [0-9]+ ' '{ print $2 }'
or
echo -e "1 2 foo\n2 3 bar\n42 2 baz" | awk -F '[[:digit:]]+ [[:digit:]]+ ' '{ print $2 }'
Both outputs:
foo
bar
baz
Just replace \d with [0-9]:
With this you can print all the fields and you can see the fields immediatelly:
$ echo -e "1 2 foo\n2 3 bar\n42 2 baz" |awk -v FS="[0-9]+ [0-9]+" '{for (k=1;k<=NF;k++) print k,$k}'
1
2 foo
1
2 bar
1
2 baz
So just use [0-9] in your command:
$ echo -e "1 2 foo\n2 3 bar\n42 2 baz" |awk -v FS="[0-9]+ [0-9]+" '{print $2}'
foo
bar
baz

How to delete lines before a match perserving it?

I have the following script to remove all lines before a line which matches with a word:
str='
1
2
3
banana
4
5
6
banana
8
9
10
'
echo "$str" | awk -v pattern=banana '
print_it {print}
$0 ~ pattern {print_it = 1}
'
It returns:
4
5
6
banana
8
9
10
But I want to include the first match too. This is the desired output:
banana
4
5
6
banana
8
9
10
How could I do this? Do you have any better idea with another command?
I've also tried sed '0,/^banana$/d', but seems it only works with files, and I want to use it with a variable.
And how could I get all lines before a match using awk?
I mean. With banana in the regex this would be the output:
1
2
3
This awk should do:
echo "$str" | awk '/banana/ {f=1} f'
banana
4
5
6
banana
8
9
10
sed -n '/^banana$/,$p'
Should do what you want. -n instructs sed to print nothing by default, and the p command specifies that all addressed lines should be printed. This will work on a stream, and is different than the awk solution since this requires the entire line to match 'banana' exactly whereas your awk solution merely requires 'banana' to be in the string, but I'm copying your sed example. Not sure what you mean by "use it with a variable". If you mean that you want the string 'banana' to be in a variable, you can easily do sed -n "/$variable/,\$p" (note the double quotes and the escaped $) or sed -n "/^$variable\$/,\$p" or sed -n "/^$variable"'$/,$p'. You can also echo "$str" | sed -n '/banana/,$p' just like you do with awk.
Just invert the commands in the awk:
echo "$str" | awk -v pattern=banana '
$0 ~ pattern {print_it = 1} <--- if line matches, activate the flag
print_it {print} <--- if the flag is active, print the line
'
The print_it flag is activated when pattern is found. From that moment on (inclusive that line), you print lines when the flag is ON. Previously the print was done before the checking.
cat in.txt | awk "/banana/,0"
In case you don't want to preserve the matched line then you can use
cat in.txt | sed "0,/banana/d"

complex text replace from the command line

A simplified example of what I want to do:
I have a file: input.txt which looks like
a 2 4 b
a 3 8 b
c 9 4 d
a 3 4 8 b
and a script: add.sh which takes command-line parameters and returns their sum
I want to search input.txt for all instances of the pattern 'a (.*) b' where I pass the (.*) part as a command line parameter to add.sh.
For example, I want to do something like sed 's/a \(.*\) b/a {add.sh \1} b/g' input.txt
(that of course doesn't work).
So the output should look like
a 6 b
a 11 b
c 9 4 d
a 15 b
What would be the easiest way to do this?
Thanks
perl -pe 's/a (.*) b/"a ".`add.sh $1`." b"/eg' input.txt
Just make sure that add.sh doesn't output a newline.
And if perl isn't an option, you could
script it something like this:
grep -e '^a .* b$' input.txt | sed -e 's/a \(.*\) b/\1/g' | while read LINE; do ./add.sh $LINE; done
I realized the above doesn't solve your problem, I just focused on your sed expression.
However, if you are keen on solving this problem using another shell script, it would probably look something like this:
cat input.txt | while read LINE; do
if [[ "$LINE" =~ ^a (.*) b$ ]]; then
echo -n "a "
add.sh ${BASH_REMATCH[1]}
echo " b"
else
echo $LINE
fi
done
If add.sh is:
#!/bin/sh
arg1=$1
nums=$2
shift 2
for i in $nums
do
sum=$((sum+i))
done
echo "$arg1 $sum $#"
then you could do:
sed 's/^\([^ ]* \)\(.*\)\( [^ ]*\)$/\1\"\2\"\3/' input.txt | xargs -L 1 ./add.sh
which would add the numbers on every line. To add them only for lines that start with "a" and end with "b" use this:
sed 's/^a \(.*\) b$/a \"\1\" b/' input.txt | xargs -L 1 ./add.sh
The "c 9 4 d" line is still processed by add.sh but the sed command doesn't add any quotes, so the script sees only "9" as $2 and so the sum is only done once with the result as "9". The "4" is seen as part of the remainder of $#.

How can I output only captured groups with sed?

Is there a way to tell sed to output only captured groups?
For example, given the input:
This is a sample 123 text and some 987 numbers
And pattern:
/([\d]+)/
Could I get only 123 and 987 output in the way formatted by back references?
The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
don't default to printing each line (-n)
exclude zero or more non-digits
include one or more digits
exclude one or more non-digits
include one or more digits
exclude zero or more non-digits
print the substitution (p) (on one line)
In general, in sed you capture groups using parentheses and output what you capture using a back reference:
echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'
will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:
echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'
outputs "a bar a".
If you have GNU grep:
echo "$string" | grep -Po '\d+'
It may also work in BSD, including OS X:
echo "$string" | grep -Eo '\d+'
These commands will match any number of digit sequences. The output will be on multiple lines.
or variations such as:
echo "$string" | grep -Po '(?<=\D )(\d+)'
The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.
Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail
you can use grep
grep -Eow "[0-9]+" file
run(s) of digits
This answer works with any count of digit groups. Example:
$ echo 'Num123that456are7899900contained0018166intext' \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Expanded answer.
Is there any way to tell sed to output only captured groups?
Yes. replace all text by the capture group:
$ echo 'Number 123 inside text' \
| sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123
s/[^0-9]* # several non-digits
\([0-9]\{1,\}\) # followed by one or more digits
[^0-9]* # and followed by more non-digits.
/\1/ # gets replaced only by the digits.
Or with extended syntax (less backquotes and allow the use of +):
$ echo 'Number 123 in text' \
| sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123
To avoid printing the original text when there is no number, use:
$ echo 'Number xxx in text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
(-n) Do not print the input by default.
(/p) print only if a replacement was done.
And to match several numbers (and also print them):
$ echo 'N 123 in 456 text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456
That works for any count of digit runs:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Which is very similar to the grep command:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166
About \d
and pattern: /([\d]+)/
Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.
The selected answer use such "character classes" to build a solution:
$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
That solution only works for (exactly) two runs of digits.
Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:
$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"
But, as has been already explained, using a s/…/…/gp command is better:
$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987
That will cover both repeated runs of digits and writing a short(er) command.
Give up and use Perl
Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)
Print the entire matching part, no matching groups or lookbehind needed:
cat <<EOS | perl -lane 'print m/\d+/g'
a1 b2
a34 b56
EOS
Output:
12
3456
Single match per line, often structured data fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
a1 b2
a34 b56
EOS
Output:
1
34
With lookbehind:
cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
a1 b2
a34 b56
EOS
Multiple fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
a1 c0 b2 c0
a34 c0 b56 c0
EOS
Output:
1 2
34 56
Multiple matches per line, often unstructured data:
cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
34 78
With lookbehind:
cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
3478
I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.
If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:
> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers
These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)
Try
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
I got this under cygwin:
$ (echo "asdf"; \
echo "1234"; \
echo "asdf1234adsf1234asdf"; \
echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
1234
1234 1234
1 2 3 4 5 6 7 8 9
$
You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:
echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'
It's not what the OP asked for (capturing groups) but you can extract the numbers using:
S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'
Gives the following:
123
987
I want to give a simpler example on "output only captured groups with sed"
I have /home/me/myfile-99 and wish to output the serial number of the file: 99
My first try, which didn't work was:
echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99
To make this work, we need to capture the unwanted portion in capture group as well:
echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99
*) Note that sed doesn't have \d
You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this
rg '(\d+)' -or '$1'
where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.