sed - Replace new line characters not followed by 5-digit number - regex

I have a csv file with some (dirty) DB schema.
example:
10391,0,3,4,12,44 --ok
10391,0,3,4, --not ok
12,44 --not ok
10391,0,3,4,12,44 --ok
I want to write sed script to replace new line characters (not followed by 5-digit number) with spaces.
Wrote this one, but not works correctly for me:
sed 's/\n\([0-9]{1,4}\)/ \1/g'
running on this sample
11111 sss
22222 aaa
3333 aaa
333 sss
22 sss
1 sss
should produce
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss
thanks to anyone who will be able to help

Or use a Perl One-Liner
perl -0777 -pe 's/\n(?!\d{5}\b)/ /g' yourfile
Explanation
\n matches the newline
(?!\d{5}\b) asserts that what follows is not five digits and a word boundary
we insert a space

Using awk:
awk -v ORS= 'NR > 1 { printf /^[0-9]{5} / ? "\n" : " " } 1
END { if (NR) printf "\n" }' file
Output:
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss

awk '{printf "%s%s" ,(NR>1&&$0~/^[0-9]{5} /?"\n":" "),$0}END{print ""}'
should work for your example:
kent$ echo "11111 sss
22222 aaa
3333 aaa
333 sss
22 sss
1 sss"|awk '{printf "%s%s" ,(NR>1&&$0~/^[0-9]{5} /?"\n":" "),$0}END{print ""}'
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss

Related

Parse log with mixed single-line and multi-line content

I need to extract messages from a log file. Messages are logged in two different ways: in a single line, like this:
2018-09-21 10:03:54,145 <message-content>
2018-09-21 10:05:02,008 <next-message-content>
or in several lines like this:
2018-09-21 10:03:54,145 <message-content-part 1>
<message-content-part 2>
...
<message-content-part n>
2018-09-21 10:04:12,198 <next-message-content>
Each message starts with header \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}.
There is no any specific ending tag in each message.
I want to extract all messages, both single- and multi-line, with specific text.
For example, the output of search for "XYZ" could be like this:
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
You may use
cat file | \
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' | \
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}'
See the online demo
Details
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' - This sed command finds lines starting with datetime format and prepends them with double newline
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}' - This awk command reads the file in splitting the file into records by "\n\n" (RS is the record separator), and only prints (omitting the \n\n because of ORS="", where ORS is the output record separator) those that contain XYZ substring.
Using perl. I added 2 more messages in the sample input, which should not appear in the output.
> cat pattern_xyz.dat
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:03:54,145 AAA BBB PPP CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
2018-09-21 10:10:55,347 BBB
CCC QQQW
DDD
>
> cat pattern_xyz.pl
#!/usr/bin/perl
$file=$ARGV[0];
$x=`cat $file`;
while($x=~m/(^\d{4}-\d{2}-\d{2})(.+?)(\d{4}-\d{2}-\d{2})(.*)/osm)
{
$content="$1$2";
$x="$3$4";
if( $content=~/XYZ/ ) { print "$content"; }
}
> pattern_xyz.pl pattern_xyz.dat #executing script
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
>
>

grep a range of N to N tokens

I would like to grep (I can accept non-grep answers but it is what I am most used to for this) lines which have a range of tokens delimited by a whitespace and with the ability to ignore punctuation marks. This means that if I want three to five tokens I would get lines with three, four or fives tokens but not one, two, six or twenty tokens. I have periods at the end and sometimes commas in the middle which I things I would like to account for if possible. Also the real data is actually words so I would like an answer with clear instructions for allowing characters which are not necessarily a-zA-Z, for example the word "can't".
My data is like this:
aa .
aa bb'b , c ddd e f gg .
aa bb .
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aa bb'b cc dd e f .
aaaaa bb'b c .
I tried this:
grep -e "[a-zA-Z']* ,*\{3,5\}"
What I expected to get was this:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .
With GNU grep:
grep -E "^([a-zA-Z']+ *,* ){3,5}\.$" file
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .
I think awk can make this task simple, because it has a variable NF that counts number of fields (separated by blanks) in each line, so:
awk 'NF >= 4 && NF <= 6' infile
I incremented its value to take into account last period. It yields:
a b c d e .
a b c d .
a b c .
EDIT: To ignore commas, use the FS variable (Field Separator) with a regular expression:
awk 'BEGIN { FS = "[[:blank:],]+" } NF >= 4 && NF <= 6' infile
It yields:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .
Here's a sed example to add to the mix:
sed -n "/^\([a-zA-Z',]* \)\{3,5\}\.$/p"
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .
Another possibility:
awk '/aaa+/' file
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

awk remove substring using regex

i have a pipe delimited file that looks like this:
34ab1 | aaa bbb ccc fff vf | 2015-01-01
35ab1 | aaa bbb ccc dddefd ddff ssss fff vi | 2015-01-01
i want to replace everything that starts with bbb and ends with fff.
i used this:
BEGIN {
FS = OFS = "|"
}
{
sub(/[0-9].*[0-9]/, "", $2); sub(/bbb.*fff/, "", $2);
print
}
the regex part for the numbers worked but the second part of the regex didnt.
output i want:
34ab1 | aaa vf | 2015-01-01
35ab1 | aaa vi | 2015-01-01
Use a single gsub function for both.
BEGIN {
FS = OFS = "|"
}
{
gsub(/[0-9].*[0-9]|bbb.*fff/, "", $2);
print
}

grep, cut, sed, awk a file for 3rd column, n lines at a time, then paste into repeated columns of n rows?

I have a file of the form:
#some header text
a 1 1234
b 2 3333
c 2 1357
#some header text
a 4 8765
b 1 1212
c 7 9999
...
with repeated data in n-row chunks separated by a blank line (with possibly some other header text). I'm only interested in the third column, and would like to do some grep, cut, awk, sed, paste magic to turn it in to this:
a 1234 8765 ...
b 3333 1212
c 1357 9999
where the third column of each subsequent n-row chunk is tacked on as a new column. I guess you could call it a transpose, just n-lines at a time, and only a specific column. The leading (a b c) column label isn't essential... I'd be happy if I could just grab the data in the third column
Is this even possible? It must be. I can get things chopped down to only the interesting columns with grep and cut:
cat myfile | grep -A2 ^a\ | cut -c13-15
but I can't figure out how to take these n-row chunks and sed/paste/whatever them into repeated n-row columns.
Any ideas?
This awk does the job:
awk 'NF<3 || /^(#|[[:blank:]]*$)/{next} !a[$1]{b[++k]=$1; a[$1]=$3; next}
{a[$1] = a[$1] OFS $3} END{for(i=1; i<=k; i++) print b[i], a[b[i]]}' file
a 1234 8765
b 3333 1212
c 1357 9999
awk '/#/{next}{a[$1] = a[$1] $3 "\t"}END{for(i in a){print i, a[i]}}' file
Would produce
a 1234 8765
b 3333 1212
c 1357 9999
You can change "\t" to a different output separator like " " if you like.
sub(/\t$/, "", a[i]); may be inserted before printif uf you don't like having trailing spaces. Another solution is to check if a[$1] already has a value where you decide if you have append to a previous value or not. It complicates the code a bit though.
Using bash > 4.0:
declare -A array
while read line
do
if [[ $line && $line != \#* ]];then
c=$( echo $line | cut -f 1 -d ' ')
value=$( echo $line | cut -f 3 -d ' ')
array[$c]="${array[$c]} $value"
fi
done < myFile.txt
for k in "${!array[#]}"
do
echo "$k ${array[$k]}"
done
Will produce:
a 1234 8765
b 3333 1212
c 1357 9999
It stores the letter as the key of the associative array, and in each iteration, appends the correspondig value to it.
$ awk -v RS= -F'\n' '{ for (i=2;i<=NF;i++) {split($i,f,/[[:space:]]+/); map[f[1]] = map[f[1]] " " f[3]} } END{ for (key in map) print key map[key]}' file
a 1234 8765
b 3333 1212
c 1357 9999

Find specific pattern and print complete text block using awk or sed

How can find a specific number in a text block and print the complete text block beginning with the key word "BEGIN" and ending with "END"? Basically this is what my file looks like:
BEGIN
A: abc
B: 12345
C: def
END
BEGIN
A: xyz
B: 56789
C: abc
END
BEGIN
A: ghi
B: 56712
C: pqr
END
[...]
If I was looking for '^B: 567', I would like to get this output:
BEGIN
A: xyz
B: 56789
C: abc
END
BEGIN
A: ghi
B: 56712
C: pqr
END
I could use grep here (grep -E -B2 -A2 "^B: 567" file), but I would like to get a more general solution. I guess awk or sed might be able to do this!?
Thanks! :)
$ awk -v RS= -v ORS='\n\n' '/\nB: 567/' file
BEGIN
A: xyz
B: 56789
C: abc
END
BEGIN
A: ghi
B: 56712
C: pqr
END
Note the \n before B to ensure it occurs at the start of a line.This is in place of the ^ start-of-string character you had originally since now each line isn't it's own string. You need to set ORS above to re-insert the blank line between records.
This might work for you (GNU sed):
sed -n '/^BEGIN/{x;d};H;/^END/{x;s/^B: 567/&/mp}' file
or this:
sed -n '/^BEGIN/!b;:a;$!{N;/\nEND/!ba};/\nB: 567/p' file
A bit lenghty but the RS-trick was already posted :-)
BEGIN {found=0;start=0;i=0}
/BEGIN/ {
start=1
delete a
}
/.*567.*/ {found=1}
{
if (start==1) {
a[i++]=$0
}
}
/END/ {
if (found) {
for (i in a)
print a[i]
}
found=0
start=0
delete a
}
Output:
$ awk -f s.awk input
BEGIN
A: xyz
B: 56789
C: abc
END
BEGIN
A: ghi
B: 56712
C: pqr
END
This awk should work:
awk -v s='B: 567' '$0~s' RS= file
BEGIN
A: xyz
B: 56789
C: abc
END
BEGIN
A: ghi
B: 56712
C: pqr
END
You can undef RS to split records in blank lines and check if the string matches in the whole block:
awk 'BEGIN { RS = "" } /\nB:[[:space:]]+567/ { print $0 ORS }' infile
It yields:
BEGIN
A: xyz
B: 56789
C: abc
END
BEGIN
A: ghi
B: 56712
C: pqr
END
perl -lne 'if(/56789/){$f=1}
push #a,$_;
if(/END/){
if($f){print join "\n",#a}
undef #a;$f=0}' your_file