grep a range of N to N tokens - regex

I would like to grep (I can accept non-grep answers but it is what I am most used to for this) lines which have a range of tokens delimited by a whitespace and with the ability to ignore punctuation marks. This means that if I want three to five tokens I would get lines with three, four or fives tokens but not one, two, six or twenty tokens. I have periods at the end and sometimes commas in the middle which I things I would like to account for if possible. Also the real data is actually words so I would like an answer with clear instructions for allowing characters which are not necessarily a-zA-Z, for example the word "can't".
My data is like this:
aa .
aa bb'b , c ddd e f gg .
aa bb .
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aa bb'b cc dd e f .
aaaaa bb'b c .
I tried this:
grep -e "[a-zA-Z']* ,*\{3,5\}"
What I expected to get was this:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

With GNU grep:
grep -E "^([a-zA-Z']+ *,* ){3,5}\.$" file
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

I think awk can make this task simple, because it has a variable NF that counts number of fields (separated by blanks) in each line, so:
awk 'NF >= 4 && NF <= 6' infile
I incremented its value to take into account last period. It yields:
a b c d e .
a b c d .
a b c .
EDIT: To ignore commas, use the FS variable (Field Separator) with a regular expression:
awk 'BEGIN { FS = "[[:blank:],]+" } NF >= 4 && NF <= 6' infile
It yields:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Here's a sed example to add to the mix:
sed -n "/^\([a-zA-Z',]* \)\{3,5\}\.$/p"
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Another possibility:
awk '/aaa+/' file
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Related

sed command to change c-style multiline comments to c++ style multiline comments

This is the script:
sed 's\[/][*]\//\g ; s/[*][/]\s\+/\n/g ; s/[*][/]/\n/g' inputFile > outputFile
This is input file:
aaa /* bbb */ ccc /* ddd */ eee /* fff
ggg */ hhh /* iii */ jjj
kkk
/* lll
mmm
nnn */
ooo
This is output file:
aaa // bbb
ccc // ddd
eee // fff
ggg
hhh // iii
jjj
kkk
// lll
mmm
nnn
ooo
Expected output:
aaa // bbb
ccc // ddd
eee // fff
// ggg
hhh // iii
jjj
kkk
// lll
// mmm
// nnn
ooo
The current script I using is unable to tackle with multiline comments, is there any way using sed command to achieve this?
If perl is your option, would you please try the following:
perl -0777pe 's#/\*\s*(.+?)\s*\*/\s*#join("\n", map {"// " . $_ } split("\n", $1)) . "\n"#sge' inputFile
Output:
aaa // bbb
ccc // ddd
eee // fff
// ggg
hhh // iii
jjj
kkk
// lll
// mmm
// nnn
ooo
The -0777 option tells perl to slurp all lines at once.
The -pe option enables the one-liner scripting.
The s switch to the s/pattern/replacement/ operator makes a dot match a newline character.
The e switch to the s/pattern/replacement/ operator enables the replacement
to be a perl expression.
The join .. map .. split() functions handle the multiline comments properly.
This might work for you (GNU sed):
sed -E ':a;\#/\*.*\*/#{s#/[*] ?#// #;s# \*/ ?#\n#;P;D};\#/\*#{N;s#(.* ?)\n#\1\n// #;ba}' file
If the current line contains both a starting and ending comment delimiter: replace the starting delimiter by // and the ending delimiter by \n, print the first line in the pattern space, remove it and then repeat.
Otherwise, if the current line contains a starting comment delimiter: append the following line, append // to the introduced newline and repeat (see above).
If there is no comment delimiters in the current line, print the line as normal.
N.B. The use of the alternate matching delimiter \#...# and the same for substitution s#...#...#, to avoid confusion because the comment delimiters contain /'s. The markdown may be acting up with the above solution owing to the nature of the *'s in the above text. Also for formatting purposes, spaces have been pinched and added to the result as per requirement.

Parse log with mixed single-line and multi-line content

I need to extract messages from a log file. Messages are logged in two different ways: in a single line, like this:
2018-09-21 10:03:54,145 <message-content>
2018-09-21 10:05:02,008 <next-message-content>
or in several lines like this:
2018-09-21 10:03:54,145 <message-content-part 1>
<message-content-part 2>
...
<message-content-part n>
2018-09-21 10:04:12,198 <next-message-content>
Each message starts with header \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}.
There is no any specific ending tag in each message.
I want to extract all messages, both single- and multi-line, with specific text.
For example, the output of search for "XYZ" could be like this:
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
You may use
cat file | \
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' | \
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}'
See the online demo
Details
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' - This sed command finds lines starting with datetime format and prepends them with double newline
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}' - This awk command reads the file in splitting the file into records by "\n\n" (RS is the record separator), and only prints (omitting the \n\n because of ORS="", where ORS is the output record separator) those that contain XYZ substring.
Using perl. I added 2 more messages in the sample input, which should not appear in the output.
> cat pattern_xyz.dat
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:03:54,145 AAA BBB PPP CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
2018-09-21 10:10:55,347 BBB
CCC QQQW
DDD
>
> cat pattern_xyz.pl
#!/usr/bin/perl
$file=$ARGV[0];
$x=`cat $file`;
while($x=~m/(^\d{4}-\d{2}-\d{2})(.+?)(\d{4}-\d{2}-\d{2})(.*)/osm)
{
$content="$1$2";
$x="$3$4";
if( $content=~/XYZ/ ) { print "$content"; }
}
> pattern_xyz.pl pattern_xyz.dat #executing script
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
>
>

Perl: Find line with matching string(str1) and only in those matched lines, replace one string(str2) with specific string(str3)

I have one file a.txt with contents as follow:
a aa aaa
b bb value = 11 xyz
c cc ccc
b bb value = 222 abc
d dd ddd
I have to find for string "bb". once matching line found i have to replace "value = xxx" with "value = 77"
Here, xxx is integer with any number of digit(11,222 in above case).
I have tried below perl command:
perl -n -e 'print; if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; print; }' < a.txt
It gives me output as:
a aa aaa
b bb value = 11 xyz
b bb value = 77 xyz
c cc ccc
b bb value = 22 abc
b bb value = 77 abc
d dd ddd
Here i am looking for in-place replacement, instead of new line with required changes.
Basically i am expecting output as follow:
a aa aaa
b bb value = 77 xyz
c cc ccc
b bb value = 77 abc
d dd ddd
Can anyone help me here in updating my command?
Also one more quick question, can I update my above command in way so that it can search for string "bb" and only matching lines will remove the string "value = xxx" completely from this matching line.
where xxx is integer with any number of digit.
You print twice when you have a match. If you don't want to do that, don't do that :)
perl -n -e 'if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; } print' < a.txt
Cleaned up:
perl -pe's/value = \K\d+/77/g if /\bbb\b/' a.txt
Based on the sample data, you might even be able to use
perl -pe's/\bbb\b.*value = \K\d+/77/' a.txt
This works:
perl -n -e 'if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; print; } else {print}' < a.txt
put one print in if and one in else
Output:
$ perl -n -e 'if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; print; } else {print}' < a.txt
a aa aaa
b bb value = 77 xyz
c cc ccc
b bb value = 77 abc
d dd ddd

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Background
Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:
import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']
my effort
I made a gawk script that solves the problem, but without to use regular expressions:
echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
FS=""
}
{
homopolymer = $1;
base = $1;
for(i=2; i<=NF; i++){
if($i == base){
homopolymer = homopolymer""base;
}else{
print homopolymer;
homopolymer = $i;
base = $i;
}
}
print homopolymer;
}'
output
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
question
how can I use regular expressions in awk or sed, getting the same result ?
grep -o will get you that in one-line:
echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
Explanation:
([A-Z]) # matches and captures a letter in matched group #1
\1* # matches 0 or more of captured group #1 using back-reference \1
sed is not the best tool for this but since OP has asked for it:
echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
PS: This is gnu-sed.
Try using split and just comparing.
echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
for (i=1; i <= length($0); i++) {
if (chars[i]!=chars[i+1])
{
printf("%s\n", chars[i])
}
else
{
printf("%s", chars[i])
}
}
}'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
EXPLANATION
The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.

sed - Replace new line characters not followed by 5-digit number

I have a csv file with some (dirty) DB schema.
example:
10391,0,3,4,12,44 --ok
10391,0,3,4, --not ok
12,44 --not ok
10391,0,3,4,12,44 --ok
I want to write sed script to replace new line characters (not followed by 5-digit number) with spaces.
Wrote this one, but not works correctly for me:
sed 's/\n\([0-9]{1,4}\)/ \1/g'
running on this sample
11111 sss
22222 aaa
3333 aaa
333 sss
22 sss
1 sss
should produce
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss
thanks to anyone who will be able to help
Or use a Perl One-Liner
perl -0777 -pe 's/\n(?!\d{5}\b)/ /g' yourfile
Explanation
\n matches the newline
(?!\d{5}\b) asserts that what follows is not five digits and a word boundary
we insert a space
Using awk:
awk -v ORS= 'NR > 1 { printf /^[0-9]{5} / ? "\n" : " " } 1
END { if (NR) printf "\n" }' file
Output:
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss
awk '{printf "%s%s" ,(NR>1&&$0~/^[0-9]{5} /?"\n":" "),$0}END{print ""}'
should work for your example:
kent$ echo "11111 sss
22222 aaa
3333 aaa
333 sss
22 sss
1 sss"|awk '{printf "%s%s" ,(NR>1&&$0~/^[0-9]{5} /?"\n":" "),$0}END{print ""}'
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss