grep a range of N to N tokens

grep a range of N to N tokens - regex

I would like to grep (I can accept non-grep answers but it is what I am most used to for this) lines which have a range of tokens delimited by a whitespace and with the ability to ignore punctuation marks. This means that if I want three to five tokens I would get lines with three, four or fives tokens but not one, two, six or twenty tokens. I have periods at the end and sometimes commas in the middle which I things I would like to account for if possible. Also the real data is actually words so I would like an answer with clear instructions for allowing characters which are not necessarily a-zA-Z, for example the word "can't".
My data is like this:
aa .
aa bb'b , c ddd e f gg .
aa bb .
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aa bb'b cc dd e f .
aaaaa bb'b c .
I tried this:
grep -e "[a-zA-Z']* ,*\{3,5\}"
What I expected to get was this:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

With GNU grep:
grep -E "^([a-zA-Z']+ *,* ){3,5}\.$" file
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

I think awk can make this task simple, because it has a variable NF that counts number of fields (separated by blanks) in each line, so:
awk 'NF >= 4 && NF <= 6' infile
I incremented its value to take into account last period. It yields:
a b c d e .
a b c d .
a b c .
EDIT: To ignore commas, use the FS variable (Field Separator) with a regular expression:
awk 'BEGIN { FS = "[[:blank:],]+" } NF >= 4 && NF <= 6' infile
It yields:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Here's a sed example to add to the mix:
sed -n "/^\([a-zA-Z',]* \)\{3,5\}\.$/p"
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Another possibility:
awk '/aaa+/' file
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Related

sed command to change c-style multiline comments to c++ style multiline comments

This is the script:
sed 's\[/][*]\//\g ; s/[*][/]\s\+/\n/g ; s/[*][/]/\n/g' inputFile > outputFile
This is input file:
aaa /* bbb */ ccc /* ddd */ eee /* fff
ggg */ hhh /* iii */ jjj
kkk
/* lll
mmm
nnn */
ooo
This is output file:
aaa // bbb
ccc // ddd
eee // fff
ggg
hhh // iii
jjj
kkk
// lll
mmm
nnn
ooo
Expected output:
aaa // bbb
ccc // ddd
eee // fff
// ggg
hhh // iii
jjj
kkk
// lll
// mmm
// nnn
ooo
The current script I using is unable to tackle with multiline comments, is there any way using sed command to achieve this?

If perl is your option, would you please try the following:
perl -0777pe 's#/\*\s*(.+?)\s*\*/\s*#join("\n", map {"// " . $_ } split("\n", $1)) . "\n"#sge' inputFile
Output:
aaa // bbb
ccc // ddd
eee // fff
// ggg
hhh // iii
jjj
kkk
// lll
// mmm
// nnn
ooo
The -0777 option tells perl to slurp all lines at once.
The -pe option enables the one-liner scripting.
The s switch to the s/pattern/replacement/ operator makes a dot match a newline character.
The e switch to the s/pattern/replacement/ operator enables the replacement
to be a perl expression.
The join .. map .. split() functions handle the multiline comments properly.

This might work for you (GNU sed):
sed -E ':a;\#/\*.*\*/#{s#/[*] ?#// #;s# \*/ ?#\n#;P;D};\#/\*#{N;s#(.* ?)\n#\1\n// #;ba}' file
If the current line contains both a starting and ending comment delimiter: replace the starting delimiter by // and the ending delimiter by \n, print the first line in the pattern space, remove it and then repeat.
Otherwise, if the current line contains a starting comment delimiter: append the following line, append // to the introduced newline and repeat (see above).
If there is no comment delimiters in the current line, print the line as normal.
N.B. The use of the alternate matching delimiter \#...# and the same for substitution s#...#...#, to avoid confusion because the comment delimiters contain /'s. The markdown may be acting up with the above solution owing to the nature of the *'s in the above text. Also for formatting purposes, spaces have been pinched and added to the result as per requirement.

Parse log with mixed single-line and multi-line content

I need to extract messages from a log file. Messages are logged in two different ways: in a single line, like this:
2018-09-21 10:03:54,145 <message-content>
2018-09-21 10:05:02,008 <next-message-content>
or in several lines like this:
2018-09-21 10:03:54,145 <message-content-part 1>
<message-content-part 2>
...
<message-content-part n>
2018-09-21 10:04:12,198 <next-message-content>
Each message starts with header \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}.
There is no any specific ending tag in each message.
I want to extract all messages, both single- and multi-line, with specific text.
For example, the output of search for "XYZ" could be like this:
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG

You may use
cat file | \
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' | \
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}'
See the online demo
Details
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' - This sed command finds lines starting with datetime format and prepends them with double newline
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}' - This awk command reads the file in splitting the file into records by "\n\n" (RS is the record separator), and only prints (omitting the \n\n because of ORS="", where ORS is the output record separator) those that contain XYZ substring.

Using perl. I added 2 more messages in the sample input, which should not appear in the output.
> cat pattern_xyz.dat
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:03:54,145 AAA BBB PPP CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
2018-09-21 10:10:55,347 BBB
CCC QQQW
DDD
>
> cat pattern_xyz.pl
#!/usr/bin/perl
$file=$ARGV[0];
$x=`cat $file`;
while($x=~m/(^\d{4}-\d{2}-\d{2})(.+?)(\d{4}-\d{2}-\d{2})(.*)/osm)
{
$content="$1$2";
$x="$3$4";
if( $content=~/XYZ/ ) { print "$content"; }
}
> pattern_xyz.pl pattern_xyz.dat #executing script
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
>
>

Perl: Find line with matching string(str1) and only in those matched lines, replace one string(str2) with specific string(str3)

I have one file a.txt with contents as follow:
a aa aaa
b bb value = 11 xyz
c cc ccc
b bb value = 222 abc
d dd ddd
I have to find for string "bb". once matching line found i have to replace "value = xxx" with "value = 77"
Here, xxx is integer with any number of digit(11,222 in above case).
I have tried below perl command:
perl -n -e 'print; if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; print; }' < a.txt
It gives me output as:
a aa aaa
b bb value = 11 xyz
b bb value = 77 xyz
c cc ccc
b bb value = 22 abc
b bb value = 77 abc
d dd ddd
Here i am looking for in-place replacement, instead of new line with required changes.
Basically i am expecting output as follow:
a aa aaa
b bb value = 77 xyz
c cc ccc
b bb value = 77 abc
d dd ddd
Can anyone help me here in updating my command?
Also one more quick question, can I update my above command in way so that it can search for string "bb" and only matching lines will remove the string "value = xxx" completely from this matching line.
where xxx is integer with any number of digit.

You print twice when you have a match. If you don't want to do that, don't do that :)
perl -n -e 'if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; } print' < a.txt
Cleaned up:
perl -pe's/value = \K\d+/77/g if /\bbb\b/' a.txt
Based on the sample data, you might even be able to use
perl -pe's/\bbb\b.*value = \K\d+/77/' a.txt

This works:
perl -n -e 'if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; print; } else {print}' < a.txt
put one print in if and one in else
Output:
$ perl -n -e 'if (m/\bbb\b/) { s/value = (\d+)/value = 77/g; print; } else {print}' < a.txt
a aa aaa
b bb value = 77 xyz
c cc ccc
b bb value = 77 abc
d dd ddd

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Background
Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:
import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']
my effort
I made a gawk script that solves the problem, but without to use regular expressions:
echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
FS=""
}
{
homopolymer = $1;
base = $1;
for(i=2; i<=NF; i++){
if($i == base){
homopolymer = homopolymer""base;
}else{
print homopolymer;
homopolymer = $i;
base = $i;
}
}
print homopolymer;
}'
output
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
question
how can I use regular expressions in awk or sed, getting the same result ?

grep -o will get you that in one-line:
echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
Explanation:
([A-Z]) # matches and captures a letter in matched group #1
\1* # matches 0 or more of captured group #1 using back-reference \1
sed is not the best tool for this but since OP has asked for it:
echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
PS: This is gnu-sed.

Try using split and just comparing.
echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
for (i=1; i <= length($0); i++) {
if (chars[i]!=chars[i+1])
{
printf("%s\n", chars[i])
}
else
{
printf("%s", chars[i])
}
}
}'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
EXPLANATION
The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.

sed - Replace new line characters not followed by 5-digit number

I have a csv file with some (dirty) DB schema.
example:
10391,0,3,4,12,44 --ok
10391,0,3,4, --not ok
12,44 --not ok
10391,0,3,4,12,44 --ok
I want to write sed script to replace new line characters (not followed by 5-digit number) with spaces.
Wrote this one, but not works correctly for me:
sed 's/\n\([0-9]{1,4}\)/ \1/g'
running on this sample
11111 sss
22222 aaa
3333 aaa
333 sss
22 sss
1 sss
should produce
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss
thanks to anyone who will be able to help

Or use a Perl One-Liner
perl -0777 -pe 's/\n(?!\d{5}\b)/ /g' yourfile
Explanation
\n matches the newline
(?!\d{5}\b) asserts that what follows is not five digits and a word boundary
we insert a space

Using awk:
awk -v ORS= 'NR > 1 { printf /^[0-9]{5} / ? "\n" : " " } 1
END { if (NR) printf "\n" }' file
Output:
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss

awk '{printf "%s%s" ,(NR>1&&$0~/^[0-9]{5} /?"\n":" "),$0}END{print ""}'
should work for your example:
kent$ echo "11111 sss
22222 aaa
3333 aaa
333 sss
22 sss
1 sss"|awk '{printf "%s%s" ,(NR>1&&$0~/^[0-9]{5} /?"\n":" "),$0}END{print ""}'
11111 sss
22222 aaa 3333 aaa 333 sss 22 sss 1 sss

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep a range of N to N tokens - regex

With GNU grep: grep -E "^([a-zA-Z']+ , ){3,5}\.$" file Output: aaa bb'b cccc dddd e . aaaa bb'b cccc , dddd . aaaaa bb'b c .

Here's a sed example to add to the mix: sed -n "/^\([a-zA-Z',]* \)\{3,5\}\.$/p" Output: aaa bb'b cccc dddd e . aaaa bb'b cccc , dddd . aaaaa bb'b c .

Another possibility: awk '/aaa+/' file aaa bb'b cccc dddd e . aaaa bb'b cccc , dddd . aaaaa bb'b c .

Related

sed command to change c-style multiline comments to c++ style multiline comments

Parse log with mixed single-line and multi-line content

Perl: Find line with matching string(str1) and only in those matched lines, replace one string(str2) with specific string(str3)

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

sed - Replace new line characters not followed by 5-digit number

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep a range of N to N tokens - regex

With GNU grep: grep -E "^([a-zA-Z']+ *,* ){3,5}\.$" file Output: aaa bb'b cccc dddd e . aaaa bb'b cccc , dddd . aaaaa bb'b c .

Here's a sed example to add to the mix: sed -n "/^\([a-zA-Z',]* \)\{3,5\}\.$/p" Output: aaa bb'b cccc dddd e . aaaa bb'b cccc , dddd . aaaaa bb'b c .

Another possibility: awk '/aaa+/' file aaa bb'b cccc dddd e . aaaa bb'b cccc , dddd . aaaaa bb'b c .

Related

sed command to change c-style multiline comments to c++ style multiline comments

Parse log with mixed single-line and multi-line content

Perl: Find line with matching string(str1) and only in those matched lines, replace one string(str2) with specific string(str3)

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

sed - Replace new line characters not followed by 5-digit number

Categories

Resources

With GNU grep: grep -E "^([a-zA-Z']+ , ){3,5}\.$" file Output: aaa bb'b cccc dddd e . aaaa bb'b cccc , dddd . aaaaa bb'b c .