I have errors in log file like this:
[11:16:16 31/10] 2428 ERROR: Wide character in subroutine entry at /home/site/site/app/lib/SC/Contro
ller/Client/Sites.pm line 1584.
Stack:
[/xxx:1584]
[/xxx:70]
[/xxx:133]
I want to put this error to some file like:
cat apache.error.log | grep "query NS" > apache.error.log-NS
But how may I do that for multiline log message?
I have found this solution:
cat apache.error.log | grep -Pzo '^.*?Wide character.*?\nStack.*?(\n(?=\s).*?)*$'
Where (\n(?=\s).*?)* means:
(...)* Find multiple times
\n next lines
(?=\s) which starts from whitespace character
.*? until end of that line (notice $ character at whole regex)
No offense, but I think #Eugen's solution is too generic and since this is a log file it might be possible that the reg ex matches unwanted lines too.
So, make sure that we fetch the exact line, This is what i think should be the reg ex. Feel free to comment!
^\[\d{2}\:\d{2}\:\d{2}\s\d{2}\/\d{2}\]\s\d+\sERROR\:\sWide\scharacter\sin\ssubroutine\sentry\sat\s[a-zA-Z\/\.]+\s.*?\nStack.*?(\n(?=\s).*?)*$
Related
I have some 'fastq' format DNA sequence files (basically just text files) like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#
+
#
+
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
My ultimate goal is to turn these into 'fasta' format files, but to do that I need to get rid of the two empty sequences in the middle.
EDIT
The desired output would look like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
All of the dedicated software I tried (Biopython, stand alone programs, perl scripts posted by others) crash at the empty sequences. This is really just a problem of searching for the string #\n+ and replacing it with nothing. I googled this and read several posts and tried about a million options with sed and couldn't figure it out. Here are some things that didn't work:
sed s/'#'/,/'+'// test.fastq > test.fasta
sed s/'#,+'// test.fastq > test.fasta
Any insights would be greatly appreciated.
PS. I've got a Mac.
Try:
sed "/^[#+]*$/d" test.fastq > test.fasta
The /d option tells sed to "delete" the matching line (i.e. not print it).
^ and $ mean "start of string" and "end of string" respectively, i.e. the line must be an exact match.
So, the above command basically says:
Print all lines that do not only contain # or +, and write the result to test.fasta.
Edit: I misunderstood the question slightly, sorry. If you want to only remove pairs of consecutive lines like
#
+
then you need to perform a multi-line search and replace.
Although this can be done with sed, it's perhaps easier to use something like a perl script instead:
perl -0pe 's/^#\n\+\n//gm' test.fastq > test.fasta
The -0 option turns Perl into "file slurp" mode, where Perl reads the entire input file in one shot (instead of line by line). This enables multi-line search and replace.
The -pe option allows you to run Perl code (pattern matching and replacement in this case) and display output from the command line.
^#\n\+\n is the pattern to match, which we are replacing with nothing (i.e. deleting).
/gm makes the substitution multiline and global.
You could also instead pass -i as the first parameter to perl, to edit the file inline.
This may not be the most elegant solution in the world, but you can use tr to replace the \n with a null character and back.
cat test.fastq | tr '\n' '\0' | sed 's/#\x0+\x0//g' | tr '\0' '\n' > test.fasta
Try this:
sed '/^#$/{N;/\n+$/d}' file
When # is found, next line is appended to the pattern space with N.
If $ is found in next line, the d command deletes both lines.
I want to print lines that contains single word only.
For example:
this is a line
another line
one
more
line
last one
I want to get the ones with single word only
one
more
line
EDIT: Guys, thank you for answers. Almost all of the answers work for my test file. However I wanted to list single lines in bash history. When I try your answers like
history | your posted commands
all of them below fails. Some only prints some numbers (might line numbers?)
You want to get all those commands in history that contain just one word. Considering that history prints the number of the command as a first column, you need to match those lines consisting in two words.
For this, you can say:
history | awk 'NF==2'
If you just want to print the command itself, say:
history | awk 'NF==2 {print $2}'
To rehash your problem, any line containing a space or nothing should be removed.
grep -Ev '^$| ' file
Your problem statement is unspecific on whether lines containing only punctuation might also occur. Maybe try
grep -Ex '[A-Za-z]+' file
to only match lines containing only one or more alphabetics. (The -x option implicitly anchors the pattern -- it requires the entire line to match.)
In Bash, the output from history is decorated with line numbers; maybe try
history | grep -E '^ *[0-9]+ [A-Za-z]+$'
to match lines where the line number is followed by a single alphanumeric token. Notice that there will be two spaces between the line number and the command.
In all cases above, the -E selects extended regular expression matching, aka egrep (basic RE aka traditional grep does not support e.g. the + operator, though it's available as \+).
Try this:
grep -E '^\s*\S+\s*$' file
With the above input, it will output:
one
more
line
If your test strings are in a file called in.txt, you can try the following:
grep -E "^\w+$" in.txt
What it means is:
^ starting the line with
\w any word character [a-zA-Z0-9]
+ there should be at least 1 of those characters or more
$ line end
And output would be
one
more
line
Assuming your file as texts.txt and if grep is not the only criteria; then
awk '{ if ( NF == 1 ) print }' texts.txt
If your single worded lines don't have a space at the end you can also search for lines without an empty space :
grep -v " "
I think that what you're looking for could be best described as a newline followed by a word with a negative lookahead for a space,
/\n\w+\b(?! )/g
example
I have many files from which I need to get information.
Example of my files:
first file content:
"test This info i need grep</singleline>"
and
second file content (with two lines):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
In first file I use:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
Please help rewrite the command or write what the other.
Or, if you insist to use grep, you can:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help:
-P, --perl-regexp
PATTERN is a Perl regular expression
-o, --only-matching
show only the part of a line matching PATTERN
-z, --null-data
a data line ends in 0 byte, not newline
I'd use pcregrep, which can match multiline regexes:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
-M allows pcregrep to match on more than one line,
-o makes it print only the match,
\K throws away the part of the match that comes before it,
(?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
Thanks to #CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
use awk to remove the soft line breaks
then use grep on the result
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).
Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.
Since strings of interest are first converted back their original single-line representations:
The matches are printed in their original form.
You can use regular (GNU) grep with line-by-line matching; contrast this with
needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
needing to install another utility, pcregrep, as in Wintermute's helpful answer.
additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).
I am trying to extract a single string out of a line having many segments in a key-value order, but I don't get it as it matches much more than I want to.
This is my example line:
|SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~|SEGB~34~12.11.2011~3~M~O~|SEGC~HELLO~WORLD~|
This lines is a kind concatenation of many segments into one line. Now I want to extract the the string at index 2 in the segment starting with SEGA.
So what I do is grep for this:
egrep -o 'SEGA(.*?)\~\|'
But it gives me the whole line, sometimes it gives me only the segment I am looking for. With the match I would split that segment by using the ~ character and take the third one.
Since I use .*? with the question mark I expected egrep to only match the content between SEGA and the very first occurrence of ~| which is right before SEGB and not the one at the end of SEGC or SEGB.
How can I tell grep to search for SEGA and give the whole content starting right after SEGA until THE VERY FIRST occurrence of ~|
You can use the -P(--perl-regexp) option in grep:
grep -oP '(?<=SEGA).*?(?=~\|)' file
If you want to include the trailing ~|, please remove the lookahead (?=...).
I think .*? (lazy) does not exit in egrep.
I'd suggest you break the line into lines on | and then grep from those:
$ echo "|SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~|SEGB~34~12.11.2011~3~M~O~|SEGC~HELLO~WORLD~|" | sed -e 's/|/\n/g' | grep ^SEGA
SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~
Suppose I had a file that had a lot of text in it. At some point I wanted to search for the following text in that file. But not exactly this text. In fact I wanted to find any text "paragraph," that could be multiple lines, that started with a line that started with "service surfaceflinger" then had a line that ended with "zygote".
What's the Linux regex for a search that would encounter this?
service surfaceflinger /system/bin/surfaceflinger
class main
user system
group graphics
onrestart restart zygote
Using awk:
awk '/^service surfaceflinger/,/zygote$/' file_with_a_lot_of_text
Using sed:
sed -n '/^service surfaceflinger/,/zygote$/p' file_with_a_lot_of_text
Recommended reading: sed & awk.
Try something like this:
(?ms)^service surfaceflinger.*?zygote$
where:
(?ms) turns on multi-line and dot-all mode (. matches any character)
^ matches start of line
.*? matches any number of characters lazily
$ matches end of line