Ignore long lines in silversearcher

Ignore long lines in silversearcher - ag

Right now I am using:
ag sessions --color|cut -b1-130
But this will cause color artifacts if the search match is cut bu the cut command.
Silversearcher has this in the docs:
--print-long-lines
Print matches on very long lines (> 2k characters by default).
Can I change 2k to something else? (120 for me, because honestly never in any of the code I work with the real code is longer than that).

Very strangely, the documented --print-long-lines actually does nothing at all, yet there is a working switch for this: -W NUM / --width NUM which is not documented at all. See https://github.com/ggreer/the_silver_searcher/pull/720

I can think of three options:
Just print the result of your search instead of the whole line, using the -o option: ag --color -o
Use less instead of cut which nicely chops long lines at the screen size's width using the -S option (chop long lines) and the -R option (to deal with the color escape sequences): ag --color <pattern> | less -R -S
Use something like sed or awk instead of cut: ag --color <pattern> |sed -E "s/(.{$COLUMNS}).*$/\1/"
Which will cut the returned line at the limit of your screen size. Of course, if you're determined to chop at 120 columns, you can: ag --color <pattern> |sed -E "s/(.{120}).*$/\1/"
This last option doesn't prevent the possibility of chopping in the middle of a color escape sequence; if you're really hellbent, you can modify the sed search pattern to ignore color escape sequences -- already answered on SO. That said, I don't see the purpose of doing this given the easiness and correctness of option 1 above.

ag --width 400 string dir/
# In .bash_aliases (s is for short)
alias ags='ag --width 400'
Ignores lines longer than 400 chars.

Related

Grep to select the searched-for regexp surrounded on either/both sides by a certain number of characters? [duplicate]

I want to run ack or grep on HTML files that often have very long lines. I don't want to see very long lines that wrap repeatedly. But I do want to see just that portion of a long line that surrounds a string that matches the regular expression. How can I get this using any combination of Unix tools?

You could use the grep options -oE, possibly in combination with changing your pattern to ".{0,10}<original pattern>.{0,10}" in order to see some context around it:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e., force grep to behave as egrep).
For example (from #Renaud's comment):
grep -oE ".{0,10}mysearchstring.{0,10}" myfile.txt
Alternatively, you could try -c:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.

Pipe your results thru cut. I'm also considering adding a --cut switch so you could say --cut=80 and only get 80 columns.

You could use less as a pager for ack and chop long lines: ack --pager="less -S" This retains the long line but leaves it on one line instead of wrapping. To see more of the line, scroll left/right in less with the arrow keys.
I have the following alias setup for ack to do this:
alias ick='ack -i --pager="less -R -S"'

grep -oE ".\{0,10\}error.\{0,10\}" mylogfile.txt
In the unusual situation where you cannot use -E, use lowercase -e instead.
Explanation:

cut -c 1-100
gets characters from 1 to 100.

The Silver Searcher (ag) supports its natively via the --width NUM option. It will replace the rest of longer lines by [...].
Example (truncate after 120 characters):
$ ag --width 120 '#patternfly'
...
1:{"version":3,"file":"react-icons.js","sources":["../../node_modules/#patternfly/ [...]
In ack3, a similar feature is planned but currently not implemented.

Taken from: http://www.topbug.net/blog/2016/08/18/truncate-long-matching-lines-of-grep-a-solution-that-preserves-color/
The suggested approach ".{0,10}<original pattern>.{0,10}" is perfectly good except for that the highlighting color is often messed up. I've created a script with a similar output but the color is also preserved:
#!/bin/bash
# Usage:
# grepl PATTERN [FILE]
# how many characters around the searching keyword should be shown?
context_length=10
# What is the length of the control character for the color before and after the
# matching string?
# This is mostly determined by the environmental variable GREP_COLORS.
control_length_before=$(($(echo a | grep --color=always a | cut -d a -f '1' | wc -c)-1))
control_length_after=$(($(echo a | grep --color=always a | cut -d a -f '2' | wc -c)-1))
grep -E --color=always "$1" $2 |
grep --color=none -oE \
".{0,$(($control_length_before + $context_length))}$1.{0,$(($control_length_after + $context_length))}"
Assuming the script is saved as grepl, then grepl pattern file_with_long_lines should display the matching lines but with only 10 characters around the matching string.

I put the following into my .bashrc:
grepl() {
$(which grep) --color=always $# | less -RS
}
You can then use grepl on the command line with any arguments that are available for grep. Use the arrow keys to see the tail of longer lines. Use q to quit.
Explanation:
grepl() {: Define a new function that will be available in every (new) bash console.
$(which grep): Get the full path of grep. (Ubuntu defines an alias for grep that is equivalent to grep --color=auto. We don't want that alias but the original grep.)
--color=always: Colorize the output. (--color=auto from the alias won't work since grep detects that the output is put into a pipe and won't color it then.)
$#: Put all arguments given to the grepl function here.
less: Display the lines using less
-R: Show colors
S: Don't break long lines

Here's what I do:
function grep () {
tput rmam;
command grep "$#";
tput smam;
}
In my .bash_profile, I override grep so that it automatically runs tput rmam before and tput smam after, which disabled wrapping and then re-enables it.

ag can also take the regex trick, if you prefer it:
ag --column -o ".{0,20}error.{0,20}"

Line-insensitive pattern-matching – How can some context be displayed?

I'm looking for a technique to search a file for a pattern (typically a phrase) that may span multiple lines, and print the match with some surrounding context on one line. The file's lines may be too long or too short for a sensible amount of context; I'm not concerned to print a single line of the file, as you might do with grep, but rather to print onto a single line of my terminal.
Basic requirements
Show a specified number of characters before and after the match, even if it straddles lines.
Show newlines as ‘\n’ to prevent flooding the terminal with whitespace if there are many short lines.
Prefix output line with line and column number of the start of the match.
Preferably a sed oneliner.
So far, I'm assuming that the pattern has a constant length shorter than the width of the terminal, which is okay and very useful for most phrases I might want to search for.
Further considerations
I would be interested to see how the following could also be achieved using sed or the likes:
Prefix output line with line and column number range of the match.
Generalise for variable length patterns, truncating the middle of the match to ‘[…]’ if too long.
Can I avoid using something like ‘[ \n]’ between words in a phrase regex on a file that has been ‘hard-wrapped’ using newlines, without altering what's printed?
Using the output of stty size to dynamically determine the terminal width may be useful, though I'd probably prefer to leave it static in case I want to resize the terminal or use it from screen attached from terminals of different sizes.
Examples
The basic idea for 10 characters of context would be something like:
‘excessively long line with match in the middle\n’ → ‘line with match in the mi’
‘short\nlines\n\nmatch\nlots\nof\nshort\nlines\n’ → ‘rt\nlines\n\nmatch\nlots\nof\ns’

Here's a command to return the 20 characters surrounding a pattern, spanning newlines and including them as a character:
$ input="test.txt"
$ pattern="match"
$ tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g'
line with match in the mi
rt\nlines\n\nmatch\nlots\nof\ns
With row number of the match as well:
$ paste <(grep -n ${pattern} "$input" | cut -d: -f1) \
<(tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g')
1 line with match in the mi
5 rt\nlines\n\nmatch\nlots\nof\ns
I realise this doesn't quite fulfill all of your basic requirements, but am not good enough with awk to do better (guess this is technically possible in sed, but I don't want to think about what it would look like).

Regular expressions with grep

So I have a bunch of data that all looks like this:
janitor#1/2 of dorm#1/1
president#4/1 of class#2/2
hunting#1/1 hat#1/2
side#1/2 of hotel#1/1
side#1/2 of hotel#1/1
king#1/2 of hotel#1/1
address#2/2 of girl#1/1
one#2/1 in family#2/2
dance#3/1 floor#1/2
movie#1/2 stars#5/1
movie#1/2 stars#5/1
insurance#1/1 office#1/2
side#1/1 of floor#1/2
middle#4/1 of December#1/2
movie#1/2 stars#5/1
one#2/1 of tables#2/2
people#1/2 at table#2/1
Some lines have prepositions, others don't so I thought I could use regular expressions to clean it up. What I need is each noun, the # sign and the following number on its own line. So for example, the first lines of output should look like this in the final file:
janitor#1
dorm#1
president#4
etc...
The list is stored in a file called NPs. My code to do this is:
cat NPs | grep -E '\b(\w*[#][1-9]).' >> test
When I open test, however, it's the exact same as the input file. Any input as to what I'm missing? It doesn't seem like it should be a hard operation, so maybe I'm missing something about syntax? I'm using this command from a shell script that is called in bash.
Thanks in advance!

This should do what you need.
The -o option will show only the part of a matching line that matches the PATTERN.
grep -Eo '[a-z#]+[1-9]' NPs > test
or even the -P option, which Interprets the PATTERN as a Perl regular expression
grep -Po '[\w#]*(?=/)' NPs > test

Using grep:
$ grep -o "\w*[#]\w*" inputfile
janitor#1
dorm#1
president#4
class#2
hunting#1
hat#1
side#1
hotel#1
side#1
hotel#1
king#1
hotel#1
address#2
girl#1
one#2
family#2
dance#3
floor#1
movie#1
stars#5
movie#1
stars#5
insurance#1
office#1
side#1
floor#1
middle#4
ecember#1
movie#1
stars#5
one#2
tables#2
people#1
table#2

grep variations extracting entire lines from text, if they match pattern. If you need to modify lines, you should use sed, like
cat NPs | sed 's/^\(\b\w*[#][1-9]\).*$/\1/g'

You need sed, not grep. (Or awk, or perl.) It looks like this would do what you want:
cat NPs | sed 's?/.*??'
or simply
sed 's?/.*??' NPs
s means "substitute". The next character is the delimiter between regular expressions. Usually it's "/", but since you need to search for "/", I used "?" instead. "." refers to any character, and "*" says "zero or more of what preceded me". Whatever is between the last two delimiters is the replacement string. In this case it's empty, so you're replacing "/" followed by zero or more of any character, with the empty string.
EDIT: Oh, I see now that you wanted to extract the last item on the line, too. Well, I'm sure that others' suggested regexps would work. If it were my problem, I'd probably filter the file in two steps, perhaps piping the results from one step to the next, or using multiple substitutions with sed: First delete the "of"s and middle spaces, and add newlines, and then run sed as above. It's not as cool as doing it all in one regexp, but each step is easier to understand. For even more simplicity and uncoolness, use three steps, replacing " of " with space in the first step. Since others have provided complete solutions, I won't work out the details.

Grep by default just searches for the text, so in your case it is printing the lines that match. I think you want to investigate sed instead to perform the replacement. (And you don't need to cat the file, just grep PATTERN filename)
To get your output on separate lines, this worked for me:
sed 's|/.||g' NPs | sed 's/ .. /=/' | tr "=" "\n"
This uses two seds in a row to do different substitutions, and tr to insert line feeds.
The -o option in grep, which causes it to print out only the matching text, as described in another answer, is probably even simpler!

An awk version:
awk '/#/ {print $NF}' RS="/" NPs
janitor#1
dorm#1
president#4
class#2
hunting#1
hat#1
side#1
hotel#1
side#1
hotel#1
king#1
hotel#1
address#2
girl#1
one#2
family#2
dance#3
floor#1
movie#1
stars#5
movie#1
stars#5
insurance#1
office#1
side#1
floor#1
middle#4
December#1
movie#1
stars#5
one#2
tables#2
people#1
table#2

Ignoring base64 encoded attahments when grepping through .eml files

I've got a huge pile of exported emails in .eml format that I'm grepping through for keywords with something like this:
egrep -iR "keyword|list|foo|bar" *
This results in a number of false positives when using relatively short keywords due to base64 encoded email attachments like this:
Inbox/Email Subject.eml:rcX2aiCZBfoogjNUShcWC64U7buTJE3rC5CeShpo/Uhz0SeGz290rljsr6woPNt3DQ0iFGzixrdj
Inbox/Email Subject.eml:3qHXNEj5sKXUa3LxfkmEAEWOpW301Pbarq2Jr2IswluaeKqCgeHIEFmFQLeY4HIcTBe3wCf6HzPL
Is there a regex I can write that will identify and exclude these matches, or can I tell grep to stop reading a file once it gets to a line that says "Content-Transfer-Encoding: base64"?

If you exclude any matches consisting entirely of base64, you should be left with only the interesting matches. As an approximation, excluding any line consisting entirely of base64 with a length longer than, say, 60 characters is probably good enough for immediate human consumption.
egrep -iR "keyword|list|foo|bar" . |
egrep -v ':[0-9A-Za-z+/]{60,}$' |
less
If you need improved accuracy, maybe prefilter the messages to exclude any attachments. You might also want to check that the excluded lines are an even multiple of 4 characters long, although it's unlikely that you have a lot of false positives for that particular criterion.

You might find the -w grep option useful (match only complete words), although it will only reduce and not eliminate false positives since there is roughly a 1/1024 chance that a string in a base-64 encoded file will be surrounded by non-alphanumeric characters.
You can get grep to stop matching when it finds a given string, such as Content-Transfer-Encoding: base64 but only at the cost of always stopping at the first match, by also matching that string and setting the maximum match count to 1. However, you then have to filter the matches:
grep -EiR -e "Content-Transfer-Encoding: base64" -e "foo|bar" -x 1 * |
grep -v -i "Content-Transfer-Encoding: base64"
You could do this more easily and more precisely with gawk:
awk 'BEGIN {IGNORECASE=1}
/Content-Transfer-Encoding: base64/ {nextfile}
/foo|bar/ {print FILENAME":"$0}' *
(Note: nextfile is a gawk extension. There are other ways to do this, but not as convenient.)
That's a bit much to type every time you want to do this, so you'd be better-off making it a shell function (or script, but I personally prefer functions.)

Replace CR/LF in a text file only after a certain column

I have a large text file I would like to put on my ebook-reader, but the formatting becomes all wrong because all lines are hard wrapped at or before column 80 with CR/LF, and paragraphs/headers are not marked differently, only a single CR/LF there too.
What I would like is to replace all CR/LF's after column 75 with a space. That would make most paragraphs continuous. (Not a perfect solution, but a lot better to read.)
Is it possible to do this with a regex? Preferably a (linux) perl or sed oneliner, alternatively a Notepad++ regex.

perl -p -e 's/\s+$//; $_ .= length() <= 75 ? qq{\n} : q{ }' book.txt
Perl's -p option means: for each input line, process and print. The processing code is supplied with the -e option. In this case: remove trailing whitespace and then attach either a newline or a space, depending on line length.

Not really answering your question, but you can achieve this result in vim using this global join command. The v expands tabs into whitespace when determining line length, a feature that might be useful depending on your source text.
:g/\%>74v$\n/j

This seems to get pretty close:
sed '/^$/! {:a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta}' ebook.txt
It doesn't get the last line of a paragraph if it's shorter than 75 characters.
Edit:
This version should do it all:
sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g' ebook.txt
Edit 2:
If you want to re-wrap at word/sentence boundaries at a different width (here 65, but choose any value) to prevent words from being broken at the margin (or long lines from being truncated):
sed 's/^.\{0,74\}$/&\n/' ebook.txt | fmt -w 65 | sed '/^$;s/\n//}'
To change from DOS to Unix line endings, just add dos2unix to the beginning of any of the pipes above:
dos2unix < ebook.txt | sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g'

The less fancy option would be to replace the cr/lf's that apperar by themselves on a line with a single lf or cr, then remove all the cr/lf's remaining. No need for fancy/complicated stuff.
regex 1:
^\r\n$
finds lone cr/lf's. It is then trivial to replace the remaining ones. See this question for help finding cr/lf's in np++.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js