Egrep command hangs when passed a file for Regex patterns - regex

NB: I'm using Cygwin.
Passing in a file into the egrep command to use patterns is running incredibly slowly (to the point where after the 4th word match, it was more than 5 minutes before I gave up).
The command I'm trying to run is:
cat words.txt | egrep ^"[A-Z]" | egrep -f words9.txt
words.txt is a dictionary (390K words), and words9.txt is a file (36,148 words) I created that contains all lowercase 9-letter words from word.txt.
This command should find any 10+ letter words that contain a 9-letter word from words9.txt.
I am new to regex and shell commands so it may be simply that this file dependency is an incredibly inefficient method, (having to search 36148 words for every word in words.txt). Is there a better way of tackling this?

If words9.txt doesn't have regexes try using a fixed string search (fgrep or grep -F) instead of using the extended regex search (egrep).
cat words.txt | egrep "^[A-Z]" | fgrep -f words9.txt

So you want to improve on egrep ^"[A-Z]" words.txt | egrep -f words9.txt
Your words9.txt is not a file of regex patterns, it's only fixed strings, so treating it as such (grep -F) will generally be much faster, as #KurzedMetal said.
Mind you, if its contents had a lot of overlap near-duplicates, you could manually merge them by constructing regexes, here's how you'd do that:
Get a list of all 9-letter words starting with 'inter' (using the Unix builtin word dict)
awk 'length($0)==9' /usr/share/dict/words
now say you wanted to merge all 9-letter words starting with the 5 characters 'inter' into one regex. First let's get them as a list: grep "^inter" | paste -sd ',' - gives:
interalar,interally,interarch,interarmy,interaxal,interaxis,interbank,interbody,intercale,intercalm,intercede,intercept,intercity,interclub,intercome,intercrop,intercurl,interdash,interdict,interdine,interdome,interface,interfere,interflow,interflux,interfold,interfret,interfuse,intergilt,intergrow,interhyal,interject,interjoin,interknit,interknot,interknow,interlace,interlaid,interlake,interlard,interleaf,interline,interlink,interloan,interlock,interloop,interlope,interlude,intermaze,intermeet,intermelt,interment,intermesh,intermine,internals,internist,internode,interpage,interpave,interpeal,interplay,interplea,interpole,interpone,interpose,interpour,interpret,interrace,interroad,interroom,interrule,interrupt,intersale,intersect,intershop,intersole,intertalk,interteam,intertill,intertone,intertown,intertwin,intervale,intervary,intervein,intervene,intervert,interview,interweld,interwind,interwish,interword,interwork,interwove,interwrap,interzone`
The regex would start with: inter(a(l(ar|ly)|r(ch|my)|x(al|is))|b(...)|c(...)|...). We're implementing a tree structure from L-to-R (there are other ways but this is the obvious way).
Testing it: grep "^inter" words9.txt | egrep '^intera(l(ar|ly)|r(ch|my)|x(al|is))'
interalar
interally
interarch
interarmy
interaxal
interaxis
Yay! But it may still be faster to just have a plain list of fixed-strings. Also, this regex will be harder to maintain, brittle etc. Impossible to easily filter or remove specific strings. Anyway you get the point. PS I'm sure there are automated tools out there that construct regexes for such wordlists.

Related

Get list of strings between certain strings in bash

Given a text file (.tex) which may contain strings of the form "\cite{alice}", "\cite{bob}", and so on, I would like to write a bash script that stores the content within brackets of each such string ("alice" and "bob") in a new text file (say, .txt).
In the output file I would like to have one line for each such content, and I would also like to avoid repetitions.
Attempts:
I thought about combining grep and cut.
From other questions and answers that I have seen on Stack Exchange I think that (modulo reading up on cut a bit more) I could manage to get at least one such content per line, but I do not know how to get all occurences of a single line if there are several such strings in it and I have not seen any question or answer giving hints in this direction.
I have tried using sed as well. Yesterday I read this guide to see if I was missing some basic sed command, but I did not see any straightforward way to do what I want (the guide did mention that sed is Turing complete, so I am sure there is a way to do this only with sed, but I do not see how).
What about:
grep -oP '(?<=\\cite{)[^}]+(?=})' sample.tex | sort -u > cites.txt
-P with GNU grep interprets the regexp as a Perl-compatible one (for lookbehind and lookahead groups)
-o "prints only the matched (non-empty) parts of a matching line, with each such part on a separate output line" (see manual)
The regexp matches a curly-brace-free text preceded by \cite{ (positive lookbehind group (?<=\\cite{)) and followed by a right curly brace (positive lookafter group (?=})).
sort -u sorts and remove duplicates
For more details about lookahead and lookbehind groups, see Regular-Expressions.info dedicated page.
You can use grep -o and postprocess its output:
grep -o '\\cite{[^{}]*}' file.tex |
sed 's/\\cite{\([^{}]*\)}/\1/'
If there can only ever be a single \cite on an input line, just a sed script suffices.
sed -n 's/.*\\cite{\([^{}]*\)}.*/\1/p' file.tex
(It's by no means impossible to refactor this into a script which extracts multiple occurrences per line; but good luck understanding your code six weeks from now.)
As usual, add sort -u to remove any repetitions.
Here's a brief Awk attempt:
awk -v RS='\' '/^cite\{/ {
split($0, g, /[{}]/)
cite[g[2]]++ }
END { for (cit in cite) print cit }' file.tex
This conveniently does not print any duplicates, and trivially handles multiple citations per line.

Use of grep + sed based on a pattern file?

Here's the problem: i have ~35k files that might or might not contain one or more of the strings in a list of 300 lines containing a regex each
if I grep -rnwl 'C:\out\' --include=*.txt -E --file='comp.log' i see there are a few thousands of files that contain a match.
now how do i get sed to delete each line in these files containing the strings in comp.log used before?
edit: comp.log contains a simple regex in each line, but for the most part each string to be matched is unique
this is is an example of how it is structured:
server[0-9]\/files\/bobba fett.stw
[a-z]+ mochaccino
[2-9] CheeseCakes
...
etc. silly examples aside, it goes to show each line is unique save for a few variations so it shouldn't affect what i really want: see if any of these lines match the lines in the file being worked on. it's no different than 's/pattern/replacement/' except that i want to use the patterns in the file instead of inline.
Ok here's an update (S.O. gets inpatient if i don't declare the question answered after a few days)
after MUCH fiddling with the #Kenavoz/#Fischer approach, i found a totally different solution, but first things first.
creating a modified pattern list for sed to work with does work.
as well as #werkritter's approach of dropping sed altogether. (this one i find the most... err... "least convoluted" way around the problem).
I couldn't make #Mklement's answer work under windows/cygwin (it did work on under ubuntu, so...not sure what that means. figures.)
What ended up solving the problem in a more... long term, reusable form was a wonderful program pointed out by a colleage called PowerGrep. it really blows every other option out of the water. unfortunately it's windows only AND it's not free. (not even advertising here, the thing is not cheap, but it does solve the problem).
so considering #werkiter's reply was not a "proper" answer and i can't just choose both #Lars Fischer and #Kenavoz's answer as a solution (they complement each other), i am awarding #Kenavoz the tickmark for being first.
final thoughts: i was hoping for a simpler, universal and free solution but apparently there is not.
You can try this :
sed -f <(sed 's/^/\//g;s/$/\/d/g' comp.log) file > outputfile
All regex in comp.log are formatted to a sed address with a d command : /regex/d. This command deletes lines matching the patterns.
This internal sed is sent as a file (with process substitition) to the -f option of the external sed applied to file.
To delete just string matching the patterns (not all line) :
sed -f <(sed 's/^/s\//g;s/$/\/\/g/g' comp.log) file > outputfile
Update :
The command output is redirected to outputfile.
Some ideas but not a complete solution, as it requires some adopting to your script (not shown in the question).
I would convert comp.log into a sed script containing the necessary deletes:
cat comp.log | sed -r "s+(.*)+/\1/ d;+" > comp.sed`
That would make your example comp.sed look like:
/server[0-9]\/files\/bobba fett.stw/ d;
/[a-z]+ mochaccino/ d;
/[2-9] CheeseCakes/ d;
then I would apply the comp.sed script to each file reported by grep (With your -rnwl that would require some filtering to get the filename.):
sed -i.bak -f comp.sed $AFileReportedByGrep
If you have gnu sed, you can use -i inplace replacement creating a .bak backup, otherwise use piping to a temporary file
Both Kenavoz's answer and Lars Fischer's answer use the same ingenious approach:
transform the list of input regexes into a list of sed match-and-delete commands, passed as a file acting as the script to sed via -f.
To complement these answers with a single command that puts it all together, assuming you have GNU sed and your shell is bash, ksh, or zsh (to support <(...)):
find 'c:/out' -name '*.txt' -exec sed -i -r -f <(sed 's#.*#/\\<&\\>/d#' comp.log) {} +
find 'c:/out' -name '*.txt' matches all *.txt files in the subtree of dir. c:/out
-exec ... + passes as many matching files as will fit on a single command line to the specified command, typically resulting only in a single invocation.
sed -i updates the input files in-place (conceptually speaking - there are caveats); append a suffix (e.g., -i.bak) to save backups of the original files with that suffix.
sed -r activates support for extended regular expressions, which is what the input regexes are.
sed -f reads the script to execute from the specified filename, which in this case, as explained in Kenavoz's answer, uses a process substitution (<(...)) to make the enclosed sed command's output act like a [transient] file.
The s/// sed command - which uses alternative delimiter # to facilitate use of literal / - encloses each line from comp.log in /\<...\>/d to yield the desired deletion command; the enclosing of the input regex in \<...\>ensures matching as a word, as grep -w does.
This is the primary reason why GNU sed is required, because neither POSIX EREs (extended regular expressions) nor BSD/OSX sed support \< and \>.
However, you could make it work with BSD/OSX sed by replacing -r with -E, and \< / \> with [[:<:]] / [[:>:]]

Get all Commands without arguments from history (with Regex)

I have just started with learning shell commands and how to script in bash.
Now I like to solve the mentioned task in the title.
What I get from history command (without line numbers):
ls [options/arguments] | grep [options/arguments]
find [...] exec- sed [...]
du [...]; awk [...] file
And how my output should look like:
ls
grep
find
sed
du
awk
I already found a solution, but it doesn't really satisfy me. So far I declared three arrays, used the readarray -t << (...) command twice, in order to save the content from my history and after that, in combination with compgen -ac, to get all commands which I can possibly run. Then I compared the contents from both with loops, and saved the command every time it matched a line in the "history" array. A lot of effort for an simple exercise, I guess.
Another solution I thought of, is to do it with regex pattern matching.
A command usually starts at the beginning of the line, after a pipe, an execute or after a semicolon. And maybe more, I just don't know about yet.
So I need a regex which gives me only the next word after it matched one of these conditions. That's the command I've found and it seems to work:
grep -oP '(?<=|\s/)\w+'
Here it uses the pipe | as a condition. But I need to insert the others too. So I have put the pattern in double quotes, created an array with all conditions and tried it as recommend:
grep -oP "(?<=$condition\s/)\w+"
But no matter how I insert the variable, it fails. To keep it short, I couldn't figure out how the command works, especially not the regex part.
So, how can solve it using regular expressions? Or with a better approach than mine?
Thank you in advance! :-)
This is simple and works quite well
history -w /dev/stdout | cut -f1 -d ' '
You can use this awk with fc command:
awk '{print $1}' <(fc -nl)
find
mkdir
find
touch
tty
printf
find
ls
fc -nl lists entries from history without the line numbers.

Using fgrep to find multiple words (korn shell)

Say I have a text file with multiple lines, but I only want fgrep to list those lines which have certain words in the same line. So, for example, if I'm looking for the words "cat" and "dog", how would I supply that information to fgrep?
I understand for one argument it would simply be:
fgrep cat text.txt
but I want to look for lines that contain "dog" as well as "cat" in the same line. How would I go about doing this?
This will work:
fgrep cat text.txt | fgrep dog
You can also use one regex with grep -E, something like:
grep -E "cat.*?dog|dog.*?cat" text.txt
But it is typically too much of brainpower to spend for simple task like that, and I choose first method instead.

grep egrep multiple-strings

Suppose I have several strings: str1 and str2 and str3.
How to find lines that have all the strings?
How to find lines that can have any of them?
And how to find lines that have str1 and either of str2 and str3 [but not both?]?
This looks like three questions. The easiest way to put these sorts of expressions together is with multiple pipes. There's no shame in that, particularly because a regular expression (using egrep) would be ungainly since you seem to imply you want order independence.
So, in order,
grep str1 | grep str2 | grep str3
egrep '(str1|str2|str3)'
grep str1 | egrep '(str2|str3)'
you can do the "and" form in an order independent way using egrep, but I think you'll find it easier to remember to do order independent ands using piped greps and order independent or's using regular expressions.
You can't reasonably do the "all" or "this plus either of those" cases because grep doesn't support lookahead. Use Perl. For the "any" case, it's egrep '(str1|str2|str3)' file.
The unreasonable way to do the "all" case is:
egrep '(str1.*str2.*str3|str3.*str1.*str2|str2.*str1.*str3|str1.*str3.*str2)' file
i.e. you build out the permutations. This is, of course, a ridiculous thing to do.
For the "this plus either of those", similarly:
egrep '(str1.*(str2|str3)|(str2|str3).*str1)' file
grep -E --color "string1|string2|string3...."
for example to find whether our system using AMD(svm) or Intel(vmx) processor and if it is 64bit(lm) lm stands for long mode- that means 64bit...
command example:
grep -E --color "lm|svm|vmx" /proc/cpuinfo
-E is must to find multiple strings
Personally, I do this in perl rather than trying to cobble together something with grep.
For instance, for the first one:
while (<FILE>)
{
next if ! m/pattern1/;
next if ! m/pattern2/;
next if ! m/pattern3/;
print $_;
}