grep for X or Y in unix? - regex

how can I capture all lines from a text file that begin with the character "X" or contain the word "foo"?
This works:
cat text | grep '^#' # begins with #
but I tried:
cat text | grep '^#|[foo]'
and variations but cannot find the right syntax anywhere. how can this be done?
thanks.

If your grep implementation isn't POSIX compliant, you can use egrep instead of grep:
egrep '^#|foo' text

cat text | grep '^#|foo'
does this. [foo] matches one character that's either an f or an o.
If you don't want to match parts of words like the foo in foobar, use word boundary anchors:
cat text | grep '^#|\bfoo\b'

contains the word "foo" is: (.*foo.*) so your regex would become:
cat yourFilePath | grep -E '^#|(.*foo.*)'

Related

How can I get a list of the words that have six or more consonants in a row using the grep command?

I want to find a list of words that contain six or more consonants in a row from a number of text files.
I'm pretty new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]{6}"
I use the cat command here because it will otherwise include the file names in the next pipe. I use the second pipe to get a list of all the words in the text files.
The problem is the last pipe, I want to somehow get it to grep 6 consonants in a row, it doesn't need to be the same one. I would know one way of solving the problem, but that would create a command longer that this entire post.
For the last grep you also need the -E switch - or you need to escape the curly braces:
cat *.txt | grep -Eo "\w+" | grep -Ei "[^AEOUIaeoui]{6}"
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]\{6\}"
I use the cat command here because it will otherwise include the file names in the next pipe
You can disable this using the -h flag:
grep -hEo "\w+" *.txt | grep -Ei "[^AEOUIaeoui]{6}"
You can use
grep -hEio '[[:alpha:]]*[b-df-hj-np-tv-z]{6}[[:alpha:]]*' *.txt
Regex details
[[:alpha:]]* - any zero or more letter
[b-df-hj-np-tv-z]{6} - six English consonant letters on end
[[:alpha:]]* - any zero or more letter.
The grep options make the regex search case insensitive (i) and grep shows the matched texts only (with o) without displaying the filenames (h). The -E option allows the POSIX ERE syntax, else, if you do not specify it, you would need to escape {6} as \{6\},
Use this Perl one-liner:
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Example:
cat > in_file.txt <<EOF
the abcdfghi aBcdfghi.
ABCDFGHI234
abcdEfgh
EOF
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Output:
abcdfghi
aBcdfghi
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
The regex uses these modifiers:
/g : Multiple matches.
/i : Case-insensitive matches.
/\b([a-z]+)\b/ig : Match words that consist of 1 or more letters only ([a-z]+), with words boundary \b on both sides. This way, ABCDFGHI234 does not match, but all 3 words in line 1 (the, abcdfghi, aBcdfghi) match. This may be important for some applications. Note that not all answers in this thread use the word boundary around letters, and thus do not make the distinction shown in this example.
/[^aeoui]{6}/i : Match 6 or more consecutive non-vowels. Non-vowels here resolve exactly to consonants, because the previous regex selected for words made of letters only, that is, vowels and consonants.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
Get all words containing 6 or more consonants in a row in a given directory
cat *.txt | grep -Eo "\w+" | grep -E "[^AEOUIaeoui]{6,}"
We can use grep -Eo (-E Extended regex, -o output ONLY matching)
cat *.txt will output all of the data from all txt files in the current directory
grep -Eo "\w+" will output all of the words from an input in the form of one word per line
We can use Regex to search for strings that contain a pattern:
[^LISTOFCHARACTERS] Any character but LISTOFCHARACTERS
{6,} 6 or more

Printing only text from group

I have working example of substitution in online regex tester https://regex101.com/r/3FKdLL/1 and I want to use it as a substitution in sed editor.
echo "repo-2019-12-31-14-30-11.gz" | sed -r 's/^([\w-]+)-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}.gz$.*/\1/p'
It always prints whole string: repo-2019-12-31-14-30-11.gz, but not matched group [\w-]+.
I expect to get only text from group which is repo string in this example.
Try this:
echo "repo-2019-12-31-14-30-11.gz" |
sed -rn 's/^([A-Za-z]+)-[[:alnum:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}.gz.*$/\1/p'
Explanations:
\w will work (not [\w] wich matches either backslash or w), but you should use [[:alnum:]] which is POSIX
For sed, \d isn't a regex class, but an escaped character representing a non-printable character
Add -n to mute sed, with /p to explicitly print matched lines
Additionaly, you could refactor your regex by removing duplication:
echo "repo-2019-12-31-14-30-11.gz" |
sed -rn 's/^([[:alnum:]]+)-[[:digit:]]{4}(-[[:digit:]]{2}){5}.gz.*$/\1/p'
Looks like a job for GNU grep :
echo "repo-2019-12-31-14-30-11.gz" | grep -oP '^\K[[:alpha:]-]+'
Displays :
repo-
On this example :
echo "repo-repo-2019-12-31-14-30-11.gz" | grep -oP '^\K[[:alpha:]-]+'
Displays :
repo-repo-
Which I think is what you want because you tried with [\w-]+ on your regex.
If I'm wrong, just replace the grep command with : grep -oP '^\K\w+'

How do i grep a list of words with a specific letter listed twice or more?

If I'm supposted to grep out every word that contains the letter 'w' twice, how am I supposed to do this?
When I try all I get is the words with two 'w's next to each other.
I've tried:
grep -P "(?=.*w)(?=.*w)" /usr/share/dict/words
egrep "(?=.*w)(?=.*w)" /usr/share/dict/words
cat /usr/share/dict/words | grep 'w' | grep 'w'
but nothing gives me the results that I want. How can I do that?
You can use this grep:
grep -Eo '\bw\w*w\w*\b'
Example:
echo 'abcw wowed drew won now wow watch' | grep -Eo '\bw\w*w\w*\b'
wowed
wow
The below grep would grab the words which has atleast two w's.
$ echo 'foo bar wow bar work wallewow' | grep -oP '\S*w\S*w\S*'
wow
wallewow
To search /usr/share/dict/words for words containing the character w twice with arbitrary characters between them,
grep 'w.*w' /usr/share/dict/words
The problem with the zero-width assertion you used is that it will not skip forward. So (?=.*w)(?=.*w) will find the first w character twice. Similarly, grep 'w' | grep 'w' will find the first w character and pipe the line containing it to another instance of grep which does the same thing.
(The standard file /usr/share/dict/words contains one word per line, so we can use "word" above when we really mean "line", which is what grep and friends operates on. Really grepping words out of free-form text is somewhat more involved.)

bash sed/grep extract text between 2 words

My problem is the same as it's here, except I only want the first occurrence, ignore all the rest:
How to use sed/grep to extract text between two words?
In his example if it would be:
input: "Here is a String Here is a String"
But I only care about the first "is"
echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
output: "is a String Here is a"
Is this even possible with grep? I could use sed as well for the job.
Thanks
Your regexp happens to be matching against the longest string that sits between "Here" and "String". That is, indeed, "Here is a String Here is a String". This is the default behaviour of the * quantifier.
$ echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a String Here is a
If you want to match the shortest, you may put a ? (greediness modifier) just after the * quantifier:
$ echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*?(?= String)'
is a
is a
To get the first word you can use grep -o '^[^ ]*':
echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*(?= String)' | grep -o '^[^ ]*'
And you can pipe grep to grep multiple times to compose simple commands into complex ones.
sed 's/ String.*//;s/.*Here //'

grep to select strings that contains certain words

I have a list:
/device1/element1/CmdDiscovery
/device1/element1/CmdReaction
/device1/element1/Direction
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft
How can I grep so that the returned strings containing only "Field" followed by digits or simply NRepeatLeft at the end of string (in my example it will be the last three strings)?
Expected output:
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft
Try doing this :
grep -E "(Field[0-9]*|NRepeatLeft$)" file.txt
| | | ||
| | OR end_line |
| opening_choice closing_choice
extented_grep
if you don't have -E switch (stands for ERE : Extented Regex Expression):
grep "\(Field[0-9]*\|NRepeatLeft$\)" file.txt
OUTPUT
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft
That will grep for lines matching Field[0-9] or lines matching RepeatLeft at the end. Is it what you expect ?
I am not much sure of how to use grep for your purpose.Probably you would like perl for this:
perl -lne 'if(/Field[\d]+/ or /NRepeatLeft/){print}' your_file
$ grep -E '(Field[0-9]*|NRepeatLeft)$' file.txt
Output:
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft
Explanation:
Field # Match the literal word
[0-9]* # Followed by any number of digits
| # Or
NRepeatLeft # Match the literal word
$ # Match the end of the string
You can see how this works with your example here.