Grep Regex - Words in brackets? - regex

I want to know the regex in grep to match everything that isn't a specific word. I know how to not match everything that isn't a single character,
gibberish blah[^.]*jack
That would match blah, jack and everything in between as long as the in between didn't contain a period. But is it possible to do something like this?
gibberish blah[^joe]*jack
Match blah, jack and everything in between as long as the in between didn't contain the word "joe"?
UPDATE:
I can also use AWK if that would better suit this purpose.
So basically, I just want to get the sentence "gibberish blah other words jack", as long as "joe" isn't in the other words.
Update 2 (The Answer, to a different question):
Sorry, I am tired. The sentence actually can contain the word "joe", but not two of them. So "gibberish blah jill joe moo jack" would be accepted, but "gibberish blah jill joe moo joe jack" wouldn't.
Anyway, I figured out the solution to my problem. Just grep for "gibberish.*jack" and then do a word count (wc) to see how many "joes" are in that sentence. If wc comes back with 1, then it's ok, but if it comes back with 2 or more, the sentence is wrong.
So, sorry for asking a question that wouldn't even solve my problem. I will mark sputnick's answer as the right one, since his answer looks like it would solve the original posts problem.

What you're looking for is named look around, it's an advanced regex technique in pcre & perl. It's used in modern languages. grep can handle this expressions if you have the -P switch. If you don't have -P, try pcregrep instead. (or any modern language).
See
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html
NOTE
If you just want to negate a regex, maybe a simple grep -v "regex" will be sufficient. (It depends of your needs) :
$ echo 'gibberish blah other words jack' | grep -v 'joe'
gibberish blah other words jack
$ echo 'gibberish blah joe other words jack' | grep -v 'joe'
$
See
man grep | less +/invert-match

Try the negative lookbehind syntax:
blahish blah(?<!joe)*jack

Related

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter
The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter
Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

Grep pattern between quotes

I'm trying to grep a code base to find alpha numeric codes between quotes. So, for example my code base might contain the line
some stuff "A234DG3" maybe more stuff
And I'd like to output: A234DG3
I'm lucky in that I know my string is 7 long and only integers and the letters A-Z, a-z.
After a bit of playing I've come up with the following, but it's just not coming out with what I'd like
grep -ro '".*"' . | grep [A-Za-z0-9]{7} | less
Where am I going wrong here? It feels like grep should give me what I want, but am I better off using something else? Cheers!
The problem is that an RE is pretty much required to match the longest sequence it can. So, given something like:
a "bcd" efg "hij" klm "nop" q
A pattern of ".*" should match: "bcd" efg "hij" klm "nop" (everything from the first quote to the last quote), not just "bcd".
You probably want a pattern more like "[^"]*" to match the open-quote, an arbitrary number of other things, then a close quote.
Using basic or extended POSIX regular expressions there is no way to extract the value between the quotes with grep. Since that I would use sed for a portable solution:
sed -n 's/.*\"\([^"]\+\)".*/\1/p' <<< 'some stuff "A234DG3" maybe more stuff'
However, having GNU goodies, GNU grep will support PCRE expressions with the -P command line option. You can use this:
grep -oP '.*?"\K[^"]+(?=")' <<< 'some stuff "A234DG3" maybe more stuff'
.*" matches everything until the first quote - including it. The \K option clears the matching buffer and therefore works like a handy, dynamic lookbehind assertion. (I could have used a real lookbehind but I like \K). [^"]+ matches the text between the quotes. (?=") is a lookahead assertion the ensure after the match will follow a " - without including it into the match.
So after more playing about I've come up with this which gives me what I'm after:
grep -r -E -o '"[A-Za-z0-9]{7}"' . | less
With the -E allowing the use of the {7} length matcher

Why this working regex does not work with sed?

I have this type of text:
Song of Solomon 1:1: The song of songs, which is Solomon’s.
John 3:16:For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
III John 1:8: We therefore ought to receive such, that we might be fellowhelpers to the truth.
I am trying to remove the verse (or metadata if you will) and just get plain text the content. The example text shows three different types of verses (multiword, singleword and roman + word), I thought that it would be easier to detect from the beginning of each line, anything until "number:number:", and then substitute it with "" (empty string).
I tested a regex that seems to work (as I described):
First find until "number:number:" excluding it [or: .+?(?=(\s+)(\d+)(:)(\d+)(:))],
And then include the "number:number:" pattern [or: (\s+)(\d+)(:)(\d+)(:)]
Which leads to the following regex:
.+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)
The regex seems to work fine, you can try it here, the problem is that when I try to use the regex with sed it just does not work:
$ sed 's/.+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)//g' testcase.txt
It will produce the same text as the input, when it should produce:
The song of songs, which is Solomon’s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
Any help please?
Thank you very much!
This awk should do:
awk -F": *" '{print $3}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
To make it more secure to the number:number: use this:
awk -F"[0-9]+:[0-9]+: *" '{print $2}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
This will also prevent problems with : within the text.
Using Adams regex, we can shorten it some.
awk -F"([0-9]+:){2} ?" '{print $2}' file
or
awk -F"([0-9]+:){2} ?" '{$0=$2}1' file
You can use the following sed command:
sed 's/.*[0-9]\+:[0-9]\+: *//' file.txt
If you have only basic posix regexes available, you need to use the following command:
sed 's/.*[0-9]\{1,\}:[0-9]\{1,\}: \{0,\}//' file.txt
I need to use \{1,\} since the \+ and \* operator is not part of the basic posix regex specification.
Btw, if you have GNU goodies, you also use grep:
grep -oP '.*([0-9]+:){2} *\K.*' file.txt
I'm using the \K option here. \K clears the current match until this point which can be used like a lookbehind assertion - but with a variable length.
This:
sed -r 's/.*([0-9]+:){2} ?//' testcase.txt
This is the job cut was invented to do:
$ cut -d: -f3- file
The song of songs, which is Solomon’s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.

GREP expression

I need help with a GREP expression to find and replace a variable group of words.
The sentence always starts with the same two words (Bold italicized) and always ends with a (colon), but the bit in the middle varies.
So I need to search for:
Bold italicized then any string of words then :
ie. starts with "Bold italicized", then any group of words, ends with ":"
For example:
Bold italicized May 6, 2010:
I will then apply some formatting to that text.
Thank you.
The right tool do do this is not grep but sed :
EXAMPLE in a shell :
$ cat file.txt
Bold italicized foo bar:
Bold italicized qux:
$ sed 's/^Bold italicized\(.*\):/do something with "\1"/g' file.txt
do something with " foo bar"
do something with " qux"
$
NOTE
you will find tons of examples and documentation here or here
the basic sed substitution command is s/regex/substitution/modifier
that use regex, I use ^ that means beginning of line, and \( \) to make a capture
This should do it, although this is a pretty simple one, so it seems like you should have been able to come up with this yourself, even as a beginner.
^Bold italicized.+?:
If you want to learn a little bit more about how to use GREP, I would recommend the InDesign GREP reference.

SED: Inserting an existing pattern, to several other places on the same line

Again a SED question from me :)
So, same as last time, I'm wrestling with phone numbers. This time the problem is a bit different.
I this kind of organization currently in my text file:
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555
Now, every areacode can have unknown number of numbers, and also the phone numbers are not fixed in length.
What I would like to know, is how could I combine areacode and phone number, to look something like this:
4444-111111, 4444-2222222, 4444-33333333
My first idea was to add again a line break before each phone number and to match these sections with regex, and then just add the first remembered item to second, and first to third:
\1-\2, \1-\3, etc
But of course since sed can only remember 9 arguments, and there can be more than 10 numbers in one line this doesn't work. Moreover, also non-fixed list of phone numbers made this a no go.
I'm again looking primarily the SED option, as I've been trying to get proficient with it - but more efficient solutions with other tools are of course definitely welcome!
$ cat input.txt | sed '1d;s/NUM:/ /g' | awk '{for(i=2;i<=NF;i++)printf("%s-%s%s", $1, $i, i==NF?"\n":",")}'
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
This might work for you:
sed '1d;:a;s/^\(\S*\)\(.*\)NUM:/\1\2,\1-/;ta;s/[^,]*,//;s/ //g' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
or:
awk 'NR>1{gsub(/NUM:/,","$1"-");sub(/[^,]*,/,"");gsub(/ /,"");print}' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
TXR:
#(collect)
#area #(coll :mintimes 1)NUM:#{num /[0-9]+/}#(end)
#(output)
#(rep)#area-#num, #(last)#area-#num#(end)
#(end)
#(end)
Run:
$ txr phone.txr phone.txt
4444-111111, 4444-2222222, 4444-33333333
5555-1111111, 5555-2222, 5555-3333333, 5555-44444444, 5555-5555555
$ cat phone.txt
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555