I want to write a flexible regex for grep that will return search terms found within a certain distance from each other.
The ideal behavior is something like research databases; for example, where you can search for articles that have capital and GDP within 15 words of each other, which would include articles where the strings capital and GDP may be separated by five, six, seven, etc., alphanumeric strings of unspecified length. The regex statement would include punctuation (e.g., commas, periods, hyphens), but also accent marks and diacritics. So, results where chechè and lavi are no more than five strings apart.
I imagine the statement will involve lookaheads, and phrases like {1,15}, or maybe piping one grep thru another grep, but that loses the benefit of GREP_OPTIONS='--color=auto'. Constructing it is really beyond my skill set. I have a set of .txt documents that I want to run the search over, but making the regex flexible to change the distance between strings or to truncate the terms would also be useful for others who have things like fieldnotes or reading notes in a standard format.
EDIT
Below is a sample of passages taken from the Bible.
Ye shall buy meat of them for money, that ye may eat; and ye shall also buy water of them for money, that ye may drink. For the Lord thy God hath blessed thee in all the works of thy hand: he knoweth thy walking through this great wilderness: these forty years the Lord thy God hath been with thee; thou hast lacked nothing... Thou shalt sell me meat for money, that I may eat; and give me water for money, that I may drink: only I will pass through on my feet: (as the children of Esau which dwell in Seir, and the Moabites which dwell in Ar, did unto me:) until I shall pass over Jordan into the land which the Lord our God giveth us. But Sihon king of Heshbon would not let us pass by him: for the Lord thy God hardened his spirit, and made his heart obstinate, that he might deliver him into thy hand, as appeareth this day. And the Lord said unto me, Behold, I have begun to give Sihon and his land before thee: begin to possess, that thou mayest inherit his land. Then Sihon came out against us, he and all his people, to fight at Jahaz. And the Lord our God delivered him before us; and we smote him, and his sons, and all his people. And if the way be too long for thee, so that thou art not able to carry it; or if the place be too far from thee, which the Lord thy God shall choose to set his name there, when the Lord thy God hath blessed thee: then shalt thou turn it into money, and bind up the money in thine hand, and shalt go unto the place which the Lord thy God shall choose: and thou shalt bestow that money for whatsoever thy soul lusteth after, for oxen, or for sheep, or for wine, or for strong drink, or for whatsoever thy soul desireth: and thou shalt eat there before the Lord thy God, and thou shalt rejoice, thou, and thine household, and the Levite that is within thy gates; thou shalt not forsake him: for he hath no part nor inheritance with thee... Now it came to pass, that at what time the chest was brought unto the king’s office by the hand of the Levites, and when they saw that there was much money, the king’s scribe and the high priest’s officer came and emptied the chest, and took it, and carried it to his place again. Thus they did day by day, and gathered money in abundance. And when they had finished it, they brought the rest of the money before the king and Jehoiada, whereof were made vessels for the house of the Lord , even vessels to minister, and to offer withal, and spoons, and vessels of gold and silver. And they offered burnt offerings in the house of the Lord continually all the days of Jehoiada. Thou hast bought me no sweet cane with money, neither hast thou filled me with the fat of thy sacrifices; but thou hast made me to serve with thy sins, thou hast wearied me with thine iniquities... Howbeit there were not made for the house of the Lord bowls of silver, snuffers, basins, trumpets, any vessels of gold, or vessels of silver, of the money that was brought into the house of the Lord: but they gave that to the workmen, and repaired therewith the house of the Lord. Moreover they reckoned not with the men, into whose hand they delivered the money to be bestowed on workmen: for they dealt faithfully. The trespass money and sin money was not brought into the house of the Lord: it was the priests’.
If I wanted to grep for instances of where shalt and money are co-present within five words (including punctuation), how would I write that regex?
I'm not sure how to give expected results since grep --context=1 would include more than just the strings with 0-5 strings in between, but I imagine the results would identify:
shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money
But would not return shall buy meat of them for money, since 'money' appears as the sixth string.
Well, it's not grep but this seems to do what you asked for using GNU awk for multi-char RS and word boundaries:
$ cat tst.awk
BEGIN {
RS="^$"
split(words,word)
}
{
gsub(/#/,"#A"); gsub(/{/,"#B"); gsub(/}/,"#C")
gsub("\\<"word[1]"\\>","{")
gsub("\\<"word[2]"\\>","}")
while ( match($0,/{[^{}]+}|}[^{}]+{/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/}/,word[2],tgt)
gsub(/{/,word[1],tgt)
gsub(/#C/,"}",tgt); gsub(/#B/,"{",tgt); gsub(/#A/,"#",tgt)
if ( gsub(/[[:space:]]+/,"&",tgt) <= range ) {
print tgt
}
$0 = substr($0,RSTART+length(word[1]))
}
}
$ awk -v words='money shalt' -v range=5 -f tst.awk file
shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money
$ awk -v words='and him' -v range=10 -f tst.awk file
him: for the Lord thy God hardened his spirit, and
and made his heart obstinate, that he might deliver him
him before us; and
and we smote him
him, and
Note that the above works even with input like shalt sell me meat for money in thine hand, and shalt where one of the words (money) appears 5 words after the first occurrence of the other word (shalt) AND 5 words before a second occurrence of that first word (again, shalt):
$ echo 'shalt sell me meat for money in thine hand, and shalt' |
awk -v words='shalt money' -v range=5 -f tst.awk
shalt sell me meat for money
money in thine hand, and shalt
For colors, file names, and line numbers:
Do this to see the colors available to you in your terminal (each line will be output in a different color):
$ for ((c=0; c<$(tput colors); c++)); do tput setaf "$c"; tput setaf "$c" | cat -v; echo "=$c"; done; tput setaf 0
^[[30m=0
^[[31m=1
^[[32m=2
^[[33m=3
^[[34m=4
^[[35m=5
^[[36m=6
^[[37m=7
Now that you can see what those escape sequences and numbers mean, update the awk script to (\033 = ^[ = Esc):
$ cat tst.awk
BEGIN {
RS="^$"
split(words,word)
c["black"] = "\033[30m"
c["red"] = "\033[31m"
c["green"] = "\033[32m"
c["yellow"] = "\033[33m"
c["blue"] = "\033[34m"
c["pink"] = "\033[35m"
c["teal"] = "\033[36m"
c["grey"] = "\033[37m"
for (color in c) {
print c[color] color c["black"]
}
}
{
gsub(/#/,"#A"); gsub(/{/,"#B"); gsub(/}/,"#C")
gsub("\\<"word[1]"\\>","{")
gsub("\\<"word[2]"\\>","}")
while ( match($0,/{[^{}]+}|}[^{}]+{/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/}/,word[2],tgt)
gsub(/{/,word[1],tgt)
gsub(/#C/,"}",tgt); gsub(/#B/,"{",tgt); gsub(/#A/,"#",tgt)
if ( gsub(/[[:space:]]+/,"&",tgt) <= range ) {
print FILENAME, FNR, c["red"] tgt c["black"]
}
$0 = substr($0,RSTART+length(word[1]))
}
}
and when you run it you'll see a dump of all available colors and for each of your target text it will be preceded by the file name and line number within that file and the text will be colored in red:
Short answer:
grep 'shalt\W\+\(\w\+\W\+\)\{0,5\}money'
Maybe in both directions:
grep 'shalt\W\+\(\w\+\W\+\)\{0,5\}money\|money\W\+\(\w\+\W\+\)\{0,5\}shalt'
https://www.gnu.org/software/grep/manual/grep.html:
‘\w’
Match word constituent, it is a synonym for ‘[_[:alnum:]]’.
‘\W’
Match non-word constituent, it is a synonym for ‘[^_[:alnum:]]’.
Generic answer to construct the grep dynamicly, in this case with a shell function:
find_adjacent() {
dist="$1"; shift
grep1="$1"; shift
grep2="$1"; shift
between='\W\+\(\w\+\W\+\)\{0,'"$dist"'\}'
regex="$grep1$between$grep2\|$grep2$between$grep1"
printf 'Using the regex: %s\n' "$regex" 1>&2
grep "$regex" "$#"
}
Example usage:
echo 'shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money
capital and GDP' | find_adjacent 3 shalt money -i --color=auto
or to match across lines:
find_adjacent 5 shalt money -z file_with_the_bible_passages.txt
Edit
As pointed out by EdMorton this only finds the first part of a continues match. It would still match the right line, but color highlighting would be of a bit.
To fix this the regex will get more complicated because it has to match any continues "shalt ... money ... shalt" in 4 cases:
"shalt ... money ... shalt"
"shalt ... money ... shalt ... money"
"money ... shalt ... money"
"money ... shalt ... money ... shalt"
This can be done by replacing the regex=... line with:
regex1="$grep1\($between$grep2$between$grep1\)\+"
regex2="$grep1$between$grep2\($between$grep1$between$grep2\)*"
regex3="$grep2\($between$grep1$between$grep2\)\+"
regex4="$grep2$between$grep1\($between$grep2$between$grep1\)*"
regex="$regex1\|$regex2\|$regex3\|$regex4"
Additionally it might be mixed up like this:
"shalt xxx shalt xxx money xxx money"
With a distance of max 3 words between, the above regex still would only find:
"shalt xxx shalt xxx money"
To mach those cases the only viable solution is, to only match the words themself and use look-aheads/look-behinds (needs more advanced implementation of regex e.g. GNU grep's -P for perl regular expressions):
find_adjacent() {
dist="$1"; shift
word1="$1"; shift
word2="$1"; shift
ahead='\W+(\w+\W+){0,'"$dist"'}'
behind='(\W+\w+){0,'"$dist"'}\W+'
regex="$word1(?=$ahead$word2)|(?<=$word2)$behind\K$word1|$word2(?=$ahead$word1)|(?<=$word1)$behind\K$word2"
printf 'Using the regex: %s\n' "$regex" 1>&2
grep -P "$regex" "$#"
}
Another example usage (search case insensitive, display filename and line, highlight the words found, search all files in a directory):
find_adjacent 15 capital GDP -i -Hn --color=auto -r folder_to_search
this is the address method
the number might be different 12 or 412 and how many words for the finch ave east
1460 Finch Ave East, Toronto, Ontario, A1A1A1
so I try this
^[0-9]+\s+[a-zA-Z]+\s+[a-zA-Z]+\s+[a-zA-Z]+[,]{1}+\s[a-zA-Z]+[,]{1}+\s+[a-zA-Z]+[,]{1}+\s[A-Za-z]\d[A-Za-z][ -]?\d[A-Za-z]\d$
I usually recommend using regex capture-groups, so you can break and simplify your matching problem to smaller sets. For most cases I use \d and \w, s for matching numbers, standard letters and whitespaces.
I usually experiment on https://regex101.com before I put it into code, because it provides a nice interactive way to play with expressions and samples.
Regarding your question the expression that I came up is:
$regexp = "^(\d+)\s*((\w+\s*)+),\s*(\w+),\s*(\w+),\s*((\w\d)*)$"
In PowerShell I like to use the direct regex class, because it offers more granularity than the standard -match operator.
# Example match and results
$sample = "1460 Finch Ave East, Toronto, Ontario, A1A1A1"
$match = [regex]::Match($sample, $regexp)
$match.Success
$match | Select -ExpandProperty groups | Format-Table Name, Value
# Constructed fields
#{
number = $match.Groups[1]
street = $match.Groups[2]
city = $match.Groups[4]
state = $match.Groups[5]
areacode = $match.Groups[6]
}
So this will result in $match.Success $true and the following numbered capture-groups will be presented in the Groups list:
Name Value
---- -----
0 1460 Finch Ave East, Toronto, Ontario, A1A1A1
1 1460
2 Finch Ave East
3 East
4 Toronto
5 Ontario
6 A1A1A1
7 A1
For constructing the fields, you can ignore 3 and 7 as those are partial-groups:
Name Value
---- -----
areacode A1A1A1
street Finch Ave East
city Toronto
state Ontario
number 1460
To add to mákos excellent answer, I would suggest using named capture groups and the $Matches automatic variable. This makes it super easy to grab the individual fields and turn them into objects for multiple input strings:
function Split-CanadianAddress {
param(
[Parameter(Mandatory,ValueFromPipeline)]
[string[]]$InputString
)
$Pattern = "^(?<Number>\d+)\s*(?<Street>(\w+\s*)+),\s*(?<City>(\w+\s*)+),\s*(?<State>(\w+\s*)+),\s*(?<AreaCode>(\w\d)*)$"
foreach($String in $InputString){
if($String -match $Pattern){
$Fields = #{}
$Matches.Keys |Where-Object {$_ -isnot [int]} |ForEach-Object {
$Fields.Add($_,$Matches[$_])
}
[pscustomobject]$Fields
}
}
}
The $Matches hashtable will contain both the numbered and named capture groups, which is why I copy only the named entries to the $Fields variable before creating the pscustomobject
Now you can use it like:
PS C:\> $sample |Split-CanadianAddress
Street : Finch Ave East
State : Ontario
AreaCode : A1A1A1
Number : 1460
City : Toronto
I've update the pattern to allow for spaces in city and state names as well (think "New Westminster, British Columbia")
I'm trying to handle a file containing currencies with sed but can't figure out where my error is.
This is a extract from the file :
AED: United Arab Emirates DirhamAFN: Afghan AfghaniALL: Albanian LekAMD: Armenian DramANG: Netherlands Antillean GuldenAOA: Angolan KwanzaARS: Argentine PesoAUD: Australian DollarAWG: Aruban FlorinAZN: Azerbaijani ManatBAM: Bosnia & Herzegovina Convertible MarkBBD: Barbadian DollarBDT: Bangladeshi TakaBGN: Bulgarian LevBIF: Burundian FrancBMD: Bermudian DollarBND: Brunei DollarBOB: Bolivian BolivianoBRL: Brazilian Real*BSD: Bahamian DollarBWP: Botswana PulaBZD: Belize DollarCAD: Canadian Dollar[...]
I want to add a newline before each tree uppercase group followed by the character ":".
What I tried was sed -e 's/\([A-Z]{3}:)/\n\1/g list1.txt > list2.txt, but nothing is changed. In fact, when I just try /[A-Z]{3}/blabla/ nothing happens.
I am puzzled.
sed -r 's/([A-Z]{3}:)/\n\1/g' list1.txt
# or
# sed -e 's/\([A-Z]\{3\}:\)/\n\1/g' list1.txt
return:
AED: United Arab Emirates Dirham
AFN: Afghan Afghani
ALL: Albanian Lek
AMD: Armenian Dram
ANG: Netherlands Antillean Gulden
AOA: Angolan Kwanza
ARS: Argentine Peso
AUD: Australian Dollar
AWG: Aruban Florin
AZN: Azerbaijani Manat
BAM: Bosnia & Herzegovina Convertible Mark
BBD: Barbadian Dollar
BDT: Bangladeshi Taka
BGN: Bulgarian Lev
BIF: Burundian Franc
BMD: Bermudian Dollar
BND: Brunei Dollar
BOB: Bolivian Boliviano
BRL: Brazilian Real*
BSD: Bahamian Dollar
BWP: Botswana Pula
BZD: Belize Dollar
CAD: Canadian Dollar
I am trying to match patterns in perl and need some help.
I need to delete from a string anything that matches [xxxx] i.e. opening bracket-things inside it-first closing bracket that occurs.
So I am trying to substitute with space the opening bracket, things inside, first closing bracket with the following code :
if($_ =~ /[/)
{
print "In here!\n";
$_ =~ s/[(.*?)]/ /ig;
}
Similarly I need to match i.e. angular bracket-things inside it-first closing angular bracket.
I am doing that using the following code :
if($_ =~ /</)
{
print "In here!\n";
$_ =~ s/<(.*?)>/ /ig;
}
This some how does not seem to work. My sample data is as below :
'Joanne' <!--Her name does NOT contain "Kathleen"; see the section "Name"--> "'Jo'" 'Rowling', OBE [http://news bbc co uk/1/hi/uk/793844 stm Caine heads birthday honours list] BBC News 17 June 2000 Retrieved 25 October 2000 , [http://content scholastic com/browse/contributor jsp?id=3578 JK Rowling Biography] Scholastic com Retrieved 20 October 2007 better known as 'J K Rowling' ,<ref name=telegraph>[http://www telegraph co uk/news/uknews/1531779/BBCs-secret-guide-to-avoid-tripping-over-your-tongue html Daily Telegraph, BBC's secret guide to avoid tripping over your tongue, 19 October 2006] is a British <!--do not change to "English" or "Scottish" until issue is resolved --> author best known as the creator of the [[Harry Potter]] fantasy series, the idea for which was conceived whilst on a train trip from Manchester to London in 1990 The Potter books have gained worldwide attention, won multiple awards, sold more than 400 million copies and been the basis for a popular series of films, in which Rowling had creative control serving as a producer in two of the seven installments [http://www businesswire com/news/home/20100920005538/en/Warner-Bros -Pictures-Worldwide-Satellite-Trailer-Debut%C2%A0Harry Business Wire - Warner Bros Pictures mentions J K Rowling as producer ]
Any help would be appreciated. Thanks!
You need to use this:
1 while s/\[[^\[\]]*\];
Demo:
% echo "i have [some [square] brackets] in [here] and [here] today."| perl -pe '1 while s/\[[^\[\]]*\]/NADA/g'
i have NADA in NADA and NADA today.
Versus the failing:
% echo "i have [some [square] brackets] in [here] and [here] today." | perl -pe 's/\[.*?\]/NADA/g'
i have NADA brackets] in NADA and NADA today.
The recursive regular expression I leave as an exercise for the reader. :)
EDIT: Eric Strom kindly provided a recursive solution you don’t have to use 1 while:
% echo "i have [some [square] brackets] in [here] and [here] today." | perl -pe 's/\[(?:[^\[\]]*|(?R))*\]/NADA/g'
i have NADA in NADA and NADA today.
$_ =~ /someregex/ will not modify $_
Just a note, $_ =~ /someregex/ and /someregex/ do the same thing.
Also, you don't need to check for the existence of [ or < or the grouping parenthesis:
s/\[.*?\]/ /g;
s/<.*?>/ /g;
will do the job you want.
Edit: changed code to match the fact you're modifying $_
Square brackets have special meaning in the regex syntax, so escape them: /\[.*?\]/. (You also don't need the parentheses here, and doing case-insensitive matching is pointless.)
It's been a long time since I had to wrestle with Perl, but I'm pretty sure that testing $_ with a regex will also modify $_ (even if you aren't using s///). You don't need the test anyway; just run the replacement, and if the pattern doesn't match anywhere, then it won't do anything.
I'm writing a bash script that will show me what TV programs to watch today, it will get this information from a text file.
The text is in the following format:
Monday:
Family Guy (2nd May)
Tuesday:
House
The Big Bang Theory (3rd May)
Wednesday:
The Bill
NCIS
NCIS LA (27th April)
Thursday:
South Park
Friday:
FlashForward
Saturday:
Sunday:
HIGNFY
Underbelly
I'm planning to use 'date +%A' to work out the day of the week and use the output in a grep regex to return the appropriate lines from my text file.
If someone can help me with the regex I should be using I would be eternally great full.
Incidentally, this bash script will be used in a Conky dock so if anyone knows of a better way to achieve this then I'd like to hear about it,
Perl solution:
#!/usr/bin/perl
my $today=`date +%A`;
$today=~s/^\s*(\w*)\s*(?:$|\Z)/$1/gsm;
my $tv=join('',(<DATA>));
for my $t (qw(Monday Tuesday Wednesday Thursday Friday Saturday Sunday)) {
print "$1\n" if $tv=~/($t:.*?)(?:^$|\Z)/sm;
}
print "Today, $1\n" if $tv=~/($today:.*?)(?:^$|\Z)/sm;
__DATA__
Monday:
Family Guy (2nd May)
Tuesday:
House
The Big Bang Theory (3rd May)
Wednesday:
The Bill
NCIS
NCIS LA (27th April)
Thursday:
South Park
Friday:
FlashForward
Saturday:
Sunday:
HIGNFY
Underbelly
sed -n '/^Tuesday:/,/^$/p' list.txt
grep -B10000 -m1 ^$ list.txt
-B10000: print 10000 lines before the match
-m1: match at most once
^$: match an empty line
Alternatively, you can use this:
awk '/^'`date +%A`':$/,/^$/ {if ($0~/[^:]$/) print $0}' guide.txt
This awk script matches a consecutive group of lines which starts with /^Day:$/ and ends with a blank line. It only prints a line if the line ends with a character that is not a colon. So it won't print "Sunday:" or the blank line.