RegEx for matching words preceding commas, with exceptions - regex

The section of text I'm targeting always begins with “Also there is” and ends with a period. The single names in between the commas is what I'm trying to target (i.e. "randomperson" in the example below. These names will always be different. It gets tricky because there’s other things present that are not single word “names”. Maybe I can match everything in between the commas ONLY IF it’s a single word/name, but I cant seem to figure that one out. The list of names could be much longer or even shorter, so the expression must be dynamic and not just match a set amount of names.
Targeted Text:
Also there is a reinforced stone wall, a wooden wall, a stone wall,
randomperson, a lumbering earth elemental, randomperson, randomperson,
randomperson.
(broken over multiple lines for readability)
How do I solve this problem?

Code
sed -r ':a
s/, ([a-zA-Z]*)([,\.])/\n##\1\n\2/
ta
' | sed -n 's/##//gp'
Output
randomperson
randomperson
randomperson
randomperson
Explanation:
Start a loop
sed -r ':a
Find all occurrences of ', oneword,' or ', oneword.' and replace with ##oneword, or ##oneword. The ## is a magic marker to identify the extracted names later
s/, ([a-zA-Z]*)([,\.])/\n##\1\n\2/
End loop
ta
Filter lines based on ## to extract only oneword
' | sed -n 's/##//gp'

In a program
my $text = "Also there is a reinforced stone wall, a wooden wall, a stone wall, "
. "randomperson, a lumbering earth elemental, randomperson, "
. "randomperson, randomperson."
my #single_words =
grep { split == 1 }
split /\s*,|\.|\!|;\s*/,
($text =~ /Also there is (.*)/)[0];
The regex on $text gets text after that initial phrase, then split
returns the list of strings between commas (or other punctuation), and grep filters out strings that have more than one word†.
On the command line
echo "Also there is a reinforced stone wall, a wooden wall,..., randomperson,..."
| perl -wnE'say for
grep { split == 1 }
split /\s*,|\.|\!|;\s*/, (/Also there is (.*)/)[0]'
The same as above.
Please show us what you have tried for additional explanations and commentary.
†  A lone split uses defaults, split ' ', $_, where ' ' is a special pattern that splits on \s+ and discards leading and trailing space. But in the expression split == 1 the split is in a scalar context (imposed by the operator == which needs a single value on both sides) and so it returns the number of elements in the list, then compared against 1.

Related

Bash Script for Concatenating Broken Dashed Words

I've scraped a large amount (10GB) of PDFs and converted them to text files, but due to the format of the original PDFs, there is an issue:
Many of the words which break across lines have a dash in them that artificially breaks up the word, like this:
You can see that this happened because the original PDFs files have breaks:
What would be the cleanest and fastest way to "join" every word instance that matches this pattern inside of a .txt file?
Perhaps some sort of Regex search, like for a [a-z]\-\s \w of some kind (word character followed by dash followed by space) would work?
Or would some sort of sed replacement work better?
Currently, I'm trying to get a sed regex to work, but I'm not sure how to translate this to use capture groups to replace the selected text:
sed -n '\%\w\- [a-z]%p' Filename.txt
My input text would look like this:
The dog rolled down the st- eep hill and pl- ayed outside.
And the output would be:
The dog rolled down the steep hill and played outside.
Ideally, the expression would also work for words split up by a newline, like this:
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
To this:
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
It's straightforward in sed:
sed -e ':a' -e '/-$/{N;s/-\n//;ba
}' -e 's/- //g' filename
This translates roughly as "if the line ends with a dash, read in the next line as well (so that you have a line with a carriage return in the middle) then excise the dash and carriage return, and loop back the beginning just in case this new line also ends with a dash. Then remove any instances of - ".
You may use this gnu-awk code:
cat file
The dog rolled down the st- eep hill and pl- ayed outside.
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
Then use awk like this:
awk 'p != "" {
w = $1
$1 = ""
sub(/^[[:blank:]]+/, ORS)
$0 = p w $0
p = ""
}
{
$0 = gensub(/([_[:alnum:]])-[[:blank:]]+([_[:alnum:]])/, "\\1\\2", "g")
}
/-$/ {
p = $0
sub(/-$/, "", p)
}
p == ""' file
The dog rolled down the steep hill and played outside.
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
If you can consider perl then this may also work for you:
Then use:
perl -0777 -pe 's/(\w)-\h+(\w)/$1$2/g; s/(\w)-\R(\w+)\s+/$1$2\n/g' file
You simply add backslash-parentheses (or use the -r or -E option if available to do away with the requirement to put backslashes before capturing parentheses) and recall the matched text with \1 for the first capturing parenthesis, \2 for the second, etc.
sed 's/\(\w\)\- \([a-z]\)/\1\2/g' Filename.txt
The \w escape is not standard sed but if it works for you, feel free to use it. Otherwise, it is easy to replace with [A-Za-z0-9_#] or whatever else you want to call "word characters".
I'm guessing not all of the matches will be hyphenated words so perhaps run the result through a spelling checker or something to verify whether the result is an English word. (I would probably switch to a more capable scripting language like Python for that, though.)

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

Greedy regex behavior not wanted. usual cures don't work

I have a large document that I needed to put anchors in. I appended a number to the end of the line. The format was " Area 1" This list goes on for hundreds of entries.
I tried to awk out the slice I wanted with the anchor but this is what I get.
cat file | awk '/Area 5/{print $0}'
Area 5
Area 50
Area 51
Area 52
Area 53
Area 54
Area 55
Area 56
Area 57
Area 58
Area 59
As you can see I wanted just "Area 5" but the regex engine matched it with 5 and 5x. Yes, I know it is being greedy. I tried to limit that behavior with:
/Area 5{1}/
and I still had this problem. I also tried {0} and {0,1} to no effect.
Question 1: What can I do to force awk (and grep as well) to limit it to the requested number?
Question 2: I used awk '/pattern/ { $0=$0 "" ++i }1' to append the number. It leaves "Area 1" I would like it to be Area1. Any ideas?
Thanks for the help.
B
To avoid matching prefixes like '5x', you can use a word boundary.
(Explanation)
In awk, word boundaries are matched using \y.
To eliminate the space between area I simply match group 'Area' and the number '5' and then print them without space.
In my tests, the following worked:
cat test.txt | awk '/Area 5\y/{print $1 $2}'
Output
Area5
/Area 5([^0-9]|$)/ would account for end of line, as well as any-thing but a digit.
But a more awk way of doing things, would be:
awk '/^Area/ && $2==5' file
If the '5' is the end of the line, you can use /Area 5$/. The $ matches end-of-line.
If it's followed by further text, /Area 5[^0-9]/ should work. The [^0-9] matches one character that is anything except a digit.
Good luck!
Some proposals.
awk '$2==5' file
Area 5
awk '$2 ~ /^[5]$/' file
Area 5

A regular expression mystery

I am investigating a regexp mystery. I am tired so I may be missing
something obvious - but I can't see any reason for this.
In the examples below, I use perl - but I first saw this in VIM,
so I am guessing it is something related to more than one regexp-engines.
Assume we have this file:
$ cat data
1 =2 3 =4
5 =6 7 =8
We can then delete the whitespace in front of the '=' with...
$ cat data | perl -ne 's,(.)\s+=(.),\1=\2,g; print;'
1=2 3=4
5=6 7=8
Notice that in every line, all instances of the match are replaced ;
we used the /g search modifier, which doesn't stop at the first replace,
and instead goes on replacing till the end of the line.
For example, both the space before the '=2' and the space before
the '=4' were removed ; in the same line.
Why not use simpler constructs like 's, =,=,g'? Well, we were
preparing for more difficult scenarios... where the right-hand side
of the assignments are quoted strings, and can be either
single or double-quoted:
$ cat data2
1 ="2" 3 ='4 ='
5 ='6' 7 ="8"
To do the same work (remove the whitespace before the equal sign),
we have to be careful, since the strings may contain the equal
sign - so we mark the first quote we see, and look for it
via back-references:
$ cat data2 | perl -ne 's,(.)\s+=(.)([^\2]*)\2,\1=\2\3\2,g; print;'
1="2" 3='4 ='
5='6' 7="8"
We used the back-reference \2 to search for anything that is not
the same quote as the one we first saw, any number of times ([^\2]*).
We then searched for the original quote itself (\2). If found,
we used back references to refer to the matched parts in the replace
target.
Now look at this:
$ cat data3
posAndWidth ="40:5 =" height ="1"
posAndWidth ="-1:8 ='" textAlignment ="Right"
What we want here, is to drop the last space character that exists
before all the instances of '=' in every line. Like before, we can't use
a simple 's, =",=",g', because the strings themselves may contain
the equal sign.
So we follow the same pattern as we did above, and use back-references:
$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,g; print;"
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment ="Right"
It works... but only on the first match of the line!
The space following 'textAlignment' was not removed, and neither was the one
on top of it (the 'height' one).
Basically, it seems that /g is not functional anymore: running the same
replace command without /g produces exactly the same output:
$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,; print;"
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment ="Right"
It appears that in this regexp, the /g is ignored.
Any ideas why?
Inserting some debug characters in your substitution sheds some light on the issue:
use strict;
use warnings;
while (<DATA>) {
s,(\w+)(\s*) =(['"])([^\3]*)\3,$1$2=$3<$4>$3,g;
print; # here -^ -^
}
__DATA__
posAndWidth ="40:5 =" height ="1"
posAndWidth ="-1:8 ='" textAlignment ="Right"
Output:
posAndWidth="<40:5 =" height ="1>"
posAndWidth="<-1:8 ='" textAlignment ="Right>"
# ^--------- match ---------------^
Note that the match goes through both quotes at once. It would seem that [^\3]* does not do what you think it does.
Regex is not the best tool here. Use a parser that can handle quoted strings, such as Text::ParseWords:
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
while (<DATA>) {
chomp;
my #a = quotewords('\s+', 1, $_);
print Dumper \#a;
print "#a\n";
}
__DATA__
posAndWidth ="40:5 =" height ="1"
posAndWidth ="-1:8 ='" textAlignment ="Right"
Output:
$VAR1 = [
'posAndWidth',
'="40:5 ="',
'height',
'="1"'
];
posAndWidth ="40:5 =" height ="1"
$VAR1 = [
'posAndWidth',
'="-1:8 =\'"',
'textAlignment',
'="Right"'
];
posAndWidth ="-1:8 ='" textAlignment ="Right"
I included the Dumper output so you can see how the strings are split.
I will elaborate on my comment to TLP's answer:
ttsiodras you are asking two questions:
1- why does your regex not produce the desired result? why does the g flag not work?
The answer is because your regular expression contains this part [^\3] which is not handled correctly: \3 is not recognised as a back reference. I looked for it but could not find a way to have a back reference in character class.
2- how do you remove the space preceding an equal sign and leave alone the part that comes after and is between quotes?
This would be a way to do it (see this reference):
$ cat data3 | perl -pe "s,(([\"']).*?\2)| (=),\1\3,g"
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment="Right"
The 1st part of the regex catches whatever is between quotes (single or double) and is replaced by the match, the second part corresponds to the equal sign preceded by a space that you are looking for.
Please note that this solution is only a work around the "interesting" part about the complement character class operator with back reference [^\3] by using the non-greedy operator *?
Finally if you want to pursue on the negative lookahead solution:
$ cat data3 | perl -pe 's,(\w+)(\s*) =(["'"'"'])((?:(?!\3).)*)\3,\1\2=\3\4\3,g'
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment="Right"
The part with the quotes between square brackets still means "[\"']" but I had to use single quotes around the whole perl command otherwise the negative lookahead (?!...) syntax returns an error in bash.
EDIT Corrected the regex with negative lookahead: notice the non-greedy operator *? again and the g flag.
EDIT Took ttsiodras's comment into account: removed the non-greedy operator.
EDIT Took TLP's comment into account

RegEx, colon separated list

I am trying to match a list of colon separated emails. For the sake of keeping things simple, I am going to leave the email expression out of the mix and match it with any number of characters with no spaces in between them.
The following will be matched...
somevalues ;somevalues; somevalues;
or
somevalues; somevalues ;somevalues
The ending ; shouldn't be necessary.
The following would not be matched.
somevalues ; some values somevalues;
or
some values; somevalues some values
I have gotten this so far, but it doesn't work. Since I allow spaces between the colons, the expression doesn't know if the space is in the word, or between the colon.
([a-zA-Z]*\s*\;?\s*)*
The following is matched (which shouldn't e)
somevalue ; somevalues some values;
How do I make the expression only allow spaces if there is a ; to the left or right of it?
Why not just split on semi colon and then regex out the email addresses?
This following PCRE Expression should work.
\w+\s*(?:(?:;(?:\s*\w+\s*)?)+)?
However if putting the email address validation regular expression on this will require
replacing \w+ with (?:<your email validation regex>)
Probabbly This is exactly what you want, tested on http://regexr.com?2rnce
EDIT: However depending on the language you might? need to escape ; as \;
The problem comes from the ? in \;?
[a-zA-Z]*(\s*;\s*[a-zA-Z]*)*
should work.
Try
([a-zA-Z]+\s*;\s*)*([a-zA-Z]+\s*\)?
Note that I changed * to + on the e-mail pattern since I assume you don't want strings like ; to match.
to solve this with regex, you must prepend + append the delimiter to your input lines, otherwise you cannot easily detect the first and last item
#!/bin/bash
input=a:aa:aaa:aaaa
needle=aa
if [[ ":$input:" =~ ":$needle:" ]]
then
echo found
else
echo not found
fi
# -> found
.. this takes 45 nanoseconds
bash globbing is faster with 35 nanoseconds
input=a:aa:aaa:aaaa
needle=aa
if [[ ":$input:" == *":$needle:"* ]]
then
echo found
else
echo not found
fi
# -> found
stupid solution: split by delimiter and match whole lines. this one is really slow, with 5100 nanoseconds
echo a:aa:aaa:aaaa | tr ':' $'\n' | grep "^aa$"
# -> aa