Extract words containing question marks - regex

I have tens of long text files (10k - 100k record each) where some characters were lost by careless handling and got replaced with question marks. I need to build a list of corrupted words.
I'm sure the most effective approach would be to regex the file with sed or awk or some other bash tools, but I'm unable to compose regex that would do the trick.
Here are couple of sample records for processing:
?ilkin, Aleksandr, Zahhar, isa
?igadlo-?van, Maria, Karl, abikaasa, 27.10.45, Veli?anõ raj.
Desired output would be:
?ilkin
?igadlo-?van
Veli?anõ
My best result so far seems to retrieve only words from the beginning of records:
awk '$1 ~/\?/ {print $1}' test.txt
->
?ilkin,
?igadlo-?van,

I need to build a list of corrupted words
If the aim is to only search for matches grep would be the most fast and powerful tool:
grep -Po '(^|)([^?\s]*?\?[^\s,]*?)(?=\s|,|$)' test.txt
The output:
?ilkin
?igadlo-?van
Veli?anõ
Explanation:
-P option, allows perl regular expresssions
-o option, tells to print only matched substrings
(^|) - matches the start of the string or an empty value(we can't use word boundary anchor \b in this case cause question mark ? is considered as a word boundary)
[^?\s]*? - matches any character except ? and whitespace \s if occurs
\?[^\s,]*? - matches a question mark ? followed by any character except whitespace \s and ,(which can be at right word boundary)
(?=\s|,|$) - lookahead positive assertion, ensures that a needed substring is followed by either whitespace \s, comma , or placed at the end of the string

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.
Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.
Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0
If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.
Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

search string preceded by either a space or a slash

I've files with below content:
76a6f0f631888fbd359420796093d19a3928123d remotes/origin/feature/ASC-122356
417435aceb671e41213697055b86d860d9a9a61c remotes/origin/feature/ASC-122356-3762
ae863a41fef068215be1529216e9dbba1314fa6f remotes/origin/master
I want to search if origin/master pattern is there or not in the file.
I'm currently doing like grep -e '^\S\+ origin/master$' but it's not correct. How can I do it?
Following would work with grep. Positive number of non-space characters, followed by a space, followed by a possibly empty sequence of non-space characters and followed by the expected string.
grep -P '\S+ \S*origin/master$' test
Can be improved to make sure the origin is either at the begining of the second column or preceded by a / to eliminate strings like remotes/backup-origin/master
grep -P '^\S+ (|\S*/)origin/master$' test
Note those expressions require (-P) - perl compatible regexes.
The pattern is uses '^\S+ ' to request that ALL characters before origin/master will be non-space (because of the '^').
Consider using similar version, which will ask for ONE space
grep -e ' \S\+origin/master$'

How can I find repeated words in a file using grep/egrep?

I need to find repeated words in a file using egrep (or grep -e) in unix (bash)
I tried:
egrep "(\<[a-zA-Z]+\>) \1" file.txt
and
egrep "(\b[a-zA-Z]+\b) \1" file.txt
but for some reason these consider things to be repeats that aren't!
for example, it thinks the string "word words" meets the criteria despite the word boundary condition \> or \b.
\1 matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the \b is inside the capture parentheses.
If you want the second instance to also be on a word boundary, you need to say so:
egrep "(\b[a-zA-Z]+) \1\b" file.txt
That is no different from:
egrep "\b([a-zA-Z]+) \1\b" file.txt
The space in the pattern forces a word boundary, so I removed the redundant \bs. If you wanted to be more explicit, you could put them in:
egrep "\<([a-zA-Z]+)\> \<\1\>" file.txt
I use
pcregrep -M '(\b[a-zA-Z]+)\s+\1\b' *
to check my documents for such errors. This also works if there is a line break between the duplicated words.
Explanation:
-M, --multiline run in multiline mode (important if a line break is between the duplicated words.
[a-zA-Z]+: Match words
\b: Word boundary, see tutorial
(\b[a-zA-Z]+) group it
\s+ match at least one (but as many more as necessary) whitespace characters. This includes newline.
\1: Match whatever was in the first group
This is the expected behaviour. See what man grep says:
The Backslash Character and Special Expressions
The symbols \< and > respectively match the empty string at the
beginning and end of a word. The symbol \b matches the empty string at
the edge of a word, and \B matches the empty string provided it's not
at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and
\W is a synonym for [^[:alnum:]].
and then in another place we see what "word" is:
Matching Control
Word-constituent characters are letters, digits, and the underscore.
So this is what will produce:
$ cat a
hello bye
hello and and bye
words words
this are words words
"words words"
$ egrep "(\b[a-zA-Z]+\b) \1" a
hello and and bye
words words
this are words words
"words words"
$ egrep "(\<[a-zA-Z]+\>) \1" a
hello and and bye
words words
this are words words
"words words"
egrep "(\<[a-zA-Z]+>) \<\1\>" file.txt
fixes the problem.
basically, you have to tell \1 that it needs to stay in word boundaries too

Whole-word matching on a body of text, given a list of words

Note:
Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one:
How to grep with a list of words
How to make grep only match if the entire line matches?
how to grep for the whole word
Grep extract only whole word
Background:
I have a list of words in a file called words.txt (one word per line). I would like to find all lines from a different, much larger file called file.txt that contain any of the words from words.txt. However, I only want whole-word matches. This means that a match should be made when a line from file.txt contains at least one instance where a word from words.txt is found "all by itself" (I know this is vague, so allow me to explain).
In other words, a match should be made when:
The word is all by itself on a line
The word is surrounded by non-alphanumeric/non-hyphen characters
The word is at the beginning of a line and followed by a non-alphanumeric/non-hyphen character
The word is at the end of a line and preceded by a non-alphanumeric/non-hyphen character
For example, if one of the words in words.txt is cat, I would like it to behave as follows:
cat #=> match
cat cat cat #=> match
the cat is gray #=> match
mouse,cat,dog #=> match
caterpillar cat #=> match
caterpillar #=> no match
concatenate #=> no match
bobcat #=> no match
catcat #=> no match
cat100 #=> no match
cat-in-law #=> no match
Previous research:
There's a grep command that almost suits my needs. It is as follows:
grep -wf words.txt file.txt
where the options are:
-w, --word-regexp
Select only those lines containing matches that form whole words.
The test is that the matching substring must either be at the beginning
of the line, or preceded by a non-word constituent character.
Similarly, it must be either at the end of the line or followed by a
non-word constituent character. Word-constituent characters are
letters, digits, and the underscore.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing.
The big problem I'm having with this is that it treats a hyphen (i.e. -) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat will return cat-in-law, which is not what I want.
I realize that the -w option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law) and not an instance of the word by itself.
Additionally, I know I could alter words.txt to contain regular expressions instead of fixed strings and then use:
grep -Ef words.txt file.txt
where
-E, --extended-regexp
Interpret PATTERN as an extended regular expression
However, I would like to avoid altering words.txt and keep it free of regex patterns.
Question:
Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text?
I finally came up with a solution:
grep -Ef <(awk '{print "([^a-zA-Z0-9-]|^)"$0"([^a-zA-Z0-9-]|$)"}' words.txt) file.txt
Explanation:
words.txt is my list of words (one per line).
file.txt is the body of text that I would like to search.
The awk command will preprocess words.txt on-the-fly, wrapping each word in a special regular expression to define its official beginning and ending (based on the specifications posted in my question above).
The awk command is surrounded by <( and ) so that its output is used as the input for the -f option.
I'm using the -E option because I'm now inputting a list of regular expressions instead of fixed strings from words.txt.
The nice thing here is that words.txt can remain human-readable and doesn't have to contain a bunch of regex patterns.

Extract strings between two separators using regex in perl

I have a file which looks like:
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
and I wish to extract strings between : and | separators, the output should be:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
tab delimited between the two columns.
I wrote in unix a perl command:
perl -l -ne '/:([^|]*)?[^:]*:([^|]*)/ and print($1,"\t",$2)' <file>
the output that I got is:
Q9VNB0 EBI-102551 uniprotkb:A1ZBG6
P91682 EBI-142245 uniprotkb:Q24117
P92177-3 EBI-204491 uniprotkb:Q9VDK2
I wish to know what am I doing wrong and how can I fix the problem.
I don't wish to use split function.
Thanks,
Tom.
The expression you give is too greedy and thus consumes more characters than you wanted. The following expression works on your sample data set:
perl -l -ne '/:([^|]*)\|.*:([^|]*)\|/ and print($1,"\t",$2)'
It anchors the search with explicit matches for something between a ":" and "|" pair. If your data doesn't match exactly, it should ignore the input line, but I have not tested this. I.e., this regex assumes exactly two entries between ":" and "|" will exist per line.
Try m/: ( [^:|]+ ) \| .+ : ( [^:|]+ ) \| /x instead.
A fix could be to use a greeding expression between the first string and the second one. With .* it goes until the end and begins to backtrack searching for the last colon followed by a pipe.
perl -l -ne '/:([^|]*).*:([^|]*)\|/ and print($1,"\t",$2)' <file>
Output:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
See it in action:
:([\w\-]*?)\|
Another method:
:(\S*?)\|
The way you've specified it, it has to match that way. You want a single colon
followed by any number of non-pipe, followed by any number of non-colon.
single colon -> :
non-pipe -> Q9VNB0
non-colon -> |intact
colon -> :
non-pipe -> EBI-102551 uniprotkb:A1ZBG6
Instead I make a space the end-of-contract, and require all my patterns to begin
with a colon, end with a pipe and consist of non-space/non-pipe characters.
perl -M5.010 -lne 'say join( "\t", m/[:]([^\s|]+)[|]/g )';
perl -nle'print "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Or with 5.10+:
perl -nE'say "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Explanation:
: Matches the start of the first "word".
([^|]*) Matches the desired part of the first "word".
\S* Matches the end of the first "word".
\s+ Matches the "word" separator.
[^:]*: Matches the start of the second "word".
([^|]*) Matches the desired part of the second "word".
This isn't the shortest answer (although it's close) because each part is quite independent of the others. This makes it more robust, less error-prone, and easier to maintain.
Why do you not want to use the split function. On the face of it this would be easily solved by writing
my #fields = map /:([^|]+)/, split
I am not sure how your regex is supposed to work. Using the /x modifier to allow non-significant whitespace it looks like this
/ : ([^|]*)? [^:]* : ([^|]*) /x
which finds a colon and optionally captures as many non-pipe characters as possible. Then skips over as many non-colon characters as possible to the next colon. Then captures zero asm many non-pipe characters as possible. Because all of your matches are greedy, any one of them is allowed to consume all of the rest of the string as long as the characters match the character class. Note that a ? that indicates an optional sequence will first of all match all that it can, and the option to skip the sequence will be taken only if the rest of the pattern cannot then be made to match
It is hard to judge from your examples the precise criteria for a field, but this code should do the trick. It finds sequences of characters that are neither a colon nor a pipe that are preceded by a colon and terminated by a pipe
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([^:|]+)\|/g;
print join("\t", #fields), "\n";
}
__DATA__
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
output
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2