I am attempting to create a regex that will find any line that contains exactly one word on it. Words separated by a hyphen or symbol (e.g test-word) or leading white space should still be treated as a single word.
$cat file1
this line has many words
hello
test-hi
this does aswell
Using the regular expression
'/^\s*(\w+)\s$/GM'
Returns only "hello" and ignores "test-hi"
I am able to capture all single words but not ones with hyphens etc!
This is easier to do with awk, by default it will separate each record into fields based on one or more continuous whitespaces and whitespace at start/end of line won't be part of field calculations
$ awk 'NF==1' ip.txt
hello
test-hi
$ awk 'NF>1' ip.txt
this line has many words
this does aswell
NF is a built-in variable that indicates number of fields in the input record
You can use
^\s*([\w-]+)\s*$
which adds support for hyphens, makes the second \s match "zero or more" spaces. Keep your GM flags.
Demo
Try using \S to match any non whitespace character:
'/^\s*(\S+)\s$/GM'
Related
Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.
Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.
Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0
If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.
Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4
I have tens of long text files (10k - 100k record each) where some characters were lost by careless handling and got replaced with question marks. I need to build a list of corrupted words.
I'm sure the most effective approach would be to regex the file with sed or awk or some other bash tools, but I'm unable to compose regex that would do the trick.
Here are couple of sample records for processing:
?ilkin, Aleksandr, Zahhar, isa
?igadlo-?van, Maria, Karl, abikaasa, 27.10.45, Veli?anõ raj.
Desired output would be:
?ilkin
?igadlo-?van
Veli?anõ
My best result so far seems to retrieve only words from the beginning of records:
awk '$1 ~/\?/ {print $1}' test.txt
->
?ilkin,
?igadlo-?van,
I need to build a list of corrupted words
If the aim is to only search for matches grep would be the most fast and powerful tool:
grep -Po '(^|)([^?\s]*?\?[^\s,]*?)(?=\s|,|$)' test.txt
The output:
?ilkin
?igadlo-?van
Veli?anõ
Explanation:
-P option, allows perl regular expresssions
-o option, tells to print only matched substrings
(^|) - matches the start of the string or an empty value(we can't use word boundary anchor \b in this case cause question mark ? is considered as a word boundary)
[^?\s]*? - matches any character except ? and whitespace \s if occurs
\?[^\s,]*? - matches a question mark ? followed by any character except whitespace \s and ,(which can be at right word boundary)
(?=\s|,|$) - lookahead positive assertion, ensures that a needed substring is followed by either whitespace \s, comma , or placed at the end of the string
I have the following line:
given some books I've given to my son.
Notice the four spaces in front of the "Given". I want to match the "given" following whitespace at the beginning of the line with a regex. I do not want the second "given" to match.
If I use \s*given it will match both words. If I add the ^ for the beginning of the line (^\s*given) it does not match either.
Try to enter \s*The and ^\s*The on this RegexOne example to understand the problem.
Edit
For some reason, the fox example works now and the regex works on another site, so here's my full example:
given an egg
and some milk
and the ingredient flour
when the cook mangles everything to a dough
and the cook fries the dough in a pan
then the resulting meal is a pan cake
And my awk expressions that all don't match:
/^\s*given/ { print "given()."}
/^[\s]*and/ { print "and()."}
/^\s*when/ { print "when()."}
/^\s*then/ { print "then()."}
They all match once I remove the ^.
As Ed Morton mentioned, some Awks (such as The One True Awk) only support POSIX character classes, so \s is not matching whitespace, it's matching the letter s.
Since you're using * to match zero or more occurrences:
awk '/\s*given/' file
matches because there are zero occurrences of s at the beginning of the line, whereas:
awk '/^\s*given/' file
will never match because there are unmatched characters (whitespace) between ^ (start of line) and the word given.
If you were to use + to match one or more occurrences, you'd see that this does not work either:
awk '/\s+given/' file
so the obvious solution is to use [[:space:]]:
awk '/^[[:space:]]*given/' file
But since Awk's default is to split fields by whitespace, if you wish to match a word against the first set of non-whitespace characters, it's more straight forward to compare the word with the first field $1.
awk '$1 == "given"' file
to match completely, or:
awk '$1 ~ /^given/' file
to match against the beginning of the first field.
As an aside, if you want to test your regex against a set of words and print them appended with ()., as is shown in your example, you could use the string functions match and substr like this:
awk '{
m = match($0, /^[[:space:]]*(given|and|when|then)/) # or match($1, /.../)
if(m)
print substr($1, RSTART, RSTART+RLENGTH) "()."
}' file
output:
given.()
and.()
and.()
when.()
and.()
then.()
This regex can match what you're looking for:
^[[:space:]]*given
It matches all the whitespace characters in the front including the first "given".
You can play with it here:
https://regex101.com/r/yA5dV0/1
Edit: Changed it to Ed Morton's suggestion.
I'm trying to replace multiple spaces with a single one, but at the start of the line.
Example:
___abc___def__
___ghi___jkl__
should turn to
___abc_def__
___ghi_jkl__
Note that I've replaced space with underscore
A simple search using the following pattern:
([^\s])\s+
matches the space at the end of the first line up to the space at the beginning of the next one.
So, if I replace with \1_, I get the following:
___abc_def_ghi_jkl
And that is absolutely not what I expect and regex engines, e.g., PowerGREP or the one in Visual Studio, don't behave that way.
If you want to match only horizontal spaces, use \h:
Find what: (?<=\S)\h+(?=\S)
Replace with: (a space)
There are several possible interpretations of the question. For each of them the replacement will be a single space character.
If spaces is plural and means space characters but not tabs then use
a find string of (^ {2,})|( {2,}$).
If spaces is plural and should includes tabs then use a find string
of (^[ \t]{2,})|([ \t]{2,}$).
If any leading or trailing spaces and tabs (one or more) is to be
replaced with a space then use a find string of (^[ \t]+)|([ \t]+$).
The general form of each of these is (^...)|(...$). The | means an alternation so either the preceding or the following bracketed expression can match. Hence the find what text can match either at the beginning or the end of a line. The ... varies depending on exactly what needs to be matched. Specifying [ \t] means only the two characters space and tab, whereas \s includes the line-end characters.
Ok, so the intention was to replace this:
Hey diddle diddle, \n<br/>
The Cat and the fiddle,\n
with this:
Hey diddle diddle,\n<br/>
The Cat and the fiddle,\n
A slightly modified version of Toto's answer did the trick:
(?<=\S)\h+(?=\S)|\s+$
finding any space(s) between word-characters and trailing space at the end of the line.
Let's say I have
def
abc
xyz
abc
And I want to match
xyz
abc
as a whole
Is this possible using the most generic RegEx possible?
That is not the perl RegEx or .Net Regex which have multi line flags.
I guess it would be BNF to match this.
Many regex implementations allow explicit line terminators. If \n is the line separator, then just search for xyz\nabc.
Regexes work on whatever text you give them, multiline or otherwise. If it happens to contain linefeeds, then it's nominally "multiline" text, but you don't have to do anything special to match it with regexes. Linefeed is just another character.
The name "multiline flag" (or "multiline mode") confuses many people. All that flag does is change the meaning of the ^ and $ anchors, allowing them to match at the beginning and end of logical lines as well as the beginning and end of the whole text.
grep -A2 "xyz" <file_name>
from https://stackoverflow.com/a/34808071/5556553