Matching WORD pattern through regex - regex

Assume i have a big paragraph, in which there are words are like found field failed fired killed (so many negative words i know!!)
Now, I want to fetch line which have words starting from fi hi or k and ends with eld or ed
How would i go about searching this pattern of word in string....??
keep in check that i am asking about word pattern in string, not string pattern
These 2 surely didn't worked
egrep "^(f[ai]|k)+(eld|ed)$"
and
egrep "\<(f|k)+(eld|ed)$\>"
I'll admit i am not a hulk of regex, doing it out of basic understanding, so any one willing to suggest a better way (with some description) is most welcome too!! :)

The regex you are probably looking for would be
"\b([fh]i|k)\w*(eld|ed)\b"
The \w* should be equivalent to [a-zA-Z0-9_]*, so that will allow any word-like characters be between requested strings.
The \b is there to ensure, that the word really starts and ends with letters you want. Otherwise you might for example match string which contains word Unfailed
Also you need to remove $ and ^ from the regex because $ means end of line and ^ the beginning of line.

I'd use
\<(fi|hi|k)[a-zA-Z]*?(eld|ed)\>
to match the words you want.
demo # regex101
(when you take a look at the demo: \b is the same as \<
Explanation:
\< #beginning of word
(fi|hi|k) #either fi or hi or k
[a-zA-Z]*? #zero to unlimited of a-z and A-Z
(eld|ed) #either eld or ed
\> #end of word
If you want to allow numbers, dashes, underscores, ... within your words, simply add them to the character-class, for example: [a-zA-Z$_] if you want to allow $ and _, too.

You can use word boundary \b.
^.*\b(fi|hi|k)\w*(eld|ed)\b.*$
------------------------
This pattern would select lines that contain those words
NOTE:You need to use multiline modifier m & global modifier g
Try it here

Related

Regex Match - Only if 4 characters long and does not contain a specific word

I am currently attempting to create a program that matches words that are a specific length, or more, that do not contain a specific word.
Currently I have the Regex : \S{4,}(?!\w*apple\w*)
When used on the test : I love these delicious applestoo
There Regex will still match 'applestoo', which I do not want.
I can see that this is a logic error, but I do not understand how else to format this Regex. If you have a solution pelase do tell me, thank you in advance.
Edit:
This code now works for my example: (?!\w*apple\w*)\b\S{4,}\b However, when using this new example it will still fail: 'logigng some testing data _______-----apple-###zx'
I have attempted to ammend this through using: (?!\w*(apple|_)\w*)\b\S{4,}\b but this does not seem to be working.
You're looking for \b(?![^\W_]*apple)[^\W_]{4,}\b (explained at regex101)
This uses [^\W_] as the character matcher, which will match any character that is not a non-word character and not an underscore. This leaves the non-underscore word characters, making it similar to [[:alnum:]] (assuming POSIX named character class support) or [0-9A-Za-z] … if you just want letters, consider [[:alpha:]] or, for just ASCII letters, [A-Za-z].
The negative lookahead, which follows the \b word boundary marker for performance reasons, states that we can't have "apple" follow zero or more of these characters (regardless of what may follow it). We then ask to match four or more of these characters and then another word boundary marker.
In the following command-line demonstration, I've used grep -Po to demonstrate this. -P causes grep to use its PCRE interpreter (from libpcre) and -o causes it to show only matches, with each match on its own line:
$ echo 'logigng some testing data _______-----apple-###zx' \
|grep -Po '\b(?![^\W_]*apple)[^\W_]{4,}\b'
logigng
some
testing
data
$
The regular expression to match a word with 4 characters only is "\b\w{4}\b".
The "\b" is a word boundary that matches the position between a word character (as defined by the \w character class) and a non-word character.
The "\w{4}" matches any four word characters, and the final "\b" is a word boundary again.
let word = "word";
let pattern = /\b\w{4}\b/;
if (pattern.test(word)) {
console.log("match");
} else {
console.log("no match");
}

Regex with start and end match

I'm having trouble matching the start and end of a regex on Python.
Essentially I'm confused about the when to use word boundaries /b and start/end anchors ^ $
My regex of
^[A-Z]{2}\d{2}
matches 4 letter characters (two uppercase letters, two digits) which is what I'm after
Matches AJ99, RD22, CP44 etc
However, I also noted that AJAJAJAJAJAJAJAJAJSJHS99 could be matched as well. I've tried used ^ and $ together to match the whole string. This doesn't work
^[A-Z]{2}\d{2}$ # this doesn't work
but
^[A-Z]{2}\d{2} # this is fine
[A-Z]{2}\d{2}$ # this is fine
The string I'm matching against is 4 characters long, but in the first two examples the regex could pick the start and end of a longer string respectively.
s = "NZ43" # 4 characters, match perfect! However....
s = "AM27272727" # matches the first example
s = "HAHSHSHSHDS57" # matches the second example
The position anchors ^ and $ place a restriction on the position of your matched chars:
Analyzing your complete regex:
^[A-Z]{2}\d{2}$
^ matches only at the beginning of the text
[A-Z]{2} exactly 2 uppercase Ascii alphabetic characters
\d{2} exactly 2 digits (equivalent to [0-9]{2})
$ matches only at the end of the text
If you remove one or both of the 2 position anchors (^ or $) you can match a substring starting from the beginning or the end as you stated above.
If you want to match exactly a word without using the start/end of the string use the \b anchor, like this:
``\b[A-Z]{2}\d{2}\b``
\b matches at the start/end of text and between a regex word (in regex a word char \w is intended as one of [a-zA-Z0-9_]) and one char not in the word group (available as \W).
The regex above matches WS24 in all the next strings:
WS24 alone
before WS24
WS24 after
before WS24 after
NZ43
It doesn't match:
AM27272727 (it will do if is AM27 272727 or AM27"272727
HAHSHSHSHDS57 (it will do if HAHSHSHSH DS75 or...you get it)
A demo online (the site will be useful to you also to experiment with regex).
The fact that your shown behaviour is like it's supposed to be, your question suggests that you maybe does not have fully understood how regular expressions work.
As a addition to the very good and informative answer of GsusRecovery, here's a site, that guides you through the concepts of regular expressions and tries to teach you the basics with a lessons-based system. To be clear, I do not want to tout this website, as there are plenty of those, but however I could really made a use of this one and so it's the one I'm suggesting.

Ignore specific lines when matching with a regex

I'm trying to make a regex that matches a specific pattern, but I want to ignore lines starting with a #. How do I do it?
Let's say i have the pattern (?i)(^|\W)[a-z]($|\W)
It matches all lines with a single occurance of a letter. It matches these lines for instance:
asdf e asdf
j
kke o
Now I want to override this so that it does not match lines starting with a #
EDIT:
I was not specific enough. My real pattern is more complicated. It looks a bit like this: (?i)(^|\W)([a-hj-z]|lala|bwaaa|foo($|\W)
It should be used kind of like I want to block offensive language, if a line does not start with a hash, in which case it should override.
This is what you are looking for
^(?!#).+$
^ marks the beginning of line and $ marks the end of line(in multiline mode)
.+ would match 1 to many characters
(?!#) is a lookahead which would match further only if the line doesn't start with #
This regex will match any word character \w not preceeded by a #:
^(?<!#)\w+$
It performs a negative lookbehind at the start of the string and then follows it with 1 or more word characters.

Regular expression find matched between a string and a space

I need to find out all words in a sentence that are between a $ and a space like this
this is $abc $cde any $ety.
The result should be abc, cde and ety.
I tried this
'(?<=$$)(.*)(?=)'
but it shows some error. What is wrong in this or any new suggestions?
You can try this:
\$(\w+)
As capturing groups, you'll get each of the words.
\w will match a-Z, 0-9 and _, if you want to match only letters, for instance, you can change to: \$([a-zA-Z]+)
Try this RegEx:
(?<=\$)([^\s]*)(?=\s)
Assuming from the question, each word (word contains only chars A-Za-z) must begin with $ and have a space at the end.
The following regex will match such words -- \$([A-Za-z])+ (there is a space at the end, which is hard to see due to the formatting here). If there are multiple spaces, you can use + (space before +, hard to see again due to formatting) at the end of the regex.
Then you can extract the first matching group (i.e. $1) as your matching word, and you need to do this in a loop till there are no more matches you can extract. That is something like --
while ($x =~ /\$([A-Za-z])+ /g) {
// $1 is your match
}
If your word contains more than just chars, then you can use \w as mentioned by pcalcao, which will include both 0-9 and _

Regex to match whole word with a particular definition of a word

I am doing a file search and replace for occurrences of specific words in perl. I'm not usually much of a perl or regex user. I have searched for other regex questions here but I couldn't find one which was quite right so I'm asking for help. My search and replace currently looks like this:
s/originalword/originalword_suffix/g
This matches cases of originalword that appear in the middle of another word, which I don't want. In my application of search and replace, a whole word can be defined as having the letters of the latin alphabet in lowercase or capital letters and the digits 0-9 and the symbol _ in any uninterrupted sequence. Anything else besides these characters, including any other symbols or any form of whitespace including line breaks or tabs, indicate operations or separators of some kind so they are outside the word boundaries. How do I modify my search and replace to only match whole words as I've defined them, without matching substrings?
Examples:
in the case that originalword = cat and originalword_suffix = cat_tastic
:cat { --> :cat_tastic {
:catalog { --> no change
Use the \b anchor to match only on a word boundary:
s/\bcat\b/cat_tastic/g
Although Perl has a slightly different definition of what a "word" is. Reading the perlre reference guide a couple of times might help you understand regexps a bit better.
Running perl -pi -e "YOUR_REGEXP" in a terminal and entering in lines of text can help you understand and debug what a particular regexp is doing.
You could try:
s/([^0-9a-z_])([0-9a-z_]+)([^0-9a-z_])/$1$2_tastic$3/gi
Basically, a non-word character, then a set of word characters, followed by a non-word character. The $1,$2,$3 represent the captured groups, and you replace $2 with $2_suffix.
Hope that helps, not a perl guy buy pretty regex-savvy. Note that the above will fail if the word is the very first or very last thing in a string. Not sure if perl regexen allow the syntax, but if so, fixing the first/last issue could be done with:
s/(^|[^0-9a-z_])([0-9a-z_]+)([^0-9a-z_]|$)/$1$2_tastic$3/gi
Using ^ and $ to match beginning/end of string.
See the example on this page which explains boundary matchers
Enter your regex: \bdog\b
Enter input string to search: The dog plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.
Enter your regex: \bdog\b
Enter input string to search: The doggie plays in the yard.
No match found.