How to remove some special character from word using regular expression? - regex

I am splitting file in words. I am able to splitting it into word but in some word there is special character like '___'. I want to skip that special character nd also split that word from that special character.
The file which contains data like this
Yahoo$$$Yahoo OK : ___GET
Gmail$$$Gmail Ok:___GET
google_data$$$Google.com.in___POST
using ((?!:)[.0-9a-zA-Z\s]\w+)+ gives me
Yahoo
Yahoo OK
___GET
Gmail
Gmail Ok
GET
google_data
Google.com.in___POST
I don't want that '___' and also the following string:
Google.com.in___POST
has to be split in two words, like:
Google.com.in
POST
Can any one help me with this ?

Using \w will also match an underscore. Looking at the example data, you want to match characters a-z or a digit, and in between there can be a space, dot or underscore.
Instead of splitting, you might match the values:
[0-9a-zA-Z]+(?:[._ ][0-9a-zA-Z]+)*
Explanation
[0-9a-zA-Z]+ Match a digit or a-z in lower or uppercase 1+ times
(?: Non caputuring group
[._ ] Match a . _ or space
[0-9a-zA-Z]+ Match a digit or a-z in lower or uppercase 1+ times
)* Close on capturing group and repeat 0+ times
Regex demo

Related

Regular expression that matches at least 4 words starting with the same letter?

I've been trying to solve this problems for few hours but with no luck. The task is to write a regular expression that matches at least four words starting with the same letter. But! These words do not have to be one after another.
This regex should be able to match a line like this:
cat color coral chat
but also one like this:
cat take boom candle creepy drum cheek
Thank you!
So far I have got this regex but it only matches words when they are in order.
(\w)\w+\s+\1\w+\s+\1\w+\s+\1
If you have only words in the line that can be matched with \w:
\b(\w)\w*(?:(?:\s+\w+)*?\s+\1\w*){3}
Explanation
\b A word boundary to prevent a partial word match
(\w)\w* Capture a single word character in group 1 followed by matching optional word characters
(?: Non capture group to repeat as a whole part
(?:\s+\w+)*? Match 1+ whitespace chars and 1+ word chars in between in case the word does not start with the character captured in the back reference
\s+\1\w* Match 1+ whitespace chars, a backreference to the same captured character and optional word characters
){3} Close the non capture group and repeat 3 times
See a regex demo
Note that \s can also match a newline.
If the words that should with the same character should be at least 2 characters long (as (\w)\w+ matches 2 or more characters)
\b(\w)\w+(?:(?:\s+\w+)*?\s+\1\w+){3}
See another regex demo.
Another idea to match lines with at least 4 words starting with the same letter:
\b(\w)(?:.*?\b\1){3}
See this demo at regex101
This is not very accurate, it just checks if there are three \b word boundaries, each followed by \1 in the first group \b(\w) captured character to the right with .*? any characters in between.

Regex to match numbers, special chars, spaces and a specific whole word?

I'm trying to create a Regex to match numbers, special chars, spaces and a specific whole word ("ICT").
Example for the string:
[Columbia (ICT-59)]
Currently I've this Regex to match the numbers, special chars and spaces:
[\W\s\d]
And this one to for the word "ICT":
(ICT)
How can I match both of this in one Regular expression?
use regex:
/(?<=\[)[a-zA-Z]+\s\(ICT-\d+\)(?=\])/g
or
/^\[[a-zA-Z]+\s\(ICT-\d+\)\]$/g
You can use \w+ at the position of [a-zA-Z], if you want to allow digits and special characters at the position of the Location.
demo:
https://regex101.com/r/ZS0jeO/1
https://regex101.com/r/hpQok3/1
You could capture the part the you want in a capture group right after the opening [ and match the rest of the format.
\[([^()]+?)\s*\(ICT-\d+\)]
\[ Match [
([^()]+?) Capture group 1, match 1+ chars other than ( or ), as few as possible
\s* Match optional whitespace chars
\(ICT-\d+\) Match (ICT- 1+ digits and )
] Match literally
Regex demo
Or matching just a single word using \w+
\[(\w+)\s*\(ICT-\d+\)]
Regex demo

Regex match pattern, space and character

^([a-zA-Z0-9_-]+)$ matches:
BAP-78810
BAP-148080
But does not match:
B8241066 C
Q2111999 A
Q2111999 B
How can I modify regex pattern to match any space and/or special character?
For the example data, you can write the pattern as:
^[a-zA-Z0-9_-]+(?: [A-Z])?$
^ Start of string
[a-zA-Z0-9_-]+ Match 1+ chars listed in the character class
(?: [A-Z])? Optionally match a space and a char A-Z
$ End of string
Regex demo
Or a more exact match:
^[A-Z]+-?\d+(?: [A-Z])?$
^ Start of string
[A-Z]+-? Match 1+ chars A-Z and optional -
\d+(?: [A-Z])? Matchh 1+ digits and optional space and char A-Z
$ End of string
Regex demo
Whenever you want to match something that can either be a space or a special character, you would use the dot symbol .. Your regex pattern would then be modified to:
^([a-zA-Z0-9_-])+.$
This will match the empty space, or any other character. If you want to match the example provided, where strictly one alphabetical, numer character will follow the space, you could include \w such that:
^([a-zA-Z0-9_-])+.\w$
Note that \w is equivalent to [A-Za-z0-9_]
Further, be careful when you use . as it makes your pattern less specific and therefore more likely to false positives.
I suggest using this approach
^[A-Z][A-Z\d -]{6,}$
The first character must be an uppercase letter, followed by at least 6 uppercase letters, digits, spaces or -.
I removed the group because there was only one group and it was the entire regex.
You can also use \w - which includes A-Z,a-z and 0-9, as well as _ (underscore). To make it case-insensitive, without explicitly adding a-z or using \w, you can use a flag - often an i.

how to match a list of fixed length words separated by space or comma?

The words' length could be 2 or 6-10 and could be separated by space or comma. The word only include alphabet, not case sensitive.
Here is the groups of words that should be matched:
RE,re,rereRE
Not matching groups:
RE,rere,rel
RE,RERE
Here is the pattern that I have tried
((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|\s+)?)
But unfortunately this pattern can match string like this: RE,RERE
Look like the word boundary has not been set.
You could match chars a-z either 2 or 6 - 10 times using an alternation
Then repeat that pattern 0+ times preceded by a comma or a space [ ,].
^(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*$
Explanation
^ Start of string
(?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match chars a-z 6 -10 or 2 times
(?: Non capturing group
[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match comma or space and repeat previous pattern
)* Close non capturing group and repeat 0+ times
$ End of string
Regex demo
If lookarounds are supported, you might also assert what is directly on the left and on the right is not a non whitespace character \S.
(?<!\S)(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[ ,](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*(?!\S)
Regex demo
([a-zA-Z]{2}(,|\s)|[a-zA-Z]{6,10}|(,|\s))
This one will get only the words who have 2 letter, or between 6 and 10
\b,?([a-zA-Z]{6,10}|[a-zA-Z]{2}),?\b
You can use this
^(?!.*\b[a-z]{4}\b)(?:(?:[a-z]{2}|[a-z]{6,10})(?:,|[ ]+)?)+$
Regex Demo
This regex will match your first case, but neither of your two other cases:
^((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|[ ]+|$))+$
I'm making the assumption here that each line should be a single match.
Here it is in action.

trying to crate a regular expression for string

I have a string like a Taxi:[(h19){h12}], HeavyTruck :[(h19){h12}] wherein I want to keep information before the ":" that is a taxi or heavy truck . can somebody help me with this?
This will capture a single word if it's followed by :[ allowing spaces before and after :.
[A-Za-z]+(?=\s*:\s*\[)
You'll need to set regex global flag to capture all occurrences.
I think this will do the trick in your case: (?=\s)*\w+(?=\s*:)
Explanation:
(?=\s)* - Searches for 0 or more spaces at the begging of the word without including them in the selection .
\w+ - Selects one or more word characters.
(?=\s*:) - Searches for 0 or more white spaces after the word followed by a column without including them in the selection.
To match the information in your provided data before the : you could try [A-Za-z]+(?= ?:) which matches upper or lowercase characters one or more times and uses a positive lookahead to assert that what follows is an optional whitespace and a :.
If the pattern after the colon should match, your could try: [A-Za-z]+(?= ?:\[\(h\d+\){h\d+}])
Explanation
Match one or more upper or lowercase characters [A-Za-z]+
A positive lookahead (?: which asserts that what follows
An optional white space ?
Is a colon with the pattern after the colon using \d+ to match one or more digits (if you want to match a valid time you could update this with a pattern that matches your time format) :\[\(h\d+\){h\d+}]
Close the positive lookahead )