Regex match all character except character start with # - regex

Example: "this is the #example te#s5t #abc17$ ++ 123"
Will match: "this is the te ++ 123"

You could use this regex to match the things you don't want, and then replace any matches with an empty string:
\s*#\S*
The regex matches an optional number of spaces, followed by an #, then some number of non-space characters. For your sample data, this will give this is the te ++ 123 as the output.
Demo on regex101
Note the reason to remove spaces at the beginning (if the # starts a word) is so that when a whole word is removed (e.g. #example in your sample data) you don't get left with two spaces next to each other in the output string.

Related

Regular Expression to match first word with a character in each line

I am trying to write a regex that finds the first word in each line that contains the character a.
For a string like:
The cat ate the dog
and the mouse
The expression should find cat and
So far, I have:
/\b\w*a\w*\b/g
However this will return every match in each line, not just the first match (cat ate and).
What is the easiest way to only return the first occurrence?
Assuming you are onluy looking for words without numbers and underscores (\w would include those), I'd advise to maybe use:
(?i)^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)
And use whatever is in the 1st capture group. See an online demo. Or, if supported:
(?i)^.*?\K(?<!\S)[b-z]*a[a-z]*(?!\S)
See an online demo.
Please note that I used lookaround to assert that the word is not inbetween anything other than whitespace characters. You may also use word-boundaries if you please and swap those lookarounds for \b. Also, depending on your application you can probably scratch the inline case-insensitive switch to a 'flag'. For example, if you happen to use JavaScript /^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)/gmi should probably be your option. See for example:
var myString = "The cat ate the dog\nand the mouse";
var myRegexp = new RegExp("^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)", "gmi");
m = myRegexp.exec(myString);
while (m != null) {
console.log(m[1])
m = myRegexp.exec(myString);
}
If you want to match a word using \w you might also use a negated character class matching any character except a or a newline.
Then match a word that consists of at least an a char with word boundaries \b
^[^a\n\r]*\b([^\Wa]*a\w*)
The pattern matches:
^ Start of string
[^a\n\r]*\b Optionally match any character except a or a newline
( Capture group 1
[^\Wa]*a\w* Optionally match a word character without a, then match a and optional word characters
) Close group 1
Regex demo
Using whitespace boundaries on the left and right:
^[^a\n\r]*(?<!\S)([^\Wa]*a\w*)(?!\S)
Regex demo
The text could be matched with the regular expression
(?=(\b[a-z]*a[a-z]*\b)).*\r?\n
with the multiline and case-indifferent flags set. For each match capture group 1 contains the first word (comprised only of letters) in a line that contains an "a". There are no matches in lines that do not contain an "a".
Demo
The expression can be broken down as follows.
(?= # begin a positive lookahead
\b # match a word boundary
([a-z]*a[a-z]*) # match a word containing an "a" and save to
# capture group 1
)
.*\r?\n # match the remainder of the line including the
# line terminator

Regex to find every second new line (match only new line characters)

Regex to find every second new line (match only new line characters)
Input:
LINE1
LINE2
LINE3
LINE4
LINE5
LINE6
Output:
LINE1LINE2
LINE3LINE4
LINE5LINE6
I have tried \n[^\n]*\n but it matches text as well for replacement and does not give desired output.
I am having issues in matching every second new line character only.
Thanks in advance!
You could use the regular expression
^(.*)\n(.*\n)
and replace each match with $1+$2.
Demo
Alternatively, you could simply match each pair of lines and remove the first newline character. That requires a bit of code, of course. As you have not indicated which language you are using I will illustrate that with some Ruby code, which readers should find easy to translate to any high-level language. Suppose str is a variable holding your multi-line string. Then:
r = /^(?:.*\n){2}/
s = str.gsub(r) { |s| s.sub(/\n/, '') }
puts s
LINE1LINE2
LINE3LINE4
LINE5LINE6
For an even number of lines, you could make use of a positive lookahead to assert what is on the right side is 0 or more times repetition of 2 lines that end with a newline, followed by matching the last line and the end of the string.
In the replacement use an empty string.
\n(?=(?:.+\n.+\n)*.+$)
Explanation
\n Match a newline
(?= Positive lookahead, assert what is on the right is
(?:.+\n.+\n)* Match 0+ times 2 lines followed by a newline
.+$ Match any char except a newline 1+ times and assert end of string
) Close lookahead
Regex demo
Output
LINE1LINE2
LINE3LINE4
LINE5LINE6

Regex - Match multiple instances of specific character thats not followed by numeric value

I need a regex query to match the following:
hello%20world //Wont match
hello % dog //Will match
hello %20 world //Wont match
hello %% world //Will match but twice (Wont match as whole word of %%, will match single "%" and then the second "%")
I am using regex to replace any matches of "%" that is not followed by a number. If it contains "%%", i also want to replace both of those with lets say 'A' so i get "AA" not "A".
My regex attempt:
%[^0-9]
https://regex101.com/r/Z9N7QJ/1
Issue is its matching but also with the next character so my string "Hello % world" matches "% ". And my "%%" is being matched as a pair not singles.
Thanks.
You may use a negative lookahead here:
/%(?!\d)/
Updated RegEx Demo
Lookahead is a Zero width match but your regex %[^0-9] on the contrary consumes next non numeric character as well.

Find the first letter and sign of a sentence with Regex

Find the first letter and sign of a sentence with Regex.
At the beginning of the sentence can sometimes be letters and sometimes numbers.
15. Lorem ipsum is placeholder text
B. Lorem ipsum is placeholder text
C.Lorem ipsum is placeholder text
D . Lorem ipsum is placeholder text
E,Lorem ipsum is placeholder text
I wrote something like this:
[\dga-zA-Z.]{1\s}
Demo with regex101
But it doesn't work right for every sentence. Moreover, it does not detect if there is a space between the first letter/digit and the sign with the sentence.
Where am I making a mistake?
Also, In terms of performance For such scenarios, it makes more sense to use regex or PHP?
Hello this matched all of your provided examples
([A-Za-z\d ]+)(\.|,)
What this does is the following:
it matches all small, big letters, digits or space. It should find at least
one of those or more (the + sign).
It should end with a dot or comma. (\.) Note: In regex, the dot should be escaped.
If that doesn't do the trick, comment below
Edit: demo here: click
The following regex will match a single letters or multiple digits that are placed at the beginning of a sentence and then followed with either a single period or comma:
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
This is the breakdown:
^ # Asserts position at start of the line
[a-zA-Z]{1}|[0-9]+ # Match a single alphabetic character or one or more digits
\s* # Matches whitespace characters between 0 and unlimited times
[.,]{1} # Matches a single period or comma character literal
.* # Matches the rest of the text
$ # Asserts position at end of the line
Group 1 - will return both the letter/numbers and the period/comma (including potential spaces). This is in case you need to get both for some reason.
Group 2 - will return only letter or numbers at the start of the sentence, which I assume you'll actually be looking for most of the times.
Group 3 - will return the rest of the text.
The regex will need to be modified depending on what you want. For example if you don't want a match when there are spaces after the letter/digits at the start of the sentence or if you want to include more delimiting characters that mark the separator character. Let me know if you have any additional constraints you'd like this regex conform to.
See the DEMO
Use: ^[\da-zA-Z]+\h*[.,]
Demo
Explanation:
^ # beginning of line
[\da-zA-Z]+ # 1 or more letter or digit
\h* # 0 or more horizontal spaces
[.,] # a dot or a comma

Regex leading space/add trailing space before/to punctuation

To better clean my forum message corpus, I would like to remove the leading spaces before punctuation and add one after if needed, using two regular expressions. The latter was no problem ((?<=[.,!?()])(?! )) but I've some problem with the first at least.
I used this expression: \s([?.!,;:"](?:\s|$))
But it's by far not flexible enough:
It matches even if there's already a space(or more) before the punctuation character
It doesn't match if there's not a space after the punctuation character
It doesn't match any unlisted punctuation character (but I guess I can use [:punct:] for that, at the end of the day)
Finally, both matches the decimal points (while they should not)
How can I eventually rewrite the expression to meet my needs?
Example Strings and expected output
This is the end .Hello world! # This is the end. Hello world! (remove the leading, add the trailing)
This is the end, Hello world! # This is the end, Hello world! (ok!)
This is the end . Hello world! # This is the end. Hello world! (remove the leading, ok the trailing)
This is a .15mm tube # This is a .15 mm tube (ok since it's a decimal point)
Use \p{P} to match all the punctuations. Use \h* instead of \s* because \s would match newline characters also.
(?<!\d)\h*(\p{P}+)\h*(?!\d)
Replace the matched strings by \1<space>
DEMO
> x <- c('This is the end .Stuff', 'This is the end, Stuff', 'This is the end . Stuff', 'This is a .15mm tube')
> gsub("(?<!\\d)\\h*(\\p{P}+)\\h*(?!\\d)", "\\1 ", x, perl=T)
[1] "This is the end. Stuff" "This is the end, Stuff" "This is the end. Stuff"
[4] "This is a .15mm tube"
Here's an expression that detects the substrings that need to be replaced:
\s*\.\s*(?!\d)
You need to replace these by: . (a dot and a space)
Here's a demo link of how this works: http://regex101.com/r/zB2bY3/1
Explanation of the regex:
\s* - matches whitespace, any number of chars (0 - unbounded)
\. - matches a dot
\s* - same as above
(?!\d) - negative lookahead. It means that the string, in order to be matched, must not be followed by a digit (this handles your last test case).