Regex to extract only text after string and before space - regex

I want to match text after given string. In this case, the text for lines starting with "BookTitle" but before first space:
BookTitle:HarryPotter JK Rowling
BookTitle:HungerGames Suzanne Collins
Author:StephenieMeyer BookTitle:Twilight
Desired output is:
HarryPotter
HungerGames
I tried: "^BookTitle(.*)" but it's giving me matches where BookTitle: is in middle of line, and also all the stuff after white space. Anyone help?

you can have positive lookbehind in your pattern.
(?<=BookTitle:).*?(?=\s)
For more info: Lookahead and Lookbehind Zero-Width Assertions

What language is this?
And provide some code, please; with the ^ anchor you should definitely only be matching on string that begin with BookTitle, so something else is wrong.
If you can guarantee that all whitespace is stripped from the titles, as in your examples, then ^BookTitle:(\S+) should work in many languages.
Explanation:
^ requires the match to start at the beginning of the string, as you know.
\s - *lower*case means: match on white*s*pace (space, tab, etc.)
\S - *upper*case means the inverse: match on anything BUT whitespace.
\w is another possibility: match on *w*ord character (alphanumeric plus underscore) - but that will fail you if, for example, there's an apostrophe in the title.
+, as you know, is a quantifier meaning "at least one of".
Hope that helps.

With the 'multi-line' regex option use something like this:
^BookTitle:([^\s]+)
Without multi-line option, this:
(?:^|\n)BookTitle:([^\s]+)

Related

How do I match these text lines in regex?

I'm trying to match the three first text lines in regex, i.e. the ones ending with form.
value="something form"
value="Second cool form"
value="another silly old form"
value="blabla"
How can I do that?
I don't know what tool you are using, but the following pattern should match the first three lines:
.*form"$
Demo
You could simply use:
.*form"$
In order to work, you would have to turn on multiline mode.
Dot (.) means - match me anything but newline character, asterisk (*) means - match me dot zero or more times after which comes text form. Dollar sign ($) is anchor to the string ending.
Take a look at demo. You should learn more about regular expressions here, this is basic regex matching.
You can try using this:
\w*form\b
\w*: Allows characters in front of form
\b: Makes sure that form is at the end of the string.
Regex 101 demo
Actually if you want to match the 'form' as a separate word, you need something like this:
\Wform\W
\W (capital W) is any character which does not represent a word character, at least in perl-like regex.

regex npp - search string must be followed by specific chars, but not include those chars

In the line below, I need to these two lines into one single line by replacing the newline and empty space with nothing.
Provisioned Links : 2/14, 2/24, 7/10, 7/12,
7/25, 7/31, 7/32
Therefore I have this regex (in Notepad++):
(\r\n|\n)\s+[0-9]\/[0-9]*
Problem: the match includes the 7/25 - I need it to look for the #/## but not include it.
If I use this lookaround pattern:
(\r\n|\n)\s+(q=[0-9]\/[0-9])*
all lines beginning with newline + spaces are matched, whether or not they end with #/##.
What am I doing wrong?
regex101 fiddle to play with
Be careful:
You should correct the way you constructed the lookahead: (?=....)
Lookarounds are not quantifiable.
so what you need really is [\r\n]\s+(?=[0-9]\/[0-9]*).
Live demo
To normalize whitespace, why not simply replace "comma with additional space after it" with "comma plus one tab character" ?
You don't need that complicated pattern at all, because \s matches spaces, newlines, and tabs all at the same time:
Pattern: ,\s*
Replacement string: ,\t
https://regex101.com/r/T0QJnq/1

Regex to match a word or a dot

This should be a fairly trivial question but I have spent quite some time and Im unable to do it -
If this is my string -
"this/DT word/NN is/VBZ a/DT dot/NN ./."
I want to extract the immediate neighbors of / , be it a word,comma or a full stop.
(\\w+)/(\\w+) gives the words before n after / but not the full stops etc.
I tried this - "\\.\\/\\.|(\\w+)/(\\w+)" for grabbing the full stops but doesn't seem to work.
Can someone help please.( I am trying this in R)
Thanks!
Note that \w only matches letters, digits and an underscore. A dot/period belongs to punctuation and can be captured with Perl-like \p{P} or POSIX class [:punct:]. Thus, theoretically, you could use something like ([\\w[:punct:]]+)/([\\w[:punct:]]+) (or even a more POSIXish ([[:alpha:][:punct:]]+)/([[:alpha:][:punct:]]+)), but I guess matching non-whitespace characters on both sides of / suits your purpose best.
Here is an alternative to the (\\S+)/(\\S+) regex:
([^\\s]+)/([^\\s]+)
See regex demo
The [^\s] means any symbol other than a whitespace. Note that \S means *any non-whitespace character.
If you can have no non-whitespace characters on either side of /, I believe
([^\\s]*)/([^\\s]*)
or
(\\S*)/(\\S*)
will work better for you since * will match 0 or more characters.
See another demo
You can use this regex
"(\\S+)/(\\S+)"
i.e. grab each non-space text before and after /.
RegEx Demo

regex to catch dollar amount catching # symbol too

I'm trying to match simple dollar strings ($34.21). My regex is as follows.
\$\d+.([0-9][0-9])
I'm catching the following string though and I don't know why.
#$23.23
Does the at symbol have some kind of special meaning? I don't see it on my regex cheat sheet and this is bugging me.
Thanks, mj
You should escape the \. and you probably need to add start (^) and end ($) anchors around the pattern:
^\$\d+\.([0-9][0-9])$
The anchors are used to ensure that no other characters are allowed in the input string before or after the matched string.
Also, depending on the exact language / platform your using*, this can probably be further simplified to:
^\$\d+\.(\d\d)$
* Some regex engines treat \d as equivalent to [0-9], while on others it will match any Unicode digit, including those from other numeral systems.
Use line start and line end anchors to make sure you don't match unwanted input:
^\$\d+\.([0-9][0-9])$
OR
^\$\d+\.\d{1,2}$

regular expression no characters

I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers
Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.
^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)
I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".