Non-greedy matching - regex

In Geany, I want to match the titles of books. One example:
Michael Lewis, Liar's Poker, Hodder & Stoughton Ltd, London, 1989
I try to do so with this regex code:
,\s.*?,
This regex matches too much. it matches: [, Liar's Poker,] and [,London,].
I want to have a regex that only matches the title.

I think you need this regex with no global modifier. If you set global modifier i.e. g then it will return further matches like you have experienced.
,\s*([^,]+)
Demo
As you want to ignore further matches thus you may try this too:
^.*?,\s*([^,]+).*$
You will get Liar's Poker in group 1
Demo 2

/(, \w+[']?\w? \w+,)/g
this regex will get you this
[", Liar's Poker,"]
you will have to do additional processing to remove those leading and trailing commas. Try it out and see if this works for you.

Related

How to remove everything except specific set of characters/word using regex on vscode?

I have a very huge dump which i downloaded from imdb and here's a tiny example from the dump.
nm0000006 Ingrid Bergman 1915 1982 actress,soundtrack,producer tt0036855,tt0077711,tt0038109,tt0034583
nm0000007 Humphrey Bogart 1899 1957 actor,soundtrack,producer tt0033870,tt0034583,tt0037382,tt0043265
nm0000008 Marlon Brando 1924 2004 actor,soundtrack,director tt0078788
nm0000009 Richard Burton 1925 1984 actor,soundtrack,producer tt0061184,tt0059749,tt0057877,tt0087803
nm0000010 James Cagney 1899 1986 actor,soundtrack,director tt0031867,tt0042041
Those "tt0029870" are the only things i need.
How should i do it on regex so everything so i can remove everything except those tt0031867 type codes?
I need the dump to look like this: tt0036855tt0077711tt0038109tt0034583tt0036855tt0077711tt0038109tt0034583tt0036855tt0077711tt0038109tt0034583
I will use vs code to find & replace/remove it using regex.
It obviously depends on the regex flavor you use.
Here is a solution.
The regex is .*?(tt\d+):
.* will match any number of any characters, but the ? modifiers tells it to match as few as possible;
( and ) capture some matched text, the one we want to preserve in the substitution;
tt matches literal tt;
\d matches a digit, but + tells it to match 1 or more (and, without ?, it matches as many as possible).
The modifiers applied to the regex are g to repeat the matching over and over on the lines, and s to make . match the newline character too.
You have to do it in a few steps:
Remove all of the line up to the first tt0123456search: ^.*?(?=tt\d{7})replace: empty
Remove all ,search: ,replace: empty
Remove all newlinessearch: \nreplace: empty
Find: (\btt\d+\b)|\n|.
Replace: $
Just this one operation is sufficient. Tested in vscode.
See regex101 demo
Frequently an alternation of what you want and don't want is the easiest way. Order of the alternates is important. Generally put what you want first and match the other stuff afterwards.
Find: (tt\d+)|[\s\S]*? also works but might be a more expensive operation.
/tt0029870/ Will Work in your case.
In DB you can always use Like
Select * from YOURTABE where code like '%tt0029870%'

Need regex help for matching names

Let's say I have these three names
John Doe (p45643)
Le'anne Frank
Molly-Mae Edwards
I want to match
1) John Doe
2) Le'anne Frank
3) Molly-Mae Edwards
The regex I have tried is
(^[a-zA-Z-'^\d]$)+
but it isn't working as I am expecting.
I would like help creating a regex pattern that:
Matches a name from start to finish, and cannot contain a number. The only permitted values each "name" can contain is, [a-zA-Z'-], so if a name was
J0hn then it shouldn't match
If I understood correctly your question, then you have a minor errors in your regex:
(^[a-zA-Z-'^\d]$)+
^-------^------Here
The - pointed above should be escaped or moved to the end since it works as a range character. The + is marking the group as repeated.
You can use this regex instead (following your previous pattern):
(^[a-zA-Z'^\d -]+$)
Regex demo
Update: for your comment. If you want to match separately, then you can use:
(\b[a-zA-Z'^\d-]+\b)
Regex demo
And if you only want to match string (not numbers), then you can use:
(\b[a-zA-Z'-]+\b)
Regex demo
You are using the anchors incorrectly. Based on the modifier it can match the whole string or a single line.
Try
/^[a-zA-Z'-]+$/
Thanks to #Djory Krache
The query I was looking for was
(\b[a-zA-Z'-]+\b)

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

regex to select only the zipcode

,Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,
i want to select only 13732 from the line. I came up with this regex
(\d)(\s*\d+)*(\,y,,)
But its also selecting the ,y,, .if i remove it that part from regex, the regex also gets valid for the date. please help me on this.
Generally, if you want to match something without capturing it, use zero-length lookaround (lookahead or lookbehind). In your case, you can use lookahead:
(\d)(\s*\d+)*(?=\,y,,)
The syntax (?=<stuff>) means "followed by <stuff>, without matching it".
More information on lookarounds can be found in this tutorial.
Regex: \D*(\d{5})\D*
Explanation: match 5 digits surrounded by zero or more non-digits on both sides. Then you can extract group containing the match.
Here's code in python:
import re
string = ",Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,"
search = re.search("\D*(\d{5})\D*", string)
print search.group(1)
Output:
13732

Regex for UK postcode

I have this regex that I want to use to parse a UK postcode, but it doesn't work when a postcode is entered without spaces.
^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?) {1,2}([0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$
What change(s) do I need to make to this regex so it'll work without spaces correctly?
If I supply LS28NG I would expect the regex to return two matches for LS2 and 8NG.
This worked for me, at least for your example of LS28NG:
^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?) {0,2}([0-9][ABD-HJLN-UW-Z]{2}|GIR ?0AA)$
I changed the repetitions after the space to 0-2 instead of 1-2, and made the space in GIR 0AA optional.
Try adding \s{0,2} and putting brackets around the first and second part of the expression:
^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?)\s{0,2}([0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$
For:
LS2 8NG
LS28NG
It will match:
LS2 and 8NG
LS2 and 8NG
See it in action.
Just add an optional space \s? at the end of the first group:
^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?\s?){1,2}([0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$