I am trying to parse a GEDCOM file using regular expressions and am almost there, but the expression grabs the next line of the text for lines where there is optional text at the end of line. Each record should be a single line.
This is an extract from the file:
0 HEAD
1 CHAR UTF-8
1 SOUR Ancestry.com Family Trees
2 VERS (2010.3)
2 NAME Ancestry.com Family Trees
2 CORP Ancestry.com
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
0 #P6# INDI
1 BIRT
And this is the regular expression I am using:
(\d+)\s+(#\S+#)?\s*(\S+)\s+(.*)
This works for all lines except those that do not contain any text at the end, such as the first one. For instance, the last capture group for the first record contains the '1 CHAR UTF-8'.
Here's a screenshot from regex101.com, showing how the purple capture group bleeds onto the next line:
I have tried using the $ qualifier to limit the .* to just line ends, but this fails as the second line is also a line end.
The \s pattern matches newline symbols. Replace it with a regular space, or [^\S\r\n], or \h if it is PCRE, or [\p{Zs}\t].
(\d+) +(#\S+#)? *(\S+) +(.*)
See the regex demo
If you need to match lines, you may add a multiline option and add anchors (^ at the start and $ at the end of the patten) on both sides (see another demo).
Related
I have a text file where almost all the lines start with the letter N followed by 3 or 4 numbers as below
N970 G2 X-1.0591 Y-1.7454 I0. J-.04
N980 G1 Y-1.7554
N990 X-1.0594 Y-1.7666
N1000 Z-.2187
N1010 Y-1.7566
How can I remove the N followed by the 3 or 4 numbers in Notepad++ to look like this? if i need to search twice (once for N### and then again for N####) that is fine also.
G2 X-1.0591 Y-1.7454 I0. J-.04
G1 Y-1.7554
X-1.0594 Y-1.7666
Z-.2187
Y-1.7566
the numbers go from 100-9990 in increments of 10 if that helps
You can use the following regex that should work for your case:
^N[0-9]+\s*(.*)
It will match every line that starts with a capital letter N immediately followed by one or more digits. Matched results will include a single group which will contain the text you are looking for.
Note that whitespaces between the N tags and the actual text will not be matched.
Try it out in this DEMO
Breakdown
^ # Assert position at the start of the line
N # Matches capital letter 'N' literally
[0-9]+ # Matches any digit between 1 and unlimited times
\s* # Matches whitespace between 0 and unlimited times
(.*) # The rest of the text you are looking for
Find/Replace
The regex will match each individual line so you can either select Find Next and then Replace and process your file one line at a time or you can choose Replace All to process the whole file at once.
Substitution line (Replace with:) line should just include the first group ($1) which represents the rest of your text with N-prefix tags trimmed.
Make sure that the Search Mode is set to Regular expression.
Problem
I have a long unstructured text which I need to extract groups of text out.
I have an ideal start and end.
This is an example of the unstructured text truncated:
more useless gibberish at the begininng...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
bunch of text with lots of newlines in between... Closing 11.11 1,111.11 111,111.11
more useless gibberish between the groups...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
The word START appears in the middle sometimes multiple times, but it's fine bunch of text with lots of newlines in between... Closing 22.22 2,222.22 222,222.22
more useless gibberish at the end...
separated by new lines...
What I have tried
In the example above, I want to extract out 2 groups of text that lie between START and Closing
I have successfully done so using regex
/(?<=START)(?s)(.*?)(?=Closing)/g
This is the result https://regex101.com/r/vo7CLx/1/
What's wrong?
Unfortunately, I also need to extract the end of the line containing Closing string.
If you notice from the regex101 link, there's a Closing 11.11 1,111.11 111,111.11 in the first match. And a Closing 22.22 2,222.22 222,222.22 in the second match.
Which the regex does not match.
Is there a way to do this in a single regex? so that even the ending tag with the numbers are included?
Try this Regex:
(?s)(?<=START)(.*?Closing(?:\s*[\d.,])+)
Click for Demo
Explanation:
(?s) - single line modifier which means a . in the regex will match a newline
(?<=START) - Positive lookbehind to find the position immediately preceded by a START
(.*?Closing(?:\s*[\d.,])+) - matches 0+ occurrences of any character lazily until the next occurrence of the word Closing which is followed by a sequence (?:\s*[\d.,])+
(?:\s*[\d.,])+ - matches 0+ occurrences of a whitespace followed by a digit or a . or a ,. The + at the end means we have to match this sub-pattern 1 or more times
(START)(?s)(.*?)(Closing)(\s+((,?\d{1,3})+.\d+))+ should match everything you want, see here!
You can try this regex,
START(.*)Closing(.*)(((.?\d{1,3})+.\d+)+.\d+.\d+.\d)\d
I have next text:
#Header
my header text
##SubHeader
my sub header text
###Sub3Header
my sub 3 text
#Header2
my header2 text
I need to select text from "#Header" to "#Header2".
I tried to wrote regexp: http://regexr.com/3ffva but it's do not match what i needed.
^#[^#\n]+([\W\w]*?)^#[^#\n]+
Basic idea: find first level-1 heading, find any text until... second level-1 heading.
^#[^#\n]+ first level-1 heading
^ start of line (because of multi-line flag)
[^#\n]+ Any character that isn't # or a newline character. Repeat 1 or more times.
([\W\w]*?) any text until next matching part
^#[^#\n]+ second level-1 heading (see above)
Flags: multiline.
With looking ahead for closing capture and also matching, before next heading:
1- without multi-line flag
(^|\n)#([^#]+?)\n([^]+?)(?=\n#[^#]|$)
Demo without multi-line flag
Description:
Group 1 captures first of string or new line that follows # and no other #, that means new Heading starts there.
Group 2 captures Heading title
Group 3 captures any thing till the next heading or end of string
Group 4 is non-capturing and looks ahead for new heading, or end of text.
2- with multi-line flag
^#([^#]+?)\n([^]+?)(?=^#[^#])
Demo with Multi-line flag
Description:
first, add #-- at the end of text, for matching last Heading by this regex!
Starts matching from first char of line by ^ and matches # with no # in heading text. Group 1 captured: Heading before \n
Group 2 captures texts till next Heading start, that defined by just one # at starting line.
Depending on your regex flavor you can use:
(^#{1}.+)(.*\n)*
As shown here: http://regexr.com/3fg08
Alternately, you can use Vim's very magic mode:
\v(^#{1}.+)(.*\n)*(^#{1}\w+)
I want to extract from the following regex (?<=^\d+\s*).*?\t trying to extract from the following text just the resources\blahblah:
10 _Resources\index.test FAIL
11 _Resources\index.test FAIL
12 Resources\index.test FAIL
13set\Relicensing Statement.test FAIL
but it captures the following text:
0 _Resources\index.test
1 _Resources\index.test
2 Resources\index.test
3set\Relicensing Statement.test
I just want the lines like Resources\index.test and not the starting numbers, no spaces, why is failing? If I just execute ^\d+\s*and matches with the any number of digits and space, but do not works with prefix.
Since you commented you were using Notepad++, how about matching ^\d+\s*([^\t]*).*$ and replacing by \1 ?
From NSRegularExpression (I saw it was tagged):
Look-behind assertion. True if the parenthesized pattern matches text
preceding the current input position, with the last character of the
match being the input character just before the current position. Does
not alter the input position. The length of possible strings matched
by the look-behind pattern must not be unbounded (no * or +
operators.)
The same problem holds in most of the languages.
Can't you extract $1 from (?:^\d+\s*)(.*?\t)?
I have a tab-separated file that looks like this:
Something 1 Text...
Something 2 Text...
Something 2001 Text...
Something 1 Text
I want to match all lines that do not have 1 in the second to last column. So I tried this:
\t[^1][^\t]*\t[^\t]*$
But for some reason this does not work. Any hints?
Thanks!
You can use this regex:
/^\S+\s+(?!.*1).*$/gm
RegEx Demo
Or else if you want 1 to be a complete word then use:
/^\S+\s+(?!.*\b1\b).*$/gm
RegEx Demo2
EDIT:
To check for presence of 1 in last 2 columns only:
/\t(?!.*1)\S+\t+\S+$/gm
RegEx Demo 3
You regex \t[^1][^\t]*\t[^\t]*$ does not work because it matches a tab, then any character other than 1, 1 time, then 0 or more characters other than tabs, a tab, and 0 or more characters other than a tab before the end of line (if you are using m mode).
I suggest reading everything in the first column, then a tab, and then set a check so that we do not have "1":
^[^\t]*\t(?!.*1).*$
Pay attention to the multiline m flag.
Here is my demo
EDIT:
If you need to only make sure there is no 1 in the last 2 columns, use this regex:
^.*(?!.*1)[^\t]+\t[^\t]+$
EXPANATION:
^ - Start of line
.* - Consume any characters from the start
(?!.*1) - Set a check for 1 - it should not appear before the end of line from here!
[^\t]+ 1 or more characters other than a tab
\t - a tab
[^\t]+ - 1 or more characters other than a tab
$ - End of line.
See another demo
You can use following regex :
/^[^\t]*\t((?!1).)*$/gm
Demo
(?!1) is a negative look ahead that match any character that doesn't followed by 1
If you want to match only lines without the character 1 from the second column until the end of the line, you can use this pattern:
^[^\t]*\t[^1]*$