How to select parts of text from markdown with regexp? - regex

I have next text:
#Header
my header text
##SubHeader
my sub header text
###Sub3Header
my sub 3 text
#Header2
my header2 text
I need to select text from "#Header" to "#Header2".
I tried to wrote regexp: http://regexr.com/3ffva but it's do not match what i needed.

^#[^#\n]+([\W\w]*?)^#[^#\n]+
Basic idea: find first level-1 heading, find any text until... second level-1 heading.
^#[^#\n]+ first level-1 heading
^ start of line (because of multi-line flag)
[^#\n]+ Any character that isn't # or a newline character. Repeat 1 or more times.
([\W\w]*?) any text until next matching part
^#[^#\n]+ second level-1 heading (see above)
Flags: multiline.

With looking ahead for closing capture and also matching, before next heading:
1- without multi-line flag
(^|\n)#([^#]+?)\n([^]+?)(?=\n#[^#]|$)
Demo without multi-line flag
Description:
Group 1 captures first of string or new line that follows # and no other #, that means new Heading starts there.
Group 2 captures Heading title
Group 3 captures any thing till the next heading or end of string
Group 4 is non-capturing and looks ahead for new heading, or end of text.
2- with multi-line flag
^#([^#]+?)\n([^]+?)(?=^#[^#])
Demo with Multi-line flag
Description:
first, add #-- at the end of text, for matching last Heading by this regex!
Starts matching from first char of line by ^ and matches # with no # in heading text. Group 1 captured: Heading before \n
Group 2 captures texts till next Heading start, that defined by just one # at starting line.

Depending on your regex flavor you can use:
(^#{1}.+)(.*\n)*
As shown here: http://regexr.com/3fg08
Alternately, you can use Vim's very magic mode:
\v(^#{1}.+)(.*\n)*(^#{1}\w+)

Related

Split comma separated list on separate line (notepad++ / regex)

I have a few files where each file has some text which has a description and list of tags.
I would like to manipulate the tags in the text with notepad++ and regular expressions in each file.
I could easily replace the commas with /r/n, but that would also take into account the description part where there are also commas and I want to keep that intact. I only need to manipulate the tag part.
Plus, there is not always the same amount of tags (sometimes there are 4, sometimes more, it varies).
Original input text:
Description: blah, blah, blah, slsls,
tag:
- hello, bye, Thanks, etc, Notepad
Desired output text:
Description: blah, blah, blah, slsls,
tag:
- hello
- bye
- thanks
- etc
- notepad
Any idea how I could achieve this? thanks much
Ctrl+H
Find what: (^tag:\s+|\G)[,-]\h*(\w+)
Replace with: $1\t- $2\n
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
( # start group 1
^ # beginning of line
tag: # literally
\s+ # 1 or more spaces
| # OR
\G # restart from last match position
) # end group
[,-] # comma or hyphen
\h* # 0 or more horizontal spaces
(\w+) # group 2, 1 or more word character (you can use [^\s,])
Replacement:
$1 # content of group 1
\t # a tabulation
- # a hyphen followed by a space
$2 # content of group 2
\n # linefeed
Screen capture (before):
Screen capture (after):

Regular expression for end of line

I am trying to parse a GEDCOM file using regular expressions and am almost there, but the expression grabs the next line of the text for lines where there is optional text at the end of line. Each record should be a single line.
This is an extract from the file:
0 HEAD
1 CHAR UTF-8
1 SOUR Ancestry.com Family Trees
2 VERS (2010.3)
2 NAME Ancestry.com Family Trees
2 CORP Ancestry.com
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
0 #P6# INDI
1 BIRT
And this is the regular expression I am using:
(\d+)\s+(#\S+#)?\s*(\S+)\s+(.*)
This works for all lines except those that do not contain any text at the end, such as the first one. For instance, the last capture group for the first record contains the '1 CHAR UTF-8'.
Here's a screenshot from regex101.com, showing how the purple capture group bleeds onto the next line:
I have tried using the $ qualifier to limit the .* to just line ends, but this fails as the second line is also a line end.
The \s pattern matches newline symbols. Replace it with a regular space, or [^\S\r\n], or \h if it is PCRE, or [\p{Zs}\t].
(\d+) +(#\S+#)? *(\S+) +(.*)
See the regex demo
If you need to match lines, you may add a multiline option and add anchors (^ at the start and $ at the end of the patten) on both sides (see another demo).

Multiple search & replace - Notepad ++ (regex)

I have a list of words, for example:
Good -> Bad
Sky -> Blue
Gray -> Black
etc...
What is the best why to do find&replace in notepad++?
I tried this:
FIND: (Good)|(Sky)|(Gray)
Replace: (?1Bad)(?2Blue)(?3Black)
but it doesn't work :(
any idea? or suggestions ?
There is however a workaround if you add this newline at the end of your text (it must be the last line, so don't press enter at the end):
#Good:Bad#Sky:Blue#Gray:Black#
and if you use this pattern:
(Good|Blue|Black)(?=(?:.*\R)++#(?>[^#]+#)*?\1:([^#]+))|\R.++(?!\R)
with this replacement:
$2
pattern details:
(Good|Blue|Black) # this part capture the word in group 1
(?= # then we reach the last line in a lookakead
(?:.*\R)++ # match all the lines until the last line
#(?>[^#]+#)*? # advance until the good value is found
\1 # the good value (backreference to the capture group 1)
: ([^#]+) # capture the replacement in group 2
) # close the lookbehind
| # OR
\R.++(?!\R) # match the last line (to remove it)
Note: to make the pattern more efficient, you can put it in a non capturing group and add a lookahead at the begining with all the first possible characters to quickly discard useless positions in the string:
(?=[GB\r\n])(?:\b(Good|Blue|Black)\b(?=(?:.*\R)++#(?>[^#]+#)*?\1:([^#]+))|\R.++(?!\R))

Replacing specifics spaces for tab

I have a big text file with addresses, and I want split the data into 3 variables. Example:
NM_LOGRADO
Street BLA BLA BLA 340
Av BLE BLI 318
Road BLI 48 Block 4
I want transform into:
NM_LOGRADO
Street(TAB)BLA BLA BLA(TAB)340
Av(TAB)BLE BLI(TAB)318
Road(TAB)BLI(TAB)48 Block 4
Basically, replace the first space and the last space before the first number space by tab.
I'm using Notepad++, and for the second replacement I tried replace ' (?=[0-9])(?<=)' by '(TAB)', but it replaced all spaces before numbers (in the third line I got Road(TAB)BLI(TAB)48 Block(TAB)4). For the first replacement I have no idea :(
Go to Search > Replace menu (shortcut CTRL+H) and do the following:
Find what:
(?:^.+?\K | (?=[0-9]+.+))
Replace:
\t
Select radio button "Regular Expression"
Then press Replace All
You can test it with your example at regex101.
Update1:
Based on your updated sample, try this:
Find:
^([^ ]+) ([^0-9]+) (.+)
Replace:
$1\t$2\t$3
Test it at regex101.
Update2:
Based on your updated sample, try this:
Find:
(?:^[^ ]+\K |(?<!Block|Ap) (?=[0-9]))
Replace:
\t
Test it at regex101.
I'm assuming that (TAB) refers to a tab character rather than a literal string.
Find what: ^(\w*) ((([A-Z]{3})( )?)+) (\d.*)$
Replace with: \1\t\2\t\6
(If my assumption was incorrect, replace \t with \(TAB\))
The key is the ungreedy space: ( )?. That leaves the leading and trailing spaces uncaptured, and therefore replaced by the tab characters.
Explanation of regular expressions:
^ Beginning of line
(\w*) Any number of alphanumeric characters, i.e. "Street", "Av", "Road"
((([A-Z]{3})( )?)+) 3 uppercase letters, followed by an ungreedy space, once or more, i.e. "BLA BLA BLA", "BLE BLI", "BLI"
(\d.*) A digit, followed by any number of any characters, i.e. "340", "318", "48 Block 4"
$ End of line
\1 First capture group, "(\w*)"
\t Tab character
\2 Second capture group, "((([A-Z]{3})( )?)+)"
\t Tab character
\6 Sixth capture group, "(\d.*)"
as you're using Notpad++, the easiest way is not to bother with regex but rather use a macro. simply record one and play it until the end of the line. You'll want to:
put your cursor at the first character of the file
Macros > Start Recording
find a space and convert it to tab (this will replace the first space of the row)
press END to go to the end of the line
use "find previous" command to find the last space of the line
replace that space with tab
go to the next line
Macros > Stop Recording
Run your macro till the end of the file

select underlined text from rtf using regex

i want to select the next piece of text which is underlined. You see the rtf of a richtextbox has following code for an underlines text :
\ul\i0 hello friend\ulnone\i
But the normal text looks like underlined. What i want to do is on a click of button the rtfbox should select the next piece of text which is underlined. An example piece of text is :
hello [friend your] house [looks] amazing.
imagine the words within square brackets are underlined. When i first click button1 "friend your" should be selected and on next click "looks" should be selected. Kind of keep moving forward and keep selecting it type of application. I know this can be done using regex but can't build a logic.
Any help will be appreciated. Thanks a lot :D
The regex would be
Dim pattern As String = "\\ul\\i0\s*((?:(?!\\ulnone\\i).)+)\\ulnone\\i"
Explanation
\\ul\\i0 # the sequence "\ul\i0"
\s* # any number of white space
( # begin group 1:
(?: # non-capturing group:
(?! # negative look-ahead ("not followed by..."):
\\ulnone\\i # the sequence "\ulnone\i"
) # end negative look-ahead
. # match next character (it is underlined)
)+ # end non-capturing group, repeat
) # end group 1 (it will contain all underlined characters)
\\ulnone\\i # the sequence "\ulnone\i"