Replace character between two string in PCRE (Perl) syntax - regex

How can I replace a special character between two special strings.
I have something like this:
"start 1
2-
G
23
end"
I want to have the following:
"start 1 2- G 23 end"
Only replace \n with space between "start and end"
Test1;Hello;"Text with more words";123
Test2;want;"start
1-
76 end";123
Test3;Test;"It's a test";123
Test4;Hellp;"start
1234
good-
the end";1234
Test5;Test;"It's a test";123
Is it possible in notepad++?

You can use this pattern:
(?:\G(?!\A)|\bstart\b)(?:(?!\bend\b).)*\K\R
demo
details:
(?:
\G(?!\A) # contiguous to a previous match
|
\bstart\b # this is the first branch that matches
)
(?:(?!\bend\b).)* # zero or more chars that are not a newline nor the start of the word "end"
\K # remove all on the left from the match result
\R # any newline sequence (\n or \r\n or \r)
Note: (?:(?!\bend\b).)* isn't very efficient, feel free to replace it by something better for your particular case.

Magic words are lazy quantifier, lookaheads and single line mode.
A solution for PHP (uses PCRE) would be:
<?php
$string = __your_string_here__;
$regex = '~(?s)(?:start)(?<content>.*?)(?=end)(?s-)~';
# ~ delimiter
# (?s) starts single line mode - aka dot matches everything
# (?:start) captures start literally
# .*? matches everything lazily
# (?=end) positive lookahead
# (?s-) turn single line mode off
# ~ delimiter
preg_match_all($regex, $string, $matches);
$content = str_replace("\n", '', $matches["content"][1]);
echo $content; // 1234good-the
?>

Related

Find :: outside of markdown code formatting

I have a bunch of markdown files, where I want to search for Ruby's double colon :: outside of some code formatting (e.g. where I forgot to apply proper markdown). For example
`foo::bar`
hello `foo::bar` test
` example::with::whitespace `
```
Proper::Formatted
```
```
Module::WithIndendation
```
```
Some::Nested::Modules
```
```ruby
CodeBlock::WithSyntax
```
# Some::Class
## Another::Class Heading
some text
The regex only should match Some::Class and Another::Class, because they miss the surrounding backticks, and are also not within a multiline code fence block.
I have this regex, but it also matches the multi line block
[\s]+[^`]+(::)[^`]+[\s]?
Any idea, how to exclude this?
EDIT:
It would be great, if the regex would work in Ruby, JS and on the command line for grep.
For the original input, you may use this regex in ruby to match :: string
not preceded by a ` and
not preceded by ` followed a white-space:
Regex:
(?<!`\s)(?<!`)\b\w+::\w+
RegEx Demo 1
RegEx Breakup:
(?<!\s): Negative lookbehind to assert that <code> and whitespace is not at preceding position
(?<!): Negative lookbehind to assert that <code> is not at preceding position
\b: Match word boundary
\w+: Match 1+ word characters
::: Match a ::
\w+: Match 1+ word characters
You can use this regex in Javascript:
(?<!`\w*\s*|::)\b\w+(?:::\w+)+
RegEx Demo 2
For gnu-grep, consider this command:
grep -ZzoP '`\w*\s*\b\w+::\w+(*SKIP)(*F)|\b\w+::\w+' file |
xargs -0 printf '%s\n'
Some::Class
Another::Class
RegEx Demo 3
One can use the regular expression
rgx = /`[^`]*`|([^`\r\n]*::[^`\r\n]*)/
with the form of String#gsub that takes one argument and no block, and therefore returns an enumerator (str holding the example string given in the question):
str.gsub(rgx).select { $1 }
#=> ["# Some::Class", "## Another::Class Heading"]
The idea is that the first part of the regex's alternation, `[^`]*`, matches, but does not capture, all strings delimited by backtics (including ``), whereas the second part, ([^`\r\n]*::[^`\r\n]*), matches and captures all strings on a single line that contain '::' but no backtics. We therefore concern ourselves with captures only, by invoking select { $1 } on the enumerator returned by gsub.
The regular expression can be made self-documenting by writing it in free-spacing mode.
rgx = /
` # match a backtic
[^`]* # match zero of more characters other than backtics
` # match a backtic
| # or
( # begin capture group 1
[^`\r\n]* # match zero of more characters other than backtics and
# line terminators
:: # match two colons
[^`\r\n]* # ditto line before previous
) # end capture group 1
/x # invoke free-spacing regex definition mode
[^`\r\n] contains \r (carriage return) in the event that the file was created with Windows. If desired, [^`]* can be replaced with .*? (match zero or more characters, as few as possible).

Regex: Matches a multi-line pattern until the same one occurs again

I need to match 3 parts in the following bit:
# [1.3.3] (2019-04-16)
### Blah
* Loreum ipsum
# [1.3.0] (2019-04-01)
### Foo
* Loreum ipsum
# [1.2.0] (2019-03-05)
### Foo
* Loreum ipsum
Basically the first one would be
# [1.3.3] (2019-04-16)
### Blah
* Loreum ipsum
and so on.
I tried the following:
(# \[.*\] \([0-9\-]{10}\)(\n|.)*)
But that basically would go on to match the whole document. I need to tell him to stop matching until a new line start with (# \[) (what would be ^(?!(# \[)).*$)
You could use the first part of your pattern to match the first line and then use a negative lookahead (?!# ) to match the following lines if they don't start with # followed by a space:
^# \[[^]]+\] \([\d-]{10}\)\n(?:(?!# ).*(?:\n|$))*
About the pattern
^# Start of string followd by # and space
\[[^]]+\] Match from opening till closing square bracket using a negated character class
\([\d-]{10}\)\n Match opening parenthesis then match 10 times what is listed in the character class followed by a closing parenthesis and a newline
(?: Non capturing group
(?!# ) Negative lookahead, assert what is on the right is not # and a space
.*(?:\n|$) Match any char except newline and match either a newline or assert end of the string
)* Close non capturing group and repeat 0+ times
Regex demo
You can use the following regex:
(# \[.*\] \([0-9\-]{10}\)(\n|[^#]|###)*)`
This will match any text until the next hash (except if that hash is part of a group of three hashes ###) .
If you need to modify it for a varying number of hashes (strictly superior to 1), you could use
(# \[.*\] \([0-9\-]{10}\)(\n|[^#]|##+)*)
You may use
^\#\s+\[.+?(?=^\#\s+\[|\Z)
See a demo on regex101.com and mind the modifiers (singleline and multiline, s and m).
Broken down this is
^\#\s+\[ # start of the line, followed by "# ["
.+? # everything else afterwards until ...
(?=
^\#\s+\[ # ... the pattern from above right at the start of a new line
| # or
\Z # the very end of the string
)
The fastest way to go would be:
^#.*(\r?\n(?!# ).*)+
To make it more precise:
^# \[\d.*(?:\r?\n(?!# ).*)+
See live demo here

Regex finding all commas between two words

I trying to clean up a large .csv file that contains many comma separated words that I need to consolidate parts of. So I have a subsection where I want to change all the commas to slashes. Lets say my file contains this text:
Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool
I want to select all commas between the unique words bar and blah. The idea is to then replace the commas with slashes (using find and replace), such that I get this result:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
As per #EganWolf input:
How do I include words in the search but exclude them from the selection (for the unique words) and how do I then match only the commas between the words?
Thus far I have only managed to select all the text between the unique words including them:
bar,.*,blah, bar:*, *,blah, (bar:.+?,blah)*,*\2
I experimented with negative look ahead but cant get any search results from my statements.
Using Notepad++, you can do:
Ctrl+H
Find what: (?:\bbar,|\G(?!^))\K([^,]*),(?=.+\bblah\b)
Replace with: $1/
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?: # start non capture group
\bbar, # word boundary then bar then a comma
| # OR
\G # restart from last match position
(?!^) # negative lookahead, make sure not followed by beginning of line
) # end group
\K # forget all we've seen until this position
([^,]*) # group 1, 0 or more non comma
, # a comma
(?= # positive lookahead
.+ # 1 or more any character but newlie
\bblah\b # word boundary, blah, word boundary
) # end lookahead
Result for given example:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
Screen capture:
The following regex will capture the minimally required text to access the commas you want:
(?<=bar,)(.*?(,))*(?=.*?,blah)
See Regex Demo.
If you want to replace the commas, you will need to replace everything in capture group 2. Capture group 0 has your entire match.
An alternative approach would be to split your string by comma to create an array of words. Then join words between bar and blah using / and append the other words joined by ,.
Here is a PowerShell example of split and join:
$a = "Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool"
$split = $a -split ","
$slashBegin = $split.indexof("bar")+1
$commaEnd = $split.indexof("blah")-1
$str1 = $split[0..($slashbegin-1)] -join ","
$str2 = $split[($slashbegin)..$commaend] -join "/"
$str3 = $split[($commaend+1)..$split.count] -join ","
#($str1,$str2,$str3) -join ","
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
This could easily be made into a function with your entire line and keywords as inputs.

regular expressions: find every word that appears exactly one time in my document

Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn't work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?
If you're trying to do this as a learning exercise, you picked a very hard problem :)
First of all, here is the solution:
\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)
Now, here is the explanation:
We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.
Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).
Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).
Here is an example (using C#):
var s = "goat goat leopard bird leopard horse";
foreach (Match m in Regex.Matches(s, #"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
Console.WriteLine(m.Value);
Output:
bird
horse
It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):
Regex regexObj = new Regex(
#"( # Match and capture into backreference no. 1:
\b # (from the start of the word)
\p{L}+ # a succession of letters
\b # (to the end of a word).
) # End of capturing group.
(?<= # Now assert that the preceding text contains:
^ # (from the start of the string)
(?: # (Start of non-capturing group)
(?! # Assert that we can't match...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
\1 # we reach the word we've just matched.
) # End of lookbehind assertion.
# We now know that we have just matched the first instance of that word.
(?= # Now look ahead to assert that we can match the following:
(?: # (Start of non-capturing group)
(?! # Assert that we can't match again...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
$ # the end of the string.
) # End of lookahead assertion.",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
If you are trying to match an English word, the best form is:
[a-zA-Z]+
The problem with \w is that it also includes _ and numeric digits 0-9.
If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.
Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.
(\w+){1} will match each word.
After that you could always perfrom the count on the matches....
Higher level solution:
Create an array of your matches:
preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);
Let PHP count your array elements:
$tmp_array = array_count_values($matches[1]);
Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}
Low level but does what you want:
Pass your text in an array using split:
$array = split('\s+', $text);
Iterate over that array:
foreach ($array as $word) { ... }
Check each word if it is a word:
if (!preg_match('/[^a-zA-Z]/', $word) continue;
Add the word to a temporary array as key:
if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;
After the loop. Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}

How can I extract substrings from a string in Perl?

Consider the following strings:
1) Scheme ID: abc-456-hu5t10 (High priority) *****
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *****
and so on in the above format - the parts in bold are changes across the strings.
==> Imagine I've many strings of format Shown above.
I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.
1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
2nd substring containing the word (in eg above it's "High priority")
3rd substring containing * (IF * is present at the end of the string ELSE leave it )
How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?
You could do something like this:
my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END
foreach (split(/\n/,$data)) {
$_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
my ($id,$word,$star) = ($1,$2,$3);
print "$id $word $star\n";
}
The key thing is the Regular expression:
Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?
Which breaks up as follows.
The fixed String "Scheme ID: ":
Scheme ID:
Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:
([a-z0-9-]+)
Followed by one or more whitespace characters:
\s+
Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:
\(([^)]+)\)
Followed by some spaces any maybe a *, captured as $3:
\s*(\*)?
You could use a regular expression such as the following:
/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/
So for example:
$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";
prints
abc-456-hu5t10
High priority
*
(\S*)\s*\((.*?)\)\s*(\*?)
(\S*) picks up anything which is NOT whitespace
\s* 0 or more whitespace characters
\( a literal open parenthesis
(.*?) anything, non-greedy so stops on first occurrence of...
\) a literal close parenthesis
\s* 0 or more whitespace characters
(\*?) 0 or 1 occurances of literal *
Well, a one liner here:
perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt
Expanded to a simple script to explain things a bit better:
#!/usr/bin/perl -ln
#-w : warnings
#-l : print newline after every print
#-n : apply script body to stdin or files listed at commandline, dont print $_
use strict; #always do this.
my $regex = qr{ # precompile regex
Scheme\ ID: # to match beginning of line.
\s+ # 1 or more whitespace
(.*?) # Non greedy match of all characters up to
\s+ # 1 or more whitespace
\( # parenthesis literal
(.*?) # non-greedy match to the next
\) # closing literal parenthesis
\s* # 0 or more whitespace (trailing * is optional)
(\*)? # 0 or 1 literal *s
}x; #x switch allows whitespace in regex to allow documentation.
#values trapped in $1 $2 $3, so do whatever you need to:
#Perl lets you use any characters as delimiters, i like pipes because
#they reduce the amount of escaping when using file paths
m|$regex| && print "$1 : $2 : $3";
#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }
Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.
Long time no Perl
while(<STDIN>) {
next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
print "|$1|$2|$3|\n";
}
This just requires a small change to my last answer:
my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( (.+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star
)? #if it exists
}x;
String 1:
$input =~ /'^\S+'/;
$s1 = $&;
String 2:
$input =~ /\(.*\)/;
$s2 = $&;
String 3:
$input =~ /\*?$/;
$s3 = $&;