I want to capture series of file names that are listed each in a new line, and I have figured out how to capture the file name in the first line, but I haven't figured out how to repeat it on the subsequent lines.
# Input
# data/raw/file1
# data/raw/file2
# Output
# data/interim/file1
# data/interim/file2
Current Attempt
The regular expression I currently have is
# Input\n(# (.*))
And my inner capture group properly captures data/raw/file1.
Desired Output
What I want is to grab all of the files in between # Input and # Output, so in this example, data/raw/file1 and data/raw/file2.
Go with \G magic:
(?:^#\s+Input|\G(?!\A))\R*(?!#\s+Output)#\s*(.*)|[\s\S]*
Live demo
Regex breakdown
(?: # Start of non-capturing group (a)
^#\s+Input # Match a line beginning with `# Input`
| # Or
\G(?!\A) # Continue from previous successful match point
) # End of NCG (a)
\R* # Match any kind of newline characters
(?!#\s+Output) # Which are not followed by such a line `# Output`
#\s*(.*) # Start matching a path line and capture path
| # If previous patterns didn't match....
[\s\S]* # Then match everything else up to end to not involve engine a lot
PHP code:
$re = '~(?:^#\s+Input|\G(?!\A))\R*(?!#\s+Output)#\s*(.*)|[\s\S]*~m';
$str = '# Input
# data/raw/file1
# data/raw/file2
# Output
# data/interim/file1
# data/interim/file2';
preg_match_all($re, $str, $matches, PREG_PATTERN_ORDER, 0);
// Print the entire match result
print_r(array_filter($matches[1]));
Output:
Array
(
[0] => data/raw/file1
[1] => data/raw/file2
)
Using the s modifier, preg_match, and preg_split you can get each result on its own.
preg_match('/# Input\n(# (?:.*?))# Output/s', '# Input
# data/raw/file1
# data/raw/file2
# Output
# data/interim/file1
# data/interim/file2', $match);
$matched = preg_split('/# /', $match[1], -1, PREG_SPLIT_NO_EMPTY);
print_r($matched);
Demo: https://3v4l.org/dAcRp
Regex demo: https://regex101.com/r/5tfJGM/1/
Related
I am trying to parse all money from a string. For example, I want to extract:
['$250,000', '$3.90', '$250,000', '$500,000']
from:
'Up to $250,000………………………………… $3.90 Over $250,000 to $500,000'
The regex:
\$\ ?(\d+\,)*\d+(\.\d*)?
seems to match all money expressions as in this link. However, when I try to scan on Ruby, it fails to give me the desired result.
s # => "Up to $250,000 $3.90 Over $250,000 to $500,000, add$3.70 Over $500,000 to $1,000,000, add..$3.40 Over $1,000,000 to $2,000,000, add...........$2.25\nOver $2,000,000 add ..$2.00"
r # => /\$\ ?(\d+\,)*\d+\.?\d*/
s.scan(r)
# => [["250,"], [nil], ["250,"], ["500,"], [nil], ["500,"], ["000,"], [nil], ["000,"], ["000,"], [nil], ["000,"], [nil]]
From String#scan docs, it looks like this is because of the group. How can I parse all the money in the string?
Let's look at your regular expression, which I'll write in free-spacing mode so I can document it:
r = /
\$ # match a dollar sign
\ ? # optionally match a space (has no effect)
( # begin capture group 1
\d+ # match one or more digits
, # match a comma (need not be escaped)
)* # end capture group 1 and execute it >= 0 times
\d+ # match one or more digits
\.? # optionally match a period
\d* # match zero or more digits
/x # free-spacing regex definition mode
In non-free-spacing mode this would be written as follows.
r = /\$ ?(\d+,)*\d+\.?\d*/
When a regex is defined in free-spacing mode all spaces are stripped out before the regex is evaluated, which is why I had to escape the space. That's not necessary when the regex is not defined in free-spacing mode.
It is nowhere needed to match a space after the dollars sign, so \ ? should be removed. Suppose now we have
r = /\$\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
That works, but it is questionable whether you want to match values that do not have exactly two digits after the decimal point.
Now write
r = /\$(\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> [[nil], [nil], [nil]]
To see why this result was obtained examine the doc for String#scan, specifically the last sentence of the first paragraph: " If the pattern contains groups, each individual result is itself an array containing one entry per group.".
We can avoid that problem by changing the capture group to a non-capture group:
r = /\$(?:\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
Now consider this:
"$2,241.31 cat $1,2345. dog $33.607".scan r
#=> ["$2,241.31", "$1,2345.", "$33.607"]
which is still not quite right. Try the following.
r = /
\$ # match a dollar sign
\d{1,3} # match one to three digits
(?:,\d{3}) # match ',' then 3 digits in a nc group
* # execute the above nc group >=0 times
(?:\.\d{2}) # match '.' then 2 digits in a nc group
? # optionally match the above nc group
(?![\d,.]) # no following digit, ',' or '.'
/x # free-spacing regex definition mode
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$146.27"]
(?![\d,.]) is a negative lookahead.
In normal mode this regular expression is written as follows.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?(?![\d,.])/
The following erroneous result would obtain without the negative lookahead at the end of the regex.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$3,615", "$33.60",
# "$146.27"]
[3] pry(main)> str = <<EOF
[3] pry(main)* Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25
[3] pry(main)* Over $2,000,000 add …..………………………$2.00
[3] pry(main)* EOF
=> "Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25\nOver $2,000,000 add …..………………………$2.00\n"
[4] pry(main)> str.scan /\$\d+(?:[,.]\d+)*/
=> ["$250,000", "$3.90", "$250,000", "$500,000", "$3.70", "$500,000", "$1,000,000", "$3.40", "$1,000,000", "$2,000,000", "$2.25", "$2,000,000", "$2.00"]
[5] pry(main)>
I'm looking for a search/replace regular expression which will capture tokens and apply them as a prefix to every subsequent line within a document.
So this..
Tokens always start with ##..
Nothing is prefixed until a token is encountered..
##CAT
furball
scratch
##DOG
egg
##MOUSE
wheel
on the stair
Becomes..
Tokens always start with ##..
Nothing is prefixed until a token is captured!
##CAT
CAT furball
CAT scratch
##DOG
DOG egg
#MOUSE
MOUSE wheel
MOUSE on the stair
You can use this pattern:
search: ((?:\A|\n)##([^\r\n]+)(?>\r?\n\2[^\r\n]+)*+\r?\n(?!##))
replace: $1$2 <= with a space at the end
But you must apply the search replace several times until there no more matches.
As far as I know, this is impossible. The closest I can get is replacing
^##(.*)\r?\n(.*)
with
##\1\n\1 \2
Output:
Tokens always start with ##..
Nothing is prefixed until a token is encountered..
##CAT
CAT furball
scratch
##DOG
DOG egg
##MOUSE
MOUSE wheel
on the stair
You have the pcre tag and the Notepad++ tag.
I don't think you can actually do this without a callback mechanism.
That being said, you can do it without a callback, but you need to divide
up the functionality.
This is a php example that might give you some ideas.
Note - not sure of php string concatenation syntax (used a '.' but it could be a '+').
The usage is multi-line mode //m modifier.
^ # Begin of line
(?-s) # Modifier, No Dot-All
(?:
( # (1 start)
\#\# # Token indicator
( \w+ ) # (2), Token
.* # The rest of line
) # (1 end)
| # or,
( .* ) # (3), Just a non-token line
)
$ # End of line
# $token = "";
# $str = preg_replace_callback('/^(?-s)(?:(\#\#(\w+).*)|(.*))$/m',
# function( $matches ){
# if ( $matches[1] != "" ) {
# $token = $matches[2];
# return $matches[1];
# }
# else
# if ( $token != "" ) {
# return $token . " " . $matches[3];
# }
# return $matches[3];
# },
# $text);
#
Team
I have written a Perl program to validate the accuracy of formatting (punctuation and the like) of surnames, forenames, and years.
If a particular entry doesn't follow a specified pattern, that entry is highlighted to be fixed.
For example, my input file has lines of similar text:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My programs works just fine, that is, if any entry doesn't follow the pattern, the script generates an error. The above input text doesn't generate any error. But the one below is an example of an error because Rose A. J. is missing a comma after Rose:
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>
From my regex search pattern, is it possible to capture all the surnames and the year, so I can generate a text prefixed to each line as shown below?
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My regex search script is as follows:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}
Thanks for your help,
Prem
Alter this bit
# pattern for surname----<END>
,?\s
This now means an optional , followed by white space. If the Persons surname is "Bunga Bunga" it won't work
All of your subpatterns are non-capturing groups, starting with (?:. This reduces compilation times by a number of factors, one of which being that the subpattern is not captured.
To capture a pattern you merely need to place parenthesis around the part you require to capture. So you could remove the non-capturing assertion ?: or place parens () where you need them. http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
I'm not sure but, from your code I think you may be attempting to use lookahead assertions as, for example, you test for surnames with spaces, if none then test for surnames with hyphens. This will not start from the same point every time, it will either match the first example or not, then move forward to test the next position with the second surname pattern, whether the regex will then test the second name for the first subpattern is what I am unsure of. http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;
For capturing the whole pattern the first pattern of the third example does not make sense, as this tells the regex to not capture the pattern group while also capturing the pattern group. Where this is useful is in the second pattern which is a fine grained pattern capture, in that the pattern captured is part of a non-capturing group.
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe
One little nitpic
id=".*?"
may be better as
id="\w*?"
id names requiring to be _alphanumeric iirc.
I have a CSV file (exported data from iWork Numbers) which contains of a list of users with information. What I want to do is to replace ;;;;;;;;; with ; on all lines accept "Last login".
By doing so and importing the file to Numbers again the data will (hopefully) be divided in rows like this:
User 1 | Points: 1 | Registered: 2012-01-01 | Last login 2012-02-02
User 2 | Points: 2 | Registered: 2012-01-01 | Last login 2012-02-02
How the CSV file looks:
;User1;;;;;;;;;
;Points: 1;;;;;;;;;
;Registered: 2012-01-01;;;;;;;;;
;Last login: 2012-02-02;;;;;;;;;
;User2;;;;;;;;;
;Points: 2;;;;;;;;;
;Registered: 2012-01-01;;;;;;;;;
;Last login: 2012-02-02;;;;;;;;;
So my question is what Regex code should I type in the Find and Replace fields?
Thanks in advance!
See the regex in action:
Find : ^(;(?!Last).*)(;{9})
Replace: $1;
Output will be:
;User1;
;Points: 1;
;Registered: 2012-01-01;
;Last login: 2012-02-02;;;;;;;;;
;User2;
;Points: 2;
;Registered: 2012-01-01;
;Last login: 2012-02-02;;;;;;;;;
Explanation
Find:
^ # Match start of the line
( # Start of the 1st capture group
;(?!Last) # Match a semicolon (;), only if not followed by 'Last' word.
.* # Match everything
) # End of the 1st capture group
( # Start of the 2nd capture group
;{9} # Match exactly 9 semicolons
) # End of the 2nd capture group
Replace:
$1; # Leave 1st capture group as is and append a semicolon.
What is the best way to ignore the white space in a target string when searching for matches using a regular expression pattern, but only if the whitespace comes after a newline (\n)? For example, if my search is for "cats", I would want "c\n ats" or "ca\n ts" to match but not "c ats" since the whitespace doesn't come after a newline. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
If the regex engine you're using supports lookaround assertions, use a positive lookbehind assertion to check for the presence of a preceding newline:
(?<=\n)\s
"What is the best way to ignore the white space in a target string when searching for matches using a regular expression pattern"
I would construct a regex dynamically, inserting a (?:\n\s)? between each character.
use strict;
use warnings;
my $needed = 'cats';
my $regex = join '(?:\n\s)?' , split ( '',$needed );
print "\nRegex = $regex\n", '-'x40, "\n\n";
my $target = "
cats
c ats
c\n ats
ca ts
ca\n ts
cat s
cat\n s
";
while ( $target =~ /($regex)/g)
{
print "Found - '$1'\n\n";
}
The output:
Regex = c(?:\n\s)?a(?:\n\s)?t(?:\n\s)?s
----------------------------------------
Found - 'cats'
Found - 'c
ats'
Found - 'ca
ts'
Found - 'cat
s'
I have made a small ruby snippet based on the rules you have listed. Is this what you are looking for?
data = <<DATA
test1c\n atsOKexpected
test2ca\n tsOKexpected
test3catsOKexpected
test5ca tsBADexpected
test6 catsOKexpected
test7cats OKexpected
DATA
tests = data.split(/\n\n/)
regex = /c(\n )?a(\n )?t(\n )?s/
tests.each do |s|
if s =~ regex
puts "OK\n#{s}\n\n"
else
puts "BAD\n#{s}\n\n"
end
end
# RESULTS
# OK
# test1c
# atsOKexpected
#
# OK
# test2ca
# tsOKexpected
#
# OK
# test3catsOKexpected
#
# BAD
# test5ca tsBADexpected
#
# OK
# test6 catsOKexpected
#
# OK
# test7cats OKexpected