Regex to determine volume/chapter - regex

I pull tons of posts from novel sites where they use this abbreviation for the volume and chapter: v5c91. So here, we have Volume 5 and Chapter 91.
Here are some examples of titles:
$string = 'hello v2c19 lorem';
$string = 'hello v2 c19 lorem';
$string = 'hello c19 lorem';
$string = 'v8 hello c19 lorem';
$string = 'hello lorem v01';
What regex can I use to pull the volume and chapter out of those examples? So I end up with something like v8c19.

To avoid matching titles with v{num} and c{num} in them, I think you want something like this:
(\bc\d+)|\bv\d+(c\d+) will catch chapters and (\bv\d+)|\bc\d+(v\d+) will capture volumes
EDIT: To capture partial chapters like c2.5, simply replace \d+ with a slighly modified regex that captures floating points (?:[0-9]*[.])?[0-9]+
It looks for a word boundary followed by the letter (c or v) and then digits, OR in the case of v1c3, it looks for the correct prefix followed by the match.
Here are some examples:
const inputs = [
'hello v2c19 lorem',
'hello v2.5 c19 lorem',
'hello c19 lorem',
'v8 hello c19 lorem',
'hello lorem c01',
'novolume nav123',
'hello noch123pter',
];
const find = (str, regex) => {
let res = null;
const match = regex.exec(str);
if (match) {
res = match[1] || match[2];
}
return res;
};
const FLOAT = `(?:[0-9]*[.])?[0-9]+`;
const vRE = new RegExp(`(\\bv${FLOAT})|\\bc${FLOAT}(v${FLOAT})`);
const cRE = new RegExp(`(\\bc${FLOAT})|\\bv${FLOAT}(c${FLOAT})`);
const output = inputs.map((title) => {
const chapter = find(title, cRE);
const volume = find(title, vRE);
return {
title,
chapter,
volume
};
});
console.log(output);
It's possible to combine these into all of the combinations of only chapter, only volume, chapter space volume, volume chapter etc... but that gets confusing fast and these are simple enough regex's to do the job.

Related

Find Replace with RegEx failing for string ending in ? Google script

I have a script in Google sheets
I am trying to find and replace headers on a sheet from a table of values on a different sheet
It is mostly working as desired but the replace is not working for any string that ends in ?
I do not know in advance when a ? will be present
I am using this:
const regex = new RegExp("(?<![^|])(?:" + search_for.join("|") + ")(?![^|])", "g");
I have tried to figure out how to correct my Regex but not getting it
Thanks in advance for your assistance with this
I have in a sheet:
search_for
replace_with
ABC Joe
MNQ
XYZ car
NNN XXX
DDD foo?
Bob bar
I have for Headers on a different sheet:
Label
Id
ABC Joe
XYZ car
DDD foo?
after running the replacement I want for headers:
Label
Id
MNQ
NNN XXX
Bob bar
what I get is:
Label
Id
MNQ
NNN XXX
DDD foo?
var data = range.getValues();
search_for.forEach(function(item, i) {
pair[item] = replace_with[i];
});
const regex = new RegExp("(?<![^|])(?:" + search_for.join("|") + ")(?![^|])", "g");
//Update Header row
//replace(/^\s+|\s+$|\s+(?=\s)/g, "") - Remove all multiple white-spaces and replaces with a single WS & trim
for(var m = 0; m<= data[0].length - 1; m++){
data[0][m] = data[0][m].replace(/^\s+|\s+$|\s+(?=\s)/g, "").replace(regex,(m) => pair[m])
}
A word of warning: what you're doing is scaring me a bit. I hope you know this is a brittle approach and it can go wrong.
You're not quoting the dynamic parts of the regex. The ? is a special character in regular expressions. I've written a solution to your problem below. Don't rely on my solution in production.
//var data = range.getValues();
var data = [
['Label', 'Id', 'ABC Joe', 'XYZ car', 'DDD foo?']
];
var search_for = [
'ABC Joe',
'XYZ car',
'DDD foo?'
];
var replace_with = [
'MNQ',
'NNN XXX',
'Bob bar'
];
var pair = {};
search_for.forEach(function(item, i) {
pair[item] = replace_with[i];
});
const regex = new RegExp("(?<![^|])(?:" + search_for.map((it) => quote(it)).join("|") + ")(?![^|])", "g");
for (var m = 0; m <= data[0].length - 1; m++) {
data[0][m] = data[0][m]
.replace(/^\s+|\s+$|\s+(?=\s)/g, "")
.replace(regex, (m) => pair[m]);
}
// see https://stackoverflow.com/a/3614500/11451
function quote(s) {
var regexpSpecialChars = /([\[\]\^\$\|\(\)\\\+\*\?\{\}\=\!])/gi;
return s.replace(regexpSpecialChars, '\\$1');
}
Can you not do something really simple like escaping all non-alphanumeric characters which would work with the example data you gave above and this seems trustworthy
function quote(s) {
var regexpSpecialChars = /((?=\W))/gi;
return s.replace(regexpSpecialChars, '\\');
}

return first instance of unmatched regex scala

Is there a way to return the first instance of an unmatched string between 2 strings with Scala's Regex library?
For example:
val a = "some text abc123 some more text"
val b = "some text xyz some more text"
a.firstUnmatched(b) = "abc123"
Regex is good for matching & replacing in strings based on patterns.
But to look for the differences between strings? Not exactly.
However, diff can be used to find differences.
object Main extends App {
val a = "some text abc123 some more text 321abc"
val b = "some text xyz some more text zyx"
val firstdiff = (a.split(" ") diff b.split(" "))(0)
println(firstdiff)
}
prints "abc123"
Is regex desired after all? Then realize that the splits could be replaced by regex matching.
The regex pattern in this example looks for words:
val reg = "\\w+".r
val firstdiff = (reg.findAllIn(a).toList diff reg.findAllIn(b).toList)(0)

Selectively uppercasing a string

I have a string with some XML tags in it, like:
"hello <b>world</b> and <i>everyone</i>"
Is there a good Scala/functional way of uppercasing the words, but not the tags, so that it looks like:
"HELLO <b>WORLD<b> AND <i>EVERYONE</i>"
We can use dustmouse's regex to replace all the text in/outside XML tags with Regex.replaceAllIn. We can get the matched text with Regex.Match.matched which then can easily be uppercased using toUpperCase.
val xmlText = """(?<!<|<\/)\b\w+(?!>)""".r
val string = "hello <b>world</b> and <i>everyone</i>"
xmlText.replaceAllIn(string, _.matched.toUpperCase)
// String = HELLO <b>WORLD</b> AND <i>EVERYONE</i>
val string2 = "<h1>>hello</h1> <span>world</span> and <span><i>everyone</i>"
xmlText.replaceAllIn(string2, _.matched.toUpperCase)
// String = <h1>>HELLO</h1> <span>WORLD</span> AND <span><i>EVERYONE</i>
Using dustmouse's updated regex :
val xmlText = """(?:<[^<>]+>\s*)(\w+)""".r
val string3 = """<h1>>hello</h1> <span id="test">world</span>"""
xmlText.replaceAllIn(string3, m =>
m.group(0).dropRight(m.group(1).length) + m.group(1).toUpperCase)
// String = <h1>>hello</h1> <span id="test">WORLD</span>
Okay, how about this. It just prints the results, and takes into consideration some of the scenarios brought up by others. Not sure how to capitalize the output without mercilessly poaching from Peter's answer:
val string = "<h1 id=\"test\">hello</h1> <span>world</span> and <span><i>everyone</i></span>"
val pattern = """(?:<[^<>]+>\s*)(\w+)""".r
pattern.findAllIn(string).matchData foreach {
m => println(m.group(1))
}
The main thing here is that it is extracting the correct capture group.
Working example: http://ideone.com/2qlwoP
Also need to give credit to the answer here for getting capture groups in scala: Scala capture group using regex

Regular expression to match word pairs joined with colons

I don't know regular expression at all. Can anybody help me with one very simple regular expression which is,
extracting 'word:word' from a sentence. e.g "Java Tutorial Format:Pdf With Location:Tokyo Javascript"?
Little modification:
the first 'word' is from a list but second is anything. "word1 in [ABC, FGR, HTY]"
guys situation demands a little more
modification.
The matching form can be "word11:word12 word13 .. " till the next "word21: ... " .
things are becoming complex with sec.....i have to learn reg ex :(
thanks in advance.
You can use the regex:
\w+:\w+
Explanation:
\w - single char which is either a letter(uppercase or lowercase), digit or a _.
\w+ - one or more of above char..basically a word
so \w+:\w+
would match a pair of words separated by a colon.
Try \b(\S+?):(\S+?)\b. Group 1 will capture "Format" and group 2, "Pdf".
A working example:
<html>
<head>
<script type="text/javascript">
function test() {
var re = /\b(\S+?):(\S+?)\b/g; // without 'g' matches only the first
var text = "Java Tutorial Format:Pdf With Location:Tokyo Javascript";
var match = null;
while ( (match = re.exec(text)) != null) {
alert(match[1] + " -- " + match[2]);
}
}
</script>
</head>
<body onload="test();">
</body>
</html>
A good reference for regexes is https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp
Use this snippet :
$str=" this is pavun:kumar hello world bk:systesm" ;
if ( preg_match_all ( '/(\w+\:\w+)/',$str ,$val ) )
{
print_r ( $val ) ;
}
else
{
print "Not matched \n";
}
Continuing Jaú's function with your additional requirement:
function test() {
var words = ['Format', 'Location', 'Size'],
text = "Java Tutorial Format:Pdf With Location:Tokyo Language:Javascript",
match = null;
var re = new RegExp( '(' + words.join('|') + '):(\\w+)', 'g');
while ( (match = re.exec(text)) != null) {
alert(match[1] + " = " + match[2]);
}
}
I am currently solving that problem in my nodejs app and found that this is, what I guess, suitable for colon-paired wordings:
([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))
It also matches quoted value. like a:"b" c:'d e' f:g
Example coding in es6:
const regex = /([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))/g;
const str = `category:"live casino" gsp:S1aik-UBnl aa:"b" c:'d e' f:g`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Example coding in PHP
$re = '/([\w]+:)("(([^"])*)"|\'(([^\'])*)\'|(([^\s])*))/';
$str = 'category:"live casino" gsp:S1aik-UBnl aa:"b" c:\'d e\' f:g';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
You can check/test your regex expressions using this online tool: https://regex101.com
Btw, if not deleted by regex101.com, you can browse that example coding here
here's the non regex way, in your favourite language, split on white spaces, go through the element, check for ":" , print them if found. Eg Python
>>> s="Java Tutorial Format:Pdf With Location:Tokyo Javascript"
>>> for i in s.split():
... if ":" in i:
... print i
...
Format:Pdf
Location:Tokyo
You can do further checks to make sure its really "someword:someword" by splitting again on ":" and checking if there are 2 elements in the splitted list. eg
>>> for i in s.split():
... if ":" in i:
... a=i.split(":")
... if len(a) == 2:
... print i
...
Format:Pdf
Location:Tokyo
([^:]+):(.+)
Meaning: (everything except : one or more times), :, (any character one ore more time)
You'll find good manuals on the net... Maybe it's time for you to learn...

Is it possible to parse this nightmare using Perl?

I'm working on some doc file, that when copied and pasted into a text file, gives me the following sample 'output':
ARTA215 ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.
This advanced study in drawing with the life ....
Prerequisite: ARTA150
Lab Fee Required
ARTA220 CERAMICS II (3 Cr) (2:2) + Studio 1 hr.
This course affords the student the opportunity to ex...
Lab Fee Required
ARTA250 SPECIAL TOPICS IN ART
This course focuses on selected topic....
ARTA260 PORTFOLIO DEVELOPMENT (3 Cr) (3:0)
The purpose of this course is to pre....
BIOS010 INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)
This course is a preparatory course designed to familiarize the begi....
BIOS101 GENERAL BIOLOGY (4 Cr) (3:3)
This course introduces the student to the principles of mo...
Lab Fee Required
BIOS102 INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)
This course is an introd....
Lab Fee Required
I want to be able to parse it so that 3 fields are generated and I could output the values into a .csv file.
The line breaks, spacing, etc... is how it could be at any point during this file.
My best guess is for a regex to find 4 capitalized alpha chars followed by 3 num chars, then find out if the next 2 chars are capitalized. (this accounts for the course #, but also excludes the possibility of tripping up during where it might say "prerequisite" as in the first entry). After this, the regex finds the first line break and gets everything after it until it finds the next course #. The 3 fields would be a course number, a course title, and a course description. The course number and title are on the same line always and the description is everything beneath.
Sample end result would contain 3 fields which I'm guessing could be stored into 3 arrays:
"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
Like I said, it's quite a nightmare, but I want to automate this instead of cleaning up after someone each time the file is generated.
Consider the following example that depends on blocks of course descriptions being completely contained within what Perl considers to be paragraphs:
#! /usr/bin/perl
$/ = "";
my $record_start = qr/
^ # starting with a newline
\s* # allow optional leading whitespace
([A-Z]+\d+) # capture course tag, e.g., ARTA215
\s+ # separating whitespace
(.+?) # course title on rest of line
\s*\n # consume trailing whitespace
/mx;
while (<>) {
my($course,$title);
if (s/\A$record_start//) { # fix Stack Overflow highlighting /
($course,$title) = ($1,$2);
}
elsif (s/(?s:^.+?)(?=$record_start)//) { # ditto /
redo;
}
else {
next;
}
my $desc;
die unless s/^(.+?)(?=$record_start|\s*$)//s;
(my $desc = $1) =~ s/\s*\n\s*/ /g;
for ($course, $title, $desc) {
s/^\s+//; s/\s+$//; s/\s+/ /g;
}
print join("," => map qq{"$_"} => $course, $title, $desc), "\n";
redo if $_;
}
When fed your sample input, it outputs
"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
"ARTA220","CERAMICS II (3 Cr) (2:2) + Studio 1 hr.","This course affords the student the opportunity to ex... Lab Fee Required"
"ARTA250","SPECIAL TOPICS IN ART","This course focuses on selected topic...."
"ARTA260","PORTFOLIO DEVELOPMENT (3 Cr) (3:0)","The purpose of this course is to pre...."
"BIOS010","INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)","This course is a preparatory course designed to familiarize the begi...."
"BIOS101","GENERAL BIOLOGY (4 Cr) (3:3)","This course introduces the student to the principles of mo... Lab Fee Required"
"BIOS102","INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)","This course is an introd.... Lab Fee Required"
Try:
my $course;
my #courses;
while ( my $line = <$input_handle> ) {
if ( $line =~ /^([A-Z]{4}\d+)\s+([A-Z]{2}.*)/ ) {
$course = [ "$1", "$2" ];
push #courses, $course;
}
elsif ($course) {
$course->[2] .= $line
}
else {
# garbage before first course in file
next
}
}
This produces an array of arrays, as I understand you want. It would make more sense to me to have an array of hashes or even a hash of hashes.
I had roughly the same idea as Gbacon to use paragraph mode since that will neatly chunk the file into records for you. He typed faster, but I wrote one, so here's my crack at it:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "";
my #items;
while (<>) {
my( $course, $description ) = (split /\n/, $_)[0, 1];
my( $course_id, $name ) = ($course =~ m/^(\w+)\s+(.*)$/);
push #items, [ $course_id, $name, $description ];
}
for my $record (#items) {
print "Course id: ", $record->[0], "\n";
print "Name and credits: ", $record->[1], "\n";
print "Description: ", $record->[2], "\n";
}
As Ysth points out in a comment on Gbacon's answer, paragraph mode may not work here. If not, never mind.
regex may be overkill for this, as the pattern appears to be simply:
[course]
[description]
{Prerequisites}
{Lab Fee Required}
where [course] is composed of
[course#] [course title] {# Cr} [etc/don't care]
and the course# is just the first 7 characters.
so you can scan the file with a simple state-machine, something like:
//NOTE: THIS IS PSEUDOCODE
s = 'parseCourse'
f = openFile(blah)
l = readLine(f)
while (l) {
if (s=='parseCourse') {
if (l.StartsWith('Prerequisite:')) {
extractPrerequisite(l)
}
else if (l.StartsWith('Lab Fee Required')) {
extractLabFeeRequired(l)
}
else {
extractCourseInfo(l)
s='parseDescription'
}
}
else if (s=='parseDescription') {
extractDescription(l)
s='parseCourse'
}
l = readLine(f)
}
close(f)
#!/usr/bin/perl
$/ = "\n\n";
$FS = "\n";
$, = ',';
while (<>) {
chomp;
#F = split($FS, $_);
print join($,,#F) ."\n";
}