I'm parsing some HTML like this
<h3>Movie1</h3>
<div class="time"><span>10:00</span><span>12:00</span></div>
<h3>Movie2</h3>
<div class="time"><span>13:00</span><span>15:00</span><span>18:00</span></div>
I'd like to get result array looks like this
0 =>
0 => Movie1
1 => Movie2
1 =>
0 =>
0 => 10:00
1 => 12:00
1 =>
0 => 13:00
1 => 15:00
2 => 18:00
I can do that on two steps
1) get the movie name and whole movie's schedule with tags by regexp like this
~<h3>(.*?)</h3>(?:.*?)<div class="time">(.*?)</div>~s
2) get time by regexp like this (I do it inside the loop for every movie I got on step 1)
~<span>([0-9]{2}:[0-9]{2})</span>~s
And it works well.
The question is that: is there a regular expression that gives me the same result in only one step?
I tried nested groups like this
~<h3>(.*?)</h3>(?:.*?)<div class="time">((<span>(.*?)</span>)*)</div>~s
and I got only the last time of every movie (only 12:00 and 18:00).
With DOMDocument:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodeList = $xpath->query('//h3|//div[#class="time"]/span');
$result = array();
$currentMovie = -1;
foreach ($nodeList as $node) {
if ($node->nodeName === 'h3') {
$result[0][++$currentMovie] = $node->nodeValue;
continue;
}
$result[1][$currentMovie][] = $node->nodeValue;
}
print_r($result);
Note: to be more rigorous, you can change the xpath query to:
//h3[following-sibling::div[#class="time"]] | //div[#class="time"]/span
Related
I have a list of random dates formatted like:
String x ="Text(\"key:[2020-08-23 22:22, 2020-08-22 10:11, 2020-02-22 12:14]\"),"
I can use the \d{4}\-\d{2}\-\d{2}\s\d{2}:\d{2} regex to match all dates in x:
RegExp regExp79 = new RegExp(
r'\d{4}\-\d{2}\-\d{2}\s\d{2}:\d{2}',
);
var match79 = regExp79.allMatches("$x");
var mylistdate = match79;
So, the matches are:
match 1 = 2020-08-22 22:22
match 2 = 2020-08-22 10:11
match 3 = 2020-02-22 12:14
I want to convert the Iterable<RegExpMatch> into a list of strings, so that the output of my list looks like:
[2020-08-22 22:22, 2020-08-22 10:11, 2020-02-22 12:14]
The allMatches method returns an Iterable<RegExpMatch> value. It contains all the RegExpMatch objects that contain some details about the matches. You need to invoke the .group(0) method on each RegExpMatch object to get the string value of the match.
So, you need to .map the results:
your_regex.allMatches(x).map((z) => z.group(0)).toList()
Code:
String x ="Text(\"key:[2020-08-23 22:22, 2020-08-22 10:11, 2020-02-22 12:14]\"),";
RegExp regExp79 = new RegExp(r'\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}');
var mylistdate = regExp79.allMatches(x).map((z) => z.group(0)).toList();
print(mylistdate);
Output:
[2020-08-23 22:22, 2020-08-22 10:11, 2020-02-22 12:14]
I have a script in Google sheets
I am trying to find and replace headers on a sheet from a table of values on a different sheet
It is mostly working as desired but the replace is not working for any string that ends in ?
I do not know in advance when a ? will be present
I am using this:
const regex = new RegExp("(?<![^|])(?:" + search_for.join("|") + ")(?![^|])", "g");
I have tried to figure out how to correct my Regex but not getting it
Thanks in advance for your assistance with this
I have in a sheet:
search_for
replace_with
ABC Joe
MNQ
XYZ car
NNN XXX
DDD foo?
Bob bar
I have for Headers on a different sheet:
Label
Id
ABC Joe
XYZ car
DDD foo?
after running the replacement I want for headers:
Label
Id
MNQ
NNN XXX
Bob bar
what I get is:
Label
Id
MNQ
NNN XXX
DDD foo?
var data = range.getValues();
search_for.forEach(function(item, i) {
pair[item] = replace_with[i];
});
const regex = new RegExp("(?<![^|])(?:" + search_for.join("|") + ")(?![^|])", "g");
//Update Header row
//replace(/^\s+|\s+$|\s+(?=\s)/g, "") - Remove all multiple white-spaces and replaces with a single WS & trim
for(var m = 0; m<= data[0].length - 1; m++){
data[0][m] = data[0][m].replace(/^\s+|\s+$|\s+(?=\s)/g, "").replace(regex,(m) => pair[m])
}
A word of warning: what you're doing is scaring me a bit. I hope you know this is a brittle approach and it can go wrong.
You're not quoting the dynamic parts of the regex. The ? is a special character in regular expressions. I've written a solution to your problem below. Don't rely on my solution in production.
//var data = range.getValues();
var data = [
['Label', 'Id', 'ABC Joe', 'XYZ car', 'DDD foo?']
];
var search_for = [
'ABC Joe',
'XYZ car',
'DDD foo?'
];
var replace_with = [
'MNQ',
'NNN XXX',
'Bob bar'
];
var pair = {};
search_for.forEach(function(item, i) {
pair[item] = replace_with[i];
});
const regex = new RegExp("(?<![^|])(?:" + search_for.map((it) => quote(it)).join("|") + ")(?![^|])", "g");
for (var m = 0; m <= data[0].length - 1; m++) {
data[0][m] = data[0][m]
.replace(/^\s+|\s+$|\s+(?=\s)/g, "")
.replace(regex, (m) => pair[m]);
}
// see https://stackoverflow.com/a/3614500/11451
function quote(s) {
var regexpSpecialChars = /([\[\]\^\$\|\(\)\\\+\*\?\{\}\=\!])/gi;
return s.replace(regexpSpecialChars, '\\$1');
}
Can you not do something really simple like escaping all non-alphanumeric characters which would work with the example data you gave above and this seems trustworthy
function quote(s) {
var regexpSpecialChars = /((?=\W))/gi;
return s.replace(regexpSpecialChars, '\\');
}
I'm trying to extract the year from a string with this format:
dataset_name = 'ALTVALLEDAOSTA000020191001.json'
I tried:
dataset_name[/<\b(19|20)\d{2}\b>/, 1]
/\b(19|20)\d{2}\b/.match(dataset_name)
I'm still reading the docs but so far I'm not able to achieve the result I want. I'm really bad at regex.
Since your dataset name always ends in yyyymmdd.json, you can take a slice of the last 13-9 characters counting from the rear:
irb(main):001:0> dataset_name = 'ALTVALLEDAOSTA000020191001.json'
irb(main):002:0> dataset_name[-13...-9]
=> "2019"
You can also use a regex if you want a bit more precision:
irb(main):003:0> dataset_name =~ /(\d{4})\d{4}\.json$/
=> 18
irb(main):004:0> $1
=> "2019"
There are many ways to get to Rome.
Starting with:
foo = 'ALTVALLEDAOSTA000020191001.json'
Stripping the extended filename + extension to its basename then using a regex:
ymd = /(\d{4})(\d{2})(\d{2})$/
ext = File.extname(foo)
File.basename(foo, ext) # => "ALTVALLEDAOSTA000020191001"
File.basename(foo, ext)[ymd, 1] # => "2019"
File.basename(foo, ext)[ymd, 2] # => "10"
File.basename(foo, ext)[ymd, 3] # => "01"
Using a regex against the entire filename to grab just the year:
ymd = /^.*(\d{4})/
foo[ymd, 1] # => "1001"
or extracting the year, month and day:
ymd = /^.*(\d{4})(\d{2})(\d{2})/
foo[ymd, 1] # => "2019"
foo[ymd, 2] # => "10"
foo[ymd, 3] # => "01"
Using String's unpack:
ymd = '#18A4'
foo.unpack(ymd) # => ["2019"]
or:
ymd = '#18A4A2A2'
foo.unpack(ymd) # => ["2019", "10", "01"]
If the strings are consistent length and format, then I'd work with unpack, because, if I remember right, it is the fastest, followed by String slicing, with anchored, then unanchored regular expressions trailing.
I have the following list.
ID AllStatuses
1001 {failed|processing|success}
1002 {failed}
1003 {success|failed}
1004 {processing|success}
1005 {failed|processing}
My requirement is to display the most optimistic status alone. Like so
ID Best Status
1001 success
1002 failed
1003 success
1004 success
1005 processing
Is there a way I can do this with one regex query rather than say check for each one in order and return where i'd have a worst case scenario of three regex checks for statuses with the most optimistic status in the end?
Regex: \{.*(success).*|\{.*(processing).*|\{.*(failed).* Substitution: $1$2$3
Details:
.* matches any character zero or more times
() Capturing group
| Or
Go code:
var re = regexp.MustCompile(`\{.*(success).*|\{.*(processing).*|\{.*(failed).*`)
s := re.ReplaceAllString(sample, `$1$2$3`)
Output:
ID AllStatuses
1001 success
1002 failed
1003 success
1004 success
1005 processing
Code demo
(\d+)\s+{.*(success|processing|failed).*}
Then take the match from
group 1: ID
group 2: status
You can make it with one regex, but with additional checks of needed elements at the end in order you need this time.
It is not so short, but I am sure that this is more stable, especially if there will be some changes in algorithm.
Example in javascript, but I am sure, you can easily implement idea in your code
var obResults = {};
var obStrings = {
1001: "{failed|processing|success}",
1002: "{failed}",
1003: "{success|failed}",
1004: "{processing|success}",
1005: "{failed|processing}",
};
for (var key in obStrings) {
var stringToCheck = obStrings[key];
var resultString = "";
var arMathces = stringToCheck.match( /(failed|processing|success)/ig );
if (arMathces.indexOf("success") != -1) {
resultString = "success";
} else if (arMathces.indexOf("processing") != -1) {
resultString = "processing";
} else if (arMathces.indexOf("failed") != -1) {
resultString = "failed";
}
obResults[key] = { result:resultString, check:stringToCheck };
}
console.log(obResults);
I am extract data from a text stream which is data structured as such
/1-<id>/<recType>-<data>..repeat n times../1-<id>/#-<data>..repeat n times..
In the above, the "/1" field precedes the record data which can then have any number of following fields, each with choice of recType from 2 to 9 (also, each field starts with a "/")
For example:
/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE
So, there are three groups of data above
1=XXXX 2=YYYY 9=ZZZZ
1=AAAA 3=BBBB 5=CCCC 8=NNNN 9=DDDD
1=QQQQ 2=WWWW 3=PPPP 7=EEEE
Data is for simplicity, I know for certain that its only contains [A-Z0-9. ] but can be variable length (not just 4 chars as per example)
Now, the following expression sort of works, but its only capturing the first 2 fields of each group and none of the remaining fields...
/1-(?'fld1'[A-Z]+)/((?'fldNo'[2-9])-(?'fldData'[A-Z0-9\. ]+))
I know I need some sort of quantifier in there somewhere, but I do not know what or where to place it.
You can use a regex to match these blocks using 2 .NET regex features: 1) capture collection and 2) multiple capturing groups with the same name in the pattern. Then, we'll need some Linq magic to combine the captured data into a list of lists:
(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*
Details:
(?<fldNo>1) - Group fldNo matching 1
- - a hyphen
(?'fldData'[^/]+) - Group "fldData" capturing 1+ chars other than /
(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))* - zero or more sequences of:
/ - a literal /
(?<fldNo>[2-9]) - 2 to 9 digit (Group "fldNo")
[-=] - a - or =
(?'fldData'[^/]+)- 1+ chars other than / (Group "fldData")
See the regex demo, results:
See C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var str = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
var res = Regex.Matches(str, #"(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*")
.Cast<Match>()
.Select(p => p.Groups["fldNo"].Captures.Cast<Capture>().Select(m => m.Value)
.Zip(p.Groups["fldData"].Captures.Cast<Capture>().Select(m => m.Value),
(first, second) => first + "=" + second))
.ToList();
foreach (var t in res)
Console.WriteLine(string.Join(" ", t));
}
}
I would suggest to first split the string by /1, then use a patern along these lines:
\/([1-9])[=-]([A-Z]+)
https://regex101.com/r/0nyzzZ/1
A single regex isn't the optimal tool for doing this (at least used in this way). The main reason is because your stream has a variable number of entries in it, and using a variable number of capture groups is not supported. I also noticed some of the values had "=" between them as well as the dash, which your current regex doesn't address.
The problem comes when you try and add a quantifier to a capture group - the group will only remember the last thing it captured, so if you add a quantifier, it will end up catching the first and last fields, leaving out all the rest of them. So something like this won't work:
\/1-(?'fld1'[A-Z]+)(?:\/(?'fldNo'[2-9])[-=](?'fldData'[A-Z]+))+
If your streams were all the same length, then a single regex could be used, but there's a way to do it using a foreach loop with a much simpler regex working on each part of your stream (so it verifies your stream as well when it goes along!)
Now I'm not sure what language you're working with when using this, but here is a solution in PHP that I think delivers what you need.
function extractFromStream($str)
{
/*
* Get an array of [num]-[letters] with explode. This will make an array that
* contains [0] => 1-AAAA, [1] => 2-BBBB ... etc
*/
$arr = explode("/", substr($str, 1));
$sorted = array();
$key = 0;
/*
* Sort this data into key->values based on numeric ordering.
* If the next one has a lower or equal starting number than the one before it,
* a new entry will be created. i.e. 2-aaaa => 1-cccc will cause a new
* entry to be made, just in case the stream doesn't always start with 1.
*/
foreach ($arr as $value)
{
// This will get the number at the start, and has the added bonus of making sure
// each bit is in the right format.
if (preg_match("/^([0-9]+)[=-]([A-Z]+)$/", $value, $matches)) {
$newKey = (int)$matches[1];
$match = $matches[2];
} else
throw new Exception("This is not a valid data stream!");
// This bit checks if we've got a lower starting number than last time.
if (isset($lastKey) && is_int($lastKey) && $newKey <= $lastKey)
$key += 1;
// Now sort them..
$sorted[$key][$newKey] = $match;
// This will be compared in the next iteration of the loop.
$lastKey = $newKey;
}
return $sorted;
}
Here's how you can use it...
$full = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
try {
$extracted = extractFromStream($full);
$stream1 = $extracted[0];
$stream2 = $extracted[1];
$stream3 = $extracted[2];
print "<pre>";
echo "Full extraction: \n";
print_r($extracted);
echo "\nFirst Stream:\n";
print_r($stream1);
echo "\nSecond Stream:\n";
print_r($stream2);
echo "\nThird Stream:\n";
print_r($stream3);
print "</pre>";
} catch (Exception $e) {
echo $e->getMessage();
}
This will print
Full extraction:
Array
(
[0] => Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
[1] => Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
[2] => Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
)
First Stream:
Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
Second Stream:
Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
Third Stream:
Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
So you can see you have the numbers as the array keys, and the values they correspond to, which are now readily accessible for further processing. I hope this helps you :)