stop short of multiple strings and characters using '^' - regex

I'm doing a regex operation that to stop short of either character sets { or \t\t{.
the first is ok, but the second cannot be achieved using the ^ symbol the way I have been.
My current regex is [\t+]?{\d+}[^\{]*
As you can see, I've used ^ effectively with a single character, but I cannot apply it to a string of characters like \t\t\{
How can the current regex be applied to consider both of these possibilities?
Example text:
{1} The words of the blessing of Enoch, wherewith he blessed the elect and righteous, who will be living in the day of tribulation, when all the wicked and godless are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, which the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, [even] on Mount Sinai,
And appear from His camp
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
When I do this as a multi-line extract, the indendantation does not maintain for the first line of each block. Ideally the extract should stop short of the \t\t{ allowing it to be picked up properly in the next extract, creating perfectly indented blocks. The reason for this is when they are taken from the database, the \t\t should be detected at the first line to allow dynamic formatting.

[\t+]?{\d+}[\s\S]*?(?=\s*{|$)
You can use this.See demo.
https://regex101.com/r/nNUHJ8/1

Related

Alternation in regexes seems to be terribly slow in big files

I am trying to use this regex:
my #vulnerabilities = ($g ~~ m:g/\s+("Low"||"Medium"||"High")\s+/);
On chunks of files such as this one, the chunks that go from one "sorted" to the next. Every one must be a few hundred kilobytes, and all of them together take from 1 to 3 seconds all together (divided by 32 per iteration).
How can this be sped up?
Inspection of the example file reveals that the strings only occur as a whole line, starting with a tab and a space. From your responses I further gathered that you're really only interested in counts. If that is the case, then I would suggest something like this solution:
my %targets = "\t Low", "Low", "\t Medium", "Medium", "\t High", "High";
my %vulnerabilities is Bag = $g.lines.map: {
%targets{$_} // Empty
}
dd %vulnerabilities; # ("Low"=>2877,"Medium"=>54).Bag
This runs in about .25 seconds on my machine.
It always pays to look at the problem domain thoroughly!
This can be simplified a little bit. You use \s+ before and after, but is this necessary? I think you need just to assure word boundary or just one whitespace, thus, you can use
\s("Low"||"Medium"||"High")\s
or you can use \b instead of \s.
Second step is not to use capturing group, use non-capturing grous instead, because regex engine wastes time and memory for "remembering" groups, so you could try with:
\s(?:"Low"||"Medium"||"High")\s
TL;DR I've compared solutions on a recent rakudo, using your sample data. The ugly brute-force solution I present here is about twice as fast as the delightfully elegant solution Liz has presented. You could probably improve times another order of magnitude or more by breaking your data up and parallel processing it. I also discuss other options if that's not enough.
Alternations seems like a red herring
When I eliminated the alternation (leaving just "Low") and ran the code on a recent rakudo, the time taken was about the same. So I think that's a red herring and have not studied that aspect further.
Parallel processing looks promising
It's clear from your data that you could break it up, splitting at some arbitrary line, and then pattern match each piece in parallel, and then combine results.
That could net you a substantial win, depending on various factors related to your system and the data you process.
But I haven't explored this option.
The fastest results I've seen
The fastest results I've seen are with this code:
my %counts;
$g ~~ m:g / "\t " [ 'Low' || 'Medium' || 'High' ] \n { %counts{$/}++ } /;
say %counts.map: { .key.trim, .value }
This displays:
((Low 2877) (Medium 54))
This approach incorporates similar changes to those Michał Turczyn discussed, but pushed harder:
I've thrown away all capturing, not only not bothering to capture the 'Low' or whatever, but also throwing away all results of the match.
I've replaced the \s+ patterns with concrete characters rather than character classes. I've done so on the basis my casual tests with a recent rakudo suggested that's a bit faster.
Going beyond raku's regexes
Raku is designed for full Unicode generality. And its regex engine is extremely powerful. But it looks like your data is just ASCII and your pattern is a typical very simple regex. So you're using a sledgehammer to crack a nut. This shouldn't really matter -- the sledgehammer is supposed to be just fine as a nutcracker too -- but raku's regex engine remains very poorly optimized thus far.
Perhaps this nut is just a simple example and you're just curious about pushing raku's built in regex capabilities to their maximum current performance.
But if not, and you need yet more speed, and the speedups from this or other better solutions in raku, coupled with parallel processing, aren't enough to get you where you need to go, it's worth considering either not using raku or using it with another tool.
One idiomatic way to use raku with another tool is to use an Inline, with the obvious one in this case being Inline::Perl5. Using that you can try perl's fast default built in regex engine or even use its regex plugin capability to plug in a really fast regex engine.
And, given the simplicity of the pattern you're matching, you could even eschew regexes altogether by writing a quick bit of glue to some low-level raw text searching tool (perhaps saving character offsets and then generating corresponding raku match objects from the results).

Select all lines starting with a character and keep only one entry

I'm using a firefox addon called "rikaisama", this addon is a pop up dictionary for japanese and it allows epwing dictionary files. In the addon option we can use one regular expression to remove unnecessary parts of a dictionary entry.
I'm using the "Kenkyusha's New Japanese-English Dictionary" epwing file but it has way too much examples to be readable.
Example of an entry :
まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
・締め切りに間に合う meet the deadline
・開演に間に合う arrive before curtain time
▲9 時の札幌行きに間に合うように空港に着いた. I arrived in time for the nine o'clock flight to Sapporo.
・「間に合うかな」「走っても間に合いそうにないね」 "Will we be in time?"―"It doesn't look like we'll be in time even if we run."
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."
・これだけあれば丸 1 年は間に合う. This will last us [see us through] one whole year. | This will be enough for a whole year.
Where all entries starting with "▲" or "・" are examples and all entries matching this regex are definitions :
\n[″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*
I already managed to come up with this regular expression on my own but it removes all examples:
\n[^″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*
Is it possible to have a regex matching this regex AND the following line of the match ?
Wished result :
まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."
Any help appreciated !
You almost had it I think - just add \n.* so it becomes /\n[″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*\n.*/. That makes it get the next line...
See it in action here: https://regexr.com/442st

Regex to capture transcript speaker names before colon

From text transcripts, I want to capture all names of speakers.
The target names start at the start of a line and should end at a ": " (ie. colon and space).
Optionally, for even finer control, it may be safe to assume the first colon and two spaces.
Example text:
Julian Z.: What's really exciting is the opportunity to be more intelligent about how you approach trying to reach your consumer. In a world where digital and the use of digital has exploded, to be able to have one-on-one conversations in the digital world, and to be able to eventually translate that into the TV space, whether that be addressable or data-driven, is really fantastic. Because at the end of the day, you want your brand, in our case, our networks, to be able to have a relationship with the consumer. Data is a proxy to allow for that to occur.
From an advertiser perspective, obviously now the ability to go to the broadcast networks and have a data-driven buy has absolutely blown up and proliferated. That's with us. That's with some of our competitors. Obviously, we think we're the best at it, but neither here nor there. I think it's a really wonderful foundational approach for advertisers to take. I think it's a great advancement in the market.
As a spender of money, and as somebody who is trying to get people to engage with our brands, the ability to use data to really have, again, these really one-on-one, unique conversations, and to be able to deliver creative content that's relevant for individual consumers, that's driven by what we know about the consumer, now, ultimately, where we can reach them effectively and in environments where we know they're engaged, is really a great, tremendous advancement. You'll see by our ratings numbers, which are on the upswing, that approach has really had a direct impact on what our linear ratings have resulted in.
Speaker 2: Great. Tell us a little bit about Viacom. It's a lot of fans, a lot of passion in people. How do you define the audience in broad strokes? How do they respond to advertising and what are some of the concerns that consumers have around ads?
Julian Z.: Well, I think, again, when you're talking about how we're reaching fans, it is using intelligence, and information, and data, not only to profile who our fans are, but ultimately where they're best reached. Our job is to deliver great, compelling content, which we believe we're really, really good at.
In order to do that, there's the linear side of the equation, but of course we want to make sure that we're reaching our fans in digital as well, and that there's a 360 kind of fan experience. We believe holistically that our fans are really the base of what we're trying to do. We're trying to please and create value for our fans. The more we engage with them, and the more we know about them, the better we're able to deliver customized content that fits their need.
Ultimately, as a content creator, what's more exciting than to delivery really great content to people that they really, really engage with and they build relationships with? That's all you can really hope for is, somebody that creates content, is to be able to develop compelling content and content that your audience really wants to engage with.
Speaker 2: When you look at targeting, is that a cross-platform? Where does that targeting happen?
Julian Z.: It absolutely is cross-platform. Of course, there is natural addressability in the digital market, because it is much more of a one-to-one. But now you see a lot of the MVPDs have obviously opened up addressable inventory. A lot of the MVPDs now have matured their addressable footprint, which allows you now to have a digital-like, not exactly the same obviously, but a digital-like experience in the linear space, to deliver content to the consumer or advertising to the consumer when it's relevant and when it's going to have the most impact for your message.
Ultimately, it's absolutely cross-platform because addressability is all about having that conversation, having that direct one-to-one with your audience. Our partners on the MVPD side have really matured over the last several years as of regard to addressable, and now you can have that 360 experience of having a conversation in linear and in digital that really is addressable.
Example strings to be captured are: Julian Z. and Speaker 2. Names will vary from text to text. I need all/multiple names present. As you see, names may include a mixture of alpha case, punctuation characters and numbers.
I will want to deduplicate names, which are repeated in the text, but believe I should shelve that for now, focusing this question on the capture.
I have tried plenty, for the last day or two.
eg. ^[^:]+\s* with /g comes close, but only captures the first, single Julian Z., whereas I want everything. For now, I am out of ideas and need to learn how to do this.
Regex to match any characters up until the first colon:
/^.*?(?=:)/gm
https://regex101.com/r/3uyXMM/3
^: match from beginning of line
.: match anything
*?: non-greedy search, so it stops at first colon (see next line)
(?=:): positive lookahead meaning next character should be colon but it doesn't capture
g: don't return after first match, returns all matches
m: run regex for each line
You can use this regex based on a negated character class:
/^\w[^:\n]*/mg
RegEx Demo 1
RegEx Demo 2
RegEx Breakup:
^\w: Match a word character at the start
[^:\n]*: Match zero or more of any character that is not a colon and not a newline.
Code:
var names = inputData.transcript.match(/^\w[^:\n]*/mg) || [];

Regular expression to expand to sentence

I'm trying to extract regions around keywords from longer passages of text. They should include complete sentences, based on the following conditions:
n=250 Charactars before / after keyword should be included if existing (the keyword can be closer then this to the start / end of the text)
from there it should expand further to include the complete sentence (let's assume here we can define sentence borders with ".?! or :" knowing it's not completely accurate)
I already achieved the expanding to the end of the last sentence, but not to start of the first in the following example, where vitamin is the keyword and the italic is captured by the regex. However, it should capture from "An extra 24 hours..."
Apparently I don't get the corresponding group up front, neither using lazy nor using lookbehind.
((.{0,250}(vitamin)\b.{0,250})(.+?(\.|\!|\?|\:))?)/ig
Well, this year you’re getting an extra day to get ahead on your taxes or (finally) clean out the garage. (Hey, we’re not trying to tell you what do but you might as well be productive.) February 29 is back on the calendar this year because it’s a leap year. Whether you love or loathe the extra winter day, you’re probably wondering why it happens in the first place. An extra 24 hours — or day — is built into the calen dar every four years to ensure it aligns with the Earth’s movement around the sun. There’s 365 days in a calendar year, but it actually takes longer for the Earth’s annual journey — about 365.2421 days — around the star that gives us light, life and vitamin D. The difference may seem like no big deal to us, but over time, it adds up. “To ensure consistency with the true astronomical year, it is necessary to periodically add in an extra day to make up the lost time and get the calendar back in sync with the heavens,” according the history. com.
Acknowledgement of the need for a leap year happened around the time of Julius Caesar. In 46 B.C., Caesar enlisted the help of astronomer Sosigenes to update the calendar so that it had 12 months and 365 days, including a leap year every four years.,
You can try something like this:
(([.?!:][^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:])
It works by consuming 250 characters of text before and after the keyword "vitamin". From that point it finds the first punctuation point (.?!:) before/after the 250 characters of text.
Here's a sample of it in action.
You can you use extra parentheses () to strategically group what exact output you want. For example, the above answer includes the ending period from the preceding sentence in the output. So you could use
(([.?!:]([^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:]))
and use group 3 from the result set which doesn't have this ending period.
I do not see how the specification in the question can be matched by a regex. It boils down to the following logic problem:
to match as many characters as possible but no more than 250 before/after the keyword, .{0,250} needs to be greedy and can neither be lazy .{0,250}? nor possessive .{0,250}+
if this part is greedy, you will miss the occurrences of the keyword that start before the .{0,250} part is matched.
The same logic applies to my understanding to the 'match back to the start of the sentnence as well.
I played around with the following more or less meaningful regex:
[.?!:]?([^.?!:]*?(.{0,250}\byear\b.{0,250})[^.?!:]*[.?!:]?) misses first 'year'
[.?!:]?([^.?!:]*?(.{0,250}?\byear\b.{0,250})[^.?!:]*[.?!:]?) gets the first 'year' but fails on others.
I suggest you write your on extraction logic in a function, eihter using regex or not, to achieve the extraction you want.
You could for example find the index of the start of the keyword \bkeyword\b and the full stops (\.[^\d]|[.?!:]$) and then with this information extract the part of the text you want.

Regex, writing a toy compiler, parsing, comment remover

I'm currently working my way through this book:
http://www1.idc.ac.il/tecs/
I'm currently on a section where the excersize is to create a compiler for a very simple java like language.
The book always states what is required but not the how the how (which is a good thing). I should also mention that it talks about yacc and lex and specifically says to avoid them for the projects in the book for the sake of learning on your own.
I'm on chaper 10 which and starting to write the tokenizer.
1) Can anyone give me some general advice - are regex the best approach for tokenizing a source file?
2) I want to remove comments from source files before parsing - this isn't hard but most compilers tell you the line an error occurs on, if I just remove comments this will mess up the line count, are there any simple strategies for preserving the line count while still removing junk?
Thanks in advance!
The tokenizer itself is usually written using a large DFA table that describes all possible valid tokens (stuff like, a token can start with a letter followed by other letters/numbers followed by a non-letter, or with a number followed by other numbers and either a non-number/point or a point followed by at least 1 number and then a non-number, etc etc). The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them.
Now to "remove comments", when you're parsing a token you can have a comment token (the regex to parse a comment, too long to describe in words), and when you finish parsing this comment you just parse a new token, thus ignoring it. Alternatively you can pass it to the compiler and let it deal with it (or ignore it as it will). Either aproach will preserve meta-data like line numbers and characters-into-the-line.
edit for DFA theory:
Every regular expression can be converted (and is converted) into a DFA for performance reasons. This removes any backtracking in parsing them. This link gives you an idea of how this is done. You first convert the regular expression into an NFA (a DFA with backtracking), then remove all the backtracking nodes by inflating your finite automata.
Another way you can build your DFA is by hand using some common sense. Take for example a finite automata that can parse either an identifier or a number. This of course isn't enough, since you most likely want to add comments too, but it'll give you an idea of the underlying structures.
A-Z space
->(Start)----->(I1)------->((Identifier))
| | ^
| +-+
| A-Z0-9
|
| space
+---->(N1)---+--->((Number)) <----------+
0-9 | ^ | |
| | | . 0-9 space |
+-+ +--->(N2)----->(N3)--------+
0-9 | ^
+-+
0-9
Some notes on the notation used, the DFA starts at the (Start) node and moves through the arrows as input is read from your file. At any one point it can match only ONE path. Any paths missing are defaulted to an "error" node. ((Number)) and ((Identifier)) are your ending, success nodes. Once in those nodes, you return your token.
So from the start, if your token starts with a letter, it HAS to continue with a bunch of letters or numbers and end with a "space" (spaces, new lines, tabs, etc). There is no backtracking, if this fails the tokenizing process fails and you can report an error. You should read a theory book on error recovery to continue parsing, its a really huge topic.
If however your token starts with a number, it has to be followed by either a bunch of numbers or one decimal point. If there's no decimal point, a "space" has to follow the numbers, otherwise a number has to follow followed by a bunch of numbers followed by a space. I didn't include the scientific notation but it's not hard to add.
Now for parsing speed, this gets transformed into a DFA table, with all nodes on both the vertical and horizontal lines. Something like this:
I1 Identifier N1 N2 N3 Number
start letter nothing number nothing nothing nothing
I1 letter+number space nothing nothing nothing nothing
Identifier nothing SUCCESS nothing nothing nothing nothing
N1 nothing nothing number dot nothing space
N2 nothing nothing nothing nothing number nothing
N3 nothing nothing nothing nothing number space
Number nothing nothing nothing nothing nothing SUCCESS
The way you'd run this is you store your starting state and move through the table as you read your input character by character. For example an input of "1.2" would parse as start->N1->N2->N3->Number->SUCCESS. If at any point you hit a "nothing" node, you have an error.
edit 2: the table should actually be node->character->node, not node->node->character, but it worked fine in this case regardless. It's been a while since i last written a compiler by hand.
1- Yes regex are good to implement the tokenizer. If using a generated tokenizer like lex, then you describe the each token as a regex. see Mark's answer.
2- The lexer is what normally tracks line/column information, as tokens are consumed by the tokenizer, you track the line/column information with the token, or have it as current state. Therefore when a problem is found the tokenizer knows where you are. Therefore when processing comments, as new lines are processed the tokenizer just increments the line_count.
In Lex you can also have parsing states. Multi-line comments are often implemented using these states, thus allowing simpler regex's. Once you find the match to the start of a comment eg '/*' you change into comment state, which you can setup to be exclusive from the normal state. Therefore as you consume text looking for the end comment marker '*/' you do not match normal tokens.
This state based process is also useful for process string literals that allow nested end makers eg "test\"more text".