REGEX - Select multiple lines unless it finds the defined stop charecter - regex

I have a String
Lorem ipsum dolor sit amet
*consectetur adipiscing elit
sed do eiusmod tempor incididunt*
ut labore et dolore magna aliqua
Ut enim ad minim veniam.
now I want to select the * content *
this [*](.*?)[*] is my current regex, but it's working with a single line
*consectetur adipiscing elit*
How do I make it multiline?

This REGEX worked for me [*]([\\s\\S]*?)[*]

Related

Extract specific string via regex from textfile

I have a text file with a lot of content. I want to extract the following text fragments beginning with TXT_ and ending with a ).
e.g.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. (TXT_I_WANT_TO_EXTRACT_THIS). At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. (TXT_AND_THIS) Lorem ipsum dolor sit amet.
Expected result:
TXT_I_WANT_TO_EXTRACT_THIS
TXT_AND_THIS
I just the need the regex for the result.
Thank you so much for your help.
Greetings
Not sure in which language you are working. In R, for example, you can use str_extract_all:
str_extract_all(txt, "TXT\\w+")
[[1]]
[1] "TXT_I_WANT_TO_EXTRACT_THIS" "TXT_AND_THIS"
Even if you don't work in R, the pattern used in the solution will not change greatly; it is in fact simple: supposing that all target strings start with the same literal pattern, say "TXT", and only contain alphabetic characters and the underscore, the parts after "TXT" can conveniently be matched by \\w+, a character class for alphanumeric characters and the underscore.
txt <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. (TXT_I_WANT_TO_EXTRACT_THIS). At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. (TXT_AND_THIS) Lorem ipsum dolor sit amet."

Find content inside parentheses and the word that comes before it (OpenRefine)

I'm trying to extract some text from a column on a CSV file. Here is an example:
"Lorem ipsum dolor sit amet (2015), consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua (2000)."
I wanna get a new column with "amet (2015)" and "aliqua (2000)". This expression gives me the (2015) and (2000): value.find(/(.*?)/)
But how can I also get the word before the parentheses?
here is the regex your are looking for /\w* \([^\)]*\)/gm.

Xml-Fo to keep justified text with same number of words in line as in left

I wonna make justified text with Apache FOP to have same number of words as the text with left aligh.
The goal is to make the justified text to have exactly the same words on each line, as the one with align="left".
I would try a different formatter. If anything, one would assume that the justified text would yield less words per line or equal, not more.
I used this sample:
<fo:flow flow-name="xsl-region-body">
<fo:block-container>
<fo:block>Dolores vero dolor sed accusam ipsum tempor justo ipsum tempor lorem vel
ipsum no. Ipsum dolor eleifend lorem ipsum dolor in takimata sit dolor. Sit et
te te duo diam diam stet no assum invidunt illum lorem no. Augue amet clita no
sit tempor nonumy nulla gubergren elitr liber magna ut facilisis eum et eos.
Tempor magna accusam elitr sit voluptua et aliquyam ut consectetuer lorem et
volutpat stet diam dolor voluptua consetetur amet. Eos magna stet. Ea dolore
adipiscing eos et dolor dolore et magna nulla eleifend. Sed est at molestie
dolore amet sit accusam nulla consetetur vero mazim sed kasd lobortis dolore.
Amet ipsum sanctus duo nibh vero invidunt quis clita at dolores nonumy eum kasd
vero illum eum eos est.</fo:block>
</fo:block-container>
<fo:block space-after="12pt"><fo:leader/></fo:block>
<fo:block-container text-align="justify">
<fo:block>Dolores vero dolor sed accusam ipsum tempor justo ipsum tempor lorem vel
ipsum no. Ipsum dolor eleifend lorem ipsum dolor in takimata sit dolor. Sit et
te te duo diam diam stet no assum invidunt illum lorem no. Augue amet clita no
sit tempor nonumy nulla gubergren elitr liber magna ut facilisis eum et eos.
Tempor magna accusam elitr sit voluptua et aliquyam ut consectetuer lorem et
volutpat stet diam dolor voluptua consetetur amet. Eos magna stet. Ea dolore
adipiscing eos et dolor dolore et magna nulla eleifend. Sed est at molestie
dolore amet sit accusam nulla consetetur vero mazim sed kasd lobortis dolore.
Amet ipsum sanctus duo nibh vero invidunt quis clita at dolores nonumy eum kasd
vero illum eum eos est.</fo:block>
</fo:block-container>
</fo:flow>
I format with Apache FOP and it yields this (replicating what you get). Note you can see that justified text actually yields more words per line which means to me that justification is changing letter and word spacing so much so that it yields more words on a line.
I format with RenderX XEP and it yields this, which is what you expect. The justification just increases letter and word spacing in a pleasing manner so that the same number of words are on each line.
I would note that in my (totally biased as I work for RenderX) opinion, the original left justified text by RenderX is much better. The ragged right edges are much tighter and it is already fitting content even in a left-justification. RenderX is not merely word/space/word/space ... break because a word does not fit on the line. RenderX has an adjustment for line tightness that will already squeeze to fit even with left-justified text is a word's length at line end is within a certain threshold. FOP does not do this or it is not apparent that it does.
I tried modifying things like letter and word spacing with no success. Not sure you can report this as a "bug" because it really isn't. It is merely a difference in the core engine behavior.
FOP may have implemented TeX line breaking too literally. I haven't seen a way to stop it from shrinking white-space.
FOP implements the TeX line-breaking algorithm, as does AH Formatter (https://www.antenna.co.jp/AHF/help/en/ahf-tech.html#line-breaking). The TeX algorithm (http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf) allows spaces to have both 'stretchability' and 'shrinkability'.
word-spacing (https://www.w3.org/TR/xsl11/#word-spacing) is the XSL property that, unsurprisingly, controls the spacing between words. CSS also has word-spacing, but the XSL word-spacing can be a <space>, with .minimum, .optimum, and .maximum components. As I read the word-spacing definition, the default normal value means that a space should be the width of a space.
text-align="justify" is defined as expanding the contents (https://www.w3.org/TR/xsl11/#text-align).
So as I read the spec, text-align="justify" should give the same words per line as text-align="start" (the XSL default value), as in the Antenna House sample below and #KevinBrown's sample. Squeezing in more words per line should require a negative word-spacing.minimum value, as below.

Regex match multiple pattern

Below is my test string:
Object: TLE-234DSDSDS324-234SDF324ER
Page location: SDEWRSD3242SD-234/324/234 (1)
org-chart Lorem ipsum dolor consectetur adipiscing # Colorado
234DSDSDS324-32-4/2/7-page2 (2) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: fatal, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
Page location: SDEWRSD3242SD-SDF/234/324 (5)
org-chart Lorem ipsum dolor consectetur adipiscin # Arizona
234DSDSDS324-23-11/1/0-page1 (1) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: log, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
I need to capture strings after the "Page location: ", "Object: " and "Comments: "
For example:
Object: TLE-234DSDSDS324-234SDF324ER - Group 1
Page location: SDEWRSD3242SD-234/324/234 (1) - Group 2
Page location: SDEWRSD3242SD-SDF/234/324 (5) - Group 3
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 4
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 5
Here is my regex URL.
I am able to capture the strings but the regex won't capture if any one of the string is repeated.
(See comments below the question for the problem description.)
The data is in a multi-line string, with multiple sections starting with Object:. Within each there are multiple lines starting with phrases Page location: and Comments:. The rest of the line for all these need be captured, and all organized by Objects.
Instead of attempting a tortured multi-line "single" regex, break the string into lines and process section by section. This way the problem becomes a very simple one.
The results are stored in an array of hashrefs; each has for keys the shown phrases. Since they can appear more than once per section their values are arrayrefs (with what follows them on the line).
use warnings;
use strict;
use feature 'say';
my $input_string = '...';
my #lines = split /\n/, $input_string;
my $patt = qr/Object|Page location|Comments/;
my #sections;
for (#lines)
{
next if not /^\s*($patt):\s*(.*)/;
push #sections, {} if $1 eq 'Object';
push #{ $sections[-1]->{$1} }, $2;
}
foreach my $sec (#sections) {
foreach my $key (sort keys %$sec) {
say "$key:";
say "\t$_" for #{$sec->{$key}};
}
}
With the input string copied (suppressed above for brevity), the output is
Comments:
Lorem ipsum dolor sit amet, [...]
Lorem ipsum dolor sit amet, [...]
Page location:
SDEWRSD3242SD-234/324/234 (1)
SDEWRSD3242SD-SDF/234/324 (5)
Object:
TLE-234DSDSDS324-234SDF324ER
A few comments.
Once the Object line is found we add a new hashref to #sections. Then the match for a pattern is set as a key and the rest of its line added to its arrayref value. This is done for the current (so last) element of #sections.
This adds an empty string if a pattern had nothing following. To disallow add next if not $2;
Note. An easy and common way to print complex data structures is via the core module Data::Dumper. But also see Data::Dump for a much more compact printout.

Regular Expression for length that validate word with length but not email

I need a regular expression to validate a large text (max 2000 characters), it should work as follows:
Say n=20
If there is any word in the text with letters greater than n, it will not validate the whole text.
If there is an email in the text say emailaddress#emailserver.com, then it should ignore that.
If the email address is like 123456789012345678901#email.com (as in this case n > 20), it will not validate the whole text.
I'm using ^(?!.*\S{10}).*$ as of now but it does not validate the email.
I racked my brain for a good ten minutes with no luck, I think it must have been some syntax I have yet to learn.
All suggestions would be greatly appreciated, many thanks!
I would use the following pattern:
\b[^\s#]{30,}\b
This pattern does not allow a word to be longer than 30 characters
It will handle e-mail addresses as two words (one word before and one word after the #.
var n = 30;
// doubly escaped slashes and global search (g)
var regex = new RegExp('\\b[^\\s#]{' + n + ',}\\b', 'g');
var text = document.getElementById('text').innerHTML;
var match = text.match(regex);
if(match) {
console.log("Document contains a value with over " + n + " characters.");
console.log(match);
}
else {
console.log("Document does not contain a value with over " + n + " characters.");
}
<div id="text">
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata
sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos
et accusametjustoduodoloreseterebumstet#clita.kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla
facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,
</div>