Regex match multiple pattern - regex

Below is my test string:
Object: TLE-234DSDSDS324-234SDF324ER
Page location: SDEWRSD3242SD-234/324/234 (1)
org-chart Lorem ipsum dolor consectetur adipiscing # Colorado
234DSDSDS324-32-4/2/7-page2 (2) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: fatal, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
Page location: SDEWRSD3242SD-SDF/234/324 (5)
org-chart Lorem ipsum dolor consectetur adipiscin # Arizona
234DSDSDS324-23-11/1/0-page1 (1) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: log, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
I need to capture strings after the "Page location: ", "Object: " and "Comments: "
For example:
Object: TLE-234DSDSDS324-234SDF324ER - Group 1
Page location: SDEWRSD3242SD-234/324/234 (1) - Group 2
Page location: SDEWRSD3242SD-SDF/234/324 (5) - Group 3
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 4
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 5
Here is my regex URL.
I am able to capture the strings but the regex won't capture if any one of the string is repeated.

(See comments below the question for the problem description.)
The data is in a multi-line string, with multiple sections starting with Object:. Within each there are multiple lines starting with phrases Page location: and Comments:. The rest of the line for all these need be captured, and all organized by Objects.
Instead of attempting a tortured multi-line "single" regex, break the string into lines and process section by section. This way the problem becomes a very simple one.
The results are stored in an array of hashrefs; each has for keys the shown phrases. Since they can appear more than once per section their values are arrayrefs (with what follows them on the line).
use warnings;
use strict;
use feature 'say';
my $input_string = '...';
my #lines = split /\n/, $input_string;
my $patt = qr/Object|Page location|Comments/;
my #sections;
for (#lines)
{
next if not /^\s*($patt):\s*(.*)/;
push #sections, {} if $1 eq 'Object';
push #{ $sections[-1]->{$1} }, $2;
}
foreach my $sec (#sections) {
foreach my $key (sort keys %$sec) {
say "$key:";
say "\t$_" for #{$sec->{$key}};
}
}
With the input string copied (suppressed above for brevity), the output is
Comments:
Lorem ipsum dolor sit amet, [...]
Lorem ipsum dolor sit amet, [...]
Page location:
SDEWRSD3242SD-234/324/234 (1)
SDEWRSD3242SD-SDF/234/324 (5)
Object:
TLE-234DSDSDS324-234SDF324ER
A few comments.
Once the Object line is found we add a new hashref to #sections. Then the match for a pattern is set as a key and the rest of its line added to its arrayref value. This is done for the current (so last) element of #sections.
This adds an empty string if a pattern had nothing following. To disallow add next if not $2;
Note. An easy and common way to print complex data structures is via the core module Data::Dumper. But also see Data::Dump for a much more compact printout.

Related

Find content inside parentheses and the word that comes before it (OpenRefine)

I'm trying to extract some text from a column on a CSV file. Here is an example:
"Lorem ipsum dolor sit amet (2015), consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua (2000)."
I wanna get a new column with "amet (2015)" and "aliqua (2000)". This expression gives me the (2015) and (2000): value.find(/(.*?)/)
But how can I also get the word before the parentheses?
here is the regex your are looking for /\w* \([^\)]*\)/gm.

REGEX - Select multiple lines unless it finds the defined stop charecter

I have a String
Lorem ipsum dolor sit amet
*consectetur adipiscing elit
sed do eiusmod tempor incididunt*
ut labore et dolore magna aliqua
Ut enim ad minim veniam.
now I want to select the * content *
this [*](.*?)[*] is my current regex, but it's working with a single line
*consectetur adipiscing elit*
How do I make it multiline?
This REGEX worked for me [*]([\\s\\S]*?)[*]

Separating words with Regex (Not in specific order)

Extracting from text For example; the following sentence contains the initial capital letters. How can I separate them?
Text:
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor
sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
Goal:
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
What have I done?
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
Result:
https://regex101.com/r/4HB0oD/1
But my Regex code doesn't detect it without first sentence. What is the reason of this?
Maybe,
(?=[A-Z]\s*\.)
might work OK.
RegEx Demo
Test
import re
string = '''
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
'''
print(re.sub(r'(?=[A-Z]\s*\.)', '\n', string))
Output
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.
RegEx Circuit
jex.im visualizes regular expressions:
This pattern should do what you're looking for:
[A-Z\d] ?\..+?(?=$|[A-Z\d] ?\.)
https://regex101.com/r/i92QR1/1

how to delete words from a list in a column in R

I have a column of titles in a table and would like to delete all words that are listed in a separate table/vector.
For example, table of titles:
"Lorem ipsum dolor"
"sit amet, consectetur adipiscing"
"elit, sed do eiusmod tempor"
"incididunt ut labore"
"et dolore magna aliqua."
To be deleted: c("Lorem", "dolore", "elit")
output:
"ipsum dolor"
"sit amet, consectetur adipiscing"
", sed do eiusmod tempor"
"incididunt ut labore"
"et magna aliqua."
The blacklisted words can occur multiple times.
The tm package has this functionality, but when applied to a wordcloud. What I would need is to leave the column intact rather than joining all the rows into one string of characters. Regex functions (gsub())don't seem to function when given a set of values as a pattern. An Oracle SQL solution would also be interesting.
lorem <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
to.delete <- c("Lorem", "dolore", "elit")
output <- lorem
for (i in to.delete) {
output <- gsub(i, "", output)
}
This gives:
[1] " ipsum dolor" "sit amet, consectetur adipiscing"
[3] ", sed do eiusmod tempor" "incididunt ut labore"
[5] "et magna aliqua."
First read the data:
dat <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")
We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:
gsub(paste0(todelete, collapse = "|"), "", dat)
You could also use stri_replace_all_fixed:
library(stringi)
lorem <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
to.delete <- c("Lorem", "dolore", "elit")
#just a simple function call
library(stringi)
stri_replace_all_fixed(lorem, to.delete, '')
Output:
[1] " ipsum dolor" "sit amet, consectetur adipiscing" ", sed do eiusmod tempor"
[4] "incididunt ut labore" "et magna aliqua."
The tm-Package has a function implemented for that:
tm:::removeWords.character
It is implemented as follows:
foo <- function(x, words){
gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),
collapse = "|")), "", x, perl = TRUE)
}
Which gives you
gsub("(*UCP)\\b(Lorem|elit|dolore)\\b","", x, perl = TRUE)

Print Perl Regex Matches

In the program I'm working on, I'm attempting to take in a txt file, then print the bits of txt contained in a pair of quotation marks.
Assuming I've taken in the txt file and put it into an array with each line as an array element this is what I was assuming would work, but alas no luck:
txt file contents:
Lorem ipsum dolor sit amet
consectetur "adipisicing elit"
sed "do" eiusmod tempor incididunt
ut "labore et dolore" magna aliqua
CODE:
foreach(#arr)
{
print $1 if /("*")/g;
}
Output:
""
...
foreach (#arr) {
print $1 for /(".*?")/g;
}
...
#!/usr/bin/perl
use strict;
use warnings;
foreach(<DATA>) {
print $1 if /(".*")/;
}
__DATA__
txt file contents:
Lorem ipsum dolor sit amet
consectetur "adipisicing elit"
sed "do" eiusmod tempor incididunt
ut "labore et dolore" magna aliqua