Separating words with Regex (Not in specific order)

Separating words with Regex (Not in specific order) - regex

Extracting from text For example; the following sentence contains the initial capital letters. How can I separate them?
Text:
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor
sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
Goal:
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
What have I done?
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
Result:
https://regex101.com/r/4HB0oD/1
But my Regex code doesn't detect it without first sentence. What is the reason of this?

Maybe,
(?=[A-Z]\s*\.)
might work OK.
RegEx Demo
Test
import re
string = '''
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
'''
print(re.sub(r'(?=[A-Z]\s*\.)', '\n', string))
Output
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.
RegEx Circuit
jex.im visualizes regular expressions:

This pattern should do what you're looking for:
[A-Z\d] ?\..+?(?=$|[A-Z\d] ?\.)
https://regex101.com/r/i92QR1/1

Related

Regex that matches multiple new lines until finding patern

I am not very familiar to regex and I am having trouble to create a regex that solves my problem.
I want to create a regex that finds the following example: (What the regex should match is in bold)
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sodales tincidunt ipsum ut ullamcorper
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt Phasellus rhoncus quam id eros volutpat, ac sodales magna
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt
Number Name Degree
11111111 LOREM IPSUM COMPUTER ENGINEERING
31837183 DOLOR IPSUM COMPUTER ENGINEERING
Total: 2
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Number Name Degree
128172818211 SIT AMET IPSUM COMPUTER ENGINEERING
12183781 CONSECTETUR ELIT COMPUTER ENGINEERING
128172818212 ETIAM SODALES COMPUTER ENGINEERING
128172818213 IPSUM UT COMPUTER ENGINEERING
128172818215 SODALES MAGNA COMPUTER ENGINEERING
Total: 5
What I have accomplished so far, is generating a regex that matches the lines with success and the first line of the action type, but not the subsequent. I would like to match everything that comes after action type till the line that contains Number, Name and Degree.
The currently regex I am using is (Action type: .+?\n|[0-9]{8,12} .+?\n). A preview of the current executiong using regex101.com is attached.
As You can see, it works well for the second example, but it does not fulfil my needs with regard to the first one.
Is it possible to adapt my current regex to fit these multilines?

Try:
^Action type:.*?(?=^Number Name Degree)|^\d{8,12}[^\n]+
Regex demo.
^Action type:.*?(?=^Number Name Degree) - this matches all text beginning with Action type: until ^Number Name Degree is found.
^\d{8,12}[^\n]+ - this matches all lines beginning with 8-12 digits.
Note: the expression needs (?s) modifier

Read from text file with delimiter

I have a text file with an index of students that looks something like this:
Anna Baker
Class 1B
Long description text about the student lorem ipsum dolor sit amet, consetetur sadipscing elitr.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr.
#####
Rick Bell
Class 2A
Long description text about the student lorem ipsum dolor sit amet.
At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor.
#####
etc.
I have a class Student and I need to extract the information from the text file and put it in student objects.
class Student{
private:
string name;
string class;
string description;
}
Name and class worked fine so far but I'm struggling with extracting the description text. "#####" serves as a delimiter. I use:
while (???){
getline(inFile, word3);
word3=word3.substr(0,word3.find(delimiter));
}
I need a while loop to read all the lines up to the delimiter and I can't find the right statement for it. Can you help me?

Quick look to cpp reference suggets that the return value of std::string::find is std::string::npos when not foud. So you can use something like
bool continue = true;
while (continue)
{
getline(inFile, word3);
continue = word3.find(delimiter)) == std::string::npos;
//do something else
}

You can use ifstream to read file, with
std::ifstream student_file("file.txt");
You can read file line by line using,
string info;
if(student_file.is_open()){
while( getline(student_file, info) ) {
if( info == "####")
continue;
else {
//this is student info
}
}
}

Regular expression matching a sequence of words

Let's suppose we have a paragraph like this:
Lorem ipsum, sit amet consectetur adipiscing elit. Lorem - ipsum, sit
amet. Morbi a suscipit sem, quis finibus turpis. Lorem ipsum: sit
amet. Proin suscipit ac arcu pharetra tincidunt. Lorem ipsum. sit
amet. Pellentesque eu lacinia metus. sit amet: Lorem ipsum. Lorem
turpis ipsum, sit amet.
I need a regex pcre pattern case insensitive that only selects the words
1 lorem
2 ipsum
3 sit
4 amet
in that specific order ignoring punctutation and occurrences like
Sit amet lorem ipsum
Lorem turpis ipsum, sit amet

Simple straight forward with certain punctuation characters. You can append any punctuation character inside the []:
([Ll]orem)[\s,.!:\-()?]+(ipsum)[\s,.!:\-()?]+(sit)[\s,.!:\-()?]+(amet)
or everything that is a whitespace and not [A-Za-z0-9]
([Ll]orem)[\s\W]+(ipsum)[\s\W]+(sit)[\s\W]+(amet)
Case sensitivity can be an option to switch depending on the programming language. Or you have to manually add every relevant variation like ([L|l]orem)
Regex101 Example

Regex match multiple pattern

Below is my test string:
Object: TLE-234DSDSDS324-234SDF324ER
Page location: SDEWRSD3242SD-234/324/234 (1)
org-chart Lorem ipsum dolor consectetur adipiscing # Colorado
234DSDSDS324-32-4/2/7-page2 (2) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: fatal, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
Page location: SDEWRSD3242SD-SDF/234/324 (5)
org-chart Lorem ipsum dolor consectetur adipiscin # Arizona
234DSDSDS324-23-11/1/0-page1 (1) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: log, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
I need to capture strings after the "Page location: ", "Object: " and "Comments: "
For example:
Object: TLE-234DSDSDS324-234SDF324ER - Group 1
Page location: SDEWRSD3242SD-234/324/234 (1) - Group 2
Page location: SDEWRSD3242SD-SDF/234/324 (5) - Group 3
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 4
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 5
Here is my regex URL.
I am able to capture the strings but the regex won't capture if any one of the string is repeated.

(See comments below the question for the problem description.)
The data is in a multi-line string, with multiple sections starting with Object:. Within each there are multiple lines starting with phrases Page location: and Comments:. The rest of the line for all these need be captured, and all organized by Objects.
Instead of attempting a tortured multi-line "single" regex, break the string into lines and process section by section. This way the problem becomes a very simple one.
The results are stored in an array of hashrefs; each has for keys the shown phrases. Since they can appear more than once per section their values are arrayrefs (with what follows them on the line).
use warnings;
use strict;
use feature 'say';
my $input_string = '...';
my #lines = split /\n/, $input_string;
my $patt = qr/Object|Page location|Comments/;
my #sections;
for (#lines)
{
next if not /^\s*($patt):\s*(.*)/;
push #sections, {} if $1 eq 'Object';
push #{ $sections[-1]->{$1} }, $2;
}
foreach my $sec (#sections) {
foreach my $key (sort keys %$sec) {
say "$key:";
say "\t$_" for #{$sec->{$key}};
}
}
With the input string copied (suppressed above for brevity), the output is
Comments:
Lorem ipsum dolor sit amet, [...]
Lorem ipsum dolor sit amet, [...]
Page location:
SDEWRSD3242SD-234/324/234 (1)
SDEWRSD3242SD-SDF/234/324 (5)
Object:
TLE-234DSDSDS324-234SDF324ER
A few comments.
Once the Object line is found we add a new hashref to #sections. Then the match for a pattern is set as a key and the rest of its line added to its arrayref value. This is done for the current (so last) element of #sections.
This adds an empty string if a pattern had nothing following. To disallow add next if not $2;
Note. An easy and common way to print complex data structures is via the core module Data::Dumper. But also see Data::Dump for a much more compact printout.

Regular expressions: Finding BB code in a piece of text

I'm trying to match on "url" BB code tag in a random piece of text. Example text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. [url]http://www.google.com[/url] Donec purus nunc, rhoncus vitae tempus vitae, [url=www.facebook.com]facebook[/url] elementum sit amet justo.
I want to find both "url" tags from this text:
[url]http://www.google.com[/url]
[url=www.facebook.com]facebook[/url]
I am not that good with regular expressions so this is as far as I could get:
\[url(=[a-z]*)?\][a-z]*\[/url\]
I think I just need to replace [a-z] with something that matches on anything EXCEPT the characters '[' and ']'. Can anybody help me out with this please?

The following expression should do it for you
\[url(=(.*?))?\](.*?)\[\/url\]

((\[url\].*?\[/url\])|(\[url=.*\](.*?)\[/url\]))
Will pull both results.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js