Generate substrings from a string in Perl - regex

I have a string of characters which I want to break down in its substrings on the spaces between words, but the number of spaces spanning between a substring should not be more than 4.
E.g.: String:
"Baicalein, a specific lipoxygenase (LOX) inhibitor, has anti-inflammatory and antioxidant effects."
The resulting substrings should look like
1. Baicalein,
2. Baicalein, a
3. Baicalein, a specific
4. Baicalein, a specific lipoxygenase
5. Baicalein, a specific lipoxygenase (LOX)
6. a
7. a specific...
I feel there must be some way with Regex, but I'm not sure
EDIT
Code that I have used:
my #arr = split('\s', $line);
for(my $i=0; $i<$#arr; $i++)
{
my $str1 = $arr[$i];
my $str2 = $arr[$i].' '.$arr[$i+1];
my $str3 = $arr[$i].' '.$arr[$i+1].' '.$arr[$i+2];
my $str4 = $arr[$i].' '.$arr[$i+1].' '.$arr[$i+2].' '.$arr[$i+3];
}
I have very long strings and by this approach it takes a lot of time.
Thanks in Advance

You could create an inner loop to avoid the repeated code. Also, repeatedly gluing stuff with the dot operator is less efficient.
my #substrings;
for (my $i=0; $i<=$#arr; ++$i)
{
for (my $j=0; $j<5 && $i+$j<=$#arr; ++$j)
{
push #substrings, join(' ', #arr[$i..$i+$j]);
}
}
You'll notice the additional boundary condition to prevent the inner loop from going past the end of the input array, and the use of a new array #substrings to contain the results. Finally, see how indentation helps you see what goes where.

Related

regex Match a capture group's items only once

So I'm trying to split a string in several options, but those options are allowed to occur only once. I've figured out how to make it match all options, but when an option occurs twice or more it matches every single option.
Example string: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again
Regex: /-{1,2}(split1|split2|split3) [\w|\s]+/g
Right now it is matching all cases and I want it to match --split1, --split2 and --split3 only once (so --split1 split1 again will not be matched).
I'm probably missing something really straight forward, but anyone care to help out? :)
Edit:
Decided to handle the extra occurances showing up in a script and not through RegEx, easier error handling. Thanks for the help!
EDIT: Somehow I ended up here from the PHP section, hence the PHP code. The same principles apply to any other language, however.
I realise that OP has said they have found a solution, but I am putting this here for future visitors.
function splitter(string $str, int $splits, $split = "--split")
{
$a = array();
for ($i = $splits; $i > 0; $i--) {
if (strpos($str, "$split{$i} ") !== false) {
$a[] = substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} "));
$str = substr($str, 0, strpos($str, "$split{$i} "));
}
}
return array_reverse($a);
}
This function will take the string to be split, as well as how many segments there will be. Use it like so:
$array = splitter($str, 3);
It will successfully explode the array around the $split parameter.
The parameters are used as follows:
$str
The string that you want to split. In your instance it is: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again.
$splits
This is how many elements of the array you wish to create. In your instance, there are 3 distinct splits.
If a split is not found, then it will be skipped. For instance, if you were to have --split1 and --split3 but no --split2 then the array will only be split twice.
$split
This is the string that will be the delimiter of the array. Note that it must be as specified in the question. This means that if you want to split using --myNewSplit then it will append that string with a number from 1 to $splits.
All elements end with a space since the function looks for $split and you have a space before each split. If you don't want to have the trailing whitespace then you can change the code to this:
$a[] = trim(substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} ")));
Also, notice that strpos looks for a space after the delimiter. Again, if you don't want the space then remove it from the string.
The reason I have used a function is that it will make it flexible for you in the future if you decide that you want to have four splits or change the delimiter.
Obviously, if you no longer want a numerically changing delimiter then the explode function exists for this purpose.
-{1,2}((split1)|(split2)|(split3)) [\w|\s]+
Something like this? This will, in this case, create 3 arrays which all will have an array of elements of the same name in them. Hope this helps

Perl regex from file.txt, match columns greater than x

I have a file containing several rows of code, like this:
160101, 0100, 58.8,
160101, 0200, 59.3,
160101, 0300, 59.5,
160101, 0400, 59.1,
I'm trying to print out the third column with a regex, like this:
# Read the text file.
open( IN, "file.txt" ) or die "Can't read words file: $!";
# Print out.
while (<IN>) {
print "Number: $1\n"
while s/[^\,]+\,[^\,]+\,([^\,]+)\,/$1/g;
}
And it works fairly well, however, I'm trying to only fetch the numbers that are greater than or equal to 59 (that includes numbers like 59.1 and 59.0). I've tried several numeric regex combinations (the one below will not give me the right number, obviously, but just making a point), including:
while s/[^\,]+\,[^\,]+\,([^\,]+)\,^[0-9]{3}$/$1/g;
but none seem to work. Any ideas?
My first idea would be to split that line and then pick and choose
while (my $line = <IN>) {
my #nums = split ',\s*', $line;
print "$nums[2]\n" if $nums[2] >= $cutoff;
}
If you insist on doing it all in the regex then you may want to use /e modifier, so in the substitution part you can run code. Then you can test the particular match and print it there.
Assuming that the numbers can't reach 100 (three digits) you could use
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+)\,
which uses your regex except for the capture group which captures the number 59 and it's decimals, or two digit numbers from 60-99 and it's decimals.
Regards
Edit:
To go above 100 you can add another alternative in the capture group:
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+|[1-9]\d{2,}\.\d+)\,
which allows larger numbers (>=100.0).
Why do you use while? Is it possible to have more than one third column on a line? A simple if will work the same, comunicating the intent more clearly.
Also, if you want to extract, you don't need to substitute. Use m// instead of s///.
Regexes aren't the right tool to do numberic comparisons. Use >= instead:
print "Number: $1\n" if /[^\,]+\,[^\,]+\,([^\,]+)\,/
&& $1 >= 59
Assuming the line ends with a comma :
print foreach map{s/.+?(\d+.\d+),$/$1/;$_} ;
In case there might be someting after the rightmost comma :
print foreach map{s/.+?(\d+.\d+),[^,]*$/$1/;$_} ;
But i wouldn't use regexp in that case :
print foreach map{(split, ',')[-2]} ;
I would suggest not using a regex when split is a better tool for the job. Likewise - regex is very bad at detecting numeric values - it works on text based patterns.
But how about:
while ( <> ) {
print ((split /,\s*/)[2],"\n");
}
If you want to test a conditional:
while ( <> ) {
my #fields = split /,\s*/;
print $fields[2],"\n" if $fields[2] >= 59;
}
Or perhaps:
print join "\n", grep { $_ >= 59 } map { (split /,\s*/)[2] } <>;
map takes your input, and extracts the third field (returning a list). grep then applies a filter condition to every element. And then we print it.
Note - in the above, I use <> which is the magic file handle (reads files specified on command line, or STDIN) but you can use your filehandle.
However it's probably worth noting - 3 argument open with lexical file handles are recommended now.
open ( my $input, '<', 'file.txt' ) or die $!;
It has a number of advantages and is generally good style.

How can I use regex with sed (or equivalent unix command line tool) to fix title case in LaTeX headings?

regular expression attempt
(\\section\{|\\subsection\{|\\subsubsection\{|\\paragraph[^{]*\{)(\w)\w*([ |\}]*)
search text
\section{intro to installation of apps}
\subsection{another heading for \myformatting{special}}
\subsubsection{good morning, San Francisco}
\paragraph{installation of backend services}
desired output
All initial characters are capitalized except prepositions, conjunctions, and the usual parts of speech that are made upper case on titles.
I supposed I should really narrow this down, so let me borrow from the U.S. Government Printing Office Style Manual:
The articles a, an, and the; the prepositions at, by, for, in, of, on, to, and up; the conjunctions and, as, but, if, or, and nor; and the second element of a compound numeral are not capitalized.
Page 41
\subsection{Installation guide for the server-side app \myapp{webgen}}
changes to
\subsection{Installation Guide for the Server-side App \myapp{Webgen}}
OR
\subsection{Installation Guide for the Server-side App \myapp{webgen}}
How would you name this type of string modification?
Applying REGEX to a string between strings?
Applying REGEX to a part of a string when that part falls between two other strings of characters?
Applying REGEX to a substring that occurs between two
other substrings within a string?
<something else>
problem
I match each latex heading command, including the {. This means that my expresion does not match more than the first word in the actually heading text. I cannot surround the whole heading code with an "OR space" because then I will find nearly every word in the document. Also, I have to be careful of formatting commands within the headings themselves.
other helpful related questions
Uppercasing First Letter of Words Using SED
https://superuser.com/questions/749164/how-to-use-regex-to-capitalise-the-first-letter-of-each-word-in-a-sentence
Using Sed to capitalize the first letter of each word
Capitalize first letter of each word in a selection using vim
So it seems to me as if you need to implement pseudo-code like this:
Are we on the first word? If yes, capitalize it and move on.
Is the current word "reserved"? If yes, lower it and move on.
Is the current word a numeral? If yes, lower it and move on.
Are we still in the list? If yes, print the line verbatim and move on.
One other helpful rule might be to leave fully upper-case words as they are, just in case they're acronyms.
The following awk script might do what you need.
#!/usr/bin/awk -f
function toformal(subject) {
return toupper(substr(subject,1,1)) tolower(substr(subject,2))
}
BEGIN {
# Reserved word list gets split into an array for easy matching.
reserved="at by for in of on to up and as but if or nor";
split(reserved,a_reserved," "); for(i in a_reserved) r[a_reserved[i]]=1;
# Same with the list of compound numerals. If this isn't what you mean, say so.
numerals="hundred thousand million billion";
split(numerals,a_numerals," "); for(i in a_numerals) n[a_numerals[i]]=1;
}
# This awk condition matches the lines we're interested in modifying.
/^\\(section|subsection|subsubsection|paragraph)[{]/ {
# Separate the particular section and the text, then split text to an array.
section=$0; sub(/\\/,"",section); sub(/[{].*/,"",section);
text=$0; sub(/^[^{]*[{]/,"",text); sub(/[}].*/,"",text);
size=split(text,atext,/[[:space:]]/);
# First word...
newtext=toformal(atext[1]);
for(i=2; i<=size; i++) {
# Reserved word...
if (r[tolower(atext[i])]) { newtext=newtext " " atext[i]; continue; }
# Compound numerals...
if (n[tolower(atext[i])]) { newtext=newtext " " tolower(atext[i]); continue; }
# # Acronyms maybe...
# if (atext[i] == toupper(atext[i])) { newtext=newtext " " atext[i]; continue; }
# Everything else...
newtext=newtext " " toformal(atext[i]);
}
print newtext;
next;
}
# Print the line if we get this far. This is a non-condition with
# a print-only statement.
1
Here is an example of how you could do it in Perl using the module Lingua::EN::Titlecase and recursive regular expressions :
use strict;
use warnings;
use Lingua::EN::Titlecase;
my $tc = Lingua::EN::Titlecase->new();
my $data = do {local $/; <> };
my ($kw_regex) = map { qr/$_/ }
join '|', qw(section subsection subsubsection paragraph);
$data =~ s/(\\(?: $kw_regex))(\{(?:[^{}]++|(?2))*\})/title_case($tc,$1,$2)/gex;
print $data;
sub title_case {
my ($tc, $p1, $p2) = #_;
$p2 =~ s/^\{//;
$p2 =~ s/\}$//;
if ($p2 =~ /\\/ ) {
while ($p2 =~ /\G(.*?)(\\.*?)(\{(?:[^{}]++|(?3))*\})/ ) {
my $next_pos = $+[0];
substr($p2, $-[1], $+[1] -$-[1], $tc->title($1));
substr($p2, $-[3], $+[3] -$-[3], title_case($tc,'',$3));
pos($p2) = $next_pos;
}
$p2 =~ s/\G(.+)$/$tc->title($1)/e;
}
else {
$p2 = $tc->title($p2);
}
return $p1 . '{' . $p2 . '}';
}

regex maching after new line in perl

i am trying to match with regex in perl different parts of a text which are not in the same line.
I have a file sized 200 mb aprox with all cases similar to the following example:
rewfww
vfresrgt
rter
*** BLOCK 049 Aeee/Ed "ewewew"U 141202 0206
BLAH1
BLAH2
END
and i want to extract all what is in the same line after the "***" in $1, BLAH1 in $2 and BLAH2 in $3.
i have tried the following without success:
open(archive, "C:/Users/g/Desktop/blahs.txt") or die "die\n";
while(< archive>){
if($_ =~ /^\*\*\*(.*)\n(.*)/s){
print $1;
print $2;
}
}
One more complexity: i don´t know how many BLAH´s are in each case. Perhaps one case have only BLAH1, other case with BLAH1, BLAH2 and BLAH3 etc. The only thing thats sure is the final "END" who separates the cases.
Regards
\*\*\*([^\n]*)\n|(?!^)\G\s*(?!\bEND\b)([^\n]+)
Try this.See demo.
https://regex101.com/r/vN3sH3/17
How about:
#!/usr/bin/perl
use strict;
use warnings;
open(my $archive, '<', "C:/Users/g/Desktop/blahs.txt") or die "die: $!";
while(<$archive>){
if (/^\*{3}/ .. /END/) {
s/^\*{3}//;
print unless /END/;
}
}
As far as I understand your question the following works for me. Please update or provide feedback if you are looking for something more or less strict (or spot any mistakes!).
^(\*{3}.*\n{2})(([a-zA-Z])*([0-9]*)\n{2})*(END)$
^(\*{3}\n{2}) - Find line consisting of three *s followed by two newlines - You could repeat this by adding * after the last closing parenthesis if you want/need to check for a "false" start. While it looks like you may have data in the file before this but this is the start of the data you actually care about/want to capture.
(([a-zA-Z])*([0-9]*)\n{2})* -The desired word characters followed by a number (or numbers if your BLAH count >9) and also check for two trailing spaces. The * at the end denotes that this can repeat zero or more times which accounts for the case where you have no data. If you want a fail if there is not data use ? instead of * to denote it must repeat 1 or more times. this segment assumes you wanted to check for data in the format word+number. If that is not the case this part can be easily modified to accept a wider range of data - let me know if you want/need a more or less strict case
(END)$ - The regex ends with sequence "END". If it is permissible for the data to continue and you just want to stop capture at this point do not include the $
I don't have permissions to post pics yet but a great site to check and to see a visual representation of your regex imo is https://www.debuggex.com/

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.