Removing multi-line texts using regex with sed

Removing multi-line texts using regex with sed - regex

I have the following sample text file with all my references which I use for citation in another software (LaTex). I want to remove the "abstract" field and its contents to help reduce the file-size and make its content more relevant.
The sample text is given below:
doi = {10.3389/fsufs.2021.575056},
abstract = {Agriculture has come under pressure to meet global food demands, whilst having to meet economic and ecological targets. This has opened newer avenues for investigation in unconventional protein sources. Current agricultural practises manage marginal lands mostly through animal husbandry, which; although effective in land utilisation for food production, largely contributes to global green-house gas (GHG) emissions. Assessing the revalorisation potential of invasive plant species growing on these lands may help encourage their utilisation as an alternate protein source and partially shift the burden from livestock production; the current dominant source of dietary protein, and offer alternate means of income from such lands. Six globally recognised invasive plant species found extensively on marginal lands; Gorse (
Ulex europaeus
), Vetch (
Vicia sativa
), Broom (
Cytisus scoparius
), Fireweed (
Chamaenerion angustifolium
), Bracken (
Pteridium aquilinum
), and Buddleia (
Buddleja davidii
) were collected and characterised to assess their potential as alternate protein sources. Amino acid profiling revealed appreciable levels of essential amino acids totalling 33.05 ± 0.04 41.43 ± 0.05, 33.05 ± 0.11, 32.63 ± 0.04, 48.71 ± 0.02 and 21.48 ± 0.05 mg/g dry plant mass for Gorse, Vetch, Broom Fireweed, Bracken, and Buddleia, respectively. The availability of essential amino acids was limited by protein solubility, and Gorse was found to have the highest soluble protein content. It was also high in bioactive phenolic compounds including cinnamic- phenyl-, pyruvic-, and benzoic acid derivatives. Databases generated using satellite imagery were used to locate the spread of invasive plants. Total biomass was estimated to be roughly 52 Tg with a protein content of 5.2 Tg with a total essential amino acid content of 1.25 Tg ({\textasciitilde}24\%). Globally, Fabaceae was the second most abundant family of invasive plants. Much of the spread was found within marginal lands and shrublands. Analysis of intrinsic agricultural factors revealed economic status as the emergent factor, driven predominantly by land use allocation, with shrublands playing a pivotal role in the model. Diverting resources from invasive plant removal through herbicides and burning to leaf protein extraction may contribute toward sustainable protein, effective land use, and achieving emission targets, while simultaneously maintaining conservation of native plant species.},
doi = {10.1186/s12864-016-3367-x},
abstract = {Background: Propionibacterium freudenreichii is an Actinobacterium widely used in the dairy industry as a ripening culture for Swiss-type cheeses, for vitamin B12 production and some strains display probiotic properties. It is reportedly a hardy bacterium, able to survive the cheese-making process and digestive stresses.
Results: During this study, P. freudenreichii CIRM-BIA 138 (alias ITG P9), which has a generation time of five hours in Yeast Extract Lactate medium at 30 °C under microaerophilic conditions, was incubated for 11 days (9 days after entry into stationary phase) in a culture medium, without any adjunct during the incubation. The carbon and free amino acids sources available in the medium, and the organic acids produced by the strain, were monitored throughout growth and survival. Although lactate (the preferred carbon source for P. freudenreichii) was exhausted three days after inoculation, the strain sustained a high population level of 9.3 log10 CFU/mL. Its physiological adaptation was investigated by RNA-seq analysis and revealed a complete disruption of metabolism at the entry into stationary phase as compared to exponential phase.
Conclusions: P. freudenreichii adapts its metabolism during entry into stationary phase by down-regulating oxidative phosphorylation, glycolysis, and the Wood-Werkman cycle by exploiting new nitrogen (glutamate, glycine, alanine) sources, by down-regulating the transcription, translation and secretion of protein. Utilization of polyphosphates was suggested.},
language = {en},
I want to prune out the abstract and all its contents. So the corresponding output should look like:
doi = {10.3389/fsufs.2021.575056},
doi = {10.1186/s12864-016-3367-x},
language = {en},
I am trying to achieve this using the following 'sed' command: sed 's/\s*abstract.*(\n*.*)*.*[$}]// gm' Test.txt
But it does not seem to work. I have checked using online tools such as https://regex101.com/, and it seems to select the relevant text. But when I try to execute it on my laptop, it doesn't work properly.
I am running this on a Lenovo Thinkpad, MXLinux.

Using GNU sed
$ sed -Ez 's/abstract =[^}]*}([^}]*\.})?,\n +?//g' input_file
doi = {10.3389/fsufs.2021.575056},
doi = {10.1186/s12864-016-3367-x},
language = {en},
Enabling extended functionality -E and separating lines by nul chars -z, you can then find the match starting from abstract =
[^}]*} - Match up till then next occurrence of } and include the curly brace
([^}]*\.)? - This is an optional condition, as above, match till the next occurance of curly brace, but this time, ensure there is a full stop before the curly brace.
\n - Include the newline in the match to be removed.
+? - Another optional condition, if there is one or more spaces after the newline, remove them also.
The g flag at the end will repeat the removal of the match as many times as it finds it.

This might work for you (GNU sed):
sed -n '/abstract = {/{:a;/},$/b;n;ba};p' file
Turn off implicit printing -n.
If a line contains abstract = {, as long as the current line does not end in },, replace the current line with the next and if it does match, then effectively delete it.
Otherwise print all other lines.

In GNU awk you could try following awk code. Written and tested in GNU awk. Using RS variable of GNU awk to mention regex in it and get the required output as per OP's request.
awk -v RS='(^[[:space:]]*|\n[[:space:]]*)doi = {[^}]*},|[[:space:]]+language = {en},' '
RT{ print RT }
' Input_file
Here is the Online demo for above code(NOTE: Regex online uses non-capturing group, which is not supported by awk, that's mentioned in their only for understanding purposes).

Related

Alternation in regexes seems to be terribly slow in big files

I am trying to use this regex:
my #vulnerabilities = ($g ~~ m:g/\s+("Low"||"Medium"||"High")\s+/);
On chunks of files such as this one, the chunks that go from one "sorted" to the next. Every one must be a few hundred kilobytes, and all of them together take from 1 to 3 seconds all together (divided by 32 per iteration).
How can this be sped up?

Inspection of the example file reveals that the strings only occur as a whole line, starting with a tab and a space. From your responses I further gathered that you're really only interested in counts. If that is the case, then I would suggest something like this solution:
my %targets = "\t Low", "Low", "\t Medium", "Medium", "\t High", "High";
my %vulnerabilities is Bag = $g.lines.map: {
%targets{$_} // Empty
}
dd %vulnerabilities; # ("Low"=>2877,"Medium"=>54).Bag
This runs in about .25 seconds on my machine.
It always pays to look at the problem domain thoroughly!

This can be simplified a little bit. You use \s+ before and after, but is this necessary? I think you need just to assure word boundary or just one whitespace, thus, you can use
\s("Low"||"Medium"||"High")\s
or you can use \b instead of \s.
Second step is not to use capturing group, use non-capturing grous instead, because regex engine wastes time and memory for "remembering" groups, so you could try with:
\s(?:"Low"||"Medium"||"High")\s

TL;DR I've compared solutions on a recent rakudo, using your sample data. The ugly brute-force solution I present here is about twice as fast as the delightfully elegant solution Liz has presented. You could probably improve times another order of magnitude or more by breaking your data up and parallel processing it. I also discuss other options if that's not enough.
Alternations seems like a red herring
When I eliminated the alternation (leaving just "Low") and ran the code on a recent rakudo, the time taken was about the same. So I think that's a red herring and have not studied that aspect further.
Parallel processing looks promising
It's clear from your data that you could break it up, splitting at some arbitrary line, and then pattern match each piece in parallel, and then combine results.
That could net you a substantial win, depending on various factors related to your system and the data you process.
But I haven't explored this option.
The fastest results I've seen
The fastest results I've seen are with this code:
my %counts;
$g ~~ m:g / "\t " [ 'Low' || 'Medium' || 'High' ] \n { %counts{$/}++ } /;
say %counts.map: { .key.trim, .value }
This displays:
((Low 2877) (Medium 54))
This approach incorporates similar changes to those Michał Turczyn discussed, but pushed harder:
I've thrown away all capturing, not only not bothering to capture the 'Low' or whatever, but also throwing away all results of the match.
I've replaced the \s+ patterns with concrete characters rather than character classes. I've done so on the basis my casual tests with a recent rakudo suggested that's a bit faster.
Going beyond raku's regexes
Raku is designed for full Unicode generality. And its regex engine is extremely powerful. But it looks like your data is just ASCII and your pattern is a typical very simple regex. So you're using a sledgehammer to crack a nut. This shouldn't really matter -- the sledgehammer is supposed to be just fine as a nutcracker too -- but raku's regex engine remains very poorly optimized thus far.
Perhaps this nut is just a simple example and you're just curious about pushing raku's built in regex capabilities to their maximum current performance.
But if not, and you need yet more speed, and the speedups from this or other better solutions in raku, coupled with parallel processing, aren't enough to get you where you need to go, it's worth considering either not using raku or using it with another tool.
One idiomatic way to use raku with another tool is to use an Inline, with the obvious one in this case being Inline::Perl5. Using that you can try perl's fast default built in regex engine or even use its regex plugin capability to plug in a really fast regex engine.
And, given the simplicity of the pattern you're matching, you could even eschew regexes altogether by writing a quick bit of glue to some low-level raw text searching tool (perhaps saving character offsets and then generating corresponding raku match objects from the results).

How to get a string before the last occurrence of a specific character before a maximum character count?

I have some long but variable-length texts that are divided into sections marked by ********************. I need to post those texts into a field that only accepts 2048 characters, so I will need to divide that text into groups of no more than 2048 characters but which do not contain an incomplete section.
My regex so far is ^([\s\S]{1,2048})([\s\S]{1,2048})([\s\S]{1,2048})
However, this has two problems:
1) It divides the text into groups that can include an incomplete section. What I want is a complete section, even if it is not a full 2048 characters. Assume the example below is at the end of 2048 characters.
Here's my actual result. Notice that the "7 Minute Workout" section is cut off mid-section
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
********************
7 Minute Workout: Lose Weight (📱)
Scientifically-proven and featured by the New York Times, a 7-minute high intensity workout proven to lose weig
Here's my desired result. Notice that the "7 Minute Workout" section is entirely omitted because it could not be included in its entirety while staying under the 2048 character limit.
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
2) The second problem with this regex is that the text I need to input varies greatly in length; it may be less than 2048 or it could be 10,000+ characters. My regex obviously only works for texts up to 6,144 characters long. Do I just keep duplicating the regex a crazy number of times to get longer than the longest text I could enter, or is there a way to get it to repeat?
Addendum: Several asked about the use case/environment for this question. No, it’s not a spambot 🙂. Rather, I’m trying to use Apple’s Shortcuts app to cross-post items from my website to followers on Kik. Unfortunately, Kik has a 2048 character limit, so I can’t post it all at once. I’m trying to use regex to split the text into appropriate sections so I can copy them from Shortcuts and paste them one at a time into Kik.

Couple Notes:
No need to use groups at all, just use match results directly as each match represent one section.
Use lazy quantifier instead of greedy by adding ? after {1,2048} to make the match cut in the right place.
In my regex, I used only Global g without the multiline m.
The code below will work only with sections that have 2048 characters or less. If the section has more than 2048 characters, it will be skipped.
The regex below uses Positive Lookahead to signal the end of the section without matching.
Here is the regex:
^|\*[\s\S]{1,2048}?(?=\n\*|$)
Example: https://regex101.com/r/hezvu5/1/
==== Update ====
To make the results greedy, to match as many sections as possible without splitting the last section, use this regex:
^|\*[\s\S]{1,2048}(?=\n\*|$)

stop short of multiple strings and characters using '^'

I'm doing a regex operation that to stop short of either character sets { or \t\t{.
the first is ok, but the second cannot be achieved using the ^ symbol the way I have been.
My current regex is [\t+]?{\d+}[^\{]*
As you can see, I've used ^ effectively with a single character, but I cannot apply it to a string of characters like \t\t\{
How can the current regex be applied to consider both of these possibilities?
Example text:
{1} The words of the blessing of Enoch, wherewith he blessed the elect and righteous, who will be living in the day of tribulation, when all the wicked and godless are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, which the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, [even] on Mount Sinai,
And appear from His camp
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
When I do this as a multi-line extract, the indendantation does not maintain for the first line of each block. Ideally the extract should stop short of the \t\t{ allowing it to be picked up properly in the next extract, creating perfectly indented blocks. The reason for this is when they are taken from the database, the \t\t should be detected at the first line to allow dynamic formatting.

[\t+]?{\d+}[\s\S]*?(?=\s*{|$)
You can use this.See demo.
https://regex101.com/r/nNUHJ8/1

Using regular expression to extract repeated phrase in R

I am attempting to locate(then extract) a repeated phrase by using the below code. I require phrases beginning with "approximately" and ending in "closed".
For example "approximately $162.9 million in total assets and $144.5 million in total deposits was closed"
str_locate(x,"(\b[Aa]pproximately\b)(.*)(\b[Cc]losed\b)")
str_extract(x,"(\b[Aa]pproximately\b)(.*)(\b[Cc]losed\b)")
The above code returns NA for phrase start and end points.
Here is a sample of the character vector where the phrases are located.(it is a webpage of publicly available FDIC information)
"206-4662).\r\n\r\nDecember \r\n\r\n\r\n Western National Bank, Phoenix, AZ with approximately $162.9 million in total assets and $144.5 million in total deposits was closed. Washington Federal, Seattle, WA has agreed to assume all deposits excluding certain brokered deposits.\r\n(PR-195-2011) \r\n\r\n\r\n\r\n Premier Community Bank of the Emerald Coast, Crestview, FL with approximately $126.0 million in total assets and $112.1 million in total deposits was closed. Summit Bank, N.A., Panama City, FL has agreed to assume all deposits.\r\n(PR-194-2011)"
I may be using reg expression incorrectly as i am new to it so any advice much appreciated.

\b is ASCII backspace. You need to escape the backslashes if you want it to mean "word boundary":
str_locate(x,"(\\b[Aa]pproximately\\b)(.*)(\\b[Cc]losed\\b)")
Also, you don't need the parentheses around your keywords, unless you want to check their capitalization later. And you can match case-insensitively with the (?i) modifier when using the perl() function for your regexes.
Lastly, be aware that .* will not match if there are newlines between approximately and closed (this can be fixed with (?s)), and it may yield unwanted results if more than one pair of keywords is present in the string.
Therefore, you should probably change your regex to
str_locate(x, perl("(?is)\\bapproximately\\b(.*?)\\bclosed\\b"))

Replace ONLY first occurrence of word from a paragraph using RegEx

I have two paragraphs. I want to replace ONLY first occurrence of a specific word 'acetaminophen' by '{yootooltip :: It is a widely used over-the-counter analgesic (pain reliever) and antipyretic (fever reducer). Excessive use of paracetamol can damage multiple organs, especially the liver and kidney.}acetaminophen{/yootooltip}'
The paragraph is:
Percocet is a painkiller which is partly made from oxycodone and partly made from acetaminophen. It will usually be prescribed for a patient who is suffering from acute severe pain. Because it has oxycodone in it, this substance can create an addiction and is also a dangerous prescription drug to abuse. It is illegal to either sell or use Percocet that has not been prescribed by a licensed professional.
In 2008, drugs like Percocet (which have both oxycodone and acetaminophen as their main ingredients) were the prescription drugs most sold in all of Ontario. The records also show that the rates of death by oxycodone (this includes brand name like Percocet) doubled. That is why it is imperative that the people who are addicted go to an Ontario drug rehab center. Most of the drug rehabs can take care of Percocet addiction.
I am trying to write a regular expression for this. I have tried
\bacetaminophen\b
But it is replacing both occurrences.
Any help would be appreciated.
Thanks

Use the optional $limit parameter in PHP's preg_replace function http://us2.php.net/manual/en/function.preg-replace.php
$text = preg_replace('/acetaminophen/i', 'da-da daa', $text, 1);
will replace only the first occurance.

For the tool you're using, just use this:
(.*?)\bacetaminophen\b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js