I have the following text layout:
Heading
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text
15 This is more text
Heading
Chapter 2:1 This is text
2 This is more text...
and I am trying to add the first Chapter reference and the last one in that Chapter right after the Heading, written in parentheses. Like so:
Heading (Chapter 1:1-15)
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text
15 This is more text
I've come up with this regular expression so far:
~s/(?s)(Heading)\r(^\d*\w+\s*\d+:\d+|\d+:\d+)(.*?)(\d+)(.*?\r)(?=Heading)/\1 (\2-\4)\r\2\3\4\5/g;
but this is grabbing the first number right after Chapter 1:1 (i.e. "2", "Heading (Chapter 1:1-2)"), instead of the last one ("15" as in "Heading (Chapter 1:1-15)"). Could someone please tell me what's wrong with the regex? Thank you!
Edit for updated question
Here's a regex with explanation that will solve your problem. http://codepad.org/mSIYCw4R
~s/
((?:^|\n)Heading) #Capture Heading into group 1.
#We can't use lookbehind because of (?:^|\n)
(?= #A lookahead, but don't capture.
\nChapter\s #Find the Chapter text.
(\d+:\d+) #Get the first chapter text. and store in group 2
.* #Capture the rest of the Chapter line.
(?:\n(\d+).+)+ #Capture every chapter line.
#The last captured chapter number gets stored into group 3.
)
/$1 (Chapter $2-$3)/gx;
An implementation of #FMc's comment could be something like:
#!/usr/bin/perl
use warnings;
use strict;
my $buffer = '';
while (<DATA>) {
if (/^Heading \d+/) { # process previous buffer, and start new buffer
process_buffer($buffer);
$buffer = $_;
}
else { # add to buffer
$buffer .= $_;
}
}
process_buffer($buffer); # don't forget last buffer's worth...
sub process_buffer {
my($b) = #_;
return unless length $b; # don't bother with an unpopulated buffer
my($last) = $b =~ /(\d+)\s.*$/;
my($chap) = $b =~ /^(Chapter \d+:\d+)/m;
$b =~ s/^(Heading \d+)/$1 ($chap-$last)/;
print $b;
}
__DATA__
Heading 1
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text
15 This is more text
Heading 2
Chapter 2:1 This is text
2 This is more text...
3 This is more text
Related
Here is my simple text file:
1. Text About Question 1
2. Text About Question 2
.
.
20. Text About Question 20
I have 250 text file and all files have only 20 questions and I want to convert these files to xml, add "question" tag beginning of every number, so they will look like:
<question>1. Text About Question 1
<question>2. Text About Question 2
.
.
<question>20. Text About Question 20<question>
I have tried this regex: copy (\d{1}.) replace \1 which just effect between 1 and 9. After 10 it divides number like
1<question>0. Text About Question 10
As a second way, this regex: (\d{2}.) only effect between 10 and 20. So it looks like:
1. Text About Question 1
2. Text About Question 2
.
.
<question>20. Text About Question 20</question>
I couldn't continue with (\d{1}.) because this regex add same tags to number between 10 and 20 and looks like:
<question>1. Text About Question 1 </question>
<question>2. Text About Question 2</question>
.
.
<question><question>20. Text About Question 20</question>
Is there proper way to tag each question from 1 to 20 using regex?
You want to match all numbers between 1 and 20. Here is the regex for that
^[1-9]\.$|^1[0-9]\.$|^20\.$
Breakdown
^ - Start of line
[1-9] - Any digit between 1 and 9. Note 0 is not included
\. - Escape character before a period. Otherwise it will match any character
$ - End of regex
| - Or
^1[0-9]\.$ - Starts with a 1 and is between 10 and 19.
|^20\.$ - Or starts and ends with 20.
I would like to know why the following RegEx's:
\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.)
and:
\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.|HOW WHO HOW WHO HOW\,\sWHO\sHOW\.)
seem to work perfectly fine on the following test string:
THIS THAT THIS THAT THIS,
THAT
THIS.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
WHAT WHERE WHAT WHERE WHAT,
WHERE
WHAT.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
HOW WHO HOW WHO HOW,
WHO
HOW.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
IF OR IF OR IF.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
TO FOR TO FOR
TO FOR TO FOR.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
IN UNDER IN
UNDER IN UNDER.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
LEFT RIGHT LEFT
RIGHT LEFT.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
UP DOWN UP DOWN UP
DOWN.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
THE END.
But, when I use the same type of Expression on files that exceed 5MB, it fails.
The VBScript that I am using is as follows:
Option Explicit
Dim strPath : strPath = "myFile.txt"
If Instr(1, WScript.FullName, "CScript", vbTextCompare) = 0 Then
With CreateObject("WScript.Shell")
.Run "cmd.exe /k cscript //nologo """ & WScript.ScriptFullName & """", 1, False
WScript.Quit
End With
Else
With CreateObject("Scripting.FileSystemObject")
If .FileExists(strPath) Then
Call Main(strPath)
Else
WScript.Echo "Input file doesn't exists"
End If
End With
End If
Private Sub Main(filePath)
Dim TempDictionary, Books, Book, b
Set TempDictionary = CreateObject("Scripting.Dictionary")
Set Books = RegEx(GetFileContent(filePath),"\b\w{7}\b\s[1]\s[\S\s]+?THE SECOND BOOK OF MOSES")
If Books.Count > 0 Then
For Each Book In Books
WScript.Echo Replace(Left(Book.Value,70),vbCrLf," ")
Next
Else
WScript.Echo "Document didn't contain any valid books"
WScript.Quit
End If
End Sub
Private Function GetFileContent(filePath)
Dim objFS, objFile, objTS
Set objFS = CreateObject("Scripting.FileSystemObject")
Set objFile = objFS.GetFile(filePath)
Set objTS = objFile.OpenAsTextStream(1, 0)
GetFileContent = objTS.Read(objFile.Size)
Set objTS = Nothing
End Function
Private Function RegEx(str,pattern)
Dim objRE, Match, Matches
Set objRE = New RegExp
objRE.Pattern = pattern
objRE.Global = True
Set RegEx = objRE.Execute(str)
WScript.Echo objRE.Test(str)
End Function
Editor that I am using is here: http://www.regexr.com/
Q: What are you trying to do?
A: I want to be able to split any textfile into several string chunks, based on a smart regex code that captures anything between two strings. The first string determiner is a fixed term, i.e. "CHAPTER 1", but the second string determiner is unfixed. The second string determiner is unfixed and changing, but it is known. It can be placed into an array, and then parsed.
The problem that I am having is that the Lookaround (?=) seems to either escape or get stuck in a loop. I have been playing around with the "|" operator, as you can see in the second RegEx at the start of this OP.
The test file that I am working with seems to parse just fine. No problem. But the larger files that I am working with... I don't know. Something just goes wrong.
This question already has answers here:
BASH: How to extract substring that is surrounded by specific text
(3 answers)
Closed 8 years ago.
I've a bunch of files named like this:
text 01 (blabla) other text
text 02 (whatever) other text
.
.
text 025 (etc) other tex
some text 1 (20031020) other text
some text 2 (20031022) other text
.
.
some text 10 (20031025) other text
some new text 01 other text
.
.
.
some new text 200 other text
and I want to extract from the filename only the words before the first number, so from the example above I want to obtain:
text
some text
some new text
I want to do this to move each file in belonging folder depending on file name (or create the folder if it not exist).
I want to do this with bash, and I know it can be done using regex but I don't know how, I've only seen example where the field are delimited by known characters, while in this case the limit is a space followed by any number.
Use ${variable%%pattern} (this remove suffix pattern).
$ filename='text 01 (blabla) other text'
$ echo ${filename%%[0-9]*}
text
from a given noun list in a .txt file, where nouns are separated by new lines, such as this one:
hooligan
football
brother
bollocks
...and a separate .txt file containing a series of regular expressions separated by new lines, like this:
[a-z]+\tNN(S)?
[a-z]+\tJJ(S)?
...I would like to run the regular expressions through each sentence of a corpus and, every time the regexp matches a pattern, if that pattern contains one of the nouns in the list of nouns, I would like to print that noun in the output and (separated it by tab) the regular expression that matched it. Here is an example of how the resulting output could be:
football [a-z]+NN(S)?\'s POS[a-z]+NN(S)?
hooligan [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
hooligan [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
football [a-z]+NN(S)?[a-z]+NN(S)?
brother [a-z]+PP$[a-z]+NN(S)?
bollocks [a-z]+DT[a-z]+NN(S)?
football [a-z]+NN(s)?(be)VBZnotRB
The corpus I would use is huge (tens of GB) and has the following format (each sentence is contained in the tag <s>):
<s>
Hooligans hooligan NNS 1 4 NMOD
, , , 2 4 P
unbridled unbridled JJ 3 4 NMOD
passion passion NN 4 0 ROOT
- - : 5 4 P
and and CC 6 4 CC
no no DT 7 9 NMOD
executive executive JJ 8 9 NMOD
boxes box NNS 9 4 COORD
. . SENT 10 0 ROOT
</s>
<s>
Hooligans hooligan NNS 1 4 NMOD
, , , 2 4 P
unbridled unbridled JJ 3 4 NMOD
passion passion NN 4 0 ROOT
- - : 5 4 P
and and CC 6 4 CC
no no DT 7 9 NMOD
executive executive JJ 8 9 NMOD
boxes box NNS 9 4 COORD
. . SENT 10 0 ROOT
</s>
<s>
Portsmouth Portsmouth NP 1 2 SBJ
bring bring VVP 2 0 ROOT
something something NN 3 2 OBJ
entirely entirely RB 4 5 AMOD
different different JJ 5 3 NMOD
to to TO 6 5 AMOD
the the DT 7 12 NMOD
Premiership Premiership NP 8 12 NMOD
: : : 9 12 P
football football NN 10 12 NMOD
's 's POS 11 10 NMOD
past past NN 12 6 PMOD
. . SENT 13 2 P
</s>
<s>
This this DT 1 2 SBJ
is be VBZ 2 0 ROOT
one one CD 3 2 PRD
of of IN 4 3 NMOD
Britain Britain NP 5 10 NMOD
's 's POS 6 5 NMOD
most most RBS 7 8 AMOD
ardent ardent JJ 8 10 NMOD
football football NN 9 10 NMOD
cities city NNS 10 4 PMOD
: : : 11 2 P
think think VVP 12 2 COORD
Liverpool Liverpool NP 13 0 ROOT
or or CC 14 13 CC
Newcastle Newcastle NP 15 19 SBJ
in in IN 16 15 ADV
miniature miniature NN 17 16 PMOD
, , , 18 15 P
wound wind VVD 19 13 COORD
back back RB 20 19 ADV
three three CD 21 22 NMOD
decades decade NNS 22 19 OBJ
. . SENT 23 2 P
</s>
I started to work to a script in PERL to achieve my goal, and in order to not run out of memory with such a huge dataset I used the module Tie::File so that my script would read one line at a time (instead of trying to open the entire corpus file in memory). This would work perfectly with a corpus where each sentence corresponds to one single line, but not in the current case where sentences are spread on more lines and delimited by a tag.
Is there a way to achieve what I want using a combination unix terminal commands (e.g. cat and grep)? Alternatively, which would be the best solution for this issue? (Some code examples would be great).
A simple regex alternation is sufficient to extract matching data from the noun list and Regexp::Assemble can handle the requirement for identifying which pattern from the other file matched. And, as Jonathan Leffler mentions in his comment, setting the input record separator allows you to read a single record at a time, even when each record spans multiple lines.
Combining all that into a running example, we get:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use Regexp::Assemble;
my #nouns = qw( hooligan football brother bollocks );
my #patterns = ('[a-z]+\s+NN(S)?', '[a-z]+\s+JJ(S)?');
my $name_re = '(' . join('|', #nouns) . ')'; # Assumes no regex metacharacters
my $ra = Regexp::Assemble->new(track => 1);
$ra->add(#patterns);
local $/ = '<s>';
while (my $line = <DATA>) {
my $match = $ra->match($line);
next unless defined $match;
while ($line =~ /$name_re/g) {
say "$1\t\t$match";
}
}
__DATA__
...
...where the content of the __DATA__ section is the sample corpus provided in the original question. I didn't include it here in the interest of keeping the answer compact. Note also that, in both patterns, I changed \t to \s+; this is because the tabs were not preserved when I copied and pasted your sample corpus.
Running that code, I get the output:
hooligan [a-z]+\s+NN(S)?
hooligan [a-z]+\s+NN(S)?
football [a-z]+\s+NN(S)?
football [a-z]+\s+NN(S)?
football [a-z]+\s+JJ(S)?
football [a-z]+\s+JJ(S)?
Edit: Corrected regexes. I initially replaced \t with \s, causing it to match NN or JJ only when preceded by exactly one space. It now also matches multiple spaces, which better emulates the original \t.
I ended up writing a quick code that solves my problem. I used Tie::File to handle huge textual datasets and specified </s> as record separator, as suggested by Jonathan Leffler (the solution proposed by Dave Sherohman seems very elegant but I couldn't try it).
After the separation of the sentences I isolate the columns that I need (2nd and 3rd) and I run the regular expressions. Before printing the output I check whether the matched word is present in my word list: if not, this is excluded from the output.
I share my code here (comments included) in case someone else needs something similar.
It's bit dirty and it could definitely be optimized but it works for me and it supports very large corpora (I tested it with a corpus of 10GB: it completed successfully in a few hours).
use strict;
use Tie::File; #This module makes a file look like a Perl array, each array element corresponds to a line of the file.
if ($#ARGV < 0 ) { print "Usage: perl albzcount.pl corpusfile\n"; exit; }
#read nouns list (.txt file with one word per line - line breaks LF)
my $nouns_list = "nouns.txt";
open(DAT, $nouns_list) || die("Could not open the config file $nouns_list or file doesn't exist!");
my #nouns_contained_in_list=<DAT>;
close(DAT);
# Reading regexp list (.txt file with one regexp per line - line breaks LF)
my $regex_list = "regexp.txt";
open(DAT, $regex_list) || die("Could not open the config file $regex_list or file doesn't exist!");
my #regexps_contained_in_list=<DAT>;
close(DAT);
# Reading Corpus File (each sentence is spread on more lines and separated by tag <s>)
my $corpusfile = $ARGV[0]; #Corpus filename (passed as an argument through the command)
# With TIE I don't load the entire file in an array. Perl thinks it's an array but the file is actually read line by line
# This is the key to manipulate huge text files without running out of memory
tie my #raw_corpus_data, 'Tie::File', $corpusfile, recsep => '</s>' or die "Can't read file: $!\n";
#START go throught the sentences of the corpus (spread on multiple lines and separated by <s>), one by one
foreach my $corpus_line (#raw_corpus_data){
#take a single sentence (that is spread along different lines).
#NB each line contains "columns" separated by tab
my #corpus_sublines = split('\n', $corpus_line);
#declare variable. Later values will be appended to it
my $corpus_line;
#for each line that composes a sentence
foreach my $sentence_newline(#corpus_sublines){ a
#explode by tab (column separator)
my #corpus_columns = split('\t', $sentence_newline);
#put together new sentences using just column 2 and 3 (noun and tag) for each original sentence
$corpus_line .= "#corpus_columns[1]\t#corpus_columns[2]\n";
#... Now the corpus has the format I want and can be processed
}
#foreach regex
foreach my $single_regexp(#regexps_contained_in_list){
# Remove the new lines (both \n and \r - depending on the OS) from the regexp present in the file.
# Without this, the regular expressions read from the file don't always work.
$single_regexp =~ s/\r|\n//g;
#if the corpus line analyzed in this cycle matches the regexp
if($corpus_line =~ m/$single_regexp/) {
# explode by tab the matched results so the first word $onematch[0] can be isolated
# $& is the entire matched string
my #onematch = split('\t', $&);
# OUTPUT RESULTS
#if the matched noun is not empty and it is part of the word list
if ($onematch[0] ne "" && grep( /^$onematch[0]$/, #nouns_contained_in_list )) {
print "$onematch[0]\t$single_regexp\n";
} # END OUTPUT RESULTS
} #END if the corpus line analyzed in this cycle matches the regexp
} #END foreach regex
} #END go throught the lines of the corpus, one by one
# Untie the source corpus file
untie #raw_corpus_data;
I have a file which is of the following form :
some text
some more text
. . .
. . .
data {
1 2 3 5 yes 10
2 3 4 5 no 11
}
some text
some text
I want to extract the data portion of the file using regular expression using the following procedure:
proc ExtractData {fileName} {
set sgd [open $fileName r]
set sgdContents [read $sgd]
regexp "data \\{(?.*)\\}" $sgdContents -> data
puts $data
}
But this is giving the following error:
couldn't compile regular expression pattern: quantifier operand invalid
I am not able figure out what is wrong with regular expression. Any help would be highly appreciated.
Use this regular expression
regexp {data \{(.*)\}} $sgdContents wholematch submatch
puts $submatch
wholematch matches the entire pattern. In your case it is
data {
1 2 3 5 yes 10
2 3 4 5 no 11
}
And submatch matches only the content inside braces like below:
1 2 3 5 yes 10
2 3 4 5 no 11
The following regexp line works
regexp "data \\{\\\n(.*?)\\\n\\s*\\}" $sgdContents -> data
The only major thing wrong with the original regular expression was misplacement of the non-greedy match indicator (?), which directs the regular expression engine to stop matching as soon as first match is found.