How to extract text under specific headings from a pdf? - python-2.7

I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.
How can I do this?

This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.

Pdf is unstructured text so there are no tags to extract data directly. So we use regular expression to find desired information from a corpus of text.
Extract raw page text using following code.
import fitz
page = pdf_file.loadPage(0) # 0 represents the page number... upto n-1 pages...
dl = page.getDisplayList()
tp = dl.getTextPage()
tp_text=tp.extractText()
re.split('\n\d+.+[ \t][a-zA-Z].+\n',tp_text)
Then apply regular expression as per your need... ( this re worked for me but you may or may not need to change it)
I am giving a detailed example how this will work
re.findall('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output : ['\n1. heading 1\n', '\n1.2.3 Heading 2\n']
You can use re.split to split text per headings and retrieve you desired heading text.
re.split('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output: ['some text', 'paragraph 1', 'parapgraph 2']
Simply ith heading will have (i+1) heading text.

The best method i found using regular expression
regex = r"^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*"
print(re.findall(regex,samplestring, re.M))

Related

Extract a list of unique text characters/ emojis from a cell

I have a text in cell (A1) like this:
✌😋👅👅☝️😉🍌🍪💧💧
I want to extract the unique emojis from this cell into separate cells:
✌😋👅☝️😉🍌🍪💧
Is this possible?
You want to put each character of ✌😋👅👅☝️😉🍌🍪💧💧 to each cell by splitting using the built-in function of Google Spreadsheet.
Sample formula:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
✌😋👅👅☝️😉🍌🍪💧💧 is put in a cell "A1".
Using REGEXREPLACE, # is put to between each character like ✌#😋#👅#👅#☝#️#😉#🍌#🍪#💧#💧#.
Using SPLIT, the value is splitted with #.
Result:
Note:
In your question, the value of ️ which cannot be displayed is included. It's \ufe0f. So "G1" can be seen like no value. But the value is existing. So please be careful this. If you want to remove the value, you can use ✌😋👅👅☝😉🍌🍪💧💧.
References:
REGEXREPLACE
SPLIT
Added:
From marikamitsos's comment, I could notice that my understanding was not correct. So the final result is as follows. This is from marikamitsos.
=TRANSPOSE(UNIQUE(TRANSPOSE(SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#"))))
or try:
=TRANSPOSE(UNIQUE(TRANSPOSE(REGEXEXTRACT(A1, REPT("(.)", LEN(A1))))))
Formula
Appears, one of the best formula solutions would be:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
You may also add some additional checks like skin tones & intermediate chars:
=TRANSPOSE(SPLIT(REGEXREPLACE(A2,"(.[🏻🏼🏽🏾🏿"&CHAR(8205)&CHAR(65039)&"]*)","#$1"),"#"))
It will help to join some emojis as a single emoji.
Script
More precise way is to use the script:
https://github.com/orling/grapheme-splitter/blob/master/index.js
↑
Add the code to Script editor
Add code for sample usage:
function splitEmojis(string) {
var splitter = new GraphemeSplitter();
// split the string to an array of grapheme clusters (one string each)
var graphemes = splitter.splitGraphemes(string);
return graphemes;
}
Tests
Not 100% precise
1
Please note: some emojis are not correctly shown in sheets
🏴󠁧󠁢󠁷󠁬󠁳󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴
↑ emojis:
flag: England
flag: Scotland
flag: Wales
black flag
are the same for Google Sheets.
2
Vlookup function in #GoogleSheets and in #Excel thinks chars
#️⃣ and
*️⃣
are the same!

Adding a space within a line in file with a specific pattern

I have a file with some data as follows:
795 0.16254624E+01-0.40318151E-03 0.45064186E+04
I want to add a space before the third number using search and replace as
795 0.16254624E+01 -0.40318151E-03 0.45064186E+04
The regular expression for the search is \d - \d. But what should I write in replace, so that I could get the above output. I have over 4000 of similar lines above and cannot do it manually. Also, can I do it in python, if possible.
Perhaps you could findall to get your matches and then use join with a whitespace to return a string where your values separated by a whitespace.
[+-]?\d+(?:\.\d+E[+-]\d+)?\b
import re
regex = r"[+-]?\d+(?:\.\d+E[+-]\d+)?\b"
test_str = "795 0.16254624E+01-0.40318151E-03 0.45064186E+04"
matches = re.findall(regex, test_str)
print(" ".join(matches))
Demo
You could do it very easily in MS Excel.
copy the content of your file into new excel sheet, in one column
select the complete column and from the data ribbon select Text to column
a wizard dialog will appear, select fixed width , then next.
click just on the location where you want to add the new space to tell excel to just split the text after this location into new column and click next
select each column header and in the column data format select text to keep all formatting and click finish
you can then copy all the new column or or export it to new text file

Regex to select specified paragraphs InDesign

I am trying to find a regex to use in InDesign that could select every nth paragraph in a text box (nth as in random, not as in sequence).
In the following example for instance, I would like to be able to select the 2nd, the 3rd and the 5th paragraph by inputing 2,3 and 5 somewhere in a regex.
This needs to be done as a script. See below for an example to get you started. The script assumes that a textframe containing your paragraphs is selected when you run the script! Note: there is no effort to check/handle errors (e.g. giving a non numeric input for paragraph numbers). You'll need to add this yourself. You could modify the input to accept a comma delimited list of paragraph numbers if needed as well.
var doc = app.activeDocument;
var frame = app.selection[0];
var para = parseInt(prompt("Paragraph:", ''));
//replace TestStyle with your desired style name
var style = app.activeDocument.paragraphStyles.item('TestStyle');
frame.parentStory.paragraphs[para - 1].appliedParagraphStyle = style;
/([^\n]+\n)/g
then use grouping to extract the paragraphs you desire.

Regex to remove footer using wildcards

Ok - this is well beyond my limited knowledge of regular expressions. We receive a report from a banking entity in a fixed with text file format. Unfortunately their system exports page headers with the data file that must be removed before processing on our end. The page headers start and end with the same text but the content changes (dates and page numbers). A typical one looks like:
00007xxxxx LAST1,FIRST1 111111 20120930
ABCD EXPORT RPT 10/04/12 at 10/04/12 16:20 Seq 1501 Page 16
MRK014 Report Date: 10/04/12
Acct# Name SH. Balance QTR (YYYYMMDD)
----------------------------------------------------------------------------------------------------
00007xxxxx LAST2,FIRST2 222222 20120930
So each header starts with "ABCD" (actually the name of the bank, just removed here for privacy) and ends with the row of -------------------.
What I need to get it down to is the customer data on two rows (00007xxxxx - those account numbers change per person).
So I need to select from the " ABCD" to the end of the "---" to remove that block of text.
Try this regex.. This is a Java code.. You can use the given pattern in your language..
str = str.replaceAll("ABCD((.*?)[\n\r])+(\\-*)", "");
Where str contains your above data.. Lines are separated by \n I assume..
To ensure you are removing correct part of report I would go with more complicated regex pattern.
Use regex pattern
(?<=[\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with empty string.
However if your environment does not support regex lookbehind, then you have to use pattern:
([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with first group.
For example in JavaScript it would be:
str.replace(/([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+/g, "$1")
Test this code here.

Format lists in VIM

I would like to find a way to easy format lists in Vim.
I checked PAR and the default formatter of Vim.
p.e.
1. this is my text this is my text this is my text
2. this is my text this is my text this is my text
3. this is my text this is my text this is my text
4. this is my text this is my text this is my text
and this
- this is my text this is my text this is my text
- this is my text this is my text this is my text
- this is my text this is my text this is my text
- this is my text this is my text this is my text
when I select the lines and do a default format to 42 with PAR and VIM these are the results:
NUMBERED LIST
formatting with par:
par error:
(42) <= (0) + (50)
formatting with vim:
1. this is my text this is my text this is
my text
2. this is my text this is my text this is
my text
3. this is my text this is my text this is
my text
4. this is my text this is my text this is
my text
LIST with '-'
formatting with par:
4 lines filtered (no change)
formatting with vim:
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
Vim does a better job formatting lists but it is not correct as well in a numbered list.
Par does have a lot of troubles formatting lists even when I use the prefix ("p") option like this:
'<,'>!par w42p4dh or '<,'>!par w42p3dh
Does anyone know a good way how to format lists without problems?
Try set fo+=n. From :help fo-table:
n When formatting text, recognize numbered lists. This actually uses
the 'formatlistpat' option, thus any kind of list can be used. The
indent of the text after the number is used for the next line. The
default is to find a number, optionally followed by '.', ':', ')',
']' or '}'. Note that 'autoindent' must be set too. Doesn't work
well together with "2".
Example:
1. the first item
wraps
2. the second item