Regex to select specified paragraphs InDesign

Regex to select specified paragraphs InDesign - regex

I am trying to find a regex to use in InDesign that could select every nth paragraph in a text box (nth as in random, not as in sequence).
In the following example for instance, I would like to be able to select the 2nd, the 3rd and the 5th paragraph by inputing 2,3 and 5 somewhere in a regex.

This needs to be done as a script. See below for an example to get you started. The script assumes that a textframe containing your paragraphs is selected when you run the script! Note: there is no effort to check/handle errors (e.g. giving a non numeric input for paragraph numbers). You'll need to add this yourself. You could modify the input to accept a comma delimited list of paragraph numbers if needed as well.
var doc = app.activeDocument;
var frame = app.selection[0];
var para = parseInt(prompt("Paragraph:", ''));
//replace TestStyle with your desired style name
var style = app.activeDocument.paragraphStyles.item('TestStyle');
frame.parentStory.paragraphs[para - 1].appliedParagraphStyle = style;

/([^\n]+\n)/g
then use grouping to extract the paragraphs you desire.

Related

Extract a list of unique text characters/ emojis from a cell

I have a text in cell (A1) like this:
✌😋👅👅☝️😉🍌🍪💧💧
I want to extract the unique emojis from this cell into separate cells:
✌😋👅☝️😉🍌🍪💧
Is this possible?

You want to put each character of ✌😋👅👅☝️😉🍌🍪💧💧 to each cell by splitting using the built-in function of Google Spreadsheet.
Sample formula:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
✌😋👅👅☝️😉🍌🍪💧💧 is put in a cell "A1".
Using REGEXREPLACE, # is put to between each character like ✌#😋#👅#👅#☝#️#😉#🍌#🍪#💧#💧#.
Using SPLIT, the value is splitted with #.
Result:
Note:
In your question, the value of ️ which cannot be displayed is included. It's \ufe0f. So "G1" can be seen like no value. But the value is existing. So please be careful this. If you want to remove the value, you can use ✌😋👅👅☝😉🍌🍪💧💧.
References:
REGEXREPLACE
SPLIT
Added:
From marikamitsos's comment, I could notice that my understanding was not correct. So the final result is as follows. This is from marikamitsos.
=TRANSPOSE(UNIQUE(TRANSPOSE(SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#"))))

or try:
=TRANSPOSE(UNIQUE(TRANSPOSE(REGEXEXTRACT(A1, REPT("(.)", LEN(A1))))))

Formula
Appears, one of the best formula solutions would be:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
You may also add some additional checks like skin tones & intermediate chars:
=TRANSPOSE(SPLIT(REGEXREPLACE(A2,"(.[🏻🏼🏽🏾🏿"&CHAR(8205)&CHAR(65039)&"]*)","#$1"),"#"))
It will help to join some emojis as a single emoji.
Script
More precise way is to use the script:
https://github.com/orling/grapheme-splitter/blob/master/index.js
↑
Add the code to Script editor
Add code for sample usage:
function splitEmojis(string) {
var splitter = new GraphemeSplitter();
// split the string to an array of grapheme clusters (one string each)
var graphemes = splitter.splitGraphemes(string);
return graphemes;
}
Tests
Not 100% precise
1
Please note: some emojis are not correctly shown in sheets
🏴󠁧󠁢󠁷󠁬󠁳󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴
↑ emojis:
flag: England
flag: Scotland
flag: Wales
black flag
are the same for Google Sheets.
2
Vlookup function in #GoogleSheets and in #Excel thinks chars
#️⃣ and
*️⃣
are the same!

How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.
How can I do this?

This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.

Pdf is unstructured text so there are no tags to extract data directly. So we use regular expression to find desired information from a corpus of text.
Extract raw page text using following code.
import fitz
page = pdf_file.loadPage(0) # 0 represents the page number... upto n-1 pages...
dl = page.getDisplayList()
tp = dl.getTextPage()
tp_text=tp.extractText()
re.split('\n\d+.+[ \t][a-zA-Z].+\n',tp_text)
Then apply regular expression as per your need... ( this re worked for me but you may or may not need to change it)
I am giving a detailed example how this will work
re.findall('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output : ['\n1. heading 1\n', '\n1.2.3 Heading 2\n']
You can use re.split to split text per headings and retrieve you desired heading text.
re.split('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output: ['some text', 'paragraph 1', 'parapgraph 2']
Simply ith heading will have (i+1) heading text.

The best method i found using regular expression
regex = r"^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*"
print(re.findall(regex,samplestring, re.M))

How can I normalize / asciify Unicode characters in Google Sheets?

I'm trying to write a formula for Google Sheets which will convert Unicode characters with diacritics to their plain ASCII equivalents.
I see that Google uses RE2 in its "REGEXREPLACE" function. And I see that RE2 offers Unicode character classes.
I tried to write a formula (similar to this one):
REGEXREPLACE("público","(\pL)\pM*","$1")
But Sheets produces the following error:
Function REGEXREPLACE parameter 2 value "\pL" is not a valid regular expression.
I suppose I could write a formula consisting of a long set of nested SUBSTITUTE functions (Like this one), but that seems pretty awful.
Can any offer a suggestion for a better way to normalize Unicode letters with diacritical/accent marks in a Google Sheets formula?

[[:^alpha:]] (negated ASCII character class) works fine for REGEXEXTRACT formula.
But =REGEXREPLACE("público","([[:alpha:]])[[:^alpha:]]","$1") gives "pblic" as a result. So, I guess, formula doesn't know what exact ASCII character must replace "ú".
Workaround
Let's take the word públicē; we need to replace two symbols in it. Put this word in cell A1, and this formula in cell B1:
=JOIN("",ArrayFormula(IFERROR(VLOOKUP(SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"),D:E,2,0),SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"))))
And then make directory of replacements in range D:E:
D E
1 ú u
2 ē e
3 ... ...
This formula is still ugly, but more useful because you can control your directory by adding more characters to the table.
Or use Java Script
Also found a good solution, which works in google sheets.

This did it for me in Google Sheets, Google Apps Scripts, GAS
function normalizetext(text) {
var weird = 'öüóőúéáàűíÖÜÓŐÚÉÁÀŰÍçÇ!#£$%^&*()_+?/*."';
var normalized = 'ouooueaauiOUOOUEAAUIcC ';
var idoff = -1,new_text = '';
var lentext = text.toString().length -1
for (i = 0; i <= lentext; i++) {
idoff = weird.search(text.charAt(i));
if (idoff == -1) {
new_text = new_text + text.charAt(i);
} else {
new_text = new_text + normalized.charAt(idoff);
}
}
return new_text;
}

This answer doesn't require a Google App Script, and it's still fast, and relatively simple. It builds on Max's answer by providing a full lookup table, and it also allows for case-sensitive transliteration (normally VLOOKUP is NOT case-sensitive).
Here is a link to the Google Spreadsheet if you want to jump right into it. If you want to use your own sheet, you'll need to copy the TRANS_TABLE sheet into your Spreadsheet.
In the code snippet below, the source cell is A2, so you'd place this formula in any column on row 2. Using REGEXREPLACE AND SPLIT, we split apart the string in A2 into an array of characters, then USING ARRAYFORMULA, we do the following to EACH character in the array: First, the character is converted to its 'decimal' CODE equivalent, then matched against a table on the TRANS_TABLE sheet by that number, then using VLOOKUP, a character X number of columns over (the index value provided) on the TRANS_TABLE sheet (in this case, the 3rd column over) is returned. When all characters in the array have been transliterated, we finally JOIN the array of characters back into a single string. I provided examples with named ranges as well.
=iferror(
join(
"",
ARRAYFORMULA(
vlookup(
code(split(REGEXREPLACE($A2,"(.)", "$1;"),";",TRUE)),
TRANS_TABLE!$A$5:$F,3
)
)
)
,)
You'll note on the TRANS_TABLE sheet I made, I created 4 different transliteration columns, which makes it easy to have a column for each of your transliteration needs. To reference the column, just use a different index number in the VLOOKUP. Each column is simply a replacement character column. In some cases, you don't want any conversion made (A -> A or 3 -> 3), so you just copy the same character from the source Glyph column. Where you DO want to convert characters, you type in whatever character you want replaced (ñ -> n etc). If you want a character removed altogether, you leave the cell blank (? -> ''). You can see examples of the transliteration output on the data sheet in which I created 4 different transliteration columns (A-D) referencing each of the Transliteration tables from the TRANS_TABLE sheet for different use case scenarios.
I hope this finally answers your question in a fashion that isn't so "ugly." Cheers.

How to replace all found ocurrences in a Google Docs for an hyperlink

We are actually wondering how can you for example find Bible verses in the document text and replace them for an URL of the verse on the web.
For example if you have a "Jn 3.1" text it will be replaced for an hiperlink like this:
Text= Jn 3.1
Link= https://www.bible.com/1/jn.3.1
we though on using Body.replaceText(searchPattern, replacement) but you cant use that for insert an hyperlink.
And also we must think that the number of characters of the verse can change, for example, it can be:
Jn 1.3
that is 6 characters or can be
John 10.10
that is 10 characters. I think that this can be covered with regex (if we are be able to use them with the solution, so its irrelevant if the solution cover it.

For this kind of modifications you will have to use the Appsscript functions. They work in the same way than normal javascript functions but here you are able to work directly with the text.
for this case the replace function is: replaceText(searchPattern, replacement)
and this is how you can search the word in your document and then replace the text.
function myFunction() {
var doc = DocumentApp.getActiveDocument();
var word = 'example';
var rep = 'replacement';
var body = doc.getBody().editAsText().findText(word);
var elem = body.getElement().asText();
var idx = elem.editAsText().getText().indexOf(word);
elem.replaceText(word, rep);
}
So basically you find the element that contains the desired word, then you will get the element and then you will edit the text contained in that element.
I personally don't like to put complete urls in the text, rather i would use and inline link so in this case "Jn 1.3" would be the text of the hyperlink.
For that, instead of the replaceText line, you can use:
var result = elem.setLinkUrl(idx, idx+word.length -1, 'www.google.com');
It will be easier to read. I hope it helps.

Writing a word macro to organize chat logs

I need some help writing a word macro to organize some chat logs. What I want is to eliminate repeated consecutive occurrences of names, regardless of timestamp. Besides this, each person will be using their own formatting style (font, font color, etc.). Edit: the raw logs have no formatting (i.e. specific fonts, font color ,etc.). I want the macro to automatically add a specific (already existent) word style to each user.
So, what I have is:
[12:40] Steve: this is an example text.
[12:41] Steve: this is another example text.
[12:41] Steve: this is yet another example text.
[12:45] Bob: some more text.
[12:46] Bob: even more text.
[12:47] Steve: yadda yadda yadda.
The expected output would be:
[12:40] Steve: *style1*this is an example text.
this is another example text.
this is yet another example text.*/style1*
[12:45] Bob: *style2*some more text.
even more text.*/style2*
[12:47] Steve: *style1*yadda yadda yadda.*style1*
As of now, unfortunately, I know next to nothing of VBA for Applications. I was thinking of maybe searching for the names by a regex pattern and assigning them to a variable, comparing each match to the previous and, if they're equal, deleting the latter. The problem is I'm not fluent in VBA, so I don't know how to do what I want.
So far, all I've got is this:
Sub Organize()
Dim re As RegExp
Dim names As MatchCollection, name As Match
re.Pattern = "\[[0-9]{2}:[0-9]{2}\] [a-zA-Z]{1,20}:"
re.IgnoreCase = True
re.Global = True
Set names = re.Execute(ActiveDocument.Range)
For Each name In names
'This is where I get lost
Next name
End Sub
So, in the interest of solving this problem and me learning some VBA, could I get some help?
EDIT: the question has been edited to better reflect what I want the macro to do.

Assuming that each line in your log is a separate paragraph I would do it without Regex but with .Find object feature. The following code is working find for the sample data you provided.
Sub qTest()
Dim PAR As Paragraph
Dim PrevName As String
For Each PAR In ActiveDocument.Content.Paragraphs
PAR.Range.Select 'highlight current paragraph
'find name in paragraph
With Selection.Find
.ClearFormatting
.Text = "\]*\:"
.Execute
End With
If Selection.Text = PrevName Then
'extend region for the whole paragraph
'end delete it
ActiveDocument.Range(PAR.Range.Start, Selection.End + 1).Delete
Else
PrevName = Selection.Text
Debug.Print PrevName
End If
Next
End Sub

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to select specified paragraphs InDesign - regex

I am trying to find a regex to use in InDesign that could select every nth paragraph in a text box (nth as in random, not as in sequence). In the following example for instance, I would like to be able to select the 2nd, the 3rd and the 5th paragraph by inputing 2,3 and 5 somewhere in a regex.

/([^\n]+\n)/g then use grouping to extract the paragraphs you desire.

Related

Extract a list of unique text characters/ emojis from a cell

How to extract text under specific headings from a pdf?

How can I normalize / asciify Unicode characters in Google Sheets?

How to replace all found ocurrences in a Google Docs for an hyperlink

Writing a word macro to organize chat logs

Categories

Resources