Work on raw textual data from a scanned catalog.
I only want to keep 2 types of strings:
- begining with a number (artists works)
- containing 2 juxtaposed uppercases letters **with accents **(artists names)
I want easily to remove everything else (with true -false?)
my datas
ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.
expected results
with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.
The data is imported into R with:
readlines ("clipboard")
I am able to identify lines including artist names in capital letters with different regex
e.g.
[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']
I am able to identify lines including artworks
^[0-9]+[\s]
Any help would be greatly appreciated.
Just a side-note: [:upper:] matches uppercase letters in the current locale (see source). Thus, this solution is good if you work with one locale:
ll <- readLines(textConnection("ÁÀDFDS (artist 1)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
...gdgdgdg (bad string)
7 in commodo enim in laoreet gravida."))
ll[grep("^[[:digit:]]+[[:blank:]]|[[:upper:]]['[:upper:]]", ll)]
See the IDEONE demo
The regex breakdown:
^ - start of string
[[:digit:]]+ - 1 or more digits
[[:blank:]] - 1 space or tab
| - or
[[:upper:]]['[:upper:]] - an uppercase letter followed by ' or another uppercase letter.
And here is a way to achieve what you need with a Perl-like regex:
ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)]
The regex matches:
^ - start of string
\\d+\\s - 1 or more digits and then a whitespace
| - or...
\\p{Lu}['\\p{Lu}] - an uppercase Unicode letter followed by either an apostrophe or another uppercase Unicode letter.
The output of the sample demo:
[1] "ÁÀDFDS (artist 1)"
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."
[3] "AB (artist 2)"
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."
[7] "4 Mauris condimentum velit eu consequat feugiat."
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."
[9] "ÉÈDFSF (artist 4)"
[10] "6 Sed cursus augue in tempus scelerisque."
[11] "7 in commodo enim in laoreet gravida."
To clean up the beginning of strings, you can use
ll <- gsub("^[\\P{L}\\D]*?([\\p{L}\\d])", "\\1", ll, perl=T)
The regex ^[\\P{L}\\D]*?([\\p{L}\\d]) matches any non-letters and non-digits as few as possible before a letter or a digit (that are placed into a capturing group), and then restores the captured alphanumeric using the \1 backreference with gsub call. Use it before grepping.
See IDEONE demo
You can use grep:
z<-readlines ("clipboard")
z[grep("^[0-9]|[[:upper:]]{2,}", z)]
[1] "AADFDS (artist 1)"
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."
[3] "AB (artist 2)"
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."
[7] "4 Mauris condimentum velit eu consequat feugiat."
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."
[9] "CCDDFSF (artist 4)"
[10] "6 Sed cursus augue in tempus scelerisque."
[11] "7 in commodo enim in laoreet gravida."
You can use POSIX character classes if you want. However, their interpretation depends on the current locale and if it's not set properly, it could alter the behavior of the POSIX class.
I'd recommend turning on Perl regular expressions and use Unicode properties.
x <- readLines('clipboard')
r <- x[grepl("^\\pN+|\\p{Lu}[\\p{Lu}']", x, perl=TRUE)]
Another interesting way would be to match the accented letters, dissuading from POSIX.
r <- x[grepl("^\\d+|(?![×Þß÷þø])[A-ZÀ-ÿ][A-ZÀ-ÿ']", x, perl=TRUE)]
You can view the compiled demo of both regular expressions be used.
Related
I have an OCR text document where paragraphs have been broken into individual lines. I'd like to make them whole paragraphs on a single line again (as per the original PDF).
How can I use regex, or find and replace, to remove the line breaks between two lines of text and replace them with a space?
Eg:
Every line of text is on a newline. I'd like them to be whole paragraphs on a single line.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam vehicula tellus faucibus metus consequat
scelerisque. Maecenas sit amet urna quis ipsum interdum consequat. Praesent elementum libero nec
velit suscipit placerat accumsan vitae lacus. Aliquam erat volutpat. Etiam egestas lectus sed orci
venenatis, ullamcorper gravida elit pulvinar. Pellentesque imperdiet, augue pulvinar sodales dapibus,
tortor magna rutrum nulla, vel ullamcorper mi purus a diam. Ut id odio sed arcu aliquet lobortis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Donec quam arcu, egestas feugiat eleifend blandit, vulputate non elit. Nulla a erat vel leo maximus
viverra at ac lorem. Nam non imperdiet lorem. Fusce tempor arcu massa, non commodo ligula lobortis
nec. Aliquam sit amet fringilla sapien, non euismod metus. Donec orci mi, sagittis vitae lobortis eu,
aliquet nec libero. Sed sodales magna lacus, pretium lobortis magna varius nec. Pellentesque quis
ipsum viverra orci lobortis egestas. Aliquam porttitor tincidunt ipsum, egestas placerat ante
consectetur in. Morbi porttitor lacus eu augue tincidunt, at aliquet lorem consectetur.
You might be looking for a programatic/dynamic approach for every new scan generated so I'm not sure if this answers your question, but since you have visual studio code in your tags I will answer how to do this in vscode.
Open keyboard shortcuts from File > Preferences > Keyboard shortcuts, and bind editor.action.joinLines to a shortcut of your choice like for example Ctrl + J.
Then go ahead and open the text you are looking to fix in vscode, select it and press that keybinding. You will notice everything will be in 1 line. I hope I helped!
I am using two regular expressions when removing linebreaks from OCR texts.
They can be used in the Find&Replace dialog from VS Code.
Remove linebreaks at lines ending with a hyphen: (?<=\w)- *\n *
Replace remaining linebreaks with whitespace, but keeping blank lines: (?<!\n) *\n *(?!\n).
Note that the * in the regular expression trims whitespace at the end and beginning of the lines.
There is also a Python tool based on Flair called dehyphen that does the job.
In my experience it produces useful results but may take quite long compared to replacing linebreaks with regular expressions.
I have this pattern in over 10.000 places:
11,1,2,0,0,"Lorem ipsum dolor sit amet, 8 - 14. consectetur adipiscing elit. 6 - 13. Aenean semper fermentum ipsum sed vehicula. In commodo sit amet libero et rhoncus. Cras vitae dapibus nisl. Mauris lacinia dui lacus, ut sodales massa congue vel. Donec at 8 - 11. dapibus mi, ullamcorper porttitor orci. Nullam id dui nibh. Fusce est ante, viverra 4 - 7. et cursus vel, scelerisque imperdiet massa. Donec sit amet nibh porttitor, tincidunt lorem in, maximus elit."
I need to capture all 11,1,2,0,0, patterns at the beginning of the sentence AND ALL the 8 - 14. patterns (they have different numbers between dashes - and before the dot .) throughout the sentence using Regex.
How do I do this?
I have tried (^\d*,\d*,\d*,\d*,\d*)+(\d* - \d*\.)
The desired output is:
11,1,2,0,0, 8 - 14. 6 - 13. 8 - 11. 4 - 7.
You can use regex alternation for 2 patterns:
\b((?:\d+,)+|\d+\s*-\s*\d+)
RegEx Demo
I have the regex from a sql query:
| Steven | 149203948 | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Ut eu lacus gravida, lobortis nulla eu, fringilla erat.
Nullam vulputate vitae lacus quis sagittis.
Donec elementum nisl vitae arcu pulvinar tristique.
| Frank | 993847 |
Morbi bibendum facilisis risus.
Nullam facilisis, lectus nec adipiscing vehicula, urna nisi rhoncus urna,
at egestas purus nisl a sapien.
Suspendisse ac sapien ut eros luctus eleifend ac sit amet odio.
Etc etc
I'm trying to select the "name", "number" and the entire list of "Lorem Text" so that I can create a sql update statement for each lorem phrase by replacing a particular word with another word.
So I could do
UPDATE person p
SET p.lorem_text = '
$3 CHANGED TEXT $4'
WHERE p.name = $1 AND p.favorite_number = $2;
For all the entries I get back from my query.
The problem is the stupid newlines, I tried using
\| (\w*) \| (\d*) \| ((.|\R)*)OLD TEXT((.|\R)*)
But it only gets the first occurence and leaves the rest (| Frank | 993847 | etc) as the entire tail end of my selection.
\| (\w*) \| (\d*) \| ([^|]*)
The rest of regex is simplified
I am trying to write an expression to take a block of text an return up until a full-stop before an ellipsis or three full-stops (... or …). So the idea is that the example text test string:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum. Curabitur nec...
Would become:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum.
So far I have come up with this pathetic attempt. I keep getting right up until the last full-stop (because the quantifier consumes the previous two full-stops so there is nothing for the look ahead to fail on). I just can't seem to wrap my head around it:
Dim testText As String = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum. Curabitur nec..."
Dim ellipsisExpression As String = "(.*\.(?!\.\.))"
Dim ellipsisMatch As Match
ellipsisMatch = Regex.Match(testText, ellipsisExpression)
If ellipsisMatch.Success Then
testText = ellipsisMatch.Groups(1).Value
End If
edit: I also need this expression to take any ... character in the text into account. for example the string:
`begin. this is a test... test complete. beginning shutdown... shutting down... `
should return
`begin. this is a test... test complete.`
The aim of this expression is to find the most flowing text before any truncation has occurred. A block of text with closure so it doesn't confuse readers expecting to be able to 'get more'.
You could replace [^.]*(?:\.{3}|…).* with an empty string to get the desired result.
For example:
result = Regex.Replace(input, "[^.]*(?:\\.{3}|…).*", "")
Use this:
result = Regex.Replace(input, "(.+\.).+(?:\.{3}|…)\s*", "$1")
Edit:
Use this regex instead:
(.+[^.]\.)(?:(?:[^.]{2})|$)
You could match that with:
.*(?<!\.)\.(?!\.)(?=(?:[^.]+|\.{3})*(?:\.{3}|…)$)
Or replace
(?<!\.)\.(?!\.)(?:[^.]+|\.{3})*(?:\.{3}|…)$
with a ..
I think I have come up with a solution that works for me. Thank you to everyone who answered previously but this expression seems to do what I need and doesn't execute as slowly as some of the other answers. It also takes other sentence terminating punctuation into account such as ! or ? and not just ..
(.*([^\.](?=\.|\?|!)(?!\.\.\.)).)
This get's the last sentence terminating character (defined with the lookahead). In this case they are ?, ! and . that isn't followed by .... This also solves the ellipsis character issue since it is effectively a sentence terminating white list. This expression succeeds in finding the largest block of text with closure.
lets say there is something like this
Lorem ipsum dolor sit amet, consectetur adipiscing elit. "Vestibulum interdum dolor nec sapien blandit a suscipit arcu fermentum. Nullam lacinia ipsum vitae enim consequat iaculis quis in augue. Phasellus fermentum congue blandit. Donec laoreet, ipsum et vestibulum vulputate, risus augue commodo nisi, vel hendrerit sem justo sed mauris." Phasellus ut nunc neque, id varius nunc. In enim lectus, blandit et dictum at, molestie in nunc. Vivamus eu ligula sed augue pretium tincidunt sit amet ac nisl. "Morbi eu elit diam, sed tristique nunc."
to be something like this
Lorem ipsum dolor sit amet, consectetur adipiscing elit. "Vestibulum interdum dolor nec sapien blandit a suscipit arcu fermentum[dot] Nullam lacinia ipsum vitae enim consequat iaculis quis in augue[dot] Phasellus fermentum congue blandit[dot] Donec laoreet, ipsum et vestibulum vulputate, risus augue commodo nisi, vel hendrerit sem justo sed mauris[dot]" Phasellus ut nunc neque, id varius nunc. In enim lectus, blandit et dictum at, molestie in nunc. Vivamus eu ligula sed augue pretium tincidunt sit amet ac nisl. "Morbi eu elit diam, sed tristique nunc[dot]"
i somehow found a regex to select all the "{sentence}" with "(.)+?" or use them like
regex('"(.)+?"','[sentence]')
but can we do something like replace the dots inside a group?. so i can get the output like above example?
I'm not sure regexps are able to suit your needs on their own.
You should implement an algorithm that replaces nested dots until the string doesn't contain nested dots anymore.
For example in PHP:
$string = 'He asked "Please." while she answered "No. Or maybe yes."';
var_dump($string);
while(preg_match('/"[^"]*\.[^"]*"/', $string)) {
$string = preg_replace('/("[^"]*)\.([^"]*")/', '$1[dot]$2', $string);
}
var_dump($string);
which prints:
string 'He asked "Please." while she answered "No. Or maybe yes."' (length=57)
string 'He asked "Please[dot]" while she answered "No[dot] Or maybe yes[dot]"' (length=69)
This is what I would do.
echo
preg_replace_callback('~(?<!\\\)"(.+?)((?<!\\\)")~',
/*
Pattern:
--------
(?<!\\\)" a double quote not preceded by a backward (escaping) slash
(.+?) anything (with min 1 char.) between condition above and below
((?<!\\\)") a double quote not preceded by a backward (escaping) slash
*/
// for anything that matches the above pattern
// the following function is called
create_function('$m',
'return preg_replace("~\.~","[dot]",$m[0]);'),
// which replaces each dot with [dot] and returns the match
$str);
EDIT: Added explanations in comments.
try this:
(\"[^\.]*)\.([^\"]*) to \1[dot]\2
works well in my editor, but sometimes $ is used instead of \ in replacement (e.g. in php)
With Javascript I would just do a basic replace:
str = str.replace(/".+?"/g,function(m) {
return m.replace(/\./g,'[dot]');
});