Delete Numeration Lines from Subtitle - regex

I have this subtitle text with many many lines. Before times and text i have numeration (1,2,3,4,5...111 numbers):
Legend:
1 = numeration
2 = numeration
00:14:xx:xx = times
quando a te... = text
text example:
1
00:14:38,511 --> 00:14:45,747
quando a te venne il Salvatore,
2
00:14:55,595 --> 00:15:06,699
...volle da te prendere il battesimo,...
ma il prete rifiuto
10
00:15:16,082 --> 00:15:27,050
e si consacrò al martirio,
213
00:15:34,467 --> 00:15:46,174
ci diede un pegno di salvezza:
ecco! ci siamo andiamo a ubriarci
i want delete numeration lines:
1
2
10
213
this should be the end result:
00:14:38,511 --> 00:14:45,747
quando a te venne il Salvatore,
00:14:55,595 --> 00:15:06,699
...volle da te prendere il battesimo,...
ma il prete rifiuto
00:15:16,082 --> 00:15:27,050
e si consacrò al martirio,
00:15:34,467 --> 00:15:46,174
ci diede un pegno di salvezza:
ecco! ci siamo andiamo a ubriarci

Search: (?m)^\d+$[\r\n]+
Replace: empty string
In engines that don't support inline modifiers such as (?m), you'll usually add the m flag at the end of the pattern, like so:
/^\d+$[\r\n]+/m
Explanation
(?m) turns on multi-line mode, allowing ^ and $ to match on each line
The ^ anchor asserts that we are at the beginning of the string
\d+ matches digits
The $ anchor asserts that we are at the end of the string
[\r\n]+ matches line breaks
We replace with the empty string

You can simply just use the following:
Find: ^\d+\s+
Replace:
^ empty
Explanation:
^ # the beginning of the string
\d+ # digits (0-9) (1 or more times)
\s+ # whitespace (\n, \r, \t, \f, and " ") (1 or more times)

Related

Strat extracting after a repeated string in regex

How to extract string_with_letters_and_special_caracters in this sequence ?
sequence_one \n sequence_two \n sequence_three \n string_with_letters_and_special_caracters 0000000 \n sequence_four
I can't manage to beginning after the last \n preceding string_with_letters_and_special_caracters.
(Here \n is the repeated string.)
For example \\n(\D+)\d+ extract from the first \n.
Example : I want to extract - Dimensions : L. or Dimensions here, which precedes an expression I have a pattern for :
https://regex101.com/r/jLqxxo/1
Thank you!
You seem to want
-\s*Dimensions\s*:\s*L\.\s*(\d+)\D+(\d+)\D+(\d+)
See the regex demo and the Python demo:
import re
s=r'''FICHE TECHNIQUE\n- Pieds du canapé en bois.\n- Assise et dossier en polyester effet velours.\n- Canapé idéal pour deux personnes.\n\nCARACTERISTIQUES TECHNIQUES\n- Dimensions : L. 128 x l. 71 x H. 80 cm.\n- Hauteur d'assise : H. 47 cm.\n- Poids : 15,14 kg.\n\n'''
m = re.search(r'-\s*Dimensions\s*:\s*L\.\s*(\d+)\D+(\d+)\D+(\d+)',s)
if m:
print(m.group(1)) # => 128
print(m.group(2)) # => 71
print(m.group(3)) # => 80

Remove spaces between words of a certain length

I have strings of the following variety:
A B C Company
XYZ Inc
S & K Co
I would like to remove the spaces in these strings that are only between words of 1 letter length. For example, in the first string I would like to remove the spaces between A B and C but not between C and Company. The result should be:
ABC Company
XYZ Inc
S&K Co
What is the proper regex expression to use in gsub for this?
Here is one way you could do this seeing how & is mixed in and not a word character ...
x <- c('A B C Company', 'XYZ Inc', 'S & K Co', 'A B C D E F G Company')
gsub('(?<!\\S\\S)\\s+(?=\\S(?!\\S))', '', x, perl=TRUE)
# [1] "ABC Company" "XYZ Inc" "S&K Co" "ABCDEFG Company"
Explanation:
First we assert that two non-whitespace characters do not precede back to back. Then we look for and match whitespace "one or more" times. Next we lookahead to assert that a non-whitespace character follows while asserting that the next character is not a non-whitespace character.
(?<! # look behind to see if there is not:
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
) # end of look-behind
\s+ # whitespace (\n, \r, \t, \f, and " ") (1 or more times)
(?= # look ahead to see if there is:
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
(?! # look ahead to see if there is not:
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
) # end of look-ahead
) # end of look-ahead
Obligatory strsplit / paste answer. This will also get those single characters that might be in the middle or at the end of the string.
x <- c('A B C Company', 'XYZ Inc', 'S & K Co',
'A B C D E F G Company', 'Company A B C', 'Co A B C mpany')
foo <- function(x) {
x[nchar(x) == 1L] <- paste(x[nchar(x) == 1L], collapse = "")
paste(unique(x), collapse = " ")
}
vapply(strsplit(x, " "), foo, character(1L))
# [1] "ABC Company" "XYZ Inc" "S&K Co"
# [4] "ABCDEFG Company" "Company ABC" "Co ABC mpany"
Coming late to the game but would this pattern work for you
(?<!\\S\\S)\\s+(?!\\S\\S)
Demo
Another option
(?![ ]+\\S\\S)[ ]+
You could do this also through PCRE verb (*SKIP)(*F)
> x <- c('A B C Company', 'XYZ Inc', 'S & K Co', 'A B C D E F G Company', ' H & K')
> gsub("\\s*\\S\\S+\\s*(*SKIP)(*F)|(?<=\\S)\\s+(?=\\S)", "", x, perl=TRUE)
[1] "ABC Company" "XYZ Inc" "S&K Co" "ABCDEFG Company"
[5] " H&K"
Explanation:
\\s*\\S\\S+\\s* Would match two or more non-space characters along with the preceding and following spaces.
(*SKIP)(*F) Causes the match the to fail.
| Now ready to choose the characters from the remaining string.
(?<=\\S)\\s+(?=\\S) one or more spaces which are preceded by a non-space , followed by a non-space character are matched.
Removing the spaces will give you the desired output.
Note: See the last element, this regex won't replace the preceding spaces at the first because the spaces at the start isn't preceded by a single non-space character.

Regular expression in R to remove the part of a string after the last space

I would like to have a gsub expression in R to remove everything in a string that occurs after the last space. E.g. string="Da Silva UF" should return me "Da Silva". Any thoughts?
Using $ anchor:
> string = "Da Silva UF"
> gsub(" [^ ]*$", "", string)
[1] "Da Silva"
You can use the following.
string <- 'Da Silva UF'
gsub(' \\S*$', '', string)
[1] "Da Silva"
Explanation:
' '
\S* non-whitespace (all but \n, \r, \t, \f, and " ") (0 or more times)
$ before an optional \n, and the end of the string

Regex extract only specific character and EOL

I am trying to extract some text using regex.
I want to extract only those line that contains "pour 1e" or "Pour 1€" and nothing more.
The regex must be incase-sensitive.
here is my regex that don't work like I want:
/Pour ([0-9.,])(€|e)/im
and this is my text:
Tesseract Open Source OCR Engine v3.01 with Leptonica
CARDEURS
Horaire dejour de flhllll 5 19h00
pour 1€
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
par€ supplémentaire
Horaire de nuit de 19h00 5 flhllll
pour 1,50€
pour 1€ supplémentaire + 300 minutes
pour 1€ supplémentaire + 420 minutes
La joumée de 24 heures
35 minutes
+ 30 minutes
+ 35 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
60 minutes
15€
Tesseract Open Source OCR Engine v3.01 with Leptonica
TARIFS
PARKING CARNOT
Homim de juur de 8:00 3 19:00 H01-aim de null de 19:00 5 8:00
mains d‘ ggg heme : G1-atuit moins d‘ ggg heure : Gmtuil
Pour 1e
Pour 1e supplémenlaire
Pour 1e suppléulentaire
Pour 1e supplémmmm
Pour 1e supplémmmm
Par e supplémenlaiI€
40 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
+ 55 minutes
+ 55 minules
Pour 1e so nzinules
Pour 1e supplémenlaiI€ + 300 minllles
Pour 1e 5upplémenlai1Q + 420 minules
La journée a
e 24 heums 15€
You need to anchor the expression with ^ and $ which match beginning/end of line when /m is active. For example:
/^pour [0-9]+[0-9,.]*[e€]$/im
use square brackets [] to specify a group of characters to match, caret ^ to match the beginning of the line and dollar sign $ to match the end of the line. Depending on which regex implementation you are using you may be able to pass the i flag to make it case-insensitive
/^Pour 1[€e]$/i
Or handle case explicitly with character groups
/^[Pp][Oo][Uu][Rr] 1[€e]$/
For matching repetitions, use * to match 0 or more of the previous character, + to match 1 or more, and ? to match 0 or 1.
In place of the 1 in the previous, you could use
[0-9.]+ to match any 1 or more digits or decimal points
[0-9]+\.?[0-9]* to match at least 1 digit follow by an optional decimal point and more digits
[0-9]+[0-9,]*\.?[0-9]* to match at least 1 digit, optionally more digits and commas, followed by an optional decimal point and more digits
You can also use curly braces {} to explicitly specify a number of repetitions (these must be escaped with a backslash \ in some regex engines)
[0-9]{1,3} would match 1,2 or 3 digits
[0-9]{3} would match exactly 3 digits
You can use parenthesis () to group a part of a regex pattern for backreference or repetition.
So to match a line that starts with "Pour " followed by 1 or more digits, then an optional comma or decimal point with 2 digits, then the euro symbol or letter e, and any number of trailing spaces, but no other characters until end of line, and be case-insensitive:
/^Pour [0-9]+([,.][0-9][0-9])?[€e][ ]*$/i

Regex capitalize first letter every word, also after a special character like a dash

I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next