Strat extracting after a repeated string in regex - regex

How to extract string_with_letters_and_special_caracters in this sequence ?
sequence_one \n sequence_two \n sequence_three \n string_with_letters_and_special_caracters 0000000 \n sequence_four
I can't manage to beginning after the last \n preceding string_with_letters_and_special_caracters.
(Here \n is the repeated string.)
For example \\n(\D+)\d+ extract from the first \n.
Example : I want to extract - Dimensions : L. or Dimensions here, which precedes an expression I have a pattern for :
https://regex101.com/r/jLqxxo/1
Thank you!

You seem to want
-\s*Dimensions\s*:\s*L\.\s*(\d+)\D+(\d+)\D+(\d+)
See the regex demo and the Python demo:
import re
s=r'''FICHE TECHNIQUE\n- Pieds du canapé en bois.\n- Assise et dossier en polyester effet velours.\n- Canapé idéal pour deux personnes.\n\nCARACTERISTIQUES TECHNIQUES\n- Dimensions : L. 128 x l. 71 x H. 80 cm.\n- Hauteur d'assise : H. 47 cm.\n- Poids : 15,14 kg.\n\n'''
m = re.search(r'-\s*Dimensions\s*:\s*L\.\s*(\d+)\D+(\d+)\D+(\d+)',s)
if m:
print(m.group(1)) # => 128
print(m.group(2)) # => 71
print(m.group(3)) # => 80

Related

how to find the second number in a string with Regex expression

I want a regular expression that finds the last match.
the result that I want is "103-220898-44"
The expression that I'm using is ([^\d]|^)\d{3}-\d{6}-\d{2}([^\d]|$). This doesn't work because it matches the first result "100-520006-90" and I want the last "103-220898-44"
Example A
Transferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
$ 1.000.00
/.*(\d{3}-\d{6}-\d{2})(?:[^\d]|$)/gs
If you add .* to the beginning of your regex, it only captures the last one since it's greedy. Also, you need to use the single-line regex flag (s) to capture new lines by using .*.
Note: I replaced some (...) strings with (?:...) since their aim is grouping, not capturing.
Demo: https://regex101.com/r/fygL1X/2
const regex = new RegExp('.*(\\d{3}-\\d{6}-\\d{2})(?:[^\\d]|$)', 'gs')
const str = `ransferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
\$ 1.000.00`;
let m;
m = regex.exec(str)
console.log(m[1])

How to find the longest string in a text using regex in R

Given a string x, i can count the number of words (length) in this string using gregexpr("[A-Za-z]\w+", x) .
> x<-"\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
> sapply(gregexpr("[A-Za-z]\\w+", x), function(x) sum(x > 0))
[1] 11
However, how can i retrieve the number of words in the longest attached string (with space and not \n), using regex under R environnent
in this example it would be "Masters Universitaires et Prives au Maroc" which length is 6 .
Thanks in Advance .
I would solve it with
x <- "\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
max(nchar(gsub("[^ ]+", "", unlist(strsplit(trimws(x), "\n+"))))) + 1
Split a trimmed string into lines, unlist the result, remove all characters other than a space, get the longest item and add one. The [^ ]+ is a regex that matches one or more (due to the + quantifier) characters other than (as [^...] is a negated character class) a space.
See IDEONE demo.
Load the package
library(stringr)
Create a new dataset, extracting and splitting the phrases
data <- unlist(str_split(x, pattern="\n", n = Inf))
index <- lapply(data, nchar)
index <- index !=0
# extract the maximum length of the phrase
max(sapply(gregexpr("\\W+", data[index]), length) + 1)
[1] 6
# just checking
data[index]
[1] "Masters Publics"
[2] "Masters Universitaires et Prives au Maroc"
[3] "\\n"
[4] "Masters Par Ville"

Delete Numeration Lines from Subtitle

I have this subtitle text with many many lines. Before times and text i have numeration (1,2,3,4,5...111 numbers):
Legend:
1 = numeration
2 = numeration
00:14:xx:xx = times
quando a te... = text
text example:
1
00:14:38,511 --> 00:14:45,747
quando a te venne il Salvatore,
2
00:14:55,595 --> 00:15:06,699
...volle da te prendere il battesimo,...
ma il prete rifiuto
10
00:15:16,082 --> 00:15:27,050
e si consacrò al martirio,
213
00:15:34,467 --> 00:15:46,174
ci diede un pegno di salvezza:
ecco! ci siamo andiamo a ubriarci
i want delete numeration lines:
1
2
10
213
this should be the end result:
00:14:38,511 --> 00:14:45,747
quando a te venne il Salvatore,
00:14:55,595 --> 00:15:06,699
...volle da te prendere il battesimo,...
ma il prete rifiuto
00:15:16,082 --> 00:15:27,050
e si consacrò al martirio,
00:15:34,467 --> 00:15:46,174
ci diede un pegno di salvezza:
ecco! ci siamo andiamo a ubriarci
Search: (?m)^\d+$[\r\n]+
Replace: empty string
In engines that don't support inline modifiers such as (?m), you'll usually add the m flag at the end of the pattern, like so:
/^\d+$[\r\n]+/m
Explanation
(?m) turns on multi-line mode, allowing ^ and $ to match on each line
The ^ anchor asserts that we are at the beginning of the string
\d+ matches digits
The $ anchor asserts that we are at the end of the string
[\r\n]+ matches line breaks
We replace with the empty string
You can simply just use the following:
Find: ^\d+\s+
Replace:
^ empty
Explanation:
^ # the beginning of the string
\d+ # digits (0-9) (1 or more times)
\s+ # whitespace (\n, \r, \t, \f, and " ") (1 or more times)

Regex extract only specific character and EOL

I am trying to extract some text using regex.
I want to extract only those line that contains "pour 1e" or "Pour 1€" and nothing more.
The regex must be incase-sensitive.
here is my regex that don't work like I want:
/Pour ([0-9.,])(€|e)/im
and this is my text:
Tesseract Open Source OCR Engine v3.01 with Leptonica
CARDEURS
Horaire dejour de flhllll 5 19h00
pour 1€
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
par€ supplémentaire
Horaire de nuit de 19h00 5 flhllll
pour 1,50€
pour 1€ supplémentaire + 300 minutes
pour 1€ supplémentaire + 420 minutes
La joumée de 24 heures
35 minutes
+ 30 minutes
+ 35 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
60 minutes
15€
Tesseract Open Source OCR Engine v3.01 with Leptonica
TARIFS
PARKING CARNOT
Homim de juur de 8:00 3 19:00 H01-aim de null de 19:00 5 8:00
mains d‘ ggg heme : G1-atuit moins d‘ ggg heure : Gmtuil
Pour 1e
Pour 1e supplémenlaire
Pour 1e suppléulentaire
Pour 1e supplémmmm
Pour 1e supplémmmm
Par e supplémenlaiI€
40 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
+ 55 minutes
+ 55 minules
Pour 1e so nzinules
Pour 1e supplémenlaiI€ + 300 minllles
Pour 1e 5upplémenlai1Q + 420 minules
La journée a
e 24 heums 15€
You need to anchor the expression with ^ and $ which match beginning/end of line when /m is active. For example:
/^pour [0-9]+[0-9,.]*[e€]$/im
use square brackets [] to specify a group of characters to match, caret ^ to match the beginning of the line and dollar sign $ to match the end of the line. Depending on which regex implementation you are using you may be able to pass the i flag to make it case-insensitive
/^Pour 1[€e]$/i
Or handle case explicitly with character groups
/^[Pp][Oo][Uu][Rr] 1[€e]$/
For matching repetitions, use * to match 0 or more of the previous character, + to match 1 or more, and ? to match 0 or 1.
In place of the 1 in the previous, you could use
[0-9.]+ to match any 1 or more digits or decimal points
[0-9]+\.?[0-9]* to match at least 1 digit follow by an optional decimal point and more digits
[0-9]+[0-9,]*\.?[0-9]* to match at least 1 digit, optionally more digits and commas, followed by an optional decimal point and more digits
You can also use curly braces {} to explicitly specify a number of repetitions (these must be escaped with a backslash \ in some regex engines)
[0-9]{1,3} would match 1,2 or 3 digits
[0-9]{3} would match exactly 3 digits
You can use parenthesis () to group a part of a regex pattern for backreference or repetition.
So to match a line that starts with "Pour " followed by 1 or more digits, then an optional comma or decimal point with 2 digits, then the euro symbol or letter e, and any number of trailing spaces, but no other characters until end of line, and be case-insensitive:
/^Pour [0-9]+([,.][0-9][0-9])?[€e][ ]*$/i

Regex capitalize first letter every word, also after a special character like a dash

I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next