Get the value in the line above a word - regex

I have the text below and would like to get the value of the line above the word OUTRAS INFORMAÇÕES in this case the value 8.571.962,06
I did the following, but I found you very vulnerable.
^(.?)\s(?(\d+.\d+.\d+,\d+|\d+.\d+.\d+,\d+|\d+.\d+,\d+|\d+,\d+))\sOUTRAS INFORMA.?.*?ES
OCR:
NOME: TESTE DE SILVA SAURO
CPF: 785.981.970-84
DECLARAÇÃO DE AJUSTE ANUAL
IMPOSTO SOBRE A RENDA - PESSOA FÍSICA
EXERCICIO 2018 ANO-CALENDÁRIO 2017
EVOLUÇÃO PATRIMONIAL
Bens e direitos em 31/12/2016
Bens e direitos em 31/12/2017
Dividas conus rcais em 31/12/2016
Divisas e ônus reais em 31/12/2017
100.580.873.91
100.329. 110,32
9135,456,07
8.571.962,06
OUTRAS INFORMAÇÕES
Rendimentos isentos e não tributáveis
I'm using sitging to test https://regexr.com/

This should do it (regex101)
^.*(?=\nOUTRAS INFORMAÇÕES)
Basically, do a lookahead to find a newline character with after it the OUTRAS INFORMAÇÕES line

Related

how to find the second number in a string with Regex expression

I want a regular expression that finds the last match.
the result that I want is "103-220898-44"
The expression that I'm using is ([^\d]|^)\d{3}-\d{6}-\d{2}([^\d]|$). This doesn't work because it matches the first result "100-520006-90" and I want the last "103-220898-44"
Example A
Transferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
$ 1.000.00
/.*(\d{3}-\d{6}-\d{2})(?:[^\d]|$)/gs
If you add .* to the beginning of your regex, it only captures the last one since it's greedy. Also, you need to use the single-line regex flag (s) to capture new lines by using .*.
Note: I replaced some (...) strings with (?:...) since their aim is grouping, not capturing.
Demo: https://regex101.com/r/fygL1X/2
const regex = new RegExp('.*(\\d{3}-\\d{6}-\\d{2})(?:[^\\d]|$)', 'gs')
const str = `ransferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
\$ 1.000.00`;
let m;
m = regex.exec(str)
console.log(m[1])

Strat extracting after a repeated string in regex

How to extract string_with_letters_and_special_caracters in this sequence ?
sequence_one \n sequence_two \n sequence_three \n string_with_letters_and_special_caracters 0000000 \n sequence_four
I can't manage to beginning after the last \n preceding string_with_letters_and_special_caracters.
(Here \n is the repeated string.)
For example \\n(\D+)\d+ extract from the first \n.
Example : I want to extract - Dimensions : L. or Dimensions here, which precedes an expression I have a pattern for :
https://regex101.com/r/jLqxxo/1
Thank you!
You seem to want
-\s*Dimensions\s*:\s*L\.\s*(\d+)\D+(\d+)\D+(\d+)
See the regex demo and the Python demo:
import re
s=r'''FICHE TECHNIQUE\n- Pieds du canapé en bois.\n- Assise et dossier en polyester effet velours.\n- Canapé idéal pour deux personnes.\n\nCARACTERISTIQUES TECHNIQUES\n- Dimensions : L. 128 x l. 71 x H. 80 cm.\n- Hauteur d'assise : H. 47 cm.\n- Poids : 15,14 kg.\n\n'''
m = re.search(r'-\s*Dimensions\s*:\s*L\.\s*(\d+)\D+(\d+)\D+(\d+)',s)
if m:
print(m.group(1)) # => 128
print(m.group(2)) # => 71
print(m.group(3)) # => 80

RegEx - Extract string from character 91 up to character 180 and delete everything before and after

I am trying to extract character 91 to 180 from this text:
Exosphere -6° Reg. fra Deuter er den perfekte sovepose til dig, der har det med at stritte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose. Den er nemlig fuld af elastikker, som tillader soveposen at blive op til 25% bredere, end den umiddelbart ser ud til at være.
So that the output will look like this:
itte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose
I am using this expression which I found here on SO REGEX to trim a string after 180 characters and before |:
Replace
^([^|]{91,180})[^|]+(.*)$
with
\1\2
It is doing some of the job this is the output:
Exosphere -6° Reg. fra Deuter er den perfekte sovepose til dig, der har det med at stritte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose
So now I need to remove everything before character 91.
The point here is that you need to match the first 90 chars, then match and capture another 90 chars into Group 1, and then just match the rest of the string, then replace with a backreference to Group 1 value.
You may use
^[\s\S]{90}([\s\S]{90})[\s\S]*
Or, if there are no line breaks, a more "regular"
^.{90}(.{90}).*
patterns. Replace with $1.
See the regex demo

detect string between brackets using regex

How can i use regex in python to detect in one file only the stings between the brackets?
So If i have some text like this:
</seg>
<src>Sono stati riportati casi di (sgomberi forzati) e violazioni dei diritti umani da parte della polizia, ma su scala minore rispetto agli anni precedenti.</src>
and I want to detect only (sgomberi forzati)
I use the
for line in file.readlines():
m=re.compile('\((.*?)\)', re.DOTALL).findall(line)
print m
but it does not print what I need: it prints also brackets empty like this
[]
[u'sgomberi forzati']
Escape your () as follows:
m = re.compile('\((.*?)\)', re.DOTALL).findall(line)

Regex extract only specific character and EOL

I am trying to extract some text using regex.
I want to extract only those line that contains "pour 1e" or "Pour 1€" and nothing more.
The regex must be incase-sensitive.
here is my regex that don't work like I want:
/Pour ([0-9.,])(€|e)/im
and this is my text:
Tesseract Open Source OCR Engine v3.01 with Leptonica
CARDEURS
Horaire dejour de flhllll 5 19h00
pour 1€
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
par€ supplémentaire
Horaire de nuit de 19h00 5 flhllll
pour 1,50€
pour 1€ supplémentaire + 300 minutes
pour 1€ supplémentaire + 420 minutes
La joumée de 24 heures
35 minutes
+ 30 minutes
+ 35 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
60 minutes
15€
Tesseract Open Source OCR Engine v3.01 with Leptonica
TARIFS
PARKING CARNOT
Homim de juur de 8:00 3 19:00 H01-aim de null de 19:00 5 8:00
mains d‘ ggg heme : G1-atuit moins d‘ ggg heure : Gmtuil
Pour 1e
Pour 1e supplémenlaire
Pour 1e suppléulentaire
Pour 1e supplémmmm
Pour 1e supplémmmm
Par e supplémenlaiI€
40 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
+ 55 minutes
+ 55 minules
Pour 1e so nzinules
Pour 1e supplémenlaiI€ + 300 minllles
Pour 1e 5upplémenlai1Q + 420 minules
La journée a
e 24 heums 15€
You need to anchor the expression with ^ and $ which match beginning/end of line when /m is active. For example:
/^pour [0-9]+[0-9,.]*[e€]$/im
use square brackets [] to specify a group of characters to match, caret ^ to match the beginning of the line and dollar sign $ to match the end of the line. Depending on which regex implementation you are using you may be able to pass the i flag to make it case-insensitive
/^Pour 1[€e]$/i
Or handle case explicitly with character groups
/^[Pp][Oo][Uu][Rr] 1[€e]$/
For matching repetitions, use * to match 0 or more of the previous character, + to match 1 or more, and ? to match 0 or 1.
In place of the 1 in the previous, you could use
[0-9.]+ to match any 1 or more digits or decimal points
[0-9]+\.?[0-9]* to match at least 1 digit follow by an optional decimal point and more digits
[0-9]+[0-9,]*\.?[0-9]* to match at least 1 digit, optionally more digits and commas, followed by an optional decimal point and more digits
You can also use curly braces {} to explicitly specify a number of repetitions (these must be escaped with a backslash \ in some regex engines)
[0-9]{1,3} would match 1,2 or 3 digits
[0-9]{3} would match exactly 3 digits
You can use parenthesis () to group a part of a regex pattern for backreference or repetition.
So to match a line that starts with "Pour " followed by 1 or more digits, then an optional comma or decimal point with 2 digits, then the euro symbol or letter e, and any number of trailing spaces, but no other characters until end of line, and be case-insensitive:
/^Pour [0-9]+([,.][0-9][0-9])?[€e][ ]*$/i