how to find the second number in a string with Regex expression - regex

I want a regular expression that finds the last match.
the result that I want is "103-220898-44"
The expression that I'm using is ([^\d]|^)\d{3}-\d{6}-\d{2}([^\d]|$). This doesn't work because it matches the first result "100-520006-90" and I want the last "103-220898-44"
Example A
Transferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
$ 1.000.00

/.*(\d{3}-\d{6}-\d{2})(?:[^\d]|$)/gs
If you add .* to the beginning of your regex, it only captures the last one since it's greedy. Also, you need to use the single-line regex flag (s) to capture new lines by using .*.
Note: I replaced some (...) strings with (?:...) since their aim is grouping, not capturing.
Demo: https://regex101.com/r/fygL1X/2
const regex = new RegExp('.*(\\d{3}-\\d{6}-\\d{2})(?:[^\\d]|$)', 'gs')
const str = `ransferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
\$ 1.000.00`;
let m;
m = regex.exec(str)
console.log(m[1])

Related

Get the value in the line above a word

I have the text below and would like to get the value of the line above the word OUTRAS INFORMAÇÕES in this case the value 8.571.962,06
I did the following, but I found you very vulnerable.
^(.?)\s(?(\d+.\d+.\d+,\d+|\d+.\d+.\d+,\d+|\d+.\d+,\d+|\d+,\d+))\sOUTRAS INFORMA.?.*?ES
OCR:
NOME: TESTE DE SILVA SAURO
CPF: 785.981.970-84
DECLARAÇÃO DE AJUSTE ANUAL
IMPOSTO SOBRE A RENDA - PESSOA FÍSICA
EXERCICIO 2018 ANO-CALENDÁRIO 2017
EVOLUÇÃO PATRIMONIAL
Bens e direitos em 31/12/2016
Bens e direitos em 31/12/2017
Dividas conus rcais em 31/12/2016
Divisas e ônus reais em 31/12/2017
100.580.873.91
100.329. 110,32
9135,456,07
8.571.962,06
OUTRAS INFORMAÇÕES
Rendimentos isentos e não tributáveis
I'm using sitging to test https://regexr.com/
This should do it (regex101)
^.*(?=\nOUTRAS INFORMAÇÕES)
Basically, do a lookahead to find a newline character with after it the OUTRAS INFORMAÇÕES line

Find words between comma and list of keywords RegEx

I have a large text. I would like to find the address of the owner. My input is something like...
INPUT: (...) seiscientos catorce guión ocho, domiciliado en calle
Santillana número trescientos sesenta y nueve, Valle Lo Campino,
comuna de Quilicura, Región Metropolitana, constituyeron una sociedad
por acciones (...)
keywords_cap = ['DOMICILIO:', 'Domicilio:', 'Domicilio', 'DOMICILIO', 'domiciliado en', 'domiciliada en',
'Domiciliado en', 'Domiciliada en']
keywords_cap = map(re.escape, keywords_cap)
keywords_cap.sort(key=len, reverse=True)
obj = re.compile(r'\b(?:{})\s*(.*?)\.'.format('|'.join(keywords_cap)))
obj2 = obj.search(mensaje)
if obj2:
company_name = obj2.group(1)
else:
company_name = "None"
OUTPUT: calle Santillana número trescientos sesenta y nueve
Something it is wrong, because I would like to extract the text between one word of keywords and the next comma (,) or the next point (.).
But the extraction is being since this list of Keywords to only the next point (.).
Can someone help me with this foolishness?
The (.*?)\. pattern matches any chars other than line break chars, as few as possible before the leftmost . char. It can be "converted" to ([^.]*), a negated character class pattern that matches 0 or more chars other than . (note that the only difference from the original pattern is that negated character classes also match linebreaks, which is a good feature in this case).
The solution will be to just add , into the character class:
obj = re.compile(r'\b(?:{})\s*([^.,]*)'.format('|'.join(keywords_cap)))
^^^^^^^^
The regex will look like
\b(?:DOMICILIO:|Domicilio:|Domicilio|DOMICILIO|domiciliado en|domiciliada en|Domiciliado en|Domiciliada en)\s*([^.,]*)
See the regex demo.

detect string between brackets using regex

How can i use regex in python to detect in one file only the stings between the brackets?
So If i have some text like this:
</seg>
<src>Sono stati riportati casi di (sgomberi forzati) e violazioni dei diritti umani da parte della polizia, ma su scala minore rispetto agli anni precedenti.</src>
and I want to detect only (sgomberi forzati)
I use the
for line in file.readlines():
m=re.compile('\((.*?)\)', re.DOTALL).findall(line)
print m
but it does not print what I need: it prints also brackets empty like this
[]
[u'sgomberi forzati']
Escape your () as follows:
m = re.compile('\((.*?)\)', re.DOTALL).findall(line)

Parsing spanish family name

A spanish family name consists of three parts:
The paternal name,
The optional maternal name,
The optional spouse's paternal name.
Each of these three parts is one single word that may be preceded by "De", "Del", "De La", "De Los" or "De Las". Each of these prefixes starts with a capital and there may be only one of them for each part. The spouse's paternal name is separated from the rest by the word "de" (no capital).
So valid family names would be:
Pérez
Pérez De León
López de López
De La Oca Ordóñez
Castillo Ramírez de Del Valle
I can parse these names with this regex:
^((?:De |Del |De La |De Los |De Las )?\w+)?( (?:De |Del |De La |De Los |De Las )?\w+)?( de (?:De |Del |De La |De Los |De Las )?\w+)?$
1.) Can this ugly regex be simplified?
2.) When the paternal name is the same as the maternal name the word "y" is inserted between them. So "López y Lópey de De León" and "Pérez y Pérez" are both valid, but "López y Pérez" and "Gómez y de Gómez" are not. How can I capture this case?
Thank you very much.
The exact answer depends on what programming language and/or regex engine you're using, but for most implementations, you should be able to do the following:
(1.) Make a separate regex that matches a single part of a name and then include this in the final regex, e.g., in Perl:
my $name1 = qr/(?:De |Del |De La |De Los |De Las )?\w+/;
my $name2 = qr/^($name1)( $name1)?( de $name1)?$/;
(I assume you don't want the ? after the first capture, as otherwise you'd match the empty string.) $name2 is then the regex to match against.
(2.) Strictly speaking, proper computer-theoretical regular expressions cannot test whether an arbitrary substring that appears at one point in the string also appears at another point. However, most regex implementations (e.g., Perl-compatible "regular expressions") actually support more features than a real regex engine would, so you could use a backreference like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y \3)(de $name1)?$/;
In PCREs, the \3 matches the exact same string that the third (...) group matches. If you can't use backreferences for some reason, your only option is to use a regex like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y ($name1))(de $name1)?$/;
and then, if $3 and $4 are defined after matching, test to see if they're equal or not. (Note that both of the above will allow names like "López López" without a "y"; if you want to prohibit those, it'll be a bit harder.)
Here's my attempt. It seems to work with the examples given:
public class Foo {
public static void main(String[] args) throws Exception {
System.out.println(new SpanishName("Pérez"));
System.out.println(new SpanishName("Pérez De León"));
System.out.println(new SpanishName("López de López"));
System.out.println(new SpanishName("De La Oca Ordóñez"));
System.out.println(new SpanishName("Castillo Ramírez de Del Valle"));
System.out.println(new SpanishName("López y López de De León"));
System.out.println(new SpanishName("Pérez y Pérez"));
// System.out.println(new SpanishName("López y Pérez")); - Throws IAE
// System.out.println(new SpanishName("Gómez y de Gómez")); - Throws IAE
}
public static class SpanishName {
private final String paternal;
private final String maternal;
private final String spousePaternal;
private static final Pattern NAME_REGEX = Pattern
.compile("^([\\p{Ll}\\p{Lu}]+?)(?:\\s([\\p{Ll}\\p{Lu}]+?))?(?:\\s([\\p{Ll}\\p{Lu}]+?))?$");
public SpanishName(String str) {
str = stripJoinWords(str);
str = removeYJoin(str);
final Matcher matcher = NAME_REGEX.matcher(str);
if (str.contains(" y ") || !matcher.matches()) {
throw new IllegalArgumentException(String.format("'%s' is not a valid Spanish name", str));
} else {
paternal = matcher.group(1);
maternal = matcher.group(2);
spousePaternal = matcher.group(3);
}
}
private String removeYJoin(final String str) {
return str.replaceFirst("^([\\p{Ll}\\p{Lu}]+?) y \\1", "$1 $1");
}
private String stripJoinWords(final String str) {
return str.replaceAll("(?<!\\sy\\s)[Dd]e(?:l| La| Los| Las)?\\s", "");
}
#Override
public String toString() {
return String.format("paternal = %s, maternal = %s, spousePaternal = %s", paternal, maternal,
spousePaternal);
}
}
}
Rather than using a regex, there's a service which does a pretty amazing job at this: https://www.nameapi.org/en/demos/name-parser/. It's open source, but instead of using regex it gathers data from phone books as well as a pretty sophisticated set of rules.

Regex capitalize first letter every word, also after a special character like a dash

I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next