Parsing spanish family name - regex

A spanish family name consists of three parts:
The paternal name,
The optional maternal name,
The optional spouse's paternal name.
Each of these three parts is one single word that may be preceded by "De", "Del", "De La", "De Los" or "De Las". Each of these prefixes starts with a capital and there may be only one of them for each part. The spouse's paternal name is separated from the rest by the word "de" (no capital).
So valid family names would be:
Pérez
Pérez De León
López de López
De La Oca Ordóñez
Castillo Ramírez de Del Valle
I can parse these names with this regex:
^((?:De |Del |De La |De Los |De Las )?\w+)?( (?:De |Del |De La |De Los |De Las )?\w+)?( de (?:De |Del |De La |De Los |De Las )?\w+)?$
1.) Can this ugly regex be simplified?
2.) When the paternal name is the same as the maternal name the word "y" is inserted between them. So "López y Lópey de De León" and "Pérez y Pérez" are both valid, but "López y Pérez" and "Gómez y de Gómez" are not. How can I capture this case?
Thank you very much.

The exact answer depends on what programming language and/or regex engine you're using, but for most implementations, you should be able to do the following:
(1.) Make a separate regex that matches a single part of a name and then include this in the final regex, e.g., in Perl:
my $name1 = qr/(?:De |Del |De La |De Los |De Las )?\w+/;
my $name2 = qr/^($name1)( $name1)?( de $name1)?$/;
(I assume you don't want the ? after the first capture, as otherwise you'd match the empty string.) $name2 is then the regex to match against.
(2.) Strictly speaking, proper computer-theoretical regular expressions cannot test whether an arbitrary substring that appears at one point in the string also appears at another point. However, most regex implementations (e.g., Perl-compatible "regular expressions") actually support more features than a real regex engine would, so you could use a backreference like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y \3)(de $name1)?$/;
In PCREs, the \3 matches the exact same string that the third (...) group matches. If you can't use backreferences for some reason, your only option is to use a regex like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y ($name1))(de $name1)?$/;
and then, if $3 and $4 are defined after matching, test to see if they're equal or not. (Note that both of the above will allow names like "López López" without a "y"; if you want to prohibit those, it'll be a bit harder.)

Here's my attempt. It seems to work with the examples given:
public class Foo {
public static void main(String[] args) throws Exception {
System.out.println(new SpanishName("Pérez"));
System.out.println(new SpanishName("Pérez De León"));
System.out.println(new SpanishName("López de López"));
System.out.println(new SpanishName("De La Oca Ordóñez"));
System.out.println(new SpanishName("Castillo Ramírez de Del Valle"));
System.out.println(new SpanishName("López y López de De León"));
System.out.println(new SpanishName("Pérez y Pérez"));
// System.out.println(new SpanishName("López y Pérez")); - Throws IAE
// System.out.println(new SpanishName("Gómez y de Gómez")); - Throws IAE
}
public static class SpanishName {
private final String paternal;
private final String maternal;
private final String spousePaternal;
private static final Pattern NAME_REGEX = Pattern
.compile("^([\\p{Ll}\\p{Lu}]+?)(?:\\s([\\p{Ll}\\p{Lu}]+?))?(?:\\s([\\p{Ll}\\p{Lu}]+?))?$");
public SpanishName(String str) {
str = stripJoinWords(str);
str = removeYJoin(str);
final Matcher matcher = NAME_REGEX.matcher(str);
if (str.contains(" y ") || !matcher.matches()) {
throw new IllegalArgumentException(String.format("'%s' is not a valid Spanish name", str));
} else {
paternal = matcher.group(1);
maternal = matcher.group(2);
spousePaternal = matcher.group(3);
}
}
private String removeYJoin(final String str) {
return str.replaceFirst("^([\\p{Ll}\\p{Lu}]+?) y \\1", "$1 $1");
}
private String stripJoinWords(final String str) {
return str.replaceAll("(?<!\\sy\\s)[Dd]e(?:l| La| Los| Las)?\\s", "");
}
#Override
public String toString() {
return String.format("paternal = %s, maternal = %s, spousePaternal = %s", paternal, maternal,
spousePaternal);
}
}
}

Rather than using a regex, there's a service which does a pretty amazing job at this: https://www.nameapi.org/en/demos/name-parser/. It's open source, but instead of using regex it gathers data from phone books as well as a pretty sophisticated set of rules.

Related

how to find the second number in a string with Regex expression

I want a regular expression that finds the last match.
the result that I want is "103-220898-44"
The expression that I'm using is ([^\d]|^)\d{3}-\d{6}-\d{2}([^\d]|$). This doesn't work because it matches the first result "100-520006-90" and I want the last "103-220898-44"
Example A
Transferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
$ 1.000.00
/.*(\d{3}-\d{6}-\d{2})(?:[^\d]|$)/gs
If you add .* to the beginning of your regex, it only captures the last one since it's greedy. Also, you need to use the single-line regex flag (s) to capture new lines by using .*.
Note: I replaced some (...) strings with (?:...) since their aim is grouping, not capturing.
Demo: https://regex101.com/r/fygL1X/2
const regex = new RegExp('.*(\\d{3}-\\d{6}-\\d{2})(?:[^\\d]|$)', 'gs')
const str = `ransferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
\$ 1.000.00`;
let m;
m = regex.exec(str)
console.log(m[1])

Find words between comma and list of keywords RegEx

I have a large text. I would like to find the address of the owner. My input is something like...
INPUT: (...) seiscientos catorce guión ocho, domiciliado en calle
Santillana número trescientos sesenta y nueve, Valle Lo Campino,
comuna de Quilicura, Región Metropolitana, constituyeron una sociedad
por acciones (...)
keywords_cap = ['DOMICILIO:', 'Domicilio:', 'Domicilio', 'DOMICILIO', 'domiciliado en', 'domiciliada en',
'Domiciliado en', 'Domiciliada en']
keywords_cap = map(re.escape, keywords_cap)
keywords_cap.sort(key=len, reverse=True)
obj = re.compile(r'\b(?:{})\s*(.*?)\.'.format('|'.join(keywords_cap)))
obj2 = obj.search(mensaje)
if obj2:
company_name = obj2.group(1)
else:
company_name = "None"
OUTPUT: calle Santillana número trescientos sesenta y nueve
Something it is wrong, because I would like to extract the text between one word of keywords and the next comma (,) or the next point (.).
But the extraction is being since this list of Keywords to only the next point (.).
Can someone help me with this foolishness?
The (.*?)\. pattern matches any chars other than line break chars, as few as possible before the leftmost . char. It can be "converted" to ([^.]*), a negated character class pattern that matches 0 or more chars other than . (note that the only difference from the original pattern is that negated character classes also match linebreaks, which is a good feature in this case).
The solution will be to just add , into the character class:
obj = re.compile(r'\b(?:{})\s*([^.,]*)'.format('|'.join(keywords_cap)))
^^^^^^^^
The regex will look like
\b(?:DOMICILIO:|Domicilio:|Domicilio|DOMICILIO|domiciliado en|domiciliada en|Domiciliado en|Domiciliada en)\s*([^.,]*)
See the regex demo.

What's an appropriate regex to split this line using scala?

I'm trying to split this line coming from a CSV file, to obtain the different matching groups from this (sample) line (file has around 750k lines):
919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT
As you can see there are four main parts in the line, id,free text, sentiment, option. Also, many characters in the content part (La dama de hierro...) and I don't know how to build a correct regex to obtain it like this: (id, txt, sent, opt).
What I've tried so far:
val fullRegex = """(\d+),(.+?),(N|P|NEU|NONE)(,\W+|;\W+)re?""".r
Works for some lines but fail for others.
Regex is powerful but sometimes it's hard to get right and cover all possible input formats. In this case it might not be needed.
val in = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val inSplit = in.split(",")
val id = inSplit.head // String = 919191911919
val txt = inSplit.tail.init.init.mkString(",") // free form text
val sent = inSplit.init.last // String = P
val opt = inSplit.last // String = AGREEMENT
As Bruno Grieder pointed out in the comments to the question, this can be handled more robustly without using regular expressions.
If this is not a well formatted CSV file (meaning, fields containing commas enclosed in quotation marks, quotation marks in field values escaped etc), an alternative is to realize that the first field does give you the ID and the last two fields do give you the sentiment and the option. Everything else is free text, so the structure of a line is rather simple.
Of course, if the file is indeed well-formatted CSV, use a library built for that purpose.
Assuming this is not well-formatted CSV, first split by a comma, put the first and the last two fields in their respective variables, and join the rest of the fields using a comma to recover the text.
I don't know much Scala, so the code is rather primitive. Improvements welcome:
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val id :: rest = line.split(",").toList
val text = rest.slice(0, rest.size - 2).mkString(",")
val sentiment = rest(rest.size - 2);
val option = rest.last;
for (x <- List(id, text, sentiment, option))
println(x)
Output:
$ scala test.scala
919191911919
"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn
P
AGREEMENT
This will also work with embedded commas in the text (although there is some extra work involved in splitting and recombining the text field). For example, if line is:
val line = "1,this is some text with one, two, three, and four commas (,),7,8
This is the output you'll get:
1
this is some text with one, two, three, and four commas (,)
7
8
If you are sure that the text is enclosed in double quotes, you can first replace all commas inside the double quotes, then split at the commas, then put the commas back. The drawback of this solution is that you need to use a Unicode char that is guaranteed to not be present in your file
object CSVFixer {
def main(args: Array[String]) {
split(line) foreach println
}
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
private val AltSep = '\u0080' // Unicode char that we reasonably expect to not have in the input
val fieldSeparator = ","
private[this] def unSep(s: String) = {
val SepChr = fieldSeparator.charAt(0)
var inQS = false
for (c <- s) yield {
c match {
case '"' =>
inQS = !inQS; c
case SepChr if inQS =>
AltSep
case _ => c
}
}
}
def split(line: String) =
unSep(line).split(fieldSeparator, -1) // do not discard trailing empty strings
.map(_.replace(AltSep, fieldSeparator.charAt(0)))
.map(_.replaceAll("\"", ""))
}

How to delete repeated specific characters

I need to replace the "#" with "-" in a string. This is straightforward, but I also need to replace multiple "#####" with just one single "-". Any ideas on how to do the latter with ASP.
Here is an example:
input string:
#Introducción a los Esquemas Algorítmicos: Apuntes y colección de problemas. Report LSI-97-6-T########09/30/1997#####TRE#
Desired output:
-Introducción a los Esquemas Algorítmicos: Apuntes y colección de problemas. Report LSI-97-6-T-09/30/1997-TRE-
Thanks.
Try this for classic ASP:
Dim regEx
Set regEx = New RegExp
With regEx
.Pattern = "([\#])\1+|(\#)"
.Global = True
.MultiLine = True
End With
strMessage = regEx.Replace(str, "-")
This will match every occurrence of multiple #### or single occurrences of #
Not sure what language you are using so here's the expression in full with delimiters: /([\#])\1+|(\#)/g
Edit - Even simpler: /#+/g
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
Console.WriteLine("Hello World");
String input = "#Introducción a los Esquemas Algorítmicos: Apuntes y colección de problemas. Report LSI-97-6-T########09/30/1997#####TRE#";
String output=Regex.Replace(input,#"\#+","-");
Console.WriteLine(output);
}
}

Regex to identify a word containing spaces

I Need to identify a string in a text and replace it with null string. Problem is, it is not always present as a word itself. There will be space character present between each letter or set of letters. For example:
For word "Decent", I may face the following values.
D ec ent,
De ce nt,
De ce n t .
Is there a way to identify these strings using "Decent" word as input with any regular expression?
I am very new to regular expressions. Please help!!
TIA!
\bD\s*e\s*c\s*e\s*n\s*t\s*
so you match D ec ent, De ce nt, De ce n t, decent Decent
but not blade centimeter
If you use
'D ?e ?c ?e ?n ?t ?'
it will match the word with extra spaces
The expression "D\s*e\s*c\s*e\s*n\s*t" will do it. Each letter is followed by zero or more spaces. Actually \s is "whitespace characters." You could replace \s* with * (space followed by an asterisk) if you just want literal spaces.
first a bit of code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordsWithSpaces {
public static void main(String[] args) {
String test = "Descent D escent De s cent desce nd";
String word = "descent";
String pattern = "";
for(int i=0; i<word.length();i++) {
pattern = pattern+word.charAt(i)+"\\s*";
}
System.err.println("pattern is: "+pattern);
Pattern p = Pattern.compile(pattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(test);
while(m.find()) {
String found = test.substring(m.start(),m.end());
System.err.println(found+" matches");
}
}
}
now for the explanation: \s is a character class for whitespace. this includes spaces and tabs and (possibly) linebreaks. in this piece of code, i take every character of the word i am looking for, and append "\s", with "*" meaning 0 or mor occurences.
also, to avoid it being case sensitive, i set the CASE_INSENSITIVE flag on the pattern.
character classes may not have the same name in your programming language of choice, but there should be one for whitespace. check your documentation.