What's an appropriate regex to split this line using scala? - regex

I'm trying to split this line coming from a CSV file, to obtain the different matching groups from this (sample) line (file has around 750k lines):
919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT
As you can see there are four main parts in the line, id,free text, sentiment, option. Also, many characters in the content part (La dama de hierro...) and I don't know how to build a correct regex to obtain it like this: (id, txt, sent, opt).
What I've tried so far:
val fullRegex = """(\d+),(.+?),(N|P|NEU|NONE)(,\W+|;\W+)re?""".r
Works for some lines but fail for others.

Regex is powerful but sometimes it's hard to get right and cover all possible input formats. In this case it might not be needed.
val in = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val inSplit = in.split(",")
val id = inSplit.head // String = 919191911919
val txt = inSplit.tail.init.init.mkString(",") // free form text
val sent = inSplit.init.last // String = P
val opt = inSplit.last // String = AGREEMENT

As Bruno Grieder pointed out in the comments to the question, this can be handled more robustly without using regular expressions.
If this is not a well formatted CSV file (meaning, fields containing commas enclosed in quotation marks, quotation marks in field values escaped etc), an alternative is to realize that the first field does give you the ID and the last two fields do give you the sentiment and the option. Everything else is free text, so the structure of a line is rather simple.
Of course, if the file is indeed well-formatted CSV, use a library built for that purpose.
Assuming this is not well-formatted CSV, first split by a comma, put the first and the last two fields in their respective variables, and join the rest of the fields using a comma to recover the text.
I don't know much Scala, so the code is rather primitive. Improvements welcome:
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val id :: rest = line.split(",").toList
val text = rest.slice(0, rest.size - 2).mkString(",")
val sentiment = rest(rest.size - 2);
val option = rest.last;
for (x <- List(id, text, sentiment, option))
println(x)
Output:
$ scala test.scala
919191911919
"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn
P
AGREEMENT
This will also work with embedded commas in the text (although there is some extra work involved in splitting and recombining the text field). For example, if line is:
val line = "1,this is some text with one, two, three, and four commas (,),7,8
This is the output you'll get:
1
this is some text with one, two, three, and four commas (,)
7
8

If you are sure that the text is enclosed in double quotes, you can first replace all commas inside the double quotes, then split at the commas, then put the commas back. The drawback of this solution is that you need to use a Unicode char that is guaranteed to not be present in your file
object CSVFixer {
def main(args: Array[String]) {
split(line) foreach println
}
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
private val AltSep = '\u0080' // Unicode char that we reasonably expect to not have in the input
val fieldSeparator = ","
private[this] def unSep(s: String) = {
val SepChr = fieldSeparator.charAt(0)
var inQS = false
for (c <- s) yield {
c match {
case '"' =>
inQS = !inQS; c
case SepChr if inQS =>
AltSep
case _ => c
}
}
}
def split(line: String) =
unSep(line).split(fieldSeparator, -1) // do not discard trailing empty strings
.map(_.replace(AltSep, fieldSeparator.charAt(0)))
.map(_.replaceAll("\"", ""))
}

Related

Find words between comma and list of keywords RegEx

I have a large text. I would like to find the address of the owner. My input is something like...
INPUT: (...) seiscientos catorce guión ocho, domiciliado en calle
Santillana número trescientos sesenta y nueve, Valle Lo Campino,
comuna de Quilicura, Región Metropolitana, constituyeron una sociedad
por acciones (...)
keywords_cap = ['DOMICILIO:', 'Domicilio:', 'Domicilio', 'DOMICILIO', 'domiciliado en', 'domiciliada en',
'Domiciliado en', 'Domiciliada en']
keywords_cap = map(re.escape, keywords_cap)
keywords_cap.sort(key=len, reverse=True)
obj = re.compile(r'\b(?:{})\s*(.*?)\.'.format('|'.join(keywords_cap)))
obj2 = obj.search(mensaje)
if obj2:
company_name = obj2.group(1)
else:
company_name = "None"
OUTPUT: calle Santillana número trescientos sesenta y nueve
Something it is wrong, because I would like to extract the text between one word of keywords and the next comma (,) or the next point (.).
But the extraction is being since this list of Keywords to only the next point (.).
Can someone help me with this foolishness?
The (.*?)\. pattern matches any chars other than line break chars, as few as possible before the leftmost . char. It can be "converted" to ([^.]*), a negated character class pattern that matches 0 or more chars other than . (note that the only difference from the original pattern is that negated character classes also match linebreaks, which is a good feature in this case).
The solution will be to just add , into the character class:
obj = re.compile(r'\b(?:{})\s*([^.,]*)'.format('|'.join(keywords_cap)))
^^^^^^^^
The regex will look like
\b(?:DOMICILIO:|Domicilio:|Domicilio|DOMICILIO|domiciliado en|domiciliada en|Domiciliado en|Domiciliada en)\s*([^.,]*)
See the regex demo.

Remove text between two tags

I'm trying to remove some text between two tags [ & ]
[13:00:00]
I want to remove 13:00:00 from [] tags.
This number is not the same any time.
Its always a time of the day so, only Integer and : symbols.
Someone can help me?
UPDATE:
I forgot to say something. The time (13:00:00) was picked from a log file. Looks like that:
[10:56:49] [Client thread/ERROR]: Item entity 26367127 has no item?!
[10:57:25] [Dbutant] misterflo13 : ils coute chere les enchent aura de feu et T2 du spawn??*
[10:57:35] [Amateur] firebow ?.SkyLegend.? : ouai 0
[10:57:38] [Novice] iPasteque : ils sont gratuit me
[10:57:41] [Novice] iPasteque : ils sont gratuit mec *
[10:57:46] [Dbutant] misterflo13 : on ma dit k'ils etait payent :o
[10:57:57] [Novice] iPasteque : on t'a mytho alors
Ignore the other text I juste want to remove the time between [ & ] (need to looks like []. The time between [ & ] is updated every second.
It looks like your log has specific format. And you seem want to get rid of the time and keep all other information. Ok - read in comments
I didn't test it but it should work
' Read log
Dim logLines() As String = File.ReadAllLines("File_path")
If logLines.Length = 0 Then Return
' prepare array to fill sliced data
Dim lines(logLines.Length - 1) As String
For i As Integer = 0 To logLines.Count - 1
' just cut off time part and add empty brackets for each line
lines(i) = "[]" & logLines(i).Substring(10)
Next
What you see above - if you know that your file comes in certain format, just use position in the string where to cut it off.
Note: Code above can be done in 1 line using LINQ
If you want to actually get the data out of it, use IndexOf. Since you looking for first occurrence of "[" or "]", just use start index "0"
' get position of open bracket in string
Dim openBracketPos As Integer = myString.IndexOf("[", 0, StringComparison.OrdinalIgnoreCase)
' get position of close bracket in string
Dim closeBracketPos As Integer = myString.IndexOf("]", 0, StringComparison.OrdinalIgnoreCase)
' get string between open and close bracket
Dim data As String = myString.Substring(openBracketPos + 1, closeBracketPos - 1)
This is another possibility using Regex:
Public Function ReplaceTime(ByVal Input As String) As String
Dim m As Match = Regex.Match(Input, "(\[)(\d{1,2}\:\d{1,2}(\:\d{1,2})?)(\])(.+)")
Return m.Groups(1).Value & m.Groups(4).Value & m.Groups(5).Value
End Function
It's more of a readability nightmare but it's efficient and it takes only the brackets containing a time value.
I also took the liberty of making it match for example 13:47 as well as 13:47:12.
Test: http://ideone.com/yogWfD
(EDIT) Multiline example:
You can combine this with File.ReadAllLines() (if that's what you prefer) and a For loop to get the replacement done.
Public Function ReplaceTimeMultiline(ByVal TextLines() As String) As String
For x = 0 To TextLines.Length - 1
TextLines(x) = ReplaceTime(TextLines(x))
Next
Return String.Join(Environment.NewLine, TextLines)
End Function
Above code usage:
Dim FinalT As String = ReplaceTimeMultiline(File.ReadAllLines(<file path here>))
Another multiline example:
Public Function ReplaceTimeMultiline(ByVal Input As String) As String
Dim ReturnString As String = ""
Dim Parts() As String = Input.Split(Environment.NewLine)
For x = 0 To Parts.Length - 1
ReturnString &= ReplaceTime(Parts(x)) & If(x < (Parts.Length - 1), Environment.NewLine, "")
Next
Return ReturnString
End Function
Multiline test: http://ideone.com/nKZQHm
If your problem is to remove numeric strings in the format of 99:99:99 that appear inside [], I would do:
//assuming you want to replace the [......] numeric string with an empty []. Should you want to completely remove the tag, just replace with string.Empty
Here's a demo (in C#, not VB, but you get the point (you need the regex, not the syntax anyway)
List<string> list = new List<string>
{
"[13:00:00]",
"[4:5:0]",
"[5d2hu2d]",
"[1:1:1000]",
"[1:00:00]",
"[512341]"
};
string s = string.Join("\n", list);
Console.WriteLine("Original input string:");
Console.WriteLine(s);
Regex r = new Regex(#"\[\d{1,2}?:\d{1,2}?:\d{1,2}?\]");
foreach (Match m in r.Matches(s))
{
Console.WriteLine("{0} is a match.", m.Value);
}
Console.WriteLine();
Console.WriteLine("String with occurrences replaced with an empty string:");
Console.WriteLine(r.Replace(s, string.Empty).Trim());

Regex match a spanish word ending with a dot(.) and underscore

This is the regex I'm trying:
([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+)(\.\_)
Here are two examples that it should match against:
EL ROSARIO / ESCUINAPA._ Con poco más de 4 mil pesos...
and
Cuautitlán._ Con poco más de 4 mil pesos...
The expression works for the first example but not for the second because of encoding probably:
docHtml = urllib.urlopen(link).read()
#using the lxml function html
tree = html.fromstring(docHtml)
newsCity = CSSSelector('#pid p')
try:
city_paragraph = newsCity(tree)
city_match = re.search('([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+\._)',city_paragraph[0].text)
Your regular expression appears to be correct. I suspect that the bug is in how you're reading the strings that you're matching against. You want something like:
import codecs
f = codecs.open('spanish.txt', encoding='utf-8')
for line in f:
print repr(line)
Finally figured it out:
newsCity = CSSSelector('#tamano5 p')
city_paragraph = newsCity(tree)
city_p = city_paragraph[0].text
city_utf=city_p.encode("utf-8")
city_match = re.search('([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+\._)',city_utf)
This gives me the expected result which in this case was to extract the city string using re.search.

Parsing spanish family name

A spanish family name consists of three parts:
The paternal name,
The optional maternal name,
The optional spouse's paternal name.
Each of these three parts is one single word that may be preceded by "De", "Del", "De La", "De Los" or "De Las". Each of these prefixes starts with a capital and there may be only one of them for each part. The spouse's paternal name is separated from the rest by the word "de" (no capital).
So valid family names would be:
Pérez
Pérez De León
López de López
De La Oca Ordóñez
Castillo Ramírez de Del Valle
I can parse these names with this regex:
^((?:De |Del |De La |De Los |De Las )?\w+)?( (?:De |Del |De La |De Los |De Las )?\w+)?( de (?:De |Del |De La |De Los |De Las )?\w+)?$
1.) Can this ugly regex be simplified?
2.) When the paternal name is the same as the maternal name the word "y" is inserted between them. So "López y Lópey de De León" and "Pérez y Pérez" are both valid, but "López y Pérez" and "Gómez y de Gómez" are not. How can I capture this case?
Thank you very much.
The exact answer depends on what programming language and/or regex engine you're using, but for most implementations, you should be able to do the following:
(1.) Make a separate regex that matches a single part of a name and then include this in the final regex, e.g., in Perl:
my $name1 = qr/(?:De |Del |De La |De Los |De Las )?\w+/;
my $name2 = qr/^($name1)( $name1)?( de $name1)?$/;
(I assume you don't want the ? after the first capture, as otherwise you'd match the empty string.) $name2 is then the regex to match against.
(2.) Strictly speaking, proper computer-theoretical regular expressions cannot test whether an arbitrary substring that appears at one point in the string also appears at another point. However, most regex implementations (e.g., Perl-compatible "regular expressions") actually support more features than a real regex engine would, so you could use a backreference like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y \3)(de $name1)?$/;
In PCREs, the \3 matches the exact same string that the third (...) group matches. If you can't use backreferences for some reason, your only option is to use a regex like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y ($name1))(de $name1)?$/;
and then, if $3 and $4 are defined after matching, test to see if they're equal or not. (Note that both of the above will allow names like "López López" without a "y"; if you want to prohibit those, it'll be a bit harder.)
Here's my attempt. It seems to work with the examples given:
public class Foo {
public static void main(String[] args) throws Exception {
System.out.println(new SpanishName("Pérez"));
System.out.println(new SpanishName("Pérez De León"));
System.out.println(new SpanishName("López de López"));
System.out.println(new SpanishName("De La Oca Ordóñez"));
System.out.println(new SpanishName("Castillo Ramírez de Del Valle"));
System.out.println(new SpanishName("López y López de De León"));
System.out.println(new SpanishName("Pérez y Pérez"));
// System.out.println(new SpanishName("López y Pérez")); - Throws IAE
// System.out.println(new SpanishName("Gómez y de Gómez")); - Throws IAE
}
public static class SpanishName {
private final String paternal;
private final String maternal;
private final String spousePaternal;
private static final Pattern NAME_REGEX = Pattern
.compile("^([\\p{Ll}\\p{Lu}]+?)(?:\\s([\\p{Ll}\\p{Lu}]+?))?(?:\\s([\\p{Ll}\\p{Lu}]+?))?$");
public SpanishName(String str) {
str = stripJoinWords(str);
str = removeYJoin(str);
final Matcher matcher = NAME_REGEX.matcher(str);
if (str.contains(" y ") || !matcher.matches()) {
throw new IllegalArgumentException(String.format("'%s' is not a valid Spanish name", str));
} else {
paternal = matcher.group(1);
maternal = matcher.group(2);
spousePaternal = matcher.group(3);
}
}
private String removeYJoin(final String str) {
return str.replaceFirst("^([\\p{Ll}\\p{Lu}]+?) y \\1", "$1 $1");
}
private String stripJoinWords(final String str) {
return str.replaceAll("(?<!\\sy\\s)[Dd]e(?:l| La| Los| Las)?\\s", "");
}
#Override
public String toString() {
return String.format("paternal = %s, maternal = %s, spousePaternal = %s", paternal, maternal,
spousePaternal);
}
}
}
Rather than using a regex, there's a service which does a pretty amazing job at this: https://www.nameapi.org/en/demos/name-parser/. It's open source, but instead of using regex it gathers data from phone books as well as a pretty sophisticated set of rules.

Regex capitalize first letter every word, also after a special character like a dash

I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next