Expanding abbreviations using regex - regex

I have a dictionary of abbreviations, I would like to expand. I would like to use these to go through a text and expand all abbreviations.
The defined dictionary is as follows:
contractions_dict = {
"kl\.": "klokken",
}
The text I which to expand is as follows:
text = 'Gl. Syd- og Sønderjyllands Politi er måske kl. 18 kløjes landets mest aktive politikreds på Twitter med over 27.000, som følger med.'
I use the following function:
def expand_contractions(s, contractions_dict, contractions_re):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, s)
contractions_re = re.compile("(%s)"%"|".join(contractions_dict.keys()))
text = expand_contractions(text, contractions_dict, contractions_re)
print(text)
I have tried a range of different keys in the dictionary to capture the abbreviations, but nothing have worked. Any suggestions?

Try:
import re
contractions_dict = {
"kl.": "klokken",
}
pat = re.compile(r'\b' + r'|'.join(re.escape(k) for k in contractions_dict))
text = "Gl. Syd- og Sønderjyllands Politi er måske kl. 18 kløjes landets mest aktive politikreds på Twitter med over 27.000, som følger med."
text = pat.sub(lambda g: contractions_dict[g.group(0)], text)
print(text)
Prints:
Gl. Syd- og Sønderjyllands Politi er måske klokken 18 kløjes landets mest aktive politikreds på Twitter med over 27.000, som følger med.

Related

RegEx - Extract string from character 91 up to character 180 and delete everything before and after

I am trying to extract character 91 to 180 from this text:
Exosphere -6° Reg. fra Deuter er den perfekte sovepose til dig, der har det med at stritte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose. Den er nemlig fuld af elastikker, som tillader soveposen at blive op til 25% bredere, end den umiddelbart ser ud til at være.
So that the output will look like this:
itte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose
I am using this expression which I found here on SO REGEX to trim a string after 180 characters and before |:
Replace
^([^|]{91,180})[^|]+(.*)$
with
\1\2
It is doing some of the job this is the output:
Exosphere -6° Reg. fra Deuter er den perfekte sovepose til dig, der har det med at stritte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose
So now I need to remove everything before character 91.
The point here is that you need to match the first 90 chars, then match and capture another 90 chars into Group 1, and then just match the rest of the string, then replace with a backreference to Group 1 value.
You may use
^[\s\S]{90}([\s\S]{90})[\s\S]*
Or, if there are no line breaks, a more "regular"
^.{90}(.{90}).*
patterns. Replace with $1.
See the regex demo

Google's text-to-speech (WaveNet) quality degrades with long texts

Using the API with the Swedish voice sv-SE-Wavenet-A, it seems that the quality of the audio degrades with longer texts.
Short text:
Det ter sig logiskt att man gått över till tvångsfinansiering av en
kanal som under året alltså tappade sex procent av tittartiden. Till
slut kommer ingen titta, men alla kommer ändå tvingas betala.
Long text (bold = short text from above):
SVT backade sex procent - endast en tredjedel tittas - tvingas betala
ändå Preliminära siffror från mätföretaget MMS visar på att
vuxendagiset SVT tappade sex procent av sin tittartid under 2018. Nu
tittas det på endast en dryg tredjedel av tiden på SVT, men alla i
Sverige tvingas ändå betala sedan årsskiftet. SVT. SVT:s tittarsiffror
tappade till 34.9% i så kallad tittartidsandel. Det tvångsfinansierade
vuxendagiset har alltså bara en dryg tredjedel av tittartiden, men
samtliga med inkomst i Sverige måste likväl betala för detta.
Siffrorna från MMS är preliminära och SVT ska ha 34.9% av tittartiden,
TV4-gruppen 31.9%, Discovery Networks-gruppen 11.9%, och Nordic
Entertainment Group 11.6%. Discovery inkluderar Kanal 5 och Nordic
Entertaingment TV3. Det ter sig logiskt att man gått över till
tvångsfinansiering av en kanal som under året alltså tappade sex
procent av tittartiden. Till slut kommer ingen titta, men alla kommer
ändå tvingas betala. Socialism baserar sig på tvång när folk inte
frivilligt gör det som socialisterna vill åstakomma. Det är en ren
skam att de borgerliga partierna var med och drev igenom
tvångsfinansieringen av det konsekvenslösa vuxendagiset. Lämplig
åtgärd är att istället koda SVT, så får de som vill betala för detta
göra det och övriga slipper. Så kan också SVT falla bort i glömskan.
Tills detta sker kommer förstås bloggen bevaka SVT:s felsteg, men kom
ihåg att anmälningar till granskningsnämnden ej ska göras då det
legitimerar ett sjukt och helt konsekvenslöst meningslöst system. SVT
är ett aktiebolag, som besitter beskattningsrätt av svenska folket.
Nedanstående kommentarer är inte en del av det redaktionella
innehållet och användare ansvarar själva för sina kommentarer. Se även
kommentarsreglerna, inklusive listan med kommentatorer som automatiskt
kommer raderas på grund av brott mot dessa. Genom att kommentera
samtycker du till att din kommentar, tidsstämpel, profillänk och
pseudonym sparas av Googles Blogger-system så länge det är relevant,
dvs så länge blogginlägget är publicerat.
API Request
const textToSpeech = require('#google-cloud/text-to-speech')
const client = new textToSpeech.TextToSpeechClient()
client.synthesizeSpeech({
input: text,
voice: {
languageCode: 'sv-SE',
ssmlGender: 'FEMALE',
name: 'sv-SE-Wavenet-A',
},
audioConfig: {
audioEncoding: 'MP3',
},
})
Results from the API
Short text audio
Long text audio
Audio comparison
The audio comparison first plays the result I got when sending the short text. It then plays the same text, but cut out from the result I got when sending the long text. Finally, it plays them both together.
Is this a bug or expected? I haven't noticed any degradation of quality at all when using the en-US or en-GB voices.
I noticed that the Swedish voice uses a different naturalSampleRateHertz than all the other voices, perhaps that might cause this?
This is probably more related to using MP3 as encoding format than to any sample rate difference with other languages. Since MP3 is a lossy format, it is expected that some quality might be lost; the differences between the short file and the longer file are probably related to MP3 encoding algorithm being used.
I have checked in my side the Speech Synthesis API, and the "sv-SE-Wavenet-A" voice seems to be using a naturalSampleRateHertz of 24000, as all the wavenet I have checked (all en-US-Wavenet voices are in 24000 as well).
I would recommend to you to change the audioEncoding flag to some other encoding format, for example "OGG_OPUS", which will yield a better audio quality.
audioConfig: {
audioEncoding: 'OGG_OPUS',
},
If the MP3 format is a must, you can then change the format in your side, so you can choose which parameters you deem convenient in your MP3 encoding to ensure the maximum audio quality, whilst the audio file is compressed.

What's an appropriate regex to split this line using scala?

I'm trying to split this line coming from a CSV file, to obtain the different matching groups from this (sample) line (file has around 750k lines):
919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT
As you can see there are four main parts in the line, id,free text, sentiment, option. Also, many characters in the content part (La dama de hierro...) and I don't know how to build a correct regex to obtain it like this: (id, txt, sent, opt).
What I've tried so far:
val fullRegex = """(\d+),(.+?),(N|P|NEU|NONE)(,\W+|;\W+)re?""".r
Works for some lines but fail for others.
Regex is powerful but sometimes it's hard to get right and cover all possible input formats. In this case it might not be needed.
val in = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val inSplit = in.split(",")
val id = inSplit.head // String = 919191911919
val txt = inSplit.tail.init.init.mkString(",") // free form text
val sent = inSplit.init.last // String = P
val opt = inSplit.last // String = AGREEMENT
As Bruno Grieder pointed out in the comments to the question, this can be handled more robustly without using regular expressions.
If this is not a well formatted CSV file (meaning, fields containing commas enclosed in quotation marks, quotation marks in field values escaped etc), an alternative is to realize that the first field does give you the ID and the last two fields do give you the sentiment and the option. Everything else is free text, so the structure of a line is rather simple.
Of course, if the file is indeed well-formatted CSV, use a library built for that purpose.
Assuming this is not well-formatted CSV, first split by a comma, put the first and the last two fields in their respective variables, and join the rest of the fields using a comma to recover the text.
I don't know much Scala, so the code is rather primitive. Improvements welcome:
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val id :: rest = line.split(",").toList
val text = rest.slice(0, rest.size - 2).mkString(",")
val sentiment = rest(rest.size - 2);
val option = rest.last;
for (x <- List(id, text, sentiment, option))
println(x)
Output:
$ scala test.scala
919191911919
"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn
P
AGREEMENT
This will also work with embedded commas in the text (although there is some extra work involved in splitting and recombining the text field). For example, if line is:
val line = "1,this is some text with one, two, three, and four commas (,),7,8
This is the output you'll get:
1
this is some text with one, two, three, and four commas (,)
7
8
If you are sure that the text is enclosed in double quotes, you can first replace all commas inside the double quotes, then split at the commas, then put the commas back. The drawback of this solution is that you need to use a Unicode char that is guaranteed to not be present in your file
object CSVFixer {
def main(args: Array[String]) {
split(line) foreach println
}
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
private val AltSep = '\u0080' // Unicode char that we reasonably expect to not have in the input
val fieldSeparator = ","
private[this] def unSep(s: String) = {
val SepChr = fieldSeparator.charAt(0)
var inQS = false
for (c <- s) yield {
c match {
case '"' =>
inQS = !inQS; c
case SepChr if inQS =>
AltSep
case _ => c
}
}
}
def split(line: String) =
unSep(line).split(fieldSeparator, -1) // do not discard trailing empty strings
.map(_.replace(AltSep, fieldSeparator.charAt(0)))
.map(_.replaceAll("\"", ""))
}

Regex to match linebreaks not preceded by non-escaped quotes in text file

I have a textfile where strings are enclosed by quotes " " and any containing quotes are escaped by \. I wan't to remove any linebrakes (\n) at in the text, as long as they are not preceded by an un-escaped quote sign ("), since thats the end of a line.
Here's an example:
"tre miljarder på att modernisera snabbtågen.\"
Dagens mest ironiska nyhet.,Väntar på att alla Summerburst-uppdateringar snart ska dö ut så min ångest kan släppa och jag kan återgå till ett normalt liv.,RT #mapeone: En till hashtag på Facebook och jag badar naken i grisblod.,Dagens biologiska lektion och psykologiska reflektion.
Så förlorade fåglarna sina penisar - DN.SE http://t.co/PFaseQMt8B,Hahaha \"#Aliceyouknow: Hah ironiskt att jag för exakt ett år sen ville gräva ner mig lika mycket som jag vill nu med.\" #livet,Det är bara kvinnor som på riktigt förstår paniken i om Zlatans hår skulle försvinna. #ikon,#nellie_lind ah han har ju rakat sidorna, snart ryker väl hela skiten,Alltså Zlatan ge fan i att mecka med håret.,Jag har ett jobb. Hur tungt är inte det. #tungt"
The regex pattern I've come up with so far looks like this:
[^"]\n+
But it also matches the character before the \n, e.g. the quote at the end of "snabbtågen.\" on line 1 and dot (.) after "reflektion" on line 2.
I want it to match a \n preceded by anything else than a non escaped ", but not include what's preceding it. How can that be done?
You should use negative lookbehind assertion
>>> print s
'first line'
'hello world
again'
>>> s2 = re.sub(r"(?<!')\s+", " ", s)
>>> print s2
'first line'
'hello world again'

Parsing spanish family name

A spanish family name consists of three parts:
The paternal name,
The optional maternal name,
The optional spouse's paternal name.
Each of these three parts is one single word that may be preceded by "De", "Del", "De La", "De Los" or "De Las". Each of these prefixes starts with a capital and there may be only one of them for each part. The spouse's paternal name is separated from the rest by the word "de" (no capital).
So valid family names would be:
Pérez
Pérez De León
López de López
De La Oca Ordóñez
Castillo Ramírez de Del Valle
I can parse these names with this regex:
^((?:De |Del |De La |De Los |De Las )?\w+)?( (?:De |Del |De La |De Los |De Las )?\w+)?( de (?:De |Del |De La |De Los |De Las )?\w+)?$
1.) Can this ugly regex be simplified?
2.) When the paternal name is the same as the maternal name the word "y" is inserted between them. So "López y Lópey de De León" and "Pérez y Pérez" are both valid, but "López y Pérez" and "Gómez y de Gómez" are not. How can I capture this case?
Thank you very much.
The exact answer depends on what programming language and/or regex engine you're using, but for most implementations, you should be able to do the following:
(1.) Make a separate regex that matches a single part of a name and then include this in the final regex, e.g., in Perl:
my $name1 = qr/(?:De |Del |De La |De Los |De Las )?\w+/;
my $name2 = qr/^($name1)( $name1)?( de $name1)?$/;
(I assume you don't want the ? after the first capture, as otherwise you'd match the empty string.) $name2 is then the regex to match against.
(2.) Strictly speaking, proper computer-theoretical regular expressions cannot test whether an arbitrary substring that appears at one point in the string also appears at another point. However, most regex implementations (e.g., Perl-compatible "regular expressions") actually support more features than a real regex engine would, so you could use a backreference like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y \3)(de $name1)?$/;
In PCREs, the \3 matches the exact same string that the third (...) group matches. If you can't use backreferences for some reason, your only option is to use a regex like:
my $name2 = qr/^(?:($name1)( $name1)?|($name1) y ($name1))(de $name1)?$/;
and then, if $3 and $4 are defined after matching, test to see if they're equal or not. (Note that both of the above will allow names like "López López" without a "y"; if you want to prohibit those, it'll be a bit harder.)
Here's my attempt. It seems to work with the examples given:
public class Foo {
public static void main(String[] args) throws Exception {
System.out.println(new SpanishName("Pérez"));
System.out.println(new SpanishName("Pérez De León"));
System.out.println(new SpanishName("López de López"));
System.out.println(new SpanishName("De La Oca Ordóñez"));
System.out.println(new SpanishName("Castillo Ramírez de Del Valle"));
System.out.println(new SpanishName("López y López de De León"));
System.out.println(new SpanishName("Pérez y Pérez"));
// System.out.println(new SpanishName("López y Pérez")); - Throws IAE
// System.out.println(new SpanishName("Gómez y de Gómez")); - Throws IAE
}
public static class SpanishName {
private final String paternal;
private final String maternal;
private final String spousePaternal;
private static final Pattern NAME_REGEX = Pattern
.compile("^([\\p{Ll}\\p{Lu}]+?)(?:\\s([\\p{Ll}\\p{Lu}]+?))?(?:\\s([\\p{Ll}\\p{Lu}]+?))?$");
public SpanishName(String str) {
str = stripJoinWords(str);
str = removeYJoin(str);
final Matcher matcher = NAME_REGEX.matcher(str);
if (str.contains(" y ") || !matcher.matches()) {
throw new IllegalArgumentException(String.format("'%s' is not a valid Spanish name", str));
} else {
paternal = matcher.group(1);
maternal = matcher.group(2);
spousePaternal = matcher.group(3);
}
}
private String removeYJoin(final String str) {
return str.replaceFirst("^([\\p{Ll}\\p{Lu}]+?) y \\1", "$1 $1");
}
private String stripJoinWords(final String str) {
return str.replaceAll("(?<!\\sy\\s)[Dd]e(?:l| La| Los| Las)?\\s", "");
}
#Override
public String toString() {
return String.format("paternal = %s, maternal = %s, spousePaternal = %s", paternal, maternal,
spousePaternal);
}
}
}
Rather than using a regex, there's a service which does a pretty amazing job at this: https://www.nameapi.org/en/demos/name-parser/. It's open source, but instead of using regex it gathers data from phone books as well as a pretty sophisticated set of rules.