Google's text-to-speech (WaveNet) quality degrades with long texts - google-cloud-platform

Using the API with the Swedish voice sv-SE-Wavenet-A, it seems that the quality of the audio degrades with longer texts.
Short text:
Det ter sig logiskt att man gått över till tvångsfinansiering av en
kanal som under året alltså tappade sex procent av tittartiden. Till
slut kommer ingen titta, men alla kommer ändå tvingas betala.
Long text (bold = short text from above):
SVT backade sex procent - endast en tredjedel tittas - tvingas betala
ändå Preliminära siffror från mätföretaget MMS visar på att
vuxendagiset SVT tappade sex procent av sin tittartid under 2018. Nu
tittas det på endast en dryg tredjedel av tiden på SVT, men alla i
Sverige tvingas ändå betala sedan årsskiftet. SVT. SVT:s tittarsiffror
tappade till 34.9% i så kallad tittartidsandel. Det tvångsfinansierade
vuxendagiset har alltså bara en dryg tredjedel av tittartiden, men
samtliga med inkomst i Sverige måste likväl betala för detta.
Siffrorna från MMS är preliminära och SVT ska ha 34.9% av tittartiden,
TV4-gruppen 31.9%, Discovery Networks-gruppen 11.9%, och Nordic
Entertainment Group 11.6%. Discovery inkluderar Kanal 5 och Nordic
Entertaingment TV3. Det ter sig logiskt att man gått över till
tvångsfinansiering av en kanal som under året alltså tappade sex
procent av tittartiden. Till slut kommer ingen titta, men alla kommer
ändå tvingas betala. Socialism baserar sig på tvång när folk inte
frivilligt gör det som socialisterna vill åstakomma. Det är en ren
skam att de borgerliga partierna var med och drev igenom
tvångsfinansieringen av det konsekvenslösa vuxendagiset. Lämplig
åtgärd är att istället koda SVT, så får de som vill betala för detta
göra det och övriga slipper. Så kan också SVT falla bort i glömskan.
Tills detta sker kommer förstås bloggen bevaka SVT:s felsteg, men kom
ihåg att anmälningar till granskningsnämnden ej ska göras då det
legitimerar ett sjukt och helt konsekvenslöst meningslöst system. SVT
är ett aktiebolag, som besitter beskattningsrätt av svenska folket.
Nedanstående kommentarer är inte en del av det redaktionella
innehållet och användare ansvarar själva för sina kommentarer. Se även
kommentarsreglerna, inklusive listan med kommentatorer som automatiskt
kommer raderas på grund av brott mot dessa. Genom att kommentera
samtycker du till att din kommentar, tidsstämpel, profillänk och
pseudonym sparas av Googles Blogger-system så länge det är relevant,
dvs så länge blogginlägget är publicerat.
API Request
const textToSpeech = require('#google-cloud/text-to-speech')
const client = new textToSpeech.TextToSpeechClient()
client.synthesizeSpeech({
input: text,
voice: {
languageCode: 'sv-SE',
ssmlGender: 'FEMALE',
name: 'sv-SE-Wavenet-A',
},
audioConfig: {
audioEncoding: 'MP3',
},
})
Results from the API
Short text audio
Long text audio
Audio comparison
The audio comparison first plays the result I got when sending the short text. It then plays the same text, but cut out from the result I got when sending the long text. Finally, it plays them both together.
Is this a bug or expected? I haven't noticed any degradation of quality at all when using the en-US or en-GB voices.
I noticed that the Swedish voice uses a different naturalSampleRateHertz than all the other voices, perhaps that might cause this?

This is probably more related to using MP3 as encoding format than to any sample rate difference with other languages. Since MP3 is a lossy format, it is expected that some quality might be lost; the differences between the short file and the longer file are probably related to MP3 encoding algorithm being used.
I have checked in my side the Speech Synthesis API, and the "sv-SE-Wavenet-A" voice seems to be using a naturalSampleRateHertz of 24000, as all the wavenet I have checked (all en-US-Wavenet voices are in 24000 as well).
I would recommend to you to change the audioEncoding flag to some other encoding format, for example "OGG_OPUS", which will yield a better audio quality.
audioConfig: {
audioEncoding: 'OGG_OPUS',
},
If the MP3 format is a must, you can then change the format in your side, so you can choose which parameters you deem convenient in your MP3 encoding to ensure the maximum audio quality, whilst the audio file is compressed.

Related

Expanding abbreviations using regex

I have a dictionary of abbreviations, I would like to expand. I would like to use these to go through a text and expand all abbreviations.
The defined dictionary is as follows:
contractions_dict = {
"kl\.": "klokken",
}
The text I which to expand is as follows:
text = 'Gl. Syd- og Sønderjyllands Politi er måske kl. 18 kløjes landets mest aktive politikreds på Twitter med over 27.000, som følger med.'
I use the following function:
def expand_contractions(s, contractions_dict, contractions_re):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, s)
contractions_re = re.compile("(%s)"%"|".join(contractions_dict.keys()))
text = expand_contractions(text, contractions_dict, contractions_re)
print(text)
I have tried a range of different keys in the dictionary to capture the abbreviations, but nothing have worked. Any suggestions?
Try:
import re
contractions_dict = {
"kl.": "klokken",
}
pat = re.compile(r'\b' + r'|'.join(re.escape(k) for k in contractions_dict))
text = "Gl. Syd- og Sønderjyllands Politi er måske kl. 18 kløjes landets mest aktive politikreds på Twitter med over 27.000, som følger med."
text = pat.sub(lambda g: contractions_dict[g.group(0)], text)
print(text)
Prints:
Gl. Syd- og Sønderjyllands Politi er måske klokken 18 kløjes landets mest aktive politikreds på Twitter med over 27.000, som følger med.

RegEx - Extract string from character 91 up to character 180 and delete everything before and after

I am trying to extract character 91 to 180 from this text:
Exosphere -6° Reg. fra Deuter er den perfekte sovepose til dig, der har det med at stritte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose. Den er nemlig fuld af elastikker, som tillader soveposen at blive op til 25% bredere, end den umiddelbart ser ud til at være.
So that the output will look like this:
itte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose
I am using this expression which I found here on SO REGEX to trim a string after 180 characters and before |:
Replace
^([^|]{91,180})[^|]+(.*)$
with
\1\2
It is doing some of the job this is the output:
Exosphere -6° Reg. fra Deuter er den perfekte sovepose til dig, der har det med at stritte med arme og ben, når du sover, og føler dig lidt hæmmet i en almindelig mumiesovepose
So now I need to remove everything before character 91.
The point here is that you need to match the first 90 chars, then match and capture another 90 chars into Group 1, and then just match the rest of the string, then replace with a backreference to Group 1 value.
You may use
^[\s\S]{90}([\s\S]{90})[\s\S]*
Or, if there are no line breaks, a more "regular"
^.{90}(.{90}).*
patterns. Replace with $1.
See the regex demo

utf8 encoding lost after tokenising using NLTK and Python 2.7

I ran into an interesting problem when attempting to tokenise a french text using nltk (version 3.1, python 2.7). Here is the code snippet I am using.
#python 2.7
from __future__ import division
import nltk, re, pprint
f= open('test.txt')
raw = f.read()
print "got it!"
print type(raw)
ucoderaw=raw.decode('utf-8')
print ucoderaw
tokens = nltk.word_tokenize(ucoderaw)
print type(tokens)
words = [w.lower() for w in tokens]
print type(words)
vocab = sorted(set(words))
print "Tokens"
The document contains a french text:
J'ai lieu de croire que Mr. de Voltaire ne sera pas fâché de voir que
son Manuscrit, qu'il a intitulé Abrégé de l'Histoire Universelle
depuis Charlemagne jusqu'à Charles-Quint, et qu'il dit être entre les
mains de trente Particuliers, soit tombé entre les miennes. Il sait
qu'il m'en avait flatté dès l'année 1742, à l'occasion de son Siècle
de Louis XIV, auquel je ne renonçai en 1750, que parce qu'il me dit
alors à Postdam, où j'étais, qu'il l'imprimait lui-même à ses propres
dépens. Ainsi il ne s'agit ici que de dire comment cet Abrégé m'est
tombé entre les mains, le voici.
À mon retour de Paris, en Juin de cette année 1753, je m'arrêtai à
Bruxelles, où j'eus l'honneur de voir une Personne de mérite, qui en
étant le possesseur me le fit voir, et m'en fit aussi tout l'éloge
imaginable, de même que l'histoire du Manuscrit, et de tout ce qui
s'était passé à l'occasion d'un Avertissement qui se trouve inséré
dans le
second Volume du mois de Juin 1752 du Mercure de France, et répété dans l'Épilogueur du 31 Juillet de la même année, avec la Réponse
que l'on y a faite, et qui se trouve dans le même Épilogueur du 7
Août suivant: toutes choses inutiles à relever ici, mais qui m'ont
ensuite déterminé à acheter des mains de ce Galant-Homme le Manuscrit
après avoir été offert à l'Auteur, bien persuadé d'ailleurs qu'il
était effectivement de Mr. de Voltaire; son génie, son style, et
surtout son orthographe s'y trouvant partout. J'ai changé cette
dernière, parce qu'il est notoire que le Public a toutes les peines du
monde à s'y accoutumer; et c'est ce que l'Auteur est prié de vouloir
bien excuser.[1]
Je dois encore faire remarquer que par la dernière période de ce
Livre, il paraît qu'elle fait la clôture de cet Abrégé, qui finit à
Charles VII Roi de France, au lieu que l'Auteur la promet par son Titre jusqu'à l'Empereur Charles-Quint. Ainsi il est à présumer que
ce qui devrait suivre, est cette partie différente d'Histoire qui
concerne les Arts, qu'il serait à souhaiter que Mr. de Voltaire
retrouvât, ou, pour mieux dire, qu'il voulût bien refaire, et la
pousser jusqu'au Siècle de Louis XIV, afin de remplir son plan, et
de nous donner ainsi une suite d'Histoire qui ferait grand plaisir au
Public et aux Libraires.
When I attempt to tokenise that text using tokens = nltk.word_tokenize(ucoderaw)
and then subsequently print out the tokens using sorted(set(words))
I get output with broken utf-8 encoding:
u'autant', u'author', u'autres', u'aux', u'available', u'avait',
u'avant', u'avec', u'avertissement', u'avoir', u'avons', u'away',
u'ayant', u'barbare', u'beaucoup', u'biblioth\xe8que', u'bien',
u'bnf/gallica', u'bornais', u'bruxelles', u'but', u'by', u"c'est",
u'capet', u'carri\xe8re', u'ce', u'cela', u'ces', u'cet', u'cette',
u'ceux', u'chang\xe9', u'chaos', u'character', u'charger',
u'charlemagne', u'charlequint', u'charles-quint_', u'chartes',
u'chez', u'chine', u'choses', u'chronologie', u'chronologiques',
u'chr\xe9tienne', u'cl\xf4ture'
where the correct output should include accents i.e. bibliothèque and not biblioth\xe8que
I've been trying to figure out how to fix this, short of saving the output to a file and writing another program to replace \xe8 with è and so on and so forth.
Is there a simpler method?
EDIT: Not the cleanest solution, however I found that by tokenising and then saving that output to a file with the correct encoding I do (largely get the output required):
#python 2.7
#-*- coding: utf-8 -*-
from __future__ import division
import nltk, re, pprint
f= open('test.txt')
raw = f.read()
print "got it!"
#print raw
print type(raw)
#encode as utf8 before moving on.
ucoderaw=raw.decode('utf-8')
tokens = nltk.word_tokenize(ucoderaw)
print type(tokens)
words = [w.lower() for w in tokens]
print type(words)
vocab = sorted(set(words))
print "encoded raw input is"
print ucoderaw
# GET TOKENS
print vocab
#write to file with correct encoding to "fix" the problem
output_file = open('output.txt', 'w')
print len(vocab)
for words in vocab:
output_file.write(words.encode('utf-8') + "\n")
Firstly, see http://nedbatchelder.com/text/unipain.html
Then try to use Python3 instead of Python2 for text processing, it makes your life a lot easier =)
Finally, regardless of Py3 or Py2, using io.open instead of open is a good practice so that your code works across both Py3 and Py2:
import io
from collections import Counter
import nltk
# Try to open files within a context, see
# https://www.python.org/dev/peps/pep-0343/ and
# http://effbot.org/zone/python-with-statement.htm
with io.open('test.txt', 'r', encoding='utf8') as fin:
word_counts = Counter(word_tokenize(fin.read()))
# List of top most common words.
print word_counts.most_common()
# Sorted by counts
print word_counts.most_common(len(word_counts))
# Sorted alphabetically
print sorted(word_counts.keys())

String substitution shell for latex

I'm trying to adapt an article to include it in a LaTeX Document. For that I'm using sed to substitute characters. However I find myself with a problem with some symbols like quotation marks. So for example with this paragraph:
Los problemas de Europa no son los mismos en todos los países. "Alemania no está creciendo rápidamente, pero consiguió evitar la recaída en la recesión", dice Guillen. "En Irlanda, por ejemplo, la economía cayó un 20%. En Francia, la situación no es desesperada, pero el país tampoco es ninguna Alemania". Mientras, Italia y España han vuelto a caer en la recesión, y Reino Unido acaba de anunciar que está nuevamente en recesión.
The " symbol (double quote in a single character) should be changed to a `` if it's found at the beginning of a word but to a '' (they are 2 characters: \x27\x27) if it's at the end. So the resulting paragraph should be (the % sysmbol was also changed):
Los problemas de Europa no son los mismos en todos los países. ``Alemania no está creciendo rápidamente, pero consiguió evitar la recaída en la recesión'', dice Guillen. ``En Irlanda, por ejemplo, la economía cayó un 20\%. En Francia, la situación no es desesperada, pero el país tampoco es ninguna Alemania''. Mientras, Italia y España han vuelto a caer en la recesión, y Reino Unido acaba de anunciar que está nuevamente en recesión.
I imagine that a regexp combining a space symbol and word to match at beginning should work, and a similar one for the end, but I don't know how to do it.
You could check that the " is either at the beginning of a line or preceded by a whitespace:
sed -r 's/(^| )"/\1``/g' filename
If your version of sed doesn't support extended regular expressions, you could say:
sed 's/\(^\| \)"/\1``/g'` filename
For escaping % and possibly other characters like &, $, you could make use of a character class to escape all those in one go:
sed -r 's/([$%])/\\\1/g' filename
The two could be combined too:
sed -r 's/(^| )"/\1``/g; s/([$%])/\\\1/g' filename
EDIT: From your clarification, it seems that you need to say:
sed -r 's/(^| )"/\1``/g;s/"/'"''"'\1/g' filename
This awk should change the " to `` if its in a beginning of a word.
awk '{for (i=1;i<=NF;i++) if ($i~/^"/) sub(/"/,"``",$i)}1' file
Los problemas de Europa no son los mismos en todos los países. ``Alemania no está creciendo rápidamente, pero consiguió evitar la recaída en la recesión", dice Guillen. ``En Irlanda, por ejemplo, la economía cayó un 20%. En Francia, la situación no es desesperada, pero el país tampoco es ninguna Alemania". Mientras, Italia y España han vuelto a caer en la recesión, y Reino Unido acaba de anunciar que está nuevamente en recesión.
It test one by one by one word and see if its starting with ", if yes change it.
For posix
sed "s/^\"/``/;s/ \"/ ``/g;s/\"$/''/;s/\" /''/g" YourFile > TempFile
mv TempFile YourFile
for gnu sed version (not testing machine here to validate)
sed -r "s/\( |^\)\"/\1``/g;s/"\( |$)/''\1/g" YourFile

Regex to match linebreaks not preceded by non-escaped quotes in text file

I have a textfile where strings are enclosed by quotes " " and any containing quotes are escaped by \. I wan't to remove any linebrakes (\n) at in the text, as long as they are not preceded by an un-escaped quote sign ("), since thats the end of a line.
Here's an example:
"tre miljarder på att modernisera snabbtågen.\"
Dagens mest ironiska nyhet.,Väntar på att alla Summerburst-uppdateringar snart ska dö ut så min ångest kan släppa och jag kan återgå till ett normalt liv.,RT #mapeone: En till hashtag på Facebook och jag badar naken i grisblod.,Dagens biologiska lektion och psykologiska reflektion.
Så förlorade fåglarna sina penisar - DN.SE http://t.co/PFaseQMt8B,Hahaha \"#Aliceyouknow: Hah ironiskt att jag för exakt ett år sen ville gräva ner mig lika mycket som jag vill nu med.\" #livet,Det är bara kvinnor som på riktigt förstår paniken i om Zlatans hår skulle försvinna. #ikon,#nellie_lind ah han har ju rakat sidorna, snart ryker väl hela skiten,Alltså Zlatan ge fan i att mecka med håret.,Jag har ett jobb. Hur tungt är inte det. #tungt"
The regex pattern I've come up with so far looks like this:
[^"]\n+
But it also matches the character before the \n, e.g. the quote at the end of "snabbtågen.\" on line 1 and dot (.) after "reflektion" on line 2.
I want it to match a \n preceded by anything else than a non escaped ", but not include what's preceding it. How can that be done?
You should use negative lookbehind assertion
>>> print s
'first line'
'hello world
again'
>>> s2 = re.sub(r"(?<!')\s+", " ", s)
>>> print s2
'first line'
'hello world again'