Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.
I'm using bookdown to write a paper. The knitted word file automatically assigns numbers to each heading, to form chapters. However, I don't want these, as they're sections of a paper rather than individual chapters.
I have found if I include {-} next to a heading, it isn't given a number in the word output. However, the figures are still captioned as if they're numbered by chapters, but instead of being "Fig 1.2" they are Fig "0.2" and I actually just want "Fig 2". Does anyone know how to stop it from doing this?
How can I tell SwiftUI that when text wraps, I'd like all the lines to be as close to equal length as possible?
For example, I don't want this:
The quick brown fox jumps over the
lazy dog
Even if there is enough horizontal space to fit everything except "lazy dog" on the first line, I want this instead (or whatever gives the most equal line lengths for the font in use):
The quick brown fox
jumps over the lazy dog
SkParagraph automatically compensates for "ghostwhitespace" when shaping a paragraph. I'd like to disable this behaviour and allow the line to pushed out when whitespace is introduced.
Center alignment with current behaviour:
The quick brown fox
๐ฆ ate a zesty
hamburgerfons ๐.
The ๐ฉโ๐ฉโ๐งโ๐ง laughed.
Now adding loads of spaces after zesty: (desired behaviour)
The quick brown fox
๐ฆ ate a zesty
hamburgerfons ๐.
The ๐ฉโ๐ฉโ๐งโ๐ง laughed.
Notice second line pushed to the left due to all the extra whitespace.
I've modified this CanvasKit fiddle to illustrate. See line 40.
I've also found this flutter issue that illustrates the issue.
I've gone through the Skia / SkParagraph source code many times over and can't find a way to introduce the behaviour I need.
I'm having trouble using NLTK to generate random sentences from a custom corpus.
Before I start, I'd like to mention that I'm using NLTK version 2x, so the "generate" function is still existent.
Here is my current code:
file = open('corpus/romeo and juliet.txt','r')
words = file.read()
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(length=10)
This runs, but does not create random sentences (I'm going for a horse_ebooks vibe). Instead, it returns me the first 10 words of my corpus source every time.
However, if I use NLTK's brown corpus, I get the desired random effect.
text = nltk.Text(nltk.corpus.brown.words())
print text.generate(length=10)
Going into the Brown corpus files, it seems as though every word is separated and tagged with verbs, adjectives, etc - something I thought would be completed with the word_tokenize function of my first block of code.
Is there a way to generate a corpus like the Brown example - even if it means converting my txt documents into that fashion instead of reading them directly?
Any help would be appreciated - any documents on this are either horribly outdated or just say to use Markov Chains (which I have, but I want to figure this out!) I understand generate() was removed as of NLTK 3.0 because of bugs.