What is a permuted index? - c++

I am reading Accelerated C++. I don't understand Exercise 5-1:
Design and implement a program to produce a permuted index from the following input. A permuted index is one in which each phrase is indexed by every word in the phrase.
The quick brown fox
jumped over the fence
The quick brown fox
jumped over the fence
jumped over the fence
The quick brown fox
jumped over the fence
The quick brown fox
That explanation isn't clear to me. What exactly is a permuted index?

The term permuted index is another name for a KWIC index, referring to the fact that it indexes all cyclic permutations of the headings. Books composed of many short sections with their own descriptive headings, most notably collections of manual pages, often ended with a permuted index section, allowing the reader to easily find a section by any word from its heading. This practice is no longer common.
From: http://en.wikipedia.org/wiki/Key_Word_in_Context
ps: you can access wikipedia via http://www.proxify.com

You can find a 'live' example of a permuted index in the 7th Edition UNIXโ„ข Programmer's Reference Manual, Vol 1 (dating back to 1979). A fragment of it (from the PDF files) is:
If you look for 'account', you can find a number of related entries together. You probably wouldn't think to look for sa(1) as well as ac(1), not to mention acct(2) or acct(5) unless they were grouped together. This is the benefit of a permuted index; you can look up the key word and see it in a bigger context.
You could also look at the man page entry for the ptx(1) command in the same 7th Edition manual.

Permuted index is an alphabetic list of index surrounded by its context. In the output, observe the bold words. They are alphabetically sorted and are surrounded by its context. This makes it easy for us to search a word and directly infer its usage from the surrounding context i.e. words in your case.
The quick brown fox
jumped over the fence
The quick brown fox
jumped over the fence
jumped over the fence
The quick brown fox
jumped over the fence
The quick brown fox

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

Removing automatic numbering of chapters and the insert of chapter number into figure captions in bookdown

I'm using bookdown to write a paper. The knitted word file automatically assigns numbers to each heading, to form chapters. However, I don't want these, as they're sections of a paper rather than individual chapters.
I have found if I include {-} next to a heading, it isn't given a number in the word output. However, the figures are still captioned as if they're numbered by chapters, but instead of being "Fig 1.2" they are Fig "0.2" and I actually just want "Fig 2". Does anyone know how to stop it from doing this?

Evenly distribute wrapped text between lines in SwiftUI

How can I tell SwiftUI that when text wraps, I'd like all the lines to be as close to equal length as possible?
For example, I don't want this:
The quick brown fox jumps over the
lazy dog
Even if there is enough horizontal space to fit everything except "lazy dog" on the first line, I want this instead (or whatever gives the most equal line lengths for the font in use):
The quick brown fox
jumps over the lazy dog

Force SkParagraph layout to account for ghost whitespace

SkParagraph automatically compensates for "ghostwhitespace" when shaping a paragraph. I'd like to disable this behaviour and allow the line to pushed out when whitespace is introduced.
Center alignment with current behaviour:
The quick brown fox
๐ŸฆŠ ate a zesty
hamburgerfons ๐Ÿ”.
The ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง laughed.
Now adding loads of spaces after zesty: (desired behaviour)
The quick brown fox
๐ŸฆŠ ate a zesty
hamburgerfons ๐Ÿ”.
The ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง laughed.
Notice second line pushed to the left due to all the extra whitespace.
I've modified this CanvasKit fiddle to illustrate. See line 40.
I've also found this flutter issue that illustrates the issue.
I've gone through the Skia / SkParagraph source code many times over and can't find a way to introduce the behaviour I need.

Generating sentences with NLTK and Python

I'm having trouble using NLTK to generate random sentences from a custom corpus.
Before I start, I'd like to mention that I'm using NLTK version 2x, so the "generate" function is still existent.
Here is my current code:
file = open('corpus/romeo and juliet.txt','r')
words = file.read()
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(length=10)
This runs, but does not create random sentences (I'm going for a horse_ebooks vibe). Instead, it returns me the first 10 words of my corpus source every time.
However, if I use NLTK's brown corpus, I get the desired random effect.
text = nltk.Text(nltk.corpus.brown.words())
print text.generate(length=10)
Going into the Brown corpus files, it seems as though every word is separated and tagged with verbs, adjectives, etc - something I thought would be completed with the word_tokenize function of my first block of code.
Is there a way to generate a corpus like the Brown example - even if it means converting my txt documents into that fashion instead of reading them directly?
Any help would be appreciated - any documents on this are either horribly outdated or just say to use Markov Chains (which I have, but I want to figure this out!) I understand generate() was removed as of NLTK 3.0 because of bugs.