subscript/superscript in Google Visualization API axes labels - google-visualization

How to include subscript/superscript in labels in Google Visualization API? I need some formula and subscript in google APIs.

You can't include markup in labels, so I think your only option is to use Unicode subscripts and superscripts. This should be no problem if you're just working with numbers — just replace the digits 0-9 with the corresponding characters from ⁰¹²³⁴⁵⁶⁷⁸⁹ and ₀₁₂₃₄₅₆₇₈₉.
Most of the alphabet characters are available too, except for q. This is because the team of Unicode experts who decide on these things felt that it was more important to make room for essential glyphs like 🚽 and 💩.

Related

google-speech-api and overriding phone number recognition

Does anyone know if there is a way to manipulate the recognition of phone numbers when using the Google Speech API? I am trying to implement a transcription scenario where a caller will say a string of letters and numbers, but the logic out of the box seems to be to try to fit any sequence of numbers to a phone number scheme, even if it means rendering letters into numbers they may sound vaguely similar to (or not). I have tried using speech contexts to manipulate the values within the "phone number" by typing out and giving the entire thing as it should be as a speech context ("eight seven seven two bee three seven", for example), but it refuses to override the digits being interpreted as a phone number. Has anyone encountered this issue or is aware of any way in which this could be worked around?
Thanks!
I'm not aware of an easy way to do this. For the Web Speech API for JavaScript, doing the following seems to yield fewer results that are forced into phone number format.:
Set the maxAlternatives = 2, e.g.,
var recognition = new speechRecognition();
recognition.maxAlternatives = 2;
Then use the second result offered, e.g.,
constr speechToText = event.results[0][1].transcript
You can get pretty far by processing the result. A remaining challenge is that since the result often clumps digits together, you lose the distinction between a series of single digit numbers and one multi-digit number (e.g., '15' & '1', '5'). The utility of this approach depends on the specifics of the numbers your app is trying to capture.
In at least one case, setting the language to en-PH (English Philippines) seems to have fixed, or at least notably improved, this problem. Other English language options might work as well.
en-GB comes back as a UK formatted number where they put one digit first then the rest of the number.

How to split emojis that contain flags without the flag breaking into 2 characters in Google Sheets

This is my initial string:
🇦🇺🏅🏉
I used a not so elegant way to break up the emojis.
=if(len(I88) = 4, REGEXEXTRACT(I88,"(.+?)\s*(.+?)"),if(len(I88) = 6, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)"),if(len(I88) = 8, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)\s*(.+?)"),if(len(I88) = 10, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)\s*(.+?)\s*(.+?)"), REGEXEXTRACT(I88,"\s*(.+?)" )))))
The result is 4 columns instead of 3: this is what it looks like
🇦 | 🇺 | 🏅 | 🏉
I left the pipes to indicate a separate column
What I want is this:
🇦🇺 | 🏅 | 🏉
Short answer
To correctly separate the three emoticons we need to use a custom function. Fortunaly there are JavaScript libraries that could be used for this like the one shared in the answer by Orlin Giorgiev to Get grapheme character count in javascript strings?
Explanation
The OP formula is returning four elements instead of three because Google Sheets built-in functions requires four "characters" (actually they are code points) that need more than 4 hexadecimal digits to represent them. Each set of "characters" to represent emoticons are called "astral code points".
From https://mathiasbynens.be/notes/javascript-unicode
Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.
Internally, JavaScript [as well Google Sheets built-in functions] represents astral symbols as surrogate pairs, and it exposes the separate surrogate halves as separate “characters”. If you represent the symbols using nothing but ECMAScript 5-compatible escape sequences, you’ll see that two escapes are needed for each astral symbol. This is confusing, because humans generally think in terms of Unicode symbols or graphemes instead.
Custom function
function SPLITGRAPHEMES(string) {
var splitter = new GraphemeSplitter();
return splitter.splitGraphemes(string);
}
NOTE: Don't forget to include the referred JavaScript library
Syntax
Assume that A1 contains . To split the three emoticons in a 1 x 3 array use the following formula:
=TRANSPOSE(SPLITGRAPHEMES(A1))
Note: In Windows the emoticons (🇦🇺🏅🏉) in this Q&A doesn't look the same as in Chrome OS, so a image was used in the above paragraph.

CloudSearch wildcard query not working with 2013 API after migration from 2011 API

I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings

Ontology-based string classification

I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:
String
|_ AlphabeticString
|_ CountryName
|_ CityName
|_ AlphaNumericString
|_ PrefixedNumericString
|_ NumericString
Eventually strings like Spain should be classified as CountryName or UE4564 would be a PrefixedNumericString.
However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565".
Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?
An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.
An outline of a process utilizing this approach might be:
Define a feature set you can extract from each string, relating to your ontology (some examples below).
Collect a "train set" of strings and their true matching categories.
Extract features from each string, and train some machine-learning algorithm on this data.
Use the trained model to classify new strings.
Retrain or update your model as needed (e.g. when new categories are added).
To illustrate more concretely, here are some suggestions based on your ontology example.
Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.
So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).
This set of feature will represent any given string.
to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.
Then the trained models can be used via Weka either from within Java or as an external command line.
Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.
I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.
Here are some expressions that you could use:
AlphabeticString:
^[A-Za-z]+\z (ASCII) or ^\p{Alpha}+\z (Unicode)
AlphaNumericString:
^[A-Za-z0-9]+\z (ASCII) or ^\p{Alnum}+\z (Unicode)
PrefixedNumericString:
^[A-Za-z]+[0-9]+\z (ASCII) or ^\p{Alpha}+\p{N}+\z (Unicode)
NumericString:
^[0-9]+\z (ASCII) or ^\p{N}+\z (Unicode)
A particular string is an instance, so you'll need some code to make the basic assertions about the particular instance. That code itself might contain the use of regular expressions. Once you've got those assertions, you'll be able to use your ontology to reason about them.
The hard part is that you've got to decide what level you're going to model at. For example, are you going to talk about individual characters? You can, but it's not necessarily sensible. You've also got the challenge that arises from the fact that negative information is awkward (as the basic model of such models is intuitionistic, IIRC) which means (for example) that you'll know that a string contains a numeric character but not that it is purely numeric. Yes, you'd know that you don't have an assertion that the instance contains an alphabetic character, but you wouldn't know whether that's because the string doesn't have one or just because nobody's said so yet. This stuff is hard!
It's far easier to write an ontology if you know exactly what problems you intend to solve with it, as that allows you to at least have a go at working out what facts and relations you need to establish in the first place. After all, there's a whole world of possible things that could be said which are true but irrelevant (“if the sun has got his hat on, he'll be coming out to play”).
Responding directly to your question, you start by checking whether a given token is numeric, alphanumeric or alphabetic (you can use regex here) and then you classify it as such. In general, the approach you're looking for is called generalization hierarchy of tokens or hierarchical feature selection (Google it). The basic idea is that you could treat each token as a separate element, but that's not the best approach since you can't cover them all [*]. Instead, you use common features among tokens (for example, 2000 and 1981 are distinct tokens but they share a common feature of being 4 digit numbers and possibly years). Then you have a class for four digit numbers, another for alphanumeric, and so on. This process of generalization helps you to simplify your classification approach.
Frequently, if you start with a string of tokens, you need to preprocess them (for example, remove punctuation or special symbols, remove words that are not relevant, stemming, etc). But maybe you can use some symbols (say, punctuation between cities and countries - e.g. Melbourne, Australia), so you assign that set of useful punctuation symbols to other symbol (#) and use that as a context (so the next time you find an unknown word next to a comma next to a known country, you can use that knowledge to assume that the unknown word is a city.
Anyway, that's the general idea behind classification using an ontology (based on a taxonomy of terms). You may also want to read about part-of-speech tagging.
By the way, if you only want to have 3 categories (numeric, alphanumeric, alphabetic), a viable option would be to use edit distance (what is more likely, that UA4E30 belongs to the alphanumeric or numeric category, considering that it doesn't correspond to the traditional format of prefixed numeric strings?). So, you assume a cost for each operation (insertion, deletion, subtitution) that transforms your unknown token into a known one.
Finally, although you said you're using Protege (which I haven't used) to build your ontology, you may want to look at WordNet.
[*] There are probabilistic approaches that help you to determine a probability for an unknown token, so the probability of such event is not zero. Usually, this is done in the context of Hidden Markov Models. Actually, this could be useful to improve the suggestion given by etov.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.