Length of string that contains CJK characters - ocaml

When given a string containing CJK characters, String.length returns the wrong number of characters in the string because it counts the number of bytes. For example:
# String.length "第1";;
- : int = 4
There are two characters in the string, but String.length returns 4 (which is the number of bytes in the string).
How can I get the real length of a string that contains CJK characters?

If you want to count the number of extended grapheme cluster (aka graphical characters), you can use Uuseg to do the segmentation:
let len = Uuseg_string.fold_utf_8 `Grapheme_cluster (fun x _ -> x + 1) 0
;; len "春"
1
which has the advantage of still being accurate in presence of non-precomposed characters like decomposed jamo in korean:
;; len "\u{1112}\u{1161}\u{11AB}"
1
which is the correct result since the previous strings should be displayed as 한 even if it is written with 3 unicode scalar values.

As stated in the comments, OCaml does not have native support for any particular encoding, hence the length being the number of bytes.
Now, assuming you are using Utf8 encoding (which is the easiest way to mix ascii and CJK AFAIK), there are a few ways to calculate that size.
As an example, using the very lightweight Uutf library [EDIT] as octachron pointed out this returns the length in scalar values and not in characters, you should use octachron's answer.
let utf8_length s = (* returns the number of unicode scalar values *)
let decoder = Uutf.decoder ~encoding:`UTF_8 (`String s) in
let rec loop () = match Uutf.decode decoder with | `End -> () | _ -> loop () in
loop ();
Uutf.decoder_count decoder

Related

How to call characters from first list with second list

I want to input two comma separated strings: the first a set of strings, the second a set of ranges and return substrings based on ranges, for example:
x=input("Input string to search: ")
search=x.split(',')
y=input("Input numbers to locate: ")
numbers=y.split(',')
I would then like to use the second list of ranges to print out specified characters from the first list.
An example:
Input string to search: abcdefffg,aabcdefghi,bbcccdefghi
Input numbers to locate: 1:2,2:3,5:9
I would like the output to look like this:
bc
bcd
defghi
Any suggestions? Thanks in advance!
split(':') splits a "range" into its two components. map(int, ...) converts them to integers. string[a:b] takes characters at indices a through b.
zip is an easy way to read from two different lists combined.
Let me know if you have any other questions:
x = "abcdefffg,aabcdefghi,bbcccdefghi"
search = x.split(',')
y = "1:2,2:3,5:9"
numbers = y.split(',')
results = []
for string, rng in zip(search, numbers):
start, how_many = map(int, rng.split(':'))
results.append(string[start:start+how_many])
print(" ".join(results))
# Output:
# bc bcd defghi

Idiomatic way to include a null character / byte in a string in OCaml

I was messing around with the OCaml Unix module to see if it would reject strings containing certain bytes that might have surprising effects in the context of the given system call and throw an exception. E.g. a null byte in the prog argument to Unix.create_process, or a newline in one of the strings in the env : string array argument.
I tried a few ways to include a null byte in my string, such as "/bin/ls\0" (which is an illegal escape sequence in a string literal) and "/bin/ls" ^ string_of_char '\0' (which is an illegal sequence in a character literal). Finally, I cast zero to a string, and then made a string of length 1 containing the null character and then concatenated it with my string.
module U = Unix;;
let string_of_char ch : string = String.make 1 ch
let sketchy_string = "/bin/ls" ^ string_of_char (char_of_int 0)
let _ = U.create_process sketchy_string [|"ls"|] U.stdin U.stdout U.stderr
What's the right way to add a null byte to an ocaml string?
You can use the generic "hexadecimal code" escape sequence to write the null byte (or any other byte you want):
let null_byte = '\x00';;
let sketchy_string = "/bin/ls\x00";;
For further reference, see the section of the Ocaml manual covering escape sequences: http://caml.inria.fr/pub/docs/manual-ocaml/lex.html#escape-sequence

Regex to find a substring with a specific length that contains minimum occurrences of a specific characters

Is there any regex to find a substring with a specific length that contains minimum number of a specific char occurrences?
For example I have a string such as: AABABAAAAA for this string we have a substring with length 5 that contains two B => AABAB so regex should find it.
But for the AAAABAAAAB there is not any substring with length of 5 that contains two B.
Suppose our string just contains A and B and we want to find substring with length of 5 that contains at least two B:
AAAABAAAAB -> Invalid
AAAAAAAABB -> Valid
AAAAAAAAAABAABAAAAAA -> Valid
AAAABAAAAAAABAAAAAAA -> Invalid
Brute force:
.B..B|B...B|..BB.|.B.B.|..B.B|BB...|B.B..|...BB|B..B.|.BB..
Well, I know that such regular expression is not parametrizable. On the other hand it's possible to obtain it programmatically (the example is in Python):
import itertools
def get_regex(char, charnum, strsize):
chars = char * charnum + "." * (strsize - charnum)
return "|".join("".join(x) for x in set(itertools.permutations(chars)))
print get_regex("B", 2, 5)
You can use this regex:
(?=[^B]{0,3}B[^B]{0,3}B).{5}
RegEx Demo

Map sequences of numbers to single characters in Scala

Given an input string map three types of possible sequences of numbers contained in the string to a single number and leave the other elements of the string unchanged:
Single number should be mapped to the char 1: "help3me" -> "help1me"
Two numbers in a row should be mapped to the char 2: "help18me" -> "help2me"
Three or more numbers in a row should be mapped to 3: "test3432help234312me" -> "test3help3me"
Our input strings can contain any number of 1,2,3+ length sequences of digits so that a valid input example is "help3490897test73me23435please5"
What is an effective solution for the above problem in Scala does it just involve enumerating through the three possible cases as a regex ?
Use regular expression and method replaceAllIn. The second argument is the function that takes Match object and transforms it to its length.
val str = "help3me34"
val expr = "(\\d+)".r
expr.replaceAllIn(str, x => (x.group(0).length min 3).toString)
res2: String = help1me2

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.