Map sequences of numbers to single characters in Scala - regex

Given an input string map three types of possible sequences of numbers contained in the string to a single number and leave the other elements of the string unchanged:
Single number should be mapped to the char 1: "help3me" -> "help1me"
Two numbers in a row should be mapped to the char 2: "help18me" -> "help2me"
Three or more numbers in a row should be mapped to 3: "test3432help234312me" -> "test3help3me"
Our input strings can contain any number of 1,2,3+ length sequences of digits so that a valid input example is "help3490897test73me23435please5"
What is an effective solution for the above problem in Scala does it just involve enumerating through the three possible cases as a regex ?

Use regular expression and method replaceAllIn. The second argument is the function that takes Match object and transforms it to its length.
val str = "help3me34"
val expr = "(\\d+)".r
expr.replaceAllIn(str, x => (x.group(0).length min 3).toString)
res2: String = help1me2

Related

Length of string that contains CJK characters

When given a string containing CJK characters, String.length returns the wrong number of characters in the string because it counts the number of bytes. For example:
# String.length "第1";;
- : int = 4
There are two characters in the string, but String.length returns 4 (which is the number of bytes in the string).
How can I get the real length of a string that contains CJK characters?
If you want to count the number of extended grapheme cluster (aka graphical characters), you can use Uuseg to do the segmentation:
let len = Uuseg_string.fold_utf_8 `Grapheme_cluster (fun x _ -> x + 1) 0
;; len "春"
1
which has the advantage of still being accurate in presence of non-precomposed characters like decomposed jamo in korean:
;; len "\u{1112}\u{1161}\u{11AB}"
1
which is the correct result since the previous strings should be displayed as 한 even if it is written with 3 unicode scalar values.
As stated in the comments, OCaml does not have native support for any particular encoding, hence the length being the number of bytes.
Now, assuming you are using Utf8 encoding (which is the easiest way to mix ascii and CJK AFAIK), there are a few ways to calculate that size.
As an example, using the very lightweight Uutf library [EDIT] as octachron pointed out this returns the length in scalar values and not in characters, you should use octachron's answer.
let utf8_length s = (* returns the number of unicode scalar values *)
let decoder = Uutf.decoder ~encoding:`UTF_8 (`String s) in
let rec loop () = match Uutf.decode decoder with | `End -> () | _ -> loop () in
loop ();
Uutf.decoder_count decoder

How to call characters from first list with second list

I want to input two comma separated strings: the first a set of strings, the second a set of ranges and return substrings based on ranges, for example:
x=input("Input string to search: ")
search=x.split(',')
y=input("Input numbers to locate: ")
numbers=y.split(',')
I would then like to use the second list of ranges to print out specified characters from the first list.
An example:
Input string to search: abcdefffg,aabcdefghi,bbcccdefghi
Input numbers to locate: 1:2,2:3,5:9
I would like the output to look like this:
bc
bcd
defghi
Any suggestions? Thanks in advance!
split(':') splits a "range" into its two components. map(int, ...) converts them to integers. string[a:b] takes characters at indices a through b.
zip is an easy way to read from two different lists combined.
Let me know if you have any other questions:
x = "abcdefffg,aabcdefghi,bbcccdefghi"
search = x.split(',')
y = "1:2,2:3,5:9"
numbers = y.split(',')
results = []
for string, rng in zip(search, numbers):
start, how_many = map(int, rng.split(':'))
results.append(string[start:start+how_many])
print(" ".join(results))
# Output:
# bc bcd defghi

Regex to find a substring with a specific length that contains minimum occurrences of a specific characters

Is there any regex to find a substring with a specific length that contains minimum number of a specific char occurrences?
For example I have a string such as: AABABAAAAA for this string we have a substring with length 5 that contains two B => AABAB so regex should find it.
But for the AAAABAAAAB there is not any substring with length of 5 that contains two B.
Suppose our string just contains A and B and we want to find substring with length of 5 that contains at least two B:
AAAABAAAAB -> Invalid
AAAAAAAABB -> Valid
AAAAAAAAAABAABAAAAAA -> Valid
AAAABAAAAAAABAAAAAAA -> Invalid
Brute force:
.B..B|B...B|..BB.|.B.B.|..B.B|BB...|B.B..|...BB|B..B.|.BB..
Well, I know that such regular expression is not parametrizable. On the other hand it's possible to obtain it programmatically (the example is in Python):
import itertools
def get_regex(char, charnum, strsize):
chars = char * charnum + "." * (strsize - charnum)
return "|".join("".join(x) for x in set(itertools.permutations(chars)))
print get_regex("B", 2, 5)
You can use this regex:
(?=[^B]{0,3}B[^B]{0,3}B).{5}
RegEx Demo

Why is max number ignoring two-digit numbers?

At the moment I am saving a set of variables to a text file. I am doing following to check if my code works, but whenever I use a two-digit numbers such as 10 it would not print this number as the max number.
If my text file looked like this.
tom:5
tom:10
tom:1
It would output 5 as the max number.
name = input('name')
score = 4
if name == 'tom':
fo= open('tom.txt','a')
fo.write('Tom: ')
fo.write(str(score ))
fo.write("\n")
fo.close()
if name == 'wood':
fo= open('wood.txt','a')
fo.write('Wood: ')
fo.write(str(score ))
fo.write("\n")
fo.close()
tomL2 = []
woodL2 = []
fo = open('tom.txt','r')
tomL = fo.readlines()
tomLi = tomL2 + tomL
fo.close
tomLL=max(tomLi)
print(tomLL)
fo = open('wood.txt','r')
woodL = fo.readlines()
woodLi = woodL2 + woodL
fo.close
woodLL=max(woodLi)
print(woodLL)
You are comparing strings, not numbers. You need to convert them into numbers before using max. For example, you have:
tomL = fo.readlines()
This contains a list of strings:
['tom:5\n', 'tom:10\n', 'tom:1\n']
Strings are ordered lexicographically (much like how words would be ordered in an English dictionary). If you want to compare numbers, you need to turn them into numbers first:
tomL_scores = [int(s.split(':')[1]) for s in tomL]
The parsing is done in the following way:
….split(':') separates the string into parts using a colon as the delimiter:
'tom:5\n' becomes ['tom', '5\n']
…[1] chooses the second element from the list:
['tom', '5\n'] becomes '5\n'
int(…) converts a string into an integer:
'5\n' becomes 5
The list comprehension [… for s in tomL] applies this sequence of operations to every element of the list.
Note that int (or similarly float) are rather picky about what it accepts: it must be in the form of a valid numeric literal or it will be rejected with an error (although preceding and trailing whitespace is allowed). This is why you need ….split(':')[1] to massage the string into a form that it's willing to accept.
This will yield:
[5, 10, 1]
Now, you can apply max to obtain the largest score.
As a side-note, the statement
fo.close
will not close a file, since it doesn't actually call the function. To call the function you must enclose the arguments in parentheses, even if there are none:
fo.close()

regexp with varying integer lengths

I want split these strings from
CH1Avg
Ch2Avg
Ch3
Ch4Avg
Ch5
Ch6Avg
Chan7
Channel9
Ch010
Ch011Avg
Chann12Average
...up to...
Ch100AVG
I need to split them into their consituent parts
"Ch", ##, "Avg"
1st and 3rd components are of variable length and form. I want to split using the 2nd component which is an integer of vary length from 0 to 100. The integer may or may not be zero padded.
Any thoughts? I am trying to use () without much success.
To split the string into the constituent parts, I suggest using named tokens for convenience:
strCell = {'CH1Avg'
'Ch2Avg'
'Ch3'
'Ch4Avg'
'Ch5'
'Ch6Avg'
'Chan7'
'Channel9'
'Ch010'
'Ch011Avg'
'Chann12Average'}
out = regexp(strCell,'(?<channelName>\D+)(?<channelNum>\d+)(?<channelType>\w*)','names')
out = [out{:}];
out(end)
ans =
channelName: 'Chann'
channelNum: '12'
channelType: 'Average'
Split on (\d+). The parentheses ensure that the number you're splitting on will also become part of the array.