Format long number to shorter version in Lua - regex

I'm trying to figure out how I would go about formatting a large number to the shorter version by appending 'k' or 'm' using Lua. Example:
17478 => 17.5k
2832 => 2.8k
1548034 => 1.55m
I would like to have the rounding in there as well as per the example. I'm not very good at Regex, so I'm not sure where I would begin. Any help would be appreciated. Thanks.

Pattern matching doesn't seem like the right direction for this problem.
Assuming 2 digits after decimal point are kept in the shorter version, try:
function foo(n)
if n >= 10^6 then
return string.format("%.2fm", n / 10^6)
elseif n >= 10^3 then
return string.format("%.2fk", n / 10^3)
else
return tostring(n)
end
end
Test:
print(foo(17478))
print(foo(2832))
print(foo(1548034))
Output:
17.48k
2.83k
1.55m

Here a longer form, which uses the hint from Tom Blodget.
Maybe its not the perfect form, but its a little more specific.
For Lua 5.0, replace #steps with table.getn(steps).
function shortnumberstring(number)
local steps = {
{1,""},
{1e3,"k"},
{1e6,"m"},
{1e9,"g"},
{1e12,"t"},
}
for _,b in ipairs(steps) do
if b[1] <= number+1 then
steps.use = _
end
end
local result = string.format("%.1f", number / steps[steps.use][1])
if tonumber(result) >= 1e3 and steps.use < #steps then
steps.use = steps.use + 1
result = string.format("%.1f", tonumber(result) / 1e3)
end
--result = string.sub(result,0,string.sub(result,-1) == "0" and -3 or -1) -- Remove .0 (just if it is zero!)
return result .. steps[steps.use][2]
end
print(shortnumberstring(100))
print(shortnumberstring(200))
print(shortnumberstring(999))
print(shortnumberstring(1234567))
print(shortnumberstring(999999))
print(shortnumberstring(9999999))
print(shortnumberstring(1345123))
Result:
> dofile"test.lua"
100.0
200.0
1.0k
1.2m
1.0m
10.0m
1.3m
>
And if you want to get rid of the "XX.0", uncomment the line before the return.
Then our result is:
> dofile"test.lua"
100
200
1k
1.2m
1m
10m
1.3m
>

Related

Efficient way for the following code

I saw this problem on hackerrank.com, the problem is to find a 4 letter palindrome from a given string which can be a long string also.
Constraint is as follows:
where, |s| is the length of the string and a,b,c,d are the positions of the corresponding letters in the palindrome.
I found out the solution for this, but it isn't efficient enough, as in during the processing time it gives 'time out' error. The code is as follows:
s='kkkkkkz'
n=0
c_i,c_j,c_k,c_l=0,0,0,0
for i in range(len(s)):
j=0;c_i+=1
while j>=0 and j<len(s):
c_j+=1
if j>i:
k=0
while k>=0 and k<len(s):
c_k+=1
if k>j:
l=0
while l>=0 and l<len(s):
c_l+=1
if l>k:
a=s[i]+s[j]+s[k]+s[l]
if a[0]==a[3] and a[1]==a[2]: n+=1
l+=1
k+=1
j+=1
print n
I thought of noticing the number of times each loop runs, which right now is 7,49,147 and 245.
It is still better than the techniques I followed before, but I am not able to to do better than this.
Suggestions please ?
One way is to use the following, but this will still not be efficient enough. Scores 12/40 ..
import itertools
s=WHATEVERSTRING
n=0
for a in itertools.combinations(s, 4):
n += (a[0] == a[3])*(a[1]==a[2])
print(n)
A working solution is to go down the following route: create a set of unique characters in the string, and map substring pairs to a dictionary. Then count all the occurrences of pairwise pairs.
from collections import defaultdict as di
data = [x for x in s.strip()]
chars = set(data)
sum_a = 0
for c in chars:
a = 0
b = di(int)
double_pairs = 0
for d in data:
if d == c:
sum_a += double_pairs
double_pairs += b[c]
b[c]+=a
a += 1
else:
double_pairs += b[d]
b[d] += a
print(sum_a%(10**9+7))

gsub speed vs pattern length

I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:
library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)
for (i in rpt) {
n = n + 1
print(n)
patt = paste(rep("a", rpt[n]), collapse = "")
#time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
msecFF[n] = mean(timeFF$time)/1000000.
timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
msecFT[n] = mean(timeFT$time)/1000000.
}
library(ggplot2)
library(grid)
library(gridExtra)
axis(1,at=seq(0,1000,200),labels=T)
p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)
As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below).
I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.
Can someone explain the kink better and why at 300 and 600 characters?
Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.
gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern
(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)
The kinks might be related to the bits required to hold patterns of that length.
There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.
patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)
I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.

Loop problems Even Count

I have a beginner question. Loops are extremely hard for me to understand, so it's come to me asking for help.
I am trying to create a function to count the amount of even numbers in a user input list, with a negative at the end to show the end of the list. I know I need to use a while loop, but I am having trouble figuring out how to walk through the indexes of the input list. This is what I have so far, can anyone give me a hand?
def find_even_count(numlist):
count = 0
numlist.split()
while numlist > 0:
if numlist % 2 == 0:
count += 1
return count
numlist = raw_input("Please enter a list of numbers, with a negative at the end: ")
print find_even_count(numlist)
I used the split to separate out the indexes of the list, but I know I am doing something wrong. Can anyone point out what I am doing wrong, or point me to a good step by step explanation of what to do here?
Thank you guys so much, I know you probably have something more on your skill level to do, but appreciate the help!
You were pretty close, just a couple of corrections:
def find_even_count(numlist):
count = 0
lst = numlist.split()
for num in lst:
if int(num) % 2 == 0:
count += 1
return count
numlist = raw_input("Please enter a list of numbers, with a negative at the end: ")
print find_even_count(numlist)
I have used a for loop rather than a while loop, stored the outcome of numlist.split() to a variable (lst) and then just iterated over this.
You have a couple of problems:
You split numlist, but don't assign the resulting list to anything.
You then try to operate on numlist, which is still the string of all numbers.
You never try to convert anything to a number.
Instead, try:
def find_even_count(numlist):
count = 0
for numstr in numlist.split(): # iterate over the list
num = int(numstr) # convert each item to an integer
if num < 0:
break # stop when we hit a negative
elif num % 2 == 0:
count += 1 # increment count for even numbers
return count # return the total
Or, doing the whole thing in one line:
def find_even_count(numlist):
return sum(num % 2 for num in map(int, numlist.split()) if num > 0)
(Note: the one-liner will fail in cases where the user tries to trick you by putting more numbers after the "final" negative number, e.g. with numlist = "1 2 -1 3 4")
If you must use a while loop (which isn't really the best tool for the job), it would look like:
def find_even_count(numlist):
index = count = 0
numlist = list(map(int, numlist.split()))
while numlist[index] > 0:
if numlist[index] % 2 == 0:
count += 1
index += 1
return count

Understanding Recursive Function

I'm working through the book NLP with Python, and I came across this example from an 'advanced' section. I'd appreciate help understanding how it works. The function computes all possibilities of a number of syllables to reach a 'meter' length n. Short syllables "S" take up one unit of length, while long syllables "L" take up two units of length. So, for a meter length of 4, the return statement looks like this:
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
The function:
def virahanka1(n):
if n == 0:
return [""]
elif n == 1:
return ["S"]
else:
s = ["S" + prosody for prosody in virahanka1(n-1)]
l = ["L" + prosody for prosody in virahanka1(n-2)]
return s + l
The part I don't understand is how the 'SSL', 'SLS', and 'LSS' matches are made, if s and l are separate lists. Also in the line "for prosody in virahanka1(n-1)," what is prosody? Is it what the function is returning each time? I'm trying to think through it step by step but I'm not getting anywhere. Thanks in advance for your help!
Adrian
Let's just build the function from scratch. That's a good way to understand it thoroughly.
Suppose then that we want a recursive function to enumerate every combination of Ls and Ss to make a given meter length n. Let's just consider some simple cases:
n = 0: Only way to do this is with an empty string.
n = 1: Only way to do this is with a single S.
n = 2: You can do it with a single L, or two Ss.
n = 3: LS, SL, SSS.
Now, think about how you might build the answer for n = 4 given the above data. Well, the answer would either involve adding an S to a meter length of 3, or adding an L to a meter length of 2. So, the answer in this case would be LL, LSS from n = 2 and SLS, SSL, SSSS from n = 3. You can check that this is all possible combinations. We can also see that n = 2 and n = 3 can be obtained from n = 0,1 and n=1,2 similarly, so we don't need to special-case them.
Generally, then, for n ≥ 2, you can derive the strings for length n by looking at strings of length n-1 and length n-2.
Then, the answer is obvious:
if n = 0, return just an empty string
if n = 1, return a single S
otherwise, return the result of adding an S to all strings of meter length n-1, combined with the result of adding an L to all strings of meter length n-2.
By the way, the function as written is a bit inefficient because it recalculates a lot of values. That would make it very slow if you asked for e.g. n = 30. You can make it faster very easily by using the new lru_cache from Python 3.3:
#lru_cache(maxsize=None)
def virahanka1(n):
...
This caches results for each n, making it much faster.
I tried to melt my brain. I added print statements to explain to me what was happening. I think the most confusing part about recursive calls is that it seems to go into the call forward but come out backwards, as you may see with the prints when you run the following code;
def virahanka1(n):
if n == 4:
print 'Lets Begin for ', n
else:
print 'recursive call for ', n, '\n'
if n == 0:
print 'n = 0 so adding "" to below'
return [""]
elif n == 1:
print 'n = 1 so returning S for below'
return ["S"]
else:
print 'next recursivly call ' + str(n) + '-1 for S'
s = ["S" + prosody for prosody in virahanka1(n-1)]
print '"S" + each string in s equals', s
if n == 4:
print '**Above is the result for s**'
print 'n =',n,'\n', 'next recursivly call ' + str(n) + '-2 for L'
l = ["L" + prosody for prosody in virahanka1(n-2)]
print '\t','what was returned + each string in l now equals', l
if n == 4:
print '**Above is the result for l**','\n','**Below is the end result of s + l**'
print 'returning s + l',s+l,'for below', '\n','='*70
return s + l
virahanka1(4)
Still confusing for me, but with this and Jocke's elegant explanation, I think I can understand what is going on.
How about you?
Below is what the code above produces;
Lets Begin for 4
next recursivly call 4-1 for S
recursive call for 3
next recursivly call 3-1 for S
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
"S" + each string in s equals ['SSS', 'SL']
n = 3
next recursivly call 3-2 for L
recursive call for 1
n = 1 so returning S for below
what was returned + each string in l now equals ['LS']
returning s + l ['SSS', 'SL', 'LS'] for below
======================================================================
"S" + each string in s equals ['SSSS', 'SSL', 'SLS']
**Above is the result for s**
n = 4
next recursivly call 4-2 for L
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
what was returned + each string in l now equals ['LSS', 'LL']
**Above is the result for l**
**Below is the end result of s + l**
returning s + l ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] for below
======================================================================
This function says that:
virakhanka1(n) is the same as [""] when n is zero, ["S"] when n is 1, and s + l otherwise.
Where s is the same as the result of "S" prepended to each elements in the resulting list of virahanka1(n - 1), and l the same as "L" prepended to the elements of virahanka1(n - 2).
So the computation would be:
When n is 0:
[""]
When n is 1:
["S"]
When n is 2:
s = ["S" + "S"]
l = ["L" + ""]
s + l = ["SS", "L"]
When n is 3:
s = ["S" + "SS", "S" + "L"]
l = ["L" + "S"]
s + l = ["SSS", "SL", "LS"]
When n is 4:
s = ["S" + "SSS", "S" + "SL", "S" + "LS"]
l = ["L" + "SS", "L" + "L"]
s + l = ['SSSS", "SSL", "SLS", "LSS", "LL"]
And there you have it, step by step.
You need to know the results of the other function calls in order to calculate the final value, which can be pretty messy to do manually as you can see. It is important though that you do not try to think recursively in your head. This would cause your mind to melt. I described the function in words, so that you can see that these kind of functions is are descriptions, and not a sequence of commands.
The prosody you see, that is a part of s and l definitions, are variables. They are used in a list-comprehension, which is a way of building lists. I've described earlier how this list is built.

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).