In golang strings.SplitAfter method split text after an special character into an slice, but I didn't find a way for Regexp type to split text after matches. Is there a way to do that?
Example :
var text string = "1.2.3.4.5.6.7.8.9"
res := strings.Split(text, ".")
fmt.Println(res) // print [1 2 3 4 5 6 7 8 9]
res = strings.SplitAfter(text, ".")
fmt.Println(res) // print [1. 2. 3. 4. 5. 6. 7. 8. 9]
first at all, your regex "." is wrong for splitAfter function. You want number followed by value "." so the regex is: "[1-9]".
The function you are looking might look like this:
func splitAfter(s string, re *regexp.Regexp) (r []string) {
re.ReplaceAllStringFunc(s, func(x string) string {
s = strings.Replace(s,x,"::"+x,-1)
return s
})
for _, x := range strings.Split(s,"::") {
if x != "" {
r = append(r, x)
}
}
return
}
Than:
fmt.Println(splitAfter("healthyRecordsMetric",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("healthyrecordsMETetric",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("HealthyHecord Hetrics",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("healthy records metric",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("1.2.3.4.5.6.7.8.9",regexp.MustCompile("[1-9]")))
[Healthy Records Metric]
[healthy Records Metric]
[healthyrecords M E Tetric]
[Healthy Hecord Hetrics]
[healthy records metric]
[1. 2. 3. 4. 5. 6. 7. 8. 9]
Good luck!
Regexp type itself does not have a method to do that exactly that but it's quite simple to write a function that implements what your asking based on Regexp functionality:
func SplitAfter(s string, re *regexp.Regexp) []string {
var (
r []string
p int
)
is := re.FindAllStringIndex(s, -1)
if is == nil {
return append(r, s)
}
for _, i := range is {
r = append(r, s[p:i[1]])
p = i[1]
}
return append(r, s[p:])
}
Here I left a program to play with it.
Related
I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)
I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))
This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+
Suppose I have the following local macro:
loc a = 12.000923
I would like to get the decimal position of the first non-zero decimal (4 in this example).
There are many ways to achieve this. One is to treat a as a string and to find the position of .:
loc a = 12.000923
loc b = strpos(string(`a'), ".")
di "`b'"
From here one could further loop through the decimals and count since I get the first non-zero element. Of course this doesn't seem to be a very elegant approach.
Can you suggest a better way to deal with this? Regular expressions perhaps?
Well, I don't know Stata, but according to the documentation, \.(0+)? is suported and it shouldn't be hard to convert this 2 lines JavaScript function in Stata.
It returns the position of the first nonzero decimal or -1 if there is no decimal.
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
Explanation
We remove from input string a dot followed by optional consecutive zeros.
The difference between the lengths of original input string and this new string gives the position of the first nonzero decimal
Demo
Sample Snippet
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
var samples = [
"loc a = 12.00012",
"loc b = 12",
"loc c = 12.012",
"loc d = 1.000012",
"loc e = -10.00012",
"loc f = -10.05012",
"loc g = 0.0012"
]
samples.forEach(function(sample) {
console.log(getNonZeroDecimalPosition(sample))
})
You can do this in mata in one line and without using regular expressions:
foreach x in 124.000923 65.020923 1.000022030 0.0090843 .00000425 {
mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
}
4
2
5
3
6
Below, you can see the steps in detail:
. local x = 124.000823
. mata:
: /* Step 1: break Stata's local macro x in tokens using . as a parsing char */
: a = tokens(st_local("x"), ".")
: a
1 2 3
+----------------------------+
1 | 124 . 000823 |
+----------------------------+
: /* Step 2: tokenize the string in a[1,3] using 0 as a parsing char */
: b = tokens(a[3], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: /* Step 3: find which values are different from zero */
: c = b :!= "0"
: c
1 2 3 4
+-----------------+
1 | 0 0 0 1 |
+-----------------+
: /* Step 4: find the first index position where this is true */
: d = selectindex(c :!= 0)[1]
: d
4
: end
You can also find the position of the string of interest in Step 2 using the
same logic.
This is the index value after the one for .:
. mata:
: k = selectindex(a :== ".") + 1
: k
3
: end
In which case, Step 2 becomes:
. mata:
:
: b = tokens(a[k], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: end
For unexpected cases without decimal:
foreach x in 124.000923 65.020923 1.000022030 12 0.0090843 .00000425 {
if strmatch("`x'", "*.*") mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
else display " 0"
}
4
2
5
0
3
6
A straighforward answer uses regular expressions and commands to work with strings.
One can select all decimals, find the first non 0 decimal, and finally find its position:
loc v = "123.000923"
loc v2 = regexr("`v'", "^[0-9]*[/.]", "") // 000923
loc v3 = regexr("`v'", "^[0-9]*[/.][0]*", "") // 923
loc first = substr("`v3'", 1, 1) // 9
loc first_pos = strpos("`v2'", "`first'") // 4: position of 9 in 000923
di "`v2'"
di "`v3'"
di "`first'"
di "`first_pos'"
Which in one step is equivalent to:
loc first_pos2 = strpos(regexr("`v'", "^[0-9]*[/.]", ""), substr(regexr("`v'", "^[0-9]*[/.][0]*", ""), 1, 1))
di "`first_pos2'"
An alternative suggested in another answer is to compare the lenght of the decimals block cleaned from the 0s with that not cleaned.
In one step this is:
loc first_pos3 = strlen(regexr("`v'", "^[0-9]*[/.]", "")) - strlen(regexr("`v'", "^[0-9]*[/.][0]*", "")) + 1
di "`first_pos3'"
Not using regex but log10 instead (which treats a number like a number), this function will:
For numbers >= 1 or numbers <= -1, return with a positive number the number of digits to the left of the decimal.
Or (and more specifically to what you were asking), for numbers between 1 and -1, return with a negative number the number of digits to the right of the decimal where the first non-zero number occurs.
digitsFromDecimal = (n) => {
dFD = Math.log10(Math.abs(n)) | 0;
if (n >= 1 || n <= -1) { dFD++; }
return dFD;
}
var x = [118.8161330, 11.10501660, 9.254180571, -1.245501523, 1, 0, 0.864931613, 0.097007836, -0.010880074, 0.009066729];
x.forEach(element => {
console.log(`${element}, Digits from Decimal: ${digitsFromDecimal(element)}`);
});
// Output
// 118.816133, Digits from Decimal: 3
// 11.1050166, Digits from Decimal: 2
// 9.254180571, Digits from Decimal: 1
// -1.245501523, Digits from Decimal: 1
// 1, Digits from Decimal: 1
// 0, Digits from Decimal: 0
// 0.864931613, Digits from Decimal: 0
// 0.097007836, Digits from Decimal: -1
// -0.010880074, Digits from Decimal: -1
// 0.009066729, Digits from Decimal: -2
Mata solution of Pearly is very likable, but notice should be paid for "unexpected" cases of "no decimal at all".
Besides, the regular expression is not a too bad choice when it could be made in a memorable 1-line.
loc v = "123.000923"
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
Below code tests with more values of v.
foreach v in 124.000923 605.20923 1.10022030 0.0090843 .00000425 12 .000125 {
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
di "`v': The wanted number = `x'"
}
I like to change "I run 12 miles" to "I run 24 miles" using regex. "12" can be any number. I use the function "replace2()" to do the math operation. The capture parameter "$1" in passed to function replace2("$1") is fine. Please help!
func replace( str:String, pattern : String, repl:String)->String?{
let regex = NSRegularExpression .regularExpressionWithPattern(pattern, options: nil,error: nil)
let replacedString = regex.stringByReplacingMatchesInString( str, options: nil,
range: NSMakeRange(0, countElements( str)), withTemplate: repl )
return replacedString
}
replace( "I run 12 miles and walk 12.45 km","\\d+(.\\d+)?", "-")!
func replace2(x: String)->String {
let xx = x.toInt()! * 2 // error : return nil?
return String(format: "\(xx)")
}
replace( "I run 12 miles ","(\\d+)", replace2("$1") )! // error
Try this in a playground:
let str1: NSString = "I run 12 miles"
let str2 = "I run 12 miles"
let match = str1.rangeOfString("\\d+", options: .RegularExpressionSearch)
let finalStr = str1.substringWithRange(match).toInt()
let n: Double = 2.2*Double(finalStr!)
let newStr = str2.stringByReplacingOccurrencesOfString("\\d+", withString: "\(n)", options: NSStringCompareOptions.RegularExpressionSearch, range: nil)
println(newStr) //I run 26.4 miles
//or more simply
let newStr2 = "I run \(n) kilometers"
//yes, I know my conversion is off
If you want to dynamically catch the values and replace them based upon the captured value, you have to use enumerateMatchesInString. You can then then write a loop that does the replacements:
func stringWithDoubleNumbers(string: String) -> String {
// build array of ranges that need replacing
var ranges = [NSRange]()
let regex = NSRegularExpression(pattern: "[\\d.]+", options: nil, error: nil) as NSRegularExpression
regex.enumerateMatchesInString(string, options: nil, range: NSMakeRange(0, countElements(string))) {
match, flags, stop in
ranges.append(match.range)
}
var doubledString = NSMutableString(string: string)
// iterate backwards so that location is valid despite other replacements
for range in reverse(ranges) {
let foundString = doubledString.substringWithRange(range)
if let value = foundString.toInt() {
let numericString = "\(value * 2)"
doubledString.replaceCharactersInRange(range, withString: numericString)
} else {
let value = NSString(string: foundString).doubleValue
let numericString = "\(value * 2.0)"
doubledString.replaceCharactersInRange(range, withString: numericString)
}
}
return doubledString
}
I do some additional logic to handle integer values differently than floating point types, but you could simplify this if you don't want to worry about that complication.
// I experiment with Steve and Rob codes above, I think I can get the codes to solve the problem for simple string. I still get problem with complex string. I try to parse a data record with various multiple (-20) numbers with either integers/doubles not in any sequence like handling both str2 and str3 in a single routine. Anyway thanks to Steve and Rob.
let str2 = "I run 89 miles and 28.576 km"
let str3 = "I run 89.45 miles and 34 km in 34F degree"
var re = NSRegularExpression(pattern: "(\\d+).* (\\d+\\.\\d+)", options: nil, error: nil)
var match = re.firstMatchInString(str2, options: nil, range: NSMakeRange(0, countElements(str2)))
var startIndex = advance(str2.startIndex, match.rangeAtIndex(1).location)
var endIndex = advance(str2.startIndex, match.rangeAtIndex(1).location + match.rangeAtIndex(1).length)
let x1 = str2.substringWithRange(Range(start: startIndex, end: endIndex)).toInt()!
x1
startIndex = advance(str2.startIndex, match.rangeAtIndex(2).location)
endIndex = advance(str2.startIndex, match.rangeAtIndex(2).location + match.rangeAtIndex(2).length)
let x2 = str2.substringWithRange(Range(start: startIndex, end: endIndex))
let x2x = NSString(string: x2).doubleValue
let n1 = x1 * 2
let n2: Double = 2.0 * x2x
str2.stringByReplacingOccurrencesOfString("(\\d+)(.* )(\\d+\\.\\d+)", withString: "\(n1)$2\(n2)", options: NSStringCompareOptions.RegularExpressionSearch, range: nil)
// I run 178 miles and 57.152 km"
If I have a list of items like this:
local items = { "apple", "orange", "pear", "banana" }
how do I check if "orange" is in this list?
In Python I could do:
if "orange" in items:
# do something
Is there an equivalent in Lua?
You could use something like a set from Programming in Lua:
function Set (list)
local set = {}
for _, l in ipairs(list) do set[l] = true end
return set
end
Then you could put your list in the Set and test for membership:
local items = Set { "apple", "orange", "pear", "banana" }
if items["orange"] then
-- do something
end
Or you could iterate over the list directly:
local items = { "apple", "orange", "pear", "banana" }
for _,v in pairs(items) do
if v == "orange" then
-- do something
break
end
end
Use the following representation instead:
local items = { apple=true, orange=true, pear=true, banana=true }
if items.apple then
...
end
You're seeing firsthand one of the cons of Lua having only one data structure---you have to roll your own. If you stick with Lua you will gradually accumulate a library of functions that manipulate tables in the way you like to do things. My library includes a list-to-set conversion and a higher-order list-searching function:
function table.set(t) -- set of list
local u = { }
for _, v in ipairs(t) do u[v] = true end
return u
end
function table.find(f, l) -- find element v of l satisfying f(v)
for _, v in ipairs(l) do
if f(v) then
return v
end
end
return nil
end
Write it however you want, but it's faster to iterate directly over the list, than to generate pairs() or ipairs()
#! /usr/bin/env lua
local items = { 'apple', 'orange', 'pear', 'banana' }
local function locate( table, value )
for i = 1, #table do
if table[i] == value then print( value ..' found' ) return true end
end
print( value ..' not found' ) return false
end
locate( items, 'orange' )
locate( items, 'car' )
orange found
car not found
Lua tables are more closely analogs of Python dictionaries rather than lists. The table you have create is essentially a 1-based indexed array of strings. Use any standard search algorithm to find out if a value is in the array. Another approach would be to store the values as table keys instead as shown in the set implementation of Jon Ericson's post.
This is a swiss-armyknife function you can use:
function table.find(t, val, recursive, metatables, keys, returnBool)
if (type(t) ~= "table") then
return nil
end
local checked = {}
local _findInTable
local _checkValue
_checkValue = function(v)
if (not checked[v]) then
if (v == val) then
return v
end
if (recursive and type(v) == "table") then
local r = _findInTable(v)
if (r ~= nil) then
return r
end
end
if (metatables) then
local r = _checkValue(getmetatable(v))
if (r ~= nil) then
return r
end
end
checked[v] = true
end
return nil
end
_findInTable = function(t)
for k,v in pairs(t) do
local r = _checkValue(t, v)
if (r ~= nil) then
return r
end
if (keys) then
r = _checkValue(t, k)
if (r ~= nil) then
return r
end
end
end
return nil
end
local r = _findInTable(t)
if (returnBool) then
return r ~= nil
end
return r
end
You can use it to check if a value exists:
local myFruit = "apple"
if (table.find({"apple", "pear", "berry"}, myFruit)) then
print(table.find({"apple", "pear", "berry"}, myFruit)) -- 1
You can use it to find the key:
local fruits = {
apple = {color="red"},
pear = {color="green"},
}
local myFruit = fruits.apple
local fruitName = table.find(fruits, myFruit)
print(fruitName) -- "apple"
I hope the recursive parameter speaks for itself.
The metatables parameter allows you to search metatables as well.
The keys parameter makes the function look for keys in the list. Of course that would be useless in Lua (you can just do fruits[key]) but together with recursive and metatables, it becomes handy.
The returnBool parameter is a safe-guard for when you have tables that have false as a key in a table (Yes that's possible: fruits = {false="apple"})
function valid(data, array)
local valid = {}
for i = 1, #array do
valid[array[i]] = true
end
if valid[data] then
return false
else
return true
end
end
Here's the function I use for checking if data is in an array.
Sort of solution using metatable...
local function preparetable(t)
setmetatable(t,{__newindex=function(self,k,v) rawset(self,v,true) end})
end
local workingtable={}
preparetable(workingtable)
table.insert(workingtable,123)
table.insert(workingtable,456)
if workingtable[456] then
...
end
The following representation can be used:
local items = {
["apple"]=true, ["orange"]=true, ["pear"]=true, ["banana"]=true
}
if items["apple"] then print("apple is a true value.") end
if not items["red"] then print("red is a false value.") end
Related output:
apple is a true value.
red is a false value.
You can also use the following code to check boolean validity:
local items = {
["apple"]=true, ["orange"]=true, ["pear"]=true, ["banana"]=true,
["red"]=false, ["blue"]=false, ["green"]=false
}
if items["yellow"] == nil then print("yellow is an inappropriate value.") end
if items["apple"] then print("apple is a true value.") end
if not items["red"] then print("red is a false value.") end
The output is:
yellow is an inappropriate value.
apple is a true value.
red is a false value.
Check Tables Tutorial for additional information.
function table.find(t,value)
if t and type(t)=="table" and value then
for _, v in ipairs (t) do
if v == value then
return true;
end
end
return false;
end
return false;
end
you can use this solution:
items = { 'a', 'b' }
for k,v in pairs(items) do
if v == 'a' then
--do something
else
--do something
end
end
or
items = {'a', 'b'}
for k,v in pairs(items) do
while v do
if v == 'a' then
return found
else
break
end
end
end
return nothing
A simple function can be used that :
returns nil, if the item is not found in table
returns index of item, if item is found in table
local items = { "apple", "orange", "pear", "banana" }
local function search_value (tbl, val)
for i = 1, #tbl do
if tbl[i] == val then
return i
end
end
return nil
end
print(search_value(items, "pear"))
print(search_value(items, "cherry"))
output of above code would be
3
nil