Split line based on regex in Julia - regex

I'm interested in splitting a line using a regular expression in Julia. My input is a corpus in Blei's LDA-C format consisting of docId wordID : wordCNT For example a document with five words is represented as follows:
186 0:1 12:1 15:2 3:1 4:1
I'm looking for a way to aggregate words and their counts into separate arrays, i.e. my desired output:
words = [0, 12, 15, 3, 4]
counts = [1, 1, 2, 1, 1]
I've tried using m = match(r"(\d+):(\d+)",line). However, it only finds the first pair 0:1. I'm looking for something similar to Python's re.compile(r'[ :]').split(line). How would I split a line based on regex in Julia?

There's no need to use regex here; Julia's split function allows using multiple characters to define where the splits should occur:
julia> split(line, [':',' '])
11-element Array{SubString{String},1}:
"186"
"0"
"1"
"12"
"1"
"15"
"2"
"3"
"1"
"4"
"1"
julia> words = v[2:2:end]
5-element Array{SubString{String},1}:
"0"
"12"
"15"
"3"
"4"
julia> counts = v[3:2:end]
5-element Array{SubString{String},1}:
"1"
"1"
"2"
"1"
"1"

I discovered the eachmatch method that returns an iterator over the regex matches. An alternative solution is to iterate over each match:
words, counts = Int64[], Int64[]
for m in eachmatch(r"(\d+):(\d+)", line)
wd, cnt = m.captures
push!(words, parse(Int64, wd))
push!(counts, parse(Int64, cnt))
end

As Matt B. mentions, there's no need for a Regex here as the Julia lib split() can use an array of chars.
However - when there is a need for Regex - the same split() function just works, similar to what others suggest here:
line = "186 0:1 12:1 15:2 3:1 4:1"
s = split(line, r":| ")
words = s[2:2:end]
counts = s[3:2:end]
I've recently had to do exactly that in some Unicode processing code (where the split chars - where a "combined character", thus not something that can fit in julia 'single-quotes') meaning:
split_chars = ["bunch","of","random","delims"]
line = "line_with_these_delims_in_the_middle"
r_split = Regex( join(split_chars, "|") )
split( line, r_split )

Related

Regex to find repeating numbers between other numbers

I have the following array and need two Regex filters that I want to use in PowerShell.
000111
010101
220220
123456
Filter 1: the number 0 that occurs equal or more than three times.
I expect the following values after filtering
000111
010101
Filter 2: all numbers that occur equal or more than three times.
I should only see these numbers.
000111
010101
220220
With 0{3,} I can only recognize numbers in sequence so i get only the number
000111
Is it possible to find repeating numbers that are between other numbers?
Since you insist to see the solution in regex, look at this: '(\d).*\1.*\1'
I think this is comprehensible without further explanation, isn't it?
Armali's helpful answer is short and to the point (use '(0).*\1.*\1' for filter 1), and definitely the best solution for the problem at hand, given that you only need to know in the abstract if a given string has 3 or more zeros / same digits.
The solutions below may be of interest if you need to know the specific count of 0s / digits, which, as far as I know, cannot be handled by a regex (alone)
Occurrence-counting variant of filter 1:
#(
'000111'
'010101'
'220220'
'123456'
).ForEach({
$zerosOnly = $_ -replace '[^0]'
[pscustomobject] #{
InputString = $_
CountOfZeros = $zerosOnly.Length
}
})
That is, each string in the input array (enumerated via the intrinsic ForEach() method), has all chars. that aren't '0' ([^0]) removed via the regex-based -replace operator. The length of the resulting string is therefore equivalent to the count of zeros.
Output:
InputString CountOfZeros
----------- ------------
000111 3
010101 3
220220 2
123456 0
Occurrence-counting variant of filter 2
#(
'000111'
'010101'
'220220'
'123456'
).ForEach({
$outputObject = [pscustomobject] #{ InputString = $_; DigitCounts = [ordered] #{} }
([char[]] $_ | Group-Object).ForEach({
$outputObject.DigitCounts[$_.Name] = $_.Count
})
$outputObject
})
That is, each input string by is grouped by its characters using Group-Object, whose output objects reflect the character at hand in the .Name property and the number of members of the group - i.e. the occurrence count for that character in the .Count property. An ordered hashtable is used to report character-occurrence-count pairs.
Output:
InputString DigitCounts
----------- -----------
000111 {[0, 3], [1, 3]}
010101 {[0, 3], [1, 3]}
220220 {[0, 2], [2, 4]}
123456 {[1, 1], [2, 1], [3, 1], [4, 1]…}
E.g., {[0, 2], [2, 4]} in the output above means that the char. '0' occurs 2 times, and '2' 4 times in input string '220220'.

VBScript if condition specific

I'd like create specific condition "IF", but I don't know how.
I need create one scprit do something when user digit specifics numbers. For example:
If String = "" or String = 0 and > 5 Then.....
Script only do something if user digit: 1,2,3,4
Anybody know how to create it?
Here are a couple of ways.
Convert the string to a number and test the bounds:
If IsNumeric(someString) Then
i = CLng(someString)
If i >= 1 And i <= 4 Then
' Match
End If
End If
Use Select Case and you can specify multiple values to match:
Select Case someString
Case "1", "2", "3", "4"
' Match
End Select
Or, if you just want to do multiple individual tests, here's the basic If structure:
If someString = "1" Or someString = "2" Or someString = "3" Or someString = "4" Then
End If

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"

Split a string on whitespace in Go?

Given an input string such as " word1 word2 word3 word4 ", what would be the best approach to split this as an array of strings in Go? Note that there can be any number of spaces or unicode-spacing characters between each word.
In Java I would just use someString.trim().split("\\s+").
(Note: possible duplicate Split string using regular expression in Go doesn't give any good quality answer. Please provide an actual example, not just a link to the regexp or strings packages reference.)
The strings package has a Fields method.
someString := "one two three four "
words := strings.Fields(someString)
fmt.Println(words, len(words)) // [one two three four] 4
DEMO: http://play.golang.org/p/et97S90cIH
From the docs:
Fields splits the string s around each instance of one or more consecutive white space characters, as defined by unicode.IsSpace, returning a slice of substrings of s or an empty slice if s contains only white space.
If you're using tip: regexp.Split
func (re *Regexp) Split(s string, n int) []string
Split slices s into substrings separated by the expression and returns
a slice of the substrings between those expression matches.
The slice returned by this method consists of all the substrings
of s not contained in the slice returned by FindAllString. When called
on an expression that contains no metacharacters, it is equivalent to strings.SplitN.
Example:
s := regexp.MustCompile("a*").Split("abaabaccadaaae", 5)
// s: ["", "b", "b", "c", "cadaaae"]
The count determines the number of substrings to return:
n > 0: at most n substrings; the last substring will be the unsplit remainder.
n == 0: the result is nil (zero substrings)
n < 0: all substrings
I came up with the following, but that seems a bit too verbose:
import "regexp"
r := regexp.MustCompile("[^\\s]+")
r.FindAllString(" word1 word2 word3 word4 ", -1)
which will evaluate to:
[]string{"word1", "word2", "word3", "word4"}
Is there a more compact or more idiomatic expression?
You can use package strings function split
strings.Split(someString, " ")
strings.Split

Vim Sublist operations

I'm trying to create a script what detects the number of different characters in a selection.
p.e.
a = 4 (the character "a" is 4 times in the selection)
b = 2
e = 10
\ = 2
etc.
To obtain this, I created a list with sublist like this:
[['a', 1], ['b', 1], ['e', 1], ['\', 1]] --> etc
(a = the character // 1 = the number of times the character is found in the text)
What I don't know is:
how to searchi in a sublist? p.e. can I search if there is an "e" or "\" in the list?
when there is a match of "e" how can I add "1" to the number after the "e"?
[['e', 1]] --> [['e', 2]]
and how can I search in a sublist with regex and echo it in an echo command
p.e. search [a-f] and obtain this output:
a = 1
b = 1
e = 2
c, d, f are not found in list and has to be skipped.
Btw...does anyone know where I can find a good documentation about sublists?
(I can't find much information about sublists in the vim docs).
If I understand your problem correctly, the right data structure is a Dictionary mapping the character to the number of occurrences, not a list.
let occurrences = { 'a': 1, 'b': 1, 'e': 1, '\': 1 }
You can check for containment via has_key('a'), and increment via let occurrences['a'] += 1. To print the results use
for char in keys(occurrences)
echo char occurrences[char] "times"
endfor
And you can use the powerful map() and filter() functions on the Dictionary. For example, to only include characters a-f:
echo filter(copy(occurrences), 'v:key =~# "[a-f]"')
Read more at :help Dictionary.