Split a string on whitespace in Go? - regex

Given an input string such as " word1 word2 word3 word4 ", what would be the best approach to split this as an array of strings in Go? Note that there can be any number of spaces or unicode-spacing characters between each word.
In Java I would just use someString.trim().split("\\s+").
(Note: possible duplicate Split string using regular expression in Go doesn't give any good quality answer. Please provide an actual example, not just a link to the regexp or strings packages reference.)

The strings package has a Fields method.
someString := "one two three four "
words := strings.Fields(someString)
fmt.Println(words, len(words)) // [one two three four] 4
DEMO: http://play.golang.org/p/et97S90cIH
From the docs:
Fields splits the string s around each instance of one or more consecutive white space characters, as defined by unicode.IsSpace, returning a slice of substrings of s or an empty slice if s contains only white space.

If you're using tip: regexp.Split
func (re *Regexp) Split(s string, n int) []string
Split slices s into substrings separated by the expression and returns
a slice of the substrings between those expression matches.
The slice returned by this method consists of all the substrings
of s not contained in the slice returned by FindAllString. When called
on an expression that contains no metacharacters, it is equivalent to strings.SplitN.
Example:
s := regexp.MustCompile("a*").Split("abaabaccadaaae", 5)
// s: ["", "b", "b", "c", "cadaaae"]
The count determines the number of substrings to return:
n > 0: at most n substrings; the last substring will be the unsplit remainder.
n == 0: the result is nil (zero substrings)
n < 0: all substrings

I came up with the following, but that seems a bit too verbose:
import "regexp"
r := regexp.MustCompile("[^\\s]+")
r.FindAllString(" word1 word2 word3 word4 ", -1)
which will evaluate to:
[]string{"word1", "word2", "word3", "word4"}
Is there a more compact or more idiomatic expression?

You can use package strings function split
strings.Split(someString, " ")
strings.Split

Related

Shorten Code Logic For String Manipulation

Examples
"123456" would be ["123", "456"].
"1234567891011" would be ["123", "456", "789", "10", "11"].
I have come up with this logic using regex to solve the challenge but I am being asked if there is a way to make the logic shorter.
def ft(str)
end
The result from the scan gives a lot of whitespaces so after the join operation, I am left with either a double dash or triple dashes so I used this .gsub(/-+/, '-') to fix that. I also noticed sometimes there is a dash at the begin or the end of the string, so I used .gsub(/^-|-$/, '') to fix that too
Any Ideas?
Slice the string in chunks of max 3 digits. (s.scan(/.{1,3}/)
Check if the last chunk has only 1 character. If so, take the last char of the chunk before and prepend it to the last.
Glue the chunks together using join(" ")
Inspired by #steenslag's recommendation. (There are quite a few other ways to achieve the same with varying levels of verbosity and esotericism)
Here is how I would go about it:
def format_str(str)
numbers = str.delete("^0-9").scan(/.{1,3}/)
# there are a number of ways this operation can be performed
numbers.concat(numbers.pop(2).join.scan(/../)) if numbers.last.length == 1
numbers.join('-')
end
Breakdown:
numbers = str.delete("^0-9") - delete any non numeric character from the string
.scan(/.{1,3}/) - scan them into groups of 1 to 3 characters
numbers.concat(numbers.pop(2).join.scan(/../)) if numbers.last.length == 1 - If the last element has a length of 1 then remove the last 2 elements join them and then scan them into groups of 2 and add these groups back to the Array
numbers.join('-') - join the numbers with a hyphen to return a formatted String
Example:
require 'securerandom'
10.times do
s = SecureRandom.hex
puts "Original: #{s} => Formatted: #{format_str(s)}"
end
# Original: fd1bbce41b1c784ce6ad5303d868bbe9 => Formatted: 141-178-465-303-86-89
# Original: af04bd4d4d6beb5a0412a692d5d3d42d => Formatted: 044-465-041-269-253-42
# Original: 9a1833a43cbef51c3f3c21baa66fe996 => Formatted: 918-334-351-332-166-996
# Original: 4104ae13c998cec896997b9919bdafb3 => Formatted: 410-413-998-896-997-991-93
# Original: 0eb49065472240ba32b3c029f897b30d => Formatted: 049-065-472-240-323-029-897-30
# Original: 4c68f9f68e8f6132c0ed5b966d639cf4 => Formatted: 468-968-861-320-596-663-94
# Original: 65987ee04aea8fb533dbe38c0fea7d63 => Formatted: 659-870-485-333-807-63
# Original: aa8aaf1cf59b52c9ad7db6d4b1ae0cbb => Formatted: 815-952-976-410
# Original: 8eb6b457059f91fd06ccbac272db8f4e => Formatted: 864-570-599-106-272-84
# Original: 1c65825ed59dcdc6ec18af969938ea57 => Formatted: 165-825-596-189-699-38-57
That being said to modify your existing code this will work as well:
def format_str(str)
str
.delete("^0-9")
.scan(/(?=\d{5})\d{3}|(?=\d{3}$)\d{3}|\d{2}/)
.join('-')
end
Here are three more ways to do that.
Use String#scan with a regular expression
def fmt(str)
str.delete("^0-9").scan(/\d{2,3}(?!\d\z)/)
end
The regular expression reads, "match two or three digits provided they are not followed by a single digit at the end of the string". (?!\d\z) is a negative lookahead (which is not part of the match). As matches are greedy by default, the regex engine will always match three digits if possible.
Solve by recursion
def fmt(str)
recurse(str.delete("^0-9"))
end
def recurse(s)
case s.size
when 2,3
[s]
when 4
[s[0,2], s[2,2]]
else
[s[0,3], *fmt(s[3..])]
end
end
Determine the last two matches from the size of the string
def fmt(str)
s = str.delete("^0-9")
if s.size % 3 == 1
s[0..-5].scan(/\d{3}/) << s[-4,2] << s[-2,2]
else
s.scan(/\d{2,3}/)
end
end
All methods exhibit the following behaviour.
["5551231234", "18883319", "123456", "1234567891011"].each do |str|
puts "#{str}: #{fmt(str)}"
end
5551231234: ["555", "123", "12", "34"]
18883319: ["188", "833", "19"]
123456: ["123", "456"]
1234567891011: ["123", "456", "789", "10", "11"]
An approach:
def foo(s)
s.gsub(/\D/, '').scan(/\d{1,3}/).each_with_object([]) do |x, arr|
if x.size == 3 || arr == []
arr << x
else
y = arr.last
arr[-1] = y[0...-1]
arr << "#{y[-1]}#{x}"
end
end
end
Remove all non-digits characters, then scan for 1 to 3 digit chunks. Iterate over them. If it's the first time through or the chunk is three digits, add it to the array. If it isn't, take the last digit from the previous chunk and prepend it to the current chunk and add that to the array.
Alternatively, without generating a new array.
def foo(s)
s.gsub(/\D/, '').scan(/\d{1,3}/).instance_eval do |y|
y.each_with_index do |x, i|
if x.size == 1
y[i] = "#{y[i-1][-1]}#{x}"
y[i-1] = y[i-1][0...-1]
end
end
end
end
Without changing your code too much and without adjusting your actual regex, I might suggest replacing scan with split in order to avoid all the extra nil values; replacing gsub with tr which is much faster; and then using reject(&:empty?) to loop through and remove any blank array elements before joining with whatever character you want:
string = "12345fg\n\t\t\t 67"
string.tr("^0-9", "")
.split(/(?=\d{5})(\d{3})|(?=\d{3}$)(\d{3})|(\d{1,2})/)
.reject(&:empty?)
.join("-")
#=> 123-45-67
Not suggesting this is the best approach, but wanted to offer a little food for thought:
You can basically reduce the logic for your challenge to test for 1 single condition and to use 2 very simple pattern matches:
Condition to test for: Number of characters is more than 3 and has a modulo(3) of 1. This condition will require the use of both pattern matches.
All other conditions will use a single pattern match so no reason to test for those.
This could probably be made a little less verbose but it’s all spelled out pretty well for clarity:
def format(s)
n = s.delete("^0-9")
regex_1 = /.{1,3}/
regex_2 = /../
if [n.length-3, 0].max.modulo(3) == 1
a = n[0..-5].scan(regex_1)+n[-4..-1].scan(regex_2)
else a=n.scan(regex_1)
end
a.join("-")
end

Comparing filenames and determine their incremental digits

Imagine i have a sequence of files, e.g.:
...
segment8_400_av.ts
segment9_400_av.ts
segment10_400_av.ts
segment11_400_av.ts
segment12_400_av.ts
...
When the filenames are known, i can match against the filenames with a regular expression like:
/segment(\d+)_400_av\.ts/
Because i know the incremental pattern.
But what would be a generic approach to this? I mean how can i take two file names out of the list, compare them and find out where in the file name the counting part is, taking into account any other digits that can occur in the filename (the 400 in this case)?
Goal: What i want to do is to run the script against various file sequences to check for example for missing files, so this should be the first step to find out the numbering scheme. File sequences can occur in many different fashions, e.g.:
test_1.jpg (simple counting suffix)
test_2.jpg
...
or
segment9_400_av.ts (counting part inbetween, with other static digits)
segment10_400_av.ts
...
or
01_trees_00008.dpx (padded with zeros)
01_trees_00009.dpx
01_trees_00010.dpx
Edit 2: Probably my problem can be described more simple: With a given set of files, i want to:
Find out, if they are a numbered sequence of files, with the rules below
Get the first file number, get the last file number and file count
Detect missing files (gaps in the sequence)
Rules:
As melpomene summarized in his answer, the file names only differ in one substring, which consists only of digits
The counting digits can occur anywhere in the filename
The digits can be padded with 0's (see example above)
I can do #2 and #3, what i am struggling with is #1 as a starting point.
You tagged this question regex, so here's a regex-based solution:
use strict;
use warnings;
my $name1 = 'segment12_400_av.ts';
my $name2 = 'segment10_400_av.ts';
if (
"$name1\0$name2" =~ m{
\A
( \D*+ (?: \d++ \D++ )* ) # prefix
( \d++ ) # numeric segment 1
( [^\0]* ) # suffix
\0 # separator
\1 # prefix
( \d++ ) # numeric segment 2
\3 # suffix
\z
}xa
) {
print <<_EOT_;
Result of comparing "$name1" and "$name2"
Common prefix: $1
Common suffix: $3
Varying numeric parts: $2 / $4
Position of varying numeric part: $-[2]
_EOT_
}
Output:
Result of comparing "segment12_400_av.ts" and "segment10_400_av.ts"
Common prefix: segment
Common suffix: _400_av.ts
Varying numeric parts: 12 / 10
Position of varying numeric part: 7
It assumes that
the strings are different (guard the condition with $name1 ne $name2 && ... if that's not guaranteed)
there's only one substring that's different between the input strings (otherwise it won't find any match)
the differing substring consists of digits only
all digits surrounding the first point of difference are part of the varying increment (e.g. the example above recognizes segment as the common prefix, not segment1)
The idea is to combine the two names into a single string (separated by NUL, which is unambiguous because filenames can't contain \0), then let the regex engine do the hard work of finding the longest common prefix (using greediness and backtracking).
Because we're in a regex, we can get a bit more fancy than just finding the longest common prefix: We can make sure that the prefix doesn't end with a digit (see the segment1 vs. segment case above) and we can verify that the suffix is also the same.
See if this works for you:
use strict;
use warnings;
sub compare {
my ( $f1, $f2 ) = #_;
my #f1 = split /(\d+)/sxm, $f1;
my #f2 = split /(\d+)/sxm, $f2;
my $i = 0;
my $out1 = q{};
my $out2 = q{};
foreach my $p (#f1) {
if ( $p eq $f2[$i] ) {
$out1 .= $p;
$out2 .= $p;
}
else {
$out1 .= sprintf ' ((%s)) ', $p;
$out2 .= sprintf ' ((%s)) ', $f2[$i];
}
$i++;
}
print $out1 . "\n";
print $out2 . "\n";
return;
}
print "Test1:\n";
compare( 'segment8_400_av.ts', 'segment9_400_av.ts' );
print "\n\nTest2:\n";
compare( 'segment999_8_400_av.ts', 'segment999_9_400_av.ts' );
You basically split strings by starting/ending digits, the loop through the items and compare each of the 'pieces'. If they are equal, you accumulate. If not, then you highlight the differences and accumulate.
Output (I'm using ((number)) for the highlight)
Test1:
segment ((8)) _400_av.ts
segment ((9)) _400_av.ts
Test2:
segment999_ ((8)) _400_av.ts
segment999_ ((9)) _400_av.ts
I assume that only the counter differs across the strings
use warnings;
use strict;
use feature 'say';
my ($fn1, $fn2) = ('segment8_400_av.ts', 'segment12_400_av.ts');
# Collect all numbers from all strings
my #nums = map { [ /([0-9]+)/g ] } ($fn1, $fn2);
my ($n, $pos); # which number in the string, at what position
# Find which differ
NUMS:
for my $j (1..$#nums) { # strings
for my $i (0..$#{$nums[0]}) { # numbers in a string
if ($nums[$j]->[$i] != $nums[0]->[$i]) { # it is i-th number
$n = $i;
$fn1 =~ /($nums[0]->[$i])/g; # to find position
$pos = $-[$i];
say "It is $i-th number in a string. Position: $pos";
last NUMS;
}
}
}
We loop over the array with arrayrefs of numbers found in each string, and over elements of each arrayref (eg [8, 400]). Each number in a string (0th or 1st or ...) is compared to its counterpart in the 0-th string (array element); all other numbers are the same.
The number of interest is the one that differs and we record which number in a string it is ($n-th).
Then its position in the string is found by matching it again and using #- regex variable with (the just established) index $n, so the offset of the start of the n-th match. This part may be unneeded; while question edits helped I am still unsure whether the position may or not be useful.
Prints, with position counting from 0
It is 0-th number in a string. Position: 7
Note that, once it is found that it is the $i-th number, we can't use index to find its position; an number earlier in strings may happen to be the same as the $i-th one, in this string.
To test, modify input strings by adding the same number to each, before the one of interest.
Per question update, to examine the sequence (for missing files for instance), with the above findings you can collect counters for all strings in an array with hashrefs (num => filename)
use Data::Dump qw(dd);
my #seq = map { { $num[$_]->[$n] => $fnames[$_] } } 0..$#fnames;
dd \#seq;
where #fnames contains filenames (like two picked for the example above, $fn1 and $fn2). This assumes that the file list was sorted to begin with, or add the sort if it wasn't
my #seq =
sort { (keys %$a)[0] <=> (keys %$b)[0] }
map { { $num[$_]->[$n] => $fnames[$_] } }
0..$#fnames;
The order is maintained by array.
Adding this to the above example (with two strings) adds to the print
[
{ 8 => "segment8_400_av.ts" },
{ 12 => "segment12_400_av.ts" },
]
With this all goals in "Edit 2" should be straighforward.
I suggest that you build a regex pattern by changing all digit sequences to (\d+) and then see which captured values have changed
For instance, with segment8_400_av.ts and
segment9_400_av.ts you would generate a pattern /segment(\d+)_(\d+)_av\.ts/. Note that s/\d+/(\d+)/g will return the number of numeric fields, which you will need for the subsequent check
The first would capture 8 and 400 which the second would capture 9 and 400. 8 is different from 9, so it is in that region of the string where the number varies
I can't really write much code as you don't say what sort of result you want from this process

Regular expression to match n times in which n is not fixed

The pattern I want to match is a sequence of length n where n is right before the sequence.
For example, when the input is "1aaaaa", I want to match the single character "a", as the first number specifies only 1 character is matched.
Similar, when the input is "2aaaaa", I want to match the first two characters "aa", but not the rest, as the number 2 specifies two characters will be matched.
I understand a{1} and a{2} will match "a" one or two times. But how to match a{n} in which n is not fixed?
Is it possible to do this type of match using regular expressions?
This will work for repeating numbers.
import re
a="1aaa2bbbbb1cccccccc4dddddddddddd"
for b in re.findall(r'\d[a-z]+', a):
print b[int(b[0])+1:int(b[0])+1+int(b[0])]
Output:
a
bb
c
dddd
Though I have done in Java, it will help you get going in your program.
Here you can select the first letter as sub-string from the given input string and use it in your regex to match the string accordingly.
public class DynamicRegex {
public static void main(String args[]){
Scanner scan = new Scanner(System.in);
System.out.println("Enter a string: ");
String str = scan.nextLine();
String testStr = str.substring(0, 1); //Get the first character from the string using sub-string.
String pattern = "a{"+ testStr +"}"; //Use the sub-string in your regex as length of the string to match.
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(str);
if(m.find()){
System.out.println(m.group());
}
}
}

Scan a file for a string of words ignoring extra whitespaces using VB.NET

I am searching a file for a string of words. For example "one two three". I have been using:
Dim text As String = File.ReadAllText(filepath)
For each phrase in phrases
index = text.IndexOf(phrase, StringComparison.OrdinalIgnoreCase)
If index >= 0 Then
Exit For
End If
Next
and it worked fine but now I have discovered that some files might contain target phrases with more than one whitespace gaps between words.
for example my code finds
"one two three" but fails to find "one two three"
is there a way I can use regular expressions, or any other technique, to capture the phrase even if distance between words is more than one whitespace?
I know I could use
Dim text As String = File.ReadAllText(filepath)
For each phrase in phrases
text=text.Replace(" "," ")
index = text.IndexOf(phrase, StringComparison.OrdinalIgnoreCase)
If index >= 0 Then
Exit For
End If
Next
But I wanted to know if there is a more efficient way to accomplish that
You can make a function to remove any double spaces.
Option Strict On
Option Explicit On
Option Infer Off
Public Class Form1
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
Dim testString As String = "one two three four five six"
Dim excessSpacesGone As String = RemoveExcessSpaces(testString)
'one two three four five six
Clipboard.SetText(excessSpacesGone)
MsgBox(excessSpacesGone)
End Sub
Function RemoveExcessSpaces(source As String) As String
Dim result As String = source
Do
result = result.Replace(" ", " "c)
Loop Until result.IndexOf(" ") = -1
Return result
End Function
End Class
Comments in the code will explain the code
Dim inputStr As String = "This contains one Two three and some other words" '<--- this be the input from the file
inputStr = Regex.Replace(inputStr, "\s{2,}", " ") '<--- Replace extra white spaces if any
Dim searchStr As String = "one two three" '<--- be the string to be searched
searchStr = Regex.Replace(searchStr, "\s{2,}", " ") '<--- Replace extra white spaces if any
If UCase(inputStr).Contains(UCase(searchStr)) Then '<--- check if input contains search string
MsgBox("contains") '<-- display message if it contains
End If
You could convert your phrases into regular expressions with \s+ between each word, and then check the text for matches against that. e.g.
Dim text = "This contains one Two three"
Dim phrases = {
"one two three"
}
' Splits each phrase into words and create the regex from the words.
For each phrase in phrases.Select(Function(p) String.Join("\s+", p.Split({" "c}, StringSplitOptions.RemoveEmptyEntries)))
If Regex.IsMatch(text, phrase, RegexOptions.IgnoreCase) Then
Console.WriteLine("Found!")
Exit For
End If
Next
Note that this doesn't check for word boundaries at the beginning/end of the phrase, so "This contains someone two threesome" would also match. If you don't want that, add "\s" at both ends of the regex.

Regular Expression - All words that begin and end in different letters

I'm having trouble with this regular expression:
Construct a regular expression defining the following language over alphabet
Σ = { a,b }
L6 = {All words that begin and end in different letters}
Here are some examples of regular expressions I was able to solve:
1. L1 = {all words of even length ending in ab}
(aa + ab + ba + bb)*(ab)
2. L2 = {all words that DO NOT have the substring ab}
b*a*
Would this work:
(a.*b)|(b.*a)
Or said in Kleene way:
a(a+b)*b+b(a+b)*a
This should do it:
"^((a.*b)|(b.*a))$"
1- Write a Regular expression for each of the following languages: (a)language of all those strings which end with substrings 'ab' and have odd length. (b)language of all those strings which do not contain the substring 'abb'.
2- Construct a deterministic FSA for each of the following languages: (a)languages of all those strings in which second last symbol is 'b'. (b)language of all those strings whose length is odd,but contain even number if b's.
(aa+ab+ba+bb)∗(a+b)ab
It can choose any number of even length and have any character from a and b, and then end at string ab.