I'm getting a very unexpected result from what should be basic control statement operations. I have the following, a file being read with this sort of data:
1, 51, one , ab
1, 74, two , ab
0, 74, tree , ab
0, 74, for , ab
0, 74, five , ab
My snip of Lua code that processes it:
if file then
for line in file:lines() do
LineArray = line
CanClaimInfo[LineArray] = {}
lineData = utils.split(line,",")
if lineData[1] == "0" then
lineData[1] = "CAN A"
elseif lineData[1] == "1" then
lineData[1] = "CAN B"
else
lineData[1] = lineData[1]
end
CanClaimInfo[LineArray]["CANBus"] = lineData[1]
CanClaimInfo[LineArray]["Address"] = lineData[2]
CanClaimInfo[LineArray]["Name"] = lineData[3]
end
and I get this as an output:
CAN A 74 for
CAN A 74 tree
CAN A 74 five
CAN B 74 two
1 51 one
I don't get how it slips through the elseif lineData[1] == "1" then bit. I checked and there are no lead/trailing white spaces or anything like that. Any ideas?
Maybe utf-8 encoding bytes at the beginning of file? Try printing lineData[1] before the "if" tests to see what it is, and print(#lineData[1]) to see how many chars it has. Likely more than 1 char so it ends up in that third branch (else):
lineData = utils.split(line,",")
print(#lineData[1]) -- likely prints 1 for all but first line
if lineData[1] == "0" then
To find the extra bytes, try print(string.byte(lineData[1], 1, #lineData[1])).
Hm, seems like your utils.split function has some problems. I used a function from http://lua-users.org/wiki/SplitJoin and it works quite well with your code:
utils = {
split = function(str, pat)
local t = {} -- NOTE: use {n = 0} in Lua-5.0
local fpat = "(.-)" .. pat
local last_end = 1
local s, e, cap = str:find(fpat, 1)
while s do
if s ~= 1 or cap ~= "" then
table.insert(t,cap)
end
last_end = e+1
s, e, cap = str:find(fpat, last_end)
end
if last_end <= #str then
cap = str:sub(last_end)
table.insert(t, cap)
end
return t
end
}
Maybe your function converts the 1 to a number (for whatever reason). In Lua, "1" ~= 1!
Related
I'm trying to convert the word wall into its ascii code list (119, 97, 108, 108) like this:
my #ascii="abcdefghijklmnopqrstuvwxyz";
my #tmp;
map { push #tmp, $_.ord if $_.ord == #ascii.comb.any.ord }, "wall".comb;
say #tmp;
Is there a way to use the #tmp without declaring it in a seperate line?
Is there a way to produce the ascii code list in one line instead of 3 lines? If so, how to do it?
Note that I have to use the #ascii variable i.e. I can't make use of the consecutively increasing ascii sequence (97, 98, 99 ... 122) because I plan to use this code for non-ascii languages too.
There are a couple of things we can do here to make it work.
First, let's tackle the #ascii variable. The # sigil indicates a positional variable, but you assigned a single string to it. This creates a 1-element array ['abc...'], which will cause problems down the road. Depending on how general you need this to be, I'd recommend either creating the array directly:
my #ascii = <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my #ascii = 'a' .. 'z';
my #ascii = 'abcdefghijklmnopqrstuvwxyz'.comb;
or going ahead and handling the any part:
my $ascii-char = any <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my $ascii-char = any 'a' .. 'z';
my $ascii-char = 'abcdefghijklmnopqrstuvwxyz'.comb.any;
Here I've used the $ sigil, because any really specifies any single value, and so will function as such (which also makes our life easier). I'd personally use $ascii, but I'm using a separate name to make later examples more distinguishable.
Now we can handle the map function. Based on the above two versions of ascii, we can rewrite your map function to either of the following
{ push #tmp, $_.ord if $_ eq #ascii.any }
{ push #tmp, $_.ord if $_ eq $ascii-char }
Note that if you prefer to use ==, you can go ahead and create the numeric values in the initial ascii creation, and then use $_.ord. As well, personally, I like to name the mapped variable, e.g.:
{ push #tmp, $^char.ord if $^char eq #ascii.any }
{ push #tmp, $^char.ord if $^char eq $ascii-char }
where $^foo replaces $_ (if you use more than one, they map alphabetical order to #_[0], #_[1], etc).
But let's get to the more interesting question here. How can we do all of this without needing to predeclare #tmp? Obviously, that just requires creating the array in the map loop. You might think that might be tricky for when we don't have an ASCII value, but the fact that an if statement returns Empty (or () ) if it's not run makes life really easy:
my #tmp = map { $^char.ord if $^char eq $ascii-char }, "wall".comb;
my #tmp = map { $^char.ord if $^char eq #ascii.any }, "wall".comb;
If we used "wáll", the list collected by map would be 119, Empty, 108, 108, which is automagically returned as 119, 108, 108. Consequently, #tmp is set to just 119, 108, 108.
Yes there is a much simpler way.
"wall".ords.grep('az'.ords.minmax);
Of course this relies on a to z being an unbroken sequence. This is because minmax creates a Range object based on the minimum and maximum value in the list.
If they weren't in an unbroken sequence you could use a junction.
"wall".ords.grep( 'az'.ords.minmax | 'AZ'.ords.minmax );
But you said that you want to match other languages. Which to me screams regex.
"wall".comb.grep( /^ <:Ll> & <:ascii> $/ ).map( *.ord )
This matches Lowercase Letters that are also in ASCII.
Actually we can make it even simpler. comb can take a regex which determines which characters it takes from the input.
"wall".comb( / <:Ll> & <:ascii> / ).map( *.ord )
# (119, 97, 108, 108)
"ΓΔαβγδε".comb( / <:Ll> & <:Greek> / ).map( *.ord )
# (945, 946, 947, 948, 949)
# Does not include Γ or Δ, as they are not lowercase
Note that the above only works with ASCII if you don't have a combining accent.
"de\c[COMBINING ACUTE ACCENT]f".comb( / <:Ll> & <:ascii> / )
# ("d", "f")
The Combining Acute Accent combines with the e which composes to Latin Small Letter E With Acute.
That composed character is not in ASCII so it is skipped.
It gets even weirder if there isn't a composed value for the character.
"f\c[COMBINING ACUTE ACCENT]".comb( / <:Ll> & <:ascii> / )
# ("f́",)
That is because the f is lowercase and in ASCII. The composing codepoint gets brought along for the ride though.
Basically if your data has, or can have combining accents and if it could break things, then you are better off dealing with it while it is still in binary form.
$buf.grep: {
.uniprop() eq 'Ll' #
&& .uniprop('Block') eq 'Basic Latin' # ASCII
}
The above would also work for single character strings because .uniprop works on either integers representing a codepoint, or on the actual character.
"wall".comb.grep: {
.uniprop() eq 'Ll' #
&& .uniprop('Block') eq 'Basic Latin' # ASCII
}
Note again that this would have the same issues with composing codepoints since it works with strings.
You may also want to use .uniprop('Script') instead of .uniprop('Block') depending on what you want to do.
Here's a working approach using Raku's trans method (code snippet performed in the Raku REPL):
> my #a = "wall".comb;
[w a l l]
> #a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put;
119 97 108 108
Above, we handle an ascii string. Below I add the "é" character, and show a 2-step solution:
> my #a = "wallé".comb;
[w a l l é]
> my #b = #a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') );
[119 97 108 108 é]
> #b.trans("é" => ords("é")).put
119 97 108 108 233
Nota bene #1: Although all the code above works fine, when I tried shortening the alphabet to 'a'..'z' I ended up seeing erroneous return values...hence the use of the full 'abcdefghijklmnopqrstuvwxyz'.
Nota bene #2: One question in my mind is trying to suppress output when trans fails to recognize a character (e.g. how to suppress assignment of "é" as the last element of #b in the second-example code above). I've tried adding the :delete argument to trans, but no luck.
EDITED: To remove unwanted characters, here's code using grep (à la #Brad Gilbert), followed by trans:
> my #a = "wallé".comb;
[w a l l é]
> #a.grep('a'..'z'.comb.any).trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put
119 97 108 108
Is there a way to reliably get the current line number during a Perl
multiline list assignment without explicitly using __LINE__? I am
storing testcases in a list and would like to tag each with its line
number.* That way I can do (roughly)
ok($_->[1], 'line ' . $_->[0]) for #tests.
And, of course, I would like to save typing compared to
putting __LINE__ at the beginning of each test case :) . I have
not been able to find a way to do so, and I have encountered some
confusing behaviour in the lines reported by caller.
* Possible XY, but I can't find a module to do it.
Update I found a hack and posted it as an answer. Thanks to #zdim for helping me think about the problem a different way!
MCVE
A long one, because I've tried several different options. my_eval,
L(), and L2{} are some I've tried so far — L() was the one
I initially hoped would work. Jump down to my #testcases to see how
I'm using these. When testing, do copy the shebang line.
Here's my non-MCVE use case, if you are interested.
#!perl
use strict; use warnings; use 5.010;
# Modified from https://www.effectiveperlprogramming.com/2011/06/set-the-line-number-and-filename-of-string-evals/#comment-155 by http://sites.google.com/site/shawnhcorey/
sub my_eval {
my ( $expr ) = #_;
my ( undef, $file, $line ) = caller;
my $code = "# line $line \"$file\"\n" . $expr;
unless(defined wantarray) {
eval $code; die $# if $#;
} elsif(wantarray) {
my #retval = eval $code; die $# if $#; return #retval;
} else {
my $retval = eval $code; die $# if $#; return $retval;
}
}
sub L { # Prepend caller's line number
my (undef, undef, $line) = caller;
return ["$line", #_];
} #L
sub L2(&) { # Prepend caller's line number
my $fn = shift;
my (undef, undef, $line) = caller;
return ["$line", &$fn];
} #L2
# List of [line number, item index, expected line number, type]
my #testcases = (
([__LINE__,0,32,'LINE']),
([__LINE__,1,33,'LINE']),
(L(2,34,'L()')),
(L(3,35,'L()')),
(do { L(4,36,'do {L}') }),
(do { L(5,37,'do {L}') }),
(eval { L(6,38,'eval {L}') }),
(eval { L(7,39,'eval {L}') }),
(eval "L(8,40,'eval L')"),
(eval "L(9,41,'eval L')"),
(my_eval("L(10,42,'my_eval L')")),
(my_eval("L(11,43,'my_eval L')")),
(L2{12,44,'L2{}'}),
(L2{13,45,'L2{}'}),
);
foreach my $idx (0..$#testcases) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases[$idx]->[3], $testcases[$idx]->[0],
$testcases[$idx]->[2],
($testcases[$idx]->[0] != $testcases[$idx]->[2]) && '*';
}
Output
With my comments added.
0 LINE line 32 expected 32
1 LINE line 33 expected 33
Using __LINE__ expressly works fine, but I'm looking for an
abbreviation.
2 L() line 45 expected 34 *
3 L() line 45 expected 35 *
L() uses caller to get the line number, and reports a line later
in the file (!).
4 do {L} line 36 expected 36
5 do {L} line 45 expected 37 *
When I wrap the L() call in a do{}, caller returns the correct
line number — but only once (!).
6 eval {L} line 38 expected 38
7 eval {L} line 39 expected 39
Block eval, interestingly, works fine. However, it's no shorter
than __LINE__.
8 eval L line 1 expected 40 *
9 eval L line 1 expected 41 *
String eval gives the line number inside the eval (no surprise)
10 my_eval L line 45 expected 42 *
11 my_eval L line 45 expected 43 *
my_eval() is a string eval plus a #line directive based on
caller. It also gives a line number later in the file (!).
12 L2{} line 45 expected 44 *
13 L2{} line 45 expected 45
L2 is the same as L, but it takes a block that returns a list,
rather than
the list itself. It also uses caller for the line number. And it
is correct once, but not twice (!). (Possibly just because it's the last
item — my_eval reported line 45 also.)
So, what is going on here? I have heard of Deparse and wonder if this is
optimization-related, but I don't know enough about the engine to know
where to start investigating. I also imagine this could be done with source
filters or Devel::Declare, but that is well beyond my
level of experience.
Take 2
#zdim's answer got me started thinking about fluent interfaces, e.g., as in my answer:
$testcases2 # line 26
->add(__LINE__,0,27,'LINE')
->add(__LINE__,1,28,'LINE')
->L(2,29,'L()')
->L(3,30,'L()')
->L(3,31,'L()')
;
However, even those don't work here — I get line 26 for each of the ->L() calls. So it appears that caller sees all of the chained calls as coming from the $testcases2->... line. Oh well. I'm still interested in knowing why, if anyone can enlighten me!
The caller can get only the line numbers of statements, decided at compilation.
When I change the code to
my #testcases;
push #testcases, ([__LINE__,0,32,'LINE']);
push #testcases, ([__LINE__,1,33,'LINE']);
push #testcases, (L(2,34,'L()'));
push #testcases, (L(3,35,'L()'));
...
maintaining line numbers, it works (except for string evals).
So, on the practical side, using caller is fine with separate statements for calls.
Perl internals
The line numbers are baked into the op-tree at compilation and (my emphasis)
At run-time, only the line numbers of statements are available [...]
from ikegami's post on permonks.
We can see this by running perl -MO=Concise script.pl where the line
2 nextstate(main 25 line_nos.pl:45) v:*,&,{,x*,x&,x$,$,67108864 ->3
is for the nextstate op, which sets the line number for caller (and warnings). See this post, and the nextstate example below.
A way around this would be to try to trick the compilation (somehow) or, better of course, to not assemble information in a list like that. One such approach is in the answer by cxw.
See this post for a related case and more detail.
nextstate example
Here's a multi-line function-call chain run through Deparse (annotated):
$ perl -MO=Concise -e '$x
->foo()
->bar()
->bat()'
d <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3 <=== the only nextstate
c <1> entersub[t4] vKRS/TARG ->d
3 <0> pushmark s ->4
a <1> entersub[t3] sKRMS/LVINTRO,TARG,INARGS ->b
4 <0> pushmark s ->5
8 <1> entersub[t2] sKRMS/LVINTRO,TARG,INARGS ->9
5 <0> pushmark s ->6
- <1> ex-rv2sv sKM/1 ->7
6 <#> gvsv[*x] s ->7
7 <.> method_named[PV "foo"] s ->8
9 <.> method_named[PV "bar"] s ->a
b <.> method_named[PV "bat"] ->c
-e syntax OK
Even though successive calls are on separate lines, they are part of the same statement, so are all attached to the same nextstate.
Edit This answer is now wrapped in a CPAN module (GitHub)!
#zdim's answer got me thinking about fluent interfaces. Below are two hacks that work for my particular use case, but that don't help me understand the behaviour reported in the question. If you can help, please post another answer!
Hack 2 (newer) (the one now on CPAN)
I think this one is very close to minimal. In perl, you can call a subroutine through a reference with $ref->(), and you can leave out the second and subsequent -> in a chain of arrows. That means, for example, that you can do:
my $foo; $foo=sub { say shift; return $foo; };
$foo->(1)
(2)
(3);
Looks good, right? So here's the MCVE:
#!perl
use strict; use warnings; use 5.010;
package FluentAutoIncList2 {
sub new { # call as $class->new(__LINE__); each element is one line
my $class = shift;
my $self = bless {lnum => shift // 0, arr => []}, $class;
# Make a loader that adds an item and returns itself --- not $self
$self->{loader} = sub { $self->L(#_); return $self->{loader} };
return $self;
}
sub size { return scalar #{ shift->{arr} }; }
sub last { return shift->size-1; } # $#
sub load { goto &{ shift->{loader} } } # kick off loading
sub L { # Push a new record with the next line number on the front
my $self = shift;
push #{ $self->{arr} }, [++$self->{lnum}, #_];
return $self;
} #L
sub add { # just add it
my $self = shift;
++$self->{lnum}; # keep it consistent
push #{ $self->{arr} }, [#_];
return $self;
} #add
} #FluentAutoIncList2
# List of [line number, item index, expected line number, type]
my $testcases = FluentAutoIncList2->new(__LINE__) # line 28
->add(__LINE__,0,36,'LINE')
->add(__LINE__,1,37,'LINE');
$testcases->load(2,38,'load')-> # <== Only need two arrows.
(3,39,'chain load') # <== After that, () are enough.
(4,40,'chain load')
(5,41,'chain load')
(6,42,'chain load')
(7,43,'chain load')
;
foreach my $idx (0..$testcases->last) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases->{arr}->[$idx]->[3],
$testcases->{arr}->[$idx]->[0],
$testcases->{arr}->[$idx]->[2],
($testcases->{arr}->[$idx]->[0] !=
$testcases->{arr}->[$idx]->[2]) && '*';
}
Output:
0 LINE line 36 expected 36
1 LINE line 37 expected 37
2 load line 38 expected 38
3 chain load line 39 expected 39
4 chain load line 40 expected 40
5 chain load line 41 expected 41
6 chain load line 42 expected 42
7 chain load line 43 expected 43
All the chain load lines were loaded with zero extra characters compared to the original [x, y] approach. Some overhead, but not much!
Hack 1
Code:
By starting with __LINE__ and assuming a fixed number of lines per call, a counter will do the trick. This could probably be done more cleanly with a tie.
#!perl
use strict; use warnings; use 5.010;
package FluentAutoIncList {
sub new { # call as $class->new(__LINE__); each element is one line
my $class = shift;
return bless {lnum => shift // 0, arr => []}, $class;
}
sub size { return scalar #{ shift->{arr} }; }
sub last { return shift->size-1; } # $#
sub L { # Push a new record with the next line number on the front
my $self = shift;
push #{ $self->{arr} }, [++$self->{lnum}, #_];
return $self;
} #L
sub add { # just add it
my $self = shift;
++$self->{lnum}; # keep it consistent
push #{ $self->{arr} }, [#_];
return $self;
} #add
} #FluentAutoIncList
# List of [line number, item index, expected line number, type]
my $testcases = FluentAutoIncList->new(__LINE__) # line 28
->add(__LINE__,0,29,'LINE')
->add(__LINE__,1,30,'LINE')
->L(2,31,'L()')
->L(3,32,'L()')
->L(4,33,'L()')
;
foreach my $idx (0..$testcases->last) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases->{arr}->[$idx]->[3],
$testcases->{arr}->[$idx]->[0],
$testcases->{arr}->[$idx]->[2],
($testcases->{arr}->[$idx]->[0] !=
$testcases->{arr}->[$idx]->[2]) && '*';
}
Output:
0 LINE line 29 expected 29
1 LINE line 30 expected 30
2 L() line 31 expected 31
3 L() line 32 expected 32
4 L() line 33 expected 33
How to get a word inside a file at given position
def get_word(file, position)
File.each_line(file).with_index do |line, line_number|
if (line_number + 1) == position.line_number
# How to get a word at position.column_number ?
end
end
end
This should work like this:
File: message.md
Dear people:
My name is [Ángeliño](#angelino).
Bye!
Calls: get_word
record Position, line_number : Int32, column_number : Int32
get_word("message.md", Position.new(1, 9)) # => people
get_word("message.md", Position.new(3, 20)) # => Ángeliño
get_word("message.md", Position.new(5, 3)) # => Bye!
Maybe, this will give you a hint. Please, be advised that this implementation never treats a punctuation mark as a part of a word, so the last example returns Bye instead of Bye!.
def get_word_of(line : String, at position : Int)
chunks = line.split(/(\p{P}|\p{Z})/)
edge = 0
hashes = chunks.map do |chunk|
next if chunk.empty?
{chunk => (edge + 1)..(edge += chunk.size)}
end.compact
candidate = hashes.find { |hash| hash.first_value.covers?(position) }
.try &.first_key
candidate unless (candidate =~ /\A(?:\p{P}|\p{Z})+\Z/)
end
p get_word_of("Dear people:", 9) # => people
p get_word_of("My name is [Ángeliño](#angelino).", 20) # => Ángeliño
p get_word_of("Bye!", 3) # => Bye
A handy way to get the numerical position of a section of a string is to use a regex like so:
filename = "/path/to/file"
File.each_line(filename).each_with_index do |line, line_number|
term = "search-term"
column = line =~ %r{#{term}}
p "Found #{term} at line #{line_number}, column #{column}." if column
end
Outputs: "Found search-term at line 38, column 6"
I am working on a little program with is able to write and load .txt files.
I am facing an issue, where the .txt file is not created when my save () function is in a "if" loop. I did used the computer's search function just in case that it was created somewhere else. Nope, nothing on the .txt file came out.
For you information, I am coding in Microsoft Visual Studio 2013 (Community), and using Windows 8.
Here are the codes:
def save (dic, filename):
out_file = open (filename, "wt")
key_list = []
for i in dic.keys():
key_list.append (i)
key_list.sort ()
for i in range(len(key_list)):
key = key_list [i]
out_file.write (key + "," + dic [key] + "\n")
out_file.close ()
filename = "dictionary.txt"
count = input ("Save (0) or Load (1): ")
if count == 0:
dic = {}
dic["1"] = "11"
dic ["2"] = "22"
dic ["3"] = "33"
dic ["4"] = "44"
dic ["5"] = "55"
save (dic, filename)
The input function returns a string. Since count is a string, count == 0 is always false. This is because the string '0' and the number 0 are not equal.
You have two options here. You can either convert the count to an integer with count = int(input('...')) and not touch your condition, or you can compare count with the string representation of 0 (count == '0').
The content of the file is here: http://pastebin.com/nAe9q9Kt (as I cannot have multiple blank lines in a question)
Below is a screenshot from my sublime-text.
SPACED INPUT EXAMPLE START
a
b
c
SPACED INPUT EXAMPLE END
You can notice that most of the lines begin with 0(zero), except the words ENGINEERS and DOESNT and are separated by single blank line and sometimes by double blank lines.
Basically what I want is this:
List(
List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
List("0IF IT", "0AINT BROKE", "0DONT FIX IT"),
List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE"),
List("0IT"),
List("0HAVE", "0ENOUGH", "0FEATURES YET.")
)
I tried to write a tail-recursive code and it worked well in the end :) But it takes too long (a couple of minutes) to run on a huge file (which has more than 10K lines)
I thought of using Regex approach or execute Unix commands like sed or awk through Scala code to generate a temp file. My guess is that it will run faster than my current approach.
Can somebody please help me with the Regex ?
Here is my tail-recursive Scala code:
#scala.annotation.tailrec
def inner(remainingLines: List[String], previousLineIsBlank: Boolean, frames: List[List[String]], frame: List[String]): List[List[String]] = {
remainingLines match {
case Nil => frame :: frames
case line :: Nil if !previousLineIsBlank =>
inner(
remainingLines = Nil,
previousLineIsBlank = false,
frames = frame :: frames,
frame = line :: frame)
case line :: tail => {
line match {
case "" if previousLineIsBlank => // Current line is blank, previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = true,
frames = frame :: frames,
frame = List.empty[String])
case "" if !previousLineIsBlank => // Current line is blank, previous line is not blank
inner(
remainingLines = tail,
previousLineIsBlank = true,
frames = frames,
frame = frame)
case line if !line.startsWith("0") && previousLineIsBlank => // Current line is not blank and does not start with 0 (ENGINEER, DOESN'T), previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = frame)
case line if previousLineIsBlank => // Current line is not blank and does starts with 0, previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = line :: frame)
case line if !previousLineIsBlank => // Current line is not blank, previous line not is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = line :: frame)
case line => sys.error("Unmatched case = " + line)
}
}
}
}
val source = """0MOST PEOPLE
0BELIEVE
0THAT
0IF IT
0AINT BROKE,
0DONT FIX IT.
ENGINEERS
0BELIEVE
0THAT
0IF
0IT AINT BROKE,
0IT
DOESNT
0HAVE
0ENOUGH
0FEATURES YET."""
val output = (for (s <- source.split("\n\n").toList) yield { // split on empty lines
s.split("\n").toList // split on new lines
.filter(_.headOption.getOrElse("")=='0')} // get rid of entries not starting with '0'
).filter(!_.isEmpty) // get rid of possible empty blocks
//output formatted for readability
scala> output: List[List[String]] = List(List(0MOST PEOPLE, 0BELIEVE, 0THAT),
List(0IF IT, 0AINT BROKE,, 0DONT FIX IT.),
List(0BELIEVE, 0THAT, 0IF, 0IT AINT BROKE,),
List(0IT),
List(0HAVE, 0ENOUGH, 0FEATURES YET.))
UPDATE:
if you are reading the lines from file, then the old imperative approach might work quite well, especially if source file is large:
import scala.collection.mutable.ListBuffer
val lb = ListBuffer[List[String]]()
val ml = ListBuffer[String]()
for (ll <- source.fromFile(<yourfile>)) {
if (ll.isEmpty) {
if (!ml.isEmpty) lb += ml.toList
ml.clear
} else if (ll(0)=='0') ml+=ll
}
val output = lb.toList
Here is a way with awk. You'll probably have to figure out a way to incorporate this in your scala code:
awk '
BEGIN { print "List(" }
/^0/ {
printf " %s", "List("
for(i = 1; i <= NF; i++) {
printf "%s%s" ,q $i q,(i==NF?"":", ")
}
print "),"
}
END { print ")" }' RS= FS='\n' q='"' file
Output with your sample data (from pastebin):
List(
List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
List("0IF IT", "0AINT BROKE,", "0DONT FIX IT."),
List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE,"),
List("0IT"),
List("0HAVE", "0ENOUGH", "0FEATURES YET."),
)
Using awk
awk 'BEGIN{print "List(" }
{ s=/^[0-9]/?1:0;i=s?i:i+1}
s{a[i]=a[i]==""?$0:a[i] OFS $0}
END{ for (j=1;j<=i;j++)
if (a[j]!="")
{ gsub(/\|/,"\",\"",a[j])
printf " list(\"%s\")\n", a[j]
}
print ")"
}' OFS="|" file
List(
list("0MOST PEOPLE","0BELIEVE","0THAT")
list("0IF IT","0AINT BROKE,","0DONT FIX IT.")
list("0BELIEVE","0THAT","0IF","0IT AINT BROKE,")
list("0IT")
list("0HAVE","0ENOUGH","0FEATURES YET.")
)
Explanation
s=/^[0-9]/?1:0;i=s?i:i+1 marks (s and i) are used to detect new record or not.
s{a[i]=a[i]==""?$0:a[i] OFS $0} save each record (seperated by non-numbmer start line) to array a
the reset in END is used to print out the result with expect format.
OFS="|" Hope there is no char | in your input file, if have, please change it to other chars, such as #, # , etc.
I'm not too familiar with Scala, but I think this is the regex you're looking for:
([A-Z]+[A-Z ]*)
See it in action: http://regex101.com/r/gY8lX6
Edit: / / in that case, all you need to do is add a zero to the beginning of the capture group:
(0[A-Z]+[A-Z ]*)