Why can use Regex class like a Data source when declare - regex

As you can see in the above picture I can use Regex class like a data source when declaring it. Why is that?
And I also noticed this LinQ lines are starting with dots. How is this possible?

Regex.Matches(string, string) returns a MatchCollection instance which implements ICollection and IEnumerable. So you can't use LINQ directly since the LINQ extension methods in System.Linq.Enumerable require IEnumerable<T>(the generic version, differences).
That's why Enumerable.OfType was used. This returns IEnumerable<Match>, so now you can use LINQ. Instead of OfType<Match> he could also have used Cast<Match>.
In general you can use Linq-To-Objects with any kind of type that implements IEnumerable<T>, even with a String since it implements IEnumerable<char>. A small example which creates a dictionary of chars and their occurrences:
Dictionary<char, int> charCounts = "Sample Text" // bad example because only unique letters but i hope you got it
.GroupBy(c => c)
.ToDictionary(g => g.Key, g => g.Count());
To answer the .dot part of your question. LINQ basically consists of many extension methods, so you call them also like any other method, you could use one line:
Dictionary<char, int> charCounts = "Sample Text".GroupBy(c => c).ToDictionary(g => g.Key, g => g.Count());

Related

Find maximum w.r.t. substring within each group of formatted strings

I am struggling to find solution for a scenario. I have few files in a directory. lets say
vbBaselIIIData_201802_3_d.data.20180405.txt.gz
vbBaselIIIData_201802_4_d.data.20180405.txt.gz
vbBaselIIIData_201803_4_d.data.20180405.txt.gz
vbBaselIIIData_201803_5_d.data.20180405.txt.gz
Here suppose the single digit number after the second underscore is called runnumber. I have to pick only files with latest runnumber. so in this case I need to pick only two out of the four files and put it in a mutable scala list. The ListBuffer should contain :
vbBaselIIIData_201802_4_d.data.20180405.txt.gz
vbBaselIIIData_201803_5_d.data.20180405.txt.gz
Can anybody suggest me how to implement this. I am using Scala, but only algorithm is also appreciated. What could be the right sets of datastructure we can use? What are the functions we need to implement? Any suggestions.
Here is a hopefully somewhat inspiring proposal that demonstrates a whole bunch of different language features and useful methods on collections:
val list = List(
"vbBaselIIIData_201802_3_d.data.20180405.txt.gz",
"vbBaselIIIData_201802_4_d.data.20180405.txt.gz",
"vbBaselIIIData_201803_4_d.data.20180405.txt.gz",
"vbBaselIIIData_201803_5_d.data.20180405.txt.gz"
)
val P = """[^_]+_(\d+)_(\d+)_.*""".r
val latest = list
.map { str => {val P(id, run) = str; (str, id, run.toInt) }}
.groupBy(_._2) // group by id
.mapValues(_.maxBy(_._3)._1) // find the last run for each id
.values // throw away the id
.toList
.sorted // restore ordering, mostly for cosmetic purposes
latest foreach println
Brief explanation of the not-entirely-trivial parts that you might have missed when reading an introduction to Scala:
"regex pattern".r converts a string into a compiled regex pattern
A block { stmt1 ; stmt2 ; stmt3 ; ... ; stmtN; result } evaluates to the last expression result
Extractor syntax can be used for compiled regex patterns
val P(id, run) = str matches the second and third _-separated values
_.maxBy(_._3)._1 finds the triple with highest run number, then extracts the first component str again
Output:
vbBaselIIIData_201802_4_d.data.20180405.txt.gz
vbBaselIIIData_201803_5_d.data.20180405.txt.gz
It's not clear what performance needs you have, even though you're mentioning an 'algorithm'.
Provided you don't have more specific needs, something like this is easy to do with Scala's Collection API. Even if you were dealing with huge directories, you could probably achieve some good performance characteristics by moving to Streams (at least in memory usage).
So assuming you have a function like def getFilesFromDir(path: String): List[String] where the List[String] is a list of filenames, you need to do the following:
Group files by date (List[String] => Map[String, List[String]]
Extract the Runnumbers, preserving the original filename (List[String] => List[(String, Int)])
Select the max Runnumber (List[(String, Int)] => (String, Int))
Map to just the filename ((String, Int) => String)
Select just the values of the resulting Map (Map[Date, String] => String)
(Note: if you want to go the pure functional route, you'll want a function something like def getFilesFromDir(path: String): IO[List[String]])
With Scala's Collections API you can achieve the above with something like this:
def extractDate(fileName: String): String = ???
def extractRunnumber(fileName: String): String = ???
def getLatestRunnumbersFromDir(path: String): List[String] =
getFilesFromDir(path)
.groupBy(extractDate) // List[String] => Map[String, List[String]]
.mapValues(selectMaxRunnumber) // Map[String, List[String]] => Map[String, String]
.values // Map[String, String] => List[String]
def selectMaxRunnumber(fileNames: List[String]): String =
fileNames.map(f => f -> extractRunnumber(f))
.maxBy(p => p._2)
._1
I've left the extractDate and extractRunnumber implementations blank. These can be done using simple regular expressions — let me know if you're having trouble with that.
If you have the file-names as a list, like:
val list = List("vbBaselIIIData_201802_3_d.data.20180405.txt.gz"
, "vbBaselIIIData_201802_4_d.data.20180405.txt.gz"
, "vbBaselIIIData_201803_4_d.data.20180405.txt.gz"
, "vbBaselIIIData_201803_5_d.data.20180405.txt.gz")
Then you can do:
list.map{f =>
val s = f.split("_").toList
(s(1), f)
}.groupBy(_._1)
.map(_._2.max)
.values
This returns:
MapLike.DefaultValuesIterable(vbBaselIIIData_201803_5_d.data.20180405.txt.gz, vbBaselIIIData_201802_4_d.data.20180405.txt.gz)
as you wanted.

Equivalent of a predicate's ANY for Swift 3 filter expression

When using a predicate, we can have:
filterPredicate = [NSPredicate predicateWithFormat:#"(ANY names.firstName contains[c] %#), nameToSearchFor];
This use of ANY allows us to find any object where any of the objects in the names collection has a firstName containing the desired text.
Is it possible to do something similar with filter expressions in Swift 3? In other words, something like:
allPeople.filter { $0.(ANY)names.firstName.contains(searchString) };
(The above ANY syntax is made up for illustration).
Perhaps could be done by nesting a reduce that concatenates all the firstNames, then see if my target string is contained in that?
allPeople.filter { $0.names.contains(where: { $0.firstName.contains(searchString) }) }

What's the best way to match strings in a file to case class in Scala?

We have a file that contains data that we want to match to a case class. I know enough to brute force it but looking for an idiomatic way in scala.
Given File:
#record
name:John Doe
age: 34
#record
name: Smith Holy
age: 33
# some comment
#record
# another comment
name: Martin Fowler
age: 99
(field values on two lines are INVALID, e.g. name:John\n Smith should error)
And the case class
case class Record(name:String, age:Int)
I Want to return a Seq type such as Stream:
val records: Stream records
The couple of ideas i'm working with but so far haven't implemented is:
Remove all new lines and treat the whole file as one long string. Then grep match on the string "((?!name).)+((?!age).)+age:([\s\d]+)" and create a new object of my case class for each match but so far my regex foo is low and can't match around comments.
Recursive idea: Iterate through each line to find the first line that matches record, then recursively call the function to match name, then age. Tail recursively return Some(new Record(cumulativeMap.get(name), cumulativeMap.get(age)) or None when hitting the next record after name (i.e. age was never encountered)
?? Better Idea?
Thanks for reading! The file is more complicated than above but all rules are equal. For the curious: i'm trying to parse a custom M3U playlist file format.
I'd use kantan.regex for a fairly trivial regex based solution.
Without fancy shapeless derivation, you can write the following:
import kantan.regex._
import kantan.regex.implicits._
case class Record(name:String, age:Int)
implicit val decoder = MatchDecoder.ordered(Record.apply _)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
This yields:
List(Success(Record(John Doe,34)), Success(Record(Smith Holy,33)), Success(Record(Martin Fowler,99)))
Note that this solution requires you to hand-write decoder, but it can often be automatically derived. If you don't mind a shapeless dependency, you could simply write:
import kantan.regex._
import kantan.regex.implicits._
import kantan.regex.generic._
case class Record(name:String, age:Int)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
And get the exact same result.
Disclaimer: I'm the library's author.
You could use Parser Combinators.
If you have the file format specification in BNF or can write one, then Scala can create a parser for you from those rules. This may be more robust than hand-made regex based parsers. It's certainly more "Scala".
I don't have much experience in Scala, but could these regexes work:
You could use (?<=name:).* to match name value, and (?<=age:).* to match the age value. If you use this, remove spaces in found matches, otherwise name: bob will match bob with a space before, you might not want that.
If name: or any other tag is in comment, or comment is after value, something will be matched. Please leave a comment if you want to avoid that.
You could try this:
Path file = Paths.get("file.txt");
val lines = Files.readAllLines(file, Charset.defaultCharset());
val records = lines.filter(s => s.startsWith("age:") || s.startsWith("name:"))
.grouped(2).toList.map {
case List(a, b) => Record(a.replaceAll("name:", "").trim,
b.replaceAll("age:", "").trim.toInt)
}

Search for an item in a text file using UIMA Ruta

I have been trying to search for an item which is there in a text file.
The text file is like
Eg: `
>HEADING
00345
XYZ
MethodName : fdsafk
Date: 23-4-2012
More text and some part containing instances of XYZ`
So I did a dictionary search for XYZ initially and found the positions, but I want only the 1st XYZ and not the rest. There is a property of XYZ that , it will always be between the 5 digit code and the text MethondName .
I am unable to do that.
WORDLIST ZipList = 'Zipcode.txt';
DECLARE Zip;
Document
Document{-> MARKFAST(Zip, ZipList)};
DECLARE Method;
"MethodName" -> Method;
WORDLIST typelist = 'typelist.txt';
DECLARE type;
Document{-> MARKFAST(type, typelist)};
Also how do we use REGEX in UIMA RUTA?
There are many ways to specify this. Here are some examples (not tested):
// just remove the other annotations (assuming type is the one you want)
type{-> UNMARK(type)} ANY{-STARTSWITH(Method)};
// only keep the first one: remove any annotation if there is one somewhere in front of it
// you can also specify this with POSISTION or CURRENTCOUNT, but both are slow
type # #type{-> UNMARK(type)}
// just create a new annotation in between
NUM{REGEXP(".....")} #{-> type} #Method;
There are two options to use regex in UIMA Ruta:
(find) simple regex rules like "[A-Za-z]+" -> Type;
(matches) REGEXP conditions for validating the match of a rule element like
ANY{REGEXP("[A-Za-z]+")-> Type};
Let me know if something is not clear. I will extend the description then.
DISCLAIMER: I am a developer of UIMA Ruta

Scala string template

Is there default(in SDK) scala support for string templating? Example: "$firstName $lastName"(named not numbered parameters) or even constructs like for/if. If there is no such default engine, what is the best scala library to accomplish this.
If you want a templating engine, I suggest you have a look at scalate. If you just need string interpolation, "%s %s".format(firstName, lastName) is your friend.
Complementing Kim's answer, note that Java's Formatter accepts positional parameters. For example:
"%2$s %1$s".format(firstName, lastName)
Also, there's the Enhanced Strings plugin, which allows one to embed arbitrary expressions on Strings. For example:
#EnhanceStrings // enhance strings in this scope
trait Example1 {
val x = 5
val str = "Inner string arithmetics: #{{ x * x + 12 }}"
}
See also this question for more answers, as this is really a close duplicate.
In Scala 2.10 and up, you can use string interpolation
val name = "James"
println(s"Hello, $name") // Hello, James
val height = 1.9d
println(f"$name%s is $height%2.2f meters tall") // James is 1.90 meters tall
This compiler plug-in has provided string interpolation for a while:
http://jrudolph.github.com/scala-enhanced-strings/Overview.scala.html
More recently, the feature seems to be making it into the scala trunk: https://lampsvn.epfl.ch/trac/scala/browser/scala/trunk/test/files/run/stringInterpolation.scala -- which generates some interesting possiblities: https://gist.github.com/a69d8ffbfe9f42e65fbf (not sure if these were possible with the plug-in; I doubt it).