Slick: Is there a way to create a WHERE clause with a regex? - regex

I look for a slick equivalient to
select * from users where last_name ~* '[\w]*son';
So for example when having the following names in the database:
first_name | last_name
----------------------
Tore | Isakson
John | Smith
Solveig | Larsson
Marc | Finnigan
The result would be
first_name | last_name
----------------------
Tore | Isakson
Solveig | Larsson
My current solution is to interpolate this with an SQLActionBuilder like
val pattern = "[\\w]*son"
val action = sql""" SELECT * FROM users WHERE last_name ~* ${pattern}; """.as[User]
But this is not the way I would like to have it. I would prefer something like
users.filter(_.last_name matchRegex "[\\w]*son") // <- This does not exist
In case it is relevant: I use Postgres.

(This answer is following the question in Slick: How can I combine a SQL LIKE statement with a SQL IN statement)
Although Slick doesn't support the ~* operator out of the box, you can add it yourself. That would give you a way to execute the query using the lifted embedded style of Slick query.
To do that, you can use the SimpleExpression builder. There's not much documentation on it, but the jumping off point would be the Scalar Database Functions page of the reference manual.
What we want to do is write a method along these lines:
def find(names: Seq[String]): DBIO[Seq[String]] = {
val pattern = names.mkString("|")
users.filter(_.lastName regexLike pattern).map(_.lastName).result
}
To get regexLike we can use a enrich (enhance, "pimp") a string column to have the regexLike method:
implicit class RegexLikeOps(s: Rep[String]) {
def regexLike(p: Rep[String]): Rep[Boolean] = {
val expr = SimpleExpression.binary[String,String,Boolean] { (s, p, qb) =>
qb.expr(s)
qb.sqlBuilder += " ~* "
qb.expr(p)
}
expr.apply(s,p)
}
}
The implicit class part is allow the compiler to construct the RegexLikeOps class anytime it has a Rep[String] that calls a method that Rep[String] doesn't already have (i.e., when regexLike is asked for).
Our regexLike method takes another Rep[String] argument as the pattern, and then uses SimpleExpression builder to safely construct the SQL we want to use.
Putting it all together we can write:
val program = for {
_ <- users.schema.create
_ <- users ++= User("foo") :: User("baz") :: User("bar") :: Nil
result <- find( Seq("baz","bar") )
} yield result
println( Await.result(db.run(program), 2.seconds) )
The SQL generated (in my test with H2) is:
select "last_name" from "app_user" where "last_name" ~* 'baz|bar'
The full code is: https://github.com/d6y/so46199828

Related

Spark - remove special characters from rows Dataframe with different column types

Assuming I've a Dataframe with many columns, some are type string others type int and others type map.
e.g.
field/columns types: stringType|intType|mapType<string,int>|...
|--------------------------------------------------------------------------
| myString1 |myInt1| myMap1 |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------
I want to remove some characters like '_' and '#' from all columns of String and Map type
so the result Dataframe/RDD will be:
|------------------------------------------------------------------------
|myString1 |myInt1| myMap1|... |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------
I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.
Also, not sure how to handle the regexp with different column types in the best way (I am sing scala).
And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:
def cleanRows(mytabledata: DataFrame): RDD[String] = {
//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))
...
//return type can be RDD or Dataframe...
}
Is there any simple solution to perform this?
Thanks
One option is to define two udfs to handle string type column and Map type column separately:
import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
| myString|myInt| myMap|
+--------------+-----+--------------------+
|this_is#string| 3|Map(str1_in#map -...|
+--------------+-----+--------------------+
1) Udf to handle string type columns:
def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)
2) Udf to handle Map type columns:
def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)
3) Apply udfs to corresponding columns to clean it up:
df.withColumn("myString", remove_string_udf($"myString")).
withColumn("myMap", remove_map_udf($"myMap")).show
+------------+-----+-------------------+
| myString|myInt| myMap|
+------------+-----+-------------------+
|thisisstring| 3|Map(str1inmap -> 3)|
+------------+-----+-------------------+

Pattern matching - spark scala RDD

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type
Description: RDD[(String, Int)]
Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn.
Input Data :
BMW 1er Model,278
MINI Cooper Model,248
Output I am looking for :
Input | Brand | Series
BMW 1er Model,278, BMW , 1er
MINI Cooper Model ,248 MINI , Cooper
where Brand and Series are newly calculated substrings from String RDD
What I have done so far.
I could achieve this for a String using regular expression, but I cani apply fro all lines.
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI
Then I can use
brandRegEx.findFirstIn("hello this mini is bmW testing")
But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above.
I read about this code snippet, but not sure how to put it altogether.
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
def getBrand(Col4: String) : String = Col4 match {
case brandRegEx(str) =>
case _ => ""
return 'substring
}
Any help would be appreciated !
Thanks
To apply your regex to each item in the RDD, you should use the RDD map function, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row):
import org.apache.spark.{SparkContext, SparkConf}
object Example extends App {
val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))
val data = Seq(
("BMW 1er Model",278),
("MINI Cooper Model",248))
val dataRDD = sc.parallelize(data)
val processedRDD = dataRDD.map{
case (inString, inInt) =>
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
val brand = brandRegEx.findFirstIn(inString)
//val seriesRegEx = ...
//val series = seriesRegEx.findFirstIn(inString)
val series = "foo"
(inString, inInt, brand, series)
}
processedRDD.collect().foreach(println)
sc.stop()
}
Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. This code outputs:
(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)
But if you correct your regexes for your needs, this is how you can apply them to each row.
hi I was just looking for aother question and got this question. The above problem can be done using normal transformations.
val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect

Regex / subString to extract all matching patterns / groups

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?
If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}
Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}

Operating on an F# List of Union Types

This is a continuation of my question at F# List of Union Types. Thanks to the helpful feedback, I was able to create a list of Reports, with Report being either Detail or Summary. Here's the data definition once more:
module Data
type Section = { Header: string;
Lines: string list;
Total: string }
type Detail = { State: string;
Divisions: string list;
Sections: Section list }
type Summary = { State: string;
Office: string;
Sections: Section list }
type Report = Detail of Detail | Summary of Summary
Now that I've got the list of Reports in a variable called reports, I want to iterate over those Report objects and perform operations based on each one. The operations are the same except for the cases of dealing with either Detail.Divisions or Summary.Office. Obviously, I have to handle those differently. But I don't want to duplicate all the code for handling the similar State and Sections of each.
My first (working) idea is something like the following:
for report in reports do
let mutable isDetail = false
let mutable isSummary = false
match report with
| Detail _ -> isDetail <- true
| Summary _ -> isSummary <- true
...
This will give me a way to know when to handle Detail.Divisions rather than Summary.Office. But it doesn't give me an object to work with. I'm still stuck with report, not knowing which it is, Detail or Summary, and also unable to access the attributes. I'd like to convert report to the appropriate Detail or Summary and then use the same code to process either case, with the exception of Detail.Divisions and Summary.Office. Is there a way to do this?
Thanks.
You could do something like this:
for report in reports do
match report with
| Detail { State = s; Sections = l }
| Summary { State = s; Sections = l } ->
// common processing for state and sections (using bound identifiers s and l)
match report with
| Detail { Divisions = l } ->
// unique processing for divisions
| Summary { Office = o } ->
// unique processing for office
The answer by #kvb is probably the approach I would use if I had the data structure you described. However, I think it would make sense to think whether the data types you have are the best possible representation.
The fact that both Detail and Summary share two of the properties (State and Sections) perhaps implies that there is some common part of a Report that is shared regardless of the kind of report (and the report can either add Divisions if it is detailed or just Office if if is summary).
Something like that would be better expressed using the following (Section stays the same, so I did not include it in the snippet):
type ReportInformation =
| Divisions of string list
| Office of string
type Report =
{ State : string;
Sections : Section list
Information : ReportInformation }
If you use this style, you can just access report.State and report.Sections (to do the common part of the processing) and then you can match on report.Information to do the varying part of the processing.
EDIT - In answer to Jeff's comment - if the data structure is already fixed, but the view has changed, you can use F# active patterns to write "adaptor" that provides access to the old data structure using the view that I described above:
let (|Report|) = function
| Detail dt -> dt.State, dt.Sections
| Summary st -> st.State, st.Sections
let (|Divisions|Office|) = function
| Detail dt -> Divisions dt.Divisions
| Summary st -> Office st.Office
The first active pattern always succeeds and extracts the common part. The second allows you to distinguish between the two cases. Then you can write:
let processReport report =
let (Report(state, sections)) = report
// Common processing
match report wiht
| Divisions divs -> // Divisions-specific code
| Office ofc -> // Offices-specific code
This is actually an excellent example of how F# active patterns provide an abstraction that allows you to hide implementation details.
kvb's answer is good, and probably what I would use. But the way you've expressed your problem sounds like you want classic inheritance.
type ReportPart(state, sections) =
member val State = state
member val Sections = sections
type Detail(state, sections, divisions) =
inherit ReportPart(state, sections)
member val Divisions = divisions
type Summary(state, sections, office) =
inherit ReportPart(state, sections)
member val Office = office
Then you can do precisely what you expect:
for report in reports do
match report with
| :? Detail as detail -> //use detail.Divisions
| :? Summary as summary -> //use summary.Office
//use common properties
You can pattern match on the Detail or Summary record in each of the union cases when you match and handle the Divisions or Office value with a separate function e.g.
let blah =
for report in reports do
let out = match report with
| Detail({ State = state; Divisions = divisions; Sections = sections } as d) ->
Detail({ d with Divisions = (handleDivisions divisions) })
| Summary({ State = state; Office = office; Sections = sections } as s) ->
Summary( { s with Office = handleOffice office })
//process out
You can refactor the code to have a utility function for each common field and use nested pattern matching:
let handleReports reports =
reports |> List.iter (function
| Detail {State = s; Sections = ss; Divisions = ds} ->
handleState s
handleSections ss
handleDivisions ds
| Summary {State = s; Sections = ss; Office = o} ->
handleState s
handleSections ss
handleOffice o)
You can also filter Detail and Summary to process them separately in different functions:
let getDetails reports =
List.choose (function Detail d -> Some d | _ -> None) reports
let getSummaries reports =
List.choose (function Summary s -> Some s | _ -> None) reports

Erlang record item list

For example i have erlang record:
-record(state, {clients
}).
Can i make from clients field list?
That I could keep in client filed as in normal list? And how can i add some values in this list?
Thank you.
Maybe you mean something like:
-module(reclist).
-export([empty_state/0, some_state/0,
add_client/1, del_client/1,
get_clients/1]).
-record(state,
{
clients = [] ::[pos_integer()],
dbname ::char()
}).
empty_state() ->
#state{}.
some_state() ->
#state{
clients = [1,2,3],
dbname = "QA"}.
del_client(Client) ->
S = some_state(),
C = S#state.clients,
S#state{clients = lists:delete(Client, C)}.
add_client(Client) ->
S = some_state(),
C = S#state.clients,
S#state{clients = [Client|C]}.
get_clients(#state{clients = C, dbname = _D}) ->
C.
Test:
1> reclist:empty_state().
{state,[],undefined}
2> reclist:some_state().
{state,[1,2,3],"QA"}
3> reclist:add_client(4).
{state,[4,1,2,3],"QA"}
4> reclist:del_client(2).
{state,[1,3],"QA"}
::[pos_integer()] means that the type of the field is a list of positive integer values, starting from 1; it's the hint for the analysis tool dialyzer, when it performs type checking.
Erlang also allows you use pattern matching on records:
5> reclist:get_clients(reclist:some_state()).
[1,2,3]
Further reading:
Records
Types and Function Specifications
dialyzer(1)
#JUST MY correct OPINION's answer made me remember that I love how Haskell goes about getting the values of the fields in the data type.
Here's a definition of a data type, stolen from Learn You a Haskell for Great Good!, which leverages record syntax:
data Car = Car {company :: String
,model :: String
,year :: Int
} deriving (Show)
It creates functions company, model and year, that lookup fields in the data type. We first make a new car:
ghci> Car "Toyota" "Supra" 2005
Car {company = "Toyota", model = "Supra", year = 2005}
Or, using record syntax (the order of fields doesn't matter):
ghci> Car {model = "Supra", year = 2005, company = "Toyota"}
Car {company = "Toyota", model = "Supra", year = 2005}
ghci> let supra = Car {model = "Supra", year = 2005, company = "Toyota"}
ghci> year supra
2005
We can even use pattern matching:
ghci> let (Car {company = c, model = m, year = y}) = supra
ghci> "This " ++ c ++ " " ++ m ++ " was made in " ++ show y
"This Toyota Supra was made in 2005"
I remember there were attempts to implement something similar to Haskell's record syntax in Erlang, but not sure if they were successful.
Some posts, concerning these attempts:
In Response to "What Sucks About Erlang"
Geeking out with Lisp Flavoured Erlang. However I would ignore parameterized modules here.
It seems that LFE uses macros, which are similar to what provides Scheme (Racket, for instance), when you want to create a new value of some structure:
> (define-struct car (company model year))
> (define supra (make-car "Toyota" "Supra" 2005))
> (car-model supra)
"Supra"
I hope we'll have something close to Haskell record syntax in the future, that would be really practically useful and handy.
Yasir's answer is the correct one, but I'm going to show you WHY it works the way it works so you can understand records a bit better.
Records in Erlang are a hack (and a pretty ugly one). Using the record definition from Yasir's answer...
-record(state,
{
clients = [] ::[pos_integer()],
dbname ::char()
}).
...when you instantiate this with #state{} (as Yasir did in empty_state/0 function), what you really get back is this:
{state, [], undefined}
That is to say your "record" is just a tuple tagged with the name of the record (state in this case) followed by the record's contents. Inside BEAM itself there is no record. It's just another tuple with Erlang data types contained within it. This is the key to understanding how things work (and the limitations of records to boot).
Now when Yasir did this...
add_client(Client) ->
S = some_state(),
C = S#state.clients,
S#state{clients = [Client|C]}.
...the S#state.clients bit translates into code internally that looks like element(2,S). You're using, in other words, standard tuple manipulation functions. S#state.clients is just a symbolic way of saying the same thing, but in a way that lets you know what element 2 actually is. It's syntactic saccharine that's an improvement over keeping track of individual fields in your tuples in an error-prone way.
Now for that last S#state{clients = [Client|C]} bit, I'm not absolutely positive as to what code is generated behind the scenes, but it is likely just straightforward stuff that does the equivalent of {state, [Client|C], element(3,S)}. It:
tags a new tuple with the name of the record (provided as #state),
copies the elements from S (dictated by the S# portion),
except for the clients piece overridden by {clients = [Client|C]}.
All of this magic is done via a preprocessing hack behind the scenes.
Understanding how records work behind the scenes is beneficial both for understanding code written using records as well as for understanding how to use them yourself (not to mention understanding why things that seem to "make sense" don't work with records -- because they don't actually exist down in the abstract machine...yet).
If you are only adding or removing single items from the clients list in the state you could cut down on typing with a macro.
-record(state, {clients = [] }).
-define(AddClientToState(Client,State),
State#state{clients = lists:append([Client], State#state.clients) } ).
-define(RemoveClientFromState(Client,State),
State#state{clients = lists:delete(Client, State#state.clients) } ).
Here is a test escript that demonstrates:
#!/usr/bin/env escript
-record(state, {clients = [] }).
-define(AddClientToState(Client,State),
State#state{clients = lists:append([Client], State#state.clients)} ).
-define(RemoveClientFromState(Client,State),
State#state{clients = lists:delete(Client, State#state.clients)} ).
main(_) ->
%Start with a state with a empty list of clients.
State0 = #state{},
io:format("Empty State: ~p~n",[State0]),
%Add foo to the list
State1 = ?AddClientToState(foo,State0),
io:format("State after adding foo: ~p~n",[State1]),
%Add bar to the list.
State2 = ?AddClientToState(bar,State1),
io:format("State after adding bar: ~p~n",[State2]),
%Add baz to the list.
State3 = ?AddClientToState(baz,State2),
io:format("State after adding baz: ~p~n",[State3]),
%Remove bar from the list.
State4 = ?RemoveClientFromState(bar,State3),
io:format("State after removing bar: ~p~n",[State4]).
Result:
Empty State: {state,[]}
State after adding foo: {state,[foo]}
State after adding bar: {state,[bar,foo]}
State after adding baz: {state,[baz,bar,foo]}
State after removing bar: {state,[baz,foo]}