I'm using ANTLR4 and the CSS grammar from https://github.com/antlr/grammars-v4/tree/master/css3. The grammar defines the following ( pared down a little for brevity )
dimension
: ( Plus | Minus )? Dimension
;
fragment FontRelative
: Number E M
| Number E X
| Number C H
| Number R E M
;
fragment AbsLength
: Number P X
| Number C M
| Number M M
| Number I N
| Number P T
| Number P C
| Number Q
;
fragment Angle
: Number D E G
| Number R A D
| Number G R A D
| Number T U R N
;
fragment Length
: AbsLength
| FontRelative
;
Dimension
: Length
| Angle
;
The matching works fine but I don't see an obvious way to extract the units. The parser creates a DimensionContext which has 3 TerminalNode members - Dimension, Plus and Minus. I'd like to be able to extract the unit during parse without having to do additional string parsing.
I know that one issue that the Length and Angle are fragments. I changed the grammar not use fragments
Unit
: 'em'
| 'ex'
| 'ch'
| 'rem'
| 'vw'
| 'vh'
| 'vmin'
| 'vmax'
| 'px'
| 'cm'
| 'mm'
| 'in'
| 'pt'
| 'q'
| 'deg'
| 'rad'
| 'grad'
| 'turn'
| 'ms'
| 's'
| 'hz'
| 'khz'
;
Dimension : Number Unit;
And things still parse but I don't get any more context about what the units are - the Dimension is still a single TerminalNode. Is there a way to deal with this without having to pull apart the full token string?
You will want to do as little as possible in the lexer:
NUMBER
: Dash? Dot Digit+ { atNumber(); }
| Dash? Digit+ ( Dot Digit* )? { atNumber(); }
;
UNIT
: { aftNumber() }?
( 'px' | 'cm' | 'mm' | 'in'
| 'pt' | 'pc' | 'em' | 'ex'
| 'deg' | 'rad' | 'grad' | '%'
| 'ms' | 's' | 'hz' | 'khz'
)
;
The trick is to produce the NUMBER and UNIT as separate tokens, yet limited to the required ordering. The actions in the NUMBER rule just set a flag and the UNIT predicate ensures that a UNIT can only follow a NUMBER:
protected void atNumber() {
_number = true;
}
protected boolean aftNumber() {
if (_number && Character.isWhitespace(_input.LA(1))) return false;
if (!_number) return false;
_number = false;
return true;
}
The parser rule is trivial, but preserves the detail required:
number
: NUMBER UNIT?
;
Use a tree-walk, parse the NUMBER to a Double and an enum (or equivalent) to provide the semantic UNIT characterization:
public enum Unit {
CM("cm", true, true), // 1cm = 96px/2.54
MM("mm", true, true),
IN("in", true, true), // 1in = 2.54cm = 96px
PX("px", true, true), // 1px = 1/96th
PT("pt", true, true), // 1pt = 1/72th
EM("em", false, true), // element font size
REM("rem", false, true), // root element font size
EX("ex", true, true), // element font x-height
CAP("cap", true, true), // element font nominal capital letters height
PER("%", false, true),
DEG("deg", true, false),
RAD("rad", true, false),
GRAD("grad", true, false),
MS("ms", true, false),
S("s", true, false),
HZ("hz", true, false),
KHZ("khz", true, false),
NONE(Strings.EMPTY, true, false), // 'no unit specified'
INVALID(Strings.UNKNOWN, true, false);
public final String symbol;
public final boolean abs;
public final boolean len;
private Unit(String symbol, boolean abs, boolean len) {
this.symbol = symbol;
this.abs = abs;
this.len = len;
}
public boolean isAbsolute() { return abs; }
public boolean isLengthUnit() { return len; }
// call from the visitor to resolve from `UNIT` to Unit
public static Unit find(TerminalNode node) {
if (node == null) return NONE;
for (Unit unit : values()) {
if (unit.symbol.equalsIgnoreCase(node.getText())) return unit;
}
return INVALID;
}
#Override
public String toString() {
return symbol;
}
}
Related
I am trying to replace all strings in a column that start with 'DEL_' with a NULL value.
I have tried this:
customer_details = customer_details.withColumn("phone_number", F.regexp_replace("phone_number", "DEL_.*", ""))
Which works as expected and the new column now looks like this:
+--------------+
| phone_number|
+--------------+
|00971585059437|
|00971559274811|
|00971559274811|
| |
|00918472847271|
| |
+--------------+
However, if I change the code to:
customer_details = customer_details.withColumn("phone_number", F.regexp_replace("phone_number", "DEL_.*", None))
This now replaces all values in the column:
+------------+
|phone_number|
+------------+
| null|
| null|
| null|
| null|
| null|
| null|
+------------+
Try this-
scala
df.withColumn("phone_number", when(col("phone_number").rlike("^DEL_.*"), null)
.otherwise(col("phone_number"))
)
python
df.withColumn("phone_number", when(col("phone_number").rlike("^DEL_.*"), None)
.otherwise(col("phone_number"))
)
Update
Query-
Can you explain why my original solution doesn't work? customer_details.withColumn("phone_number", F.regexp_replace("phone_number", "DEL_.*", None))
Ans- All the ternary expressions(functions taking 3 arguments) are all null-safe. That means if spark finds any of the arguments null, it will indeed return null without any actual processing (eg. pattern matching for regexp_replace).
you may wanted to look at this piece of spark repo
override def eval(input: InternalRow): Any = {
val exprs = children
val value1 = exprs(0).eval(input)
if (value1 != null) {
val value2 = exprs(1).eval(input)
if (value2 != null) {
val value3 = exprs(2).eval(input)
if (value3 != null) {
return nullSafeEval(value1, value2, value3)
}
}
}
null
}
I have to write a method "all()" which returns a list of tuples; each tuple will contain the row, column and set relevant to a particular given row and column, when the function meets a 0 in the list. I already have written the "hyp" function which returns the set part I need, eg: Set(1,2).
I am using a list of lists:
| 0 | 0 | 9 |
| 0 | x | 0 |
| 7 | 0 | 8 |
If Set (1,2) are referring to the cell marked as x, all() should return: (1,1, Set(1,2))
where 1,1 are the index of the row and column.
I wrote this method by using zipWithIndex but I can't use this function. Is there any simpler way how to access an index as in this case? Thanks in advance
Code:
def all(): List[(Int, Int, Set[Int])] =
{
puzzle.list.zipWithIndex flatMap
{
rowAndIndex =>
rowAndIndex._1.zipWithIndex.withFilter(_._1 == 0) map
{
colAndIndex =>
(rowAndIndex._2, colAndIndex._2, hyp(rowAndIndex._2, colAndIndex._2))
}
}
}
The (_._1 == 0 ) is because the function has to return the (Int,Int, Set()) only when it finds a 0 in the grid
The all function can be simplified as this:
// Given a list of list
puzzle.list = List(List(0, 0, 9), List(0, 5, 0), List(7, 0, 8))
for {
(row, rIndex) <- puzzle.list.zipWithIndex // List of (row, index)
// List( (List(0,0,9), 0) ...
(col, cIndex) <- row.zipWithIndex; // List of (col, index)
// List( (0,0), (0,1), (9, 2) ...
if (col == 0) // keep only col with 0
} yield (rIndex, cIndex, hyp(rIndex, cIndex))
type t = {
dir : [ `Buy | `Sell ];
quantity : int;
price : float;
mutable cancelled : bool;
}
There is a ` before Buy and Sell, what does that mean?
Also what are the type [ | ]?
The ` and [] syntax are to define polymorphic variants. They are similar in spirit to an inline variant definition.
http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual006.html#toc36
In your case, dir can take the value `Buy or `Sell, and pattern matching works accordingly:
let x = { dir = `Buy, quantity = 5, price = 1.0, cancelled = true }
match x.dir with
| `Buy -> 1
| `Sell -> 2
A premise, I'm not a programmer, I'm a physicist and I use c++ as a tool to analyze data (ROOT package). My knowledge might be limited!
I have this situation, I read data from a file and store them in a vector (no problem with that)
vector<double> data;
with this data I want to plot a correlation plot, so I need to split them up in two different subsets one of which will be the X entries of a 2D histogram and the other the Y entries.
The splitting must be as follow, I have this table (I only copy a small part of it just to explain the problem)
************* LBA - LBC **************
--------------------------------------
Cell Name | Channel | PMT |
D0 | 0 | 1 |
A1-L | 1 | 2 |
BC1-R | 2 | 3 |
BC1-L | 3 | 4 |
A1-R | 4 | 5 |
A2-L | 5 | 6 |
BC2-R | 6 | 7 |
BC2-L | 7 | 8 |
A2-R | 8 | 9 |
A3-L | 9 | 10 |
A3-R | 10 | 11 |
BC3-L | 11 | 12 |
BC3-R | 12 | 13 |
D1-L | 13 | 14 |
D1-R | 14 | 15 |
A4-L | 15 | 16 |
BC4-R | 16 | 17 |
BC4-L | 17 | 18 |
A4-R | 18 | 19 |
A5-L | 19 | 20 |
...
None | 31 | 32 |
as you can see there are entries like A1-L and A1-R which corresponds to the left and right side of one cell, to this left and right side are associated an int that corresponds to a channel, in this case 1 and 4. I wish these left and right side to be on the X and Y axis of my 2D histogram.
The problem is then to associate to the vector of data somehow this table so that I can pick the channels that belongs to the same cell and put them one on the X axis and the other on the Y axis. To complicate the things there are also cells that don't have a partner like in this example D0 and channels that don't have a cell associated like channel 31.
My attempted solution is to create an indexing vector
vector<int> indexing = (0, 1, 4, ....);
and an ordered data vector
vector<double> data_ordered;
and fill the ordered vector with something like
for( vector<int> iterator it = indexing.begin(); it != indexing.end(); ++it)
data_ordered.push_back(data.at(*it));
and then put the even index of data_ordered on the X axis and the odd values on the Y axis but I have the problem of the D0 cell and the empty ones!
Another idea that I had is to create a struct like
struct cell{
string cell_name;
int left_channel;
int right_channel;
double data;
....
other informations
};
and then try to work with that, but there it comes my lack of c++ knowledge! Can someone give me an hint on how to solve this problem? I hope that my question is clear enough and that it respects the rules of this site!
EDIT----------
To clarify the problem I try to explain it with an example
vector<double> data = (data0, data1, data2, data3, data4, ...);
do data0 has index 0 and if I go to the table I see it corresponds to the cell D0 which has no other partner and let's say can be disregarded for now. data1 has index 1 and it corresponds to the left part of the cell A1 (A1-L) so I need to find the right partner which has index 4 in the table and ideally leads me to pick data4 from the vector containing the data.
I hope this clarify the situation at least a little!
Here is an engine that does what you want, roughly:
#include <vector>
#include <map>
#include <string>
#include <iostream>
enum sub_entry { left, right, only };
struct DataType {
std::string cell;
sub_entry sub;
DataType( DataType const& o ): cell(o.cell), sub(o.sub) {};
DataType( const char* c, sub_entry s=only ):
cell( c ),
sub( s )
{}
DataType(): cell("UNUSED"), sub(only) {};
// lexographic weak ordering:
bool operator<( DataType const& o ) const {
if (cell != o.cell)
return cell < o.cell;
return sub < o.sub;
}
};
typedef std::vector< double > RawData;
typedef std::vector< DataType > LookupTable;
typedef std::map< DataType, double > OrganizedData;
OrganizedData organize( RawData const& raw, LookupTable const& table )
{
OrganizedData retval;
for( unsigned i = 0; i < raw.size() && i < table.size(); ++i ) {
DataType d = table[i];
retval[d] = raw[i];
}
return retval;
}
void PrintOrganizedData( OrganizedData const& data ) {
for (OrganizedData::const_iterator it = data.begin(); it != data.end(); ++it ) {
std::cout << (*it).first.cell;
switch( (*it).first.sub ) {
case left: {
std::cout << "-L";
} break;
case right: {
std::cout << "-R";
} break;
case only: {
} break;
}
std::cout << " is " << (*it).second << "\n";
}
}
int main() {
RawData test;
test.push_back(3.14);
test.push_back(2.8);
test.push_back(-1);
LookupTable table;
table.resize(3);
table[0] = DataType("A1", left);
table[1] = "D0";
table[2] = DataType("A1", right);
OrganizedData org = organize( test, table );
PrintOrganizedData( org );
}
The lookup table stores what channel maps to what cell name and side.
Unused entries in the lookup table should be set to DataType(), which will flag their values to be stored in an "UNUSED" location. (It will still be stored, but you can discard it afterwards).
The result of this is a map from (CellName, Side) to the double data. I included a simple printer that just dumps the data. If you have graphing software, you can figure out a way to make a graph from it. Skipping "UNUSED" is an exercise that involves checking (*it).first.cell == "UNUSED" in that printing loop.
I believe everything is C++03 compliant. A bunch of the above becomes prettier if you had a C++11 compiler.
I have following data in my database:
| value1 | value2 |
|----------+----------|
| 1 | a |
| 1 | b |
| 2 | a |
| 3 | c |
| 3 | d |
|----------+----------|
What I want as a output is {"key":1,"value":[a,b]},{"key":2,"value":[a]},{"key":3,"value":[c,d]}
I wrote this map function (but not quiet sure if this is correct)
function(doc) {
emit(doc.value1,doc.value2);
}
...but I am missing the reduce-function. Thanks for your help!
Not sure if this can/should be done with a reduce function.
However, you can reformat the output with lists. Try the following list function:
function (head, req) {
var row,
returnObj = {};
while (row = getRow()) {
if(returnObj[row.key]){
returnObj[row.key].push(row.value);
}else{
returnObj[row.key] = [row.value];
}
}
send(JSON.stringify(returnObj));
};
The output should look like this:
{
1: [
"a",
"b"
],
2: [
"a"
],
3: [
"c",
"d"
]
}
Hope that helps.