Use gazetteer as dictionary within JAPE rule in GATE - gate

I have this scenario:
I have a list of key-value pairs in the form of (for instance)
000.000.0001.000 VALUE1
000.000.0002.000 VALUE2
...
000.010.0001.000 VALUE254
The documents presents the information using a table as follows:
SK1 | SK2 | SK3 | SK4
000 | 000 | 0001 | 000
The problem is that when processing this table, it turns to
000
000
0001
000
So a gazetteer wont match it. I figured constructing a JAPE rule to match this, and it works properly matching the 4 key parts.
Now I would need to load the gazetteer from withing my JAPE rule in a structure (for instance, a hashmap) so I can lookup the concatenation of these 4 key parts and get (for example) "VALUE1". Is it possible to load a gazetteer from within a JAPE file and use it as a dictionary?
Is there any other (better) way to do what I need to?
Thanks a lot.

I found the solution to my problem using GazetteerList class with the next snippet:
//Gazetteer object
GazetteerList gazList = new GazetteerList() ;
//Object to map gazetteers entries and their positions in the list
//i.e.: 000.000.0001.000 -> 1,3
//This is because, in my case, the same key
//can appear more than once in the gazetteer
HashMap<String, ArrayList<Integer>> keyMap =
new HashMap<String, ArrayList<Integer>>();
try{
gazList.setMode(GazetteerList.LIST_MODE);
gazList.setSeparator("\t");
gazList.setURL(
new URL("file:/path/to/gazetteer/gazetteer_list_file.lst"));
gazList.load();
//Here is the mapping between the keys and their position
int pos = 0;
for( GazetteerNode gazNode : gazList){
if(keyMap.get(gazNode.getEntry()) == null)
keyMap.put(gazNode.getEntry(), new ArrayList<Integer>());
keyMap.get(gazNode.getEntry()).add(pos);
pos++;
}
} catch (MalformedURLException ex){
System.out.println(ex);
} catch (ResourceInstantiationException ex){
System.out.println(ex);
}
Then, you can lookup the matched key in the map and get its features:
for(Integer index : keyMap.get(key)){
FeatureMap fmap = toFeatureMap(gazList.get(index).getFeatureMap());
fmap.put("additionalFeature", "feature");
outputAS.add(startOffset, endOffset, "Lookup", fmap);
}

Related

How to generate the random values of map from a given set of values and then store the key and values into separate variables in scala

I am trying to generate the 1000 random key-val map pairs from the given(statically defined) 2 map key-val pairs in scala and also later i would want to break the key and value pairs and store them into separate variables
Whatever i have tried:
object methodTest extends App{
val testMap = Map("3875835"->"ABCDE","316067107"->"EFGHI")
def getRandomElement(seq: Map[String,String]): Map[String,String] = {
seq(Random.nextInt(seq.length))
}
var outList = List.empty[Map[String,String]]
for(i<-0 to 1000){
outList+=getRandomElement(testMap)
}
print(outList)
}
The Output should generate 1000 Key-val pairs of map like i am showing below
[3875835,ABCDE]
[316067107,EFGHI]
[3875835,ABCDE]
[316067107,EFGHI]
[316067107,EFGHI]
............
............
............. upto 1000 random key-val pairs
Please help me figure out where i am going wrong and let me know How to resolve it, if any issue regarding the requirement, please feel free to comment for it
You can transform your seed map testMap into sequence of key/value tuples using .toSeq and then generate key/value pairs by iterating over the list of numbers from 0 until 1000, associating each number to a random choice between first or second element of the seed:
import scala.util.Random
val testMap = Map("3875835" -> "ABCDE", "316067107" -> "EFGHI")
val seed = testMap.toSeq
val keyValuesList = (0 until 1000).map(index => seed(Random.nextInt(seed.size)))
Note: 0 until 1000 will return all the numbers from 0 to 1000 excluded, so 1000 numbers. If you use 0 to 1000 you will get all the numbers from 0 to 1000 included, so 1001 numbers.
If you want to print the resulting list, you can use .foreach method with println function as argument:
keyValuesList.foreach(println)
And you will get:
(3875835,ABCDE)
(316067107,EFGHI)
(3875835,ABCDE)
(3875835,ABCDE)
(316067107,EFGHI)
(3875835,ABCDE)
...
if you want to keep only the keys, you can iterate on the list using .map method, taking only the first element of the tuple by using ._1 method, that retrieve the first element of a tuple:
val keys = keyValuesList.map(keyValuePair => keyValuePair._1)
And if you want only a list containing all second elements of each pair:
val values = keyValuesList.map(keyValuePair => keyValuePair._2)

Spark - remove special characters from rows Dataframe with different column types

Assuming I've a Dataframe with many columns, some are type string others type int and others type map.
e.g.
field/columns types: stringType|intType|mapType<string,int>|...
|--------------------------------------------------------------------------
| myString1 |myInt1| myMap1 |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------
I want to remove some characters like '_' and '#' from all columns of String and Map type
so the result Dataframe/RDD will be:
|------------------------------------------------------------------------
|myString1 |myInt1| myMap1|... |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------
I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.
Also, not sure how to handle the regexp with different column types in the best way (I am sing scala).
And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:
def cleanRows(mytabledata: DataFrame): RDD[String] = {
//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))
...
//return type can be RDD or Dataframe...
}
Is there any simple solution to perform this?
Thanks
One option is to define two udfs to handle string type column and Map type column separately:
import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
| myString|myInt| myMap|
+--------------+-----+--------------------+
|this_is#string| 3|Map(str1_in#map -...|
+--------------+-----+--------------------+
1) Udf to handle string type columns:
def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)
2) Udf to handle Map type columns:
def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)
3) Apply udfs to corresponding columns to clean it up:
df.withColumn("myString", remove_string_udf($"myString")).
withColumn("myMap", remove_map_udf($"myMap")).show
+------------+-----+-------------------+
| myString|myInt| myMap|
+------------+-----+-------------------+
|thisisstring| 3|Map(str1inmap -> 3)|
+------------+-----+-------------------+

get Unique record among Duplicates Using mapReduce

File.txt
123,abc,4,Mony,Wa
123,abc,4, ,War
234,xyz,5, ,update
234,xyz,5,Rheka,sild
179,ijo,6,all,allSingle
179,ijo,6,ball,ballTwo
1) column1,column2,colum3 are primary Keys
2) column4,column5 are comparision Keys
I have a file with duplicate records like above In this duplicate record i need to get only one record among duplicates based on sorting order.
Expected Output:
123,abc,4, ,War
234,xyz,5, ,update
179,ijo,6,all,allSingle
Please help me. Thanks in advance.
You can try the below code:
data = LOAD 'path/to/file' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = group data by (col1,col2,col3);
C = foreach B {
sorted = order data by col4 desc;
first = limit sorted 1;
generate group, flatten(first);
};
In the above code, you can change the sorted variable to choose the column you would like to consider for sorting and the type of sorting. Also, in case you require more than one record, you can change the limit to greater than 1.
Hope this helps.
Questions isn't soo clear , but I understand this is what you need :
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = GROUP A BY (column1,column2,colum3);
C = FOREACH B GENERATE FLATTERN(group) as (column1,column2,colum3);
DUMP C;
Or
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = DISTINCT(FOREACH A GENERATE column1,column2,colum3);
DUMP B;

Rainbow attack through python lookup is failing.

I have some issues with an assignment have been given. The gist is that I have to do a rainbow attack on a "car fop".
With a generator table, the RainbowAttack.py script the following:
The key broadcasts to car (in this case the adversary)
The car/eve responds with a challenge u.
The key then responds with a hash consisting of MD5(s||u).
Eve now uses the Rainbow-table to crack s.
We use MD5 to hash our response and our keys
And then we use our reduction function on the hash and take the first 28 bit
f_i(x) = (f(x)+i) mod 2^28.
My hash and redcution function
def f(s, i=0):
"""Lowest 28 bits of (MD5(s||u) % i)"""
digest = '0x' + md5.new(str(s) + str(u)).hexdigest()
result = hex((int(digest, 16) + i) % 2**BIT_SIZE)[:BIT_SIZE/4+2]
return result
anyways when we run our script we receive the response we calculate all successors and compare them to the end points in the rainbow-table if a match is found we get the start point of the collision and now we check if the key is in the chain from start point to end point if one of the keys here is the same as the response we got from the fop we know that the previous key is the secret to opening the car door.
At the moment we are only able to actually find the key when it is in the start position or end position of the rainbow-table and not if it's in the chain.
Here is the code for the loops that check the succsessors and that check if any of our successors are in the rainbowtable and if they are we check if our response from the car fop is in there if it is we have our key.
It might be a problem that is caused when we calculate our successors since the reduction function will be diffrent than the one used on the key (i will increment making the reduction function slightly diffrent for all keys in a chain)
def find_key(table, r):
"""Search for matching respons in Rainbow-table"""
succ = [r]
print r
for i in xrange(1, CHAIN_LEN):
succ.append(f(succ[i-1],i))
for key, value in table.iteritems():
if value in succ:
print "\tCollition: %s -> %s" % (key, value)
ss = key
for i in xrange(0, CHAIN_LEN):
rs = f(ss, i)
if rs==r:
return ss
ss = rs
return -1
the rainbowtable and the files can be found here (github)
(derp.py(rainbow attack) and table1.csv(change name to table.csv))

Weka classification and predicted class

I'm trying to classify an unlabelled string using Weka, I'm not an expert in data mining so i have been struggling with the different terms. What I'm doing is I am providing the training data and setting the unlabeled string after running the M5Rules classifier, I'm actually getting an output but i have no idea what it mean:
run:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
Results
======
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1
BUILD SUCCESSFUL (total time: 1 second)
The source code is as follows:
public Categorizer(){
try{
//*** READ ARRF FILES *///////////////////////////////////////////////////////
//BufferedReader trainReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/training-data.arff"));//File with text examples
//BufferedReader classifyReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/test-data.arff"));//File with text to classify
// Create trainning data instance
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/training-data"));
Instances dataRaw = loader.getDataSet();
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataTraining = Filter.useFilter(dataRaw, filter);
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
// Create test data instances
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/test-data"));
dataRaw = loader.getDataSet();
Instances dataTest = Filter.useFilter(dataRaw, filter);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
// Classify
FilteredClassifier model = new FilteredClassifier();
model.setFilter(new StringToWordVector());
model.setClassifier(new M5Rules());
model.buildClassifier(dataTraining);
for (int i = 0; i < dataTest.numInstances(); i++) {
dataTest.instance(i).setClassMissing();
double cls = model.classifyInstance(dataTest.instance(i));
dataTest.instance(i).setClassValue(cls);
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(dataTraining);
eval.evaluateModelOnce(cls, dataTest.instance(i));
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
catch(FileNotFoundException e){
System.err.println(e.getMessage());
}
catch(IOException i){
System.err.println(i.getMessage());
}
catch(Exception o){
System.err.println(o.getMessage());
}
}
And finally a couple of screenshots in case i made anything wrong in the folder hierarchy:
tl;dr:
You set the class index to a random feature
You have to use a classifier, not a regression algorithm
The problem is how you initialize your data sets. Although weka usually puts the class in the last column, the TextDirectoryLoader doesn't. In fact, you don't need to set the class index manually, it is already set, so remove the lines
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
(The first line is wrong anyway, because you use the number of attributes from the raw data set, but choose the column of the already filtered data set.)
If you then run your code, you will get this:
weka.classifiers.functions.LinearRegression: Cannot handle binary class!
As I already guessed, M5Rules is not a classifier, but for regression. If you use a classifier like J48 or RandomForest, you will get a more sensible output. Just change the line
model.setClassifier(new M5Rules());
to
model.setClassifier(new RandomForest());
As for your output, here is what I make of it:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
is the result of the lines
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
So you see the features of your instance serialized as sparse ARFF followed by | and the class.
Usually, the class should be an integer, but from the documentation of M5Rules I get that it is a classifier for regression problems, so you won't get discrete classes, but continuous values, in your case -0.03816793850062397
Since you (incorrectly) set a numerical feature as class label, M5Rules didn't complain and gave you an output. If you use an actual classifier, you will get your labels "health" or "travel".
The rest are standard statistics about the classifiers performance, but they are pretty useless for only one classifier instance. It looks like the one sample was classified correctly, so all errors are zero.
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1
Just in case someone else got the same error with M5P, try to see if the Arff is just a header or empty.
Otherwise try
model.buildClassifier(....)
instead of
model.setClassifier(....);
That solved it for me.