Search for an item in a text file using UIMA Ruta - regex

I have been trying to search for an item which is there in a text file.
The text file is like
Eg: `
>HEADING
00345
XYZ
MethodName : fdsafk
Date: 23-4-2012
More text and some part containing instances of XYZ`
So I did a dictionary search for XYZ initially and found the positions, but I want only the 1st XYZ and not the rest. There is a property of XYZ that , it will always be between the 5 digit code and the text MethondName .
I am unable to do that.
WORDLIST ZipList = 'Zipcode.txt';
DECLARE Zip;
Document
Document{-> MARKFAST(Zip, ZipList)};
DECLARE Method;
"MethodName" -> Method;
WORDLIST typelist = 'typelist.txt';
DECLARE type;
Document{-> MARKFAST(type, typelist)};
Also how do we use REGEX in UIMA RUTA?

There are many ways to specify this. Here are some examples (not tested):
// just remove the other annotations (assuming type is the one you want)
type{-> UNMARK(type)} ANY{-STARTSWITH(Method)};
// only keep the first one: remove any annotation if there is one somewhere in front of it
// you can also specify this with POSISTION or CURRENTCOUNT, but both are slow
type # #type{-> UNMARK(type)}
// just create a new annotation in between
NUM{REGEXP(".....")} #{-> type} #Method;
There are two options to use regex in UIMA Ruta:
(find) simple regex rules like "[A-Za-z]+" -> Type;
(matches) REGEXP conditions for validating the match of a rule element like
ANY{REGEXP("[A-Za-z]+")-> Type};
Let me know if something is not clear. I will extend the description then.
DISCLAIMER: I am a developer of UIMA Ruta

Related

Refactoring starting place for regex

I have a function that stripes HTML markup to display inside of a text element.
stripChar: function stripChar(string) {
string = string.replace(/<\/?[^>]+(>|$)/g, "")
string = string.trim()
string = string.replace(/(\n{2,})/gm,"\n\n");
string = string.replace(/…/g,"...")
string = string.replace(/ /g,"")
let changeencode = entities.decode(string);
return changeencode;
}
This has worked great for me, but I have a new requirement and Im struggle to work out where I should start refactoring the code above. I still need to stripe out the above, but I have 2 exceptions;
List items, <ul><li>, I need to handle these so that they still appear as a bullet point
Hyperlinks, I want to use the react-native-hyperlink, so I need to leave intack the <a> for me to handle separately
Whilst the function is great for generalise tag replacement, its less flexible for my needs above.
You may use
stripChar: function stripChar(string) {
string = string.replace(/ |<(?!\/?(?:li|ul|a)\b)\/?[^>]+(?:>|$)/g, "");
string = string.trim();
string = string.replace(/\n{2,}/g,"\n\n");
string = string.replace(/…/g,"...")
let changeencode = entities.decode(string);
return changeencode;
}
The main changes:
.replace(/ /g,"") is moved to the first replace
The first replace is now used with a new regex pattern where the li, ul and a tags are excluded from the matches using a negative lookahead (?!\/?(?:li|ul|a)\b).
See the updated regex demo here.

What's the best way to match strings in a file to case class in Scala?

We have a file that contains data that we want to match to a case class. I know enough to brute force it but looking for an idiomatic way in scala.
Given File:
#record
name:John Doe
age: 34
#record
name: Smith Holy
age: 33
# some comment
#record
# another comment
name: Martin Fowler
age: 99
(field values on two lines are INVALID, e.g. name:John\n Smith should error)
And the case class
case class Record(name:String, age:Int)
I Want to return a Seq type such as Stream:
val records: Stream records
The couple of ideas i'm working with but so far haven't implemented is:
Remove all new lines and treat the whole file as one long string. Then grep match on the string "((?!name).)+((?!age).)+age:([\s\d]+)" and create a new object of my case class for each match but so far my regex foo is low and can't match around comments.
Recursive idea: Iterate through each line to find the first line that matches record, then recursively call the function to match name, then age. Tail recursively return Some(new Record(cumulativeMap.get(name), cumulativeMap.get(age)) or None when hitting the next record after name (i.e. age was never encountered)
?? Better Idea?
Thanks for reading! The file is more complicated than above but all rules are equal. For the curious: i'm trying to parse a custom M3U playlist file format.
I'd use kantan.regex for a fairly trivial regex based solution.
Without fancy shapeless derivation, you can write the following:
import kantan.regex._
import kantan.regex.implicits._
case class Record(name:String, age:Int)
implicit val decoder = MatchDecoder.ordered(Record.apply _)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
This yields:
List(Success(Record(John Doe,34)), Success(Record(Smith Holy,33)), Success(Record(Martin Fowler,99)))
Note that this solution requires you to hand-write decoder, but it can often be automatically derived. If you don't mind a shapeless dependency, you could simply write:
import kantan.regex._
import kantan.regex.implicits._
import kantan.regex.generic._
case class Record(name:String, age:Int)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
And get the exact same result.
Disclaimer: I'm the library's author.
You could use Parser Combinators.
If you have the file format specification in BNF or can write one, then Scala can create a parser for you from those rules. This may be more robust than hand-made regex based parsers. It's certainly more "Scala".
I don't have much experience in Scala, but could these regexes work:
You could use (?<=name:).* to match name value, and (?<=age:).* to match the age value. If you use this, remove spaces in found matches, otherwise name: bob will match bob with a space before, you might not want that.
If name: or any other tag is in comment, or comment is after value, something will be matched. Please leave a comment if you want to avoid that.
You could try this:
Path file = Paths.get("file.txt");
val lines = Files.readAllLines(file, Charset.defaultCharset());
val records = lines.filter(s => s.startsWith("age:") || s.startsWith("name:"))
.grouped(2).toList.map {
case List(a, b) => Record(a.replaceAll("name:", "").trim,
b.replaceAll("age:", "").trim.toInt)
}

PIG regex extract then filter the unnamed regex tuple

I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');

How to change a node's property based on one of its other properties in Neo4j

I just started using Neo4j server 2.0.1. I am having trouble with the writing a cypher script to change one of the nodes property to something based one of its already defined properties.
So if I created these node's:
CREATE (:Post {uname:'user1', content:'Bought a new pair of pants today', kw:''}),
(:Post {uname:'user2', content:'Catching up on Futurama', kw:''}),
(:Post {uname:'user3', content:'The last episode of Game of Thrones was awesome', kw:''})
I want the script to look at the content property and pick out the word "Bought" and set the kw property to that using a regular expression to pick out word(s) larger then five characters. So, user2's post kw would be "Catching, Futurama" and user3's post kw would be "episode, Thrones, awesome".
Any help would be greatly appreciated.
You could do something like this:
MATCH (p:Post { uname:'user1' })
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
if you want to do that for all posts, which might not be the fastest operation:
MATCH (p:Post)
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
split splits a string into a collection of parts, in this case words separated by space
filter filters a collection by a condition behind WHERE, only the elements that fulfill the condition are kept
Probably you'd rather want to create nodes for those keywords and link the post to the keyword nodes.

How to read semicolon separated certain values from a QString?

I am developing an application using Qt/KDE. While writing code for this, I need to read a QString that contains values like ( ; delimited)
<http://example.com/example.ext.torrent>; rel=describedby; type="application/x-bittorrent"; name="differentname.ext"
I need to read every attribute like rel, type and name into a different QString. The apporach I have taken so far is something like this
if (line.contains("describedby")) {
m_reltype = "describedby" ;
}
if (line.contains("duplicate")) {
m_reltype = "duplicate";
}
That is if I need to be bothered only by the presence of an attribute (and not its value) I am manually looking for the text and setting if the attribute is present. This approach however fails for attributes like "type" and name whose actual values need to be stored in a QString. Although I know this can be done by splitting the entire string at the delimiter ; and then searching for the attribute or its value, I wanted to know is there a cleaner and a more efficient way of doing it.
As I understand, the data is not always an URL.
So,
1: Split the string
2: For each substring, separate the identifier from the value:
id = str.mid(0,str.indexOf("="));
value = str.mid(str.indexOf("=")+1);
You can also use a RegExp:
regexp = "^([a-z]+)\s*=\s*(.*)$";
id = \1 of the regexp;
value = \2 of the regexp;
I need to read every attribute like rel, type and name into a different QString.
Is there a gurantee that this string will always be a URL?
I wanted to know is there a cleaner and a more efficient way of doing it.
Don't reinvent the wheel! You can use QURL::queryItems which would parse these query variables and return a map of name-value pairs.
However, make sure that your string is a well-formed URL (so that QURL does not reject it).