Regex in Apache Spark - regex

I have a text file that reads like this:-
This recipe can be made either with a stand mixer, or by hand with a bowl, a wooden spoon, and strong arms. If you use salted butter, please omit the added salt in this recipe.
Yum
Ingredients
1 1/4 cups all-purpose flour (160 g)
1/4 teaspoon salt
1/2 teaspoon baking powder
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature
1/2 cup white sugar (90 g)
1/2 cup dark brown sugar, packed (85 g)
1 large egg
1 teaspoon vanilla extract
1/2 teaspoon instant coffee granules or instant espresso powder
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g)
1/2 cup white chocolate chips
Method
1 Preheat the oven to 350°F (175°C). Vigorously whisk together the flour, and baking powder in a bowl and set aside.
I want to extract the data between words Ingredients and Method.
I have written a regex (?s)(?<=\bIngredients\b).*?(?=\bMethod\b)
to extract the data and it's working fine.
But when I try to that using spark-shell like following, it doesn't give me anything.
val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)
Please tell me where I am going wrong and what steps should I take to
correct this?
Thanks in advance!

what you need to do is
Read the file using WholeTextFiles (so it does not break lines and you read entire data together)
Write a function which takes a string and outputs a string using that regex
so, it may look like (in python)
Blockquote
def getWhatIneed(s):
output = <my regexp>
return output
b = sc.WholeTextFiles(...)
c = b.map(getWhatIneed)
Now, c is also a RDD. You need to collect it before you print it. Output of collect is a normal array/list
print c.collect()

Related

Parsing text file into a Data Frame

I have a text file which has information, like so:
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: A1QA985ULVCQOB
review/profileName: Carleen M. Amadio "Lady Dragonfly"
review/helpfulness: 2/2
review/score: 5.0
review/time: 1314057600
review/summary: Fun for adults too!
review/text: I really enjoy these scissors for my inspiration books that I am making (like collage, but in books) and using these different textures these give is just wonderful, makes a great statement with the pictures and sayings. Want more, perfect for any need you have even for gifts as well. Pretty cool!
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: ALCX2ELNHLQA7
review/profileName: Barbara
review/helpfulness: 0/0
review/score: 5.0
review/time: 1328659200
review/summary: Making the cut!
review/text: Looked all over in art supply and other stores for "crazy cutting" scissors for my 4-year old grandson. These are exactly what I was looking for - fun, very well made, metal rather than plastic blades (so they actually do a good job of cutting paper), safe ("blunt") ends, etc. (These really are for age 4 and up, not younger.) Very high quality. Very pleased with the product.
I want to parse this into a dataframe with the productID, title, price.. as columns and the data as the rows. How can I do this in R?
A quick and dirty approach:
mytable <- read.table(text=mytxt, sep = ":")
mytable$id <- rep(1:2, each = 10)
res <- reshape(mytable, direction = "wide", timevar = "V1", idvar = "id")
There will be issues if there are other colons in the data. Also assumes that there is an equal number (10) of variables for each case. All

Gsub to extract relevant content between two lines

I want to extract the content between Preparation and Nutrition\nPer serving.
I am using -
gsub(".*\nPreparation\n\\s*|Tips & Notes*", "", filename)
My filename looks like
\nPreparation\nThinly slice both lemons and one orange. Combine the
sliced fruit with rum in a bowl. Cover and let macerate at least 8 hours
or overnight.\nCombine sugar and water in a small saucepan; bring to a boil. Remove from the heat and stir in tea; let steep for 20 to 30 minutes. Strain into the rum mixture. Cover and chill.\nJust before serving, slice the remaining orange. Strain the rum mixture into a large punch bowl. Add Champagne or sparkling wine and seltzer. Float the orange slices in the punch.\nTips & Notes\nMake Ahead Tip: Prepare through Step 2. Cover and refrigerate for up to 3 days.\nNutrition\nPer serving
If you want to extract contents, it's better to use str_extract or regmatches function.
> regmatches(x, gregexpr("(?s)Preparation\\s*\\K.*?(?=\\s*\\bNutrition\\nPer serving)", x, perl=T))[[1]]
[1] "Thinly slice both lemons and one orange. Combine the \n sliced fruit with rum in a bowl. Cover and let macerate at least 8 hours \n or overnight.\nCombine sugar and water in a small saucepan; bring to a boil. Remove from the heat and stir in tea; let steep for 20 to 30 minutes. Strain into the rum mixture. Cover and chill.\nJust before serving, slice the remaining orange. Strain the rum mixture into a large punch bowl. Add Champagne or sparkling wine and seltzer. Float the orange slices in the punch.\nTips & Notes\nMake Ahead Tip: Prepare through Step 2. Cover and refrigerate for up to 3 days."

Neo4j regex string matching not returning expected results

I'm trying to use the Neo4j 2.1.5 regex matching in Cypher and running into problems.
I need to implement a full text search on specific fields that a user has access to. The access requirement is key and is what prevents me from just dumping everything into a Lucene instance and querying that way. The access system is dynamic and so I need to query for the set of nodes that a particular user has access to and then within those nodes perform the search. I would really like to match the set of nodes against a Lucene query, but I can't figure out how to do that so I'm just using basic regex matching for now. My problem is that Neo4j doesn't always return the expected results.
For example, I have about 200 nodes with one of them being the following:
( i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})
This query produces one result:
MATCH (p)-->(:group)-->(i:node)
WHERE (i.name =~ "(?i).*mosaic.*")
RETURN i
> Returned 1 row in 569 ms
But this query produces zero results even though the description property matches the expression:
MATCH (p)-->(:group)-->(i:node)
WHERE (i.description=~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 601 ms
And this query also produces zero results even though it includes the name property which returned results previously:
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
WHERE (searchText =~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 487 ms
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
RETURN searchText
>
...
SotoLinear Glass Mosaic Tiles Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
...
Even more odd, if I search for a different term, it returns all of the expected results without a problem.
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
WHERE (searchText =~ "(?i).*plumbing.*")
RETURN i
> Returned 8 rows in 522 ms
I then tried to cache the search text on the nodes and I added an index to see if that would change anything, but it still didn't produce any results.
CREATE INDEX ON :node(searchText)
MATCH (p)-->(:group)-->(i:node)
WHERE (i.searchText =~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 3182 ms
I then tried to simplify the data to reproduce the problem, but in this simple case it works as expected:
MERGE (i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})
WITH i, (
i.name + " " + COALESCE(i.description, "")
) AS searchText
WHERE searchText =~ "(?i).*mosaic.*"
RETURN i
> Returned 1 rows in 630 ms
I tried using the CYPHER 2.1.EXPERIMENTAL tag as well but that didn't change any of the results. Am I making incorrect assumptions on how the regex support works? Is there something else I should try or some other way to debug the problem?
Additional information
Here is a sample call that I make to the Cypher Transactional Rest API when creating my nodes. This is the actual plain text that is sent (other than some formatting for easier reading) when adding nodes to the database. Any string encoding is just standard URL encoding that is performed by Go when creating a new HTTP request.
{"statements":[
{
"parameters":
{
"p01":"lsF30nP7TsyFh",
"p02":
{
"description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
"id":"lsF3BxzFdn0kj",
"name":"Linear Glass Mosaic Tiles",
"object":"material"
}
},
"resultDataContents":["row"],
"statement":
"MATCH (p:project { id: { p01 } })
WITH p
CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material { p02 })"
}
]}
If it is an encoding issue, why does a search on name work, description not work, and name + description not work? Is there any way to examine the database to see if/how the data was encoded. When I perform searches, the text returned appears correct.
just a few notes:
probably replace create unique with merge (which works a bit differently)
for your fulltext search I would go with the lucene legacy index for performance, if your group restriction is not limiting enough to keep the response below a few ms
I just tried your exact json statement, and it works perfectly.
inserted with
curl -H accept:application/json -H content-type:application/json -d #insert.json \
-XPOST http://localhost:7474/db/data/transaction/commit
json:
{"statements":[
{
"parameters":
{
"p01":"lsF30nP7TsyFh",
"p02":
{
"description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
"id":"lsF3BxzFdn0kj",
"name":"Linear Glass Mosaic Tiles",
"object":"material"
}
},
"resultDataContents":["row"],
"statement":
"MERGE (p:project { id: { p01 } })
WITH p
CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material { p02 }) RETURN m"
}
]}
queried:
MATCH (p)-->(:group)-->(i:material)
WHERE (i.description=~ "(?i).*mosaic.*")
RETURN i
returns:
name: Linear Glass Mosaic Tiles
id: lsF3BxzFdn0kj
description: Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
object: material
What you can try to check your data is to look at the json or csv dumps that the browser offers (little download icons on the result and table-result)
Or you use neo4j-shell with my shell-import-tools to actually output csv or graphml and check those files.
Or use a bit of java (or groovy) code to check your data.
There is also the consistency-checker that comes with the neo4j-enterprise download. Here is a blog post on how to run it.
java -cp 'lib/*:system/lib/*' org.neo4j.consistency.ConsistencyCheckTool /tmp/foo
I added a groovy test script here: https://gist.github.com/jexp/5a183c3501869ee63d30
One more idea: regexp flags
Sometimes there is a multiline thing going on, there are two more flags:
multiline (?m) which also matches across multiple lines and
dotall (?s) which allows the dot also to match special chars like newlines
So could you try (?ism).*mosaic.*

LibreCalc Search and Replace, Search for [] and replace it, along with its contents

I am compiling a list of video games.
At this time, I am currently using Wikipedia to do so.
As I copied ps3 games over to LibreCalc, the copied titles of the video games include citation brackets at the end of the line. Rather than remove this line by, I am trying to search and replace the brackets and their contents.
I continue to fail in this endeavor. An example below,
Rune Factory: Tides of Destiny[629]
Fight Night Champion[268]
Dragon Age II[209]
Major League Baseball 2K11[427]
MLB 11: The Show[459]
Warriors: Legends of Troy[817]
Dynasty Warriors 7[222]
Homefront[334]
Top Spin 4[773]
MotorStorm: Apocalypse[474]
Crysis 2[164]
Lego Star Wars III: The Clone Wars
The Tomb Raider Trilogy[765]
NASCAR 2011: The Game[488]
Shift 2: Unleashed[650]
Tiger Woods PGA Tour 12: The Masters[746]
WWE All Stars[839]
Michael Jackson: The Experience[448]
Rio[614]
Mortal Kombat[469]
Portal 2[563]
SOCOM 4: U.S. Navy SEALs[20]
AFL Live[16]
Operation Flashpoint: Red River[542]
Man vs. Wild[430]
Sniper: Ghost Warrior[679]
El Shaddai: Ascension of the Metatron[233]
Virtua Tennis 4[808]
Thor: God of Thunder[740]
MX vs. ATV Alive[478]
Brink[116]
Lego Pirates of the Caribbean: The Video Game[391]
Battle vs. Chess[82]
L.A. Noire[379]
Dirt 3[196]
Kung Fu Panda 2[377]
Hunted: The Demon's Forge[336]
Infamous 2[345]
Red Faction: Armageddon[599]
Yakuza: Dead Souls[849]
Duke Nukem Forever[217]
Alice: Madness Returns[29]
Child of Eden[146]
Transformers: Dark of the Moon[777]
Dungeon Siege III[218]
Cars 2: The Video Game[138]
F.E.A.R. 3[247]
Shadows of the Damned[647]
Atelier Meruru: The Apprentice of Arland[67]
Bleach: Soul Resurrección[108]
Angel Love Online[38]
Angel Senki
Air Conflicts: Secret Wars[24]
Harry Potter and the Deathly Hallows: Part II[322]
NCAA Football 12[511]
Captain America: Super Soldier[137]
Call of Juarez: The Cartel[136]
Phineas and Ferb: Across the 2nd Dimension[558]
Hyperdimension Neptunia Mk2[338]
Deus Ex: Human Revolution[191]
Bodycount[111]
Madden NFL 12[415]
Driver: San Francisco[216]
Dead Island[175]
Resistance 3[609]
Warhammer 40000: Space Marine[815]
NHL 12[526]
Tales of Xillia[718]
God of War: Origins Collection[298]
Tom Clancy's Splinter Cell Classic Trilogy HD[762]
Supremacy MMA[712]
Dark Souls[169]
Ico & Shadow of the Colossus Collection[340]
FIFA 12[263]
PES 2012: Pro Evolution Soccer[557]
Dynasty Warriors 7: Xtreme Legends[223]
Ra.One: The Game[584]
Crysis[163]
Rage[586]
Spider-Man: Edge of Time[692]
NBA 2K12[498]
The Cursed Crusade[733]
Ace Combat: Assault Horizon[12]
Skylanders: Spyro's Adventure[675]
Batman: Arkham City[79]
Ratchet & Clank: All 4 One[591]
Rocksmith[627]
The Sims 3: Pets[658]
The Adventures of Tintin: The Secret of the Unicorn[14]
Back to the Future: The Game[71]
Battlefield 3[83]
Dragon Ball Z: Ultimate Tenkaichi[212]
Puss in Boots[581]
The Idolmaster 2[736]
Uncharted 3: Drake's Deception[795]
GoldenEye 007: Reloaded[301]
The Lord of the Rings: War in the North[401]
Sonic Generations[683]
Call of Duty: Modern Warfare 3[131]
Metal Gear Solid HD Collection[445]
The Elder Scrolls V: Skyrim[236]
Lego Harry Potter: Years 5–7[388]
Assassin's Creed: Revelations[63]
Jurassic Park: The Game[358]
Cartoon Network: Punch Time Explosion XL[141]
Need for Speed: The Run[516]
Saints Row: The Third[633]
Apache: Air Assault[42]
After Hours Athletes[21]
Ni no Kuni[531]
WWE '12[838]
The King of Fighters XIII[371]
Just Dance 3[361]
Order Up![543]
Final Fantasy XIII-2[273]
Zack Zero[853]
Armored Core V[52]
NeverDead[520]
Soulcalibur V[689]
Kingdoms of Amalur: Reckoning[374]
The Darkness II[735]
Grand Slam Tennis 2[306]
Twisted Metal[787]
UFC Undisputed 3[792]
Binary Domain[95]
Asura's Wrath[64]
Syndicate[714]
Gal*Gun[288]
Naruto Shippuden: Ultimate Ninja Storm Generations[485]
SSX[698]
One Piece: Pirate Warriors[539][540]
Blades of Time[101]
Major League Baseball 2K12[428]
Mass Effect 3[436]
MLB 12: The Show[460]
Street Fighter X Tekken[706]
Top Gun: Hard Lock[771]
Mobile Suit Gundam Unicorn[465]
FIFA Street[265]
Silent Hill: Downpour[653]
Silent Hill HD Collection[654]
Ninja Gaiden 3[535]
Resident Evil: Operation Raccoon City[605]
Ridge Racer Unbounded[613]
Battleship[87]
Prototype 2[577]
Max Payne 3[438]
Dragon's Dogma[214]
Tom Clancy's Ghost Recon: Future Soldier[756]
Dirt: Showdown[197]
Inversion[350]
Tokyo Jungle[753]
Part of my problem seems to be that brackets are characters used in regular expressions.
Can some one assist me, or toss me in the right direction to solving this problem.
You can escape the brackets with a backslash so they are treated as regular characters. On that base, you could use the following regex to match all square brackets containing only digits:
\[[:digit:]*\]
When leaving the Replace with box empty, a search/replace run should remove all footnote marks in your example.
Since only the opening bracket is a special character for LO Calc, the following should work, too:
\[[:digit:]*]

Help: Extracting data tuples from text... Regex or Machine learning?

I would really appreciate your thoughts on the best approach to the following problem. I am using a Car Classified listing example which is similar in nature to give an idea.
Problem: Extract a data tuple from the given text.
Here are some characteristics of the data.
The vocabulary (words) in the text is limited to a specific domain. Lets assume 100-200 words at the most.
Text that needs to be parsed is a headline like a Car Ad data shown below. So each record corresponds to one tuple (row).
In some cases some of the attributes may be missing. So for example, in raw data row #5 below the year is missing.
Some words go together (bigrams). Like "Low miles".
Historical data available = 10,000 records
Incoming New Data volume = 1000-1500 records / week
The expected output should be in the form of (Year,Make,Model, feature). So the output should look like
1 -> (2009, Ford, Fusion, SE)
2 -> (1997, Ford, Taurus, Wagon)
3 -> (2000, Mitsubishi, Mirage, DE)
4 -> (2007, Ford, Expedition, EL Limited)
5 -> ( , Honda, Accord, EX)
....
....
Raw Headline Data:
1 -> 2009 Ford Fusion SE - $7000
2 -> 1997 Ford Taurus Wagon - $800 (san jose east)
3 -> '00 Mitsubishi Mirage DE - $2499 (saratoga) pic
4 -> 2007 Ford Expedition EL Limited - $7800 (x)
5 -> Honda Accord ex low miles - $2800 (dublin / pleasanton / livermore) pic
6 -> 2004 HONDA ODASSEY LX 68K MILES - $10800 (danville / san ramon)
7 -> 93 LINCOLN MARK - $2000 (oakland east) pic
8 -> #######2006 LEXUS GS 430 BLACK ON BLACK 114KMI ####### - $19700 (san rafael) pic
9 -> 2004 Audi A4 1.8T FWD - $8900 (Sacramento) pic
10 -> #######2003 GMC C2500 HD EX-CAB 6.0 V8 EFI WHITE 4X4 ####### - $10575 (san rafael) pic
11 -> 1990 Toyota Corolla RUNS GOOD! GAS SAVER! 5SPEED CLEAN! REG 2011 O.B.O - $1600 (hayward / castro valley) pic img
12 -> HONDA ACCORD EX 2000 - $4900 (dublin / pleasanton / livermore) pic
13 -> 2009 Chevy Silverado LT Crew Cab - $23900 (dublin / pleasanton / livermore) pic
14 -> 2010 Acura TSX - V6 - TECH - $29900 (dublin / pleasanton / livermore) pic
15 -> 2003 Nissan Altima - $1830 (SF) pic
Possible choices:
A machine learning Text Classifier (Naive Bayes etc)
Regex
What I am trying to figure out is if RegEx is too complicated for the job and a Text classifier is an overkill?
If the choice is to go with a text classifier then what would you consider to be the easiest to implement.
Thanks in advance for your kind help.
This is a well studied problem called information extraction. It is not straight forward to do what you want to do, and it is not as simple as you make it sound (ie machine learning is not an overkill). There are several techniques, you should read an overview of the research area.
Check this IE library for writing extraction rule< I think it will work best for you problem.
There also example how to create fast dictionary matching.
I think that the ARX or Phoebus systems may suit your needs if you already have annotated data and a list of words associated to each field. Their approach is a mix of information extraction and information integration.
There are a few good entity recognition libraries. Have you taken a look at Apache opennlp?
As a user looking for a specific model of car the task is easier. I'm pretty sure I could classify, say, most Ford Rangers since I know what to look for with regexp.
I think your best bet is to write a function for each car model with type String -> Maybe Tuple. Then run all these on each input and throw away those inputs resulting in zero or too many tuples.
You should use a tool like Amazon Mechanical Turk for this. Human microtasking. Another alternative is to use a data entry freelancer. upWork is a great place to look. You can get excellent quality results and the cost is very reasonable for each.