Regular expression with csv not finding blank space - regex

I'm trying to parse a csv file. I got the following regular expression from google. It works pretty good except I have one issue and that it doesnt parse blank data.
let arrItem = row.match(/(".*?"|[^",]+)(?=\s*,|\s*$)/g);
arrItem = arrItem || [];
Example row data
9598,"HERE IS LOOKING AT YOU KID, LLC",85647 GOLDEN BLAH BLAH,,ASHBURN,VA,20147,USA,555-555-1511,45-1111111,SOME#GMAIL.COM,9598,,
Here is a screenshot of the arrItem:
I modified the data in the sample and covered it in the screenshot for privacy.
The problem is that in the array, the third item should be blank and then the 4th should be "Ashburn" and so forth. Any ideas on how to fix the expression?
I created the following sample
Thanks

Related

Regex (re2 googlesheets) multiple values in multiline cell

Getting stuck on how to read and pretty up these values from a multiline cell via arrayformula.
Im using regex as preceding line can vary.
just formulas please, no custom code
The first column looks like a set of these:
```
[config]
name = the_name
texture = blah.dds
cost = 1000
[effect0]
value = 1000
type = ATTR_A
[effect1]
value = 8
type = ATTR_B
[feature0]
name = feature_blah
[components]
0 = comp_one,1
[resources]
res_one = 1
res_five = 1
res_four = 1
<br/>
Where to be useful elsewhere, at minimum it needs each [tag] set ([effect\d], [feature\d], ect) to be in one column each, for example the 'effects' column would look like:
ATTR_A:1000,ATTR_B:8
and so on.
Desired output can also be seen in the included spreadsheet
<br/>
<b>Here is the example spreadsheet:</b>
https://docs.google.com/spreadsheets/d/1arMaaT56S_STTvRr2OxCINTyF-VvZ95Pm3mljju8Cxw/edit?usp=sharing
**Current REGEXREPLACE**
Kinda works, finds each 'type' and 'value' great, just cant figure out how to extract just that from the rest, tried capture (and non-capturing) groups before and after but didnt work
=ARRAYFORMULA(REGEXREPLACE($A3:$A,"[\n.][effect\d][\n.](.)\n(.)","1:$1 2:$2"))
**Current SUBSTITUTE + REGEXEXTRACT + REGEXREPLACE**
A different approach entirely, also kinda works, longer form though and left with having to parse the values out of that string, where got stuck again. Idea was to use this to simplify, then regexreplace like above. Getting stuck removing content around the final matches though, and if can do that then above approach is fine too.
// First ran a substitute
=ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE($A3:$A,char(10),";"),";;",char(10)))
// Then variation of this (gave up on single line 'effect/d' so broke it up to try and get it working)
=ARRAYFORMULA(IF(A3:A<>"",IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect0]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect1]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect2]);(.)$")&";;"),""))
// Then use regexreplace like above
=ARRAYFORMULA(REGEXREPLACE($B3:$B,"value = (.);type = (.);;","1:$1 2:$2"))
**--EDIT--**
Also, as my updated 'Desired Output' sheet shows (see timestamped comment below), bonus kudos if you can also extract just the values of matching 'type's to those extra columns (see spreadsheet).
All good if you cant though, just realized would need that too for lookups.
**--END OF EDIT--**
<br/>
Ive tried dozens of things, discarding each in turn, had a quick look in version history to grab out two promising attempts and shared them in separate sheets.
One of these also used SUBSTITUTE to simplify input column, im happy for a solution using either RAW or the SUBSTITUTE results.
<br/>
**Potentially Useful links:**
https://github.com/google/re2/wiki/Syntax
<br/>
<b>Just some more words:</b>
I also have looked at dozens of stackoverflow and google support pages, so tried both REGEXEXTRACT and REGEXREPLACE, both promising but missing that final tweak. And i tried dozens of tweaks already on both.
Any help would be great, and hopefully help others in future since examples with spreadsheets are great since every new REGEX seems to be a new adventure ;)
<br/>
P.S. if we can think of better title for OP, please say in comment or your answer :)
paste in B3:
=ARRAYFORMULA(SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE(
IF(C3:E<>"", C2:E2&":"&C3:E, )),,999^99))), " ", ", "))
paste in C3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&C2)))
paste in D3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&D2)))
paste in E3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&E2)))
paste in F3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[feature\d+\]\nname = (.*)")))
paste in G3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[components\]\n\d+ = (.*)")))
paste in H3:
=ARRAYFORMULA(IFNA(REGEXREPLACE(INDEX(SPLIT(REGEXEXTRACT(
REGEXREPLACE(A3:A, "\n", ", "), "\[resources\], (.*)"), "["),,1), ", , $", )))
spreadsheet demo
This was a fun exercise. :-)
Caveat first: I have added some "input data". Examples:
[feature1]
name = feature_active_spoiler2
[components]
0 = spoiler,1
1 = spoilerA, 2
So the output has "extra" output.
See the tab ADW's Solution.

How to remove tags using RegexTokenizer() in Spark/Scala ML?

I have a a feature column that has HTML tags in it. I would like to remove all tags.
An example of one row of data from column "body" is as follows:
"<p>Are questions related to and similar products on-topic?</p>"
I would like the output after using RegexTokenizer() to be as follows:
"are questions related to and similar products on-topic?"
Here is what I have started:
val regexTokenizer = new RegexTokenizer()
.setInputCol("body")
.setOutputCol("removedTags")
.setPattern("")
I think I need to fix the .setPattern() but unsure of how.
Assuming that you may not have any other < or > in your strings, maybe,
<[^>]+>
replaced with an empty string might be working OK to some extent, otherwise it'd fail.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

web browser innertext data how received in textbox?

I have posted my HTML below. In which I want to get the name value from within my textbox area. I've tried several processes and I'm still not getting any valid solution. Please check my HTML and code snippet, and show me a possible solution.
The name prefix will always stay the same when I refresh the page. However, the last name within the "name" area will change, but will always contain the literal "mr." as the first 3 digits. regex as ([mM]r.\ ) - Four digits if you consider the literal space. Below is my table example.
<table>
<tr><td><b>Your Name is </b> mr. kamrul</td></tr>
<tr><td><b>your age </b> 12</td></tr>
<tr><td><b>Email:</b>kennethdasma30#gmail.com</td></tr>
<tr><td><b>job title</b> sales man</td></tr>
</table>
As shown below I am trying this process using listbox but I am not receiving anything.
HtmlElementCollection bColl =
webBrowser1.Document.GetElementsByTagName("table");
foreach (HtmlElement bEl in bColl)
{
if (bEl.GetAttribute("table") != null)
{
listBox1.Items.Add(bEl.GetAttribute("table"));
}
}
If anyone ca give me an idea of how I am able to receive all in the browser window as ("mr. " + text) within my list box I would appreciate it. Also, if you can explain the answer verbosely and with good comments I would appreciate it, as I'd like to understand the answer in greater detail as well.
Here is one simple way using Regex, assuming that the format of your html page doesn't change.
Regex re = new Regex(#"(?<=<tr><td><b>Your\sName\sis\s?</b>\s?)[mM]r\.\s.+?(?=</td></tr>)", RegexOptions.Singleline);
foreach (Match match in re.Matches(webBrowser1.DocumentText))
{
listBox1.Items.Add(match.Value);
}

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

Mallet in R regex error :java.lang.NoSuchMethodException: No suitable method for the given parameters

Ive been following the tutorial on how to use mallet in R to create topic models. My text file has 1 sentence per line. It looks like this and has about 50 sentences.
Thank you again and have a good day :).
This is an apple.
This is awesome!
LOL!
i need 2.
.
.
.
This is my code:
Sys.setenv(NOAWT=TRUE)
#setup the workspace
# Set working directory
dir<-"/Users/jxn"
Dir <- "~/Desktop/Chat/malletR/text" # adjust to suit
require(mallet)
documents1 <- mallet.read.dir(Dir)
View(documents1)
stoplist1<-mallet.read.dir("~/Desktop/Chat/malletR/stoplists")
View(stoplist1)
**mallet.instances <- mallet.import(documents1$id, documents1$text, "~/Desktop/Chat/malletR/stoplists/en.txt", token.regexp ="\\p{L}[\\p{L}\\p{P}]+\\p{L}")**
Everything works except for the last line of the code
**`**mallet.instances <- mallet.import(documents1$id, documents1$text, "~/Desktop/Chat/malletR/stoplists/en.txt", token.regexp ="\\p{L}[\\p{L}\\p{P}]+\\p{L}")**`**
I keep getting this error :
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.NoSuchMethodException: No suitable method for the given parameters
According to the package, this is how the function should be:
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt",
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
I believe it has something to do with the token.regexp argument as
documents1 <- mallet.read.dir(Dir) works just fine which means that the first 3 arguments supplied to mallet.instances was correct.
This is a link to the git repo that i was following the tutorial from.
https://github.com/shawngraham/R/blob/master/topicmodel.R
Any help would be much appreciated.
Thanks,
J
I suspect the problem is with your text file. I have encountered the same error and resolved it by using the as.character() function as follows:
mallet.instances <- mallet.import(as.character(documents$id),
as.character(documents$text),
"en.txt",
FALSE,
token.regexp="\\p{L}[\\p{L}\\p{P}]+\\p{L}")
Are you sure you converted the id field also to character ? It is easy to overlook the advice and leave it as an integer.
Also there is a typo in the code sample: the backslashes have to be escaped:
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}"
This usually occurs because the html text editor eats up one backslash.