XML find and delete all text in doc not within a specified tag - regex

I have an XML doc which is massive - a short example is below to illustrate formatting. What I want to do is find all the text in the doc which is not within a tag and delete it - so I am left with just a list of the data...
So here is the original:
51.639973121-2.161205923
112.0
<time>2017-02-19T11:26:45Z</time>
51.639902964-2.161258059
111.6
<time>2017-02-19T11:26:46Z</time>
51.639834484-2.161310529
111.6
<time>2017-02-19T11:26:47Z</time>
51.639765501-2.161366101
111.6
<time>2017-02-19T11:26:48Z</time>
51.639697859-2.161426451
111.8
<time>2017-02-19T11:26:49Z</time>
And once formatted - it will become:
<time>2017-02-19T11:26:45Z</time>
<time>2017-02-19T11:26:46Z</time>
<time>2017-02-19T11:26:47Z</time>
<time>2017-02-19T11:26:48Z</time>
<time>2017-02-19T11:26:49Z</time>
How is this possible???

The following expression will select all text but time tags:
^(?!<time>[^<]+<\/time>).*\R
It works only if the tags are on a new line, like in you example input.
See the demo

Related

How to remove tags using RegexTokenizer() in Spark/Scala ML?

I have a a feature column that has HTML tags in it. I would like to remove all tags.
An example of one row of data from column "body" is as follows:
"<p>Are questions related to and similar products on-topic?</p>"
I would like the output after using RegexTokenizer() to be as follows:
"are questions related to and similar products on-topic?"
Here is what I have started:
val regexTokenizer = new RegexTokenizer()
.setInputCol("body")
.setOutputCol("removedTags")
.setPattern("")
I think I need to fix the .setPattern() but unsure of how.
Assuming that you may not have any other < or > in your strings, maybe,
<[^>]+>
replaced with an empty string might be working OK to some extent, otherwise it'd fail.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Regular expression with csv not finding blank space

I'm trying to parse a csv file. I got the following regular expression from google. It works pretty good except I have one issue and that it doesnt parse blank data.
let arrItem = row.match(/(".*?"|[^",]+)(?=\s*,|\s*$)/g);
arrItem = arrItem || [];
Example row data
9598,"HERE IS LOOKING AT YOU KID, LLC",85647 GOLDEN BLAH BLAH,,ASHBURN,VA,20147,USA,555-555-1511,45-1111111,SOME#GMAIL.COM,9598,,
Here is a screenshot of the arrItem:
I modified the data in the sample and covered it in the screenshot for privacy.
The problem is that in the array, the third item should be blank and then the 4th should be "Ashburn" and so forth. Any ideas on how to fix the expression?
I created the following sample
Thanks

imacros Extract all text without href

need help for extract text1,text2,text3 (i mean all text, sometimes until text9 in category)
<h4>Category:</h4>
<p>text1, text2, text3</p>
my imacros code just only extract text1
TAG POS=R1 TYPE=A ATTR=TXT:* EXTRACT=TXT
Q : how extract all text in category ?
Thanks
To expand on the JavaScript comment, this is how you could go about it:
ExtractCategory.js content
// Play the macro reading the category data
iimPlay("foo.iim");
// Get the last extracted value, i.e. the p content
var pContent = iimGetExtract();
// Parse the p using regex, first find a tag pairs and then drop the surrounding a tags
var result = pContent.match(/<a(.*?)<\/a>/g).map(function(val){
return val.replace(/<\/?a>/g,'').replace(/<a.+>/g,'');
});
// Pass the generated String to another macro to work with it
iimSet("passed_var", result);
iimPlay("bar.iim");
Next to ExtractCategory.js, foo.iim content
'Your previous code here, line #2 is just to find the right p in line #3 in a mockup html
TAG POS=1 TYPE=H4 ATTR=*
TAG POS=R1 TYPE=P ATTR=* EXTRACT=HTM
Next to ExtractCategory.js, bar.iim content
'Do whatever with the passed variable containing your formatted String
'This is just an output to show it
PROMPT {{passed_var}}
When you run ExtractCategory.js it will run your foo.iim code to extract the p content, parse it with regex (might want to be careful here, depending on what texts you are expecting this might break) and then pass the generated String on to another macro to do with it what you please.
Running this your result is text1,text2,text3 as desired.
Read up on http://wiki.imacros.net/iimSet() and http://wiki.imacros.net/iimPlay() if you need further information on how to use them.
This code, will extract the data in all the A tags inside the P tag, but there is a small setup you need to do, I use XPATH to get the path of the A tags.
Please install:
XPath Checker By Brian Slesinsky
or
how to find the xpath of an element (I would recommend the chrome console method)
with this you need to right click on the a tag and give View XPATH, this will give you an XPATH like
/x:html/x:body/x:p/x:a[2]
Then, after you get this X path you need to paste it in the Xpath value (Note you need to remove the x: from the above XPATH before pasting. Also note the number in the [] of the Xpath indicates the child number, since we use !LOOP to set the line number we ignore [2]) of the tag, refer the below code where I have done the same with the above Xpath
Note:
1. Please loop the imacros code according to the number of A tags you want to extract.
2. You also need to update the folder attribute of SAVEAS line, to your desktop path.
Code:
SET !LOOP 1
SET !ERRORIGNORE YES
TAG XPATH=(/html/body//p/a)[{{!LOOP}}] EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=C:/Users/Test/Desktop/ FILE=output.csv

Search for an item in a text file using UIMA Ruta

I have been trying to search for an item which is there in a text file.
The text file is like
Eg: `
>HEADING
00345
XYZ
MethodName : fdsafk
Date: 23-4-2012
More text and some part containing instances of XYZ`
So I did a dictionary search for XYZ initially and found the positions, but I want only the 1st XYZ and not the rest. There is a property of XYZ that , it will always be between the 5 digit code and the text MethondName .
I am unable to do that.
WORDLIST ZipList = 'Zipcode.txt';
DECLARE Zip;
Document
Document{-> MARKFAST(Zip, ZipList)};
DECLARE Method;
"MethodName" -> Method;
WORDLIST typelist = 'typelist.txt';
DECLARE type;
Document{-> MARKFAST(type, typelist)};
Also how do we use REGEX in UIMA RUTA?
There are many ways to specify this. Here are some examples (not tested):
// just remove the other annotations (assuming type is the one you want)
type{-> UNMARK(type)} ANY{-STARTSWITH(Method)};
// only keep the first one: remove any annotation if there is one somewhere in front of it
// you can also specify this with POSISTION or CURRENTCOUNT, but both are slow
type # #type{-> UNMARK(type)}
// just create a new annotation in between
NUM{REGEXP(".....")} #{-> type} #Method;
There are two options to use regex in UIMA Ruta:
(find) simple regex rules like "[A-Za-z]+" -> Type;
(matches) REGEXP conditions for validating the match of a rule element like
ANY{REGEXP("[A-Za-z]+")-> Type};
Let me know if something is not clear. I will extend the description then.
DISCLAIMER: I am a developer of UIMA Ruta

Regex code how to filter all names that contain only numbers and end with .jpg and/or _number.jpg?

How to filter all names that consist of numbers and end with .jpg and/or _number.jpg?
Background info:
In SSIS 2008 I have a foreach loop that will store the filename into a variable for all jpg files. The enumorator configuration for Files is currently: *.jpg
This will handle all jpg files.
What is the code so it will only handle names likes?:
3417761506233.jpg
3417761506233_1.jpg
5414233177487.jpg
5414233177487_1.jpg
5414233177487_14.jpg
but not names like:
abc.jpg
abc123.jpg
def.png
456.png
The numbers represent EAN codes by the way.
I thought about this code:
\d|_|.jpg
but SSIS returns an error stating there are no files that meet the criteria eventhough the files(names) are in the folder.
You could use a Script Task within the loop to do the regex filtering:
http://microsoft-ssis.blogspot.com/2012/04/regex-filter-for-foreach-loop.html
Or you could use a (free) Third Party Enumerator:
http://microsoft-ssis.blogspot.com/2012/04/custom-ssis-component-foreach-file.html
For that, you can use the following regex:
^\d+(_\d+)?.jpg$
Demo: http://regex101.com/r/qC7oV3
^(\d+(?:_\d+)?\.jpg$)
DEMO --> http://regex101.com/r/dM9rJ7
Matches:
3417761506233.jpg
3417761506233_1.jpg
5414233177487.jpg
5414233177487_1.jpg
5414233177487_14.jpg
Excludes:
abc.jpg
abc123.jpg
def.png
456.png