How can I remove junk characters with regex? - regex

I have a web application that reads the contents of a web page and parses the sentences using an NLP algorithm. I have been using regex to split the contents into single sentences and then parsing them.
I would like to remove characters like  from my sentences. These characters, I imagine, are because of the HTML encoding.
I obviously cannot use a regex like [^\w\d]+ or its variations because I need the punctuations intact. Of course I could add individual exceptions for each of the punctuation like [^\w\d\.,:]+ and so on, but I would like it if there is an easier way to do this, like probably a character class that knows it is a... funny character?
Any help will be much appreciated. Thanks.
EDIT: The app is built with PHP and I am using a simple file_get_contents() to fetch the HTML data from the site and reading the contents inside <p> tags.

This was mentioned in the comments by #TheGreatCO but you are able to create a character class of "special" characters. You can use the hex code values to create a range in a character class. So for any special character over ASCII 127 would be this.
[\x80-\xFE]
That would match anything but your most basic characters. For reference sake, here's a list of the ASCII character table with their hex codes.
This page discusses the different ways you can reference special characters in regex.

I found this regexpr helpful to identify junk character in a file using atom
[^(\x20-\x7F\p{Sc})]

Related

Non-ascii characters not getting filtered through regex in java

I am using regex [^\x00-\x7F] to filter any non- ascii characters in my Java application. It is filtering most of the characters but recently found a problem where it is letting the control character called 'Start of Guarded area', see link (https://www.codetable.net/name/start-of-guarded-area) to pass through and appear as – in my xml files. Though this character is a non-ascii i.e. out of (0-127) range, Can anybody throw some light why is it not being filtered and if there are any other characters which might not get filtered like this. Note that I am using xstream parser for parsing the text. Any suggestions would be well appreciated. Thanks!

Ignoring invisible characters in RegEx

I've run into a bit of a conundrum.
I am currently trying to build a regex to filter out some particularly nasty scam emails. I'm sure you've seen them before, using a data dump from a compromised website to threaten to reveal intimate videos.
That's all well and good, except I noticed while testing the regex that some of these messages insert special invisible characters in the middle of words. Like you might see here (I've found it especially hard to find a place that keeps these special characters):
Regexr link
I find myself looking for a way to create a regex that might ignore these characters all together, as some emails have them and some don't. In the end, I'm trying to create a match with something like
/all (.*)your contacts
If there's a particular string you're trying to flag, you could do something like this:
Detect "email" with optional invis characters: /e[^\w]?m[^\w]?a[^\w]?i[^\w]?l/
[^\w]? will detect anything that's not a letter or digit. You could also use [^\w]* if you're seeing more than one invisible character being used between letters.
Most invisible characters are just whitespace.
These don't matter which character set they're rendered in,
it's probably invisible.
If using a Unicode aware regex engine, you could probably just stick
in the whitespace class between the characters you're looking for.
If not, you could try using the class equivalent [ ].
\s =
[\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]
Same, but without CRLF's
[^\S\r\n] =
[\x{9}\x{B}-\x{C}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]

Are there any characters that are not allowed/used in regex

I have the somehow weird requirement that several regex should be passed as one single string to a jenkins plugin.
They should be entered in one single textfield and I have to split this string in a List of Regex later on.
Now the issue is, I can't think of any way to delimit the regexes in the string so I can later split this string as a character like a , could also be considered part of a regex itself.
E.g. if I'd use a , for the two regex "(\d+,?\s+\d{1})\.xls" and "\w+\.exe" :
"(\d+,?\s+\d{1})\.xls,\w+\.exe"
would be split into 3 regexes: "(\d+", "?\s+\d{1})\.xls" and "\w+\.exe"
where the first 2 are obviously invalid.
So my actual question is, are there any characters, that can never appear in a regex which I could use to delimit my regexes?
No, any and all characters can appear in a regex. Use any serialisation format to serialise your list of strings into a clearly expressed list format, e.g. JSON:
["(\\d+", "?\\s+\\d{1})\\.xls", "\\w+\\.exe"]
Alternatively CSV or anything else that can express a list of things and properly escapes characters used to denote item separators.

Notepad ++: selecting text up to matched characters

In notepad ++, I want to select text up to a certain text match, including the match.
The txt file I am working with contains a lot of text with also white characters, returns and some special characters. In this text, there are characters that mark an end. Let's call these stop characters "ZZ." for now.
Using RegEx, I tried to create an expression that finds the next "ZZ." and selects everything before it. This is what it looks like:
+., \c ZZ.\n
But I seem to have gotten something wrong. As it is a similar to this
problem, I tried to use their RegEx with slight modification. Here is a picture so you can figure what I'd like to accomplish:
Find the next stop marker, selext the marker and everything before it.
In the actual file, the stop marker is "გვ."
If I want to use those, maybe I need to change the RegEx even more, as those are no ASCII characters? Like so, as stated in the RegEx Wiki?
\c+ (\x{nnnn}\x{nnnn}.)\n
Not quite sure if the \c works that way. I have seen expressions that use something like (A-Za-z)(0-9) but this is a different alphabet.
To match any text up to and including some pattern, use .*? (to match any zero or more characters, as few as possible) with the . matches newline option ON and add the გვ after it:

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.