Having issues with regex sub function in Python 3 - regex

I am trying to remove symbols like \x92, \xa0 etc from a text that I have downloaded from website and parsed using BeautifulSoup. Then I see that I have these symbols (encoding ) everywhere. I am using re.sub(r'[^\x00-x7F]',' ',txt)
to remove these symbols from my txt, but I noticed that I have lost each occurrence of y. For example: 'Security' became 'Securit' etc.
Any help would be greatly appreciated.
Thanks.

The (erroneous) regular expression r'[^\x00-x7F]' probably should be r'[^\x00-\x7F]' (note an additional backslash).
As you have written, it contains the set not in NUL through x. So y and subsequent ascii codes are missing.

Related

Doxygen parsing ampersands for ascii chars

I've been using Doxygen to document my project but I've ran into some problems.
My documentation is written in a language which apostrophes are often used. Although my language config parameter is properly set, when Doxygen generates the HTML output, it can't parse apostrophes so the code is shown instead of the correct character.
So, in the HTML documentation:
This should be the text: Vector d'Individus
But instead, it shows this: Vector d'Individus
That's strange, but searching the code in the HTML file, I found that what happens is that instead of using an ampersand to write the ' code, it uses the ampersand code. Well, seeing the code is easier to see:
<div class="ttdoc">Vector d&#39;Individus ... </div>
One other thing is to note that this only happens with the text inside tooltips...
But not on other places (same code, same class)...
What can I do to solve this?
Thanks!
Apostrophes in code comments must be encoded with the correct glyph for doxygen to parse it correctly. This seems particularly true for the SOURCE_TOOLTIPS popups. The correct glyph is \u2019, standing for RIGHT SINGLE QUOTATION MARK. If the keyboard you are using is not providing this glyph, you may write a temporary symbol (e.g. ') and batch replace it afterwards with an unicode capable auxiliary tool, for example: perl -pC -e "s/'/\x{2019}/g" < infile > outfile. Hope it helps.
Regarding the answer from ramkinobit, this is not necessary, doxygen can use for e.g. the Right Single quote: ’ (see doxygen documentation chapter "HTML commands").
Regarding the apostrophe the OP asks for one can use (the doxygen extension) &apos; (see also doxygen documentation chapter "HTML commands")).
There was a double 'HTML escape' in doxygen resulting in the behavior as observed for the single quote i.e. displaying '.
I've just pushed a proposed patch to github (pull request 784, https://github.com/doxygen/doxygen/pull/784).
EDIT 07/07/2018 (alternative) patch has been integrated in main branch on github.

Represent "less-than" symbol in Sublime Text regex in syntax definition

I'm working with an ancient pre-XML markup that uses codes of the form "$=x", where x may be an alphabetic character or a symbol on the keyboard, such as ; (semicolon), ? (question mark), or < (right left angle bracket, aka greater-than less-than). [Note after editing: the confusion manifested in the question as originally phrased goes to the heart of the problem. See my comment to the accepted answer. RS]
So I've modified a copy of XML.tmLanguage syntax definition file in my User folder to identify the eleven different categories that these codes represent, so I can easily see them in the large text files (which also contain XML markup) I'm working with.
For all the symbols except < I'm able to escape the symbol by preceding it with a backslash. But in the Boost regex engine that ST2 uses, \< is how you indicate that you want to match only at the start of a word. Consequently I've been unable to get this code to be properly recognized and highlighted.
I've looked everywhere for how to escape the < symbol in this circumstance. I've tried preceding it with 0, 1, 2, 3 and 4 back-slashes; and I also tried using the hexadecimal escape code \x{3009}. [Note: this is the code for greater-than instead of less-than.]
All in vain. (A few alternatives didn't generate an error message but also didn't highlight the code.)
Because the codes I'm working with need to be colored differently, I can't use a generic symbol in lieu of <, and I can't specify it either. How do I get this?
The tmLanguage file is written in XML, so Sublime Text feeds it through an XML parser first, before giving pieces to its regex engine's parser.
XML uses < to open tags such as <string>, so you can't use it directly as a character. Instead, there are these standard character references:
& for & (ampersand)
< for < (less than)
> for > (greater than; not required)
" for " (quote mark; only required in attribute values quoted with ")
&apos; for ' (apostrophe; only required in attribute values quoted with ')
So use <string>\$=<</string> in the syntax file. When Sublime Text reads the file, its own XML parser will turn this into \$=< for the regex parser.
Backslash sequences don't help because the XML parser passes them through unchanged to the regex parser, which then sees \< or \\, neither of which are what you want.
\x{3008} is passed by the XML parser to the regex parser, where it's decoded to 〈, a character which looks somewhat similar to < but doesn't match it. \x3C would work though.
By the way, tmLanguage files use plist (property list) XML, so you can convert it to a format that's easier to edit, or use a plist editor such as http://tustin2121.github.io/jsPlistor/ (from Is there any online .plist editor?).
Try to use > for a syntax file.

pattern matching a filename in R

This is probably real simple, but I can't seem to figure out how to do it.
I have an application in R (Shiny) where a user uploads to the application a *.zip file that contains all the components of an ESRI shapefile. I unpack these files into their own directory. This folder then, may or may not, contain a *.shp.xml file. At some point in my R code, I need to find the exact name of the *.shp file that has been unpacked, and distinguish it from the *.shp.xml file. How do I write the expression that will do that? I was thinking to use list.files, but I am unsure how to write the rest of the expression.
thanks!
With R regex patterns the "$" has special meaning as the end of a character element (and the 'dots' need to be escaped with \\, so
shpfils <- list.files(path, pattern="\\.shp$")
This should isolate your file -
Sys.glob("*shp")
as compared to
Sys.glob("*shp*")
which should give both the files
or
Sys.glob("*shp.xml")
which should give the .shp.xml file

regular expressions for selecting multiple lines

i have a text file in a particular format..
!c_xyz|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyz
asdasda........................................................
asddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
!c_abc|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyz...
I need a regular expression to reformat this file using Find and Replace - Visual Studio. The Desc field value has overflowed onto next lines. i need to move them back to the actual line. Final string should be like
!c_xyz|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyzsdasda.........asdddddd..
!c_abc|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyz...
I need an RE for "desc=" followed by anything until the next ! symbol
find Desc=([^\|\r\n]+)[\r\n](([^!\r\n][^\r\n]+[\r\n])*), replace with Desc=\1\2 and repeat until every line starts with ! (you can test this using ^[^!] as a search expr which should find nothing).
alternatively find [\r\n]+, replace with the empty string. thereafter find !, replace with \r\n!. this suggestion has 2 drawbacks. it temporarily produces very long lines which your editor (notably vs) may or may not have difficulties with and processes descriptions containing ! incorrectly.
addendum:
your input seems to be fixed format up to the Desc section. if it is indeed, you can apply alternative #2, step 1, being followed by a search/replace run using (!.{53}\|Desc=)/[\r\n]\1.
As mentioned in the comments by #X3074861X, you can use Notepad++.
Input:
!c_xyz|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyz
asdasda........................................................
asddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
!c_abc|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyz...
For the find and replace, select the mode as Regular expression with the options as follows:
Find what: \r\n[^!]
Leave Replace with blank.
Output:
!c_xyz|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyzsdasda........................................................sddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
!c_abc|crby=112|crdate=12jun11|mdby=112|mddate=12jun11|Desc=xyz...
Screenshot:

Can an tinyxml someone explain which characters need to be escaped?

I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with &amp or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.