I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.
Related
I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.
I often grab quotes from articles that include citations that include superscripted footnotes, which when copied are a pain in the ass. They show up as actual letters in the text as they are pasted in plaintext and not in html.
Is there a way I could run this through a regex to take out these superscripts?
For example
In the abeginning bGod ccreated the dheaven and the eearth.
Should become
In the beginning God created the heaven and the earth.
I can't think of a way to have regex search for misspellings and a corresponding sequential set of numbers and letters.
Any thoughts? I'm also using Sublime Text 3 for the majority of my writing, but I wouldn't mind outsourcing this to an AppleScript, or text replacement app (aText, textExpander, etc.).
Matching Code vs. Matching a Screen
It's hard to tell without seeing an example, but this should be doable if you copy the text from code view, as opposed to the regular browser view. (Ctrl or Cmd-J is your friend). Since writing the rules will take time, this will only be worthwhile for large chunks of text.
In code view, your superscript will be marked up in a way that can be targetted by regex. For instance:
and therefore bananas make you smartera
in the browser view (where the a at the end is a citation note) may look like this in code view:
and therefore bananas make you smarter<span class="mycitations">a</span>
In your editor, using regex, you can process the text to remove all tags, or just certain tags. The rules may not always be easy to write, and of course there are many disclaimers about using regex to parse html.
However, if your source is always the same (Wikipedia for instance), then you can create and save rules that should work across many pages.
I need to replace the following characters with regex (gsub):
ÃÆè -> è
ÃÆÃÂ -> à
ÃÆò -> ò
ÃÆì -> ì
ÃÆÃù -> ù
My strategy is to first remove the first three characters ÃÆÃ that are common to all and the move to the last, leaving à at the end since it is basically the lowest common denominator.
Now gsub correctly removes the first three but then it seams it doesn't see the final ones - like ¨ - but I noticed it sees ñ (for ñ).
By copy/pasting the characters into the text editor I noticed they cause weird behaviours (such as moving the cursor forward by few positions).
My dataset was downloaded from a website that itself has encoding problems for the oldest pages but not for the most recent ones (I think they corrected the encoding problem sometime in the last years). Visiting the oldest pages you can still see the very same ̮̬ in plain sight. Then the problem is not (I assume) in the encoding of my file.
That is, the encoding errors are limited to regions of the dataset and are not the result of an encoding issue with the whole text corpus.
The problem when the characters are not correctly displayed is to understand exactly how they are parsed by the regex. In my case, as explained, the encoding errors where limited to few strings in my dataset. Then Encoding() was not applicable.
I solved the problem by visualising the problematic characters directly in R console. In console they appear like Ã\u0083Æ\u0092Ã\u0082¨ while in the R-studio they were visualised as Ã Æ Ã Â¨. What visualised in console was what I needed for a correct match with regex: gsub("Ã\u0083Æ\u0092Ã\u0082¨"...
I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with & or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.
I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:
“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…
The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.
The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).
So the “offending” lines look like this:
“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…
You can see the problem here.
Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.
Based upon that, what would be the best regex I could use to identify these lines?
My first thought is: Tell me which lines contain more than 4 quotes (“).
But I couldn’t come up with anything (I’m not good at Regex beyond the basics).
this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"
you can experiment with replacing 4 with 5 or more, depending on your data.
I think that you can be more direct with a Regex than you're planning to be. Depending on your dialect of Regex, something like this should do it:
^"\d+",".*".*"
You could also use a regex to remove the outside quotes and use a better delimeter instead. For example, search for ^"([0-9]+)","(.*)"$ and replace it with \1+++++DELIM+++++\2.
Of course, this doesn't directly answer your question, but it might solve the problem.