Stripping superscript from plaintext - regex

I often grab quotes from articles that include citations that include superscripted footnotes, which when copied are a pain in the ass. They show up as actual letters in the text as they are pasted in plaintext and not in html.
Is there a way I could run this through a regex to take out these superscripts?
For example
In the abeginning bGod ccreated the dheaven and the eearth.
Should become
In the beginning God created the heaven and the earth.
I can't think of a way to have regex search for misspellings and a corresponding sequential set of numbers and letters.
Any thoughts? I'm also using Sublime Text 3 for the majority of my writing, but I wouldn't mind outsourcing this to an AppleScript, or text replacement app (aText, textExpander, etc.).

Matching Code vs. Matching a Screen
It's hard to tell without seeing an example, but this should be doable if you copy the text from code view, as opposed to the regular browser view. (Ctrl or Cmd-J is your friend). Since writing the rules will take time, this will only be worthwhile for large chunks of text.
In code view, your superscript will be marked up in a way that can be targetted by regex. For instance:
and therefore bananas make you smartera
in the browser view (where the a at the end is a citation note) may look like this in code view:
and therefore bananas make you smarter<span class="mycitations">a</span>
In your editor, using regex, you can process the text to remove all tags, or just certain tags. The rules may not always be easy to write, and of course there are many disclaimers about using regex to parse html.
However, if your source is always the same (Wikipedia for instance), then you can create and save rules that should work across many pages.

Related

Is it possible to make an index search by regex in PDF?

I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.

Using variables in reStructuredText

I'm writing a long HOWTO in reStructuredText format and wondering if there's a way to let user specify values for a couple variables (hostname, ip address) at the top so the rest of the document would be filled with those automatically?
Like me, you are probably looking for substitution. At the bottom of the section you'll find how to replace text.
Substitution Definitions
Doctree element: substitution_definition.
Substitution definitions are indicated by an explicit markup start
(".. ") followed by a vertical bar, the substitution text, another
vertical bar, whitespace, and the definition block. Substitution text
may not begin or end with whitespace. A substitution definition block
contains an embedded inline-compatible directive (without the leading
".. "), such as "image" or "replace".
Specifically about text replacement:
Replacement text
The substitution mechanism may be used for simple macro substitution. This may be appropriate when the replacement text is
repeated many times throughout one or more documents, especially if it
may need to change later. A short example is unavoidably contrived:
|RST|_ is a little annoying to type over and over, especially
when writing about |RST| itself, and spelling out the
bicapitalized word |RST| every time isn't really necessary for
|RST| source readability.
.. |RST| replace:: reStructuredText
.. _RST: http://docutils.sourceforge.net/rst.html
reStructuredText is a markup language to define static content. HTML content (I assumed the desired output format is HTML) is typically generated from reStructuredText on build time and then released/shipped to the user.
To allow users to specify variables, you would need a solution on top of reStructuredText, for example:
Ship the content with a JavaScript plugin that dynamically replaces specific strings in the HTML document with user input.
Generate the documentation on-the-fly after the user has specified the variables.
Note that these examples are not necessarily particularly viable solutions.

Find Lines with N occurrences of a char

I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:
“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…
The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.
The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).
So the “offending” lines look like this:
“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…
You can see the problem here.
Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.
Based upon that, what would be the best regex I could use to identify these lines?
My first thought is: Tell me which lines contain more than 4 quotes (“).
But I couldn’t come up with anything (I’m not good at Regex beyond the basics).
this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"
you can experiment with replacing 4 with 5 or more, depending on your data.
I think that you can be more direct with a Regex than you're planning to be. Depending on your dialect of Regex, something like this should do it:
^"\d+",".*".*"
You could also use a regex to remove the outside quotes and use a better delimeter instead. For example, search for ^"([0-9]+)","(.*)"$ and replace it with \1+++++DELIM+++++\2.
Of course, this doesn't directly answer your question, but it might solve the problem.

Regex/Textmate Confusion

I'm trying to create a Textmate snippet, but have run into some difficulties. Basically, I want to type in a Name and split it into its parts.
Example,
Bill Gates: (Bill), (bill), (Gates), (gates), (Bill Gates), (Bill gates), (bill Gates), (bill gates)
EDIT**
So I most certainly can produce these results quite simply if I was using a programming language. For example, I could split the words and then call the uppercase or lowercase functions to produce this output.
But in my situation I am using Textmate and it regular expression capabilities to create a tab snippet. I want to type some trigger key, ie doit, press tab and then type in a username. Then the ouput above will be created. This won't save me that much time, but I feel like I come across this sort of stuff in Textmate quite frequently and want to figure it out.
I have been using this as a reference, but still don't know how use regexps to be selective with the words and upper and lowercase the values (\u \U \l \L)
http://manual.macromates.com/en/snippets
You can use Ruby for textmate snippets. That should make it simpler.

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.