regex to parse image - regex

If I have a string in the form:

What is the best regex I can use to parse these elements in an array? (so I can write away the correct image)
update: i understand base64 encoding but the question is actually how to parse these kind of embedded icons in webpages. since i dont know if people are using e.g. base62 ... or other image strings or even other formats to embed images. etc... i also see examples in pages where the identifier is image/x-icon but he string actually contains a png.
UPDATE just some giveback to share the code where I used this: http://plugins.svn.wordpress.org/wp-favicons/trunk/filters/search/filter_extract_from_page.php
Though I still have some questions e.g. IF only base64 is used etc... but time will tell in practice.

Can you see the base64 at the beginning? You don't need regex. You need to decode this base64 string into a byte stream and then save it as an image.
I have now saved the following text into a file icon.txt:
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABmJLR0QAAAAAAAD5Q7t
/AAAA2UlEQVQ4y8WSvQvCMBDFX2rFUvuFSAUFBQfBwUXQVfFfFpzdRV2c7O5UKmihX9E6RZo2pXbyTbmX3C+5uwD
/FskG+76WsvX65n
/3Lm0pdU214HOAbHIWwvzeYPL1p4cT4QCi5DIxEINIdWt+Hs9cXAtg3UOkIJAUpT5ADiho8kbD0NG0LB6Q76xIevwCpW+0bBvj7Y5wgCpI148RBxTmYo7Z1RGPkSk
/kc4jgme0oHoJlmFUOC+8lUEMN0ASvyBpGha++IXCJrJyKJGhjIalyZVyNqufP9j
/9AH0S0vqrU+YMgAAAABJRU5ErkJggg==
And processed:
base64 -d icon.txt > icon.png
and it shows a red heart icon, 16x16 pixels.
This is the way you can decode it in the command line. Most programming languages offer good libraries to decode it directly in your program.
EDIT: If you use PHP, then have a look at base64_decode().

Related

UTF-8 Encoding Regex

Hello I am bumping my head against the wall with Reg Expressions and UTF-8, I would like to filter emails that in the header from field comes with something like this:
=?UTF-8?Q?**FirstName**=C2=A0**LastName**?=
I want to be able to check first name and last name whether is in upper case, lower case, capitalized, etc.
I currently have a regex like this one \bFirstName+([ ])+LastName\b and that works like a charm, but when the header has UTF-8 Encoding does not work.
This isn't "UTF-8 encoding", it's MIME header encoding (using Quoted-Printable). There's only one reasonable way to deal with it: decode it (using an appropriate library for your language, probably found in an email-related package), and run your regex on the decoded result. Trying to match without decoding first, while not actually impossible, will be stupidly complex and error-prone.

Is it possible to make an index search by regex in PDF?

I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

How to save a image into a string in c++ source code

it's posible save a image into a c++ source code?
#include <string>
int main(){
std::string = "<full code of a .png for example>";
//...
}
The problem it's that a image got a lot of characters like '\'... and copy and pasting it from a hexadecimal editor generates errors.
I don't want load the image from a .png file, i want get the image code directly into a string.
This is generally done by saving the image into a base64 encoded string. It requires more bytes to store, but has the advantage of being a string. You can use an online tool to convert your image to a base64 encoded string that you can copy into your source file.
string base64 = "copy encoded string here";
See this question for more details on how to decode that string into an image. Hope this helps.
There are tools like xxd which will generate a character array in a header file from a binary input file for you to include in your project, see this answer. This is generally preferable for this use case to using a string since you don't need to worry about base64 encoding to handle special characters.
Add the image to your resources if using Windows. Console applications can also have resource files. Then just load the image from the resources.
The other option is to use Base64 encoding on the image, copy the string into your source, recompile and decode the string at runtime..

Pulling bad chars from an xml feed

I need to figure out a way to remove any non standard ASCII char from a feed I am getting form a partners system.
The issue is they are sending a mix of data - HTML formatted and hard returns along with bad chars.
I am already using the UDF DeMoronize(text) to pull any Microsoft Latin-1 "Extentions" chars out.
The feed is coming over in utf-8 and we have a tag on the page to insure we are processing the feed in utf-8.
is there a simple way to code this to remove any NON UTF-8 char?