Create And Convert To PDF's.. NO Toolkit - c++

Not sure where else to ask this, so I figured I'd give good old stackoverflow a shot.
Let's say, by chance, I would like to write a library or set of libraries that will create PDF's and convert files to PDF, AND I could care less about how long it will take me to complete (3 months - 10 years.. whatever). I have absolutely no interest in paying for a toolkit... the point of this would be to learn how to manipulate and create files like PDF's. There's nothing business critical about the project, I just want to learn how to do it. Where do I start? I would imagine something like this would be written in C++, but I'm not sure... maybe high level languages would work as well. I'm not looking for someone to tell me exactly how to do it, but send me in the write direction, or at least point out the concepts I would need to concretely grasp before proceeding with such a project.
Any advice and help in directing me here is greatly appreciated : )

Well, you will need a very good understanding of the PDF file format. Adobe publishes the standard and you can start at their site. You can start with the base 1.7 standard and then read the cumulative supplements from there. It is a daunting task, but it can be done and you can pretty much use any language you want, because in the end you are just generating bytes that can be saved to a file.
If you want to convert from, let's say, word documents, it will get a little trickier. Microsoft has published their file formats, which you would have to learn and then learn how to translate that into the corresponding PDF formatting. Also note that the .doc and .docx formats are completely separate file formats and would require separate engines to convert them.
With unlimited time, it is definitely doable, you would just need to ask yourself if it is worth it.

Related

reading mp3 file for game development

I am currently creating a game. My game will use music from an mp3 file that the user sends in in order to make decisions on where to place things, how fast the level moves, etc. I am fairly new at this, I have been reading information about mp3. Currently I have found all the frames in the mp3 file that I am using. I don't really know where to go from here. What I want to do is measure the frequencies of the sound wave of the music at certain times (like every sec) and then based on that frequency, do what I need to for the game. I don't know whether I should decode the mp3, that looks like a lot of work and I don't want to do that if I don't have 2 or if I can just read the bytes in the frame and convert them without decoding anything. I am developing this in c#, using the game engine FlatRedBall. I am not using any libraries. I am also planning on selling this game so I would like to avoid using other people's code if I can avoid it. Please someone help me, I just need a direction to go from here. I know how to parse the header and calculate the framelength, I just don't know the next step in what I want to do...
Convert your music to .ogg format which is free and use free library to play it.
Note: I was going to post this as a comment but it quickly grew too big. :)
Writing your own MP3 enconder/decoder is probably going to take a good ammount of effort; effort which would probably be better spent on your game itself. Therefore, is possible, I would be all means try to use an open source library.
That said, most good MP3 libraries are LGPL/GPL licensed. This means you can use it in a commercial setting, as long as you dynamically link to it. Also the SDL Mixer library, as of version 1.2.12, supports MP3s and is under a more permissive zlib license, but since you mention C# I don't know if stable and up-to-date bindings are available. Also since your project isn't written in SDL to begin with, it might be hard to integrate it.
Also, as #pro_metedor hinted, perhaps using a more open format could help in licensing issues. In general, OGG achieves better compression than MP3, which is a plus for things like download size, bandwidth/resource usage, etc.
Just shop around for a while, and try to be a little flexible. I'm sure you'll find something nice! :)

psd file format

I am attempting to find documentation of the psd file format so I can read in a .psd and then save out the individual layers as files, along with do other modifications. Does anyone know of any document in on the .psd file format? (Just for reference, I will be writing this in C++)
If there are any code examples of loading a .psd file in C++ then I would appreciate them being linked.
(Please not turn this into a "just use XXX software". This is not homework, or anything related to that. I am doing this because I think it will be a fun project to work on. I will ask for posts to be down voted if this happens.)
There's also some Objective-C code on GitHub (should be easily understandable for anyone with a C++ background), also source of this gem, which appears to sum it up nicely:
At this point, I'd like to take a moment to speak to you about the Adobe PSD format.
PSD is not a good format. PSD is not even a bad format. Calling it such would be an
insult to other bad formats, such as PCX or JPEG. No, PSD is an abysmal format. Having
worked on this code for several weeks now, my hate for PSD has grown to a raging fire
that burns with the fierce passion of a million suns.
If there are two different ways of doing something, PSD will do both, in different
places. It will then make up three more ways no sane human would think of, and do those
too. PSD makes inconsistency an art form. Why, for instance, did it suddenly decide
that these particular chunks should be aligned to four bytes, and that this alignement
should not be included in the size? Other chunks in other places are either unaligned,
or aligned with the alignment included in the size. Here, though, it is not included.
Either one of these three behaviours would be fine. A sane format would pick one. PSD,
of course, uses all three, and more.
Trying to get data out of a PSD file is like trying to find something in the attic of
your eccentric old uncle who died in a freak freshwater shark attack on his 58th
birthday. That last detail may not be important for the purposes of the simile, but
at this point I am spending a lot of time imagining amusing fates for the people
responsible for this Rube Goldberg of a file format.
Earlier, I tried to get a hold of the latest specs for the PSD file format. To do this,
I had to apply to them for permission to apply to them to have them consider sending
me this sacred tome. This would have involved faxing them a copy of some document or
other, probably signed in blood. I can only imagine that they make this process so
difficult because they are intensely ashamed of having created this abomination. I
was naturally not gullible enough to go through with this procedure, but if I had done
so, I would have printed out every single page of the spec, and set them all on fire.
Were it within my power, I would gather every single copy of those specs, and launch
them on a spaceship directly into the sun.
PSD is not my favourite file format.
Just so you are warned. :)
This will not be a fun project, the .psd format is big. It incorporates every feature Adobe has put into Photoshop over many years.
I believe the specification can be had from Adobe, but they don't just hand it out to the public. You'll have to contact them and jump through some hoops first.
The PSD file format specification as written by Adobe is here;
http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/
Last update: June 2012. As far as I know this is the best available source about the PSD file format even there are few mistakes.
First I recommend starting by dividing PSD into blocks.
Enjoy!
MyPSD::CPSD class is a C++ class that can load images saved in Adobe's Photoshop native format.
http://www.codeproject.com/Articles/10398/Import-Adobe-Photoshop-psd-images
MolecularMatters psd_sdk seems like a good library to take inspiration form: https://github.com/MolecularMatters/psd_sdk
It allows to read layers from a .psd file and much more.

Publishing toolchain

I have a book project which I'd like to start sooner than later. This would follow an agile-like publishing workflow, i.e: publish early and often. It is meant to be self-publsihed by me and I'm not really looking to paper-publish it, even though we never know.
If I weren't a geek, I'd probably have already started writting in Word or any other WYSIWYG tool and just export to PDF. However, we know it is not the best solution, and emacs rules my text-editing life, so, the output format should be as simple as possible and be text-based.
I've thought about the following options:
Just use orgmode and export to PDF (orgmode has this feature natively)
Use markdown mode and export to PDF (markdown->LaTeX->PDF should not be hard to setup);
Use something similar to what the guys # Pragmatic Progammers do: A XML + XSLT + LaTeX.
More complex, but much more control over the style.
EDIT: Someone just told me that he uses a combo of Textile+Adobe In Design and the XTags plugin. Not sure how they are glued together though, gotta do some research.
Any other ideas / references ?
I want to start writting as soon as possible. In fact, I already have a draft in an org-formatted file. However, I do want to have and use the full power of LaTex later on to format it the way I want and make it look fabulous :)
Thanks in advance,
Marcelo.
I have done a TON of research on this lately, since I'm planning on starting my own small press soon.
It really depends on what you want your final output to be (PDF, HTML, other?), and what the book is about.
Org mode is great, as I'm sure you know, because it expands as you do. I often write my outlines in org mode, then just fill in the body text when I'm really ready to start writing.
IF it's prose, and you just need some simple divisions (chapters and sections and not much else), org mode -> latex should do you just fine. Then you also have the possibility of org mode -> html
IF you need math in it, you can just write the math right in the org mode file.
If it's really really technical information, docbook might be nice (emacs + nxml), then dockbook 4.5 -> jade -> jadetex -> pdf.
I'd stay away from docbook 5, because it uses FOP to generate PDFs, and the typesetting is really inferior to latex.
BOTTOM LINE: If you want a PDF, use org -> latex, the path of least resistance ;) -- whatever you do, concentrate on the content of the book first, and worry about what it looks like til after.
And why not paper publish? Have you looked at lulu.com? I recently formatted a book with latex, uploaded the pdf to lulu, and had them print it. The quality is pretty good, and definitely worth a look. I have a ton of bookmarks at home about publishing in general, if you're interested.
Typography is hard.
TeX/LaTeX are tools that can get you the best possible results, however they require knowledge about typography to be used correctly--especially with a big document like a book. And I haven't seen any other cheap (=not for professional use) software that would do things correctly automatically. (I haven't seen any professional software, so it is possible they don't do that either)
However, assuming that you'll write your book in some machine-readable format, putting it into TeX/LaTeX should not be very hard: once I had a set of documents in a custom XML format. Proper usage of XSLT, TeXML and LaTeX gave me something I could tweak manually (and this tweaking was necessary!) and get the best possible result.
My advice: prepare content in something that is easy to parse and easy to write in. I'd dismiss XML. Markdown seems to be good choice. This will also allow you to quickly show your work. Then if you decide to make the result better, write some simple script to translate that to TeX (it is not that hard to get basic functionality) and fix things by hand. This might actually be a good exercise to learn TeX.
Don't try to get everything right from the beginning. Firstly get the content, then play with formatting.
If you are really wanting to do online only, I would suggest you use org mode and just stay in HTML. Then you can use CSS to style it however you would like.
That being said, if you really want to output to PDF for technical stuff, I would strongly suggest using Docbook (www.docbook.org). It's made for that, it works great with Emacs.
You have already answered yourself. Not mentioning that you already started writing in org-mode. Org-mode is really extremely powerful and will enable you to publish to PDF and HTML eventually with no effort.
In case of PDF you can take advantage of LaTeX and how org-mode is working with exports. You can include any LaTeX code to your org file. Also IMHO it's way better to write the book/article in org-mode since something becomes even easier than in plain .tex files take for example tables.
Regarding Publishing it's a same story with one single function you can trigger exporting to HTML/PDF and uploading to your server. And notice that you are still using just plain text file which is human readable and very clean.
Org-mode really follows the Emacs philosphy just start using it and it will grow with you.
If you are writing a book, it would certainly be worth the overhead of learning tex.
Even something like,
\documentclass[a4paper,10pt]{book}
\title{SERPA'S BOOK}
\author{SERPA}
\date{\today}
\begin{document}
\maketitle
\tableofcontents
\include{chapterA}
\include{chapterB}
\include{chapterC}
\end{document}
Then, in the same directory have files chapterA.tex, chapterB.tex, chapterC.tex that look like
\chapter{My chapter title}
Lorem ipsum dolor sit amet, consectetur adipiscing elit....
That alone will produce an extremely nice looking document. You can edit each chapter separately and then just compile the main tex file. I think if you try to learn intermediate tools that try to abstract away from tex, you'll only make it more difficult later to do what you actually want, because you will be both fighting tex and an abstraction of tex at the same time.
Best of luck on such an undertaking.
Also, no matter what you do, make sure to use some kind of version control system, such as SVN, to manage your files. It will be worth it.
I would write it in Latex and have an online repository that does nightly compiles to PDF of the 'publish-ready' branch, available to readers.
I would not start with using LaTeX these days. TeX input is unstructured and the only thing you can get out of TeX input is PDF. If you need HTML or anything else, you are screwed.
Use something structured, such as XML (DocBook is a good suggestion) or define your own XML subset as you need it. Use XSLT to transform it into something usable (HTML etc.) That way you are set for the future.
Depending on your typographical needs, you can then use TeX as a backend processor, or XSLT or whatever.
Also, have a look at ConTeXt, it can read XML directly and has great typography!

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.

How do I write a Perl script to filter out digital pictures that have been doctored?

Last night before going to bed, I browsed through the Scalar Data section of Learning Perl again and came across the following sentence:
the ability to have any character in a string means you can create, scan, and manipulate raw binary data as strings.
An idea immediately hit me that I could actually let Perl scan the pictures that I have stored on my hard disk to check if they contain the string Adobe. It seems by doing so, I can tell which of them have been photoshopped. So I tried to implement the idea and came up with the following code:
#!perl
use autodie;
use strict;
use warnings;
{
local $/="\n\n";
my $dir = 'f:/TestPix/';
my #pix = glob "$dir/*";
foreach my $file (#pix) {
open my $pic,'<', "$file";
while(<$pic>) {
if (/Adobe/) {
print "$file\n";
}
}
}
}
Excitingly, the code seems to be really working and it does the job of filtering out the pictures that have been photoshopped. But problem is many pictures are edited by other utilities. I think I'm kind of stuck there. Do we have some simple but universal method to tell if a digital picture has been edited or not, something like
if (!= /the origianl format/) {...}
Or do we simply have to add more conditions? like
if (/Adobe/|/ACDSee/|/some other picture editors/)
Any ideas on this? Or am I oversimplifying due to my miserably limited programming knowledge?
Thanks, as always, for any guidance.
Your best bet in Perl is probably ExifTool. This gives you access to whatever non-image information is embedded into the image. However, as other people said, it's possible to strip this information out, of course.
I'm not going to say there is absolutely no way to detect alterations in an image, but the problem is extremely difficult.
The only person I know of who claims to have an answer is Dr. Neal Krawetz, who claims that digitally altered parts of an image will have different compression error rates from the original portions. He claims that re-saving a JPEG at different quality levels will highlight these differences.
I have not found this to be the case, in my investigations, but perhaps you might have better results.
No. There is no functional distinction between a perfectly edited image, and one which was the way it is from the start - it's all just a bag of pixels in the end, after all, and any other metadata you can remove or forge all you want.
The name of the graphics program used to edit the image is not part of the image data itself but of something called meta data - which may be stored in the image file but, as others have noted, is neither required (so some programs may not store it, some may allow you an option of not storing it) nor reliable - if you forged an image, you might have forged the meta data as well.
So the answer to your question is "no, there's no way to universally tell if the pic was edited or not, although some image editing software may write its signature into the image file and it'll be left there by carelessness of the editing person.
If you're inclined to learn more about image processing in Perl, you could take a look at some of the excellent modules CPAN has to offer:
Image::Magick - read, manipulate and write of a large number of image file formats
GD - create colour drawings using a large number of graphics primitives, and emit the drawings in various formats.
GD::Graph - create charts
GD::Graph3d - create 3D Graphs with GD and GD::Graph
However, there are other utilities available for identifying various image formats. It's more of a question for Super User, but for various unix distros you can use file to identify many different types of files, and for MacOSX, Graphic Converter has never let me down. (It was even able to open the bizarre multi-file X-ray of my cat's shattered pelvis that I got on a disc from the vet.)
How would you know what the original format was? I'm pretty sure there's no guaranteed way to tell if an image has been modified.
I can just open the file (with my favourite programming language and filesystem API) and just write whatever I want into that file willy-nilly. As long as I don't screw something up with the file format, you'd never know it happened.
Heck, I could print the image out and then scan it back in; how would you tell it from an original?
As other's have stated, there is no way to know if the image was doctored. I'm guessing what you basically want to know is the difference between a realistic photograph and one that has been enhanced or modified.
There's always the option of running some extremely complex image recognition algorithm that would analyze every pixel in your image and do some very complicated stuff to determine if the image was doctored or not. This solution would probably involve AI which would examine millions of photos that are both doctored and those that are not and learn from them. However, this is more of a theoretical solution and isn't very practical... you would probably only see it in movies. It would be extremely complex to develop and probably take years. And even if you did get something like this to work, it probably still wouldn't be 100% correct all the time. I'm guessing AI technology still isn't at that level and could take a while until it is.
A not-commonly-known feature of exiftool allows you to recognize the originating software through an analysis of the JPEG quantization tables (not relying on image metadata). It recognizes tables written by many applications. Note that some cameras may use the same quantization tables as some applications, so this isn't a 100% solution, but it is worth looking into. Here is an example of exiftool run on two images, the first was edited by photoshop.
> exiftool -jpegdigest a.jpg b.jpg
======== a.jpg
JPEG Digest : Adobe Photoshop, Quality 10
======== b.jpg
JPEG Digest : Canon EOS 30D/40D/50D/300D, Normal
2 image files read
This will work even if the metadata has been removed.
There is existing software out there which uses various techniques (compression artifacting, comparison to signature profiles in a database of cameras, etc.) to analyze the actual image data for evidence of alteration. If you have access to such software and the software available to you provides an API for external access to these analysis functions, then there's a decent chance that a Perl module exists which will interface with that API and, if no such module exists, it could probably be created rather quickly.
In theory, it would also be possible to implement the image analysis code directly in native Perl, but I'm not aware of anyone having done so and I expect that you'd be better off writing something that low-level and processor-intensive in a fully-compiled language (e.g., C/C++) rather than in Perl.
http://www.impulseadventure.com/photo/jpeg-snoop.html
is a tool that does the job almost good
If there has been any cloning , there is a variation in the pixel density..or concentration which sometimes shows up.. upon manual inspection
a Photoshop cloned area will have even pixel density(my meaning is variation of Pixels wrt a scanned image)