Why can't itext read this tiff - clojure

I am using iText 7.0.0 to read in pages from a TIFF image file and add those pages to a PDF.
I have found an example TIFF that iText cannot read.
The answerer of this question Exception when converting tiff file to pdf file with iText mentioned sharing problem TIFFs, so I am doing that here.
This is a clojure example that fails reading the first page with both the recoverFromImageError and direct flags set to true. It also fails to read all of the other pages.
(let [tiff "test-multi-rgb-compression-type-7.tiff"
url (UrlUtil/toURL tiff)]
(Image. (ImageDataFactory/createTiff url true 1 true)))
This produces the following stacktrace:
com.itextpdf.io.IOException cannot.read.tiff.image
TiffImageHelper.java: 279 com.itextpdf.io.image.TiffImageHelper/processTiffImage
TiffImageHelper.java: 89 com.itextpdf.io.image.TiffImageHelper/processImage
ImageDataFactory.java: 400 com.itextpdf.io.image.ImageDataFactory/createTiff
The TIFF file is accessible here https://drive.google.com/file/d/0B5HypmT13gm-RGFURWZ4SlgxLUk/view?usp=sharing
Thanks for you time.

Related

Determining the format of audio file (MP3) using SAPI

HiI'm trying to create a "Speech to text" app that can transcribe any audio/video file. I've created an app based on this post and it works great for WAV files. But if I use an MP3 file, the line hr = cpInputStream->BindToFile(wInputFileName.c_str(), SPFM_OPEN_READONLY, &sInputFormat.FormatId(), sInputFormat.WaveFormatExPtr(), SPFEI_ALL_EVENTS); returns
The Parameter is incorrect
The question is, can I use MP3 files as input for SAPI? and if yes, how do I determine the correct format for the call to hr = sInputFormat.AssignFormat(SPSF_16kHz16BitStereo) because SPSF_16kHz16BitStereo will certainly not be correct and I don't think we should hardcode it.

How to convert ogg file to mp3 using python?

I am trying to convert the Ogg file to mp3/wav formats. I used:
FFmpeg
pyaudio
dlls
But nothing worked out.
Also, I am trying to first read the ogg data from an HTTP URL then want to convert it to mp3/wav, and then using speech_recognition converting to text.
If I don't use any method I get the following error.
Error: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if the file is corrupted or in another format.
Please suggest any libraries.
Code snippet:
audio_data = Data.get("audio")
if '.wav'or'.mp3' not in audio_data:
file = ("newspeech.mp3")
new_audio = urllib.request.urlretrieve(audio_data,file)

when i open a pdf in adobe acrobat pro Dc the text is getting messed up for pdf creation i used itext5.5.12

For our application we have used itext5.5.12 for pdf creation, ruby 2.0.0 and rails 4.2
After downloading the pdf to our local machine from application and open the pdf in Adobe Acrobat DC Pro the text is getting messed up see below pdf
Please suggest me how to fix this problem
here is my code
FileStream = Rjb::import('java.io.FileOutputStream')
LicenseKey = Rjb::import('com.itextpdf.license.LicenseKey')
PdfReader = Rjb::import('com.itextpdf.text.pdf.PdfReader')
Document = Rjb::import('com.itextpdf.text.Document')
PdfCopy = Rjb::import('com.itextpdf.text.pdf.PdfCopy')
def initialize(output_filename, pdf_files)
itext_key = File.join(Rails.root, '/lib/jars/itextkey.xml')
LicenseKey.loadLicenseFile(itext_key)
#output_filename = output_filename
#pdf_files = pdf_files
end
def bind
doc =Document.new
pdf_copy = PdfCopy.new(doc, FileStream.new(#output_filename))
doc.open
#pdf_files.each do |pdf|
reader = PdfReader.new(pdf)
pages = reader.getNumberOfPages()
(1..pages).each do |p|
pdf_copy.addPage(pdf_copy.getImportedPage(reader, p))
end
reader.close
end
doc.close
end
Your code
First of all, I ported your code to Java, ran it for the files you provided (cf. Merging.java), but the result file was undamaged. Thus, the code you provided most likely is not the issue.
Post processors
Then I analyzed your PDF file and saw that your code was not the final editor of it:
The Producer line says:
iText® 5.5.12 ©2000-2017 iText Group NV (yash; Trial version ka2lzaG9yZS5yYWNoYWtvbmRhQHlhc2guY29t); modified using iText® 7.0.1 ©2000-2016 iText Group NV (AGPL-version)
Thus, another routine using iText 7.0.1 has post-processed the merged PDF.
The PDF is linearized.
As far as I know neither iText 5.x nor iText 7.0.x can create linearized files.
Thus, yet another transformation by an unknown PDF processor took place.
So either the code using iText 7 or the unknown linearizing post processor might have introduced the issue.
The issue itself
I compared the originals you provided and your result. The difference:
The pages in "3 SLC Dec.pdf" reference a font resource for Arial-BoldMT which is not embedded. Thus, if the computer on which the PDF is displayed has that font, the PDF is properly displayed.
The matching pages in your result "39bdd9b1ba6501b44d401ee1b157ddb5631fcf36.pdf", though, reference a font resource for Arial-BoldMT which suddenly does have an embedded font file. But this font file is incomplete, i.e. merely a subset of Arial-BoldMT!
If there is an embedded font file, this embedded font file is used for displaying the text drawn in this font, not a font on the local computer anymore. As this embedded font file here is incomplete, numerous characters don't appear anymore.
Looking further it turns out that this subset font file is from your input "5 Sched of INS.pdf". Indeed, the only page in this PDF has a font resource for Arial-BoldMT with a font file which only contains the characters needed for that page.
So either the code using iText 7 or the unknown linearizing post processor seems to have assumed that the Arial-BoldMT font file from "5 Sched of INS.pdf" is complete, and has added it to Arial-BoldMT font resources on other pages.
Was this an error?
This appears to have been a valid optimization step by the post processor which did this and not an error, because strictly speaking the Arial-BoldMT font resource in "5 Sched of INS.pdf" is broken; the PDF specification requires:
For a font subset, the PostScript name of the font — the value of the font’s BaseFont entry and the font descriptor’s FontName entry — shall begin with a tag followed by a plus sign (+). The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file shall have different tags.
As the Arial-BoldMT font resource in "5 Sched of INS.pdf" does not have this prefix, any font file embedded in it must be complete. But the embedded font file is not.
Thus, which post processor ever actually did add the font file to the other font resources was allowed to do so, the "5 Sched of INS.pdf" file is the culprit and should be repaired before usage.

Looking for data in EXIF format

I got the problem with my program made for downloading the DateTimeOrginal data from the .JPG file. I found the document about it on the internet:
https://ExifTool.org/TagNames/EXIF.html
I see that the data I'm looking for is at 0x9003 address.
So right now what I'm trying to do is:
temp = fopen(name, "rb");
open the file binary
fseek (temp, 0x9003, SEEK_SET);
move the File pointer to the address
fscanf(temp, "%s", str);
and load the data to the char[] structure.
Is atleast any of that correct? I'm still thinking that i got the problem with the address, because after compile that program i see only some trash from the file.
The EXIF data is embedded into the jpeg tag APP1 (0xE1).
The first thing to do is to find the jpef tag 0xE1 in the stream; you have to scan all the jpeg tags (marked by 0xFF+tag, in your case 0xFF,0xE1). After you get the tag, find its length by reading the next 2 bytes (and adjust for high endian), then get the tag's content.
After you get the tag's content, then look in it for the EXIF tag you are interested in (0x9003).
The method readStream in the jpeg class of the open source project Imebra gives you an example on how to parse jpeg tags: https://bitbucket.org/binarno/imebra/src/2eb33b2170e76b5ad2737d1c2d81c1dcaccd19e5/project_files/library/imebra/src/jpegCodec.cpp?at=default#cl-867
Given the style of programming of the OP, I'd recommend Easyexif at https://github.com/mayanklahiri/easyexif
It's relatively easy to integrate. Note that fseek() goes to a file position; it does not search for a certain number.

Edit the frame rate of an avi file

Is it possible to change the frame rate of an avi file using the Video for windows library? I tried the following steps but did not succeed.
AviFileInit
AviFileOpen(OF_READWRITE)
pavi1 = AviFileGetStream
avi_info = AviStreamInfo
avi_info.dwrate = 15
EditStreamSetInfo(dwrate) returns -2147467262.
I'm pretty sure the AVIFile* APIs don't support this. (Disclaimer: I was the one who defined those APIs, but it was over 15 years ago...)
You can't just call EditStreamSetInfo on an plain AVIStream, only one returned from CreateEditableStream.
You could use AVISave, then, but that would obviously re-copy the whole file.
So, yes, you would probably want to do this by parsing the AVI file header enough to find the one DWORD you want to change. There are lots of documents on the RIFF and AVI file formats out there, such as http://www.opennet.ru/docs/formats/avi.txt.
I don't know anything about VfW, but you could always try hex-editing the file. The framerate is probably a field somewhere in the header of the AVI file.
Otherwise, you can script some tool like mencoder[1] to copy the stream to a new file under a different framerate.
[1] http://www.mplayerhq.hu/
HRESULT: 0x80004002 (2147500034)
Name: E_NOINTERFACE
Description: The requested COM interface is not available
Severity code: Failed
Facility Code: FACILITY_NULL (0)
Error Code: 0x4002 (16386)
Does it work if you DON'T call EditStreamSetInfo?
Can you post up the code you use to set the stream info?