How can I split Multipage pdf file into muliple pdf files in ruby - prawn

I would like to know is there any ruby gem or script to convert multipage pdf file into individual pdf files per pages in ruby. I tried with gems pdf-reader and prawn but not able to solve the problem. Help will be heartly appreciated. Thankyou.

The command-line utility PDFtk can do this easily
pdftk source.pdf cat 1-10 output first10.pdf
pdftk source.pdf cat 10-end output rest.pdf
or
pdftk source.pdf burst
# By default, the output files are named pg_0001.pdf, pg_0002.pdf, etc.
Build a utility method something like this:
def split_pdf(source_file, dest_dir)
Dir.mkdir(dest_dir.to_s)
exec("pdftk #{source_file} burst output #{dest_dir}/p%02d.pdf")
Dir.entries(dest_dir.to_s)
.select { |e| e.ends_with?('.pdf') }
.map { |f| "#{dest_dir}/#{f} }
end
and call it something like this:
source_file = Rails.root.join('public', 'source.pdf')
dest_dir = Rails.root.join('public', 'docs', doc_id)
page_files = split_pdf(source_file, dest_dir)

Related

how to get text from .doc file using python 2.7

I am extracting text from .docx file using following code
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
data = getText(file_path)
Now,I want to extract .doc file also in my django rest api hosted on pythonanywhere.As api is on pythonanywhere I am unable to install textract library and antiword.So,How can I do it?
abiword is installed on PythonAnywhere:
abiword --to=txt myfile.doc
will produce a file called myfile.txt.

Parse all text files in a directory using Python 2.7 and re module

I have created a script, using the re module in Python 2.7, that parses data out of a text file. I have about 300 text files in a directory (Directory ALL in the same folder as the python script) and wish to have my script parse the data out of each text file in that directory (Please note that the directory only has text files).
import re
f = open ("All/text1.txt", "rb")
readfile = f.read()
ID1 = re.findall(r"ID1:(.*?)ID2", readfile, re.DOTALL)
ID1 = ID1[0]
print ID1
What I want to do is have that script run on all the text files in the directory printing out the ID1 of each file. I am not really certain how to do this. Your help would be greatly appreciated.

Zip support in Apache Spark

I have read about Spark's support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as .zip files. So far I have tried computing a file compressed under a zip file, but Spark seems unable to read its contents successfully.
I have taken a look to Hadoop's newAPIHadoopFile and newAPIHadoopRDD, but so far I have not been able to get anything working.
In addition, Spark supports creating a partition for every file under a specified folder, like in the example below:
SparkConf SpkCnf = new SparkConf().setAppName("SparkApp")
.setMaster("local[4]");
JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);
JavaRDD<String> FirstRDD = Ctx.textFile("C:\input\).cache();
Where C:\input\ points to a directory with multiple files.
In the case computing zipped files would be possible, would it also be possible to pack every file under a single compressed file and follow the same pattern of one partition per file?
Spark default support compressed files
According to Spark Programming Guide
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This could be expanded by providing information about what compression formats are supported by Hadoop, which basically can be checked by finding all classes extending CompressionCodec (docs)
name | ext | codec class
-------------------------------------------------------------
bzip2 | .bz2 | org.apache.hadoop.io.compress.BZip2Codec
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec
gzip | .gz | org.apache.hadoop.io.compress.GzipCodec
lz4 | .lz4 | org.apache.hadoop.io.compress.Lz4Codec
snappy | .snappy | org.apache.hadoop.io.compress.SnappyCodec
Source : List the available hadoop codecs
So the above formats and much more possibilities could be achieved simply by calling:
sc.readFile(path)
Reading zip files in Spark
Unfortunately, zip is not on the supported list by default.
I have found a great article: Hadoop: Processing ZIP files in Map/Reduce and some answers (example) explaining how to use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me.
My solution
Without any external dependencies, you can load your file with sc.binaryFiles and later on decompress the PortableDataStream reading the content. This is the approach I have chosen.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
// this solution works only for single file in the zip
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
} else {
sc.textFile(path, minPartitions)
}
}
}
using this implicit class, you need to import it and call the readFile
method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)
And the implicit class will load your zip file properly and return RDD[String] like it used to.
Note: This only works for single file in the zip archive!
For multiple files in your zip support, check this answer: https://stackoverflow.com/a/45958458/1549135
Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.
This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.
This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).
You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.
Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/
file_RDD = sc.binaryFiles( HDFS_path + data_path )
def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
try :
pseudo_file = io.BytesIO( binary_stream_string )
zf = zipfile.ZipFile( pseudo_file )
return zf
except :
return None
def read_zip_lines(zipfile_object) :
file_iter = zipfile_object.open('diff.txt')
data = file_iter.readlines()
return data
My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))
You can use sc.binaryFiles to read Zip as binary file
val rdd = sc.binaryFiles(path).flatMap {
case (name: String, content: PortableDataStream) => new ZipInputStream(content.open)
} //=> RDD[ZipInputStream]
And then you can map the ZipInputStream to list of lines:
val zis = rdd.first
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList
But the problem remains that the zip file is not splittable.
Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.
allzip.foreach { x =>
val zipFileRDD = sc.newAPIHadoopFile(
x.getPath.toString,
classOf[ZipFileInputFormat],
classOf[Text],
classOf[BytesWritable], hadoopConf)
zipFileRDD.foreach { y =>
ProcessFile(y._1.toString, y._2)
}
https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala
The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

Beginner, python - how to read a list from a file

I have a Word document that is literally a list of lists, that is 8 pages long. Eg:
[['WTCS','Dec 21'],['THWD','Mar 22']...]
I am using Linux Mint, Python 3.2 and the IDLE interface, plus my own .py programs. I need to read and reference this list frequently and when I stored it inside .py programs it seemed to slow down the window considerably as I was editing code. How can I store this information in a separate file and read it into python? I have it in a .txt file now and tried the following code:
def readlist():
f = open(r'/home/file.txt','r')
info = list(f.read())
return(info)
but I get each character as an element of a list. I also tried info = f.read() but I get a string. Thanks!
You can convert a Python list read from a text file from a text file as a string into a list using the ast module:
>>> import ast
>>> s = "[['WTCS','Dec 21'],['THWD','Mar 22']]"
>>> ast.literal_eval(s)
[['WTCS', 'Dec 21'], ['THWD', 'Mar 22']]

How do you shift all pages of a PDF document right by one inch?

I want to shift all the pages of an existing pdf document right one inch so they can be three hole punched without hitting the content. The pdf documents will be already generated so changing the way they are generated is not possible.
It appears iText can do this from a previous question.
What is an equivalent library (or way do this) for C++ or Python?
If it is platform dependent I need one that would work on Linux.
Update: Figured I would post a little script I wrote to do this in case anyone else finds this page and needs it.
Working code thanks to Scott Anderson's suggestion:
rightshift.py
#!/usr/bin/python2
import sys
import os
from pyPdf import PdfFileReader, PdfFileWriter
#not sure what default user space units are.
# just guessed until current document i was looking at worked
uToShift = 50;
if (len(sys.argv) < 3):
print "Usage rightshift [in_file] [out_file]"
sys.exit()
if not os.path.exists(sys.argv[1]):
print "%s does not exist." % sys.argv[1]
sys.exit()
pdfInput = PdfFileReader(file( sys.argv[1], "rb"))
pdfOutput = PdfFileWriter()
pages=pdfInput.getNumPages()
for i in range(0,pages):
p = pdfInput.getPage(i)
for box in (p.mediaBox, p.cropBox, p.bleedBox, p.trimBox, p.artBox):
box.lowerLeft = (box.getLowerLeft_x() - uToShift, box.getLowerLeft_y())
box.upperRight = (box.getUpperRight_x() - uToShift, box.getUpperRight_y())
pdfOutput.addPage( p )
outputStream = file(sys.argv[2], "wb")
pdfOutput.write(outputStream)
outputStream.close()
You can try the pypdf library. In 2022 PyPDF2 was merged back into pypdf.
two ways to perform this task in Linux
using ghostscript trough gsview
look in your /root or /home for the hidden file .gsview.ini
go to section:
[pdfwrite Options]
Options=
Xoffset=0
Yoffset=0
change the values for X axis, settling a convenient value (values are in postscript points, 1 inch = 72 postscript points)
so:
[pdfwrite Options]
Options=
Xoffset=72
Yoffset=0
close .gsview.ini
open your pdf file with gsview
file / convert / pdfwrite
select first odd pages and print to a new file (you can name this as odd.pdf)
now repeat same steps for even pages
open your pdf file with gsview
[pdfwrite Options]
Options=
Xoffset=-72
Yoffset=0
file / convert / pdfwrite
select first even pages and print to a new file (you can name this as even.pdf)
now you need to mix these two pdf with odd and even pages
you can use:
Pdf Transformer
http://sourceforge.net/projects/pdf-transformer/
java -jar ./pdf-transformer-0.4.0.jar <INPUT_FILE_NAME1> <INPUT_FILE_NAME2> <OUTPUT_FILE_NAME> merge -j
2: : use podofobox + pdftk
first step: with pdftk separate whole pdf document in two pdf files with only odd and even pages
pdftk file.pdf cat 1-endodd output odd.pdf && pdftk file.pdf cat 1-endeven output even.pdf
now with podofobox, included into podofo utils
http://podofo.sourceforge.net/about.html
podofobox file.pdf odd.pdf crop -3600 0 widht height for odd pages and
podofobox file.pdf even.pdf crop 3600 0 widht height for even pages
width and height are in postscript point x 100 and can be found with pdfinfo
e.g. if your pdf file has pagesize 482x680, then you enter
./podofobox file.pdf odd.pdf crop -3600 0 48200 68000
./podofobox file.pdf even.pdf crop 3600 0 48200 68000
then you can mix together odd and even in a unique file with already cited
Pdf Transformer
http://sourceforge.net/projects/pdf-transformer/
With pdfjam, the command to translate all pages 1 inch to the right is
pdfjam --offset '1in 0in' doc.pdf
The transformed document is saved to doc-pdfjam.pdf. For further options, type pdfjam --help. Currently pdfjam requires a Unix-like command prompt (Linux, Mac, or Cygwin). In Ubuntu, it can be installed with
sudo apt install pdfjam
Not a full answer, but you can use LaTeX with pdfpages:
http://www.ctan.org/tex-archive/macros/latex/contrib/pdfpages/
Multiple commandline linux tools also use this approach, for instance pdfjam uses this:
http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/firth/software/pdfjam
Maybe pdfjam can already provide what you need already.
Here is a modified version for python3.x.
First install pypdf2 via pip install pypdf2
import sys
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
uToShift = 40; # amount to shift contents by. +ve shifts right
if (len(sys.argv) < 3):
print ("Usage rightshift [in_file] [out_file]")
sys.exit()
if not os.path.exists(sys.argv[1]):
print ("%s does not exist." % sys.argv[1])
sys.exit()
path=os.path.dirname(os.path.realpath(__file__))
with open(("%s\\%s" % (path, sys.argv[1])), "rb") as pdfin:
with open(("%s\\%s" % (path, sys.argv[2])), "wb") as pdfout:
pdfInput = PdfFileReader(pdfin)
pdfOutput = PdfFileWriter()
pages=pdfInput.getNumPages()
for i in range(0,pages):
p = pdfInput.getPage(i)
for box in (p.mediaBox, p.cropBox, p.bleedBox, p.trimBox, p.artBox):
box.lowerLeft = (box.getLowerLeft_x() - uToShift, box.getLowerLeft_y())
box.upperRight = (box.getUpperRight_x() - uToShift, box.getUpperRight_y())
pdfOutput.addPage( p )
pdfOutput.write(pdfout)