Editing Word Document, add Headers/footers and save it - Python - python-2.7

I want to add headers and footers to every page of a word document, and want to add some pages to the start of the document, how can I do this using python ? I have tried python-docx but it is not working as I expected. Is there any other way to achieve my requirement ?

I think the best way you can take is looking at python-docx to see how it manages the files. Docx is a zipped format (Docx Tag Wiki):
The .docx format is a zipped file that contains the following folders:
+--docProps
| + app.xml
| \ core.xml
+ res.log
+--word //this folder contains most of the files that control the content of the document
| + document.xml //Is the actual content of the document
| + endnotes.xml
| + fontTable.xml
| + footer1.xml //Containst the elements in the footer of the document
| + footnotes.xml
| +--media //This folder contains all images embedded in the word
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels //this document tells word where the images are situated
+ [Content_Types].xml
\--_rels
\ .rels
A library like docx-python extracts at first the docx, so you could find it in the python docx: I have found it for you: https://github.com/mikemaccana/python-docx/blob/master/docx.py#L65
def opendocx(file):
'''Open a docx file, return a document XML tree'''
mydoc = zipfile.ZipFile(file)
xmlcontent = mydoc.read('word/document.xml')
document = etree.fromstring(xmlcontent)
return document
You can get the xmlcontent of 'word/document.xml' which is the main content of the docx, change it either using docx-python (which I recommend to you, docx-python seems to be able to add many different elements. If you want to copy the content of an other word document at the beginning of the document, you could probably try to copy the content of the document.xml to your document.xml, but It will probably give you some errors, specially if you use Images or non-text content.
To add a header or a footer, you will have to create a file word/header1.xml or a word/footer1.xml You could just copy the header1.xml content of a file you created, this should work.
Hope this helps

Related

Generate CSV import file for AutoML Vision from an existing bucket

I already have a GCloud bucket divided by label as follows:
gs://my_bucket/dataset/label1/
gs://my_bucket/dataset/label2/
...
Each label folder has photos inside. I would like to generate the required CSV – as explained here – but I don't know how to do it programmatically, considering that I have hundreds of photos in each folder. The CSV file should look like this:
gs://my_bucket/dataset/label1/photo1.jpg,label1
gs://my_bucket/dataset/label1/photo12.jpg,label1
gs://my_bucket/dataset/label2/photo7.jpg,label2
...
You need to list all files inside the dataset folder with their complete path and then parse it to obtain the name of the folder containing the file, as in your case this is the label you want to use. This can be done in several different ways. I will include two examples from which you can base your code on:
Gsutil has a method that lists bucket contents, then you can parse the string with a bash script:
# Create csv file and define bucket path
bucket_path="gs://buckbuckbuckbuck/dataset/"
filename="labels_csv_bash.csv"
touch $filename
IFS=$'\n' # Internal field separator variable has to be set to separate on new lines
# List of every .jpg file inside the buckets folder. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.jpg`
do
# Cuts the address using the / limiter and gets the second item starting from the end.
label=$(echo $i | rev | cut -d'/' -f2 | rev)
echo "$i, $label" >> $filename
done
IFS=' ' # Reset to originnal value
gsutil cp $filename $bucket_path
It also can be done using the Google Cloud Client libraries provided for different languages. Here you have an example using python:
# Imports the Google Cloud client library
import os
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client()
# The name for the new bucket
bucket_name = 'my_bucket'
path_in_bucket = 'dataset'
blobs = storage_client.list_blobs(bucket_name, prefix=path_in_bucket)
# Reading blobs, parsing information and creating the csv file
filename = 'labels_csv_python.csv'
with open(filename, 'w+') as f:
for blob in blobs:
if '.jpg' in blob.name:
bucket_path = 'gs://' + os.path.join(bucket_name, blob.name)
label = blob.name.split('/')[-2]
f.write(', '.join([bucket_path, label]))
f.write("\n")
# Uploading csv file to the bucket
bucket = storage_client.get_bucket(bucket_name)
destination_blob_name = os.path.join(path_in_bucket, filename)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(filename)
For those, like me, who were looking for a way to create the .csv file for batch processing in googleAutoML, but don't need the label column :
# Create csv file and define bucket path
bucket_path="gs:YOUR_BUCKET/FOLDER"
filename="THE_FILENAME_YOU_WANT.csv"
touch $filename
IFS=$'\n' # Internal field separator variable has to be set to separate on new lines
# List of every [YOUREXTENSION] file inside the buckets folder - change in next line - ie **.png beceomes **.your_extension. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.png`
do
echo "$i" >> $filename
done
IFS=' ' # Reset to originnal value
gsutil cp $filename $bucket_path

Find and rename all files in directory matching a certain pattern

I'm attempting to write a program that will loop through every subfolder, find and rename all the files that match a given pattern in the filename. The files are all .jpg files and have the following patter:
[0-9][0-9][0-9]_UsersfirstnameUserslastname[0-9][0-9][0-9].jpg
so for instance one folders would have the following:
452_AlexBobenko002.jpg
452_AlexBobenko003.jpg
452_AlexBobenko007.jpg
Then it would go to another folder where the following files exists:
834_CatDonald001.jpg
...
834_CatDonlad126.jpg
I would like to rename the files so that there would be an underscore after the last letter and before the last set of 3 digits. So the patter would go from:
[0-9][0-9][0-9]_UsersfirstnameUserslastname[0-9][0-9][0-9].jpg
to
[0-9][0-9][0-9]_UsersfirstnameUserslastname_[0-9][0-9][0-9].jpg
and from the above example I would have:
452_AlexBobenko002.jpg --> 452_AlexBobenko_002.jpg
452_AlexBobenko003.jpg --> 452_AlexBobenko_003.jpg
452_AlexBobenko007.jpg --> 452_AlexBobenko_007.jpg
and
834_CatDonald001.jpg --> 834_CatDonald_001.jpg
...
834_CatDonlad126.jpg --> 834_CatDonald_126.jpg
So far I have been able to locate the desired files with the following:
path = mydir
folders = [filename for filename in os.listdir(path) if filename.startswith('EMP-')]
subfolders = [[] for i in range(len(folders))]
# Will populate the empty sublist of subfolders with the contents of each distinct folder
for i in range(len(folders)):
subfolders[i] = [subfolder for subfolder in os.listdir(path +'\\%s' %folders[i])]
for z_1 in range(len(folders)):
for z_2 in range(len(subfolders[z_1])):
if os.path.isdir(path + '\\%s\\%s' % (folders[z_1], subfolders[z_1][z_2])) == True:
for file in glob.glob(path + '\\%s\\%s\\[0-9][0-9][0-9]_*.jpg' % (folders[z_1], subfolders[z_1][z_2])):
#rename(file)
I really have no clue how to rename them

Zip support in Apache Spark

I have read about Spark's support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as .zip files. So far I have tried computing a file compressed under a zip file, but Spark seems unable to read its contents successfully.
I have taken a look to Hadoop's newAPIHadoopFile and newAPIHadoopRDD, but so far I have not been able to get anything working.
In addition, Spark supports creating a partition for every file under a specified folder, like in the example below:
SparkConf SpkCnf = new SparkConf().setAppName("SparkApp")
.setMaster("local[4]");
JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);
JavaRDD<String> FirstRDD = Ctx.textFile("C:\input\).cache();
Where C:\input\ points to a directory with multiple files.
In the case computing zipped files would be possible, would it also be possible to pack every file under a single compressed file and follow the same pattern of one partition per file?
Spark default support compressed files
According to Spark Programming Guide
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This could be expanded by providing information about what compression formats are supported by Hadoop, which basically can be checked by finding all classes extending CompressionCodec (docs)
name | ext | codec class
-------------------------------------------------------------
bzip2 | .bz2 | org.apache.hadoop.io.compress.BZip2Codec
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec
gzip | .gz | org.apache.hadoop.io.compress.GzipCodec
lz4 | .lz4 | org.apache.hadoop.io.compress.Lz4Codec
snappy | .snappy | org.apache.hadoop.io.compress.SnappyCodec
Source : List the available hadoop codecs
So the above formats and much more possibilities could be achieved simply by calling:
sc.readFile(path)
Reading zip files in Spark
Unfortunately, zip is not on the supported list by default.
I have found a great article: Hadoop: Processing ZIP files in Map/Reduce and some answers (example) explaining how to use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me.
My solution
Without any external dependencies, you can load your file with sc.binaryFiles and later on decompress the PortableDataStream reading the content. This is the approach I have chosen.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
// this solution works only for single file in the zip
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
} else {
sc.textFile(path, minPartitions)
}
}
}
using this implicit class, you need to import it and call the readFile
method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)
And the implicit class will load your zip file properly and return RDD[String] like it used to.
Note: This only works for single file in the zip archive!
For multiple files in your zip support, check this answer: https://stackoverflow.com/a/45958458/1549135
Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.
This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.
This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).
You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.
Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/
file_RDD = sc.binaryFiles( HDFS_path + data_path )
def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
try :
pseudo_file = io.BytesIO( binary_stream_string )
zf = zipfile.ZipFile( pseudo_file )
return zf
except :
return None
def read_zip_lines(zipfile_object) :
file_iter = zipfile_object.open('diff.txt')
data = file_iter.readlines()
return data
My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))
You can use sc.binaryFiles to read Zip as binary file
val rdd = sc.binaryFiles(path).flatMap {
case (name: String, content: PortableDataStream) => new ZipInputStream(content.open)
} //=> RDD[ZipInputStream]
And then you can map the ZipInputStream to list of lines:
val zis = rdd.first
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList
But the problem remains that the zip file is not splittable.
Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.
allzip.foreach { x =>
val zipFileRDD = sc.newAPIHadoopFile(
x.getPath.toString,
classOf[ZipFileInputFormat],
classOf[Text],
classOf[BytesWritable], hadoopConf)
zipFileRDD.foreach { y =>
ProcessFile(y._1.toString, y._2)
}
https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala
The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

PowerShell - Duplicate file using a list txt of names

all!
I am trying to solve the following issue using PowerShell.
Basically, I have setup a file with the needed properties. Let's call it "FileA.xlsx".
I have a text file which contains a list of names, i.e:
FileB.xlsx
DumpA.xlsx
EditC.xlsx
What I am trying is to duplicate "FileA.xlsx" serveral times and use all the names from the text file, so in the end I should end up with 4 files (all of them are copies of "FileA.xlsx":
FileA.xlsx
FileB.xlsx
DumpA.xlsx
EditC.xlsx
Assuming that you have a file called files.txt with the following content:
bob2.txt
bob3.txt
bob4.txt
Then the following will do what you want:
$sourceFile = "FileA.xlsx"
gc .\files.txt | % {
Copy-Item $sourceFile $_
}
This will create a copy of the FileA.xlsx with the names bob2.txt, bob3.txt and bob4.txt

django - docfile adding %0D to url?

I have a weird issue where the docfile.url of a file on our server is adding a %0D (carriage return) to the end of the url. This is only happening to files that I mannually linked. What I mean is, there were about 1,000 files in a directory and I created a CSV file which had the id and filename of each file, and added them to the mysql database with some code. All files uploaded normally through my django app's interface link normally - clicking their link opens the file properly.
Here's a sample of the CSV file:
792,asbuilts/C0010.pdf
793,asbuilts/C0011.pdf
794,asbuilts/C0012.pdf
795,asbuilts/C0013.pdf
796,asbuilts/C0014.pdf
797,asbuilts/C0015.pdf
798,asbuilts/C0016.pdf
799,asbuilts/C0017.pdf
I have all these asbuilt files in the directory static_media/asbuilts/. In mysql I ran this command:
load data local infile '/srv/www/cpm/CPM_CSV_Files/comm_asbuilts.csv' into table systems_asbuilt fields terminated by ',' lines terminated by '\n' (id, docFile);
A sample output of select * from systems_asbuilt is like this:
|846 | asbuilts/C0057.pdf
|847 | asbuilts/C0059.pdf
|848 | asbuilts/C0060.pdf
|849 | asbuilts/C0061.pdf
|850 | asbuilts/C0062.pdf
|851 | asbuilts/C0063.pdf
|852 | asbuilts/C0064.pdf
Everything looks good right?
But when I look at the link created it looks like this:
`www.ourdomain.com/static_media/asbuilts/R0546.pdf%0D'
If I manually delete the %0D from the link, the file opens as expected. Any idea why there's the extra %0D on there? Where is it coming from?
Thanks
My guess is, that this:
lines terminated by '\n'
Should be
lines terminated by '\r\n'
Your result "looks" right because of the client you are using to browse it, but when the record is retrieved, it still has the \r appended to it.
So you can strip it before you load it in the database, or .strip() before you generate the link.