How to read the cover image from an EPUB file in Python

Πως να διαβάσετε την εικόνα εξωφύλλου από ένα αρχείο EPUB σε Python

10 May 2020 10 Μαϊ 2020 · Coding Προγραμματισμός · python python epub epub cover εξώφυλλο

IntroductionΕισαγωγή

EPUB is the most widely supported, vendor-independent, e-book file format. The term is short for "e(lectronic) pub(lication)" and is sometimes styled as "ePub". The EPUB format is implemented as a single zipped archive file that contains a set of interrelated resources (XML files along with images and other supporting files). Here is an example of its file structure:

Το EPUB είναι το πιο ευρέως υποστηριζόμενο, ανεξάρτητο από εταιρείες, πρότυπο ηλεκτρονικού βιβλίου. Ο όρος αποτελεί συντόμευση του "e(lectronic) pub(lication)" και μερικές φορές γράφεται ως "ePub". Το πρότυπο EPUB υλοποιείται ως ένα εννιαίο συμπιεσμένο αρχείο zip που περιέχει συσχετιζόμενους πόρους (αρχεία XML μαζί με εικόνες και άλλα υποστηρικτικά αρχεία). Ακολουθεί ένα παράδειγμα της αρχειακής του δομής:

--ZIP Container--
mimetype
META-INF/
  container.xml
OEBPS/
  content.opf
  chapter1.xhtml
  ch1-pic.png
  css/
    style.css
    myfont.otf
  toc.ncx

The "META-INF" directory should be present. This directory should contain a file named "container.xml", the content of which -among other things- point to an OPF file. The OPF file defines the contents of the book. Apart from "mimetype" and "META-INF/container.xml", other files (like OPF, NCX, XHTML, CSS and images files) are traditionally put in a directory named "OEBPS". Let's have a look at an example of "container.xml":

O φάκελος "META-INF" πρέπει να είναι παρόν. Αυτός ο φάκελος οφείλει να περιέχει ένα αρχείο με όνομα "container.xml", τα περιεχόμενα του οποίου -μεταξύ άλλων- δείχνουν σε ένα OPF αρχείο. Το αρχείο OPF ορίζει τα περιεχόμενα του βιβλίου. Εκτός από το "mimetype" και το "META-INF/container.xml", τα υπόλοιπα αρχεία (όπως OPF, NCX, XHTML, CSS και αρχεία εικόνων) τοποθετούνται παραδοσιακά σε έναν φάκελο που λέγεται "OEBPS". Ας δούμε ένα παράδειγμα του "container.xml":

<?xml version="1.0" encoding="UTF-8" ?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>

And here is an example of "OEBPS/content.opf":

Kαι ορίστε ενά παράδειγμα του "OEBPS/content.opf":

<?xml version="1.0"?>
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="BookId">

  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:title>Pride and Prejudice</dc:title>
    <dc:language>en</dc:language>
    <dc:identifier id="BookId" opf:scheme="ISBN">123456789X</dc:identifier>
    <dc:creator opf:file-as="Austen, Jane" opf:role="aut">Jane Austen</dc:creator>
    <meta content="my-cover-image" name="cover"/>
  </metadata>

  <manifest>
    <item id="chapter1" href="chapter1.xhtml" media-type="application/xhtml+xml"/>
    <item id="appendix" href="appendix.xhtml" media-type="application/xhtml+xml"/>
    <item id="stylesheet" href="style.css" media-type="text/css"/>
    <item id="my-cover-image" href="images/978.jpg" media-type="image/jpeg"/>
    <item id="ch1-pic" href="ch1-pic.png" media-type="image/png"/>
    <item id="myfont" href="css/myfont.otf" media-type="application/x-font-opentype"/>
    <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
  </manifest>

  <spine toc="ncx">
    <itemref idref="chapter1" />
    <itemref idref="appendix" />
  </spine>

  <guide>
    <reference type="loi" title="List Of Illustrations" href="appendix.xhtml#figures" />
  </guide>

</package>

CodeΚώδικας

Import

Let's see how we can read the cover image. First, we import the necessary python modules:

Ας δούμε πως μπορούμε να διαβάσουμε την εικόνα εξωφύλου. Πρώτα, ας κάνουμε import τα απαραίτητα modules της Python:

import os
import sys
import zipfile
from lxml import etree
from PIL import Image

XML Namespaces

Then, we need to define the required XML namespaces (these namespaces are a way to avoid name conflicts in elements):

Έπειτα, χρειάζεται να ορίσουμε τα απαραίτητα XML namespaces (αυτά τα namespaces είναι ένας τρόπος για να αποφύγουμε τη σύγχυση μεταξύ των ονομάτων στα διάφορα στοιχεία):

namespaces = {
   "calibre":"http://calibre.kovidgoyal.net/2009/metadata",
   "dc":"http://purl.org/dc/elements/1.1/",
   "dcterms":"http://purl.org/dc/terms/",
   "opf":"http://www.idpf.org/2007/opf",
   "u":"urn:oasis:names:tc:opendocument:xmlns:container",
   "xsi":"http://www.w3.org/2001/XMLSchema-instance",
}

get_epub_cover()

Let's examine our function get_epub_cover(epub_path). Please, read the explanatory comments inside the code:

Ας εξετάσουμε τη συνάρτηση μας get_epub_cover(epub_path). Παρακαλώ, διαβάστε τα επεξηγηματικά σχόλια μέσα στον κώδικα:

def get_epub_cover(epub_path):
    ''' Return the cover image file from an epub archive. '''
    
    # We open the epub archive using zipfile.ZipFile():
    with zipfile.ZipFile(epub_path) as z:
    
        # We load "META-INF/container.xml" using lxml.etree.fromString():
        t = etree.fromstring(z.read("META-INF/container.xml"))
        # We use xpath() to find the attribute "full-path":
        '''
        <container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
          <rootfiles>
            <rootfile full-path="OEBPS/content.opf" ... />
          </rootfiles>
        </container>
        '''
        rootfile_path =  t.xpath("/u:container/u:rootfiles/u:rootfile",
                                             namespaces=namespaces)[0].get("full-path")
        print("Path of root file found: " + rootfile_path)
        
        # We load the "root" file, indicated by the "full_path" attribute of "META-INF/container.xml", using lxml.etree.fromString():
        t = etree.fromstring(z.read(rootfile_path))
        # We use xpath() to find the attribute "content":
        '''
        <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
          ...
          <meta content="my-cover-image" name="cover"/>
          ...
        </metadata>
        '''
        cover_id = t.xpath("//opf:metadata/opf:meta[@name='cover']",
                                    namespaces=namespaces)[0].get("content")
        print("ID of cover image found: " + cover_id)
        
        # We use xpath() to find the attribute "href":
        '''
        <manifest>
            ...
            <item id="my-cover-image" href="images/978.jpg" ... />
            ... 
        </manifest>
        '''
        cover_href = t.xpath("//opf:manifest/opf:item[@id='" + cover_id + "']",
                                         namespaces=namespaces)[0].get("href")
        # In order to get the full path for the cover image, we have to join rootfile_path and cover_href:
        cover_path = os.path.join(os.path.dirname(rootfile_path), cover_href)
        print("Path of cover image found: " + cover_path)
        
        # We return the image
        return z.open(cover_path)

Now, if we would like to show our image, we can simply use PIL.Image.open() like this:

Τώρα, αν επιθυμούμε να εμφανίσουμε την εικόνα μας, μπορούμε απλά να χρησιμοποιήσουμε την PIL.Image.open() ως εξής:

image = Image.open(get_epub_cover(epubfile))
image.show()

Full codeΠλήρης κώδικας

You can download the full code from here: epub-show-cover.py

Μπορείτε να κατεβάσετε τον πλήρη κώδικα απο εδώ: epub-show-cover.py