|

Extra layers of EPUB validation

Author: Keith Fahlgren, Director of Engineering at Safari Books Online

Now that Threepress has joined Safari Books Online, we’ve been given access to (and responsibility for shepherding) a huge number of EPUBs. This has made me painfully aware of the current suite of tools and techniques available to both publishers and businesses working with EPUBs.

EpubCheck is the best tool for validating EPUB (and now EPUB 3) documents, especially now that the IDPF and DAISY Consortium are actively sponsoring its development. But many businesses have a range of unique preferences and business rules for EPUB documents that go beyond a strict validity test. To help address that, I’ve started work on a project called nort.

Rather than waiting for a polished version (that might never come), I’m releasing the first version of nort when it does remarkably little: run extra layers of validation on the OPF file inside an EPUB and report on the results.

The extra requirements are specified using ISO Schematron files, which offer a way to codify machine-readable “rules.” The advantage of Schematron is that it combines actually human-readable business rules in plain text with a bit of XPath. Properly written, Schematron can give much more intelligible output than other validation techniques. Writing new rules will require access to someone comfortable with XML, but having them sit down with someone comfortable with the business to translate sentences like “Every EPUB must include a cover image file” into XPath is usually straightforward (and rewarding). In fact, here’s that rule:

<rule context="opf:metadata">
  <assert test="opf:meta[@name='cover']">OPF metadata must include a reference to a cover image file.</assert>
</rule>

The above rule asserts the presence of a particular element in the metadata. In some situations, it makes more sense just to report on what is there:

<rule context="opf:metadata">
  <report test="count(opf:meta)"><value-of select="count(opf:meta)"/> &lt;opf:meta&gt; elements in &lt;opf:metadata&gt;</report>
  <report test="count(*)"><value-of select="count(*)"/> child elements of &lt;opf:metadata&gt;</report>
</rule>

Here is what nort does with the above when wrapped up in a complete Schematron:
$ nort -f cover_required.sch yes_cover.epub
yes_cover.epub successfully validates against cover_required.sch
Report: 1 <opf:meta> elements in <opf:metadata>
Report: 4 child elements of <opf:metadata>

# versus

$ nort -f cover_required.sch no_cover.epub
no_cover.epub fails to validate against cover_required.sch
OPF metadata must include a reference to a cover image file.
Report: 3 child elements of <opf:metadata>

nort is not a replacement for EpubCheck. Your EPUB files must be valid according to EpubCheck before using nort. nort is good for specifying or testing the particular preferences you have in addition to basic validity.

Limitations

If you have problems or think the tool would be worth extending in a particular way, please submit an issue or a pull request.

A complete version of the Schematron file from above

<schema xmlns="http://purl.oclc.org/dsdl/schematron"
        queryBinding="xslt1">
  <ns prefix="opf" uri="http://www.idpf.org/2007/opf"/>

  <pattern id="cover_meta_check">
    <rule context="opf:metadata">
      <assert test="opf:meta[@name='cover']">OPF metadata must include a reference to a cover image file.</assert>
    </rule>
  </pattern>
  <pattern id="chatty_counting">
    <rule context="opf:metadata">
      <report test="count(opf:meta)"><value-of select="count(opf:meta)"/> &lt;opf:meta&gt; elements in &lt;opf:metadata&gt;</report>
      <report test="count(*)"><value-of select="count(*)"/> child elements of &lt;opf:metadata&gt;</report>
    </rule>
  </pattern>
</schema>

About the Author

Keith Fahlgren, Director of Engineering, Safari Books Online

Keith Fahlgren has deep experience in publishing technology, particularly in the area of digital content readability. Keith played a lead role in integrating Ibis Reader with existing platforms and helping publishers create digital content more effectively. His varied contributions to the digital publishing ecosystem include a number of open-source EPUB tools. Keith has spoken widely, was the co-founder of Ibis Reader, and was formerly at O’Reilly Media, where he helped design and implement many of their digital publishing workflows.

About Keith Fahlgren

Keith Fahlgren is the Director of Engineering at Safari Books Online. He works on tools that help readers learn and build skills with digital content. He has been involved in a range of projects in the digital publishing ecosystem, including EPUB 3, OPDS, DocBook, and Ibis Reader.
|

Comments are closed.