Safari now has upwards of 25,000 books and 2,400 videos available to our members, a fact that we are very pleased about.
That’s a lot of stuff, but we realize that it means little if you can’t find the content that is valuable to you. One of our goals is to provide more and better ways for our audience to navigate that content, which includes creating more routes into it and also better linking between content items.
Creating new and better ways for you to discover content is part of an ongoing strategy based on semantic analysis of that content, and we have recently launched the first public manifestation of this initiative: the Safari aggregation pages within our new website. This blog post describes some of the technical work underpinning those pages, and how it fits into the wider strategy of improving our content description and interlinking.
Taking an iterative, experimental approach
Early in this project we decided to go with a highly iterative approach. We sought to build small, flexible components, only loosely coupled (at this stage) to the wider systems that power Safari. This approach allowed us to experiment and gave us the freedom to build systems which didn’t have to satisfy every future need.
I think of this approach as a sort of “bootstrapping”—building small systems, just complex enough to help us build the next iteration, and so on. Ideally, each iteration should benefit our customers, and so, as a first step, we decided to build aggregation pages, as they would provide immediate and obvious value to readers.
The idea for these aggregation pages was two-fold: firstly, that they should showcase more of our content; and secondly, that they should break down our existing topics into more fine-grained subtopics. We reasoned that introducing more granularity would help our customers more easily find content on the subjects that they are interested in.
Creating a controlled vocabulary
So our first goal was to specify the substructure underneath the Safari topics, which meant creating a set of terms—a “controlled vocabulary”—that both described the topics in a meaningful and intuitive way, but also, crucially, in a way that accurately represents the content.
Ultimately, we sought to create an interlinked vocabulary, so that, where appropriate, the same terms could be used within different topics, as shown in the following illustration.
An interlinked vocabulary would allow us to create a real network of content—a great foundation for building new features that would enable people to move seamlessly between content items that are relevant to them.
Crawling the indexes
While it would have been possible to use a preexisting vocabulary of terms to define a topic such as, say, Python, we felt that adapting it for our content and audience could require significant editorial input.1 So in keeping with our “bootstrapping” approach, we looked for programmatic ways to extract a vocabulary of terms from the content itself.
Most of the books in Safari already have significant editorial input in marking up their content in the form of back-of-book indexes. Creating an effective index by hand takes a lot of editorial effort, and the terms are carefully curated in order to make the book easy to navigate. A good index is tailored to the topics but also to the reader (unlike some pure classification schemes).
Since these indexes already exist, we thought it would make sense to take advantage of the editorial expertise that went into creating them. We looked at indexes to identify the most frequent terms used in all the books on a particular subject (initially we looked at books on Python and Java). The results gave us a pretty useful list of terms for describing the content. The top index terms for Python, for example, are:
classes, modules, functions, files, objects, methods, strings, Python, os module, performance, dictionaries, attributes, variables, exceptions, sequences, threads, OOP (object-oriented programming), lists, sys module, debugging
However, we noticed that they tended to be a bit too fine-grained. For example, whereas we might want to tag all content related to Network Programming in Python, the indexes in the books referred to socket module and urllib and so on. Whereas we might want to group content about Object Orientation, the indexes would separately use Objects, Classes, and Methods.
Clustering index terms
To get around this over-specificity, we tried clustering the index terms based on how often they occurred in chapters together. Clustering, generally speaking, is a mathematical technique whereby things are grouped algorithmically according to their properties, which are expressed in a vector space. In this case the properties are the index terms themselves, so the vector for a particular index term expresses how often it occurs with the other index terms.
Clustering can be something of a black art. There are many different clustering algorithms, with different parameters to vary, and there are also many quality metrics to describe how “good” the output clusters are. By trying different algorithms, parameters, and quality metrics we came up with a system that made some very sensible suggestions for groups of index terms—groups which tended to match the granularity at which we wanted to create the topic substructure.
Using this method, we came up with a bunch of subtopics. For example:
Python: Basic Types, Debugging, Decorators, Exceptions, Files, Inheritance, List Comprehensions, Modules, Network Programming, Object-Oriented Programming, Polymorphism, Regular Expressions, Serialization, Strings, Testing, Threads, Timers
Databases: Data Types, Foreign Keys, Indexes, Normalization, Operators, Primary Keys, SQL, Tables, Transactions, Views
Creating these clusters wasn’t a fully automated process. The terms that the analysis suggested still needed some curation, which we managed with a basic admin interface that we built. But the combination of the clustering algorithm and the admin interface allowed us to create a decent substructure for most of our topics (particularly the computer science-oriented topics, which tend to have a fairly standardised vocabulary). In addition, the index terms associated with a particular subtopic could then be used in search queries to discover content with which to populate these subtopics.
And, inevitably, this process was not foolproof. Firstly, some of the indexes were not very well formatted, and so we couldn’t extract the terms in a useful way. Additionally, some topics aren’t amenable to this analysis at all, particularly those that are very broad, such as business. Finally, although the vocabulary for the very technical topics was much more regular than it was in others, there were still some ambiguities—for example, different books have different conventions on abbreviations. We may well address some of these issues in the future, but, with the volume of books we have, we still managed to get decent results—good enough for us to get started.
To resolve this issues, we didn’t need to go to the extent of clustering the index terms as in the first analysis. Since these subtopics are generally important enough to have entire books devoted to them, we could simply search for index terms that were common in the titles of those books. This approach would reveal items that may not appear frequently in indexes but do occur in book titles. It worked very well, and a whole new set of subtopics emerged.
The Database topic highlighted this problem very well, as most of its books will tend to be about a particular type of DB, rather than about DBs in general. Applied to that topic, the title-based analysis very quickly extracted FileMaker Pro, Microsoft Access, MySQL, Oracle, PostgreSQL and SQLite to complement the more general Database subtopics found by the clustering analysis. Some more examples:
New Languages: Clojure, Erlang, F, Haskell, Scala, Scratch
Aggregation pages and future work
All in all these analyses (and a couple of others, a bit more ad hoc, to deal with some of the more general topics) have identified around 350 subtopics, which have now been turned into aggregation pages. These pages expose a lot more content to our users and, we hope, helps them navigate the breadth of the subjects which Safari covers.
There is still a lot of work to do—as is often said around here, this is not the end, it’s just the beginning. In particular we are very aware that some topics do not yet have subtopics and aggregation pages. And also, we have not yet tagged all our content. These are both high priorities, but we considered that the aggregation pages were useful enough as they were that we should make them available to our users as soon as possible.
In terms of the semantic data strategy, the biggest benefit is probably that we now have a workable—though still fledgling—vocabulary with which to describe our content. It will be refined over the coming weeks and months (for one thing there are some areas where it should be more consistent across different topics), and it will, of course, need to adapt to describe our content as it grows. But for now it gives us an excellent framework on which to build our next set of features—features which are aimed at letting people find the most relevant content as easily and quickly as possible. We’ll let you know how we get on—please let us know your thoughts on our strategy and on the progress so far.
 We fully recognise the value of using a standard vocabulary, paticularly one from the Linked Data world. For one thing, it would facilitate the automatic linking of our content to all the other relevant content that is out there. At some point we will almost certainly map our vocabulary to an open set of identifiers, such as those provided by DBpedia. However, for the minute we feel there are a number of advantages in using our own “discovered” vocabulary to create the aggregation pages. Again, this is the “bootstrapping” philosophy in action.