Posted by & filed under Content - Highlights and Reviews.

For years we’ve fielded customer requests to add books from the Pragmatic Bookshelf, and so I’m thrilled to announce that 19 titles are available right now with more titles coming soon.

Pragmatic Bookshelf + Safari

Dave Thomas and Andy Hunt have built a great reputation not only for their books, but for their way of doing business, applying a lot of software development and agile methodologies to running a publishing business. I first learned Ruby from “the Pickaxe Book”, and it’s exciting to make their full list available to Safari members.

Start your free 10-day trial of Safari right now.

Posted by & filed under Aggregation, Categorization, content, publishing, Semantic Analytics.

Safari now has upwards of 25,000 books and 2,400 videos available to our members, a fact that we are very pleased about.

That’s a lot of stuff, but we realize that it means little if you can’t find the content that is valuable to you. One of our goals is to provide more and better ways for our audience to navigate that content, which includes creating more routes into it and also better linking between content items.

Creating new and better ways for you to discover content is part of an ongoing strategy based on semantic analysis of that content, and we have recently launched the first public manifestation of this initiative: the Safari aggregation pages within our new website. This blog post describes some of the technical work underpinning those pages, and how it fits into the wider strategy of improving our content description and interlinking.

Taking an iterative, experimental approach

Early in this project we decided to go with a highly iterative approach. We sought to build small, flexible components, only loosely coupled (at this stage) to the wider systems that power Safari. This approach allowed us to experiment and gave us the freedom to build systems which didn’t have to satisfy every future need.

I think of this approach as a sort of “bootstrapping”—building small systems, just complex enough to help us build the next iteration, and so on. Ideally, each iteration should benefit our customers, and so, as a first step, we decided to build aggregation pages, as they would provide immediate and obvious value to readers.

The idea for these aggregation pages was two-fold: firstly, that they should showcase more of our content; and secondly, that they should break down our existing topics into more fine-grained subtopics. We reasoned that introducing more granularity would help our customers more easily find content on the subjects that they are interested in.

Creating a controlled vocabulary

So our first goal was to specify the substructure underneath the Safari topics, which meant creating a set of terms—a “controlled vocabulary”—that both described the topics in a meaningful and intuitive way, but also, crucially, in a way that accurately represents the content.

Ultimately, we sought to create an interlinked vocabulary, so that, where appropriate, the same terms could be used within different topics, as shown in the following illustration.


An interlinked vocabulary would allow us to create a real network of content—a great foundation for building new features that would enable people to move seamlessly between content items that are relevant to them.

Crawling the indexes

While it would have been possible to use a preexisting vocabulary of terms to define a topic such as, say, Python, we felt that adapting it for our content and audience could require significant editorial input.1 So in keeping with our “bootstrapping” approach, we looked for programmatic ways to extract a vocabulary of terms from the content itself.

Most of the books in Safari already have significant editorial input in marking up their content in the form of back-of-book indexes. Creating an effective index by hand takes a lot of editorial effort, and the terms are carefully curated in order to make the book easy to navigate. A good index is tailored to the topics but also to the reader (unlike some pure classification schemes).

Since these indexes already exist, we thought it would make sense to take advantage of the editorial expertise that went into creating them. We looked at indexes to identify the most frequent terms used in all the books on a particular subject (initially we looked at books on Python and Java). The results gave us a pretty useful list of terms for describing the content. The top index terms for Python, for example, are:

classes, modules, functions, files, objects, methods, strings, Python, os module, performance, dictionaries, attributes, variables, exceptions, sequences, threads, OOP (object-oriented programming), lists, sys module, debugging

However, we noticed that they tended to be a bit too fine-grained. For example, whereas we might want to tag all content related to Network Programming in Python, the indexes in the books referred to socket module and urllib and so on. Whereas we might want to group content about Object Orientation, the indexes would separately use Objects, Classes, and Methods.

Clustering index terms

To get around this over-specificity, we tried clustering the index terms based on how often they occurred in chapters together. Clustering, generally speaking, is a mathematical technique whereby things are grouped algorithmically according to their properties, which are expressed in a vector space. In this case the properties are the index terms themselves, so the vector for a particular index term expresses how often it occurs with the other index terms.

Clustering can be something of a black art. There are many different clustering algorithms, with different parameters to vary, and there are also many quality metrics to describe how “good” the output clusters are. By trying different algorithms, parameters, and quality metrics we came up with a system that made some very sensible suggestions for groups of index terms—groups which tended to match the granularity at which we wanted to create the topic substructure.

Using this method, we came up with a bunch of subtopics. For example:

JavaScriptAjax, DebuggingDOMEventsFunctional Programming, Inheritance, jQuery, JSON, Loops, Object Oriented JavaScript, Regular ExpressionsTesting, Variables

PythonBasic Types, Debugging, Decorators, Exceptions, Files, Inheritance, List Comprehensions, Modules, Network Programming, Object-Oriented Programming, Polymorphism, Regular Expressions, Serialization, Strings, Testing, Threads, Timers

DatabasesData Types, Foreign Keys, Indexes, Normalization, Operators, Primary Keys, SQL, Tables, Transactions, Views

Creating these clusters wasn’t a fully automated process. The terms that the analysis suggested still needed some curation, which we managed with a basic admin interface that we built. But the combination of the clustering algorithm and the admin interface allowed us to create a decent substructure for most of our topics (particularly the computer science-oriented topics, which tend to have a fairly standardised vocabulary). In addition, the index terms associated with a particular subtopic could then be used in search queries to discover content with which to populate these subtopics.

And, inevitably, this process was not foolproof. Firstly, some of the indexes were not very well formatted, and so we couldn’t extract the terms in a useful way. Additionally, some topics aren’t amenable to this analysis at all, particularly those that are very broad, such as business. Finally, although the vocabulary for the very technical topics was much more regular than it was in others, there were still some ambiguities—for example, different books have different conventions on abbreviations. We may well address some of these issues in the future, but, with the volume of books we have, we still managed to get decent results—good enough for us to get started.

Missing subtopics

Having completed this first analysis, we noticed that some subtopics which we’d expected to see did not emerge. We realised that these “missing” subtopics were usually on subjects which would have a whole book written about them, and so may only be mentioned in passing, if at all, in a more general book. As a result, they weren’t mentioned frequently enough in the indexes to register in this analysis. Good examples of missing topics are Django, Rails, AngularJS, etc. (Interestingly, the one such framework which was extracted by the clustering analysis was jQuery, demonstrating the ubiquity that has now achieved in the JavaScript world.)

To resolve this issues, we didn’t need to go to the extent of clustering the index terms as in the first analysis. Since these subtopics are generally important enough to have entire books devoted to them, we could simply search for index terms that were common in the titles of those books. This approach would reveal items that may not appear frequently in indexes but do occur in book titles. It worked very well, and a whole new set of subtopics emerged.

The Database topic highlighted this problem very well, as most of its books will tend to be about a particular type of DB, rather than about DBs in general. Applied to that topic, the title-based analysis very quickly extracted FileMaker Pro, Microsoft Access, MySQL, OraclePostgreSQL and SQLite to complement the more general Database subtopics found by the clustering analysis. Some more examples:

JavaScriptAngularJS, Backbone, Closure, CoffeeScript, Dojo, Ember.js, Ext JS, Mongoose, Node.js, RequireJS, Sencha Touch, YUI

New LanguagesClojureErlangFHaskellScalaScratch

Aggregation pages and future work

All in all these analyses (and a couple of others, a bit more ad hoc, to deal with some of the more general topics) have identified around 350 subtopics, which have now been turned into aggregation pages. These pages expose a lot more content to our users and, we hope, helps them navigate the breadth of the subjects which Safari covers.

There is still a lot of work to do—as is often said around here, this is not the end, it’s just the beginning. In particular we are very aware that some topics do not yet have subtopics and aggregation pages. And also, we have not yet tagged all our content. These are both high priorities, but we considered that the aggregation pages were useful enough as they were that we should make them available to our users as soon as possible.

In terms of the semantic data strategy, the biggest benefit is probably that we now have a workable—though still fledgling—vocabulary with which to describe our content. It will be refined over the coming weeks and months (for one thing there are some areas where it should be more consistent across different topics), and it will, of course, need to adapt to describe our content as it grows. But for now it gives us an excellent framework on which to build our next set of features—features which are aimed at letting people find the most relevant content as easily and quickly as possible. We’ll let you know how we get on—please let us know your thoughts on our strategy and on the progress so far.


[1] We fully recognise the value of using a standard vocabulary, paticularly one from the Linked Data world. For one thing, it would facilitate the automatic linking of our content to all the other relevant content that is out there. At some point we will almost certainly map our vocabulary to an open set of identifiers, such as those provided by DBpedia. However, for the minute we feel there are a number of advantages in using our own “discovered” vocabulary to create the aggregation pages. Again, this is the “bootstrapping” philosophy in action.

Posted by & filed under Devops, Information Technology, infrastructure, IT, Operations.

Last week at Devops Days in Boston, I had the opportunity to attend back to back presentations that complemented each other and helped bring into focus an idea that has been hanging around just beyond the horizon of my awareness. Namely, when it comes to infrastructure software, we are working through a time of major transitions that affect both our tools and the structure and processes of our work.

Expecting conflict and adapting

The first presentation that started turning my vague sense of what’s been happening into a revelation was Nikolas Katsimpras presenting on conflict within organizations. He described various types of conflict between: individuals and individuals, individuals and groups, different departments, and so on. Although the talk was nontechnical, it was easy to apply many of the concepts Katsimpras described to infrastructure software and devops, where our work tends to be driven by other departments, whose needs define our priorities. This arrangement causes many organizations to get stuck maintaining the status quo with inefficient compensation patterns rather than changing with time and technology.

Katsimpras emphasized the importance of responsive adaptivity and described Nelson Mandela as brilliantly adaptive in that he was willing and able to adjust himself as circumstances changed. During another portion of the presentation, Katsimpras defined “double-loop learning,” which is a term used to describe the process of questioning initial assumptions when seeking to change outcomes, rather than focusing on refining strategies and goals. This concept strikes me as particularly salient given the rise of automated configuration management and test-driven infrastructure.

Today’s tools are not tomorrow’s

After the constructive conflict presentation, Kelsey Hightower went on to discuss CoreOS. I found myself completely rapt and pondering the near future, in which CoreOS is the solution to all my high-traffic, high-availability web app problems. But then it happened—I felt conflicted as to whether CoreOS would solve problems I’m already solving with Chef. I felt a pang of worry. On the one hand, we are still developing our Chef-based infrastructure: expanding test coverage, updating and standardizing dev tools, among other things. But on the other hand, the purpose of our business is to serve customers, not to commit to a particular system configuration management tool for all time. So maybe I ought not get defensive about keeping today’s infrastructure tools around for tomorrow. Here it was, double-loop learning in my own everyday life!

Three years ago, the tools needed for automatic dependency resolution and local testing in Chef were only starting to manifest in discussions and as ideas. Today, many of these tools—which were often initiated by third-party developers in the user community—are part of what is considered the Chef standard toolkit. By incorporating behavior- and test-driven development practices into infrastructure software, companies are improving the customer experience by separating the customer from the config and deployment errors.

These initiatives in the developer communities have led to an advent of tools that have dramatically improved efficiency and productivity in ways that would have sounded like magic a decade ago. But the outcomes are consistent with the double-loop learning principle of returning to initial assumptions before changing practices. For instance, just five years ago, most reasonable people would have agreed that it was difficult-to-impossible to test infrastructure changes on a local workstation. Now this type of testing is increasingly the norm, thanks once again to developers questioning the original assumptions.

Similarly, today we use all kinds of virtualization infrastructure in order to accommodate and scale complex web software, so our current patterns are generally predicated on using VMs as individual hosts: this reliance on VMs is just one example of a concept that is no longer relevant in CoreOS. During his talk, Hightower mentioned his personal interest in Golang, which made me contemplate how many of today’s tools that are written in C, C++, and even Java will be implemented in Go over the next decade.

While searching to make sure I wasn’t stepping on someone else’s title, an option I tried was “The Future is Adaptive.”  When I saw Ian Clatworthy’s essay from seven years ago, I knew I had found mine. In his essay, Clatworthy, a Bazaar dev, discussed the importance of adaptability in context of version control and described the tensions that arise from diverging priorities. Today we are even further down that road, and we’re going faster. Clatworthy’s ideas are still applicable and interesting today, even though the specific technologies change. It’s easy to forget that we will pass through many moments of now on our way to what lies ahead. Between now and the future we will implement the changes that distinguish the two from each other. While these changes will create situations that are ripe for conflict, we must learn to leverage contention as a constructive force. We must do this because the nature of software engineering and soft product development has been and will continue shifting under our feet.

Posted by & filed under programming, publishing, xslt.

Twice in the past couple of weeks I’ve needed to solve this problem:

We have content provided as individual displayable pages, but we need a document for those pages to refer to, with a table of contents so that a user can navigate between them. In one case it was a set of articles in a journal issue, in another it was entries in an encyclopedia. I solved both of these problems with the same general XSLT structure, so I thought I’d write it up for others to use. Read more »

Posted by & filed under news.

Safari was founded in 2001, as a joint venture between O’Reilly Media and Pearson, two publishers that care deeply about learning, innovation, knowledge, and personal growth.

Today I’m thrilled to announce that O’Reilly Media, Inc., has acquired the Pearson stake in Safari, and Safari is now a wholly-owned subsidiary of O’Reilly Media, Inc. Safari will continue to operate as an independent company, and Pearson Education will remain a strategic content partner of Safari.

Safari - O'Reilly Read more »

Posted by & filed under Information Technology, infrastructure, IT, tools.

If you’ve used Chef, you’ve probably used a community cookbook. Community cookbooks are helpful because someone else has figured out how to solve your problem, be it installing nginx or configuring postgresql. While community cookbooks are great, they sometimes don’t include everything that you need. That’s where wrapper cookbooks come in. If you want to change or extend the functionality of a cookbook without having to rewrite it from scratch, a Chef wrapper cookbook is the way to go. Read more »

Posted by & filed under Content - Highlights and Reviews.

You’ve probably noticed we’ve made some changes around here, including a brand new design and sharp new logo.

New Safari Logo

Safari began more than 13 years ago as “Safari Tech Books Online,” with the promise of replacing the collection of IT and programming reference books on your shelf with something online and searchable. (As the saying goes, “You can’t grep dead trees.”) Read more »

Posted by & filed under java, programming, testing.

A few months ago I noticed the JUnit Attachments Plugin and was inspired.  Recording pictures and other files after test failures is such an obviously good idea, especially if you’re the one who has to fix the tests.

Unfortunately, the JUnit Attachments Plugin has some rough edges. I wrote a small library that handles the busywork while keeping tests readable.  Since we are awesome, we open sourced it so you can use it too!

Read more »

Posted by & filed under Tutorials.

Safari is seeking help in usability testing for Safari Tutorials, our curated learning paths based on Safari books and videos. You will be speaking with folks from our product development team, and your input will directly influence how the product evolves in the next few months. That’s pretty cool, don’t you think?

fisheye image of UI books

Photo by loureiro, used under CC-By A / Cropped from original

The Details

Tests will be performed remotely via a Google Hangout video chat, running for 30 to 45 minutes. Session times are available on June 18 and June 19.

If you are interested in participating, we ask that you take a few minutes to complete this short questionnaire.

Those selected for usability testing will receive a $25 gift card (iTunes or Amazon) upon completion of the session.

Thank you for considering participating in our study. Feedback from our community is essential as we build products that help our users learn and grow. Plus we just like speaking with you.