Jim's Thoughtspot: Data vs. Content

It's been a while since I've posted on this blog. A lot has happened in the intervening months (years). Mostly, I've moved forward and backward in the tech world. I've hummed and hawed over the direction of my career. I've also been somewhat distracted by local events that required my attention. You can label it a higher calling; a change in priorities from a completely geeky world that I have embraced as my own to one that encompasses the future of my geeks-in-training.

Needless to say, I haven't abandoned the world of XML, XSLT, XPath, XQuery entirely. I've evolved. I've had a gap year (or two) and seen the world outside the comfortable confines of angle brackets and FLOWR statements, and it has changed me - a bit.

For those who've read some of my posts, I drank the kool-aid in the 90's, and wanted everyone else to share from the same cup. What the last few years have shown is that kool-aid is for kids. It's time to grow up. The technology and content worlds have changed, and I need to change with it.

Primarily, what has changed is my thinking about the role of XML technologies in the landscape. It has a place, and honestly, it's a very important player in the wide ranging landscape - just not in the way I perceived it 5, 10, or even 15 years ago. In fact, I'm not sure that Sir Berners-Lee would have envisioned the path that markup languages have taken. Nonetheless, it's time to embrace these changes for what they are.

XML is here to stay. It's mature, it's lived up to its promise of extensibility, and it won't go away.
XML technologies are stable. There is little in the way of implementation variability among different providers now. Whether you are using Java, a classic 'P' (Python, PHP, Perl) language, or any one of the newer languages, they all must honor XML in order to be complete.
Incremental changes in XML technologies are principally to support scale. DOMs are nice, elegant, and easy-to-use structures, but quickly turn into boat-anchors when we attempt to embrace internet-scale data within them. Streaming XML is the new sexy.
Virtually any data model can be represented in XML for a myriad of business purposes with self describing semantics and the capability to flex its node hierarchy based on the data. For this reason alone, XML has been, and will continue to be, a workhorse. Think about Spring, one of the web's most successful Java frameworks. XML is the underlying data for nearly part of it.
As a data persistence layer, XML plays well with tabular, relational, and hierachical structures. With its rich semantics and vendor-agnostic format, XML technologies are powerful, flexible, and scalable. Yes, it's also a great storage model for pernicious content models - like DITA, DocBook, and, gulp, OOXML (I'll shower later for that).
From XML, I can deliver and/or display content/data in virtually any format imaginable, even to the point of backward compatibility to legacy formats (ask me about HTML 3.2/EDGAR transformations sometime)

With all that XML has going for it, what can go wrong? Well, depending on who you ask, the answer will vary. Some criticize XML for not living up to the hype of Web 2.0. XML's initial purpose was to be the "SGML for the web." To some degree, it is, but it is far from ubiquitous. That isn't to say that we didn't try. From XML Data Islands to XMLHttpRequest objects in Javascript, XML was given first class status on the web. The problem was (and is) that, as a DOM, extracting data often relied on a lot of additional code to recurse through the XML content. For some, the browser's tools felt like a blunt instrument when finer grained precision was needed. Eventually, JSON became the lingua franca for web data, and rightfully so.

Perhaps its biggest limitation or failure is the countless attempts to make XML usable for the masses. I'll admit that I was one of the biggest evangelists. I honestly believed that we could build authoring tools that were intuitive and easy-to-use back by powerful semantic markup. We would be able to enrich the web (and by proxy, the world) with content that had meaning - it could searched intelligently, reused, repurposed, translated, and delivered anywhere. As one of my friends and mentor, Eric Severson, said, XML has the capability of making content personal, designed for a wide audience and personalized for an "audience of one."

Intrinsically, I still have some faith in the idea, but the implementation never lived up to the hype. For over twenty years, we've tried to build tools that could manage XML authoring workflows from creation to delivery. Back in the late 90's and early 2000's, I remember evangelizing for XML authoring solutions to a group of technical writers for a big technology firm. I was surprised by the resistance and push back I got. Despite the benefits of XML authoring, the tools were still too primitive, and instead of making them more productive, it slowed them down. Nevertheless, I kept evangelizing like Linus in the pumpkin patch.

Eventually, the tools did improve. They did make authoring easier... for some. What we often glossed over was the level of effort required to make the tools easier to use. Instead of being tools that could be used by virtually anybody who didn't want to see angle brackets (tools for the masses), we made built-for-purpose applications. For folks like me who understood the magical incantations and sorcery behind these applications, they were fantastic. They were powerful. They also came with a hefty price tag, too. And, because they were often heavily customized, users were locked in to the tools, the content model, and the processes designed to support it.

Even if we attempted to standardize on the grammar to enable greater interchange, it still required high priests and wizards to make it work. The bottom line is that the cost of entry is just too high for many. The net result is that XML authoring is a niche, specialized craft left to highly trained technical writers and the geekiest of authors.

Years ago, I read Thomas Kuhn's The Structure of Scientific Revolutions. The main premise is that we continue to practice our crafts under the premise of well-accepted theory. Over time, through the course of repeated testing, anomalies emerge. Initially, we discard these anomalies, but as they continue to accumulate, we realize that we can't ignore these anomalies anymore. New theories emerge. However, we reject these new ideas and vigorously debate that the old theories are still valid, until enough evidence disproves them entirely. At that moment, a new paradigm emerges.

We are at that moment of paradigmatic shift. No longer can XML be thought of as a universal theory of information and interchange. Instead, we need to reshape our thinking to accept that XML solves many difficult problems, and has a place in our toolbox of technology, but other technologies and ideas are emerging that are easier, cheaper, faster methods for content authoring. For many, the answers to "intelligent content" aren't about embedding semantics within, but rather to extend content with rich metadata about the content that live as wrappers on the content - that can be dynamic, contextual, and mutable.

Before I'm labeled a heretic, let me be clear. XML isn't going away, nor is it inherently a failed technology. Quite the opposite. Its genius is in its relative simplicity and flexibility to be widely used in a vast number of technologies in an effective manner. The difference is that we've learned that we could never get enough inertia behind the idea of XML as a universal data model for content authoring, and it was too cumbersome for web browsers to manipulate. We have other tools for that.

I had an illuminating moment a few weeks ago when I met with some developers at a client about an XML project that I am working on for them. The project involves some pretty sophisticated Java code to execute a transformation of XML content into multiple formats like HTML, PDF and Microsoft HTML Help. Essentially, the work involves extending an existing framework to support transforming any Document Markup Language (DITA, DocBook, TEI, custom DTD/Schema, you name it) instance to whatever the desired output should be.

Yet after a few days with these extremely bright developers, it became quite clear to me that they viewed XML strictly as a data format, something that contained configuration data, or content manifests. They had no experience or context to think about XML as content. And here I was explaining to them that XML is a very rich medium for technical manuals and even commercial publications (this is a whole other topic).

Having worked with DocBook for the last 6 years, and more recently DITA, my world view of XML was focused strongly on XML for documentation. Yes, I've created lots of XML for configuration files and other data (SOAP packets) in my development work. But generally I correlate XML with technical document content.

And then it dawned on me: Within the development community at-large, there appears to be a large disconnect about how XML is and can be used. Like the developers I spoke about, XML is rich data format with the ability to hold all kinds of data in a hierarchical way - something that Java .properties files and flat text files can't do easily; for others like myself, XML is a rich content format that provides a way of semantically describing the content. And in fact, the content becomes data, not just text.

There are a proliferation of XML standards: RSS, WSDL, SMIL and so on. These are powerful data structures that have advanced the Web from serving up static HTML pages to delivering a wide variety of new services and applications. Yet, ironically, all of these XML standards have their orgins in SGML (indirectly), of which, some of the earliest applications were used to mark up content: HTML being perhaps the most well known. DocBook started out as an SGML application in the early 90's and ported to XML later on.

This goes to show that XML is an immensly powerful technology that lives up to its name: eXtensible Markup Language. It has become so firmly entrenched in so many technologies. Interestingly, it's because XML is both data and content that we find two naturally occurring views in the development community. We can't know everything, and as a consequence of that, we tend to focus on specific technologies to become expert in (myself included). Look at Java or C# as examples, there are so many different distinct areas that these languages can be used, we can't possibly be expert in all of them.

This meeting proved to be illuminating in other ways. I began thinking about yesterday's post and Microsoft's decision not to support XSLT 2.0 or XPath 2.0. Perhaps the view of XML as data is a much more common than that of viewing XML as content. And maybe this distinction is an artifact of "conventional" ways of thinking about "data" and "human-readable" content. Data lives in databases, and configuration files; content lives Word and HTML files, and ne'er the two shall meet, and for the most part they didn't.

XML puts data and content on an equal playing field, which opens up a whole variety of possible applications, some of which are only beginning to emerge. It has also opened up existing applications to new ways of sharing information. For example, Microsoft's Office Open XML standard will allow interchange to and from a wide variety of sources, perhaps even from DocBook XML content!

Jim's Thoughtspot

Saturday, January 9, 2016

It's been a while

Saturday, February 3, 2007

XML - It's not just for data. Really.

Labels