Suppose you work for corporation that has on hand code that for strategic reasons it wants to release in open-source form, and you are asked to provide some guidance.
Here are your options:
- 1. Make. Make the community from scratch, providing all the resources to establish the project web site, market the project, attract developers, establish a release model, convince people of your good intentions, convince people you will follow the open-source rules, and so on, and so on.
- 2. Buy. Buy into an existing community, turning the code over to them and providing your developers not as project managers but as assistants to the managers of the existing community.
I’m a strong believer in the second approach. You want to buy into an existing community, the best such community you can find. You want to leverage that community to establish your project.
I say this because I’ve been there, and done that. Let me tell you about it, the story of Cloudscape/Derby.
Go to your favorite search engine and search for “apache database”. Within seconds you will see the word “Derby.” I like being able to say this — I spent almost four months of my life spending every spare moment I could to help make it happen.
One day in early April of 2004, as I stepped off the elevator on the way to my office a colleague said to me, “Have you heard the news about Cloudscape?”
I said no. However, I had heard about Cloudscape (CS, which I’ll use hereafter as that’s what I used for months in e-mails, in part to disguise the project’s purpose) before. In early 1999 I was invited to discuss Jikes at a Java user’s group meeting in NYC. One of the speakers who went before me was from a start-up called CS. Their product, CS, was a browser plugin written in Java. In those days before the arrival of the high-speed internet the page update time was significant, and CS was created to speed up the manipulation and analysis of data, as follows.
Data manipulation and analysis was done on the client machine by first downloading the data from the web and saving it into cached memory. The CS software was a relational database system that allowed the full use of SQL to retrieve data as well as various graphical tools to display the results. This greatly speeded up analysis.
I thought it a clever idea when I first heard it, and their demo looked promising, though I lost track of CS as I was involved so much in Jikes in those days. (CS is the only presentation I remember from that meeting though several other products were demonstrated.)
I learned a few hours later that my colleague was right. I got an e-mail from the CS team saying they had been told to open-source the code and so wanted some guidance on how to navigate IBM’s open-source process.
I soon learned from the team that the CS startup had been acquired by Informix and that the CS code had come to IBM when IBM acquired Informix. CS had matured into a high-quality piece of code — a “relational database in a Java jar file” was the simplest summary. If you added that jar file to your Java classpath, then you had full access to a relational database that had a full implementation of SQL and also met the ACID-test.
CS had once been a separate product, and had also once been the reference implementation for the corresponding Java database spec. Though it was no longer marketed separately, it was now one of the most widely-used components in IBM software, and was embedded in scores of IBM’s products. It was production-level code, the real deal.
The initial plan of the team was to start a new entity, perhaps based on the eclipse model. Let’s call that approach CS.ORG. The team felt this would allow them to build a community gradually, perhaps even shape the pace of development, develop tie-ins with IBM’s products as the code became more accepted, and so on.
It was a real opportunity — great code that filled a gap that hadn’t yet been met, that of a relational database in a jar file, as well as skilled developers on hand to show the community how the code worked. There were also backing from management, and so on.
Everything was coming up roses. It was just a matter of execution.
But there was a problem, a big problem. The code was big, very big.
There were almost a million lines of code in Cloudscape. Though perhaps medium-sized in terms of corporate applications, this is a large number in the open-source world. 
When we published Jikes it was just under 90,000 lines of C++. It took two programmers three years, or six programmer/years, to put that code together, at a cost to IBM of at least $1.5M.  That works out to about $100/line.
I did a study of the size of open-source packages back in the Jikes days. There were very few packages over 100,000 lines. This seems reasonable It takes at least a few thousand lines of infrastructure for almost any meaningful program, and when you add code that actually does something, you find that most open-source packages are in the range of 10-30 thousand lines. With two programmers you can make it up near a hundred thousand lines or so, though the way in which open-source packages are organized favors componentization.
These estimates are for C or C++. Java programs tend to be larger; for example, there are more statements needed to manage namespaces, and there is a requirement that each class must have its own file.
Suffice it to say that most open-source developers — save those who have spent quite some time working on the Linux kernel — have no experience working with code bases of a million lines or so.This meant that the plan of the team to slowly build a community while IBM retained clear control, hoping that developers willing to take on a million-line code base would come forth, was a dream, most likely a pipe dream.
I also felt that while CS.ORG might seem a good place to start, one should look ahead to the end-game. Where would you most like to be after three to five years?
I don’t know about you, but I’d like to have my code be part of the Apache Software Foundation (ASF) . Only the kernel folks have the same level of credibility. And, as with the kernel folks, everyone knows there will be a strong team in place to watch over that code for years — and more likely decades –to come. Torvalds and Behlendorf are in their mid-30’s now. Does anyone doubt that the work they started in their twenties won’t be around when they retire? (And I’m assuming they will continue working until they are at least 60, as they are not in this for the money, but for the love of the game.)
So within a few days I started sending e-mails and making calls, suggesting that IBM’s best approach would be to hand the code over to ASF.
What made this hard was that I knew IBM couldn’t just make a token donation, but that we would have to hand over the code lock, stock and barrel, making clear we were entrusting the code to Apache’s capable hands going forward, and that while we would be around to help as they deemed fit, we fully appreciated that from the get-go the code would be theirs, and no longer ours.
In my experience this is the most important part of transferring technology from its creators to another group, whether from a research lab to a development group, or from one part of a company to another part, or from a corporation to the open-source world:
To maintain any control at all you must give up all control.
The transfer can succeed only if all accept and understand that the ownership of the code has changed, and that the creators no longer direct the effort, but are now available only to advise and help, as requested.
I was not the only one to argue that giving the code to ASF was the best course. Others also felt the same, including folks with a lot more clout than me. I’m presenting the arguments I made and I can’t claim these arguments were decisive, but I do hope they provide some insight.
In any event, the decision to release via ASF was reached in early July, and the contribution to ASF was was publicly announced at LinuxWorld in early August.
However, there is one contribution I can take credit for. I also suggested that to improve the odds IBM should find someone with open-source experience who could serve as an ombudsman or trusted intermediary. Or as I put it:
We need a sheriff, someone trusted on both sides of the firewall. Our developers our good, but they are inexperienced in open-source, so there may be some mistakes, and we need someone on hand who can sort out the mess.
That’s how Ken Coar got his current job.
The rest is history.
For example, it was Ken who came up with the name change from Cloudscape to Derby. I remember some fun instant-message sessions during which he mentioned some of the names he was considering. It was his call to make, but as an Apache person (he is on the ASF board) and not as an IBM person (the IBMers wanted to retain the name, as I recall.)
Note that the key decision was not what license to use, but how to build a community. Once you know that then the choice of a license is a tactical decision. For example, with ASF we had to use the Apache license.
1. IBM’s public statements said the code was worth about $70M. I was told it the code was about a million lines. I recall doing a count and finding from 800-1000 kloc. I did a download of Derby just now and found fewer lines, though I can’t say I counted them all correctly. Suffice it to say the code was indeed big by open-source standards.
2. The programmers both had Ph.D.’s in computer science, though not everyone would consider that an advantage.
However you count it, Philippe and I can jointly say we convinced IBM to invest at least $2M or so into open-source by funding our work on Jikes. We are grateful for their trust.