Bit to Fit to Git: An Introduction to Open Source

This is the talk “Bit to Fit to Git: An Introduction to Open Source” first given at the City Univeristy of New York (CUNY) Graduate Center on 10 March 2006.

I considered it a great honor to be asked to speak at CUNY. Though I have never had a formal relationship with CUNY, I have a number of colleagues and friends who were part of CUNY, and so I decided to create a fresh presentation.
The result can be found in the next few entries the ones marked ‘bfg’. This is an attempt to both provide an introduction to OSS for non-technical folks and also give some indication of why it can be fun and rewarding to be part of the OSS community.

I was invited to give this talk as part of a joint IBM/Novell/RedHat presentation to students, faculty and staff at CUNY to discuss a number of topics in the Linux/ OSS area. I had an hour.

As a programmer I need few tools to do my work. Indeed, two suffice, and I can get them for about $100/month:

Operating systems don’t come cheap.

Current estimate is that it takes $800 million to $1000 million (one billion) to bring a new operating system into existence.
Recent innovations from IBM:

* Linux Watchpad: Linux on a wrist watch
* BlueGene/L: world’s largest supercomputer, more than $100M investment by IBM
* Cell: new architecture developed by IBM and others, also more than $100M investment

Bit to fit to git: open source, Linux, innovation

Bit to fit to git: open source, Linux, innovation

David Shields

shieldsd at us.ibm.com

First presented at the Graduate Center, City University of New York (CUNY), 10 March 2006

Introduction
I’m Dave Shields. I was a programmer from 1965-2002. I no longer write code; I currently work for IBM in a staff position.
What’s a programmer?
A programmer is a writer.Programmers don’t write in natural language; they write in special languages called
programming languages. There are hundreds of such languages. I have spent most of my career designing and implementing programming languages.

For example:

Language
Sample text

shell
echo “hello world”

C
printf(“hello world\n”);

Java
System.out.println(“hello world”);

This document is written in a programming language called HTML.
A meaningful, useful piece of text in a programming language is called a program (no surprise there!) or in some cases
a package.

Copyright, Licenses, Open-Source Software (OSS)

Programmers are authors, as are poets and dramatists, and their work is protected by copyright.
The copyright owner gets to decide how their work can be used, typically in the form of an assignment of rights, or
by a license.
A license is said to be an “Open Source” license if it meets certain conditions. These conditions are defined by the
Open Source Initiative (OSI), and include the following:

1. Free Redistribution
2. Source Code
3. Derived Works
4. Integrity of The Author’s Source Code
5. No Discrimination Against Persons or Groups
6. No Discrimination Against Fields of Endeavor
7. Distribution of License
8. License Must Not Be Specific to a Product
9. License Must Not Restrict Other Software
10. License Must Be Technology-Neutral

Derived Works is — to a programmer — the key part; you can take the code, change it, and distribute the
result.

Free redistribution is — to a non-programmer — the key part; you don’t have to pay to get the code — it’s
free.

Code made available under a license that meets the OSI terms is said to be “Open Source Software” or OSS.

By the way, OSS is everywhere. You are using some of it every time you use a computer or use the internet. Yahoo runs
its web site site using OSS, as does Google. Firefox is a good example of OSS.

Open-Source Licensing 101 (from a non-lawyer)

he inbound license is the license under which you receive the code

The outbound license is the license under which you distribute the code, either with no change of with your
changes.

Some licenses tell you what the outbound license must be; for example, if GPL inbound, then GPL outbound.

Other licenses let you change the license; for example, MIT or BSD.

Some licenses require that you disclose any changes you make; others don’t.

The copyright owner can license the same code under multiple licenses. Usually just two are used, one that meets the
OSI definition, another that does not, and is so known as a commercial license. For example, MySQL AB makes its
code available under two licenses: GPL or a commercial license; it uses the GPL version to grow a user base and get
developer attention in the hopes that enough folks will choose to buy a commercial license.

Open-Source Licensing Philosophy (from a non-lawyer, former programmer)

Though as a Caltech grad I hate to say it, MIT did get it right when they wrote the MIT license:
Copyright (c)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the “Software”), to deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This is a liberal license. It says you can do pretty much what you want with the code, including
changing the license. It gives you freedom of action. For example, you can take code received under the MIT license and relicense it under the GPL license. It is NOT possible to relicense GPL code under the BSD license (unless you happen to be the copyright owner).

Open-Source is free, yes. But what good is it?

As a programmer I need few tools to do my work. Indeed, two suffice, and I can get them for about $100/month:

* $50 to buy hardware from newegg and the electricity to run the computer
* $50 monthly fee for high-speed internet connection

You can build a useful computer for well under $1000. A modern computer is a case study in “open standards” and
componentization in that you need just a few components:

* Case (the outside metal shell, including the power supply)
* Processor, aka CPU (Central Processing Unit)
* Fan to cool the CPU. Sometimes this comes with the CPU (OEM); sometimes it doesn’t (retail).
* Memory chips
* Hard disk(s)
* CD/DVD drive
* Motherboard

The case and fan are simple technologies. Each of the others took hundreds of millions — in some cases tens of billions — of dollars of research and development to bring to their current form. A DVD drive can be had for $30 today — it took years to make that possible.

The really nice part of building your own computer is that you don’t need to spend a penny for software.

That’s because there is a magical piece of software called Linux.

Linux is an Operating System (OS). It that figures out what CPU you have, how much memory, which disks, which CD/DVD drive, and so on… and so turns your computer from an expensive room heater into something useful, something so useful that programmers have given it a special name:the Linux platform.

“Platform” is a word much-loved by software executives. Another favorite is “ecosystem”, meaning that a platform is
so compelling and valued that a whole host of companies, organizations and communities have sprung up around it. For
example, Windows is an ecosystem, as is Linux.

How much is that Operating System in the Window?

Current estimate is that it takes $800 million to $1000 million (one billion) to bring a new operating system into existence.
Recent innovations from IBM:

* Linux Watchpad: Linux on a wrist watch
* BlueGene/L: world’s largest supercomputer, more than $100M investment by IBM
* Cell: new architecture developed by IBM and others, also more than $100M investment

Delta’s, diff’s, patches

Programmers love to edit files: old-file -> new-file. The difference between the old and new file is called the ‘delta’.

The program ‘diff’ can be used to list the differences in a concise fashion. Doing so requires a little mathematics, finding the longest common subsequence (LCS).

The ‘-u3’ option to ‘diff’ can be used to describe the differences in a form that can be transmitted, so that if you and a colleage each have the old version, then you can send this form to your colleague and they can product the edited file.
This is done by a program called ‘patch’, developed by Larry Wall, who is author of the Perl programming language,
the “Swiss Army Knife” of a programmer’s toolkit.

Linux is maintained using patches.

The Apache web server was originally developed using patches. Indeed, “Apache” is derived from “a patchy server”. Nowadays Apache and most other OSS projects use a tools that just accept the original and revised versions, and compute the diffs internally and maintain them.Basic tool for this was CVS. It had following properties:

* It is ‘good enough’ that most folks didn’t want to write a replacement;
* It was ‘bad enough’ that eventually some folks did write a replacement; it is called Subversion (SVN).

The BitKeeper “fiasco”

he next few pages discuss a topic known to some as the Bitkeeper “fiasco”. See BitKeeper (From Wikipedia, the free encyclopedia.

The net of this a fit was thrown, bitkeeper was deemed unfit, and the Linux community created a new tool called “git”. See Git (Wikipedia, the free encyclopedia).

I first became aware of this issue in April 2005 when a number of articles occurred in the OSS trade press. One of them included a link to an early version of the git mail-list. I printed out a document with the postings. It ran about 50 pages.

I read those 50 pages and … was completely blown away by what I saw. There was an incredible amount of true innovation going on, at a pace that was hard to imagine. I have some experience in this area as I once worked on a failed, and now little-remembered, project called Stellation that addressed some of the same issues.

Simply put, git went from concept to prototype to first implementation to suite to platform to ecosystem in just a few weeks. Simply astounding!

Quick history of Linux patch process, BitKeeper

The next section to capture some of the info in a novel format, an online conversation I had with one of IBM’s
Linux developers. This occurred over about half an hour, about five hours before I actually delivered the talk. (The
“Resources” section at the end includes some links to articles about this issue; the articles by Joe Barr are
particularly informative.)

The following is a SameTime exchange with Sean Dague, one of the LTC Linux kernel developers:
dave: hi

sean: hey there

dave: i need short course in Linux “patch” process. First, let’s start with pre-Bitkeeper days. I’m assuming that in those days folks kep a base version around, applied patches as they were distributed by Alan and Linus. Is that right?

sean: right, there are a number of people with kernel.org accounts that were just for that purpose

sean: Linus’ was the only official version

sean: Alan Cox had the big ac patch sample>

dave: thanks. as for ‘rc’ the release candidates, I know there were rc1, rc2, etc. Were they cumulative? Did rc2 include all the changes in rc1, for example?

sean: IIRC all the patches were against the last release

dave: so rc2 was patch to rc1?

sean: no, there were no tar releases for rcs

dave: maybe I’m confusing ‘rc’ with ‘pre’. How did success patches get distributed?

sean: though, I’m scratching my head at the moment as if that was true

sean: browsing on http://www.us.kernel.org/pub/linux/kernel/v2.2/testing/ would help

dave: not a big deal. moving on to Bitkeeper (BK), is is true that BK provided central repository for patches that allowed kernel hackers to keep their source tree on their own machine?

dave: Unlike CVS, which has central server for the code, BK has central server for the patches, with distributed base code on hacker’s machines.

sean: given that the patches go up and down in side, I think you are right. You had to apply patches in order

sean: well, it is distributed scm

sean: which is different

dave: got it. Just what did Tridge’s tool do?

sean: reversed out the client server protocol

dave: ?

sean: maybe if I step back a minute, it will make sense. Hope you don’t mind the digression

dave: np

sean: SCM systems end up falling into 3 basic camps. 1) Lock -> Checkout -> Change -> Commit -> Unlock (this is what things like CMVC, and most proprietary source management systems do)

sean: 2) Lockless, but still centralized, so Checkout -> Change -> Commit -> Fail if Merge Required -> Commit (this is what subversion, and CVS do)

sean: 3) fully decentralized, where ever developer has their own server instance, and Changes are pulled from other repos with a common parent

dave: great stuff … keep it coming

sean: this is what Bitkeeper, Git, Mercurial are

sean: Bitkeeper was one of the first experiments in this space

sean: as a commercial vendor

sean: which meant the client and server were closed source

sean: and had a funky license

sean: Bitmover hosted Linus’s linux kernel tree for free on their servers, as well as other kernel developers code (I can’t remember all of it)

sean: developers could directly expose their bitkeeper trees, but the client to pull those changes was a binary blob that was free “as in beer”, but closed, and had an ever changing license

sean: what Tridge did was reverse out the protocol between the bitkeeper server, and client, so that he could have an open source client to pull changes

dave: got it.

sean: because, amoung other things, if you accepted the bitkeeper license, you couldn’t work on competing (i.e. open source) SCM systems

dave: And git provides alternate implementation of (3)?

sean: yes

dave: “mercurial” keeps coming up. Is it good?

sean: it is what Xen uses for development (so I use it a lot)

sean: my own personal experience is that mercurial feels like less of a hack than git, but I don’t use git enough to know if that’s really true or not

dave: It’s in Python? Python is one of my favorite PL’s. It’s the one closest to my all-time favorite, SETL.

sean: yeh, mercurial is in python

dave: I understand that Linus kicked this off by proposing a filesystem change, or a new filesystem. Is that right?

sean: I don’t remember that, but I wasn’t watching all of it that closely

dave: ok – i can sort that out later. thanks for all the help — really appreciate. I know I’m going to *love* being part of the LTC.

Git

Linus decides to write a replacement for BitKeeper (BK). It comes to be called ‘git’. (It is pronounced with a “hard” g; that is, “git” rhymes with “bit”.)

An astounding amount of innovation in a very short time.

Current git home site: GIT – Tree History Storage Tool

Note that it is based in ‘cz’, which means Czechoslovakia. Not only can OSS be found everywhere — it’s development takes place everywhere in the world. For example, the 2.4 Linux kernel is maintained by someone in Brazil.

These Folks are Good!

fter convincing myself that Git was for real, that an astounding amount of innovation had taken place in a very short time indeed, I said to myself: If the Linux folks can put together a world-class SCM system in a matter of weeks, just to solve a problem with the tool set they use to build and maintain the kernel, then they must be *very* good indeed at their day-to-day job: making Linux better and better.

Why I’m proud to be an OSS programmer

As noted earlier, the Linux kernel alone is worth at least a billion dollars — that is what it would cost if some
company wanted to hire some programmers and build an operating system of equal quality.There is also much more OSS available:

* the gcc tool chain
* Apache HTTP server
* all the other stuff from Apache Software Foundation (ASF): Xerces, Xalan, Geronimo, Jakarta, …
* Eclipse
* Xen
* All the scripting languages: Perl, Python, PHP, Ruby
* All the emerging content management software, including WordPress, Mambo, etc.
* GIMP
* Tex, the software used to typeset almost all the scientific papers published today.
* and so it goes …

Which means that the programming community has collectively produced a software artifact worth at least three billion
dollars, an artifact that is freely available for anyone in the world to use as they see fit, and at no charge.

What other profession had made such a contribution to society?

Resources

Linux: BitKeeper Is A Commercial Product? (5 Oct 2002)

Interview: McVoy on BitKeeper, Torvalds and Perens (14 Feb 2003)

Feature: No More Free BitKeeper (5 Apr 2005)

Linux: Managing the Kernel Source With ‘git’ (11 Apr 2005)

BitKeeper and Linux: The end of the road? (11 Apr 2005)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

  • Pages

  • August 2017
    M T W T F S S
    « Apr    
     123456
    78910111213
    14151617181920
    21222324252627
    28293031  
  • RSS The Wayward Word Press

  • Recent Comments

    mrrdev on On being the maintainer, sole…
    daveshields on On being the maintainer, sole…
    James Murray on On being the maintainer, sole…
    russurquhart1 on SPITBOL for OSX is now av…
    dave porter on On being the maintainer, sole…
  • Archives

  • Blog Stats

  • Top Posts

  • Top Rated

  • Recent Posts

  • Archives

  • Top Rated

  • %d bloggers like this: