Thread started by davidlu now.

What's been wrought using the Piece Table?

A few months before I decided to try to gain employment with Microsoft back in late 1983, I read about Microsoft Word, which was in its first version for MS-DOS back then.

The feature I read about that fascinated me the most, even more than the idea that you could select text that you wanted to operate on with a mouse, or that text onscreen was laid out as it would look on the page, with bold, italics, underline and strikethough visible in the display, was it's ability to undo the last operation that you'd made to change your document. And then to redo the operation, if you decided you really needed to make that change after all.

The first few weeks that I worked at Microsoft before I began to work on Mac Word 1.0 in June 1984, I was asked to review the manual for the soon-to-ship PC Word 2.0 to make sure it was describing what the application was actually doing, and as a way to become familiar with Word's technology in detail.

When I was able to do experiments with Word's Undo on a copy of Word 2.0 that ran on the IBM PC I was issued, I became more intrigued. No matter how large a text file or Word document was, I found that one could select thousands of lines of a newly opened text file, and copy it into an empty document in a fraction of a second.

No matter how enormous the copied text was, you could undo that change in a fraction of a second, and if you did redo the operation, it was equally fast.

With all of the text editors that I had used up to that point, including some that it been my job to modify and maintain, if you were copying an extremely large selection of text, you had to pay for executing that kind of operation by waiting longer than an eyeblink and sometimes for a good number of breaths for such an operation to complete.

In extreme cases, the wait would be so long, there was time to leave the room to get a soda, or a cup of coffee.

When I used Word I never had to pay that cost.

How did the Word team do that? I was really anxious to find out.

It took several months before my curiousity was satisfied.

At a time when we needed to know, Charles Simonyi diagrammed on his white board how Word edits worked for me and Rob Horowitz, another novice Microsoft engineer assigned to the Mac Word 1.0 (Sand Word) project.

Word's editing speed and ability to Undo and Redo depended on a data structure called a piece table.

J Strother Moore, a computer scientist who worked at Xerox PARC when Simonyi was on staff in 1974, had invented the piece table as a side-consequence of work he did on representing logic clauses within software that proved theorems.

He had realized that when one edited a text, it was not necessary to represent the edit result as one long string of characters recorded in a buffer.

Using that naive text representation lead to the big waits that occurred when one inserted hundreds of thousands of characters into the middle of existing text.

In the naive method, you would have to first make room for the newly added characters in a text buffer by moving all characters at the insertion point to the right by the number of characters that were to be added and then copy the new text into the gap that was created.

In 1970's software, a process that copied hundreds of thousands of characters, cost time that could be measured by watching a clock's second hand, whenever such an operation occured during a program's execution. This was especially noticeable when a large copy operation required that the copied bytes be reflected to a disk file as a measure to preserve space within a program's precious and limited-in-size internal memory allocation.

Instead of moving text in a document into one contiguous string, one could, instead, write a small set of records, consisting of only a few bytes of data, that described how a text string was fragmented into pieces as it was edited.

When text existed already in a file, when that file was opened, a type of record, call it a piece description, could be created that showed where text began in the file, and how many characters were contiguously recorded after that point in the file.

If newly typed characters needed to be added to the middle of that text, that single piece description could be fragmented into two pieces. The descriptors for those pieces, would then point to the location of the first character of text up to the insertion point, and then to the location of the character after the insertion up to the end of the original text.

Then a third record could be created that describes the text that was newly typed, which records where that new text was recorded, which could be in memory somelace or in a temporary scratch file rceorded on the computer's file system. That new record could be placed between the beginning and ending piece that bracketed the point of insertion.

This sequence of descriptive records that would summarize the state of an edited document could be called a piece table.

J Moore doesn't use that terminology in his original description of the idea in a Xerox technical report.

It appears that Simonyi and his engineers who wrote the Bravo word processor were the first to coin the phrase 'piece table' to describe Moore's innovation. I saw that phrase already used to describe document editing data structures in the source code files of PC Word 1.0 and Mac Word 1.0.

The resulting three record sequence would describe the new sequence of characters in the document.

Any further editing would fragment the document further, which would require the addition of more piece descriptors to the piece table to describe each fragment that was created.

If one recorded the original piece record list from before an edit, you could undo an edit by blessing the original piece description that existed before the edit, as the new specification of the document's state.

If you also remembered the after-edit piece description, before doing an undo operation, one only needed to move that small piece table version around in computer memory, rather than moving all of the chararacters they point to, to accomplish a redo.

So that was the answer that satisfied my how-did-they-do-that curiosity: shuffle tiny records that describe pieces in a few bytes of memory rather than shuffle actual buffer contents that could contain hundreds of thousand or millions of characters into a new order.

Were there other properties of the piece table that Microsoft developers were able to exploit for the user's benefit?

One of the banes that early word processor users had to endure was the time that they had to wait whenever they decided to save their documents.

When MacWord ran on early Macs that saved fheir files to that slow variable speed floppy disk drive that shipped with those machines, users always had to wait seconds, sometimes minutes, for that operation to complete, while the disk drive sang and groaned to record the new document file that was being written. Unfortunately, in early versions of Word a save operation forced the abandonment of the piece table representation of a document's character order. Save had to follow the piece table entries and re-represent the document text as that long, long string of characters.

This was where the piper had to be paid. Every character in the document needed to be touched to construct that long string, and every character had to make that trip down to the document's disk image. Once a save ended, the previous piece table was thrown away and a new one consisting of a single piece was created to point to the entire extent of that new text string within he saved file.

Starting with Mac Word 3.0, the beginning of Win Word 1.1's lineage, it was possible to fast save a document. A fast save operation, recorded the piece table structure reflecting the document's current editing state into the document file that was being saved. It would locate any pieces that were newly typed or that were copied from another document, and flushed only those characters and their peroperties down to the saved file.

Imagine a million character document, where the user had typed a 'the' string to correct a grammar mistake in their text and copied one sentence into the document to add support to an argument that they were making in their document. If the document could be fast saved, only the new "the" and the copied sentence content plus properties needed to be copied into the saved file. These fast saved additions accumulated in layers at the end of the original Word document file.

Word preferred to save a document via fast save, if the document had already been completely saved once and if there was still space available in Word's internal memory to accomodate a sizeable piece table. If it could be performed, a small edit to an enormous file could be performed in an eye blink, the performance one expected when software used a piece table to keep track of editing state.

If the Word ran out of internal memory space, one of its go-to-the-lifeboat actions was to force a full save of fast saved documents that had accumulated large piece tables. By spending the time to reorganize the document's text stream so that its piece table could be discarded, it beacame possible to free up enough memory so that the user could continue to type and edit for a much longer time.

A flag in a the header of a saved Word doc, fib.fComplex, was set to fTrue whenever that document had been fast saved. The arrangement of text in such documents WAS quite complex, because adjacent text characters recorded in the file may no longer be adjacent in the document's text stream. This caused competitors who wished to read Word documents fits, I had heard, because they had to get used to the unusual geometry of fast saved text.

It was also possible that characters that were still stored in the document file, had been deleted since those characters were first copied into the file. That meant if a user of the program had typed embarassing or incriminating text in a Word document, saved it and then thought better of publishing such thoughts, and erased them, that once saved text still was recorded in a Word file. If you used an operating system utility to dump the raw content of a fast saved Word file, this stuff became visible to the technically minded nosy person.

These were both disadvanatages of providing a fast save facility in Word.

The overriding advantage that made up for these detrments was that the feature could save an intensive user of Word hundreds of hours of otherwise wasted save time per year, which would have interrupted the flow of their writing and document production. I imagine it's effect, as Word gained large market share in the early 90's and before disk drives sped up massively, might been visible as a tiny sliver of a percent of benefit in econometrician's studies of Gross National Product and national productivity.

Even in the most braggadocious mood, you usually couldn't dream to make such a claim for any feature of a software program. Given the wide adoption of Word around the world, the large number of users involved, the necessity for frequent use of the save operation to preserve a user's word processor work, and the slowness of disk drives for an extended period of years, it's pretty certain that fast saving paid off for society and for Microsoft.

As a Microsoft sofware developer, I could view the code that performed so marvelously, and I could make small changes to that process to improve its performance and add to the flexibility of its execution in different ways, but I could only show that code to other Microsoft employees, because the details of its operation was protected as a trade secret.

Microsoft's piece table implementation remained a secret until last week, when the company allowed the Computer History Museum to publish the source code for Windows Word 1.1 on its website for the curious to view.

Despite the fact that Moore's piece table invention was crucial to the development of Bravo and Bravo X at PARC and later Microsoft Word after Simonyi moved to Microsoft, the data structure is little noted in public literature.

It is given a not-very specific, not really adequate, two sentence description in Wikipedia. Moore's invention of the idea, and its use as foundational technology within Bravo and Word is not mentioned there.

It's not Wikipedia's fault. The utility of the piece table could not be verified until instances of these algorithms became available for technical review.

The discussion page beneath this article indicates that it is rated Low Importance by the WikiProject Computing group within Wikipedia, which is tasked with improving the coverage of computers, computing, and information technology on Wikipedia.

Given the importance of this data structure to the performance and flexibility of operation of Microsoft Word, a software application that has been used by billions, perhaps this article has been misrated.

What's been wrought using piece tables? Answer: The Word formatted documents created by the billion or so peple who have used different version of Microsoft Word since it's introduction.

I haven't spoken yet about another aspect of piece tables. There's a field in a piece description, the pcd.prm, that records any formatting change that has been made to the character, paragraph or table properties within the run of text in the document described by that piece description, since the time that piece was first created.

This 16 bit quantity, if it is non-zero, can point to a heap block in Word's internal heap that describes how the properties of that block differ from the default settings of those properties.

Using this scheme, hundreds of bytes of property data for a piece could frequently be specified using only a few bytes of descriptive data.

This whole aspect of Word's design is pretty amazing and deserves study. I promise another article that will describe how this part of the design worked.

A tour through the piece table related code for software developers

To see the definition of a single piece table descriptor in the Word 1.1 archive, look for the definition of PCD (piece descriptor) in Opus/Wordtech/doc.h.

The complete piece table for a document was stored in a structure called a PLCPCD. A PLC was a dynamic generalization of an array (dynamic means its size could grown and shrunk under program control during execution) that packaged into one heap block an array of character positions (CPs) in one-to-one correspondence with an array of PCDs, which was recorded immediately after the array of CPs in the heap block.

One could not write a single C expression which would give you access to the data strored past pcd.rgcp. One is not allowed to declare a variable length array that follows another variable length array inside of a C structure definition.

Instead one could write generic get and set access routines, that would do calculations to reach below the rgcp allocation and grab or stuff a data payload in its own slot within the heap allocation.

You'll find definition of the PLC also in Opus/Wordtech/doc.h. You'll see that a DOD (document descriptor structure defined in doc.h) defines a hplcpcd, which is a document's piece table. PLCs are created by HplcInit() in Opus/wordtech/clsplc.c. GetPlc() and PutPlc() are the getter and setter routines for that retrieve and send a payload data structure to a PLC.

FOpenPlc(), which grows and shrinks PLCs, and FInsertInPlc(), which adds a new payload structure to a plc at a given CP coordinate, are found in Opus/wordtech/clsplc.c

The distance between two adjacent character positions in the array of CPs was the size of a piece.

Because it takes n+1 points to determine the bounds of n intervals, the 1-to-1 correspondence maintained within a PLCPCD was between beginning CPs of pieces and the PCDs in the following PCD array.

The final CP in the CP array was only an end of piece coordinate. Since it did not start a piece, unlike all of the earlier CPs in its array, it did not have a corresponding PCD structure recorded in the array of PCDs.

Most of the primitive routines that acted on a piece table can mostly be found in Opus/edit.c.

The routine that split a piece table at a specified CP position in preparation for insertion or deletion was called IpcdSplit().

The routine that created a new piece to describe text that was newly typed and inserted that new piece into the piece table, in order to replace a target CP range, was called FReplace().

Using FReplace() to insert a zero-length piece to replace a CP range, performed a text deletion operation.

The routine that was used to copy a run of characters from one document to another, perturbing the destination piece table to record the edit, was called FReplaceCps().

The routine that created a single entry piece table that described the entire range of text within a full saved document was FInitPlcpcd() in Opus/create2.c

The routine that performed a fast save operation was FQuickSave(), which can be found in Opus/wordtech/savefast.c

If you have any questions about any of this, leave them in the Disqus comment thread below. I'll answer as I notice them.

XML