Thread started by davidlu now.

Making Rich Text Rich: Word's Property Storage Scheme

What is a property?

In the real life of human beings, a person's property is the collection of objects and assets that belong to them. For a person, a property would be one of the things that they own.

The idea of chemical properties stretches the property idea to mean important descriptive characteristics that belong to the different objects of study in the field of chemistry

In the world of chemistry, a chemical property is one of the characteristics of an atom or a molecule that describes that kind of object, which allows it to be identified, and which predicts how it will act in a chemical reaction or physical action.

In the early days of chemical study, chemists paid attention to a substance's physical characteristics: color, smell, whether it was a gas, liquid, or solid at room temperature, its density, whether the substance was metallic or electrically conductive. If the chemist was brave, how it tasted. After enough experience was accumulated via taste tests, whether a substance was poisonous or not.

As chemist's understanding grew, they began to pay attention to freezing and melting points, a substance's specific heat, the spectrum of colors that samples of the substance emitted when heated to extreme temperatures, what elements and substances easily reacted with one another, and the amount of heat that was liberated by a reaction, or how much heat had to be contributed from an external source to allow a reaction to take place.

Once chemists understood that elements were all made of different kinds of atom, and that molecules were atoms that were bound together in unique structures, they began to pay attention to atomic number, atomic mass, types and the strengths of chemical bonds, structure diagrams of molecules, an atom's electron orbital structure, the energy levels of electrons bound to the atom, the atom's ionization energy, the atoms list of valence electrons, an element's family identification in the periodic table of elements.

Once these kinds of things were known about atoms, it became possible to predict how atoms and molecules would interact from first principles, not merely recording the results of millions of experiments for later lookup.

All of these characteristics that became important for description and identification and for the prediction of action and reaction are called chemical properties. There's no final bound on the number of properties that belong to chemical objects.

The number grows as chemist's theories grow and complexify as they try to understand what is happening in the phenomena that they study.

What are properties in a software system?

Abstractly, properties are named entities that can be assigned a value for some kind of object. These entities are attached to or associated with object instances in some manner. Software properties are said to belong to a software object, just like an item of personal property belongs to a person.

Concretely, properties are used to record important facts about a particular object instance. This follows the usage of scientists in other fields who named and discussed different kinds of chemical, physical, or biological properties of their objects of study.

Taken together, the set of recorded facts (property values) for objects realized as software can be used to identify object instances, allowing them to become targets that may be searched for or operated upon, and can also be used to set rules that describe and modify how a particular object is allowed to act within a system.

What are the important objects that a word processing system operates upon?

Documents. The characters recorded in a particular order within a text stream of a document, which can be emphasized in different ways. Paragraphs, which partition the text stream into units that have certain presentation rules. Sections, which partition adjacent paragraphs into units that can be grouped into larger structures akin to the chapters of a book that have their own presentation rules. Table rows, which can optionally be imposed on a set of adjacent paragraphs, which define a set of top aligned, rectangular cells that maintain their own text flows in a horizontally arranged array.

What are properties of those word processing objects?

They are a set of named items that record all of the rules that be set independently for any character, paragraph, table row with its cells within it, or section within a document.

A document itself can have properties that set rules (encrypt this document or not, use this character set) but the larger portion declare information to establish the identity, history and ownership of the document.

Character properties describe how a character looks or is presented when it is displayed in its paragraph: its font number, which can be mapped to unique font name, its font size, the font face (bold, italics, small caps, all caps, underline, strikethrough), inter-character spacing, superscript and subscript, or type of underline.

Paragraph properties set the indentation rules for the paragraph, the text alignment, whether a first line indent will be maintained, absolute or relative line heights, space between lines, the justification rules for text in the paragraph, how much white space to show before or after the paragraph, whether the paragraph will be displayed within a bordered area and with what kind of border, will the paragraph be forced to appear at a fixed location on a page, and if so how will the text of neighboring paragraphs flow around it, how many tabs will be defined for the paragraph and what type will they be, will text flow from left to right or right to left, will characters flow down or flow up. The list is large and ever increasing as new versions of a word processor are published.

Table properties define the number of text cells that will be maintained in a table row, their horizontal extent, the amount of white space that will be maintained between cells, whether a cell will have a border and what kind if so.

Section properties define the margins that will maintained as section text is laid out on a page, the number of columns of text that will be maintained within the section, whether the text on the page will be laid out as facing pages with a gutter maintained so the printed text may be bound into a book, whether or not headers and footer text will be displayed and the location and dimensions of the areas that will show that material if they are to be displayed.

The great challenge for software developers who create word processors

In a word processor, the properties of any character in a document can be set independently for each character, the properties of any paragraph can set be set independently for each paragraph, the properties of a table row can be set independently for each table row, the properties of a section can be set independently for each section.

This last paragraph is easy to state, but it sets a terrifying challenge for a software designer.

Word processing documents in a well designed system can record tens of millions of characters in a document. If a user records little but paragraph marks in their document, the resulting document can contain tens of millions of paragraphs also. When a software system vomits out an enormous flow of data, that data flow can be presented in a tabular form that could contain millions of rows per table.

There is a limit to how many chapter-like entities are required in a document, but a person who feels like making demonic demands could ask to define tens of thousands of sections in a document.

The user of a word processor can add, delete, copy, paste, and cut text from the document and edit that document text at any instant with the expectation that those operations take place within the time it takes to press a key on a keyboard, or to dispatch a command on a menu, or submit the contents of a dialog.

And despite the fact that none of the objects in a document have a fixed address - they slide forward and backward in unison past the point of edit as the user types or backspaces, with a steady flow of characters winking into and out of existance, all of the properties they own must stay associated with their owning object, and new characters or paragraphs must inherit their properties from their neighbors so that authors are not interrupted as they work.

What is 'rich text'?

The individual objects in rich text can each carry and own their own very large set of properties that specify requirements that must be enforced by the word processor as the document is typed, formatted, published or printed.

Every time a user sets or changes a property of a word processing object, it's as if Jean-Luc Picard makes a gesture at newly marked up text and says "make it so". Upon the command, the word processor goes to work to make it so.

How does one measure the richness of text? Just as a person's wealth can be measured by their number of possessions or their ability to marshall their assets to make a large number of different things happen easily in sequence or at the same time, text is rich when an arbitrarily large set of requirements can be attached to the various text objects in the document with the expectation that the word processor can satisfy all of those independently set requirements all at once, all of the time.

Design goals for Microsoft Word, a rich text editor

Microsoft Word was designed to allow its users to create rich text documents that could grow as needed, without imposing limits on the users activities. Word documents could grow to contain several million characters once the Mac Word 3.0 file format was introduced in 1987.

We wanted to give Word users the impression that a document could be expanded along any of its dimensions (number of characters, count of character property changes within the document, number of paragraphs, complexity of paragraph property changes, number of tables, number of rows within tables, number of footnotes, number of annotations, number of fields, number of sections/chapters) without hitting a limit that would prevent them from continuing to write and edit within that document.

There was a hard upper limit for each of these dimensions in Word. Our aim, though, was to set each so high that a user would never be able to reach them without automated assistance.

For the number of footnotes, annotations, sections, bookmarks, fields, outline levels in a document there was a theoretical maximum of 65535. The data structures that kept track of these objects, contained counts that were 16-bit quantities. The largest number that can be stored in 16 bits is 65535. Even for extremely involved documents, this limit for these kinds of structures was likely 20 to 50 times larger than necessary for documents that could be created by a typing human.

The maximum limits for count of characters, character property changes, number of paragraphs, number of tables, number of table rows, needed to be set at least a thousand times larger than the number of possible references one could create within a document.

The largest novel written in European languages topped out at 2,100,000 words. Assuming 6 characters per word, a word processor would need to be able to record approximately 14.7 million characters to represent such a text (the six characters per word plus the space character that delimits words and the commas and periods which delimit phrases and sentences).

The marketplace didn't demand that it be possible for every character, paragraph, table row, or section in a document to own its own properties, independent of any others in the worst case. For simplicity of operation and of explanation, though, those limits needed to be set to extremely large values.

A competitive battle that Word won in the mid to late 80's. The market seemed to award the prize to the team who was able to relax their application's document limits more than their competition could.

In the word processing competitions of the mid to late 1980's that played out on the Macintosh, it was devastating to an application's prospects whenever it was discovered that it's maximum document size was set too small due to limitations in its design.

Full Write Professional, a Mac word processor that competed seriously with Mac Word from 1987 through early 1989, was early to provide provide an editable page view for its documents, that always showed how a document would be printed as it was typed, a facility that Word could not provide until its 4.0 release in early 1989.

External observation of Full Write's implementation indicated that the application required that its document data structures reside in internal memory, which meant that when it was run on the Macs of that era, which frequently shipped with only 1 or 2 Mb of memory, editing would abruptly come to a halt when one of its internal memory limits was reached. These limitations also affected the number of large documents that could be open and visible in one instant using Full Write.

In the Macs of 1987-88, experiment showed that Full Write seemed to have enough resources to allow a single document to grow to 150-200K size.

If graphics of any size were used in the document, available space for representing document text evaporated. A single Mac picture file could take 20 to 100K to represent, which would quickly use up space within an in memory document representation.

Mac Word won this competition. Word's property representation was extremely compact and kept most property descriptions (and picture inclusions) on disk in the document file or a scratch file, until they were needed just for an instant to display text onscreen or to paginate a document.

Word could scroll through a document at a higher speed and could keep multiple large documents open at once, due to this design. In the market of the day, those capabilities of Word's trumped Full Write's early ability to provide an editable pageview.

Word's implementation also built a reputation for stability and reliability compared to FullWrite once Mac Word 3.01 was released.

The 3.0 release of Mac Word was shipped with a number of serious bugs. Word's reputation was threatened during the several month interval that elapsed before 3.01 could be shipped.

When Mac Word property scheme was extended to describe the format of table rows to support the creation and editing of tables and Word acquired its editable page view for its 4.0 version, Full Write was unable to respond, which ended the competition.

During Word's development, it ended up being easiest to put questions of document size entirely to rest and create a design that imposed no limits along these dimensions. No matter how many characters (or paragraphs or table rows or sections) could be added to a doc, Word's property recording scheme was capable of handling the worst case for that document size (the maximal number of properties changing for every character, paragraph, table row or section in the doc).

Word's scheme was fast enough to record property changes while a document was typed instantaneously. It was able to fetch properties fast enough that a full window slice of a document could be laid out in a fraction of a second and also fast enough to allow an entire document to be paginated in a small number of seconds.

The magnitude of the storage and retrieval problems that needed to be handled - the worst and then more reasonable cases for character properties

The CHP, Word's CHaracter Property block, (pronounced chip) by the time of Win Word 1.1 contained 29 fields packed into a 10-byte block. In the worst case, Word would have to record a different group of CHP settings for every character in the file. The fields of a CHP report the display characteristics of a character (bold, italic, underline, strikethrough, font type, font size, etc)

Imagine a document created by Word. Now imagine that every character of it was made to have a different look from any of its predecessors.

If your document grows past a few thousand characters in size, it becomes a large task to create a document that follows this rule. Few would have the patience to do such work. The payoff for building such a document is small also: the text that results would be extremely unpleasant to read. Human readers are used to having the look of text change only when the author wants to change emphasis of words. Changing the look of every character is truly excessive.

It's possible to carry this excess to a truly wretched degree: make sure that every possible character property setting changes for every possible character. Boldness, italicness, strikethrough toggle on and off with every character. The font and font size changes with every character. The character spacing changes. Subscript and superscript amounts cycle with every character. Whenever underline changes, the style of underline also changes.

Win Word 1.0, and all versions after that, shipped with a version of Basic that could operate on all details of a document via macro programs. It's not hard to write a program that creates one of these worst case documents of a desired length.

Word should be able to handle these kinds of cases with ease, although you may notice that text drawing slows, as Word's format and display routines are maximally loaded by such examples.

In a more typical case, there might be 10 changes of character look average within a paragraph so that important words and phrases in a text can be emphasized.

Let's emulate a Word software designer, and perform mental experiments to find a representation that will allow us to efficiently record the character properties of a document. The goal here is to figure out how large of a document could be grown as the number of property changes gets large, if Word's property information were to be stored entirely within its managed area for allocatable, deletable, growable, and shrinkable data structures, the heap storage within its internal memory allocation.

If it ends up that large enough documents can't be represented in the heap, we'll consider methods for storing that data within files and see how far that idea can take us.

The internal heap used in Mac Word 3.0, the first application that used the Win Word 1.0 internal design, had a by-design limit of 64K. In 3.0, the heap was not even allowed to grow to full size. I believe it maxed out around 48K.

It was a design goal for that app to be able to have three or four document windows open at once. Each of those windows would need to share heap storage, so you would have a a maximum budget 16K for each windows heap usage, assuming three windows were allowed to be open at once.

Let's try a naive, space extravagant character property scheme to represent a worst case document and see how bad it can get.

Assume that we store entire copies of the CHP, for every character, where each character has a different CHP setting. How many copies would fit in that heap allocation?

Answer: approximately 16000/10 = 1600.

This means we could only allow the user's document to grow to 1600 characters before we'd tell them that we have to stop processing, because we can't represent a larger document.

A disturbing result. In the worst case, we would only have room to describe the look of one and a half window's worth of characters, about half a page of printed text. I can tell you that the CHP is the smallest of Word's property blocks by about a factor of ten. Where will we have room to store for all of a Word document's other classes of property storage, when memory is so tight?

I'll translate a byte count into a page count estimate a number of times in this article. For documents written in English, you can estimate that one page is close to full when it contains approximately 3000 characters, including spaces and punctuation.

Let's pay attention to how users use character formatting in a more typical document. Maybe that will give us an insight that will let us represent properties more efficiently

Perhaps a more usual, and more useful case, won't look so hopeless.

We estimated above that Word users in a typical document might change the look of characters within a paragraph 10 times. This is not an average case: we'd need to collect lot's of statistics to estimate that, but that figure does seem typical for a technical document that defines a lot of terms.

Let's guesstimate that an average paragraph might might contain 1000 character slots, which consists of printable characters and the spaces between words plus any punctation characters.

Lets visualize how doc like that would look. The paragraph would probably start using the normal character look for the paragraph, and then for purposes of emphasis, a single word or several word phrase, would be given boldness or italicness, or an underline. The emphasized text might vary from 5 characters to perhaps 20 to 25 characters. Assuming our average figure for emphasis, the text would revert back to the normal look and then be emphasized 4 more times. Assuming this density of character property changes, our document would record a character property change every 100 characters on average. In reality, the length of normal text and of emphasized text sections would vary between transitions.

Can we take advantage of this property clumpiness to reduce the number of property blocks we need to store in order to describe what's going on in the paragraph?

Who would have thought that properties 'ran' in word processing documents like luck itself?

We're going to name the segments between character look changes 'runs of text'. This is the same word usage that causes one to call unbroken sequence of winning bets in a game of chance a run of luck. In the realm of word processing, this characteristic of property settings, that they run on for a number of instances, before changing to a new value, is good luck that allows a designer to compress his property representations considerably.

Can we take advantage of the characteristic that property descriptions clump into runs? Yes, very likely. This idea of exploiting the existance of run patterns in data occurs all over computer science.

We'll use the luck of the run to improve our property storage scheme

Let's collect the character position coordinates within a document's stream of text, where ever property settings change, into an array of ascending character positions. Using this idea, we no longer have to record a CHP for every character, just one for every transition coordinate recorded.

Does this buy us much?

FIrst, how large does our character position coordinate have to be. A two byte coordinate would only allow our document to grow to hold 65535 characters. That seems too small. A lot of our customer's documents are much larger than 20 pages, roughly equivalent to 65535 characters. The next larger integer size that our compiler allows is 4 bytes (32 bits). That's a number larger than 4 billion, when the number is an unsigned quantity. When no bit of the 32 is used to represent the sign of a number, we eliminate the possibility of negative numbers so that we are able to double the size that we can count to.

All right, we'll need to accept 4 bytes additional overhead for a beginning of run coordinate for each CHP block stored. But instead of storing 100 CHPs of 14 byte length for every 100 characters, we could instead store one CHP plus plus one run coordinate values of 4 bytes length as properties change on average every 100 characters.

Note that we are making the worst case much worse. If we have to record a property change for every character, we'll be storing 14 bytes for coordinate plus property rather than 10 for the property only. It's a good trade to make the worst worse, if we can make typical cases much better.

Do the math: 100*10 - 1*(10+4) = 986. So over a length of 100 characters stored in the document character stream, we'd save 986 bytes of CHP storage out of the 1000 bytes that would have been used in the previous implementation.

How large of a document can we do now inside of a character prop allocation that maxes out at 16000? The new max of characters would be (16000 / 14) * 100 = 114,285. This is equivalent to a 38 page document. And remember we haven't made any accomodations for storing paragraph, table or section properties yet. Perhaps our true CHP storage reservation can only be 1/4 of that estimate, which means we have a hope of storing a 9 page document when the other property types are heavily used.

If we chose this representation, Word could have been used to prepare large memos but not the book size documents that we had in mind.

This was a limitation that MacWrite suffered from that was never remedied. If we had stopped here, we would have been able to rival MacWrite but not surpass it in document size.

We need a new idea. Think of the stylesheet structures that Word stores in every document. Does that give us any leverage?

E.B. White's famous style guide for writing , "The Elements of Style", prescribes for authors the dictum "Omit needless words". That idea has as great a value for software developers creating writing tools as it does for writers who use all of their power and skill to craft their text.

The art of the effort to compress property information lay in the ability to reorganize property data so that we could remove the need to record many previously necessary words of storage in a new data representation, which we could safely omit whenever we saved a document, recorded the properties of newly typed text, or recorded property changes for existing text.

You'll see what effect weaving a stylesheet into Word's document format had on our ability to compress property inormation.

We had to be very careful to keep Word's user interface simple while we integrated the stylesheet idea into Word's property recording mechanism. We would have lost many customers if it became a requirement that users had to understand and master stylesheets, and their indirect way of specifying formatting, before they could do anything useful with Word.

For that reason, we tried to hide in the user interface the fact that, by default, a Normal style was being set for any paragraphs that they typed in their document.

That Normal style was always present in any new document, could not be deleted, and was set to reasonable default settings. Using that convention user's could start to apply styles to their paragraphs, once they finally understood the concept, if they ever did, but they could produce useful documents from the start without being troubled by the fact that Word was labelling their paragraph with style codes, and using that to encode and compress the property settings of their document when it was saved.

A stylesheet in Word is a structure that specifies the default paragraph and character properties for a paragraph. One of the properties that belongs to any paragraph in a Word document is an stc, a numerical style code coordinate. Given an stc, there is a function in Word that will return the CHP and PAP that's defined for that slot of the stylesheet.

The stylesheet CHP value seems as though it is perfectly set up to specify how character text looks when it is unemphasized. What differs between a CHP for unemphasized text and one that expresses an emphasis?

Sometimes, the difference is a change of a single bit, or two or three bits. Bold, italics, small caps, and all caps, are frequently textual signals for emphasis in text. How are the specifications for those character looks encoded in a CHP?

CHPs have special characteristics that will allow us to use an extremely effective encoding/compression idea

All of those toggling text properties are encoded as single bits recorded in the first byte of the CHP. Is there an idea that we could use that would allow us to carry a CHPs information in a smaller space?

Let's try to see if we can change the meaning of CHP fields so that they can express the difference between two CHPs rather than simply stating the information content of a single CHP.

The solution that Charles Simonyi invented as he designed the Mac Word 3.0 file format was not easily discovered, but is very nice. It allows a CHP structure that expresses the difference between two CHPs to be compressed down to a single byte in many cases.

9 of the CHP fields are stored as single bit values, which can store a 1 or 0 value. When a single property value is set to 1, it means that a character's look will be changed to show that property. A 0 means that the character display is unaffected by that property. (Example, chp.fItalic is set to 1 when a character is to be displayed using italics)

Eight of these CHP bits of them reside in byte 0 of the structure and the remaining one is in byte 1. For the purpose of expressing a CHP difference, we'll change the meaning of those bits so that if they are set to 1 that means that bit setting is the opposite of the value stored in the reference CHP, and 0 means that the bit setting of the CHP being compressed is the same as the value in the reference CHP.

There are seven other fields in the CHP that require more than 1 bit of storage. Add seven status bits corresponding to each of the multibit fields.

Those status bits will be set to 1 if its muti-bit field stored later in the CHP is different than the setting stored in the reference CHP, and 0 if it is unchanged from the reference.

Now make sure that all of the multi-bit fields are initailzed to have a 0 value. Then copy the values of the multibit fields from the CHP to be compressed that are different from the reference CHP and enter them into the CHP that is being built to express the difference from the CHP to be compressed and the reference CHP.

Having built the difference CHP using these rules, we can safely store only those bytes at the beginning of the CHP that are non-zero and and store only that non-zero prefix of bytes to represent the difference CHP.

From now on we're going to give these difference CHPs their own name. We'll call them CHPXs (character property exception blocks). Pronounce CHPX as 'chip-ex'. They are called exception blocks because they record what is different and exceptional between a character's properties and the default character properties for the paragraph it's a part of, which is recorded in the stylesheet.

The way you would reverse the compression and "add water" to reconstitute the compressed CHP from this stored difference CHPX would be to

1) First, clear a CHP structure to 0 values and then copy the CHPX prefix to the front of that structure. Copy the reference CHP structure from the stylesheet to another CHP instance that will be used to store the reconstituted CHP.

2) Exclusive-OR the first 9 bits of the difference CHPX onto the reconstituted CHP buffer. Exclusive-OR toggles the bits of the reconstituted CHP, when the corresponding bit in the difference CHP is 1. When the bit in the difference CHP is 0, the reconstituted CHP is left unchanged.

3) If a status bit for the multibit fields is set to 1, copy that multibit value from the CHPX to the reconstituted CHP.

What can we notice about the performance of this compression/decompression idea?

When the properties of a CHP run are exactly the same as the stylesheet CHP for that run's paragraph, not even a bit of the CHPX needs to be stored.

When only some of the single bit properties are set in the difference and none of the multi-bit fields are different, the CHPX prefix can be stored in a single byte.

The most likely to change multi-bit fields are positioned as close to the front of the CHP structure as possible. If a multi-bit field is the only difference from the stylesheet CHP, it's pretty likely that more than half of the CHP space will be left set to zero at the end of the CHP, allowing a CHPX to be less than half the size of a complete CHP.

To take advantage of the idea that we can compress CHPXs down to their non-zero prefix, we will need to keep a single byte count of characters to keep track of the size of every CHP prefix we store in correspondence with run beginning coordinates in this new scheme. It's pretty rare that text would be emphasized in a paragraph by changing font, font size, or inter character spacing. It's somewhat likely, although a rarer occurence, that some character might be sub or superscripted or underlined. Let's guess that on average a CHPX is 1/3 the size of a full CHP.

What does that buy us for maximum document size? 1/3 of a 10-byte CHP (approximately 3.33) plus a one byte length count: 4.33 bytes

(16000 / (4+4.33)) * 100 = 192076 (remember 16000 byte heap allocation, assume a CHP transitions every 100 characters, 4 bytes character coordinate plus 4.33 bytes estimated for each CHPX compression)

192076 is a respectable document size, around 64 pages of text, but many customer documents need to be larger than this. The Mac Word 3.0 spec document was larger than this size, if I recall correctly. It was one of the earliest documents we could read with an early version of the software.

If we divide by 4 to allow four documents to be open at once, we are way too small still.

We've come pretty far already and it's not obvious that more space can be squeezed out of the CHP using coding and compression techniques.

We have to conclude that a document's CHPs cannot be recorded within in-memory heap storage, if we wish to allow users to create the documents that they frequently have need to create using their word processor.

Instead we'll try to store CHP info inside of disk files, and bring what we need into memory when we need to operate with it.

The maximum distance from the beginning of a file that Word could access was 16,777,215, a limit set by declarations made in the paging algorithm that Word used to access files. This constant was called fcMax in the Word sources.

Those paging system declarations could have been changed easily to allow larger documents to be created, but this document character size limit seemed to make sense for the Win Word 1.0 release.

Let's see how large of a document we could create, assuming an average of one property changes per 100 characters:

(16,777,215 / (4+4.33)) * 100 = 201,407,143

That's larger than Word's paging limit, which means the full 16,777,215 limit can be written. This is ample space, many times beyond our customer's expectation, no matter their endurance as they create their document.

We do need to record properties in the document file, in order to allow users to create large sized documents. By deciding that, we've entered a new land that promises ample storage resources.

This figure is 12 times fcMax, which says that using this newest scheme, we could handle twelve times more property changes on average per 100 document characters, approximately a property change every 8 characters, without affecting the maximum number of characters that a document could record.

Storing properties on disk gives us plenty of headroom to allow Word user's to create gargantuan sized documents.

Having stored properties on disk, we can't bring them back into memory all at once. We can't afford that anymore.

Let's think about the particulars of keeping our property structures on disk. If we allow these structures to grow to be millions of characters in size, there's no way that we can allow Word to read the entire stored set of property information, it's index plus its matching compressed block of properties, into memory at once.

In fact, the temporary in-memory storage that is made available to one executing procedure in Word proabably shouldn't be allowed to grow larger than a few thousand bytes. This means we need to have a way to access sections of the property information in chunks that are less than that temporary storage limit.

The initial idea we had for recording character properties on disk was to write an array of 32-bit (4 byte) file character positions (one for each property transition), and then after that a group of CHPXs (a grpchpx, in Hungarian notation, pronounced 'group-chpx' or 'group of chpx') to store one CHPX for each of the property transitions recorded.

A group of CHPXs would precede each compressed CHPX with a byte count that records the non-zero prefx size for that CHPX, and then would record the CHPX prefix of that size. Each succeeding CHPX with preceeding byte-count would be located immediately past the end of the preceding entry.

There's a problem with this idea that we need to handle: since each CHPX prefix may be recorded with a different size, it's not possible to predict where the nth entry of the grpchpx begins within the structure. To locate it, one needs to traverse each individual prefix by reading its length count and skipping past its body to reach the next one, and to continue the traversal until the nth element is reached. The worst case search through a million element GRP structure could take minutes.

We did trade a lot of time to gain a tremendous amount of storage space, when we moved to representing character props on disk. We used to have instantaneous access to any part of the character property storage whenever we wished.

Maybe we'll be able to focus our attention on only the data we need for a particular instant and ignore the rest without penalty. That would make up for a lot of the apparent time hit we've just taken on.

Remember we are not doing our search through an in-memory data structure anymore. We've accepted a performance penalty by storing our data on disk, which contributes to the slowness of our process. We'll have to read that group of CHPXs in sections from the disk into memory to examine the entries that it stores. For computers of the 1980's, disk fetches were thousands to thousands of times slower than processing data directly in memory. So examining all of the CHPX data to locate the nth CHPX is going to be amazingly slow in the worst case.

We'd like to change how we package property information to make the process of finding the property block corresponding to an fc coordinate be nearly instananeous, if that's possible. Word could end up doing this kind of lookup many times a second in many cases. This process needs to perform like lightning.

We need to package our property info into chunks so we're not swamped by a tsunami of data when we try to read property info from disk into memory.

Try this idea: choose a page size that's able to store at least one of the largest property exceptions that we'll ever be forced to store, plus room for at least two FC coordinates (enough to store the fcFirst and fcLim for a single run) plus any other overhead bytes necessary to make the storage scheme work.

I already know that the largest property exception block that we'll ever try to store using this scheme will be 462 bytes long and that only a single count byte suffices to record the number of coordinates plus property blocks within a page. We'd like to use a page size that is a power of 2, so our correct page size choice should be 512 bytes.

We're going to call our page an FKP (a Formatted Disk Page). Here's how we'll use the space. We're going to use the byte that begins at offset 511, the last byte of the structure, to record the count of runs stored in the page. We'll call it fkp.crun. fkp.crun should always be 1 or greater for any FKP recorded in a document. The maximum number of runs that can be stored in an FKP is 102 (512/5).

We'll record the file coordinates (FCs) of whatever runs are recorded in the page as a range of fcs (fkp.rgfc). There will be one more FC recorded than fc.crun, because we need to know the limit fc of the last run stored in the page. (It takes n+1 boundary coordinates to record the extent of n runs)

To get instant access to the property blocks stored within our page, we'll store an array of bytes beginning at offset (fkp.crun+1)*4 in the FKP (this offset value will add one byte of overhead for each run), which will store the 2-byte word offset that locates the beginning of each property exception stored wihin the page. For any of the word offsets we'll store in this table, we'll multiply it by two before we add it to the beginning address of the buffer to calculate the beginning location of a property block. If 254 is the word offset stored, we'll use that to access a property block stored at offset 508 (2*254) in the page. If we've decided that we need to access the information for the nth run on the page, we can look up the offset of the nth property exception block, add that to the beginning address of the page, which gives the location of the first byte of the exception block, its length byte.

(Our single-byte length offsets can only count up to 255. By doubling these values we can point up to word 508 inside of our page. This scheme will waste a byte or two of storage space between property blocks occasionally. In earlier version of Word, the page size was 128 bytes which allowed the property exceptions to always be tightly packed. We had to give up tight packing to accomodate the girth of a fully sized table property exception, starting with Mac Word 4.0.

Now as we pack property exction blocks into the page, we will place them in front of the fkp.crun at the end of the page, placing each new block in front of the last one recorded. When we add a new run to an FKP, we'll use up four bytes for a new fc coordinate to mark the beginning of the run, we'll use up another byte for the word offset that will point to the CHPX. These will both be allocated at the beginning of the FKP. The compressed CHPX, we'll add as close to the end of the FKP as we can. The array storage at the front of the FKP will grow to the right, and the assemblage of property exceptions will grow to the left, leaving an island of free space that wil decrease with every run added.

We'll continue to add run information to the block until there is no longer room for the coordinate of the new run plus the exception block offset byte plus the length of the exception block. When we run out of space we write the FKP page that has been filled, and prepare a new empty page to record further run properties.

The FKP page structure that we've just invented will give us an opportunity to save lots more property storage space within the inside of a page, at the cost of adding one more byte of overhead for each run stored.

That array of offsets that allows us to calculate the locations of property blocks within the FKP can be exploited to allow us to squeeze more property storage savings out of the FKP structure.

Our mental model of how the look of characters usually change across the length of a paragraph tells us that there are runs of plain text whose properties are the same as the paragraph's style style character properties alternating with words or phrases of empasized text, that differ from the style by only a few paragraphs. Notice that the plain text runs that occur every other run are assigned identical CHP settings. If we used a storage scheme that required us to record one CHPX for each run in the doc, we'd have to record a CHPX that is identical to the CHPX that was two slots back in the table. Seems like a horrible waste of storage space.

First addional payoff for using FKPs: we can notice when the same CHPX is used in multiple places within the span described by the page and share a single CHPX recording.

Notice: if we scan all of the CHPXs that have already been stored in an FKP that we are building as we prepare to add a new FKP to the table, we can detect whenever that identical CHPX is already recorded there. That scan will leave us with the offset of the matching CHPX in our hand. We had to use that to find the matching block so that we could compare the new CHPX against its value.

There's no requirement that we have to record a redundant CHPX in the FKP. Instead we can record the byte offset of the previously stored CHPX in the byte offset slot for our new character run. By doing this we have shared the same CHPX settings for runs whose properties match, and we have avoided wasting an entire CHPX of space. This means when we have alternating character runs in a paragraph, a situation that we think happens quite often, half of the CHPX storage in an FKP can be saved by sharing one CHPX recording with all of those matching run instances.

If the user usually uses the same pattern of emphasis for text that needs it, it could be that all of the emphasis runs are assigned identical property settings which can all be shared within one FKP. In this happy case, we will store nearly the maximal number of run transition fc coordinates in the FKP (that would be a 102), and perhaps have to store only two reduced size CHPX blocks for all of those runs.

Second additional payoff for using FKPs: when a CHPX compresses down to zero-length we can store a special signal offset value that means we decline to store any CHPX block for that case, saving a byte of FKP space

There's another cool thing we can realize: when a CHPX records that there is no difference between the properties of text run and its paragraph's character property style definition, we know that the CHPX will be entirely set to zero. When we compress away the bytes that contain zeroes from the end of the structure, we find that the CHPX compresses down to only its length byte which records a byte length for the CHPX of 0.

If we always have to physically point to some set of bytes recorded in the FKP and record that in the offset slot for a run, for a CHPX of length 0 we'll always store a single byte of 0 in the FKP to satisfy this requirement.

We're going to intentionally burn our ability to record a CHPX that begins at offset 510 (corresponding to an offset array value of 255). It's not much of a waste since we only have a chance to record a single-byte (ie. a fully compressed) CHPX there before we'd overrun the fkp.crun stored at offset 511.

The byte at offset 510 might still be used, if the first CHPX we record in the FKP ends up being three bytes long and is placed to begin at offset 508.

Now that we've reserved 255 so that we'll never record any block that begins at offset 510, we'll use the value of 255 as a signal that means that the CHPX for a run has been extirely compressed away. We'll store 255 as the offset for any CHPX that has been compressed down to a count byte that stores 0, and then not bother to record the fully compressed CHPX.

Result: we've won an extra byte of savings whenever it happens that a run's CHPX compresses down to a zero length structure.

Our code that reconstitutes CHPs from CHPXs, will copy the stylesheet CHP value for the paragraph into the buffer where the reconstituted CHP is being built, and just declare its work to be done when it notices the CHPX pointer stored for a run in the FKP is 255.

What was the performance of this idea of storing sections of the run boundary index in FKP pages, which contained compressed exception records that an offset pointer kept in correspondence with the run index entries referred to?

This was a winning idea, the CHP storage scheme that was used in Mac Word 3.0, Mac Word 4.0, and finally Win Word 1.0. It was market tested in contests with many competitors, for longer than a decade.

Great, characterize its performance for retrieval of a CHP for a particular fc position in a stored file.

There was a small, dynamic arrary structure, called a bin table (think of FKPs that contained a lot of property information like a storage bin), kept in heap memory and called the hplcbteChp, that recorded in ascending order the beginning fc, file character position, of every FKP stored, which was kept in correspondence with a payload that recorded the page number that stored the fcs for each fc range of the file.

Given a random fc, the fcs of the bin table would be binary searched to locate the page number of the page that contained the properties for that FC. That page would be fetched into memory. Then the index of ascending fcs that was stored at the beginning of the FKP would be binary searched to locate the FKP slot that recorded the properties for that FC.

For a half-million character file that changed character properties once every hundred characters, and used properties that were stored in the CHPs bit fields, and all other properties were set via the stylesheet, one FKP would record the properties for roughly 10,000 characters, meaning the bin table would grow to have 50 or so entries. The bin table binary search would identify the page number of the correct FKP in less than 6 table probes, and a second binary search would identify the correct fc slot of the 100 or so recorded in the FKP in less than 7 table probes.

If we continue to fetch properties within the range of FCs covered by the FKP, those two binary searches will be lightning quick as required. If the next time we do a CHPX lookup for a new FC, and the first binary search tells us to load a different page than the one we looked at last time, there will be a gap in execution of some number of milliseconds before we have the new FKP page in hand and can do the second binary search to locate the correct block there.

If we're scanning through the FKP to locate adjacent runs like we do inside of Word's FormatLine() routine, these property lookups do happen in sequence extremely quickly, but there's a slow down for one fetch when the scan crosses page boundaries.

We'd find the location of the beginning of the CHPX in the FKP by looking up its value in the slot identified by this last binary search, fetching the word offset, multiply that by 2, and add the result to the FKP result. An order(1) operaton.

Then we'd decompress by copying the proper stylesheet CHP into a buffer, XOR-ing the single bit fields of the CHPX, and then individually copying any multi-bit fields that the CHPX showed were different from the stylesheet.

With one page of CHP properties describing the settings of 10,000 characters, that means that a single FKP read from disk suffices to produce the CHP settings for close to 9 full windows of text.

As property change density decreased in the document, one character FKP would describe larger swaths of the document and the number of bin table entries that needed to be searched to find the FKP would be reduced.

What was the absolute best job this compression scheme could accomplish?

We know that whenever a character run's properties exactly matched the CHP stored in the stylesheet that the CHPX calculated for that run would be zero length. So the number of bytes required to describe the CHP properties of a document depended entirely on the number of runs that had to be recorded.

If the user totally depended on the stylesheet style settings to completely describe the character properties of their document text, there might be a visble change of properties, whenever they changed the style settings for a paragraph.

If the creation rule for a new run was to record a new one whenever a new character's properties differed from the preceding character, a new run would be recorded whenever the styles of adjacent paragraphs caused a character property change. Everytime that this would occur, when the new CHPX for that run was calculated, it would be entirely compressed away. The result would always be a single byte containing a length of zero.

This tells us that we should use a more subtle rule for declaring run boundaries. Instead, whenever there's a change of properties, we should calculate a new CHPX and compare that with the one we've previously recorded the last time we noticed a property change. We should only declare a new run, when the newly calculated CHPX differs from the last recorded CHPX.

Using this rule, whenever a document's CHPs are completely specified by the stylesheet (meaning the user never made an exception in any part of the document), only a single run need be written to describe the properties of the entire document, no matter how large it grows. As we displayed windows for that document, we would only have to fetch a single FKP over its entire length. That FKP would begin with two FCs, one set to the beginning of characters in the document, and the other set to the fcLim (one past the last character recorded in the document). That would be followed by an offset pointer of 255, to show that the CHPX was completely compressed away, and the fkp.crun in the last byte of the page would be set to 1.

Compressing the PAP, TAP, and SEP - a different compression idea, slightly varied, worked to encode and compress all three structures

The FKP page structure developed for CHPX storage was also used to record paragraph and table properties (the PAP and the TAP) in Word document and scratch files.

A different, more generally applicable encoding and compression scheme was used though to transform PAPs, TAPs, and SEPs (section properties) into PAPXs (paragraph property exceptions), TAPXs (table property exceptions), and SEPXs (section property exceptions).

The CHPX encoding and compression scheme depended on special characteristics of the CHP: the fact that many of the CHP properties could be encoded as bit fields, and the fact that the designer positioned these bits at the very beginning of the CHP and positioned the most likely to be encountered multi-bit fields as close to the front of the CHP as was practical, made it possible to do an encoding of this data which also produced much better compression.

For PAPS, TAPs and SEPs, bit-fields were barely used. It was also harder to identify fields that would consistently be used more than any of the other fields in those structures.

So for these beasts, a more general scheme had to be used.

Here are the stats for these structures as defined for Win Word 1.1:

For the PAP - the fixed-length prefix of the PAP contained 27 2-byte words of property data consisting of 33 fields followed by a variable length section that contained up to 50 two-byte tab position fields, followed by up to 50 one-byte tab descriptions. The size of the largest possible PAP was 204 bytes. The fixed-length portion of the PAP was 54 bytes long.

For the TAP - the fixed length prefix of the TAP contained 6 2-byte words consisting of 9 fields, giving a fixed prefix length of 12 bytes. The variable length part followed contained up to 33 2-byte table position boundaries, followed by up to 32 table cell descriptions of 5 2-byte words, which each could contain 6 fields. The largest possible TAP was 408 bytes long.

For the SEP - the SEP had a fixed length of 16 2-byte words, which contained 20 fields. The SEP was 32 bytes long.

The same pattern that we noticed for the CHP, that a run's properties differed from the settings stored in the stylesheet by only a few property settings, usually held true when a PAPs settings were compared with those recorded in the stylesheet. If we could invent an encoding method that would only record the few fields whose settings differed from the settings recorded in the stylesheet, we would be able to write a PAPX, a paragraph property exception block that omitted nearly the entire redundant content of whatever was stored in the PAP.

Wild/cool idea: design instructions for an imaginary computer. Use those instructions to build a sequence of instructions (a program) via a compression routine, which records step by step how to change a base property setting into the actual properties for a situation. The program that's generated is much smaller that the original property storage. Store that program as your property exception description. Write code for Word that interprets your newly defined instructions. Interpret that stored exception program to reconstitute the compressed property block, which was recorded as a sequence of your newly imagined instructions.

Here's a powerful idea: the instructions that are defined in the instruction set of a computer are operators that can change the contents of any particular field stored in the computer's memory. As a software designer for Word, invent instructions that target all of the possible fields of the PAP, TAP or SEP for an imaginary computer. These made-up-as-you-need-them instructions contain an operation code (this opcode is called a sprm in Word, a single property modifier) that specifies the target field, which is followed by a payload, which will be copied over any existing content recorded in the target field. These instructions provide a vocabulary that allows one to specify a way to establish a new setting for any field of the target data structure. A single instance of one of these instructions consisting of the sprm opcode followed by its payload, is called a PRL, a property list element.

To create an exception block for a PAP, first setup a buffer that will be used to record a PAPX (pronounce that as 'pap-ex'). Different from CHPXs that could be entirely compressed away when the CHP settings matched those stored in the stylesheet CHP, PAPs carried two fields of data that must be recorded to disk, which would always guarantee a difference from the stylesheet PAP.

First, a PHE (a paragraph height structure, pronounced 'pee-aitch-eee') was calculated for every paragraph as its size was determined during page layout. During full-save that information needed to recorded on disk, so that we could delete and then reallocate a fresh PLCPHE structure to record new paragraph heights that would be calculated as the user created new paragraph text or edited existing paragraphs.

Its calculated height for a PHE was likely, more often than not, to be different that that calculated for any other paragraph that might be recorded in an FKP, so this included information reduced the possibilities for sharing PAPXs as they were packed away to be stored in FKPs.

Second, to locate the proper stylesheet slot, so that the stylesheet PAP could be located when it was time to reconstitute the PAP, we always needed to know the style's one byte stc (style code).

For these reasons, a five-byte prefix consisting of the stc, followed by the PHE value would always be included as overhead in the PAPX.

Compare every field in the structure against the settings recorded in the stylesheet PAP. For any field that differs, add to the end of the PAPX buffer the opcode that targets that particular type of field, and follow that by the contents of the PAP field that is different from the stylesheet value. After this process completes, the PAPX buffer contains a program, a sequence of instructions that changes the stylesheet PAP into the true PAP for the paragraph that owns it. The sequence of instructions that is arranged one after the other in exection order in a property exception buffer is called a grpprl, a group of property list elements. grpprl is pronounced 'group-prell' or is called a 'group of prell'.

How do you execute this grpprl program for an imaginary computer? The Word developer writes a small program that identifies the opcode that begins each prl instruction recorded in the PAPX. That small program uses that operation code to lookup in a table that instruction's characteristics, which includes the location and length of the target field within the target data structure, and which describes what kind of action that operation will perform using the instruction payload that follow the sprm opcode. This interpreting program will perform the correct action for each instruction. If a buffer is loaded with the stylesheet PAP, and then this PAPX's series of instructions is interpreted, the stylesheet PAP will be transformed into the actual paragraph PAP. Executing the PAPX grpprl, uncompresses the PAPX thereby producing the PAP that was originally encoded.

How the FKP idea was adapted to record paragraph properties

A special character recorded at the end of a paragraph, the paragraph mark, carried the paragraph properties that applied to all of the characters that preceded that mark within the paragraph. The beginning character of the paragraph was the character that followed the paragraph mark that ended the previous paragraph.

When an FKP was used to record PAPXs, the entire extent of the paragraph from the file character position of its first character up to the fcLim that marked the limit of the paragraph, one character past the paragraph mark, was treated as one run of the FKP. The word offset of the PAPX for that paragraph was recorded in an array stored immediately after the array of FCs recorded at the beginning of the FKP. The PAPXs would be placed at the tail of the FKP, with each new PAPX placed in front of previously recorded PAPXs in the buffer, until space was exhausted in the FKP.

Different than an FKP that records CHPXs for a span that may run across many paragraphs, an entry is recorded in a PAPX FKP for every paragraph in the document, no matter if the previous paragraph holds the same properties as the following paragraph. We needed to record a new run record for every paragraph so that we could instantly lookup the boundary of a paragraph rather than locate its boundary by searching through the paragraph for the next paragraph mark boundary. The boundary coordinate could be looked up in an FKP using a binary search, a relatively fast operation. We did this rather than scan character by character through the document text for a paragraph mark, a process that might examine hundreds or thousands of characters to locate the paragraph bounds, a quite slow process.

This changed policy uses up FKP space at a faster rate and produces a larger number of paragraph bin table entries but produces great benefit by making it possible to speedily find the boundaries and properties of a paragraph.

How the end of table mark was treated as a kind of paragraph mark, which allowed FKPs to carry PAPXs and TAPXs at the same time

How SEPXs were compressed and stored reusing the compression idea used for PAPs and TAPXs

SEPs differ from CHPs, PAPs, and TAPS because section property information is never encoded as a SEPX to be packed into an FKP and because a SEP is never stored in the stylesheet.

But it is still possible to encode and compress a SEP producing a SEPX by using a set of instructions that target SEP fields. To produce a SEPX, a SEP is compared field by field against a standard sep definition known by Word. For any fields that differ from that standard, an instruction which targets the differing field is written to the end of a SEPX buffer. By the time all fields have been examined, the SEPX contains a small program that shows how to change the field settings of the Standard Sep into the actual properties for the section. This reduced size SEPX is written to free space at the end of a document file or scratch file. The file number and file location that would be used to retrieve the SEPX block is recorded within an in-memory table, the hplcsed.

SEPXs are not recorded within their own FKP structures because the number of section descriptors for a document never grows very large. Even in very large documents, the number of sections defined (sections are analogues of chapters) rarely grows above a hundred.

How does compression using sprm opcodes perform?

When the property block to be compressed exactly matches the settings of the stylesheet PAP or the reference default SEP, the grpprl that is generated is zero length. When only one property is changed from the reference property, the grpprl consists of the sprm code that targets that property field, followed by the changed value of that field.

If the the changed property is two bytes long, the generated grpprl exception will be three bytes long, no matter the original size of the block that was compressed. With 5 bytes PAPX overhead for the stc and PHE included before the grpprl, the PAPX would be 9 bytes long. That would result in 22.6 to 1 compression for a PAP. If the PAPX contained TAP information for a table row end, that would also cause an 9-byte PAPX to be built resulting in 45.3 to 1 compression for a TAP. A SEPX would consists of the three-byte grpprl only, which would provide 10.6 bytes to 1 compression for a SEP.

If the changed property value is one byte long, the generated grpprl exception will be two bytes long. Including 5 bytes of stc and PHE prefix info, when the PAPX recorded a compressed PAP or TAP that would result in 25 to 1 compression for PAPs, 51 to 1 compression for TAPs. A SEPX would only carry the two-byte grpprl which would result in 16 to 1 compression for SEPs.

A comparison of the compression of grpprl encoding vs. CHPX encoding

If this method was applied to a CHP that had three of the single bit fields changed along with a 2-byte font code, a 9 byte grpprl would have been generated, consisting of three two byte prls for the bit fields and a three byte prl to encode the font code change, which would have only reduced the CHP encoding by 10%. The equivalent CHPX encoding would have taken only 4 bytes which would have been a 60% size reduction.

The improved compaction provided by CHPX encoding was the reason it was used to compress CHP blocks for in-document storage.

Note that when the PAP, TAP, and SEP blocks increased in size as additional fields were added for future versions of Word, which always happened, that the compression efficiency of using sprms to encode exception information only improved.

We record property exception blocks to files when we record them permanently at a full-save and at initial type in. We record prop changes in memory between full saves.

We record CHPXs, PAPXs and TAPXs in FKPs and write SEPXs for storage within a file, only when we permanently record the properties for a document as it is saved, and when we record the initial property information for newly typed text within a temporary scratch file.

When we later edit these saved away property settings during an editing session, we don't use the disk file property representation using CHPXs, PAPXs, TAPXs SEPXs to record those changes.

Instead, these changed settings are recorded into in-memory data structures stored in Word's heap. A Word document's piece table, which records the state of an edited document, contains a 2-byte field called the property modifier, the prm, which points to these heap records of property changes.

When a document is finally full saved, the last saved document properties are fetched, the new property modifier information is applied to the previously saved properties, and the newest versions of the property settings are encoded and compressed for the newly saved document file.

Word used differently packaged grpprls (stored interpreted instruction sequences designed to change the kind of target data structure specified by their opcodes), which were originally invented to encode the PAPX, TAPX and SEPX within a saved file, to encode property changes of all types (CHP, PAP, TAP, and SEP) that the user has made via editing actions since the last time the document was full-saved. These grpprls are recorded within a prl carrier structure, a PRC, that is allocated in Word's heap.

The grpprl idea can also be used to record the results of an edit that changes properties in an open document. The grpprl, a small interpreted list of property change instructions, can record how the last saved properties of selected text should be changed to produce new property settings for the selection.

We used the grpprl idea to also record in memory edits that affected CHPs. CHPXs which summarize all of the changes made to a CHP are overkill when we are trying to record only the single changes that the user has decided to make to character property settings.

When we are recording single-bit CHP property changes, we'll generate a two-byte prl to capture that info. That seems wasteful, but we will try to share grpprls between pieces that execute equivalent property changes, which will tend to reduce the per-piece cost of recording the edit.

As a formatting edit operation completes, a grpprl can be generated to express the change the user desires, which can be merged with any property modifications that may have already been recorded for that region of the document. The result of any such property changing merge is stored in the document's piece table, in its pcd.prm field.

When a user selects a range of text within a document, and changes a property setting through a keyboard command or by changes a field in a property dialog and submits that property change, that change request is encoded as a sequence of instructions. For every property that is changed, a prl is constructed, which consists of a single byte sprm opcode which targets a field in a property block that must be changed, followed by a payload, that varies with the type of sprm selected, which records the new value that must be set in a field of the targeted property block. All of the prls that are generated are collected in a variable sized buffer of bytes called a grpprl.

The document piece table is fragmented at the cpFirst and cpLim coordinates recorded in a selection structure, a SEL. The resulting piece table that results after such a split operation, would have a single piece contained within the selection boundaries if the document had never been edited since its last full save, or would contain a sequence of pieces, when several pieces were recorded within the selection cp range.

The imformation stored in the new grpprl would be combined with info stored by any grpprl that the selected piece already pointed to using its pcd.prm structure. The routine MergeGrpprls() (opus/wordtech/prcSubs.c) would be called to merge the new grpprl into any grpprl lists pointed to by any pcd.prm pointers that were recorded within any pcds whose pieces intersected the selection bounds. The newly merged grpprl would by packaged into a heap block,and then the pcd.prm for that piece woud be changed to point to the newly merged grpprl.

The merged grpprl that we record to store the results of user edits may contain sprms of any type, which describe any edits that user has made in that piece since the time the document was last full-saved. When it's time to produce the current CHP, PAP, TAP, or SEP for character run, a paragraph, a table row, or a section, a sprm group code, an sgc, is passed to specify the kind of sprms that are allowed to execute for the kind of structure that is being reconstituted.

sgcChp is passed when any CHP sprms stored in a grpprl must be executed to change the last full-saved CHP of a character run to reflect the latest property changes that the user may have made for that run.

sgcPap is passed when PAP sprms stored in a grpprl must be executed to change the last full-saved PAP belonging to a paragraph mark to reflect the latest property changes that the user may have made for that paragraph.

sgcTap is passed when TAP sprms stored in a grpprl must be executed to change the last full-saved TAP belonging to a end of table row mark to reflect the latest property changes that the user may have made for that table row.

sgcSep is passed when SEP sprms stored in a grpprl must be executed to change the last full-saved SEP belonging to a section mark to reflect the latest property changes that the user may have made for that section.

These editing grpprls would be packaged into a carrier data structure that would be recorded in the 64K-maximum heap that Win Word used to store the headers of its mportant data structures. A pointer to that carrier of grpprls would be stored as a 16-bit pointer stored in the Word document's piece table, the pcd.prm (the piecetable property modifier).

It's possible to accumulate property edit changes as an historical list that records edits that have been made since the last full save, but that leads to trouble if you are trying to be economical with memory usage.

The simplest and most simplistic way to record the set of property changes for a piece it to record a historical list of property changes that have been made to it: every time a property is changed for a piece, add a single property changing instruction to that piece's grpprl add its end.

When a word processor has only 64Kb of memory to record all of the data structure headers that may be open in any open documents, this idea has a lot of deficiencies. It's too easy to exhaust all available heap space just recording piece property edit histories.

The problem with using an historical property scheme is that many different editing histories of many diferent lengths accumulate in memory even when they produce exactly identical changes in the destination document.

A more efficient representation would describe a particular set of property changes in exactly one way, in a normal form, so that any piece needing those property settings could point to a single grpprl block rather than having many point to many alternate representations that express identical information content.

How we reoganized our property storage so that we could recognize when property changes produced via different histories produce identical results

If an historical recording method was used, setting boldness then italics produced a different list than setting italics then boldness. Despite the order difference, both lists produce the same change in the target property block.

Idea: when a new sprm is merged into a property modifier block, make sure that in the merged list, each sprm command is listed in sprm code order (prls with lower numbered sprm codes should be recorded before those with higher sprm codes). If sprmCFBold < sprmCFItalic, then sprmCFBold will appear in the merged grpprl before any sprmCItalic instances.

If a prl with a particular sprm code is recorded in a grpprl, and a new change is recorded using that sprm and the action of the sprm is to completely replace its property setting, the old prl should be purged from the merged list before the new one is recorded.

eg. If sprmPDxaLeft was set to 720 in a grpprl, and a new sprmPDxaLeft of 1440 is merged into the list, the previously recorded sprmPDxaLeft should be purged, resulting in a single instance of sprmPDxaLeft with value of 1440 being listed.

When the action of sprm doesn't change a field to a particular setting but instead transforms its previous value, instead of recording more than one of those sprms in the same list, compose the payloads together and record a single sprm that expresses the combined result.

eg. the sprmPIncLvl sprm adds its payload to pap.stc. If sprmPIncLvl already exists in the prm whose value is 1, and a sprmPincLvl is merged with a value of 2, the merge routine can record a single sprmPIncLvl grpprl with a value of 3.

When a sprm has the effect of setting many different standard values, it should be given a lower sprm code value by the designer than any of the sprms that set those values individually.Then when that sprm with wide effect is merged into the list, it can cause any of the sprms that set the individual field values to be purged from the merged list.

eg.sprmCPlain resets the fBold, fItalic, fStrikethrough, hps fields of the CHP to the field value recorded in the stylesheet. If sprmCPlain is merged into the grpprl, any sprmCFBold, sprmCFItalic, sprlCFStrikethrough, or sprmCHps prls should be purged from the merged grpprl.

When these merge rules are used, piece table grpprls can be merged into a normal form, and then stored in a heap block, an hprc. Any document pieces that record identical property changes can point to this shared hprc block rather than waste space by each recording their own description of the change.

We search for sharing opportunities, one last opportunity to omit unnecessary words from our recorded property changes, before we actually store newly produced grpprls in a heap block.

A linked list, accessed via the global list head vhprc, threaded through all of the currently allocated hprc blocks stored in Word's heap storage. When a property change was applied to a piece entry, and MergeGrpprls() had been used to produce a grpprl to record that change, the vhprc list would be searched to discover if that particular grpprl was already stored in an hprc.

Since the memory allocation of an hprc structure could sometimes extend across hundreds of bytes, a hash of the hrpc contents was always stored inside of the hprc, as it was added to the hprc list. A hash using the same formula would also be calculated for a new prc before the list was searched. When the list was searched, if the hash calculation for the new grpprl was not equal to the hash result stored in the hprc, we would know that we could skip that hprc entry and continue to the next entry in the list. Only when the new grpprl hash matched the hash calculation stored in the hprc and the grpprl lengths matched, would we have to match the two grpprls byte by byte to determine whether or not they actually matched.

If the new grpprl to record matched one already stored, the handle of that hprc would be stored in the pcd.prm (piece property modifier) for that piece.

Whenever Word declared a memory emergency, when available heap storage had grown so small that it might have to suspend further processing, it would run a garbage collection process ,which would identify hprc blocks that piece tables no longer reference so that they could be deleted.

Guide to the Win Word 1.1 sources for software developers who wish to view the app's property storage, edit or fetch routines

CbGenChpxFromChp() (opus/wordtech/inssubs.c) was the routine that generated a CHPX for a given CHP.

FNewChpIns() (opus/wordtech/inssubs.c) called CbGenChpxFromChp() to calculate a CHPX for a newly inserted character, and then called FAddRun() (also in inssubs.c) to record it in the FKP if it differed.

CbGenPapxFromPap() was called by FInsertRgch() (both in opus/wordtech/inssubs.c) when newly added characters included a paragraph mark. If that paragraph also was the end of a table row, CbAppendTapPropsToPapx() would be called additionally to add TAP sprms to the end of the PAPX. Finally, FAddRun() would be called to pack the papx into its FKP.

CmdInsertSect1() (opus/wordtech/editsub.c) calls CbGrpprlProp()(opus/wordtech/inssubs.c) to create the SEPX for a newly created section.

dnsprm[], declared in opus/wordtech/prcSubs.c, is the master array which records the characteristics of all of the sprms defined in Word. 163 slots are allocated in the table, with 95 sprms defined, and 68 sprm definitions reserved.

So many sprm reservations were reserved since having to change opcode assignments required that files recorded before such a change would have to be translated to use the new assignments. We preferred to avoid this if we could.

ApplyGrpprlCa() (opus/wordtech/prcSubs.c) would apply the sprms recorded in the grpprl to every piece that intersects the cp range passed in via its pca parameter.

PrmAppend() (opus/wordtech/prcSubs.c) would be called by ApplyGrpprlCa() to package the results of a call to MergeGrpprls()(also prcSubs.c) that would add the sprms from the new grpprl to those already recorded for the piece.

CachePara() (opus/wordtech/fetch.c) fetches the paragraph props for a given cp. It calls ApplyPrlSgc() with an sgcPap to reconstitute the last saved PAP for a paragraph using the PAPX and stylesheet PAP. Then if the piece property modifier is non-nil, applies any stored PAP sprms using DoPrmSgc().

FetchCp() (opus/wordtech/fetch.c) fetches the text and character properties beginning with a specified cp. It calls ApplyPrlSgc() with an sgcChp to reconstitute the last saved CHP for a paragraph using the correct fetched CHPX and stylesheet CHP. Then if the piece property modifier is non-nil, applies any stored CHP sprms using DoPrmSgc().

CacheSect() (opus/wordtech/fetch1.c) fetches the section props for a given cp. It calls ApplyPrlSgc() with an sgcSep to reconstitute the last saved SEP for a paragraph using the correct fetched SEPX and standard sep. Then if the piece property modifier is non-nil, it applies any stored SEP sprms using DoPrmSgc().

CacheTap() (opus/wordtech/fetchtb.c) fetches the table props for a given cp. It calls CachePara() to retrieve the last saved TAP. Then if the piece property modifier is non-nil, it applies any stored TAP sprms using DoPrmSgc().

FQuicksaveChps() (opus/wordtech/savefast.c) scans the piece table to find entries that have not been saved in the destination document file. (During a full-save, none of the pieces have been saved there). For each such piece, WriteChps() is called, which for each character run calls CbGenChpxForChp(). If the newly calculated CHPX differs from the last one stored, FAddRun() is called to pack the differing CHPX into the CHP FKP page.

FQuicksavePaps() (opus/wordtech/savefast.c) scans the piece table to find entries that have not been saved in the destination document file. For each such piece, WritePaps() is called, which for each paragraph calls CbGenPapxForPap(). If the paragraph that is having its properties saved is also an end of table row mark, CbAppendTapPropsToPapx() is also called to add TAP sprms to the PAPX. Then FAddRun() is called to record the PAPX in the PAP FKP.

During RetrieveNextNonDestCp() (opus/wordtech/savefast.c) for each SED entry that hasn't been copied to the destination document, a SEPX is generated using CbGrpprlsProp() which is written to the destination document.

XML