Thread started by davidlu on Tuesday, April 01, 2014.

Inside Microsoft Word

Guide to the Winword 1.1 Sources - available for download at the Computer History Museum

Word data objects

Important coordinates (word.h)

CP - document character positions - 32 bit integer coordinates that record locations within a Word document.

cp0 is the coordinate of the 0th character of a document

cpMac is a field that records the coordinate of the current maximum extent (Mac) of a document. This is used to report the maximum size of a document.

cpMax is a constant that reports the maximum size that a Word document may grow to.

FC - file character coordinates - 32 bit integer coordinates that point to locations of interest within a file that has been opened by Word.

fc0 is the coordinate of the 0th byte recorded in an open file.

fcMac is a field that records the current maximum extent of a file that is being written.

fcMax is a constant that reports the coordinate of the maximum size file that Word can open.

For Win Word 1.1. fcMax is declared to be 16,777,215. That value is a consequence of the fact that the PN (page number) values that Word uses are declared to be 16-bit quantities that point to 512 byte sectors within a file (32*1024*512).

fcMax should have been declared to be ((64*1024)*512) before Word shipped because the PN declaration had been changed to be unsigned, which would have allowed PNs to count to 65535, doubling its size from its previous signed 16-bit declaration that would have allowed a maximum PN of 32767.

FCB - file control block (wordtech/file.h)

The FCB (File Control Block) is the data object that describes an open file in Word.

FN is an integer file number, which is used to access the FNth entry of the mpfnhfcb array to aqcuire its FCBs are stored in a heap, so they are nearly always accessed via a handle, an hfcb. The FCB for a particular FN called fn can be found in mpfnhfcb[fn].

The Document Descriptor - DOD (wordtech/doc.h)

a DOD (a document's Document Descriptor) is the data structure which records all the features of a document that is open within Microsoft Word.

A DOC is the integer index which identifies a document recorded in the mpdochdod array. The DOD (document descriptor) for a particular DOC named doc is accessible via a handle to document descriptor, an hdod, recorded in the slot of the mpdochdod array given by its doc number: mpdochdod[doc]

A doc number is paased as a parameter to routines that act on the contents of particular document. There are tens of thousands of such references in the Word sources.

When a document's source text is stored in an opened file, dod.fn is its file number

Subdocuments that are referenced in the DOD

docDot - document type (document template) subdocument. A document type subdoc will have dod.fDot set to fTrue

docFtn - footnote subdocument. A footnote document will have dod.fFtn set to true

docHdr - header/footer subdocument. A header/footer subdoc will have dod.fHdr set to fTrue

docGlsy - glossary document. A glossary subdoc will have dod.fGlsy set to fTrue

docMcr - macro subdocument. A macro subdoc will have dod.fMcr set to fTrue

docAtn - annotation subdocument. An annotation subdoc will have dod.fAtn set to fTrue

The root DOD that points to all of its subdocs has dod.fMother set to fTrue. When dod.fMother is fFalse, dod.doc is used to access a subdoc's mother document.

Many of the DOD fields that are needed to completely represent a mother document and all of its subdocs (its stylesheet, its outline structure, its section table, its bookmarks, its font master table, its key map, and menu descriptions) contribute no useful information inside the subdoc itself, because that information is stored within the mother doc's DOD.

For this reason, most sunbdocs have dod.fSDoc set to fTrue, which means that only a short prefix of the DOD has been allocated. When subdoc processing requires one of the mother doc fields, it accesses those via mpdochdod[dodOfSubdoc.doc]

Documents for special processes that are not openable into windows

Such special docs have reserved, globally known document numbers

docScrap

SAB - scrap state (wordtech/doc.h)

routines that act on docScrap refer to a vsab, a globally accessible SAB instance, which records the scrap state and its selection coordinates from the souece document (doc, cpFirst, cpLim)

docUndo

UAB - undo state (wordtech/doc.h)

routines that act on docUndo refer to the vuab, a globally accessible UAB instance

mpdochdod - Word's master array that maps from a DOC to a DOD, when a DOC is used as its index.

The Window Descriptor - WWD

NewWindow - the creation routine

mpwwhwwd

hdndl

Plex of CPs - PLC

Generally stored in the heap as an hplc. An hplc is a generalization of a dynamic array. A Word document points to a large number of PLC structures which all record important boundaries of differing classes within a document's character stream or else point to important objects within the stream. Each of these PLCs contain different payloads, which record descriptors that identify these different

An hplc conatins a header, which contains declarations that determine the allocation, usage and size of the structure, an index array which contains an ascending range of CPs (Character Positions), and a second payload array which is kept in 1-to1 correspondence with the range of CPs.

The cp of the 0th character in a document is called cp0. The current maximum number of cps in a doc is recorded in dod.cpMac. The index of the last cp recorded in a doc is dod.cpMac-1.

When its necessary to record information about sub ranges of characters within a document (Word programmers called such subranges character runs or runs of characters) a PLC would be established to mark the boundaries of runs within the interval [0,cpMac). Note that it takes n+1 cps to mark the bounaries of n adjacent intervals.

plc.iMac records the number of cp indexes stored in the PLC. plc.rgcp records the ascending array of cp coordinates. plc.rgcp[0] is always set to cp0 and plc.rgcp[plc.iMac] is always set to dod.cpMac. In most PLCs, plc.rgcp[itc] < plc.rgcp[itc+1] always. Translation: in most PLCs, its CPs are recorded in monotonically increasing order.

In the rare case where PLCs that are used in pairs to record overlapping interval bounds (eg. the bookmark PLCs hplbckf and hplcbkl), where one PLC records the interval cpFirsts and the other records the interval cpLims, the correct invariant is plc.rgcp[itc] <= plc.rgcp[itc+1]. Translation: when intervals are allowed to overlap, its possible for several adjacent intervals to begin or end at the same cp position.

In some kinds of PLCs, called Reference PLCs by Word developers, the intent of the cps of the PLC is to point to the character position of important marks within the document and not to mark the boundaries of intervals within the document.

This distinction needed to made because a PLC record interval boundaries would be queried using a CP to discover the CP range that contains that query cp. A reference PLC would be queried with a particular cp to discover if a particular mark existed at that CP position. If there is not a CP recorded at that precise location, such a search routine would have to report "mark not found".

A new PLC would be created by declaring the size of the descriptor data object that will be stored for each interval or reference recorded in the PLC. That size would be recorded in plc.cb.

Stored in one to one correspondence with the number of interval cpFirsts or marking CPs stored in the PLC, and recorded after the last rgcp entry, payload space for a generic data structure, call it foo, would be reserved in a following rgfoo array.

Since the descriptor payload of PLC could not be accessed via an array reference since plc.rgcp 's length varied in size according to the value of the plc.iMax setting, the PLC payload was accessed via calls to GetPlc and PutPlc

If the payload of the PLC was a SED, the plc payload would be declared to be cbSed (the length of a SED as a count of bytes) and the plc would be called hplcsed.

The ith payload of the hplcsed would be fetched via GetPlc(hplcsed, i, &sed) and the ith payload of the hplcsed would be recorded via PutPlc(hplcsed, i, &sed)

Piece Table

PCD (PieCe table Descriptor)

e

Word programmers called these text segments, pieces, and kept track of the location and size of these pieces using a PLC whose payload descriptors were PCDs, called the hplcpcd. A document's hplcpcd could be accessed using the expression dod.hplcpcd.

pcd.fn pointed to the open file that contains the piece that the PCS describes,

pcd.fc pointed to the beginning file coordinate of the piece within its containing file

If a piece was the ith piece of the document, the piece's length could be calculated via the expression plc.rgcp[i+1] - plc.rgcp[i].

PRM (property modifier - recorded in pcd.prm), PRL (property list element), GRPPRL (group of property list elements), PRC (property list carrier) and SPRM (single property modifier) - (wordtech/prm.h)

Whenever character, parapraph or table property properties were changed within the range of a piece of text since the last time the document had been full saved, a 16 bit PRM code would be stored in the piece table to record the changes that had been made.

Each individual property change was encoded as one instruction of a small interpreted bytecode language. The opcode of these instructions ,called a SPRM (a single property modifier) , indentified one field of a CHP, PAP or TAP structure that would be modified by a byytecode interpreter, using whatever data that was recorded in payload bytes that followed the opcode. One of these instructions, whose length could vary from 1 bytes through sveral thousand bytes, was called a PRL.

If more than one prl was necessary to record the property changes accumulated for a run of text, each of these prl instructions were appended one after another in a byte array called a GRPPRL. In a grpprl, the sprm code of the n+1 prl would be placed in the byte following the last byte of the nth prm.

Word's interpretation routine for grpprl's was called ApplyPrlSgc. That routine when passed an in-memory property block, an identifier code called the SGC that described whether that block was character, paragraph, table or section property block, and a grpprl, would apply the changes recorded in the grpprl, by scanning the block PRL by PRL until the grpprl end was reached, and executing any of the property changes stored in the GRPPRL that had a target in that type of property block.

If the length of a PRL was two bytes or less and its sprm code was small enough to have its leftmost bit turned off, prm.fComplex would be set to fFalse and that single prl could be stored within the last 15 bits of the PRM structure.

When a grpprl could not pass this test, it was stored with a PRC structure ( a prl carrier structure). That PRC would be stored in Word internal heap and would be allocated a 16 bit heap handle, an hprc, to point to it. It was guaranteed that heap blocks there were always stored on two-byte boundaries in the internal heap.

When the leftmost bit of the 16-bit PRM, prm.fComplex was set to 1, the 16 bit heap handle would be divided by 2 and the 15 bit result would be recorded in the field pcd.prm.cfgrPrc

the dnsprm, was the master array which described the length, format, and action of each sprm known to Word. The ApplyPrlSgc routine, as it identified the sprm code of the next PRL in the GRPPRL, would use that sprm code as an index within dnsprm. The fields within dnsprm[sprm] would include dnsprm[sprm].cch (count of characters of a prl that begins with that sprm code), dnsprm[sprm].b, the byte offset within its property block of the prl's target, dnsprm[sprm].sgc, the sprm group code, which identified the type of property block the sprm could be applied to, and dnsprm[sprm].spra, the sprm action which determined how the the payload of a prl would be used to transform the target field in the targeted property block.

In Hungarian notation, an array was given the DN (domain) tag when the position of each array entry and their index coordinates, had structural significance within the system. In the dnsprm, the domain of sprms, once a sprm code had been assigned to a particular entry of the table, one could not alter that sprm code assignment without planning and forethought. This was because sprm codes were recorded within saved Word documents, .doc files.

If it was necessary to change the numerical value of sprms that had been saved by older versions of Word, it was necessary to write a translation routine which would map old saved sprm codes to the values used in the new Word version so stored grpprls would be interpreted correctly by the new Word code.

Important PLC structures referenced by DOD

hplcpcd - recorded a documnet's piece table. It's payload was a PCD.

hplcpgd - recorded a document's page table, which recorded the boundaries and format of each page of the document. It's payload was a PGD (a page decriptor)

hplcphe - recorded the height of each paragraph in the document. This PLCs payload was a PHE, a paragraph height record.

hplcsed - recorded a document's section table. It's payload was an array of SEDs, section descriptors

hplchdd - recorded a document's header/footer table. It's payload was an array HDDs, Header/Footer Descriptors

hplcfld - recorded a document's field paramter table. It's payload was an array of FLDs, Field Descriptors which recorded the parsed parameters of the documents fields, many differents of different types of data, that would be expanded and kept current by Word as a user edited their document

hplcmcr - recorded a dictionary reference for the document's macros. Its payload was an array of MCR structures

hplcglsy - when a document was a glossary, a repository of named text excerpts which could be copied into a document, the GLSY array payload stored here recorded the bounds and characteristics of each entry of the glossary

hplcfrd - a reference plc, which recorded the CP locations of each footnote in a mother document. Its payload was an array of FRDs, footnote reference descriptors.

hplcfnd -for a footnote reference in the main document, a plc whose CPs demaracted the bounds of the footnote text recorded in dod.docFtn kept in correspondence with the mother doc's footnote referances

hplcatrd - a reference plc, which recorded the CP locations of each annotation in a mother document. Its payload was an array of ATRDs, annotation reference descriptors.

hplcand - for an annotation reference in the main document, a plc whose CPs demaracted the bounds of the annotation text recorded in dod.docAtn kept in correspondence with the mother doc's annotation referances

Word kept track of the first and limit CPs of a bookmark in a pair of PLCs, the hplcbkf and hplcbkl.

hplcbkf - the CPs of the hplcbkf recorded the cpFirsts of a document's bookmarks. Its payload was an array of BKF structures, Bookmark First structures

hplcbkl - the CPsof the hplcbkl recorded the cpLims of a document's bookmarks. This PLC carried no descriptor payload.

hplcddli - recorded a document's DDE Link table. The PLCs payload was an array of DDLIs, DDE Link descriptions, which showed where to transmit the contents of a field beginng at that location to a remote document, or else showed where data being received from a remote machine shoudl be stored

hplcpad - this PLC recorded teh current ouline state of a document, if it was open in an outline view. The payload was an array of PAD, an outline Paragraph description, which recorded the expand/collapse state of the paragraphs recorded in the doc viewed as an outline structure.

Plex - PL (wordtech/doc.h)

A Plex is a generalied, dynamic array stored in the heap. The Hungarian tag for a Plex is the PL.

The allocator for the PL, HplInit(cbPlc, ifooMaxInit), took two parameters, a cbPlc, which specified the size of the payload structure of the PLC, and an ifooMaxInit, which specified the allocation size of the PL. pl.iMac, the current maximum (the count of entries actually used in the PL) was initialized to 0 at creation.

The PutPl() routine was used to fill a PL slot with its payload and GetPl() was used to retireve the PL's payload structure.

FreeHpl() was used to deallocate an hpl.

In the usual case (those implemented by HplInit allocations), the array of payload elements would start immediately after the header fields that are located at beginning of the structure. This structure may be generalized though to reserve space between the end of the structure header and the beginning of the variable length payload array allocation. For this purpose pl.brgfoo (a byte offset to the beginning of the range of foo - we use foo to describe the payload structure that the PL records) is added to the structure.

For allocations via HplInit, pl.brgfoo is set to cbPLCBase, the size of the PLC header.

When it was necessary to create a space reservation between the header and the PL's payload array (the rgfoo, if you will), an alternate allocation method, HplInit2 would be called whose calling sequence included an extra parameter brgfoo, that allowed the pl.brgfoo to be set, which would be used to create the requested space reservation.

Word's WWD structure, the Window Descriptor, was an instance of a structure that began with the fields of a PL header, that was followed by a space reservation for all of the fields that defined a window for Word, followed by a variable length array, the dndl, whose entries derscribed the geometrical characteristics of every dl (display line) shown in the window..

Because of this structual congruence, an hwwd structure was created via a call to HplInit2 , whose result was cast to **WWD, with cbWWD passed as the brgfoo parameter, and the iMax parameter set to the initial number of display lines shown by that window.

The most important array generalization implemented by a PL r

stsh.plestcp

STTB - dynamic string table

STTBF - string table stored in file

stsh.hsttbName

stsh.hsttbChpx

stsh.hsttbPapx

hsttbBkmk

hsttbGlsy

hsttbAssoc

hsttbChpe

hsttbPape

hsttbFlc

STSH - the stylesheet structure

CHP - Character Property block

PAP - Paragraph Property block

TAP - Table Properties Block

TC - table cell description

tap.itcMac is the current maximum number of cells stored for this table. A particular table cell's description is accessed via tap.rgtc[itc]

SEP - Section Property block

DOP - Document Property block

FKP - Formatted Disk Page

CHPX - character property exception block

PAPX - paragraph property excetion block

Important PLC structures referenced by an FCB (file control block)

hplcbteChp - a PLC that records the boundaries of FCs (File Coordinates) within a file, that mark a run of text whose Character Property (CHP)specifications can all be found in the single FKP page located by a page number, a PN, recorded in bte.pn, a field within the PLCs payload of Bin Table descriptions (BTEs)

hplcbtePap - - a PLC that records the boundaries of FCs (File Coordinates) within a file, that mark a run of text whose Paragraph Property (PAP) specifications can all be found in the single FKP page located by a page number, a PN, recorded in bte.pn, a field within the PLCs payload of Bin Table descriptions (BTEs)

SEPX - section property block

BPTB - buffer page table

SEL - Selection descriptor. Records the doc, cpFirst and cpLim of a selection.

SELS

XML