******************************************************************
* *
* Stanford Data Center *
* Stanford University *
* Stanford, Ca. 94305 *
* *
* (c)Copyright 1994 by the Board of Trustees of the *
* Leland Stanford Junior University *
* All rights reserved *
* Printed in the United States of America *
* *
******************************************************************
SPIRES (TM) is a trademark of Stanford University.
The intent of this manual is to enable anyone with a good knowledge of SPIRES searching and updating to define a functional SPIRES file and manage its contents.
The file definition process for most files is fairly straight- forward: analyze the structure of the records you want to have in your data base, define the characteristics of those records in the file definition language, then test your definition with sample data records. Next, by analyzing the requirements for retrieving those records, define the indexes required and the method in which information from the data records will be passed to the indexes. After testing the searching capabilities you may want to define various levels of access and protection for your file.
Before tackling a production file requiring complex techniques, experiment with a file of modest requirements. This manual is meant to accompany you through your first file definition; several appendixes give details of some powerful definition techniques, which rely on a previous mastery of the basics of file definition.
Since a file definition is itself a record in a SPIRES system-owned subfile, a knowledge of SPIRES searching and updating commands is essential for entering a file definition. This knowledge is also essential for the assessment of searching requirements for a file: you must formulate your searching needs in terms of the SPIRES command facilities you understand. For example, you might be likely to overlook compound indexing techniques as a possibility for a file if you had not used the extended capabilities of compound indexes in searching a SPIRES file. An informed file definer is first an experienced SPIRES user; you are encouraged to investigate the SPIRES facilities you will need with the SPIRES consultant.
Many experienced users are not aware of the internal features of SPIRES that provide the command language. Therefore, Part A of this document is devoted to linking conceptually the internal file definition options to the external command language. A cursory view is presented of those facilities of the file definition language which are invisible to the general user.
The section following the overview gives a timetable for the file definition process, outlining the topics to be considered at each stage of the definition. It is tempting to define a file all at once, that is, defining indexes and access privileges while trying to decide the length of certain elements. This approach is haphazard at best; at its worst it is confusing--try to approach the tasks of file definition in the order in which they are presented in this manual.
If you encounter problems in learning file definition, you may contact the SPIRES consultant in SCIP User Services at Polya Hall for assistance. Extensive assistance in defining files is available for a fee through Contract Programming services.
Intimately tied to a file may be various input and output formats, as well as protocols. Formats and protocols, whose command languages are described in other SPIRES documents, can be used to present an attractive, concise and helpful interface between your file and its users.
The original file definition manual was written by J. R. Schroeder; the present document was written by J. R. Sack.
SPIRES, the Stanford Public Information Retrieval System, is a generalized, online data base management system developed at Stanford University in the early 1970's.
The task of the original SPIRES development was to provide a file and file management system for Stanford's library automation project (BALLOTS). The versatility of the present SPIRES design can be gauged by the diversity of applications SPIRES now supports. Since 1972 well over 300 data bases have been implemented, including such applications as BALLOTS, the library automation project; bibliographic citations files, such as PLANTBIO; student record management; document inquiry and preparation systems, using SPIDOC; program library maintenance, MASTERLIST; survey data, both geo-physical and astronomical; inventory and materials tracking systems; directories, catalogs, mailing lists, and others. Present files range in size from a few dozen records to over three-quarters of a million. In sum, SPIRES serves the database needs of a large and diverse computing community.
SPIRES users design and maintain their own data bases; there is no centralized data base administrator. A number of the data base applications noted above were defined by individual users, non-professionals in data base systems, largely without individual programming aid from the data base professional staff.
Files presently in the system vary widely in complexity from unindexed files with two elements (a protocols subfile, for example), through files with ten indexes for as many elements (a personnel file, perhaps), up to files with over a hundred elements and nested data structures (the MARC or FILEDEF subfile, for example). Many of the present files are made up of more than one subfile, with the file definition describing the interrelationships between subfiles in a single file.
The command language available to SPIRES users for manipulation and management of these data bases is described in the SPIRES/370 Searching and Updating Manual; only commands that relate specifically to file definition and management will be discussed in this manual.
The various "languages" that form the language for development, management and use of data bases are:
Most SPIRES users are familiar with this language, which is used for input and editing of data (TRANSFER, UPDATE, ADD), record retrieval (FIND, AND, OR, ALSO, FOR), and record display (SET FORMAT, SET REPORT, TYPE, OUTPUT, DISPLAY). The prospective file definer and manager should know the capabilities of the index and sequential searching commands.
This facility is an extension of the command language; protocols are a set of SPIRES, WYLBUR, MILTEN and ORVYL commands. The commands make up a program that can be executed by users needing guidance in manipulation of a particular data base. Using protocols you can extend the normal SPIRES command language, tailoring the interface of specific users to specific files. This tailoring is particularly valuable when the end user of a production file has no special training in SPIRES commands.
Protocols are developed and tested interactively, and feature string manipulation and arithmetic functions as well as condition testing and branching capabilities for sophisticated interactive dialog and data base manipulation.
Any file user can provide formats for input and/or output of any data base's contents by defining a set of data transformations that map information from a source (a data base or terminal, for example) to a destination (a terminal or data base).
By means of input formats, a user can be prompted for the records' element values, and be given helpful diagnostic messages and reprompts should any error conditions be raised. Also, input formats provide a tool for converting pre-existing machine readable data to SPIRES-suitable input form. Using commands for arithmetic and string operations, condition testing and branching, complex algorithms for data validation can be performed that are unavailable using even elaborate file-definition input editing rule sequences.
By means of output formats, products such as reports, directories and catalogs can be produced by mapping file element values, and any other computed values, onto a two-dimensional array that can be output onto a terminal, line-printer, or full-face CRT. Output formats can make SPIRES data base contents acceptable for use by batch programs, or can arrange the same data so it can be easily understood by an untrained user.
The formats facility should be considered an integral part of the file/user interface. Database contents can be graphically organized so that information structures and hierarchies are easily recognized by any user. For example, a bibliographic entry in the CATALOG file can be output in the form of a library catalog card so that any user would easily recognize which elements are the author, title, and call number by their place on the output screen. Using another format, information in a card catalog file can be selected for printing in the form of a call number on a book's spine label. A data base of computer charges can be used to produce a billing letter for an individual user and reports of system-wide charges for an accountant. Just as in an outline, formats can make use of indentation to highlight hierarchical relationships of elements in a record. Where data is logically understood as a table, table formats can be devised. Text such as catalogs can be output in a form suitable for photocomposition.
There are several SPIRES processors that are important to the file definer and manager.
The SPIRES that most users know, this is the program with which most file searching and updating is done.
This is an offline version of SPIRES that allows users to indicate a series of SPIRES commands that are to be executed during non-prime time blocks. Large searches and reports are also aided by this facility, since they are not hampered by the active file and user core size restrictions that are applied to interactive SPIRES users. Use of this facility is made easier by the OFFLINE file: user commands are added as a record to this file, then executed after file updating is done by JOBGEN (see below).
This is an online program that compiles file definitions, format definitions, variable group definitions and protocols, returning error messages if the compilation process is not successful. Users no longer need SPICOMP, since all the compilation functions mentioned are incorporated into SPIRES.
This is a batch program run every night against any file that has records in the deferred queue and does not have the NOAUTOGEN option set. JOBGEN submits a batch SPIBILD job (see below) that causes the passing of deferred queue records to goal and index records overnight, during non-prime time blocks; the same passing can be done online by SPIBILD. The file owner can issue online commands that cause JOBGEN to pass over the file on certain nights, or can indicate in the file definition that JOBGEN is not to be run on the file.
This program can be called in either an online or batch form to pass records in the deferred queue of your file to the goal record and index record data sets.
This is a batch program that greatly reduces the CPU time and I/O's necessary for adding a large number of initial records to a new (empty) file. A protocol is available to generate the necessary JCL.
This is a batch program similar to FASTBILD: it provides a facility for adding a large number of records to a file using less CPU time and fewer I/O's than SPIBILD. Unlike FASTBILD, it can be used on files that already contain data. A protocol is available to generate the necessary JCL.
This facility allows access to SPIRES data bases through batch programs in PL/I, COBOL, and FORTRAN. The batch programs can provide input and/or process output from SPIRES files.
A SPIRES user can access any data base permitted to him or her through a powerful set of English-like commands. The same command language applies to all SPIRES files, in keeping with the general-purpose intent of the SPIRES system. Though the number of commands is large, a searcher develops a feel for the grammar implicit in SPIRES commands. These commands allow you to
- obtain online search and updating assistance by using the TUTORIAL, HELP, or EXPLAIN commands;
- select a data base permitted to you, explain its contents and understand its organization by displaying the search terms and data elements, or by browsing its indexes;
- search a data base using indexes defined in the file definition;
- search a data base sequentially using information that is not indexed;
- choose an output or report format;
- manipulate the records retrieved by a search, display them, or put them in the text-editor's active file for manipulation by editing commands;
- choose an input format;
- add to the data base or remove or update existing records.
At Stanford these command facilities are supplemented by the text-editor, WYLBUR, which provides commands for data entry, offline listings, and remote job entry. Terminal communication is the function of another service program called MILTEN. The timesharing monitor, called ORVYL, supports the file system in which SPIRES data bases are stored, and controls virtual memory and resource scheduling for interactive SPIRES users. These three subsystems, WYLBUR, ORVYL and MILTEN, are distinct from SPIRES, and SPIRES does not duplicate their functions. At other installations these companion services are provided by different text editors, communications controllers and timesharing systems.
The SPIRES commands used for searching and updating any data base are covered in a user's manual of less than 170 pages ("SPIRES/370 Searching and Updating"), and can be learned in a four-session course.
The primary data base unit familiar to SPIRES users is the "subfile." The subfile is a set of goal records optionally linked to index records. Speaking generally, we can say that a goal record is what you retrieve from a search request--a book in the CATALOG subfile, a restaurant in the RESTAURANT subfile, a file definition in the FILEDEF subfile.
The subfile name is specified in a section of the file definition, appropriately called the "subfile section," in conjunction with a list of the computer account numbers of users or groups of users for whom you specify various levels of access to your file. There are many levels possible. You may make the entire subfile available to the public for searching and updating (the RESTAURANT subfile is an example of this privilege level). You may make all records searchable and only some updatable (such as the PEOPLE subfile), or make only certain records available for search and update (such as the FILEDEF and FORMATS subfiles). You could even prevent a user from seeing, searching and/or updating certain elements in all of the goal records of a subfile. Of course, a subfile's use can be restricted to only one account (the file owner's) if you wish.
The file definer must be aware of the difference between a file and a subfile. The difference is best shown through examples: the DOCUMENTS, MASTERLIST, and MASTERLIST SHARE CODES subfiles are all part of a SCIP-maintained file of software resources; the MARC and CATALOG subfiles belong to a single file maintained by the BALLOTS Center.
It is often useful to put subfiles that share some common information together in a single file, allowing SPIRES to look up or cross-reference information in one subfile when it is needed for input or output operations in another. Information is not stored redundantly in each subfile, because look-ups between subfiles in the same file can be performed; such a method of linkage also makes file updating more convenient, since common information need only be modified once per file, not once per subfile. It might be useful to link together subfiles of student, teacher, and course goal records in a single file, looking up a student's identification number or name when printing out a teacher's course list, and looking up a course number or title when printing out a student's transcript.
Search and retrieval of a reasonable number of records (less than five million) through index is accomplished in seconds, with a maximum of five disk accesses to retrieve records in a file of a million records. For most applications SPIRES uses a "B-tree" method of record access. The chapter on SPIRES File Structure contains more information on this. [See B.6.] Rapid retrieval is possible if the file definer has specified that selected information in the goal records be passed to index records. If an index was not built when records were first added to the file, it can usually be added easily to the file at a later date, by using the original goal records to pass the requisite information to that index. In cases where indexes have not been built, perhaps because the frequency of searching for a certain kind of information would not have warranted the cost of building or maintaining an index, the file can be searched sequentially for this information. It is also possible to obtain a subset of the file via an index search, then examine or process that subset sequentially (using global FOR). However, a sequential search is usually slower and more expensive than an index search.
When records have been retrieved, they can be manipulated in several ways: they can be displayed at the terminal using any predefined output format; they can be put into a user's work-area or "scratch pad" for manipulation by the text-editor; they can be made available to a batch program; they can become a source for a new series of SPIRES commands, such as a sequencing command; or they can be used to generate reports.
Keeping information in the data base current is done by removing records from, adding records to, or updating records in the file. For adding and updating, records can be moved between a subfile and the text editor's work-area using simple commands. Data is collected or modified using the text-editor's commmands. SPIRES also supports use of a CRT in full face mode.
The integrity of the data is constantly verified and protected by SPIRES. Redundant information stored in each file block insures the validity of the data inside. Modifications made by users to the data base do not take place immediately (though they are immediately visible online). Changes are held until the deferred queue records are processed by JOBGEN overnight; this reduces the chance of accidental loss of data base contents as a result of a system crash or user error.
Data can also be validated on entry by specifying certain tests that the input values must pass. These tests can be specified in the file definition, or in an input format or protocol used to add records to the file. They may be as simple as tests based on the number of occurrences of an element, an element's length, a required range for element values, or inclusion or exclusion of elements that contain certain characters. Editing of input data can also be specified: values can be converted to binary; personal names can be changed to a canonical form; the date or time of day can be supplied, and many other kinds of editing can take place. Similar to these input processing rules are index passing rules that specify how or what kind of information is passed from an element in the goal record to an element in an index. You can specify that groups of words, individual words or phrases, all words, or all words over a certain length be passed; you can also include or exclude a certain set of words, delete or nullify punctuation, or force capitalization, when an element or elements are passed to an index.
A complete file definition is never easy. But if you approach the process in the order outlined below, trying not to define ahead of your comprehension, you will avoid general confusion. If you test your first file definition at the different stages or "test points" indicated, you will be able to locate problems more easily and review the materials in a particular section of this manual or ask the SPIRES consultant in User Services for help.
Experienced file definers go through all of the following steps, and usually in the order presented.
Analyze your data base with respect to the following:
An entry such as a restaurant in the RESTAURANT subfile is usually called a "record." A collection of logically related entries or records that are the goal of search requests in your file are called "goal records." For example, an entry or record for an individual restaurant in the RESTAURANT subfile is a goal record in that subfile.
The elementary parts of a record in the phone book are a name, an address, and a phone number. In a SPIRES file, each of these becomes an "element" and is assigned an "element name" in the file definition.
In most entries in a phone book for example, each name has a single address and phone number. But if we were making a file of students and the courses they take, the "courses" element in the record might occur four, five, or six times, or perhaps not at all.
So, in your file perhaps, some elements must occur once or twice, some must occur at least once but perhaps many times, and some elements may be entirely optional. A file can have elements with any combination of these possibilities.
Elements like a telephone number can be fixed in length, just as a social security number can be. Elements like dates and most numbers are of fixed length if you tell SPIRES to change them to binary (a fixed internal form) before storing them. If some elements only have a limited number of values, you may want to have SPIRES turn the value into a fixed-length code.
The "length" of an element is always the length in bytes (or characters) as the value will be stored on disk. It is cheapest to process and store elements that are either fixed in occurrence (see "b" above) or fixed in length, or fixed in both occurrence and length. Elements that vary in either or both length and occurrence are more expensive to store and process. Optionally-occurring elements that may vary in length and occurrence are the most expensive to process and store.
SPIRES can be told to check the validity of input data if you can specify the criteria for validation. Elements like a phone number and a social security number have "-" in certain required places and are of a known length. Zip codes are of a certain length and contain only numerals. Course numbers at certain institutions might always be three letters, a blank, then three numbers. You can tell SPIRES to reject input to an "age" element when a value is greater than 100 and less than 0, or a "number of children" element's input value that is negative or greater than fifteen. You may want to supply automatically a default value if an element is not input, or override an input value if one is supplied: the date and time a record is added to your file can be supplied as input data by SPIRES.
Elements that are grouped together form a "structure." Taken together, a street address, city, state and zip-code might be called an "address"; in a file in which several addresses occur in each record (perhaps a home address and business address in a mailing list file) there must be a way to associate or bind the first city input with the first state and zip-code input, and the second city with the second state and zip-code. This logical binding of different elements is a structure. The university affiliation of a professor may always be paired with his or her name in a subfile of conference participants, for example. Or, a job-code might always occur with a salary figure in a record of a person's employment history.
Structures can be nested within structures: the record of a student's grades for a single term, making up a course-number/grade structure, could itself be a structure that occurs several times in a goal record that is a student's transcript for several terms of work.
The key of the record is a unique identifier by which one record can be distinguished from all other records. The key must be chosen carefully, since it has many consequences: the element designated as the key must only occur once in each record and the value for that key must be unique among all the records in the subfile.
In a file with a goal record of employee data, the name of a single employee is not likely to be unique, but his or her social security number is unique. So the social security number may be the best choice for the key. In a file whose goal record is comprised of items ordered for a store, you might be tempted to use a purchase order number as the key, but if more than one item were listed on a purchase order then the goal record, which is the result retrieved from a SPIRES search (see "a" above), could not be individual items but would have to be purchase orders.
If you don't have an element that can be the key, that is, an element whose uniqueness could be guaranteed by the nature of your data, you can have SPIRES assign an integer or "slot" key for each added record; this technique is used by the RESTAURANT subfile, a "slot" subfile. SPIRES simply assigns the first record the key "1", the second record the key "2" and so on.
An "augmented key" can also be coded if you must use a non-unique key. SPIRES will simply place a suffix on any non-unique key you enter; this suffix will make the key unique. This technique is useful when personal names are used as keys, or when accession numbers are being assigned.
Define your goal record elements using the file definition language to describe the data characteristics you determined in the above steps. [See A.3.1.] The language for goal record definition is described in the first three chapters of Part B, "Goal Record Concepts and Definition," "Goal Record Keys, Slot and Removed Records," and "Structures." [See B.1, B.2, B.3.]
Add processing rules to the description of each element. These processing rules or "actions" are called INPROCS or OUTPROCS depending upon whether they affect the input or output of an element. Study "Processing Rules: INPROC, INCLOSE, OUTPROC" [See B.4.] and become familiar with the use of the appendices "Processing Rules: Complete Listing by Number," "Quick Reference to Processing Rule Functions by Number," and "Quick Reference to Processing Rules by Function-Keyword." [See D.1, D.2, D.3.]
Test your basic goal record description. To do this you must first study "The FILEDEF Subfile and File Compilation." [See B.5.] Then:
- Identify your file to the system in the file definition language
- Add your file definition to the system file containing file definitions called FILEDEF
- Compile your file definition
- Add, transfer, update, remove and display records of your subfile. Verify that all of the elements you input are processed and output correctly.
Two additional chapters may be of help at this point: "File Definition Syntax and Semantics" and "Recompile of an Existing File's Definition." [See D.4, C.1.]
Study "File Structure: Tree and Slot, Goal and Index Records," [See B.5.] in order to understand the structure of the ORVYL files that SPIRES has created according to your file definition. That study prepares you for defining your file's index records.
Study "Understanding and Coding Index Records" [See B.7.] for an understanding of the various indexing techniques and when to use each: Simple Indexes, Qualifiers, Sub-Indexes and Compound Indexes. In that chapter you will learn how to describe the structure of the "index records" which are associated with the "goal record" you defined and tested previously.
Study "Understanding and Coding the Linkage Section" [See B.8.] so you can use the file definition language to describe:
- a) how information is passed from goal record(s) to index record(s)
- b) how the form and content of this information can be manipulated using PASSPROCS
- c) the rules for searching each index, and what kinds of processing each search request must undergo using SRCPROCS.
Make use of the appendices "A Guide for Coding Index Record Definitions" and "A Guide for Coding the Linkage Section" [See D.5, D.6.] to code the index and linkage sections of your file. This chapter provides (almost) guaranteed recipes for these two difficult-to-code parts of a file definition. You determine the indexing techniques you need, and use the recipes for index and linkage sections that are indicated.
Transfer your goal record file definition and add the index and linkage sections to it, update the definition, then erase and compile your file. (Before this, you may want to save any data that you have entered.) Use the online SPIBILD processor to pass information from the deferred queue and goal records to the index records you just defined, and build the indexes and goal records into a searchable file. Use the SPIRES searching and browsing commands to check your SRCPROCS and PASSPROCS, verifying that the indexes contain the information you intended.
Make use of the file definition language described in "Defining Subfile Privileges" [See B.9.] to specify accounts or groups of accounts that can: use the file, search it, update it, see only some elements, and update only some elements. Modify your file definition, adding this new code, and recompile it. You will be able to verify that you specified the correct privilege codes after the system FILEDEF file has been updated--that is, the day after you make any changes.
Make use of the file-manager commands described in "SPIRES File Management" [See B.10.] to monitor and control the status, activity and processing of your file.
A subsystem of SPIRES, named the File Definer, can simplify the process of file definition. By using a concentrated language, based on a subset of the standard file definition language discussed in this manual, you can specify basic information about the file design, such as element names, which elements should be indexed, etc., and the File Definer will generate a complete file definition for you, saving you the trouble of writing and coding goal and index records and the linkage section.
The File Definer subsytem is available only when you are in SPIRES; to use it, you issue the SPIRES command ENTER FILE DEFINER. Below is an example of a sample File Definer session:
-? enter file definer * ENTERING FILE DEFINER :-? input 1. Input? subfile PHONE-BOOK/ accounts ga.spi 2. Input? element NAME,NAM/ name/ single/ index 3. Input? element PHONE,PHO/ occurrence 2 4. Input? element COMMENT,COM/ single 5. Input? end :-? generate :-? return -?
The five lines of input shown would be used by File Definer to generate a file definition of about 30 lines, including a goal record definition with the elements NAME, PHONE and COMMENT, an index record definition for the NAME element, a linkage section and a subfile section. That is much simpler than coding the record definitions and linkage sections yourself. [See B.7, B.8.] The file definition generated may then be added to the FILEDEF subfile and compiled.
Most people writing file definitions will want to use the File Definer at some point because it relieves them from the tedious task of coding index record definitions and linkage sections. Even if you cannot code the entire definition using File Definer (it has some limitations, e.g., you cannot directly code SEARCHTERMS, SRCPROC and PASSPROC statements), you can use it to create a file definition ranging from skeletal to almost complete for any file.
Naturally, there is a great deal of educational value in writing an entire file definition (goal records, index records, linkage section and all) yourself, especially if you want to learn and understand SPIRES file structure. However, letting the File Definer do the tedious work and studying the file definition it generates can be educationally rewarding as well, especially if you do so as you read this manual.
The File Definer has its own reference manual, entitled "File Definer", which is written for people already familiar with the concepts and language of file definition as taught in this manual. A primer to the File Definer, aimed at people who primarily want to create a SPIRES file quickly but who are unfamiliar with file definition concepts and language, may be found in the SPIRES primer "A Guide to Data Base Development".
A data "element" is the smallest unit of named data known to SPIRES. Data elements (or "fields" as they are called in other systems) may consist of characters, numbers or bits; they may be fixed in length or varying in length. They may also be required to occur more than once or be completely optional. Elements are things such as a person's name, a social security number, a salary, or an abstract of an article.
A "record" consists of a series of data elements and their values. Usually, the record is a collection of all the data elements that pertain to a single entity in the entire collection of data. Thus, a record could be made up of one person's name, address, social security number, and salary. Another record in the same collection of data would have the same elements, but for a different person.
Within a record, elements may be grouped together in "structures," which are referenced in the same manner as elements. For example, if a person has several offices and phone numbers, the office and phone number elements might be grouped or paired together in a structure to keep the proper phone number associated with an office.
Elements that are not in structures are called "record level" elements. Elements inside a structure are "lower level" elements with respect to the record level elements. Structures that are not inside of other structures are record level structures. Structures inside of another structure are lower level structures with respect to the containing structure.
Each record in a collection of data maintained by SPIRES has a required singularly occurring data element known as the "key." Each key within a collection of goal records must be unique to the goal record--no two records in the same goal record collection can have the same key. The key element in a personnel goal record would probably be the social security number, since it will be unique for each person. If the data records themselves do not contain unique elements that are useable as keys, SPIRES can supply unique consecutive numbers as the values for the keys in a set of goal records.
"Goal Record" is the SPIRES term for a data record that could found as a result of a SPIRES search operation. In a collection of data about restaurants, the goal records would probably be restaurants; in a collection of data about library holdings, a goal record would probably be a book. A single record retrieved from a SPIRES search is a goal record. All of the records that have the same structure as the retrieved record or records, and hence "could" have been retrieved by a search, are referred to collectively as "the goal record" or "the goal record data set."
An "index record" consists of a series of data elements and their values, just as a goal record does. However, in index records one of the data elements contains as its value an internal pointer (or pointers) to a goal record (or records). An index built out of names in the goal record would contain one index record for each name that occurs in the goal record as well as a pointer to the goal record (or records) in which the name occurs. The user has no direct interaction with indexes, though they are used by searching commands.
A "record-type" must be distinguished clearly from a "record." A record-type refers to a collection of records, and may refer to either goal records or index records. There are "goal record record-types" and "index record record-types." The record-type is a collection of records that all have the same structure. In a personnel file, the goal record of social security numbers, names and salaries makes up one record-type, while a name index and a salary index are two other record-types.
An "index" is a collection of index records created and maintained by SPIRES; one usually does not manipulate them directly. Indexes act as a "go between" between a searching command and the goal records. The values in a search request are looked up in the index, and the values in the index point to particular goal records. This is similar to the index one might find at the end of a book: such an index contains words or concepts, and each word or concept has a list of the pages on which it occurs.
There are two types of indexes available: simple and compound.
A "simple index" (or more specifically, a simple index record-type) contains one record for each entry in the index. For each unique name in a personnel file, there is one record in a NAME index that points to each goal record containing that name. The key of a simple index is the thing being indexed, a person's name, for example. Simple indexes are cheaper to search than compound indexes.
A "compound index" may index several elements, and is usually used to index short numeric values or coded elements such as salaries and dates. In a compound index, there is one record for each element being indexed. If a personnel file has the elements salary, job-class, and date-hired all in a compound index, then the compound index will have three records (one for salary, one for job-class, and one for date-hired). Each record in the compound index contains all the values that exist in the goal records for a particular element; this is why compound indexes are not recommended for large files -- the compound index records become too large to be searched quickly. Compound indexes may be searched with all of the relational operators, but are more expensive to search than simple indexes.
Combined record-types are record-types that are stored in the same ORVYL file. The file owner specifies in the file definition which record-types are to be combined together; if no combinations are defined, then each record-type occupies its own ORVYL file. There is a system limit of 13 physical record-types, so if a goal record is to have more than eight indexes, some of them must be combined into the same ORVYL file. SLOT record-types may not be combined with any other record-type. Record-types that are physically combined are kept conceptually separate by the logic built into SPIRES. Record-types that are physically combined with each other are different "logical record-types," even though they may occupy the same "physical record-type" (the same ORVYL file). There is a system limit of 64 logical record-types.
A "removed record-type" has nothing to do with records that have been deleted from the file with the REMOVE command. Removed record-types may provide increased access efficiency for some data. SPIRES access efficiency depends on a large number of records being packed in a single file block. SPIRES provides the file definer with an option of keeping only the key of a record in the file block, plus a pointer to the remainder of the record's data. The remainder of the data is kept in the "residual data set." This is called "record removal" and allows many more (partial) records to be kept in a file block than if whole records were kept intact.
A "subfile" is defined as one set of goal records, the indexes to those goal records, and the access and update restrictions that apply to the data elements. Among the record-types that are brought into association in a subfile, a clear distinction is made between goal records and index records, since the user can only manipulate goal records, not index records. If a goal record has no indexes built for it, then the subfile consists of only the goal record and the access restrictions to it.
Several subfiles may relate to the same data and may be placed in one "file" or data base. A file thus contains all the subfiles that relate to the same data. A user can only work on a single subfile at a time, even though there may be several subfiles defined in one file. It is also possible (and is frequently the case) that only a single subfile is contained in a file.
The following chart shows the relationship of the parts of a file. The chart depicts a single file, with two subfiles. The first subfile has three record-types: one goal record and two index records. The second subfile has only a single record-type, the goal record.
The goal record of the first subfile is composed of a record with a key and several elements. One of the elements is a structure, and is thus itself composed of elements. The first index record of the first subfile is a record with only a key and a pointer.
------ ---------- ------------- -----
| goal record-> | KEY
| | | | ELEMENT
| | "goal | | ELEMENT ---
| | record | | : |
| | record- | | STRUCTURE---|ELEMENT
| | type" | | ELEMENT |ELEMENT
| | or -> | | : |--:
| | "goal | goal record |
| | record | goal record |
| | dataset" | : ---
| | | : :
| | |----------- :
| S | ------------
| U | | index record |---
| B | "index | ->| KEY
| F ->| record | | POINTER
| I | record- ->| |----
FILE -->| L | type" | index record [
| E | or | index record [
| | "index" | index record [
| | | :
| | |-------------
| |
| | -------------
| | "index |
| | record |
| | record- |
| | type" ->| as above
| | or | for
| | "index" | index-record
| | | record-type
| | |
| ----------- -------------
|
| ----------- -------------
| | |
| | "goal |
| | record |
| S | record- |
| U | type" ->| as above
| B ->| or | for
| F | "goal | goal
| I | record | record
| L | data set" | record-type
| E | |
| | |
------- ----------- ------------
To begin our consideration of goal record definition, let's take a telephone directory as our example; the structure of a directory is something with which we are all familiar. Certain assumptions we are going to make for our directory file will simplify its structure.
What information is stored in a telephone directory? Usually, and most simply, a name, address, and telephone number make up each entry.
If we have a single "record" or entry that consists of the three elements, name, address, and phone number, how many times will each element occur in a single record? Let's look at the question this way:
FILE = DIRECTORY; RECORD = PERSON; ELEMENT = NAME; OCCURRENCE = ? ELEMENT = ADDRESS; OCCURRENCE = ? ELEMENT = PHONE NUMBER; OCCURRENCE = ?
Most likely (in the simplest case), the name and address elements will occur once and only once: for each name there is one address. But it would not be unusual for a person to have several phone numbers, so we don't know how many phone numbers to expect or allow for.
Now, what can we say about the length of each of these elements? Here is a review of what we have so far:
FILE = DIRECTORY; RECORD = PERSON; ELEMENT = NAME; OCCURRENCE = 1; LENGTH = ? ELEMENT = ADDRESS; OCCURRENCE = 1; LENGTH = ? ELEMENT = PHONE NUMBER; LENGTH = ?
We really don't know the length of the longest possible name and address. We could probably specify a length that couldn't be exceeded, but SPIRES does not require us to. If you do not specify a length, SPIRES stores only the length of the value input, plus two bytes of information about the length. Let's not specify a length for the NAME and ADDRESS elements.
The question of the length of the phone number requires some decision; let's agree that a phone number is an eight character (or eight "byte") value, such as "497-4420". (If we wanted to include the area code with each number, then the value is thirteen bytes long: "(415)497-4420" for example.) Our "file definition" now looks like this:
FILE = DIRECTORY; RECORD = PERSON; ELEMENT = NAME; OCCURRENCE = 1; ELEMENT = ADDRESS; OCCURRENCE = 1; ELEMENT = PHONE NUMBER; LENGTH = 8;
Let's see what this file will look like in the SPIRES file definition language. The first thing we must "code" or specify is the name of the file. This name is always an alphanumeric string preceded by the file definer's account number in the form GG.UUU. This account becomes the only account that by default can modify or compile the file definition. The file name is coded first in the definition, and looks like this:
FILE = GG.UUU.DIRECTORY;
Here, GG.UUU is the account, and "DIRECTORY" is the name chosen for this particular file. The file name (including the account) may be up to 23 characters long (longer names will be truncated by SPIRES). No one but the file owner need ever see this name. This is not the name used to select the subfile.
A file consists of sets of records; each set is called a "record-type." Most often there is a goal record record-type and several index record record-types per subfile. To simplify our discussion at this point, we will call a record-type a "record." (Though this is not true, strictly speaking. Many goal-records make up a single goal-record record-type. [See A.4.] Each of these records has a unique name.) The goal record is often called "REC01", and its name is coded
RECORD-NAME = REC01;
We will see later why this name is most common, and some circumstances in which you might want to choose a different name. [See B.2.2.]
Within each record we define that record's elements. All of the elements must be in one of three categories: FIXED, REQUIRED, or OPTIONAL. Elements are segregated into these categories by their occurrence and length attributes as follows:
Category Length Occurrence
FIXED Fixed Required number of occurrences
REQUIRED Varying Required to occur
OPTIONAL Fixed or Need not occur
Varying
Notice that "Fixed" and "Varying" for FIXED and REQUIRED refer to the length attribute of the element being defined, not its occurrence attribute.
The length of an element in the FIXED section of the record definition must be specified. If the occurrence is not specified for an element in the FIXED section, then the number of occurrences is assumed to be one. On the other hand, an element defined in the REQUIRED section of the record definition need not have either length or occurrence attributes specified; the element must occur--but its occurrence and length may vary from entry to entry. Elements in the OPTIONAL section need not have either length or occurrence specified, because that element may or may not occur in a given record.
If you do specify an occurrence for an element in the OPTIONAL section or the REQUIRED section, it has a special meaning, which is different from an occurrence specification for a FIXED element. If the number of occurrences is one, then the element is "singularly occurring": for a REQUIRED element, this means that it must occur once and only once in each record; for an OPTIONAL element, this means that if it occurs at all, it can occur only once. If the number of occurrences is more than one, then the element is multiply occurring: for a multiply-occurring REQUIRED or OPTIONAL element, the number of occurrences is not checked. You may have SPIRES do a minimum and/or maximum occurrence check by specifying certain processing rules, usually A123 and A146. [See B.4.14.]
A small amount of storage space is saved when REQUIRED or OPTIONAL elements are specified as singularly occurring.
Though not required for a valid file definition, an OPTIONAL section with a dummy element should be coded in every file or record definition for which an OPTIONAL section would not otherwise occur. By coding this section you have the flexibility to add elements to the record definition even after data has been stored in the file. Such elements are always added to the OPTIONAL section; the dummy element is never coded with length and occurrence attributes. [See B.1.7.] No more than 254 elements may be coded in an OPTIONAL section.
Remember that we decided "PHONE" is fixed in length, but is not fixed in the number of times it can occur, though it must occur at least once. Elements that must occur, but for which a firm occurrence count can't be specified, are placed in the REQUIRED section, even if they can be fixed in length.
We code the category name at the head of the list of elements it describes:
FILE = GG.UUU.DIRECTORY; RECORD-NAME = REC01; FIXED; : elements : REQUIRED; : elements : OPTIONAL; : elements :
Note the order in which the categories must appear: FIXED, then REQUIRED, then OPTIONAL.
If you do not code any categories, all the elements will be OPTIONAL, with the exception of the key, which will be REQUIRED. [See B.1.4.]
In addition to placing each element in an appropriate category, we must choose an element that will be the "key" of the record. A key is required for every record (whether goal or index) defined in the file; it must be unique in value and occur only once in each record.
Now, in our phone directory, we would most likely pick the name as the key. What consequences does this have? The key of a certain record is a unique value for that record; no two records or entries in the file can have the same value for the key element. Thus, no two records in our telephone book could have the same name. (Here is where we allow ourselves to simplify with the assumption that no two people in our phone directory will have the same name. This would not be a realistic assumption for a real phone directory. The solution to this problem is found in the next chapter, "Goal Record Keys, Slot and Removed Records.")
In addition to being unique in value among the goal records, the key must always be singly occurring; that is, the occurrence attribute of the key must be one. For this reason, an occurrence number need not be specified for the key element. A key may be varying in length, such as the name in our phone directory. But a length attribute may be specified if it is known. In our phone directory, NAME would be coded as the key element as follows:
KEY = NAME;
The key element is always coded as the first element in the category in which it is defined. Since the key must be singly occurring, but may be fixed or varying in length, it is coded as the first element in either the FIXED or REQUIRED categories.
Let's review our definition:
FILE = GG.UUU.DIRECTORY; RECORD-NAME = REC01; REQUIRED; KEY = NAME;
Since we don't have any elements in the FIXED section, we don't code it.
The next element to code is ADDRESS. For this element we can specify that it must occur once and only once, but we can't specify a length attribute. The name of an element is specified in the ELEM statement. (You are allowed to use ELEMENT instead of ELEM; however, SPIRES will change it to ELEM when you add it to the FILEDEF subfile later The occurrence attribute of an element is specified in the OCC statement. (Similarly, OCCURS, OCCURRENCE and OCCURRENCES may be used instead, though they will be changed to OCC.) The ADDRESS element would be coded in the REQUIRED section as follows:
ELEM = ADDRESS; OCC = 1;
If we had not specified "OCC = 1", then ADDRESS could occur one or more times. (Since it is coded in the REQUIRED section, it must occur at least once if the occurrence attribute is not coded.)
We now must code the phone number element. For this element we can only say, "it must occur." We don't know how many times. In such a case, "OCC = 1" is not coded since this would limit the element to one and only one occurrence. We have decided that the length of the phone number in bytes (characters) as it will be stored on disk is eight characters. The length attribute of an element is specified in the LEN statement. (LENGTH is also allowed but will be changed to LEN.) Since it may vary in occurrence, the phone number element is coded in the REQUIRED section thus:
ELEM = PHONE-NUMBER; LEN = 8;
Remember that the length attribute, coded by "LEN =", is the length as the value will be stored on disk, which is not necessarily the length of the value as input when the record is added; processing rules, called "actions", can manipulate the input values.
If we specify "LEN = 8" for the PHONE-NUMBER element, then all element values stored on disk will be eight bytes long. If a value is input that is longer than eight characters, the record will be rejected for input, and an error message will be issued. If a value is input that is shorter than eight characters, SPIRES will pad it with blanks to a length of eight bytes. [To allow null values for an element's input, omit the LEN statement; otherwise, SPIRES will fill the entire length with blanks, which is not the same as a null value.] Manipulation of input values can be effected more intelligently when "actions" are coded. [See B.4.]
Embedded blanks are not permitted in element names; the special characters ".", "_", "-", and "$" are allowed, though "-" is not allowed as an element name by itself; it may be embedded within a name, however. [See "SPIRES Searching and Updating", section D.1.3.1, for more information about the "-" or "throw-away" element.] The length of an element's name is limited to sixteen characters.
Long element names are often advisable for clarity, since the value coded in the ELEM statement is the name of the element used when records are displayed.
However, it is not convenient or sensible to enter a twelve-character element name for an eight-character value: "PHONE-NUMBER = 497-4420;". SPIRES allows you to give an element a long, descriptive name such as "PHONE-NUMBER" and refer to it by several other names, such as "P" or "PN". The file definer must indicate what these other names can be by coding "aliases" for the element names in the file definition. Here is how aliases are coded:
ELEM = PHONE-NUMBER; LEN = 8; ALIASES = PN, P, NUM;
A phone number can now be entered by "P = 497-4420;" or, more simply, by "P 497-4420;" (since the "=" is optional). We have also allowed the aliases "PN" and "NUM", which are mnemonically more significant than the terse "P". No two elements can have the same alias at the record level or in a structure.
Since we have no OPTIONAL section, we should code an "empty" OPTIONAL section with a single dummy element. (This will allow us to add elements to the record definition at a later date without invalidating data already stored.) This section is coded as follows:
OPTIONAL; ELEM = DUMMY; COMMENTS = This element may never be used, but it reserves a 'place' for any additional elements we may want to define later, after data has been added to the file;
An item called "COMMENTS" may be coded for any element you define; no single comment can be longer than 1,024 characters.
Let's look at the record definition we have coded, adding aliases where they are useful:
FILE = GG.UUU.DIRECTORY;
RECORD-NAME = REC01;
REQUIRED;
KEY = NAME; ALIASES = N;
ELEM = ADDRESS; OCC = 1; ALIASES = A;
ELEM = PHONE-NUMBER; LEN = 8; ALIASES = PN, P, NUM;
OPTIONAL;
ELEM = DUMMY;
COMMENTS = This element may never be used, but it
reserves a 'place' for any additional elements we
may want to define later, after data has been
added to the file;
Other statements can be coded that will make the file definition more complete: AUTHOR, MAXVAL, NOAUTOGEN and BIN. They are coded after the FILE statements, which is the first statement in our definition.
It is important to specify the AUTHOR statement in your file definition. In case it is necessary for the data base systems staff to contact you, the AUTHOR statement should supply the necessary information. This element is usually coded after the FILE element, and is a free-form text string:
AUTHOR = JOHN SACK, POLYA 151, 497-4420;
Another file-level element is necessary for some applications, particularly those involving long text strings such as bibliographic and abstract files. If any element values in your file will be longer than 4,096 bytes, you must code the following in your file definition:
MAXVAL = <integral multiple of 8>;
The value specifies the maximum data length for any single occurrence of an element in the file. MAXVAL cannot exceed 32,760. Also, no single record in a SPIRES file can be more than 120,000 characters long.
The MAXVAL limit also applies to values processed by actions A44 and A48 and by the SET VALUE Uproc in Userprocs. [See C.11.1.1.]
An optional element "NOAUTOGEN;" may be coded in your file definition if you do not want SPIBILD automatically (i.e., nightly) to pass records from the deferred queue to the goal and index records. Normally, every night that there are records in the deferred queue, JOBGEN will generate a job to build or process them into the goal and index record data sets. Every time this job runs, a certain amount of overhead for job scheduling and initiation is incurred. With only a small number of records to be processed (say, fewer than 5), this overhead is a significant percentage of the job cost.
However, if NOAUTOGEN is coded, you must explicitly cause this job to be submitted by issuing the online SET AUTOGEN command, perhaps after allowing several records to accumulate in the deferred queue. JOBGEN will generate a SPIBILD job that night, and then reset the file to the NOAUTOGEN condition. If NOAUTOGEN is not coded, then you must take specific action to prevent overnight processing; SET NOAUTOGEN can be issued to prevent the generation of this job until you explicitly SET AUTOGEN in SPIRES or PROCESS the file in SPIBILD.
You may code the bin number to which you wish output from SPIRES-generated jobs to be sent. Output from compilations and automatic file building (JOBGEN) will go to the bin specified; if no bin is coded, then such output will be directed to the default bin of the file owner.
If you code PURGE for the bin, then the output will be purged if there were no batch requests processed by SPIBILD and if no errors occurred during SPIBILD processing. Otherwise, the output will be sent to the file owner's default bin. Coding PURGE is recommended because it generates output only in the event of a SPIBILD problem or a batch request, thus saving you printing charges.
If you code HOLD for the bin, the output will be directed to the default bin of the file owner but the output will be held. The file owner can fetch the output and then either purge it or release it for printing.
The bin is coded in your file definition like this:
BIN = nnn;
where "nnn" is the number of the bin or HOLD or PURGE as described above.
Though our goal record definition is now complete, there are several other things that must be coded to complete the definition of the file itself. (Remember that a file definition usually, but not necessarily, contains several record definitions.)
As noted earlier, the file name is almost never seen by the user; what the user sees is the subfile name, which is coded as the first statement in the "subfile section" of the file definition. The subfile section (or sections) follows at the end of the last record description.
SUBFILE-NAME = PHONE DIRECTORY;
Embedded blanks are allowed in the subfile name. Since this name is typed in a SELECT command, it should not be very long or otherwise difficult to type. The maximum length for a subfile name is thirty-two characters, including blanks.
The second statement in the subfile section identifies the record that will be the goal record when the subfile is selected. Because we only have one record name for our single record definition, this may seem redundant. But since most subfiles have multiple record descriptions--usually one goal record and several index records--SPIRES must be told explicitly which record is the goal record. This statement is coded as follows:
GOAL-RECORD = REC01;
Remember that "REC01" was the name of the record we described and named by the statement "RECORD-NAME = REC01;".
Now we must specify what accounts are permitted to select the subfile whose name is given by the "SUBFILE-NAME" value immediately preceding.
ACCOUNTS = D5.M07;
This permits access to the subfile only to the account specified. At a minimum, the file-owner's account should always be specified; if it is not, then the file owner must issue the ATTACH command to use the subfile.
You can permit more than one account by coding other account values:
ACCOUNTS = D5.M07, SN.JRS, GA.SPI;
To permit all group "GG" accounts (but not "GA" accounts), you would include "GG...." in the ACCOUNTS value. To make a subfile public, you specify "PUBLIC" as the ACCOUNTS value. The matter of controlling access to SPIRES subfiles is detailed in "Defining Subfile Privileges." [See B.9.] A complete subfile section can be coded like this:
SUBFILE-NAME = PHONE DIRECTORY; GOAL-RECORD = REC01; ACCOUNTS = D5.M07, GG...., GA.JEF, GB....;
Here is what our complete phone directory looks like when coded in the file definition language:
FILE = D5.M07.DIRECTORY;
AUTHOR = JOHN SACK, POLYA 162, 497-4420;
BIN = 907;
MAXVAL = 4096;
COMMENTS = MAXVAL has been set to 4096 to demonstrate
how the statement is coded. No value in the file
will need 4096 bytes.;
NOAUTOGEN;
RECORD-NAME = REC01;
REQUIRED;
KEY = NAME;
ELEM = ADDRESS; OCC = 1; ALIASES = A;
ELEM = PHONE-NUMBER; LEN = 8; ALIASES = P, PN, NUM;
OPTIONAL;
ELEM = DUMMY;
COMMENTS = This element may never be used, but it
reserves a 'place' for any additional elements we
may want to define later, after data has been
added to the file;
SUBFILE-NAME = PHONE DIRECTORY;
GOAL-RECORD = REC01;
ACCOUNTS = D5.M07, GA.JEF, GB....;
The indentation shown is for the sake of clarity; you can use any indentation that is helpful to you. Also, an element's name, occurrence, length and aliases need not be defined on a single line; in fact, when SPIRES displays your file definition, each of these will be on a separate line, with indentation used to structure the definition for easy reading.
Let's consider another way of defining a telephone directory file. Suppose we made the telephone number the key of the record, what would be the impact on the file? Here is a record definition in which the key is the phone number; we have also allowed name and address to occur more than once by not specifying any OCC limits.
RECORD-NAME = REC01;
FIXED;
KEY = PHONE-NUMBER; LEN = 8; ALIASES = PN, P, NUM;
REQUIRED;
ELEM = NAME;
ELEM = ADDRESS;
OPTIONAL;
ELEM = DUMMY;
Such a directory would give you access to all the users of a particular phone number; if one person had two different phones, the name would be in two different records, each record's key being one of the phone numbers. A directory keyed on the phone number might not be useful to someone looking for John Jones' phone number, but it would be useful to someone looking for the owner of phone number 497-4420, which has been reported out of order, perhaps.
Notice that this dramatically changes how we look at or use the file. Now, all the people sharing a single office extension can be found, but one person's phone number can't be found as directly as it was in the directory keyed on name. The goal of the search--either names, as in the previous case, or phone numbers as in the present example--determines the choice of key.
Since it is unlikely that one phone number could be at more than one address (though some businesses have "extensions" in several buildings), we will code "OCC = 1" for the occurrence attribute of the ADDRESS element. But it is very likely that more than one person could be listed for each phone. For this reason, we will not code any occurrence attribute for the NAME element: we simply don't know how many times this element will occur. The occurrence of a record key must be one; the length of a record key must never be greater than 240 bytes, whether the length is fixed or not, whether it is the key of a goal record or index record.
The present definition, keyed on phone number, is different in another way from the definition keyed on name: the phone number, which was in the REQUIRED section (varying in length, required to occur) in our record keyed on name, is now in the FIXED section. The phone number was multiply occurring before, though it was fixed in length; now, since it is the key of the record, it is required to occur exactly once. Elements whose occurrence and length attributes both can be fixed are usually coded in the FIXED section.
There may be reasons why you would choose not to put a fixed length and occurrence element in the FIXED section of a record definition. Let's look at two record definitions for a phone directory keyed on phone number; we will add an element for zip code, which seems to belong in the FIXED section, being fixed in both length and occurrence.
RECORD-NAME = RECO1 RECORD-NAME = RECO1;
FIXED; FIXED;
KEY = PHONE-NUMBER; KEY = PHONE-NUMBER;
LEN = 8; LEN = 8;
ELEM = ZIP-CODE; OCC=1;
LEN=5;
REQUIRED; REQUIRED;
ELEM = NAME; ELEM = NAME;
ELEM = ADDRESS; OCC = 1; ELEM = ADDRESS; OCC = 1;
ELEM = ZIP-CODE; OCC = 1;
LEN = 5;
OPTIONAL; OPTIONAL;
ELEM = DUMMY; ELEM = DUMMY;
In the standard SPIRES output format, "element mnemonic = value", the elements in a record are output in the order in which they are defined: FIXED, REQUIRED, then OPTIONAL elements. (If an element occurs more than once, its occurrences are output in the order in which they were input.) Standard record output formats for each of the above definitions might be as follows:
PHONE-NUMBER = 497-4400; PHONE-NUMBER = 497-4400; ZIP-CODE = 94305; NAME = USER SERVICES; NAME = USER SERVICES; ADDRESS = POLYA HALL 117; ADDRESS = POLYA HALL 117; ZIP-CODE = 94305;
So, for readability, you may want to put ZIP-CODE in the REQUIRED section of the record definition. But if you are certain to define an output format, there is no need to consider this problem.
We have just discussed the importance of choosing the best element for the key of the record. Let's look at situations in which the choice of unique key may be difficult or impossible.
Suppose our SPIRES file was going to be a collection of abstracts from scientific journals. Our element record definition might be as follows (note that the key is not specified):
RECORD-NAME = REC01;
FIXED;
ELEM = YEAR; LEN = 4; OCC = 1;
REQUIRED;
ELEM = AUTHOR; OCC = 1;
ELEM = ARTICLE; OCC = 1;
ELEM = JOURNAL-NAME; OCC = 1;
ELEM = VOLUME; OCC = 1;
ELEM = NUMBER; OCC = 1;
ELEM = PAGE; OCC = 1;
ELEM = ABSTRACT; OCC = 1;
Now, if we wanted a search to retrieve the list of journal abstracts in which the words specified in the search request appeared, the goal record would be "articles." A search request for such a file would look like this:
-? find abstract-keyword light waves -RESULT: 27 ARTICLE(S)
How would we go about choosing a key for an "article" goal record? None of the elements defined above is very likely to be entirely unique. We could contrive a unique key by concatenating portions of the JOURNAL, YEAR and PAGE elements: NG.76.202, for example, could signify page 202 of a 1976 issue of National Geographic. However, such a key would not be convenient to enter or use. (See the "Structured Key" processing rule, A33, for one solution.)
SPIRES has a more elegant solution to the problem of a lack of a natural key. If you specify that a record type is "SLOT", SPIRES will assign a unique integer key to each record added; these keys start at one and will be incremented by one as each record enters the file. SPIRES always stores a slot key as a four byte binary number.
This simple solution could lead to problems: suppose you typed a command such as "remove 197" when it was actually record 187 that was to be removed. The file definer can protect against this kind of error in a slot file. SPIRES allows you to specify that a "check digit" be appended to each integer slot number as the record is added to the file. A check digit is a single digit that is appended to the right end of a number; it is computed by performing multiplication and addition operations on each digit of the original number, and then adding and subtracting the resulting sum to yield a single digit. Since this digit is computed from the other digits in the number, the original number's digits can be verified by seeing that the final digit is correct for a number you type at the terminal.
For example, a record in your file may have the key "2757", of which the final digit, "7", is the check digit. The value "2657" would not be a valid key, however, since the first three digits, "265", require (or compute to) a different check digit than "7". Thus, each digit becomes significant in computing the check digit, and most typographical errors in specifying a record's key (such as typing "2657" instead of "2757") will be caught when the system attempts to verify the check digit. Note that the record whose key is "2757" is not two thousand seven hundred fifty-seventh record in the file, but the two hundred seventy-fifth; the final digit is the check digit. The system does not store the check digit with the key, but computes it each time you display the key--the digit is always shown when the key is displayed. (If you "look up" the key of a record using action 32 [See C.5.] and intend to display it, you must explicitly code a processing rule to have this digit displayed.)
A check digit is requested by coding "SLOTCHECK" on a SLOT record. On all commands requiring a key, the check digit, first computed by SPIRES, is appended to the record key by the user and recomputed and validated by the system. This digit functions similarly to a parity bit in tapes, verifying that the data is valid. The method SPIRES uses in computing the check digit is described in detail in the description of action 27. [See D.1.3.0.2.7.]
The default check-digit formula is called the Mod-11 rule; it can be explicitly requested by giving the value "0" to the SLOTCHECK statement ("SLOTCHECK = 0;"). Other formulas, described in the description of action 27, can be requested by coding different integers on the SLOTCHECK statement:
formula SLOTCHECK value ---------------- --------------- Mod-11 0 Student Services 1 LEN 2 LUHN 3 ABA 4
No KEY statement is coded for a slot record, since the slot number (with a check digit if one is requested) is the key of the record. SLOT and SLOTCHECK are coded as part of the record definition as follows. (REMOVED will be explained in the next section of this chapter.)
FILE = GG.UUU.BIBLIOGRAPHY;
AUTHOR = JOHN SACK, POLYA 151, 497-4420;
BIN = 907;
RECORD-NAME = ENTRY;
REMOVED;
SLOT;
SLOTCHECK;
FIXED;
ELEM = YEAR; OCC = 1; LEN = 4;
REQUIRED;
ELEM = AUTHOR; OCC = 1;
ELEM = ARTICLE; OCC = 1;
ELEM = JOURNAL-NAME; OCC = 1;
ELEM = VOLUME; OCC = 1;
ELEM = NUMBER; OCC = 1;
ELEM = PAGE; OCC = 1;
ELEM = ABSTRACT; OCC = 1;
SUBFILE-NAME = ABSTRACTS;
GOAL-RECORD = ENTRY;
ACCOUNTS = GG.SPI, GA....;
You will notice that "RECORD-NAME = ENTRY" was coded, instead of "RECORD-NAME = REC01" as before. In a slot record the name of the goal record key is determined by the value of the "RECORD-NAME" statement. One caution should be observed: the value of this element should be lower in alphabetical sequence than any other "RECORD-NAME" statements you code since the record definitions are displayed in alphabetical sequence by RECORD-NAME. In general, it is not good to code "RECORD-NAME = GOAL". If you wish to have the goal record key named something other than the RECORD-NAME, then following the "SLOT" statement, code the "SLOT-NAME = name" statement. You may also code an ALIASES statement for the slot key. [See B.1.6.]
The SLOT statement may also have a numeric value, representing its priv-tag number. [See B.9.4.4.]
A file may not have more than eight slot-type record definitions. You may, under certain circumstances, want to define a goal record as "slot" even when a natural key exists. The advantages and disadvantages of such a scheme must be weighed carefully, and are described below.
SPIRES treats slot type goal records in a special way; it keeps them in a data set that is organized sequentially rather than tree structured. If all of the elements in a slot record are fixed required (coded in the FIXED section of the record definition), then the amount of space each record requires is known exactly; when SPIRES goes to retrieve such a record, it "calculates" the record's position and goes directly to that location--it does not have to account for the varying size of each goal record stored. Thus, the major advantage of slot organization of fixed required elements is that record retrieval goes much faster (retrieval is not to be confused with searching, the process that usually precedes retrieval). A second significant advantage is that, since the goal records are structured sequentially, sequential searching by global FOR commands is faster.
The disadvantages of forcing your data base structure into a fixed slot-type record format are actually inconveniences; the file definer must decide if these inconveniences are acceptable.
One inconvenience is that records must be referred to by their slot number in TRANSFER, REMOVE, UPDATE and DISPLAY commands. In a personnel file keyed on social security number, you could remove or update an employee's record by simply giving the person's social security number. If this file were defined as a slot file, you would first retrieve the record by using a FIND command against a social security index, then TRANSFER the record retrieved using a global FOR command. As you can see, the extra expense of building a social security number index would have to be incurred.
A second inconvenience is the loss of verification that the social security number of each person was unique; if social security number were the key of the record, SPIRES would verify that no records had the same value for the key. Generally, when a natural key exists, SLOT organization is not used.
If you expect to do a lot of sequential searching of the data base using global FOR commands, then consider making the record elements FIXED and the record SLOT. How is this done? If most of the elements in a record are of fixed length and fixed occurrence, you can consider having SPIRES store a variable length element such as NAME as a fixed length element; you would choose the largest possible length for the length attribute of each variable length element. But if a record has an element that can vary greatly in length, such as ABSTRACT in our article goal record above, we would not want to waste storage space by fixing the length of this element at its longest possible value.
If all of a SLOT record's elements are not coded in the FIXED section, then the record must be "removed." In the following section we will see what record removal means, and how it is coded.
SLOT-START is a new field in the SLOT structure of record definitions, both in FILEDEF and RECDEF. SLOT-START serves more than a single purpose -- enabling a simple way to generate keys that begin at a particular value.
Suppose you would like to generate Slot records whose first key value is something other than 1 -- say 9000000. You want the first record to have a key of 9000000, the second 9000001, then 9000002, etc. The ORVYL system on the Stanford mainframe stored the key of 9000000 in block 35573 of the record-type data set (assuming the block size is 2048 and the record-type is REMOVED). This situation poses no real disadvantages in mainframe SPIRES because ORVYL only writes a single block 35573 resulting in a data set that has two blocks -- block 0 and block 35573.
But for Unix SPIRES this would be a different matter. If you added the first record of 9000000 to a slot record-type in that system then SPIRES must fill in the gap between 0 and 35573. Not only does this represent "wasted" space but represents 35572 extra block read requests should you attempt to do a sequential scan (eg. FOR SUBFILE / DISPLAY ALL) of the subfile.
If you wish to take advantage of this option then you should code the "SLOT-START = number" statement immediately following the "SLOT" statement in your record definition. If the subfile is a NEW subfile then your work is done and the first record added to the subfile will have a key of "number".
If the subfile already exists and has SLOT keys that begin from a different value and you wish to take advantage of this new option then you should RECOMPILE using the REBALANCE option, following the same recipe that you use to rebalance Tree data set using the CONVERT option. [EXPLAIN RECOMPILE COMMAND, WITH REBALANCE OPTION.]
Here are some answers to other questions you might ask:
- Your Slot-Start key value will be placed in block 0 of a SLOT record-type. Subsequent key values will occur in that block, or later blocks.
- You cannot recompile a SLOT record-type and alter the Slot-Start value unless you recompile with the REBALANCE option.
- If you attempt to add a record whose key is less than the Slot-Start value you will receive an S521 error diagnostic.
- If you attempt to alter the Next Slot to a value less than the Slot-Start value using the FIX SUBFILE SLOT command you will receive an S521 error diagnostic.
- The SLOT-START value may be coded in a RECDEF definition or for a Pseudo record-type.
- You can expect that Slot-Start processing will work for temporary SPIRES subfiles (eg. Select @orv.gg.uuu.tempfile).
In our first example of a file, a telephone directory keyed on name, the length of a single record is approximately the sum of the lengths of the individual elements. It would probably not exceed fifty bytes or characters: the NAME and ADDRESS elements may take twenty bytes each, and the PHONE-NUMBER element takes an additional eight bytes.
This means that in one block of ORVYL storage, which is 2048 bytes, approximately forty records could be stored. SPIRES access efficiency depends to a great extent upon the number of records each block contains. If the average number of records per block is eighty, then one record in 512,000 can be retrieved by accessing three blocks or less, if the data is structured in tree fashion. If the average number of records in a block is only twenty (each record averaging one hundred bytes in length), then five accesses may be necessary. If the number of records per block drops to sixteen or fewer, then efficiency of record access seriously degenerates, since many file I/O's may be necessary to locate a particular record.
In order to keep access efficiency high, SPIRES provides the file definer with the option of removing large record types (remember that a "record type" is our "REC01," a "record" is an entry in the file), say of 60 bytes or more, from the goal record data set to a "residual data set." This removal is done for all records in a record type, and may be specified for any record-type, whether large or small in size. Only a key and a pointer to the record's location in the residual data set remain in the goal record data set when you specify record removal.
Tree (or "non-slot") record types, such as our telephone directory, should usually be removed, since the size of an entry is often over forty bytes. You can specify that a record type's contents are to be removed to a residual data set when you code the record type's name:
RECORD-NAME = REC01; REMOVED;
Slot record types, which use a different access technique from tree structured record types, are always removed unless all elements are fixed in length and occurrence. Slot record types, such as our articles file, can specify removal to a residual data set as follows:
RECORD-NAME = ENTRY; REMOVED;
Slot and tree structured data sets may be mixed in a file or data base.
Some rules-of-thumb can be stated for record removal. Remove record types if any of the following are true:
- The record type is slot and the elements are not all FIXED-REQ.
- Goal records will be retrieved through the SPIRES index searching commands--FIND, AND, etc.--rather than primarily through the DISPLAY command.
- The average record is more than 60 bytes long.