LANGDEV

Default splashtext!

LexisML Index Records
1.0 Specification

Introduction

The LexisML Index Record (LREC) serves to index the lexicon of a language and provide links to word definitions. LRECs are designed to be minimalist text files that can store a large number of records in a compact, human-readable format. The purpose of an LREC is to give users and programs a sense of the scope of a lexicon without having to read each individual entry, thus reducing initial load times for displaying an index thereof.

The format of a LREC is based upon that of the IANA Language Subtag Registry, which is in turn derived from the record-jar format described in [ART-UNIX].

Definitions

Begin
Of a sequence, to be first in a sequence's contents.
Blank line
A line which contains zero characters; that is, a single U+000A LINE FEED Unicode character.
Character
When not qualified as a Unicode character, a graphic character, as defined in [UNICODE].
Contents
Of a line, the sequence of zero or more characters which comprise the line. Of another sequence, the sequence itself. A sequence of characters or lines is said to be contained in another sequence if it is in the sequence's contents.
End
Of a sequence of lines, to be the last lines in a sequence's contents which are not blank. Of another sequence, to be last in the sequence's contents.
Field
A sequence of lines as defined in The LREC Format.
Indented contents
The contents of a line, starting with the first non-space character in the line.
Line
A sequence of zero or more characters followed by a U+000A LINE FEED Unicode character. Lines MUST NOT be longer than 72 bytes.
LREC
A file which conforms to this specification.
Non-space character
A character which is not U+0020 SPACE.
Record
A sequence of lines as defined in The LREC Format.
Unicode character
A Unicode scalar value, as defined in [UNICODE].

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [RFC-2119].

The LREC Format

LRECs are plain-text files, encoded in UTF-8 as specified by [UNICODE]. They consist of a sequence of one or more records, each separated by the two-character sequence U+0025 PERCENT SIGN, U+0025 PERCENT SIGN on its own line. Records, in turn, consist of several name-value pairs known as fields.

A field MUST consist of the following, in order:

  • A sequence of characters which does not include the sequence U+0020 SPACE, U+003A COLON, U+0020 SPACE. These make up the field's name.

  • The character sequence U+0020 SPACE, U+003A COLON, U+0020 SPACE.

  • A sequence of characters, at least one of which is a non-space character. These make up the field's value. Whitespace which begins or ends a field's value MUST be ignored.

Lines MUST NOT be longer than 72 bytes; consequently, field values MAY need to be broken across multiple lines. The contents of any lines which are begun by four U+0020 SPACE characters, starting with the fifth character of the line, MUST be considered a continuation of the preceding field's value. Similarly, lines which do not begin with four U+0020 SPACE characters MUST NOT be considered a continuation of the preceding field's value.

Lines whose contents are begun by exactly one (not two) U+0025 PERCENT SIGN character are comments and MUST be ignored. An LREC MUST NOT contain lines which are not either part of a field, a comment, or the sequence U+0025 PERCENT SIGN, U+0025 PERCENT SIGN, U+000A LINE FEED.

Fields are referred to by their names. Except where otherwise specified, more than one field of a particular name MUST NOT appear in a single record. All fields which are not marked as OPTIONAL or RECOMMENDED are REQUIRED to appear in a record's contents. Except where otherwise specified, field names are case-insensitive but field values are case-sensitive.

Record and field definitions

The following records are defined by this document:

Metadata records

A metadata record MUST begin a LREC, and MUST NOT appear elsewhere in the document. It consists of the following fields, in any order, specified by name:

Title

The title of the LDOC.

Subtitle

(OPTIONAL) The subtitle of the LDOC.

Author

(RECOMMENDED) The author(s) of the LDOC.

Date

(RECOMMENDED) The date upon which the LDOC was last modified. The value of a date field SHOULD be in the "full-date" format as specified by [RFC-3339].

Language

(RECOMMENDED) A language tag matching the syntax provided in [BCP-47], describing the language of the LDOC as a whole; ie, that of its intended audience. This SHOULD be a valid (recognized) tag but MAY include private-use components.

Description

(RECOMMENDED) A description of the scope and purpose of the LDOC, or similar relevant information.

Splash

(OPTIONAL) Splash text; that is, miscellaneous words and phrases used to celebrate different aspects or achievements of the document. This field MAY appear more than once.

Frontmatter

(OPTIONAL) A URI at which the frontmatter for the dictionary may be retrieved, if available.

Tag-group records

A tag-group record groups together tags for later use in lexeme records. If present, tag-group records MUST appear before any lexeme records. They consist of the following fields, in any order, specified by name:

Group

The name of the tag-group. This MUST be unique across an LDOC; ie, there must not be more than one tag-group record with a given value for its group field. The value of this field is case-insensitive.

Description

(OPTIONAL) A description of the tag-group.

Subgroup

(OPTIONAL) The name of another tag-group which should be considered a part of this one. If present, this MUST match the group field of a tag-group record, which MUST appear before this one. This field MAY appear more than once. The value of this field is case-insensitive.

Tag

(OPTIONAL) The name of a tag which should be considered a part of this tag-group. This field MAY appear more than once. The value of this field is case- insensitive.

Although both tag and subgroup fields are OPTIONAL, at least one of these fields MUST appear in a tag-group record. Furthermore, both fields MUST be unique across an LDOC; that is, any two tag or subgroup fields in an LDOC MUST NOT have the same value when compared in a case-insensitive manner, regardless of the record they are contained in.

Lexeme records

A lexeme record provides information on a lexeme. It consists of the following fields, in any order, specified by name:

Lexeme

The lexeme itself. This MUST be unique across an LDOC; ie, there must not be more than one lexeme record with a given lexeme field.

At

A URI at which information on the lexeme can be retrieved.

Language

(RECOMMENDED) A language tag matching the syntax provided in [BCP-47], describing the language of the lexeme itself. If this field is absent but the language field is present in the LDOC's metadata record, the language of the lexeme MUST be inherited from the language of the LDOC as a whole.

Pronunciation

(RECOMMENDED) A hint at pronouncing the lexeme. This field MAY appear more than once.

Gloss

(OPTIONAL) A brief gloss of the lexeme.

Inflection records

An inflection record provides inflected forms for lexemes. It consists of the following fields, in any order, specified by name:

Inflected

The inflected form.

Of

The lexeme that the record provides an inflection of. This MUST match the lexeme field of a lexeme record, which MUST appear before the given inflection record.

Pronunciation

(RECOMMENDED) A hint at pronouncing the inflected form. This field MAY appear more than once.

There MUST NOT be any two inflection records in a single LDOC with identical values for both the inflected and of fields.

Alternate records

An alternate record provides alternate forms for lexemes or inflections. It consists of the following fields, in any order, specified by their field names:

Alternate

The alternate form.

For

The lexeme or inflection that the record provides an alternate for. This MUST match the lexeme field of a lexeme record or the inflected field of an inflection record, either of which MUST appear before the given alternate record.

Of

(OPTIONAL) If the for field refers to an inflection, this field MUST match the of field of the corresponding inflection record. Otherwise, this field MUST NOT be present.

Script

(RECOMMENDED) A script tag as specified in [BCP-47], describing the script that the alternate form is written in. If left unspecified, the script is assumed to be the same as the lexeme that the record provides an alternate for.

Pronunciation

(OPTIONAL) If the pronunciation of the alternate differs from the original lexeme, a hint at its pronunciation. This field MAY appear more than once.

Any two alternate records in a single LDOC MUST NOT have identical values for both the alternate and for fields, when both records provide an alternate form for a lexeme, or identical values for all three of the alternate, for, and of fields, when both records provide an alternate form for an inflection.