Problem with GEDCOM embedded multimedia objects (BLOB). February 16, 2013.

Few Gedcom applications can handle embedded objects, and even less makes it correctly, why?

This short discussion tries to answer to this question.

Conversion of binary file into set of printable characters is wide spread, e.g. in email (Mime encoding) or XML. (http://en.wikipedia.org/wiki/Base64).

Gedcom specification belongs to this group of translations, based on radix-64 representation.

Here is transformation schema: 3 bytes (24 bits) are converted to 4 bytes. These 4 bytes has always two MSB set to 0 and the remaining 6 bits comes from the source bytes (4x6 = 24)

Bytes with 6 bits can have all values between 0 and 0x3f. The goal is to embed binary file as a text, so the next step is to translate value from 0 - 0x3f to same printable characters.

Mime (base64) and Gedcom uses different translate tables.

3 to 4 bytes Layer 1 7 6 2 2 4 4 4 4 2 6 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 translate 0x0-0x3f to printable translate 0x0-0x3f to printable translate 0x0-0x3f to printable translate 0x0-0x3f to printable

Base requirement is reversibility of this encoding process, i.e.

file == decode( encode( file ) );     for all files.

Padding

The number of bytes to encode may be divisible by 3 or not. Size modulo 3 - can be 0, 1 or 2. When the (size modulo 3) is 1, we need 2 bytes to encode the last byte, when the (size modulo 3) equals 2 - three bytes are needed. Logically no extra information needs, decoder can find out all possibilities (encoded file modulo 4 equals 0, 2 or 3).

Same implementation use padding, i.e. one use two or one extra characters to make encoded file divisible by four. It's essential that these padding bytes must be different from regular character set uses in the translation table of given implementation! The decoder must know if last bytes are regular encoded bytes or just padding bytes.

One case where padding characters are required is when multiple encoded files are concatenated, i.e. encode(file1) + encode(file2) = encode(file1 + file2); + means file concatenation.

Here can happen conceptual mishmash: when file to encode is not divisible by 3, only for regularity - for decoding algorithm which take three bytes - instruction can say: pad the last byte with 2 other bytes ... then encode three last bytes. This is of course not the same "padding" as we talk about first.

Verbatim from the Gedcom standard release 5.5

The encoding routine converts a binary multimedia file segment of from 1 to 54 bytes in length into an encoded GEDCOM line value of 2 to 64 bytes in length. This encoded value becomes the <ENCODED_MULTIMEDIA_LINE> used in the MULTIMEDIA_RECORD (see page 26.)

The algorithm accomplishes its goal using the following steps:

  1. Each 3 bytes (24 bits) of the binary 1 to 54 character segment is divided into four (6- bit) values. Each of these (6-bit) values are converted into an (8-bit) character making a character whose hexadecimal representation is between 0x00 and 0x3F (0 to 63 decimal.)
  2. Each of the 4 new characters represents an Encoding key which is used to obtain the new replacement character from an Encoding Table included in this appendix.
  3. Exception processing may be required in processing the last 3 byte chunk of the 1 to 54 character segment, which may consist of 0, 1, or 2 bytes:
        Retrieved  Action
    1. 0 bytes Pad the last 3 characters with 0xFF. The conversion is complete.
    2. 1 byte: Pad last two bytes with 0xFF then complete steps 1 and 2 above.
    3. 2 bytes: Pad last byte with 0xFF then complete steps 1 and 2 above.
  4. Repeat until all characters in the received line value has been substituted. The return value of new encoded characters should contain from 4 to 72 characters. The length of the return value will always be a multiple of 4.

Decoding:

The Decoding routine converts the encoded line value back into the original binary character multimedia file segment.

The decoding algorithm can be accomplished in the following steps:

  1. Each encoded multimedia line segment is divided into sets of 4 (8-bit) characters.
  2. Each of these characters becomes a decoding key used to look up a corresponding character from the Decoding Table. A new (24-bit) group is formed by concatenating the low-order 6 bits from each of the 4 characters obtained from the decoding table.
  3. Divide this new 24 bit group created by step 2 into three (8-bit) characters and concatenate them into the stream of characters being built as the decoded results.
  4. Processing ends when the 0xFF padded bytes are encountered.

The quotation ends here.


Doubts:

  1. First line of the standard says: "(...) from 1 to 54 bytes in length into an encoded GEDCOM line value of 2 to 64 bytes in length", and then, in the p. 5 of Encoding: "The length of the return value will always be a multiple of 4."
    Maybe they mean that encoding algorithm produces 2 to 64 bytes, but encoded string pads to multiple of 4? This makes a correct procedure with odd (not ASCII) value of padding byte 0xff.
    Strings "abc", "abcd" and "abcde" would encoded respectively to "MK7X", "MK7XNDÿÿ" and "MK7XN4Lÿ"; "ÿ" is the padding byte equals 0xff.
    I don't find an application that encoding this way.
  2. P. 3 a. "Pad the last 3 characters with 0xFF. The conversion is complete." Last three byte - also they talk about file to encode? But next meaning is "The conversion is complete." - Also they talk about the encoded string - pad the last encoded 4 bytes with 0xff - ok, this could work, but encoded string will be not a multiple of 4.
    Then (b) and (c), how decoding routine can differ original file which last chunk is 0xff with file which last chunk is 0xff 0xff?

Conclusions

I can find only one way to interpret GEDCOM standard in a way that works (as in p. 1 above), but the instruction as a whole is very ambiguous. Some of genealogical application decided to decode without padding - I know only one which give correct results, this is: "The Complete Genealogy Builder" (in fact there was zero apps for few days ago, The Complete Genealogy Builder had a little bug - corrected now).

Images in "jpg" format seams be insensitive to error in few last bytes, maybe therefore developers not see or not care about such errors, but this is not right way, other formats can has some checksums or other integrity control that will make decoded file unusable.