cgit.sxemacs.org Git - packages/blob - mule-packages/latin-unity/README

   1 ***** latin-unity
   2
   3 Last-modified: 2002 October 22
   4
   5 Mule bogusly considers the various ISO-8859 extended character sets as
   6 disjoint, when ISO 8859 itself clearly considers them to be subsets of
   7 a larger character set.  For example, all of the Latin character sets
   8 include NO-BREAK SPACE at code point 32 (ie, 0xA0 in an 8-bit code),
   9 but the Latin-1 and Latin-2 NO-BREAK SPACE characters are considered
  10 to be different by Mule, an obvious absurdity.
  11
  12 This package provides functions which determine the list of coding
  13 systems which can encode all of the characters in the buffer, and
  14 translate to a common coding system if possible.
  15
  16
  17 ***** Basic usage:
  18
  19 To set up the package, simply put
  20
  21 (latin-unity-install)
  22
  23 in your init file.
  24
  25
  26 ***** Availability:
  27
  28 anonymous CVS:
  29 Get the latin-unity module and build as usual.
  30
  31 WWW:
  32 ftp://ftp.xemacs.org/pub/xemacs/packages/latin-unity-VERSION-pkg.tar.gz
  33
  34
  35 ***** Features:
  36
  37   o If a buffer contains only ASCII and ISO-8859 Latin characters, the
  38     buffer can be "unified", that is treated so that all characters
  39     are translated to one charset that includes them all.  If the
  40     current buffer coding system is not sufficient, the package will
  41     suggest alternatives.  It prefers ISO-8859 encodings, but also
  42     suggests UTF-8 (if available; 21.4+ feature, currently requires
  43     Mule-UCS for 21.4 versions), ISO 2022 7-bit, or X Compound Text
  44     if no ISO 8859 coding system is comprehensive enough.
  45
  46     It allows the user to use other coding systems, and the list of
  47     suggested coding systems is Customizable.
  48
  49     latin-unity will automatically adjust buffer-file-coding-system to
  50     the user's choice, on a Customizable basis.
  51
  52     Optionally checks -*- coding: codesys -*- cookies for consistency.
  53     This only works for Emacs coding cookies; doesn't handle encoding
  54     attributes in XML declarations or HTML META tags yet.
  55
  56     This probably also is useful out of the box if the buffer contains
  57     non-Latin characters in addition to a mixture of Latin characters.
  58     For example, it would reduce a buffer originally encoded in
  59     ISO-2022-JP (including Latin-1 characters) to ISO 8859/1 if all
  60     the Japanese were deleted.  (untested)
  61
  62   o ISO 8859/13 for XEmacs 21.4 and 21.1 (both untested).
  63     To get 'iso-8859-13 preferred to 'iso-8859-1 in autodetection, use
  64     (set-coding-category-system 'iso-8-1 'iso-8859-13).  (untested)
  65     Alternatively set language environment to Latin-7.
  66
  67     If all you want is ISO 8859/13 support, you can `(require
  68     'latin-unity-latin7)' and `(require 'latin-latin7-input)', and not
  69     do `(latin-unity-install)'.
  70
  71   o ISO 8859/14 for XEmacs 21.4 and 21.1 (both untested).
  72     To get 'iso-8859-14 preferred to 'iso-8859-1 in autodetection, use
  73     (set-coding-category-system 'iso-8-1 'iso-8859-14).  (untested)
  74     Alternatively set language environment to Latin-8.
  75
  76     If all you want is ISO 8859/14 support, you can `(require
  77     'latin-unity-latin8)', and not do `(latin-unity-install)'.
  78     Latin-8 does not yet have an input method.
  79
  80   o ISO 8859/15 for XEmacs 21.4 (moderately tested) and 21.1 (lightly
  81     tested), including binding the EuroSign keysym to ISO 8859/15 0xA4
  82     (as well as the other "new" keysyms needed for ISO 8859/15).  To
  83     get 'iso-8859-15 preferred to 'iso-8859-1 in autodetection, use
  84     (set-coding-category-system 'iso-8-1 'iso-8859-15).  (untested)
  85     Alternatively set language environment to Latin-9.
  86
  87     If all you want is ISO 8859/15 support, you can `(require
  88     'latin-unity-latin9)' and `(require 'latin-euro-input)', and not
  89     do `(latin-unity-install)'.
  90
  91   o ISO 8859/16 for XEmacs 21.4 and 21.1 (both untested).
  92     To get 'iso-8859-16 preferred to 'iso-8859-1 in autodetection, use
  93     (set-coding-category-system 'iso-8-1 'iso-8859-16).  (untested)
  94     Alternatively set language environment to Latin-10.
  95
  96     If all you want is ISO 8859/16 support, you can `(require
  97     'latin-unity-latin10)', and not do `(latin-unity-install)'.
  98     Latin-10 does not yet have an input method.
  99
 100   o Hooks into `write-region' to prevent (or at least drastically
 101     reduce the probability of) introduction of ISO 2022 escape
 102     sequences for "foreign" character sets.  This hook is not set by
 103     default in this package yet; try M-x latin-unity-example RET for a
 104     short introduction and some useful C-x C-e'able exprs.
 105
 106     This may permit us to turn off support for those sequences
 107     entirely in our ISO 8859 coding-systems.
 108
 109   o Interactive functions to _remap_ a region between character sets
 110     (preserving character identity) and _recode_ a region (preserving
 111     the code point).  The former is probably not useful if the
 112     automatic function is working at all, but provided for
 113     completeness.  The latter is useful if Mule mistakenly reads an
 114     ISO 8859/2 file as ISO 8859/1; you can change it without rereading
 115     the file.  Since it's region-oriented, you can also deal with cut
 116     and paste from dumb applications that export everything as ISO 8859/1.
 117
 118   o A nearly comprehensive Texinfo manual contains a discussion of
 119     why these things happen, why they can't be 100% avoided in an 8-bit
 120     world, and some defensive measures users can take, as well as the
 121     usual install, configure, and operating instructions.
 122
 123   o latin-unity itself depends only on mule-base in operation.  Table
 124     generation depends on Unicode support such as Mule-UCS or XEmacs
 125     >= 21.5.6, and the package build currently requires Mule-UCS.  The
 126     input method depends on LEIM and fsf-compat.
 127
 128 Current misfeatures:
 129
 130   o Needs to be refactored so that automatic tests of functionality
 131     currently buried in interactive functions can be written.  See
 132     comment on `latin-unity-sanity-check'.
 133
 134   o Doesn't automatically save pure ASCII files in ASCII superset
 135     encodings like iso-2022-jp.  Workaround:  put an ISO 8859 coding
 136     system in `latin-unity-preapproved-coding-system-list'.
 137
 138   o Need `(require 'latin-euro-input)' to get Quail support.
 139
 140   o Possible performance hit on large (> 20kB) buffers with many
 141     (>20%) non-ASCII characters.  Partially optimized, but see near
 142     `latin-unity-region-representations-feasible-region' in
 143     latin-unity.el for possible further optimizations.
 144
 145   o Package depends on Mule-UCS, LEIM (Quail), and fsf-compat.
 146
 147   o This README is too long.
 148
 149   o Must load latin-unity-vars before reading a file with ISO 8859/15,
 150     there is no way to autoload a charset.  (Can't be fixed without
 151     changing XEmacs core.)
 152
 153 Planned, sooner or later:
 154
 155   o Fix the misfeatures.
 156
 157   o Fix JIS Roman (as an alternative to ASCII) support.
 158
 159   o More UI features (like highlighting unrepresentable chars in buffer).
 160
 161   o Integration to development tree (but probably not 21.4, this
 162     package should be good enough).
 163
 164   o Hook into MUAs.
 165
 166   o GNU Emacs support.
 167
 168   o Add coding-system and charset widgets for Customization.  The :set
 169     functions should do sanity and cross checks.
 170
 171
 172 Not planned any time soon:
 173
 174   o Extend to process buffers in some way, which looks very hard.
 175
 176   o Han-unity.  This is not entirely analogous to Latin unity, and
 177     needs to be treated very carefully.
 178
 179
 180 ***** Implementation:
 181
 182 latin-unity.el is the main library, providing the detection and translation
 183 functionality, including a hook function to hang on `write-region-pre-hook'.
 184
 185 latin-unity-vars.el contains the definition of variables common to
 186 several modules.
 187
 188 latin-unity-latin7.el defines ISO 8859/13 and the Latin-7 environment.
 189
 190 latin-latin7-input.el contains a Quail input method for Latin 7.
 191
 192 latin-unity-latin8.el defines ISO 8859/14 and the Latin-8 environment.
 193
 194 latin-unity-latin9.el defines ISO 8859/15 and the Latin-9 environment.
 195
 196 latin-euro-input.el contains Dave Love's Quail input method for Latin 9.
 197
 198 latin-unity-latin10.el defines ISO 8859/16 and the Latin-10 environment.
 199
 200 (Latin-8 and Latin-10 do not have input methods yet.)
 201
 202 latin-unity-tables.el contains the table of feasible character sets and
 203 equivalent Mule characters from other character sets for the various Mule
 204 representations of each character.  Automatically generated.  Used only when
 205 Unicode support is not present.
 206
 207 latin-unity-utils.el contains utilities for creating the equivalence table,
 208 and dumping it to a file.  Used in preference to the precomputed table when
 209 Unicode support is available.
 210
 211 latin-unity-tests.el contains a (very incomplete) test suite using Martin
 212 Buchholz's test-harness.el (distributed in the core in tests/automated).