3 Last-modified: 2002 October 22
5 Mule bogusly considers the various ISO-8859 extended character sets as
6 disjoint, when ISO 8859 itself clearly considers them to be subsets of
7 a larger character set. For example, all of the Latin character sets
8 include NO-BREAK SPACE at code point 32 (ie, 0xA0 in an 8-bit code),
9 but the Latin-1 and Latin-2 NO-BREAK SPACE characters are considered
10 to be different by Mule, an obvious absurdity.
12 This package provides functions which determine the list of coding
13 systems which can encode all of the characters in the buffer, and
14 translate to a common coding system if possible.
19 To set up the package, simply put
29 Get the latin-unity module and build as usual.
32 ftp://ftp.xemacs.org/pub/xemacs/packages/latin-unity-VERSION-pkg.tar.gz
37 o If a buffer contains only ASCII and ISO-8859 Latin characters, the
38 buffer can be "unified", that is treated so that all characters
39 are translated to one charset that includes them all. If the
40 current buffer coding system is not sufficient, the package will
41 suggest alternatives. It prefers ISO-8859 encodings, but also
42 suggests UTF-8 (if available; 21.4+ feature, currently requires
43 Mule-UCS for 21.4 versions), ISO 2022 7-bit, or X Compound Text
44 if no ISO 8859 coding system is comprehensive enough.
46 It allows the user to use other coding systems, and the list of
47 suggested coding systems is Customizable.
49 latin-unity will automatically adjust buffer-file-coding-system to
50 the user's choice, on a Customizable basis.
52 Optionally checks -*- coding: codesys -*- cookies for consistency.
53 This only works for Emacs coding cookies; doesn't handle encoding
54 attributes in XML declarations or HTML META tags yet.
56 This probably also is useful out of the box if the buffer contains
57 non-Latin characters in addition to a mixture of Latin characters.
58 For example, it would reduce a buffer originally encoded in
59 ISO-2022-JP (including Latin-1 characters) to ISO 8859/1 if all
60 the Japanese were deleted. (untested)
62 o ISO 8859/13 for XEmacs 21.4 and 21.1 (both untested).
63 To get 'iso-8859-13 preferred to 'iso-8859-1 in autodetection, use
64 (set-coding-category-system 'iso-8-1 'iso-8859-13). (untested)
65 Alternatively set language environment to Latin-7.
67 If all you want is ISO 8859/13 support, you can `(require
68 'latin-unity-latin7)' and `(require 'latin-latin7-input)', and not
69 do `(latin-unity-install)'.
71 o ISO 8859/14 for XEmacs 21.4 and 21.1 (both untested).
72 To get 'iso-8859-14 preferred to 'iso-8859-1 in autodetection, use
73 (set-coding-category-system 'iso-8-1 'iso-8859-14). (untested)
74 Alternatively set language environment to Latin-8.
76 If all you want is ISO 8859/14 support, you can `(require
77 'latin-unity-latin8)', and not do `(latin-unity-install)'.
78 Latin-8 does not yet have an input method.
80 o ISO 8859/15 for XEmacs 21.4 (moderately tested) and 21.1 (lightly
81 tested), including binding the EuroSign keysym to ISO 8859/15 0xA4
82 (as well as the other "new" keysyms needed for ISO 8859/15). To
83 get 'iso-8859-15 preferred to 'iso-8859-1 in autodetection, use
84 (set-coding-category-system 'iso-8-1 'iso-8859-15). (untested)
85 Alternatively set language environment to Latin-9.
87 If all you want is ISO 8859/15 support, you can `(require
88 'latin-unity-latin9)' and `(require 'latin-euro-input)', and not
89 do `(latin-unity-install)'.
91 o ISO 8859/16 for XEmacs 21.4 and 21.1 (both untested).
92 To get 'iso-8859-16 preferred to 'iso-8859-1 in autodetection, use
93 (set-coding-category-system 'iso-8-1 'iso-8859-16). (untested)
94 Alternatively set language environment to Latin-10.
96 If all you want is ISO 8859/16 support, you can `(require
97 'latin-unity-latin10)', and not do `(latin-unity-install)'.
98 Latin-10 does not yet have an input method.
100 o Hooks into `write-region' to prevent (or at least drastically
101 reduce the probability of) introduction of ISO 2022 escape
102 sequences for "foreign" character sets. This hook is not set by
103 default in this package yet; try M-x latin-unity-example RET for a
104 short introduction and some useful C-x C-e'able exprs.
106 This may permit us to turn off support for those sequences
107 entirely in our ISO 8859 coding-systems.
109 o Interactive functions to _remap_ a region between character sets
110 (preserving character identity) and _recode_ a region (preserving
111 the code point). The former is probably not useful if the
112 automatic function is working at all, but provided for
113 completeness. The latter is useful if Mule mistakenly reads an
114 ISO 8859/2 file as ISO 8859/1; you can change it without rereading
115 the file. Since it's region-oriented, you can also deal with cut
116 and paste from dumb applications that export everything as ISO 8859/1.
118 o A nearly comprehensive Texinfo manual contains a discussion of
119 why these things happen, why they can't be 100% avoided in an 8-bit
120 world, and some defensive measures users can take, as well as the
121 usual install, configure, and operating instructions.
123 o latin-unity itself depends only on mule-base in operation. Table
124 generation depends on Unicode support such as Mule-UCS or XEmacs
125 >= 21.5.6, and the package build currently requires Mule-UCS. The
126 input method depends on LEIM and fsf-compat.
130 o Needs to be refactored so that automatic tests of functionality
131 currently buried in interactive functions can be written. See
132 comment on `latin-unity-sanity-check'.
134 o Doesn't automatically save pure ASCII files in ASCII superset
135 encodings like iso-2022-jp. Workaround: put an ISO 8859 coding
136 system in `latin-unity-preapproved-coding-system-list'.
138 o Need `(require 'latin-euro-input)' to get Quail support.
140 o Possible performance hit on large (> 20kB) buffers with many
141 (>20%) non-ASCII characters. Partially optimized, but see near
142 `latin-unity-region-representations-feasible-region' in
143 latin-unity.el for possible further optimizations.
145 o Package depends on Mule-UCS, LEIM (Quail), and fsf-compat.
147 o This README is too long.
149 o Must load latin-unity-vars before reading a file with ISO 8859/15,
150 there is no way to autoload a charset. (Can't be fixed without
151 changing XEmacs core.)
153 Planned, sooner or later:
155 o Fix the misfeatures.
157 o Fix JIS Roman (as an alternative to ASCII) support.
159 o More UI features (like highlighting unrepresentable chars in buffer).
161 o Integration to development tree (but probably not 21.4, this
162 package should be good enough).
168 o Add coding-system and charset widgets for Customization. The :set
169 functions should do sanity and cross checks.
172 Not planned any time soon:
174 o Extend to process buffers in some way, which looks very hard.
176 o Han-unity. This is not entirely analogous to Latin unity, and
177 needs to be treated very carefully.
180 ***** Implementation:
182 latin-unity.el is the main library, providing the detection and translation
183 functionality, including a hook function to hang on `write-region-pre-hook'.
185 latin-unity-vars.el contains the definition of variables common to
188 latin-unity-latin7.el defines ISO 8859/13 and the Latin-7 environment.
190 latin-latin7-input.el contains a Quail input method for Latin 7.
192 latin-unity-latin8.el defines ISO 8859/14 and the Latin-8 environment.
194 latin-unity-latin9.el defines ISO 8859/15 and the Latin-9 environment.
196 latin-euro-input.el contains Dave Love's Quail input method for Latin 9.
198 latin-unity-latin10.el defines ISO 8859/16 and the Latin-10 environment.
200 (Latin-8 and Latin-10 do not have input methods yet.)
202 latin-unity-tables.el contains the table of feasible character sets and
203 equivalent Mule characters from other character sets for the various Mule
204 representations of each character. Automatically generated. Used only when
205 Unicode support is not present.
207 latin-unity-utils.el contains utilities for creating the equivalence table,
208 and dumping it to a file. Used in preference to the precomputed table when
209 Unicode support is available.
211 latin-unity-tests.el contains a (very incomplete) test suite using Martin
212 Buchholz's test-harness.el (distributed in the core in tests/automated).