1 \input texinfo @c -*-texinfo-*-
3 @c Manual for the XEmacs latin-unity package.
5 @c ** #### something uses these headers, find out how and explain here
7 @setfilename latin-unity.info
8 @settitle Equivalencing Coded Latin Characters
12 @c Version values, for easy modification
14 @set UPDATED Monday, February 7, 2005
16 @c ** Many people seem to prefer this footenote style
20 * latin-unity:: Remap Latin characters from a single charset.
23 @c ** It is often convenient to use a macro for names which have unusual
24 @c ** spelling or formatting conventions.
25 @c Macro to make formatting of the package name consistent.
30 @c Copying permissions, et al
31 @c Note that this whole section is repeated thrice, once for the Info
32 @c version, once for TeX, and once for HTML.
33 @c Note that if you combine this document with others' work, you may
34 @c need to change the permissions statement. Make sure you do so in all
36 @c #### Probably not the right place for HTML.
38 This file is part of XEmacs. It documents the @pkgname{} package,
39 which ensures that wherever possible representations of all Latin
40 characters are drawn from the same 8-bit character set.
42 Copyright @copyright{} 2002, 2003 Free Software Foundation, Inc.
44 Permission is granted to make and distribute verbatim or modified copies
45 of this manual provided that the copyright notice above is preserved,
46 and at least one of the numbered paragraphs below up to the phrase "End
47 of permissions notice." is included, under the terms of any of the
48 licenses enumerated below.
50 1. The GNU General Public License, version 2 or any later version
51 published by the Free Software Foundation, Inc. A copy of the GNU
52 General Public License was included with your copy of XEmacs.
53 Alternatively, you may request a copy from the Free Software Foundation,
54 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
56 2. The license used for XEmacs documentation (see the file
57 man/xemacs/xemacs.texi distributed with XEmacs).
59 3. The GNU Free Documentation License (available from the Free Software
60 Foundation at the address above).
62 End of permissions notice.
64 Because the XEmacs documentation license and the GNU Free Documentation
65 License are not free software licenses, and are mutually incompatible
66 with each other and with the GNU General Public License, it is desirable
67 that the permissions given here be preserved if possible. This normally
68 requires that you write fresh text for any additions or modifications to
69 this file, rather than copying code or comments from other sources. The
70 exceptions are if you are sole copyright holder for the other source, or
71 if you receive permission from the holders of the copyrights for the
74 That is, if you merge code or comments from a file licensed to you only
75 under the GNU General Public License you @emph{may not} distribute
76 under the documentation licenses or with a dual license; such a
77 combination may be distributed under the GNU General Public License
78 @emph{only}. Similarly, if you copy text from a file licensed to you
79 only under the XEmacs documentation license or the GNU Free
80 Documentation License you may distribute the result @emph{only} under
81 the appropriate license.
85 This file is part of XEmacs. It documents the @pkgname{} package,
86 which ensures that wherever possible representations of all Latin
87 characters are drawn from the same 8-bit character set.
89 Copyright @copyright{} 2002, 2003 Free Software Foundation, Inc.
91 Permission is granted to make and distribute verbatim or modified copies
92 of this manual provided that the copyright notice above is preserved,
93 and at least one of the numbered paragraphs below up to the phrase "End
94 of permissions notice." is included, under the terms of any of the
95 licenses enumerated below.
98 Permission is granted to process this file through TeX and print the
99 results, provided the printed document carries a copying permission
100 notice identical to this one except for the removal of this paragraph
101 (this paragraph not being relevant to the printed manual).
104 1. The GNU General Public License, version 2 or any later version
105 published by the Free Software Foundation, Inc. A copy of the GNU
106 General Public License was included with your copy of XEmacs.
107 Alternatively, you may request a copy from the Free Software Foundation,
108 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
110 2. The license used for XEmacs documentation (see the file
111 man/xemacs/xemacs.texi distributed with XEmacs).
113 3. The GNU Free Documentation License (available from the Free Software
114 Foundation at the address above).
116 End of permissions notice.
118 Because the XEmacs documentation license and the GNU Free Documentation
119 License are not free software licenses, and are mutually incompatible
120 with each other and with the GNU General Public License, it is desirable
121 that the permissions given here be preserved if possible. This normally
122 requires that you write fresh text for any additions or modifications to
123 this file, rather than copying code or comments from other sources. The
124 exceptions are if you are sole copyright holder for the other source, or
125 if you receive permission from the holders of the copyrights for the
128 That is, if you merge code or comments from a file licensed to you only
129 under the GNU General Public License you @emph{may not} distribute
130 under the documentation licenses or with a dual license; such a
131 combination may be distributed under the GNU General Public License
132 @emph{only}. Similarly, if you copy text from a file licensed to you
133 only under the XEmacs documentation license or the GNU Free
134 Documentation License you may distribute the result @emph{only} under
135 the appropriate license.
141 @title Latin Unity for Emacsen
142 @subtitle Last updated @value{UPDATED}
144 @author Stephen J. Turnbull
147 This file is part of XEmacs. It documents the @pkgname{} package,
148 which ensures that wherever possible representations of all Latin
149 characters are drawn from the same 8-bit character set.
151 Copyright @copyright{} 2002, 2003 Free Software Foundation, Inc.
153 @vskip 0pt plus 1filll
154 Permission is granted to make and distribute verbatim or modified copies
155 of this manual provided that the copyright notice above is preserved,
156 and at least one of the numbered paragraphs below up to the phrase "End
157 of permissions notice." is included, under the terms of any of the
158 licenses enumerated below.
161 Permission is granted to process this file through TeX and print the
162 results, provided the printed document carries a copying permission
163 notice identical to this one except for the removal of this paragraph
164 (this paragraph not being relevant to the printed manual).
167 1. The GNU General Public License, version 2 or any later version
168 published by the Free Software Foundation, Inc. A copy of the GNU
169 General Public License was included with your copy of XEmacs.
170 Alternatively, you may request a copy from the Free Software Foundation,
171 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
173 2. The license used for XEmacs documentation (see the file
174 man/xemacs/xemacs.texi distributed with XEmacs).
176 3. The GNU Free Documentation License (available from the Free Software
177 Foundation at the address above).
179 End of permissions notice.
181 Because the XEmacs documentation license and the GNU Free Documentation
182 License are not free software licenses, and are mutually incompatible
183 with each other and with the GNU General Public License, it is desirable
184 that the permissions given here be preserved if possible. This normally
185 requires that you write fresh text for any additions or modifications to
186 this file, rather than copying code or comments from other sources. The
187 exceptions are if you are sole copyright holder for the other source, or
188 if you receive permission from the holders of the copyrights for the
191 That is, if you merge code or comments from a file licensed to you only
192 under the GNU General Public License you @emph{may not} distribute
193 under the documentation licenses or with a dual license; such a
194 combination may be distributed under the GNU General Public License
195 @emph{only}. Similarly, if you copy text from a file licensed to you
196 only under the XEmacs documentation license or the GNU Free
197 Documentation License you may distribute the result @emph{only} under
198 the appropriate license.
206 @node Top, Copying, (dir), (dir)
207 @top Latin Unity for Emacsen
210 This project kindly hosted by
212 <a href="http://sunsite.dk">
213 <img src="http://sunsite.dk/images/hostedby.png"
214 border="0" alt="sunSITE.dk Logo" /></a>
215 <a href="http://www.tux.org/">
216 <img src="http://www.tux.org/images/minibanner.gif"
217 width="88" height="31" border="0" alt="Tux.Org Logo" /></a>
218 <a href="http://sourceforge.net">
219 <img src="http://sourceforge.net/sflogo.php?group_id=34545"
220 width="88" height="31" border="0" alt="SourceForge Logo" /></a>
223 Mule suffers from a design defect that causes it to consider the ISO
224 Latin character sets to be disjoint. This results in oddities such as
225 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
226 2022 control sequences to switch between them, as well as more plausible
227 but often unnecessary combinations like ISO 8859/1 with ISO 8859/2.
228 This can be very annoying when sending messages or even in simple
229 editing on a single host. @pkgname{} works around the problem by
230 converting as many characters as possible to use a single Latin coded
231 character set before saving the buffer.
233 This is version @value{VERSION} of the @pkgname{} manual, last updated on
236 @c ** Currently we provide documentation for the XEmacs core, but not for
237 @c ** individual packages, on the web. Check for latest policy either at
238 @c ** www.xemacs.org or on xemacs-beta@xemacs.org.
240 @c You can find the latest version of this document on the web at
241 @c @uref{http://www.xemacs.org/}.
244 @c ** Mention translations and other online versions here.
246 @c ** Adjust to taste. You may wish to mention newgroups such as
247 @c ** comp.emacs.xemacs, comp.emacs, and gnu.emacs.help.
248 Discussion and enhancement of @pkgname{} is conducted on the XEmacs
249 mailing lists, especially the XEmacs Beta list. See
250 @uref{http://www.xemacs.org/Lists/} for more information about the
251 XEmacs mailing lists.
257 * Copying:: @pkgname{} copying conditions.
258 * Overview:: @pkgname{} history and general information.
261 * Usage:: An overview of the operation of @pkgname{}.
262 * Installation:: Installing @pkgname{} with your (X)Emacs.
263 * Configuration:: Configuring @pkgname{} for use.
264 * Bug Reports:: Reporting bugs and problems.
265 * Frequently Asked Questions:: Questions and answers from the mailing list.
266 * Theory of Operation:: How @pkgname{} works.
267 * What latin-unity Cannot Do for You:: Inherent problems of 8-bit charsets.
270 * Interfaces:: Calling @pkgname{} from Lisp code.
271 * Charsets and Coding Systems:: Reference lists with annotations.
274 * Internals:: Utilities and implementation details.
276 @c ** For small packages, with no or few subnodes, a detailmenu is not
279 @c --- The Detailed Node Listing ---
284 @node Copying, Overview, Top, Top
285 @chapter @pkgname{} Copying Conditions
287 @c ** CHECK THE COPYRIGHT DATE(S) AND HOLDER(S)!
289 Copyright (C) 2002 Free Software Foundation, Inc.
291 @pkgname{} is free software; you can redistribute it
292 and/or modify it under the terms of the GNU General Public License as
293 published by the Free Software Foundation; either version 2, or (at your
294 option) any later version.
296 @pkgname{} is distributed in the hope that it will
297 be useful, but WITHOUT ANY WARRANTY; without even the implied warranty
298 of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
299 General Public License for more details.
301 You should have received a copy of the GNU General Public License along
302 with XEmacs; see the file COPYING. If not, write to the Free Software
303 Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
307 @node Overview, Usage, Copying, Top
308 @chapter An Overview of @pkgname{}
310 Mule suffers from a design defect that causes it to consider the ISO
311 Latin character sets to be disjoint. This manifests itself when a user
312 enters characters using input methods associated with different coded
313 character sets into a single buffer.
315 A very important example involves email. Many sites, especially in the
316 U.S., default to use of the ISO 8859/1 coded character set (also called
317 ``Latin 1,'' though these are somewhat different concepts). However,
318 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the
319 Euro has become the official currency of most countries in Europe, this
320 is unsatisfactory (and in practice, useless). So Europeans generally
321 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
322 languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
324 Suppose a European user yanks text from a post encoded in ISO 8859/1
325 into a message composition buffer, and enters some text including the
326 Euro sign. Then Mule will consider the buffer to contain both ISO
327 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
328 programmed) send the message as a multipart mixed MIME body!
330 This is clearly stupid. What is not as obvious is that, just as any
331 European can include American English in their text because ASCII is a
332 subset of ISO 8859/15, most European languages which use Latin
333 characters (eg, German and Polish) can typically be mixed while using
334 only one Latin coded character set (in this case, ISO 8859/2). However,
335 this often depends on exactly what text is to be encoded.
337 @pkgname{} works around the problem by converting as many characters as
338 possible to use a single Latin coded character set before saving the
341 @node Usage, Installation, Overview, Top
342 @chapter Operation of @pkgname{}
344 Normally, @pkgname{} works in the background by installing
345 @code{latin-unity-sanity-check} on @code{write-region-pre-hook}. The
346 user activates this functionality by invoking
347 @code{latin-unity-install}, either interactively or in her init file.
348 @xref{Init File, , , xemacs}. @pkgname{} can be deactivated by
349 invoking @code{latin-unity-uninstall}.
351 @pkgname{} also provides a few functions for remapping or recoding the
352 buffer by hand. To @dfn{remap} a character means to change the buffer
353 representation of the character by using another coded character set.
354 Remapping never changes the identity of the character, but may involve
355 altering the code point of the character. To @dfn{recode} a character
356 means to simply change the coded character set. Recoding never alters
357 the code point of the character, but may change the identity of the
358 character. @xref{Theory of Operation}.
360 There are a few variables which determine which coding systems are
361 always acceptable to @pkgname{}: @code{latin-unity-ucs-list},
362 @code{latin-unity-preferred-coding-system-list}, and
363 @code{latin-unity-preapproved-coding-system-list}. The latter two default
364 to @code{()}, and should probably be avoided because they short-circuit
365 the sanity check. If you find you need to use them, consider reporting
366 it as a bug or request for enhancement. Because they seem unsafe, the
367 recommended interface is likely to change.
370 * Basic Functionality:: User interface and customization.
371 * Interactive Usage:: Treating text by hand.
372 Also documents the hook function(s).
376 @node Basic Functionality, Interactive Usage, , Usage
377 @section Basic Functionality
379 These functions and user options initialize and configure @pkgname{}.
380 In normal use, only a call to @code{latin-unity-install} is needed.
383 @defun latin-unity-install
384 Set up hooks and initialize variables for latin-unity.
386 There are no arguments.
388 This function is idempotent. It will reinitialize any hooks or variables
389 that are not in initial state.
391 Note: a quirk in XEmacs means that the @file{cl-macs} library cannot be
392 required, it must be loaded explicitly. This means that if you invoke
393 @samp{latin-unity-install} in your init file, XEmacs will print
396 Loading cl-macs...done
398 on your console, as these messages have not yet been redirected to the
399 @samp{ *Message-Log*} buffer.
403 @defun latin-unity-uninstall
404 There are no arguments.
406 Clean up hooks and void variables used by latin-unity.
410 @defopt latin-unity-ucs-list
411 List of coding systems considered to be universal.
413 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
415 Order matters; coding systems earlier in the list will be preferred when
416 recommending a coding system. These coding systems will not be used
417 without querying the user (unless they are also present in
418 @code{latin-unity-preapproved-coding-system-list}), and follow the
419 @code{latin-unity-preferred-coding-system-list} in the list of suggested
422 If none of the preferred coding systems are feasible, the first in
423 this list will be the default.
425 Notes on certain coding systems: @code{escape-quoted} is a special
426 coding system used for autosaves and compiled Lisp in Mule. You should
427 not delete this, although it is rare that a user would want to use it
428 directly. @pkgname{} does not try to be \"smart\" about other general
429 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized
430 as equivalent to @code{iso-2022-7}.) If your preferred coding system is
431 one of these, you may consider adding it to @code{latin-unity-ucs-list}.
432 However, this will typically have the side effect that (eg) ISO 8859/1
433 files will be saved in 7-bit form with ISO 2022 escape sequences.
437 Coding systems which are not Latin and not in
438 @code{latin-unity-ucs-list} are handled by short circuiting checks of
439 coding system against the next two variables.
441 @defopt latin-unity-preapproved-coding-system-list
442 List of coding systems used without querying the user if feasible.
444 The default value is @samp{(buffer-default preferred)}.
446 The first feasible coding system in this list is used. The special values
447 @samp{preferred} and @samp{buffer-default} may be present:
451 Use the coding system used by @samp{write-region}, if feasible.
454 Use the coding system specified by @samp{prefer-coding-system} if feasible.
457 "Feasible" means that all characters in the buffer can be represented by
458 the coding system. Coding systems in @samp{latin-unity-ucs-list} are
459 always considered feasible. Other feasible coding systems are computed
460 by @samp{latin-unity-representations-feasible-region}.
462 Note that the first universal coding system in this list shadows all
463 other coding systems. In particular, if your preferred coding system is
464 a universal coding system, and @code{preferred} is a member of this
465 list, @pkgname{} will blithely convert all your files to that coding
466 system. This is considered a feature, but it may surprise most users.
467 Users who don't like this behavior should put @code{preferred} in
468 @code{latin-unity-preferred-coding-system-list}.
472 @defopt latin-unity-preferred-coding-system-list
473 List of coding systems suggested the user if feasible.
475 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
476 iso-8859-4 iso-8859-9)}.
478 If none of the coding systems in
479 @samp{latin-unity-preferred-coding-system-list} are feasible, this list
480 will be recommended to the user, followed by the
481 @samp{latin-unity-ucs-list} (so those coding systems should not be in
482 this list). The first coding system in this list is default. The
483 special values @samp{preferred} and @samp{buffer-default} may be
488 Use the coding system used by @samp{write-region}, if feasible.
491 Use the coding system specified by @samp{prefer-coding-system} if feasible.
494 "Feasible" means that all characters in the buffer can be represented by
495 the coding system. Coding systems in @samp{latin-unity-ucs-list} are
496 always considered feasible. Other feasible coding systems are computed
497 by @samp{latin-unity-representations-feasible-region}.
501 @defvar latin-unity-iso-8859-1-aliases
502 List of coding systems to be treated as aliases of ISO 8859/1.
504 The default value is '(iso-8859-1).
506 This is not a user variable; to customize input of coding systems or
507 charsets, @samp{latin-unity-coding-system-alias-alist} or
508 @samp{latin-unity-charset-alias-alist}.
512 @node Interactive Usage, , Basic Functionality, Usage
513 @section Interactive Usage
515 First, the hook function @code{latin-unity-sanity-check} is documented.
516 (It is placed here because it is not an interactive function, and there
517 is not yet a programmer's section of the manual.)
519 These functions provide access to internal functionality (such as the
520 remapping function) and to extra functionality (the recoding functions
521 and the test function).
524 @defun latin-unity-sanity-check begin end filename append visit lockname &optional coding-system
526 Check if @var{coding-system} can represent all characters between
527 @var{begin} and @var{end}.
529 For compatibility with old broken versions of @code{write-region},
530 @var{coding-system} defaults to @code{buffer-file-coding-system}.
531 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are
534 Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
535 Latin. If @code{buffer-file-coding-system} is safe for the charsets
536 actually present in the buffer, return it. Otherwise, ask the user to
537 choose a coding system, and return that.
539 This function does @emph{not} do the safe thing when
540 @code{buffer-file-coding-system} is nil (aka no-conversion). It
541 considers that ``non-Latin,'' and passes it on to the Mule detection
544 This function is intended for use as a @code{write-region-pre-hook}. It
545 does nothing except return @var{coding-system} if @code{write-region}
546 handlers are inhibited.
550 @defun latin-unity-buffer-representations-feasible
552 There are no arguments.
554 Apply latin-unity-region-representations-feasible to the current buffer.
558 @defun latin-unity-region-representations-feasible begin end &optional buf
560 Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}.
562 @var{buf} defaults to the current buffer. Called interactively, will be
563 applied to the region. Function assumes @var{begin} <= @var{end}.
565 The return value is a cons. The car is the list of character sets
566 that can individually represent all of the non-ASCII portion of the
567 buffer, and the cdr is the list of character sets that can
568 individually represent all of the ASCII portion.
570 The following is taken from a comment in the source. Please refer to
571 the source to be sure of an accurate description.
573 The basic algorithm is to map over the region, compute the set of
574 charsets that can represent each character (the ``feasible charset''),
575 and take the intersection of those sets.
577 The current implementation takes advantage of the fact that ASCII
578 characters are common and cannot change asciisets. Then using
579 skip-chars-forward makes motion over ASCII subregions very fast.
581 This same strategy could be applied generally by precomputing classes
582 of characters equivalent according to their effect on latinsets, and
583 adding a whole class to the skip-chars-forward string once a member is
586 Probably efficiency is a function of the number of characters matched,
587 or maybe the length of the match string? With @code{skip-category-forward}
588 over a precomputed category table it should be really fast. In practice
589 for Latin character sets there are only 29 classes.
593 @defun latin-unity-remap-region begin end character-set &optional coding-system
595 Remap characters between @var{begin} and @var{end} to equivalents in
596 @var{character-set}. Optional argument @var{coding-system} may be a
597 coding system name (a symbol) or nil. Characters with no equivalent are
600 When called interactively, @var{begin} and @var{end} are set to the
601 beginning and end, respectively, of the active region, and the function
602 prompts for @var{character-set}. The function does completion, knows
603 how to guess a character set name from a coding system name, and also
604 provides some common aliases. See @code{latin-unity-guess-charset}.
605 There is no way to specify @var{coding-system}, as it has no useful
606 function interactively.
608 Return @var{coding-system} if @var{coding-system} can encode all
609 characters in the region, t if @var{coding-system} is nil and the coding
610 system with G0 = 'ascii and G1 = @var{character-set} can encode all
611 characters, and otherwise nil. Note that a non-null return does
612 @emph{not} mean it is safe to write the file, only the specified region.
613 (This behavior is useful for multipart MIME encoding and the like.)
615 Note: by default this function is quite fascist about universal coding
616 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and
617 @samp{ctext}. Customize @code{latin-unity-approved-ucs-list} to change
620 This function remaps characters that are artificially distinguished by Mule
621 internal code. It may change the code point as well as the character set.
622 To recode characters that were decoded in the wrong coding system, use
623 @code{latin-unity-recode-region}.
627 @defun latin-unity-recode-region begin end wrong-cs right-cs
629 Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
632 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain
633 the same code point but the character set is changed. Only characters
634 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the
635 character may change. Note that this could be dangerous, if characters
636 whose identities you do not want changed are included in the region.
637 This function cannot guess which characters you want changed, and which
638 should be left alone.
640 When called interactively, @var{begin} and @var{end} are set to the
641 beginning and end, respectively, of the active region, and the function
642 prompts for @var{wrong-cs} and @var{right-cs}. The function does
643 completion, knows how to guess a character set name from a coding system
644 name, and also provides some common aliases. See
645 @code{latin-unity-guess-charset}.
647 Another way to accomplish this, but using coding systems rather than
648 character sets to specify the desired recoding, is
649 @samp{latin-unity-recode-coding-region}. That function may be faster
650 but is somewhat more dangerous, because it may recode more than one
653 To change from one Mule representation to another without changing identity
654 of any characters, use @samp{latin-unity-remap-region}.
658 @defun latin-unity-recode-coding-region begin end wrong-cs right-cs
660 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
663 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain
664 the same code point but the character set is changed. The identity of
665 characters may change. This is an inherently dangerous function;
666 multilingual text may be recoded in unexpected ways. #### It's also
667 dangerous because the coding systems are not sanity-checked in the
668 current implementation.
670 When called interactively, @var{begin} and @var{end} are set to the
671 beginning and end, respectively, of the active region, and the function
672 prompts for @var{wrong-cs} and @var{right-cs}. The function does
673 completion, knows how to guess a coding system name from a character set
674 name, and also provides some common aliases. See
675 @code{latin-unity-guess-coding-system}.
677 Another, safer, way to accomplish this, using character sets rather than coding
678 systems to specify the desired recoding, is to use `latin-unity-recode-region.
680 To change from one Mule representation to another without changing identity
681 of any characters, use @code{latin-unity-remap-region}.
685 Helper functions for input of coding system and character set names.
687 @defun latin-unity-guess-charset candidate
688 Guess a charset based on the symbol @var{candidate}.
690 @var{candidate} itself is not tried as the value.
692 Uses the natural mapping in @samp{latin-unity-cset-codesys-alist}, and
693 the values in @samp{latin-unity-charset-alias-alist}."
696 @defun latin-unity-guess-coding-system candidate
697 Guess a coding system based on the symbol @var{candidate}.
699 @var{candidate} itself is not tried as the value.
701 Uses the natural mapping in @samp{latin-unity-cset-codesys-alist}, and
702 the values in @samp{latin-unity-coding-system-alias-alist}."
706 @defun latin-unity-example
708 A cheesy example for @pkgname{}.
710 At present it just makes a multilingual buffer. To test, setq
711 buffer-file-coding-system to some value, make the buffer dirty (eg
712 with RET BackSpace), and save.
715 @defun latin-unity-test
717 A simple automated test suite for latin-unity.
721 @node Installation, Configuration, Usage, Top
722 @chapter Installing @pkgname{} with your (X)Emacs
724 @pkgname{} may be installed from XEmacs via the package user interface
725 (accessible from the @samp{Tools} menu or via @kbd{M-x list-packages}).
727 You can also download the @file{latin-unity-@var{version}-pkg.tar.gz}
728 tarball from @url{ftp://ftp.xemacs.org/pub/xemacs/packages/}, and simply
729 unpack it in the usual place.
731 @pkgname{} sources are available from XEmacs's CVS repository. The
732 module name is @samp{latin-unity}. See
733 @uref{http://www.xemacs.org/Develop/cvsaccess.html} for more
734 information about XEmacs's CVS repository.
737 @node Configuration, Bug Reports, Installation, Top
738 @chapter Configuring @pkgname{} for Use
740 If you want @pkgname{} to be automatically initialized, invoke
741 @samp{latin-unity-install} with no arguments in your init file.
742 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs
743 earlier than 21.1, you should also load @file{auto-autoloads} using the
744 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
746 You may wish to define aliases for commonly used character sets and
747 coding systems for convenience in input.
749 @defopt latin-unity-charset-alias-alist
750 Alist mapping aliases to Mule charset names (symbols)."
754 ((latin-1 . latin-iso8859-1)
755 (latin-2 . latin-iso8859-2)
756 (latin-3 . latin-iso8859-3)
757 (latin-4 . latin-iso8859-4)
758 (latin-5 . latin-iso8859-9)
759 (latin-9 . latin-iso8859-15)
760 (latin-10 . latin-iso8859-16))
763 If a charset does not exist on your system, it will not complete and you
764 will not be able to enter it in response to prompts. A real charset
765 with the same name as an alias in this list will shadow the alias.
768 @defopt latin-unity-coding-system-alias-alist nil
769 Alist mapping aliases to Mule coding system names (symbols).
771 The default value is @samp{nil}.
775 @node Bug Reports, Frequently Asked Questions, Configuration, Top
776 @chapter Reporting Bugs and Problems
778 Please report bugs to the author, @email{stephen@@xemacs.org,Stephen
779 Turnbull}, or to the developers' mailing list,
780 @email{xemacs-beta@@xemacs.org, XEmacs Beta}.
782 Suggestions for improvement are welcome at the same addresses.
785 @node Frequently Asked Questions, Theory of Operation, Bug Reports, Top
786 @chapter Frequently Asked Questions
790 I'm smarter than latin-unity! How can that be?
792 Don't be surprised. Trust yourself.
794 latin-unity is very young as yet. Teach it what you know by Customizing
795 its variables, and report your changes to the maintainer (@pxref{Bug
801 According to ISO 10646, a Universal Coded character Set. In
802 latin-unity, it's Universal (Mule) Coding System.
803 @ref{Coding Systems, , , xemacs}
806 I know utf-16-le-bom is a UCS, but latin-unity won't use it. Why not?
808 There are an awful lot of UCSes in Mule, and you probably do not want to
809 ever use, and definitely not be asked about, most of them. So the
810 default set includes a few that the author thought plausible, but
811 they're surely not comprehensive or optimal.
813 Customize @code{latin-unity-ucs-list} to include the ones you use, and
814 report your favorites to the maintainer for consideration for inclusion
815 in the defaults, @xref{Bug Reports}. (Note that you @emph{must} include
816 @code{escape-quoted} in this list, because Mule uses it internally as
817 the coding system for auto-save files.)
819 Alternatively, if you just want to use it this one time, simply type it
820 in at the prompt. latin-unity will confirm that is a real coding
821 system, and then assume that you know what you're doing.
824 This is crazy: I can't quit XEmacs and get queried on autosaves! Why?
826 You probably removed @code{escape-quoted} from
827 @code{latin-unity-ucs-list}. Put it back.
830 latin-unity is really buggy and I can't get any work done.
832 First, use @kbd{M-x latin-unity-uninstall RET}, then report your
833 problems as a bug (@pxref{Bug Reports}).
837 @node Theory of Operation, What latin-unity Cannot Do for You, Frequently Asked Questions, Top
838 @chapter Theory of Operation
840 Standard encodings suffer from the design defect that they do not
841 provide a reliable way to recognize which coded character sets in use.
842 @xref{What latin-unity Cannot Do for You}. There are scores of
843 character sets which can be represented by a single octet (8-bit byte),
844 whose union contains many hundreds of characters. Obviously this
845 results in great confusion, since you can't tell the players without a
846 scorecard, and there is no scorecard.
848 There are two ways to solve this problem. The first is to create a
849 universal coded character set. This is the concept behind Unicode.
850 However, there have been satisfactory (nearly) universal character sets
851 for several decades, but even today many Westerners resist using Unicode
852 because they consider its space requirements excessive. On the other
853 hand, Asians dislike Unicode because they consider it to be incomplete.
854 (This is partly, but not entirely, political.)
856 In any case, Unicode only solves the internal representation problem.
857 Many data sets will contain files in ``legacy'' encodings, and Unicode
858 does not help distinguish among them.
860 The second approach is to embed information about the encodings used in
861 a document in its text. This approach is taken by the ISO 2022
862 standard. This would solve the problem completely from the users' of
863 view, except that ISO 2022 is basically not implemented at all, in the
864 sense that few applications or systems implement more than a small
865 subset of ISO 2022 functionality. This is due to the fact that
866 mono-literate users object to the presence of escape sequences in their
867 texts (which they, with some justification, consider data corruption).
868 Programmers are more than willing to cater to these users, since
869 implementing ISO 2022 is a painstaking task.
871 In fact, Emacs/Mule adopts both of these approaches. Internally it uses
872 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022
873 techniques both to save files in forms robust to encoding issues, and as
874 hints when attempting to ``guess'' an unknown encoding. However, Mule
875 suffers from a design defect, namely it embeds the character set
876 information that ISO 2022 attaches to runs of characters by introducing
877 them with a control sequence in each character. That causes Mule to
878 consider the ISO Latin character sets to be disjoint. This manifests
879 itself when a user enters characters using input methods associated with
880 different coded character sets into a single buffer.
882 There are two problems stemming from this design. First, Mule
883 represents the same character in different ways. Abstractly, '
\e,As
\e(B'
884 (LATIN SMALL LETTER O WITH ACUTE) can get represented as
885 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like
886 '
\e,Ass
\e(B' in the display might actually be represented [latin-iso8859-1
887 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
888 #xF3 ESC - A] in the file. In some cases this treatment would be
889 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
890 (the CJK ideographic character meaning ``one'')), and although arguably
891 incorrect it is convenient when mixing the CJK scripts. But in the case
892 of the Latin scripts this is wrong.
894 Worse yet, it is very likely to occur when mixing ``different'' encodings
895 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
896 points that are almost never used. A very important example involves
897 email. Many sites, especially in the U.S., default to use of the ISO
898 8859/1 coded character set (also called ``Latin 1,'' though these are
899 somewhat different concepts). However, ISO 8859/1 provides a generic
900 CURRENCY SIGN character. Now that the Euro has become the official
901 currency of most countries in Europe, this is unsatisfactory (and in
902 practice, useless). So Europeans generally use ISO 8859/15, which is
903 nearly identical to ISO 8859/1 for most languages, except that it
904 substitutes EURO SIGN for CURRENCY SIGN.
906 Suppose a European user yanks text from a post encoded in ISO 8859/1
907 into a message composition buffer, and enters some text including the
908 Euro sign. Then Mule will consider the buffer to contain both ISO
909 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
910 programmed) send the message as a multipart mixed MIME body!
912 This is clearly stupid. What is not as obvious is that, just as any
913 European can include American English in their text because ASCII is a
914 subset of ISO 8859/15, most European languages which use Latin
915 characters (eg, German and Polish) can typically be mixed while using
916 only one Latin coded character set (in the case of German and Polish,
917 ISO 8859/2). However, this often depends on exactly what text is to be
918 encoded (even for the same pair of languages).
920 @pkgname{} works around the problem by converting as many characters as
921 possible to use a single Latin coded character set before saving the
924 Because the problem is rarely noticable in editing a buffer, but tends
925 to manifest when that buffer is exported to a file or process, the
926 @pkgname{} package uses the strategy of examining the buffer prior to
927 export. If use of multiple Latin coded character sets is detected,
928 @pkgname{} attempts to unify them by finding a single coded character
929 set which contains all of the Latin characters in the buffer.
931 The primary purpose of @pkgname{} is to fix the problem by giving the
932 user the choice to change the representation of all characters to one
933 character set and give sensible recommendations based on context. In
934 the '
\e,As
\e(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
935 both will be suggested. In the EURO SIGN example, only ISO 8859/15
936 makes sense, and that is what will be recommended. In both cases, the
937 user will be reminded that there are universal encodings available.
939 I call this @dfn{remapping} (from the universal character set to a
940 particular ISO 8859 coded character set). It is mere accident that this
941 letter has the same code point in both character sets. (Not entirely,
942 but there are many examples of Latin characters that have different code
943 points in different Latin-X sets.)
945 Note that, in the '
\e,As
\e(B' example, that treating the buffer in this way will
946 result in a representation such as [latin-iso8859-2
947 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
948 This is guaranteed to occasionally result in the second problem you
949 observed, to which we now turn.
951 This problem is that, although the file is intended to be an
952 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
953 compliant program---this is required by the standard, obvious if you
954 think a bit, @pxref{What latin-unity Cannot Do for You}) will read that
955 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this
956 is no problem if all of the characters in the file are contained in ISO
957 8859/1, but suppose there are some which are not, but are contained in
958 the (intended) ISO 8859/2.
960 You now want to fix this, but not by finding the same character in
961 another set. Instead, you want to simply change the character set that
962 Mule associates with that buffer position without changing the code.
963 (This is conceptually somewhat distinct from the first problem, and
964 logically ought to be handled in the code that defines coding systems.
965 However, @pkgname{} is not an unreasonable place for it.) @pkgname{}
966 provides two functions (one fast and dangerous, the other slow and
967 careful) to handle this. I call this @dfn{recoding}, because the
968 transformation actually involves @emph{encoding} the buffer to file
969 representation, then @emph{decoding} it to buffer representation (in a
970 different character set). This cannot be done automatically because
971 Mule can have no idea what the correct encoding is---after all, it
972 already gave you its best guess. @xref{What latin-unity Cannot Do for
973 You}. So these functions must be invoked by the user. @xref{Interactive
977 @node What latin-unity Cannot Do for You, Interfaces, Theory of Operation, Top
978 @chapter What latin-unity Cannot Do for You
980 @pkgname{} @strong{cannot} save you if you insist on exporting data in
981 8-bit encodings in a multilingual environment. @emph{You will
982 eventually corrupt data if you do this.} It is not Mule's, or any
983 application's, fault. You will have only yourself to blame; consider
984 yourself warned. (It is true that Mule has bugs, which make Mule
985 somewhat more dangerous and inconvenient than some naive applications.
986 We're working to address those, but no application can remedy the
987 inherent defect of 8-bit encodings.)
989 Use standard universal encodings, preferably Unicode (UTF-8) unless
990 applicable standards indicate otherwise. The most important such case
991 is Internet messages, where MIME should be used, whether or not the
992 subordinate encoding is a universal encoding. (Note that since one of
993 the important provisions of MIME is the @samp{Content-Type} header,
994 which has the charset parameter, MIME is to be considered a universal
995 encoding for the purposes of this manual. Of course, technically
996 speaking it's neither a coded character set nor a coding extension
997 technique compliant with ISO 2022.)
999 As mentioned earlier, the problem is that standard encodings suffer from
1000 the design defect that they do not provide a reliable way to recognize
1001 which coded character sets are in use. There are scores of character
1002 sets which can be represented by a single octet (8-bit byte), whose
1003 union contains many hundreds of characters. Thus any 8-bit coded
1004 character set must contain characters that share code points used for
1005 different characters in other coded character sets.
1007 This means that a given file's intended encoding cannot be identified
1008 with 100% reliability unless it contains encoding markers such as those
1009 provided by MIME or ISO 2022.
1011 @pkgname{} actually makes it more likely that you will have problems of
1012 this kind. Traditionally Mule has been ``helpful'' by simply using an
1013 ISO 2022 universal coding system when the current buffer coding system
1014 cannot handle all the characters in the buffer. This has the effect
1015 that, because the file contains control sequences, it is not recognized
1016 as being in the locale's normal 8-bit encoding. It may be annoying if
1017 you are not a Mule expert, but your data is automatically recoverable
1018 with a tool you already have: Mule.
1020 However, with @pkgname{}, Mule converts to a single 8-bit character set
1021 when possible. But typically this will @emph{not} be in your usual
1022 locale. Ie, the times that an ISO 8859/1 user will need @pkgname{} is
1023 when there are ISO 8859/2 characters in the buffer. But then most
1024 likely the file will be saved in a pure 8-bit encoding that is not ISO
1025 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the
1026 most sophisticated yet available) cannot tell the difference between ISO
1027 8859/1 and ISO 8859/2, and in a Western European locale will choose the
1028 former even though the latter was intended. Even the extension
1029 (``statistical recognition'') planned for XEmacs 22 is unlikely to be at
1030 all accurate in the case of mixed codes.
1032 So now consider adding some additional ISO 8859/1 text to the buffer.
1033 If it includes any ISO 8859/1 codes that are used by different
1034 characters in ISO 8859/2, you now have a file that cannot be
1035 mechanically disentangled. You need a human being who can recognize
1036 that @emph{this is German and Swedish} and stays in Latin-1, while
1037 @emph{that is Polish} and needs to be recoded to Latin-2.
1039 Moral: switch to a universal coded character set, preferably Unicode
1040 using the UTF-8 transformation format. If you really need the space,
1041 compress your files.
1044 @node Interfaces, Charsets and Coding Systems, What latin-unity Cannot Do for You, Top
1047 Various recent---dating from the end of the nineties---ISO 8859 standard
1048 language environments used to be provided with this package, but they
1049 have been refactored out into their own package,
1050 @i{latin-euro-standards}, on which this one depends. See the
1051 documentation for that package if you're interested in using those from
1054 @node Charsets and Coding Systems, Internals, Interfaces, Top
1055 @chapter Charsets and Coding Systems
1057 This section provides reference lists of Mule charsets and coding
1058 systems. Mule charsets are typically named by character set and
1062 @item ASCII variants
1064 Identification of equivalent characters in these sets is not properly
1065 implemented. @pkgname{} does not distinguish the two charsets.
1067 @samp{ascii} @samp{latin-jisx0201}
1069 @item Extended Latin
1071 Characters from the following ISO 2022 conformant charsets are
1072 identified with equivalents in other charsets in the group by
1075 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
1076 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
1077 @samp{latin-iso8859-13} @samp{latin-iso8859-16}
1079 The follow charsets are Latin variants which are not understood by
1080 @pkgname{}. In addition, many of the Asian language standards provide
1081 ASCII, at least, and sometimes other Latin characters. None of these
1082 are identified with their ISO 8859 equivalents.
1084 @samp{vietnamese-viscii-lower}
1085 @samp{vietnamese-viscii-upper}
1087 @item Other character sets
1089 @samp{arabic-1-column}
1090 @samp{arabic-2-column}
1092 @samp{arabic-iso8859-6}
1093 @samp{chinese-big5-1}
1094 @samp{chinese-big5-2}
1095 @samp{chinese-cns11643-1}
1096 @samp{chinese-cns11643-2}
1097 @samp{chinese-cns11643-3}
1098 @samp{chinese-cns11643-4}
1099 @samp{chinese-cns11643-5}
1100 @samp{chinese-cns11643-6}
1101 @samp{chinese-cns11643-7}
1102 @samp{chinese-gb2312}
1103 @samp{chinese-isoir165}
1104 @samp{cyrillic-iso8859-5}
1106 @samp{greek-iso8859-7}
1107 @samp{hebrew-iso8859-8}
1109 @samp{japanese-jisx0208}
1110 @samp{japanese-jisx0208-1978}
1111 @samp{japanese-jisx0212}
1112 @samp{katakana-jisx0201}
1113 @samp{korean-ksc5601}
1118 @item Non-graphic charsets
1126 Some of these coding systems may specify EOL conventions. Note that
1127 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
1128 coding system. Although @pkgname{} attempts to compensate for this, it
1129 is possible that the @samp{iso-8859-1} coding system will behave
1130 differently from other ISO 8859 coding systems.
1132 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
1134 @item Latin coding systems
1136 These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
1137 combining ASCII in the GL register (bytes with high-bit clear) and an
1138 extended Latin character set in the GR register (bytes with high-bit set).
1140 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
1141 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
1143 These coding systems are single-byte, 8-bit coding systems that do not
1144 conform to international standards. They should be avoided in all
1145 potentially multilingual contexts, including any text distributed over
1146 the Internet and World Wide Web.
1150 @item Multilingual coding systems
1152 The following ISO-2022-based coding systems are useful for multilingual
1155 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
1156 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
1158 XEmacs also supports Unicode with the Mule-UCS package. These are the
1159 preferred coding systems for multilingual use. (There is a possible
1160 exception for texts that mix several Asian ideographic character sets.)
1162 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
1163 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
1164 @samp{utf-8} @samp{utf-8-ws}
1166 Development versions of XEmacs (the 21.5 series) support Unicode
1167 internally, with (at least) the following coding systems implemented:
1169 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
1170 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
1172 @item Asian ideographic languages
1174 The following coding systems are based on ISO 2022, and are more or less
1175 suitable for encoding multilingual texts. They all can represent ASCII
1176 at least, and sometimes several other foreign character sets, without
1177 resort to arbitrary ISO 2022 designations. However, these subsets are
1178 not identified with the corresponding national standards in XEmacs Mule.
1180 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
1181 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
1182 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
1183 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
1184 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
1186 The following coding systems cannot be used for general multilingual
1187 text and do not cooperate well with other coding systems.
1189 @samp{big5} @samp{shift_jis}
1191 @item Other languages
1193 The following coding systems are based on ISO 2022. Though none of them
1194 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
1195 to 21.4 defaults to) use of ISO 2022 control sequences to designate
1196 other character sets for inclusion the text.
1198 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
1201 The following are character sets that do not conform to ISO 2022 and
1202 thus cannot be safely used in a multilingual context.
1204 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
1205 @samp{viscii} @samp{vscii}
1207 @item Special coding systems
1209 Mule uses the following coding systems for special purposes.
1211 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
1213 @samp{escape-quoted} is especially important, as it is used internally
1214 as the coding system for autosaved data.
1216 The following coding systems are aliases for others, and are used for
1217 communication with the host operating system.
1219 @samp{file-name} @samp{keyboard} @samp{terminal}
1223 Mule detection of coding systems is actually limited to detection of
1224 classes of coding systems called @dfn{coding categories}. These coding
1225 categories are identified by the ISO 2022 control sequences they use, if
1226 any, by their conformance to ISO 2022 restrictions on code points that
1227 may be used, and by characteristic patterns of use of 8-bit code points.
1229 @samp{no-conversion}
1233 @samp{iso-lock-shift}
1236 @samp{iso-8-designate}
1241 @node Internals, , Charsets and Coding Systems, Top
1244 No internals documentation yet.
1246 @file{latin-unity-utils.el} provides one utility function.
1248 @defun latin-unity-dump-tables
1250 Dump the temporary table created by loading @file{latin-unity-utils.el}
1251 to @file{latin-unity-tables.el}. Loading the latter file initializes
1252 @samp{latin-unity-equivalences}.
1255 @c end of latin-unity.texi