cgit.sxemacs.org Git - packages/blob - mule-packages/latin-unity/latin-unity.texi

   1 \input texinfo   @c -*-texinfo-*-
   2
   3 @c Manual for the XEmacs latin-unity package.
   4
   5 @c ** #### something uses these headers, find out how and explain here
   6 @c %**start of header
   7 @setfilename latin-unity.info
   8 @settitle Equivalencing Coded Latin Characters
   9 @setchapternewpage odd
  10 @c %**end of header
  11
  12 @c Version values, for easy modification
  13 @set VERSION 1.1
  14 @set UPDATED Monday, February 7, 2005
  15
  16 @c ** Many people seem to prefer this footenote style
  17 @footnotestyle end
  18
  19 @direntry
  20 * latin-unity::                Remap Latin characters from a single charset.
  21 @end direntry
  22
  23 @c ** It is often convenient to use a macro for names which have unusual
  24 @c ** spelling or formatting conventions.
  25 @c Macro to make formatting of the package name consistent.
  26 @macro pkgname
  27 @i{latin-unity}
  28 @end macro
  29
  30 @c Copying permissions, et al
  31 @c Note that this whole section is repeated thrice, once for the Info
  32 @c version, once for TeX, and once for HTML.
  33 @c Note that if you combine this document with others' work, you may
  34 @c need to change the permissions statement.  Make sure you do so in all
  35 @c three places.
  36 @c #### Probably not the right place for HTML.
  37 @html
  38 This file is part of XEmacs.  It documents the @pkgname{} package,
  39 which ensures that wherever possible representations of all Latin
  40 characters are drawn from the same 8-bit character set.
  41
  42 Copyright @copyright{} 2002, 2003 Free Software Foundation, Inc.
  43
  44 Permission is granted to make and distribute verbatim or modified copies
  45 of this manual provided that the copyright notice above is preserved,
  46 and at least one of the numbered paragraphs below up to the phrase "End
  47 of permissions notice." is included, under the terms of any of the
  48 licenses enumerated below.
  49
  50 1. The GNU General Public License, version 2 or any later version
  51 published by the Free Software Foundation, Inc.  A copy of the GNU
  52 General Public License was included with your copy of XEmacs.
  53 Alternatively, you may request a copy from the Free Software Foundation,
  54 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
  55
  56 2. The license used for XEmacs documentation (see the file
  57 man/xemacs/xemacs.texi distributed with XEmacs).
  58
  59 3. The GNU Free Documentation License (available from the Free Software
  60 Foundation at the address above).
  61
  62 End of permissions notice.
  63
  64 Because the XEmacs documentation license and the GNU Free Documentation
  65 License are not free software licenses, and are mutually incompatible
  66 with each other and with the GNU General Public License, it is desirable
  67 that the permissions given here be preserved if possible.  This normally
  68 requires that you write fresh text for any additions or modifications to
  69 this file, rather than copying code or comments from other sources.  The
  70 exceptions are if you are sole copyright holder for the other source, or
  71 if you receive permission from the holders of the copyrights for the
  72 other sources.
  73
  74 That is, if you merge code or comments from a file licensed to you only
  75 under the GNU General Public License you @emph{may not} distribute
  76 under the documentation licenses or with a dual license; such a
  77 combination may be distributed under the GNU General Public License
  78 @emph{only}.  Similarly, if you copy text from a file licensed to you
  79 only under the XEmacs documentation license or the GNU Free
  80 Documentation License you may distribute the result @emph{only} under
  81 the appropriate license.
  82 @end html
  83
  84 @ifinfo
  85 This file is part of XEmacs.  It documents the @pkgname{} package,
  86 which ensures that wherever possible representations of all Latin
  87 characters are drawn from the same 8-bit character set.
  88
  89 Copyright @copyright{} 2002, 2003 Free Software Foundation, Inc.
  90
  91 Permission is granted to make and distribute verbatim or modified copies
  92 of this manual provided that the copyright notice above is preserved,
  93 and at least one of the numbered paragraphs below up to the phrase "End
  94 of permissions notice." is included, under the terms of any of the
  95 licenses enumerated below.
  96
  97 @ignore
  98 Permission is granted to process this file through TeX and print the
  99 results, provided the printed document carries a copying permission
 100 notice identical to this one except for the removal of this paragraph
 101 (this paragraph not being relevant to the printed manual).
 102
 103 @end ignore
 104 1. The GNU General Public License, version 2 or any later version
 105 published by the Free Software Foundation, Inc.  A copy of the GNU
 106 General Public License was included with your copy of XEmacs.
 107 Alternatively, you may request a copy from the Free Software Foundation,
 108 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 109
 110 2. The license used for XEmacs documentation (see the file
 111 man/xemacs/xemacs.texi distributed with XEmacs).
 112
 113 3. The GNU Free Documentation License (available from the Free Software
 114 Foundation at the address above).
 115
 116 End of permissions notice.
 117
 118 Because the XEmacs documentation license and the GNU Free Documentation
 119 License are not free software licenses, and are mutually incompatible
 120 with each other and with the GNU General Public License, it is desirable
 121 that the permissions given here be preserved if possible.  This normally
 122 requires that you write fresh text for any additions or modifications to
 123 this file, rather than copying code or comments from other sources.  The
 124 exceptions are if you are sole copyright holder for the other source, or
 125 if you receive permission from the holders of the copyrights for the
 126 other sources.
 127
 128 That is, if you merge code or comments from a file licensed to you only
 129 under the GNU General Public License you @emph{may not} distribute
 130 under the documentation licenses or with a dual license; such a
 131 combination may be distributed under the GNU General Public License
 132 @emph{only}.  Similarly, if you copy text from a file licensed to you
 133 only under the XEmacs documentation license or the GNU Free
 134 Documentation License you may distribute the result @emph{only} under
 135 the appropriate license.
 136 @end ifinfo
 137
 138 @tex
 139
 140 @titlepage
 141 @title Latin Unity for Emacsen
 142 @subtitle Last updated @value{UPDATED}
 143
 144 @author Stephen J. Turnbull
 145 @page
 146
 147 This file is part of XEmacs.  It documents the @pkgname{} package,
 148 which ensures that wherever possible representations of all Latin
 149 characters are drawn from the same 8-bit character set.
 150
 151 Copyright @copyright{} 2002, 2003 Free Software Foundation, Inc.
 152
 153 @vskip 0pt plus 1filll
 154 Permission is granted to make and distribute verbatim or modified copies
 155 of this manual provided that the copyright notice above is preserved,
 156 and at least one of the numbered paragraphs below up to the phrase "End
 157 of permissions notice." is included, under the terms of any of the
 158 licenses enumerated below.
 159
 160 @ignore
 161 Permission is granted to process this file through TeX and print the
 162 results, provided the printed document carries a copying permission
 163 notice identical to this one except for the removal of this paragraph
 164 (this paragraph not being relevant to the printed manual).
 165
 166 @end ignore
 167 1. The GNU General Public License, version 2 or any later version
 168 published by the Free Software Foundation, Inc.  A copy of the GNU
 169 General Public License was included with your copy of XEmacs.
 170 Alternatively, you may request a copy from the Free Software Foundation,
 171 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 172
 173 2. The license used for XEmacs documentation (see the file
 174 man/xemacs/xemacs.texi distributed with XEmacs).
 175
 176 3. The GNU Free Documentation License (available from the Free Software
 177 Foundation at the address above).
 178
 179 End of permissions notice.
 180
 181 Because the XEmacs documentation license and the GNU Free Documentation
 182 License are not free software licenses, and are mutually incompatible
 183 with each other and with the GNU General Public License, it is desirable
 184 that the permissions given here be preserved if possible.  This normally
 185 requires that you write fresh text for any additions or modifications to
 186 this file, rather than copying code or comments from other sources.  The
 187 exceptions are if you are sole copyright holder for the other source, or
 188 if you receive permission from the holders of the copyrights for the
 189 other sources.
 190
 191 That is, if you merge code or comments from a file licensed to you only
 192 under the GNU General Public License you @emph{may not} distribute
 193 under the documentation licenses or with a dual license; such a
 194 combination may be distributed under the GNU General Public License
 195 @emph{only}.  Similarly, if you copy text from a file licensed to you
 196 only under the XEmacs documentation license or the GNU Free
 197 Documentation License you may distribute the result @emph{only} under
 198 the appropriate license.
 199
 200 @end titlepage
 201 @page
 202
 203 @end tex
 204
 205 @ifnottex
 206 @node Top, Copying, (dir), (dir)
 207 @top Latin Unity for Emacsen
 208
 209 @html
 210 This project kindly hosted by
 211 <br />
 212 <a href="http://sunsite.dk">
 213 <img src="http://sunsite.dk/images/hostedby.png"
 214      border="0" alt="sunSITE.dk Logo" /></a>
 215 <a href="http://www.tux.org/">
 216 <img src="http://www.tux.org/images/minibanner.gif"
 217      width="88" height="31" border="0" alt="Tux.Org Logo" /></a>
 218 <a href="http://sourceforge.net">
 219 <img src="http://sourceforge.net/sflogo.php?group_id=34545"
 220      width="88" height="31" border="0" alt="SourceForge Logo" /></a>
 221 @end html
 222
 223 Mule suffers from a design defect that causes it to consider the ISO
 224 Latin character sets to be disjoint.  This results in oddities such as
 225 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
 226 2022 control sequences to switch between them, as well as more plausible
 227 but often unnecessary combinations like ISO 8859/1 with ISO 8859/2.
 228 This can be very annoying when sending messages or even in simple
 229 editing on a single host.  @pkgname{} works around the problem by
 230 converting as many characters as possible to use a single Latin coded
 231 character set before saving the buffer.
 232
 233 This is version @value{VERSION} of the @pkgname{} manual, last updated on
 234 @value{UPDATED}.
 235
 236 @c ** Currently we provide documentation for the XEmacs core, but not for
 237 @c ** individual packages, on the web.  Check for latest policy either at
 238 @c ** www.xemacs.org or on xemacs-beta@xemacs.org.
 239
 240 @c You can find the latest version of this document on the web at
 241 @c @uref{http://www.xemacs.org/}.
 242
 243 @ifhtml
 244 @c ** Mention translations and other online versions here.
 245
 246 @c ** Adjust to taste.  You may wish to mention newgroups such as
 247 @c ** comp.emacs.xemacs, comp.emacs, and gnu.emacs.help.
 248 Discussion and enhancement of @pkgname{} is conducted on the XEmacs
 249 mailing lists, especially the XEmacs Beta list.  See
 250 @uref{http://www.xemacs.org/Lists/} for more information about the
 251 XEmacs mailing lists.
 252 @end ifhtml
 253
 254 @end ifnottex
 255
 256 @menu
 257 * Copying::                     @pkgname{} copying conditions.
 258 * Overview::                    @pkgname{} history and general information.
 259
 260 For general users:
 261 * Usage::                       An overview of the operation of @pkgname{}.
 262 * Installation::                Installing @pkgname{} with your (X)Emacs.
 263 * Configuration::               Configuring @pkgname{} for use.
 264 * Bug Reports::                 Reporting bugs and problems.
 265 * Frequently Asked Questions::  Questions and answers from the mailing list.
 266 * Theory of Operation::         How @pkgname{} works.
 267 * What latin-unity Cannot Do for You::  Inherent problems of 8-bit charsets.
 268
 269 For programmers:
 270 * Interfaces::                  Calling @pkgname{} from Lisp code.
 271 * Charsets and Coding Systems:: Reference lists with annotations.
 272
 273 For maintainers:
 274 * Internals::                   Utilities and implementation details.
 275
 276 @c ** For small packages, with no or few subnodes, a detailmenu is not
 277 @c ** necessary.
 278 @c @detailmenu
 279 @c  --- The Detailed Node Listing ---
 280
 281 @c @end detailmenu
 282 @end menu
 283
 284 @node Copying, Overview, Top, Top
 285 @chapter @pkgname{} Copying Conditions
 286
 287 @c ** CHECK THE COPYRIGHT DATE(S) AND HOLDER(S)!
 288
 289 Copyright (C) 2002 Free Software Foundation, Inc.
 290
 291 @pkgname{} is free software; you can redistribute it
 292 and/or modify it under the terms of the GNU General Public License as
 293 published by the Free Software Foundation; either version 2, or (at your
 294 option) any later version.
 295
 296 @pkgname{} is distributed in the hope that it will
 297 be useful, but WITHOUT ANY WARRANTY; without even the implied warranty
 298 of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 299 General Public License for more details.
 300
 301 You should have received a copy of the GNU General Public License along
 302 with XEmacs; see the file COPYING. If not, write to the Free Software
 303 Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
 304 USA.
 305
 306
 307 @node Overview, Usage, Copying, Top
 308 @chapter An Overview of @pkgname{}
 309
 310 Mule suffers from a design defect that causes it to consider the ISO
 311 Latin character sets to be disjoint.  This manifests itself when a user
 312 enters characters using input methods associated with different coded
 313 character sets into a single buffer.
 314
 315 A very important example involves email.  Many sites, especially in the
 316 U.S., default to use of the ISO 8859/1 coded character set (also called
 317 ``Latin 1,'' though these are somewhat different concepts).  However,
 318 ISO 8859/1 provides a generic CURRENCY SIGN character.  Now that the
 319 Euro has become the official currency of most countries in Europe, this
 320 is unsatisfactory (and in practice, useless).  So Europeans generally
 321 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
 322 languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
 323
 324 Suppose a European user yanks text from a post encoded in ISO 8859/1
 325 into a message composition buffer, and enters some text including the
 326 Euro sign.  Then Mule will consider the buffer to contain both ISO
 327 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
 328 programmed) send the message as a multipart mixed MIME body!
 329
 330 This is clearly stupid.  What is not as obvious is that, just as any
 331 European can include American English in their text because ASCII is a
 332 subset of ISO 8859/15, most European languages which use Latin
 333 characters (eg, German and Polish) can typically be mixed while using
 334 only one Latin coded character set (in this case, ISO 8859/2).  However,
 335 this often depends on exactly what text is to be encoded.
 336
 337 @pkgname{} works around the problem by converting as many characters as
 338 possible to use a single Latin coded character set before saving the
 339 buffer.
 340
 341 @node Usage, Installation, Overview, Top
 342 @chapter Operation of @pkgname{}
 343
 344 Normally, @pkgname{} works in the background by installing
 345 @code{latin-unity-sanity-check} on @code{write-region-pre-hook}.  The
 346 user activates this functionality by invoking
 347 @code{latin-unity-install}, either interactively or in her init file.
 348 @xref{Init File, , , xemacs}.  @pkgname{} can be deactivated by
 349 invoking @code{latin-unity-uninstall}.
 350
 351 @pkgname{} also provides a few functions for remapping or recoding the
 352 buffer by hand.  To @dfn{remap} a character means to change the buffer
 353 representation of the character by using another coded character set.
 354 Remapping never changes the identity of the character, but may involve
 355 altering the code point of the character.  To @dfn{recode} a character
 356 means to simply change the coded character set.  Recoding never alters
 357 the code point of the character, but may change the identity of the
 358 character.  @xref{Theory of Operation}.
 359
 360 There are a few variables which determine which coding systems are
 361 always acceptable to @pkgname{}:  @code{latin-unity-ucs-list},
 362 @code{latin-unity-preferred-coding-system-list}, and
 363 @code{latin-unity-preapproved-coding-system-list}.  The latter two default
 364 to @code{()}, and should probably be avoided because they short-circuit
 365 the sanity check.  If you find you need to use them, consider reporting
 366 it as a bug or request for enhancement.  Because they seem unsafe, the
 367 recommended interface is likely to change.
 368
 369 @menu
 370 * Basic Functionality::            User interface and customization.
 371 * Interactive Usage::              Treating text by hand.
 372                                    Also documents the hook function(s).
 373 @end menu
 374
 375
 376 @node Basic Functionality, Interactive Usage, , Usage
 377 @section Basic Functionality
 378
 379 These functions and user options initialize and configure @pkgname{}.
 380 In normal use, only a call to @code{latin-unity-install} is needed.
 381
 382
 383 @defun latin-unity-install
 384 Set up hooks and initialize variables for latin-unity.
 385
 386 There are no arguments.
 387
 388 This function is idempotent.  It will reinitialize any hooks or variables
 389 that are not in initial state.
 390
 391 Note: a quirk in XEmacs means that the @file{cl-macs} library cannot be
 392 required, it must be loaded explicitly.  This means that if you invoke
 393 @samp{latin-unity-install} in your init file, XEmacs will print
 394 @example
 395 Loading cl-macs...
 396 Loading cl-macs...done
 397 @end example
 398 on your console, as these messages have not yet been redirected to the
 399 @samp{ *Message-Log*} buffer.
 400 @end defun
 401
 402
 403 @defun latin-unity-uninstall
 404 There are no arguments.
 405
 406 Clean up hooks and void variables used by latin-unity.
 407 @end defun
 408
 409
 410 @defopt latin-unity-ucs-list
 411 List of coding systems considered to be universal.
 412
 413 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
 414
 415 Order matters; coding systems earlier in the list will be preferred when
 416 recommending a coding system.  These coding systems will not be used
 417 without querying the user (unless they are also present in
 418 @code{latin-unity-preapproved-coding-system-list}), and follow the
 419 @code{latin-unity-preferred-coding-system-list} in the list of suggested
 420 coding systems.
 421
 422 If none of the preferred coding systems are feasible, the first in
 423 this list will be the default.
 424
 425 Notes on certain coding systems:  @code{escape-quoted} is a special
 426 coding system used for autosaves and compiled Lisp in Mule.  You should
 427 not delete this, although it is rare that a user would want to use it
 428 directly.  @pkgname{} does not try to be \"smart\" about other general
 429 ISO 2022 coding systems, such as ISO-2022-JP.  (They are not recognized
 430 as equivalent to @code{iso-2022-7}.)  If your preferred coding system is
 431 one of these, you may consider adding it to @code{latin-unity-ucs-list}.
 432 However, this will typically have the side effect that (eg) ISO 8859/1
 433 files will be saved in 7-bit form with ISO 2022 escape sequences.
 434 @end defopt
 435
 436
 437 Coding systems which are not Latin and not in
 438 @code{latin-unity-ucs-list} are handled by short circuiting checks of
 439 coding system against the next two variables.
 440
 441 @defopt latin-unity-preapproved-coding-system-list
 442 List of coding systems used without querying the user if feasible.
 443
 444 The default value is @samp{(buffer-default preferred)}.
 445
 446 The first feasible coding system in this list is used.  The special values
 447 @samp{preferred} and @samp{buffer-default} may be present:
 448
 449 @table @code
 450 @item buffer-default
 451 Use the coding system used by @samp{write-region}, if feasible.
 452
 453 @item preferred
 454 Use the coding system specified by @samp{prefer-coding-system} if feasible.
 455 @end table
 456
 457 "Feasible" means that all characters in the buffer can be represented by
 458 the coding system.  Coding systems in @samp{latin-unity-ucs-list} are
 459 always considered feasible.  Other feasible coding systems are computed
 460 by @samp{latin-unity-representations-feasible-region}.
 461
 462 Note that the first universal coding system in this list shadows all
 463 other coding systems.  In particular, if your preferred coding system is
 464 a universal coding system, and @code{preferred} is a member of this
 465 list, @pkgname{} will blithely convert all your files to that coding
 466 system.  This is considered a feature, but it may surprise most users.
 467 Users who don't like this behavior should put @code{preferred} in
 468 @code{latin-unity-preferred-coding-system-list}.
 469 @end defopt
 470
 471
 472 @defopt latin-unity-preferred-coding-system-list
 473 List of coding systems suggested the user if feasible.
 474
 475 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
 476 iso-8859-4 iso-8859-9)}.
 477
 478 If none of the coding systems in
 479 @samp{latin-unity-preferred-coding-system-list} are feasible, this list
 480 will be recommended to the user, followed by the
 481 @samp{latin-unity-ucs-list} (so those coding systems should not be in
 482 this list).  The first coding system in this list is default.  The
 483 special values @samp{preferred} and @samp{buffer-default} may be
 484 present:
 485
 486 @table @code
 487 @item buffer-default
 488 Use the coding system used by @samp{write-region}, if feasible.
 489
 490 @item preferred
 491 Use the coding system specified by @samp{prefer-coding-system} if feasible.
 492 @end table
 493
 494 "Feasible" means that all characters in the buffer can be represented by
 495 the coding system.  Coding systems in @samp{latin-unity-ucs-list} are
 496 always considered feasible.  Other feasible coding systems are computed
 497 by @samp{latin-unity-representations-feasible-region}.
 498 @end defopt
 499
 500
 501 @defvar latin-unity-iso-8859-1-aliases
 502 List of coding systems to be treated as aliases of ISO 8859/1.
 503
 504 The default value is '(iso-8859-1).
 505
 506 This is not a user variable; to customize input of coding systems or
 507 charsets, @samp{latin-unity-coding-system-alias-alist} or
 508 @samp{latin-unity-charset-alias-alist}.
 509 @end defvar
 510
 511
 512 @node Interactive Usage, , Basic Functionality, Usage
 513 @section Interactive Usage
 514
 515 First, the hook function @code{latin-unity-sanity-check} is documented.
 516 (It is placed here because it is not an interactive function, and there
 517 is not yet a programmer's section of the manual.)
 518
 519 These functions provide access to internal functionality (such as the
 520 remapping function) and to extra functionality (the recoding functions
 521 and the test function).
 522
 523
 524 @defun latin-unity-sanity-check begin end filename append visit lockname &optional coding-system
 525
 526 Check if @var{coding-system} can represent all characters between
 527 @var{begin} and @var{end}.
 528
 529 For compatibility with old broken versions of @code{write-region},
 530 @var{coding-system} defaults to @code{buffer-file-coding-system}.
 531 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are
 532 ignored.
 533
 534 Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
 535 Latin.  If @code{buffer-file-coding-system} is safe for the charsets
 536 actually present in the buffer, return it.  Otherwise, ask the user to
 537 choose a coding system, and return that.
 538
 539 This function does @emph{not} do the safe thing when
 540 @code{buffer-file-coding-system} is nil (aka no-conversion).  It
 541 considers that ``non-Latin,'' and passes it on to the Mule detection
 542 mechanism.
 543
 544 This function is intended for use as a @code{write-region-pre-hook}.  It
 545 does nothing except return @var{coding-system} if @code{write-region}
 546 handlers are inhibited.
 547 @end defun
 548
 549
 550 @defun latin-unity-buffer-representations-feasible
 551
 552 There are no arguments.
 553
 554 Apply latin-unity-region-representations-feasible to the current buffer.
 555 @end defun
 556
 557
 558 @defun latin-unity-region-representations-feasible begin end &optional buf
 559
 560 Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}.
 561
 562 @var{buf} defaults to the current buffer.  Called interactively, will be
 563 applied to the region.  Function assumes @var{begin} <= @var{end}.
 564
 565 The return value is a cons.  The car is the list of character sets
 566 that can individually represent all of the non-ASCII portion of the
 567 buffer, and the cdr is the list of character sets that can
 568 individually represent all of the ASCII portion.
 569
 570 The following is taken from a comment in the source.  Please refer to
 571 the source to be sure of an accurate description.
 572
 573 The basic algorithm is to map over the region, compute the set of
 574 charsets that can represent each character (the ``feasible charset''),
 575 and take the intersection of those sets.
 576
 577 The current implementation takes advantage of the fact that ASCII
 578 characters are common and cannot change asciisets.  Then using
 579 skip-chars-forward makes motion over ASCII subregions very fast.
 580
 581 This same strategy could be applied generally by precomputing classes
 582 of characters equivalent according to their effect on latinsets, and
 583 adding a whole class to the skip-chars-forward string once a member is
 584 found.
 585
 586 Probably efficiency is a function of the number of characters matched,
 587 or maybe the length of the match string?  With @code{skip-category-forward}
 588 over a precomputed category table it should be really fast.  In practice
 589 for Latin character sets there are only 29 classes.
 590 @end defun
 591
 592
 593 @defun latin-unity-remap-region begin end character-set &optional coding-system
 594
 595 Remap characters between @var{begin} and @var{end} to equivalents in
 596 @var{character-set}.  Optional argument @var{coding-system} may be a
 597 coding system name (a symbol) or nil.  Characters with no equivalent are
 598 left as-is.
 599
 600 When called interactively, @var{begin} and @var{end} are set to the
 601 beginning and end, respectively, of the active region, and the function
 602 prompts for @var{character-set}.  The function does completion, knows
 603 how to guess a character set name from a coding system name, and also
 604 provides some common aliases.  See @code{latin-unity-guess-charset}.
 605 There is no way to specify @var{coding-system}, as it has no useful
 606 function interactively.
 607
 608 Return @var{coding-system} if @var{coding-system} can encode all
 609 characters in the region, t if @var{coding-system} is nil and the coding
 610 system with G0 = 'ascii and G1 = @var{character-set} can encode all
 611 characters, and otherwise nil.  Note that a non-null return does
 612 @emph{not} mean it is safe to write the file, only the specified region.
 613 (This behavior is useful for multipart MIME encoding and the like.)
 614
 615 Note:  by default this function is quite fascist about universal coding
 616 systems.  It only admits @samp{utf-8}, @samp{iso-2022-7}, and
 617 @samp{ctext}.  Customize @code{latin-unity-approved-ucs-list} to change
 618 this.
 619
 620 This function remaps characters that are artificially distinguished by Mule
 621 internal code.  It may change the code point as well as the character set.
 622 To recode characters that were decoded in the wrong coding system, use
 623 @code{latin-unity-recode-region}.
 624 @end defun
 625
 626
 627 @defun latin-unity-recode-region begin end wrong-cs right-cs
 628
 629 Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
 630 to @var{right-cs}.
 631
 632 @var{wrong-cs} and @var{right-cs} are character sets.  Characters retain
 633 the same code point but the character set is changed.  Only characters
 634 from @var{wrong-cs} are changed to @var{right-cs}.  The identity of the
 635 character may change.  Note that this could be dangerous, if characters
 636 whose identities you do not want changed are included in the region.
 637 This function cannot guess which characters you want changed, and which
 638 should be left alone.
 639
 640 When called interactively, @var{begin} and @var{end} are set to the
 641 beginning and end, respectively, of the active region, and the function
 642 prompts for @var{wrong-cs} and @var{right-cs}.  The function does
 643 completion, knows how to guess a character set name from a coding system
 644 name, and also provides some common aliases.  See
 645 @code{latin-unity-guess-charset}.
 646
 647 Another way to accomplish this, but using coding systems rather than
 648 character sets to specify the desired recoding, is
 649 @samp{latin-unity-recode-coding-region}.  That function may be faster
 650 but is somewhat more dangerous, because it may recode more than one
 651 character set.
 652
 653 To change from one Mule representation to another without changing identity
 654 of any characters, use @samp{latin-unity-remap-region}.
 655 @end defun
 656
 657
 658 @defun latin-unity-recode-coding-region begin end wrong-cs right-cs
 659
 660 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
 661 @var{right-cs}.
 662
 663 @var{wrong-cs} and @var{right-cs} are coding systems.  Characters retain
 664 the same code point but the character set is changed.  The identity of
 665 characters may change.  This is an inherently dangerous function;
 666 multilingual text may be recoded in unexpected ways.  #### It's also
 667 dangerous because the coding systems are not sanity-checked in the
 668 current implementation.
 669
 670 When called interactively, @var{begin} and @var{end} are set to the
 671 beginning and end, respectively, of the active region, and the function
 672 prompts for @var{wrong-cs} and @var{right-cs}.  The function does
 673 completion, knows how to guess a coding system name from a character set
 674 name, and also provides some common aliases.  See
 675 @code{latin-unity-guess-coding-system}.
 676
 677 Another, safer, way to accomplish this, using character sets rather than coding
 678 systems to specify the desired recoding, is to use `latin-unity-recode-region.
 679
 680 To change from one Mule representation to another without changing identity
 681 of any characters, use @code{latin-unity-remap-region}.
 682 @end defun
 683
 684
 685 Helper functions for input of coding system and character set names.
 686
 687 @defun latin-unity-guess-charset candidate
 688 Guess a charset based on the symbol @var{candidate}.
 689
 690 @var{candidate} itself is not tried as the value.
 691
 692 Uses the natural mapping in @samp{latin-unity-cset-codesys-alist}, and
 693 the values in @samp{latin-unity-charset-alias-alist}."
 694 @end defun
 695
 696 @defun latin-unity-guess-coding-system candidate
 697 Guess a coding system based on the symbol @var{candidate}.
 698
 699 @var{candidate} itself is not tried as the value.
 700
 701 Uses the natural mapping in @samp{latin-unity-cset-codesys-alist}, and
 702 the values in @samp{latin-unity-coding-system-alias-alist}."
 703 @end defun
 704
 705
 706 @defun latin-unity-example
 707
 708 A cheesy example for @pkgname{}.
 709
 710 At present it just makes a multilingual buffer.  To test, setq
 711 buffer-file-coding-system to some value, make the buffer dirty (eg
 712 with RET BackSpace), and save.
 713 @end defun
 714
 715 @defun latin-unity-test
 716
 717 A simple automated test suite for latin-unity.
 718 @end defun
 719
 720
 721 @node Installation, Configuration, Usage, Top
 722 @chapter Installing @pkgname{} with your (X)Emacs
 723
 724 @pkgname{} may be installed from XEmacs via the package user interface
 725 (accessible from the @samp{Tools} menu or via @kbd{M-x list-packages}).
 726
 727 You can also download the @file{latin-unity-@var{version}-pkg.tar.gz}
 728 tarball from @url{ftp://ftp.xemacs.org/pub/xemacs/packages/}, and simply
 729 unpack it in the usual place.
 730
 731 @pkgname{} sources are available from XEmacs's CVS repository.  The
 732 module name is @samp{latin-unity}.  See
 733 @uref{http://www.xemacs.org/Develop/cvsaccess.html} for more
 734 information about XEmacs's CVS repository.
 735
 736
 737 @node Configuration, Bug Reports, Installation, Top
 738 @chapter Configuring @pkgname{} for Use
 739
 740 If you want @pkgname{} to be automatically initialized, invoke
 741 @samp{latin-unity-install} with no arguments in your init file.
 742 @xref{Init File, , , xemacs}.  If you are using GNU Emacs or an XEmacs
 743 earlier than 21.1, you should also load @file{auto-autoloads} using the
 744 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
 745
 746 You may wish to define aliases for commonly used character sets and
 747 coding systems for convenience in input.
 748
 749 @defopt latin-unity-charset-alias-alist
 750 Alist mapping aliases to Mule charset names (symbols)."
 751
 752 The default value is
 753 @example
 754    ((latin-1 . latin-iso8859-1)
 755     (latin-2 . latin-iso8859-2)
 756     (latin-3 . latin-iso8859-3)
 757     (latin-4 . latin-iso8859-4)
 758     (latin-5 . latin-iso8859-9)
 759     (latin-9 . latin-iso8859-15)
 760     (latin-10 . latin-iso8859-16))
 761 @end example
 762
 763 If a charset does not exist on your system, it will not complete and you
 764 will not be able to enter it in response to prompts.  A real charset
 765 with the same name as an alias in this list will shadow the alias.
 766 @end defopt
 767
 768 @defopt latin-unity-coding-system-alias-alist nil
 769 Alist mapping aliases to Mule coding system names (symbols).
 770
 771 The default value is @samp{nil}.
 772 @end defopt
 773
 774
 775 @node Bug Reports, Frequently Asked Questions, Configuration, Top
 776 @chapter Reporting Bugs and Problems
 777
 778 Please report bugs to the author, @email{stephen@@xemacs.org,Stephen
 779 Turnbull}, or to the developers' mailing list,
 780 @email{xemacs-beta@@xemacs.org, XEmacs Beta}.
 781
 782 Suggestions for improvement are welcome at the same addresses.
 783
 784
 785 @node Frequently Asked Questions, Theory of Operation, Bug Reports, Top
 786 @chapter Frequently Asked Questions
 787
 788 @enumerate
 789 @item
 790 I'm smarter than latin-unity!  How can that be?
 791
 792 Don't be surprised.  Trust yourself.
 793
 794 latin-unity is very young as yet.  Teach it what you know by Customizing
 795 its variables, and report your changes to the maintainer (@pxref{Bug
 796 Reports}).
 797
 798 @item
 799 What is a UCS?
 800
 801 According to ISO 10646, a Universal Coded character Set.  In
 802 latin-unity, it's Universal (Mule) Coding System.
 803 @ref{Coding Systems, , , xemacs}
 804
 805 @item
 806 I know utf-16-le-bom is a UCS, but latin-unity won't use it.  Why not?
 807
 808 There are an awful lot of UCSes in Mule, and you probably do not want to
 809 ever use, and definitely not be asked about, most of them.  So the
 810 default set includes a few that the author thought plausible, but
 811 they're surely not comprehensive or optimal.
 812
 813 Customize @code{latin-unity-ucs-list} to include the ones you use, and
 814 report your favorites to the maintainer for consideration for inclusion
 815 in the defaults, @xref{Bug Reports}.  (Note that you @emph{must} include
 816 @code{escape-quoted} in this list, because Mule uses it internally as
 817 the coding system for auto-save files.)
 818
 819 Alternatively, if you just want to use it this one time, simply type it
 820 in at the prompt.  latin-unity will confirm that is a real coding
 821 system, and then assume that you know what you're doing.
 822
 823 @item
 824 This is crazy: I can't quit XEmacs and get queried on autosaves!  Why?
 825
 826 You probably removed @code{escape-quoted} from
 827 @code{latin-unity-ucs-list}.  Put it back.
 828
 829 @item
 830 latin-unity is really buggy and I can't get any work done.
 831
 832 First, use @kbd{M-x latin-unity-uninstall RET}, then report your
 833 problems as a bug (@pxref{Bug Reports}).
 834 @end enumerate
 835
 836
 837 @node Theory of Operation, What latin-unity Cannot Do for You, Frequently Asked Questions, Top
 838 @chapter Theory of Operation
 839
 840 Standard encodings suffer from the design defect that they do not
 841 provide a reliable way to recognize which coded character sets in use.
 842 @xref{What latin-unity Cannot Do for You}.  There are scores of
 843 character sets which can be represented by a single octet (8-bit byte),
 844 whose union contains many hundreds of characters.  Obviously this
 845 results in great confusion, since you can't tell the players without a
 846 scorecard, and there is no scorecard.
 847
 848 There are two ways to solve this problem.  The first is to create a
 849 universal coded character set.  This is the concept behind Unicode.
 850 However, there have been satisfactory (nearly) universal character sets
 851 for several decades, but even today many Westerners resist using Unicode
 852 because they consider its space requirements excessive.  On the other
 853 hand, Asians dislike Unicode because they consider it to be incomplete.
 854 (This is partly, but not entirely, political.)
 855
 856 In any case, Unicode only solves the internal representation problem.
 857 Many data sets will contain files in ``legacy'' encodings, and Unicode
 858 does not help distinguish among them.
 859
 860 The second approach is to embed information about the encodings used in
 861 a document in its text.  This approach is taken by the ISO 2022
 862 standard.  This would solve the problem completely from the users' of
 863 view, except that ISO 2022 is basically not implemented at all, in the
 864 sense that few applications or systems implement more than a small
 865 subset of ISO 2022 functionality.  This is due to the fact that
 866 mono-literate users object to the presence of escape sequences in their
 867 texts (which they, with some justification, consider data corruption).
 868 Programmers are more than willing to cater to these users, since
 869 implementing ISO 2022 is a painstaking task.
 870
 871 In fact, Emacs/Mule adopts both of these approaches.  Internally it uses
 872 a universal character set, @dfn{Mule code}.  Externally it uses ISO 2022
 873 techniques both to save files in forms robust to encoding issues, and as
 874 hints when attempting to ``guess'' an unknown encoding.  However, Mule
 875 suffers from a design defect, namely it embeds the character set
 876 information that ISO 2022 attaches to runs of characters by introducing
 877 them with a control sequence in each character.  That causes Mule to
 878 consider the ISO Latin character sets to be disjoint.  This manifests
 879 itself when a user enters characters using input methods associated with
 880 different coded character sets into a single buffer.
 881
 882 There are two problems stemming from this design.  First, Mule
 883 represents the same character in different ways.  Abstractly, '\e,As\e(B'
 884 (LATIN SMALL LETTER O WITH ACUTE) can get represented as
 885 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73].  So what looks like
 886 '\e,Ass\e(B' in the display might actually be represented [latin-iso8859-1
 887 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
 888 #xF3 ESC - A] in the file.  In some cases this treatment would be
 889 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
 890 (the CJK ideographic character meaning ``one'')), and although arguably
 891 incorrect it is convenient when mixing the CJK scripts.  But in the case
 892 of the Latin scripts this is wrong.
 893
 894 Worse yet, it is very likely to occur when mixing ``different'' encodings
 895 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
 896 points that are almost never used.  A very important example involves
 897 email.  Many sites, especially in the U.S., default to use of the ISO
 898 8859/1 coded character set (also called ``Latin 1,'' though these are
 899 somewhat different concepts).  However, ISO 8859/1 provides a generic
 900 CURRENCY SIGN character.  Now that the Euro has become the official
 901 currency of most countries in Europe, this is unsatisfactory (and in
 902 practice, useless).  So Europeans generally use ISO 8859/15, which is
 903 nearly identical to ISO 8859/1 for most languages, except that it
 904 substitutes EURO SIGN for CURRENCY SIGN.
 905
 906 Suppose a European user yanks text from a post encoded in ISO 8859/1
 907 into a message composition buffer, and enters some text including the
 908 Euro sign.  Then Mule will consider the buffer to contain both ISO
 909 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
 910 programmed) send the message as a multipart mixed MIME body!
 911
 912 This is clearly stupid.  What is not as obvious is that, just as any
 913 European can include American English in their text because ASCII is a
 914 subset of ISO 8859/15, most European languages which use Latin
 915 characters (eg, German and Polish) can typically be mixed while using
 916 only one Latin coded character set (in the case of German and Polish,
 917 ISO 8859/2).  However, this often depends on exactly what text is to be
 918 encoded (even for the same pair of languages).
 919
 920 @pkgname{} works around the problem by converting as many characters as
 921 possible to use a single Latin coded character set before saving the
 922 buffer.
 923
 924 Because the problem is rarely noticable in editing a buffer, but tends
 925 to manifest when that buffer is exported to a file or process, the
 926 @pkgname{} package uses the strategy of examining the buffer prior to
 927 export.  If use of multiple Latin coded character sets is detected,
 928 @pkgname{} attempts to unify them by finding a single coded character
 929 set which contains all of the Latin characters in the buffer.
 930
 931 The primary purpose of @pkgname{} is to fix the problem by giving the
 932 user the choice to change the representation of all characters to one
 933 character set and give sensible recommendations based on context.  In
 934 the '\e,As\e(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
 935 both will be suggested.  In the EURO SIGN example, only ISO 8859/15
 936 makes sense, and that is what will be recommended.  In both cases, the
 937 user will be reminded that there are universal encodings available.
 938
 939 I call this @dfn{remapping} (from the universal character set to a
 940 particular ISO 8859 coded character set).  It is mere accident that this
 941 letter has the same code point in both character sets.  (Not entirely,
 942 but there are many examples of Latin characters that have different code
 943 points in different Latin-X sets.)
 944
 945 Note that, in the '\e,As\e(B' example, that treating the buffer in this way will
 946 result in a representation such as [latin-iso8859-2
 947 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
 948 This is guaranteed to occasionally result in the second problem you
 949 observed, to which we now turn.
 950
 951 This problem is that, although the file is intended to be an
 952 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
 953 compliant program---this is required by the standard, obvious if you
 954 think a bit, @pxref{What latin-unity Cannot Do for You}) will read that
 955 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73].  Of course this
 956 is no problem if all of the characters in the file are contained in ISO
 957 8859/1, but suppose there are some which are not, but are contained in
 958 the (intended) ISO 8859/2.
 959
 960 You now want to fix this, but not by finding the same character in
 961 another set.  Instead, you want to simply change the character set that
 962 Mule associates with that buffer position without changing the code.
 963 (This is conceptually somewhat distinct from the first problem, and
 964 logically ought to be handled in the code that defines coding systems.
 965 However, @pkgname{} is not an unreasonable place for it.)  @pkgname{}
 966 provides two functions (one fast and dangerous, the other slow and
 967 careful) to handle this.  I call this @dfn{recoding}, because the
 968 transformation actually involves @emph{encoding} the buffer to file
 969 representation, then @emph{decoding} it to buffer representation (in a
 970 different character set).  This cannot be done automatically because
 971 Mule can have no idea what the correct encoding is---after all, it
 972 already gave you its best guess.  @xref{What latin-unity Cannot Do for
 973 You}.  So these functions must be invoked by the user.  @xref{Interactive
 974 Usage}.
 975
 976
 977 @node What latin-unity Cannot Do for You, Interfaces, Theory of Operation, Top
 978 @chapter What latin-unity Cannot Do for You
 979
 980 @pkgname{} @strong{cannot} save you if you insist on exporting data in
 981 8-bit encodings in a multilingual environment.  @emph{You will
 982 eventually corrupt data if you do this.}  It is not Mule's, or any
 983 application's, fault.  You will have only yourself to blame; consider
 984 yourself warned.  (It is true that Mule has bugs, which make Mule
 985 somewhat more dangerous and inconvenient than some naive applications.
 986 We're working to address those, but no application can remedy the
 987 inherent defect of 8-bit encodings.)
 988
 989 Use standard universal encodings, preferably Unicode (UTF-8) unless
 990 applicable standards indicate otherwise.  The most important such case
 991 is Internet messages, where MIME should be used, whether or not the
 992 subordinate encoding is a universal encoding.  (Note that since one of
 993 the important provisions of MIME is the @samp{Content-Type} header,
 994 which has the charset parameter, MIME is to be considered a universal
 995 encoding for the purposes of this manual.  Of course, technically
 996 speaking it's neither a coded character set nor a coding extension
 997 technique compliant with ISO 2022.)
 998
 999 As mentioned earlier, the problem is that standard encodings suffer from
1000 the design defect that they do not provide a reliable way to recognize
1001 which coded character sets are in use.  There are scores of character
1002 sets which can be represented by a single octet (8-bit byte), whose
1003 union contains many hundreds of characters.  Thus any 8-bit coded
1004 character set must contain characters that share code points used for
1005 different characters in other coded character sets.
1006
1007 This means that a given file's intended encoding cannot be identified
1008 with 100% reliability unless it contains encoding markers such as those
1009 provided by MIME or ISO 2022.
1010
1011 @pkgname{} actually makes it more likely that you will have problems of
1012 this kind.  Traditionally Mule has been ``helpful'' by simply using an
1013 ISO 2022 universal coding system when the current buffer coding system
1014 cannot handle all the characters in the buffer.  This has the effect
1015 that, because the file contains control sequences, it is not recognized
1016 as being in the locale's normal 8-bit encoding.  It may be annoying if
1017 you are not a Mule expert, but your data is automatically recoverable
1018 with a tool you already have: Mule.
1019
1020 However, with @pkgname{}, Mule converts to a single 8-bit character set
1021 when possible.  But typically this will @emph{not} be in your usual
1022 locale.  Ie, the times that an ISO 8859/1 user will need @pkgname{} is
1023 when there are ISO 8859/2 characters in the buffer.  But then most
1024 likely the file will be saved in a pure 8-bit encoding that is not ISO
1025 8859/1, ie, ISO 8859/2.  Mule's autorecognizer (which is probably the
1026 most sophisticated yet available) cannot tell the difference between ISO
1027 8859/1 and ISO 8859/2, and in a Western European locale will choose the
1028 former even though the latter was intended.  Even the extension
1029 (``statistical recognition'') planned for XEmacs 22 is unlikely to be at
1030 all accurate in the case of mixed codes.
1031
1032 So now consider adding some additional ISO 8859/1 text to the buffer.
1033 If it includes any ISO 8859/1 codes that are used by different
1034 characters in ISO 8859/2, you now have a file that cannot be
1035 mechanically disentangled.  You need a human being who can recognize
1036 that @emph{this is German and Swedish} and stays in Latin-1, while
1037 @emph{that is Polish} and needs to be recoded to Latin-2.
1038
1039 Moral: switch to a universal coded character set, preferably Unicode
1040 using the UTF-8 transformation format.  If you really need the space,
1041 compress your files.
1042
1043
1044 @node Interfaces, Charsets and Coding Systems, What latin-unity Cannot Do for You, Top
1045 @chapter Interfaces
1046
1047 Various recent---dating from the end of the nineties---ISO 8859 standard
1048 language environments used to be provided with this package, but they
1049 have been refactored out into their own package,
1050 @i{latin-euro-standards}, on which this one depends. See the
1051 documentation for that package if you're interested in using those from
1052 Lisp.
1053
1054 @node Charsets and Coding Systems, Internals, Interfaces, Top
1055 @chapter Charsets and Coding Systems
1056
1057 This section provides reference lists of Mule charsets and coding
1058 systems.  Mule charsets are typically named by character set and
1059 standard.
1060
1061 @table @strong
1062 @item ASCII variants
1063
1064 Identification of equivalent characters in these sets is not properly
1065 implemented.  @pkgname{} does not distinguish the two charsets.
1066
1067 @samp{ascii} @samp{latin-jisx0201}
1068
1069 @item Extended Latin
1070
1071 Characters from the following ISO 2022 conformant charsets are
1072 identified with equivalents in other charsets in the group by
1073 @pkgname{}.
1074
1075 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
1076 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
1077 @samp{latin-iso8859-13} @samp{latin-iso8859-16}
1078
1079 The follow charsets are Latin variants which are not understood by
1080 @pkgname{}.  In addition, many of the Asian language standards provide
1081 ASCII, at least, and sometimes other Latin characters.  None of these
1082 are identified with their ISO 8859 equivalents.
1083
1084 @samp{vietnamese-viscii-lower}
1085 @samp{vietnamese-viscii-upper}
1086
1087 @item Other character sets
1088
1089 @samp{arabic-1-column}
1090 @samp{arabic-2-column}
1091 @samp{arabic-digit}
1092 @samp{arabic-iso8859-6}
1093 @samp{chinese-big5-1}
1094 @samp{chinese-big5-2}
1095 @samp{chinese-cns11643-1}
1096 @samp{chinese-cns11643-2}
1097 @samp{chinese-cns11643-3}
1098 @samp{chinese-cns11643-4}
1099 @samp{chinese-cns11643-5}
1100 @samp{chinese-cns11643-6}
1101 @samp{chinese-cns11643-7}
1102 @samp{chinese-gb2312}
1103 @samp{chinese-isoir165}
1104 @samp{cyrillic-iso8859-5}
1105 @samp{ethiopic}
1106 @samp{greek-iso8859-7}
1107 @samp{hebrew-iso8859-8}
1108 @samp{ipa}
1109 @samp{japanese-jisx0208}
1110 @samp{japanese-jisx0208-1978}
1111 @samp{japanese-jisx0212}
1112 @samp{katakana-jisx0201}
1113 @samp{korean-ksc5601}
1114 @samp{sisheng}
1115 @samp{thai-tis620}
1116 @samp{thai-xtis}
1117
1118 @item Non-graphic charsets
1119
1120 @samp{control-1}
1121 @end table
1122
1123 @table @strong
1124 @item No conversion
1125
1126 Some of these coding systems may specify EOL conventions.  Note that
1127 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
1128 coding system.  Although @pkgname{} attempts to compensate for this, it
1129 is possible that the @samp{iso-8859-1} coding system will behave
1130 differently from other ISO 8859 coding systems.
1131
1132 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
1133
1134 @item Latin coding systems
1135
1136 These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
1137 combining ASCII in the GL register (bytes with high-bit clear) and an
1138 extended Latin character set in the GR register (bytes with high-bit set).
1139
1140 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
1141 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
1142
1143 These coding systems are single-byte, 8-bit coding systems that do not
1144 conform to international standards.  They should be avoided in all
1145 potentially multilingual contexts, including any text distributed over
1146 the Internet and World Wide Web.
1147
1148 @samp{windows-1251}
1149
1150 @item Multilingual coding systems
1151
1152 The following ISO-2022-based coding systems are useful for multilingual
1153 text.
1154
1155 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
1156 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
1157
1158 XEmacs also supports Unicode with the Mule-UCS package.  These are the
1159 preferred coding systems for multilingual use.  (There is a possible
1160 exception for texts that mix several Asian ideographic character sets.)
1161
1162 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
1163 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
1164 @samp{utf-8} @samp{utf-8-ws}
1165
1166 Development versions of XEmacs (the 21.5 series) support Unicode
1167 internally, with (at least) the following coding systems implemented:
1168
1169 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
1170 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
1171
1172 @item Asian ideographic languages
1173
1174 The following coding systems are based on ISO 2022, and are more or less
1175 suitable for encoding multilingual texts.  They all can represent ASCII
1176 at least, and sometimes several other foreign character sets, without
1177 resort to arbitrary ISO 2022 designations.  However, these subsets are
1178 not identified with the corresponding national standards in XEmacs Mule.
1179
1180 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
1181 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
1182 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
1183 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
1184 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
1185
1186 The following coding systems cannot be used for general multilingual
1187 text and do not cooperate well with other coding systems.
1188
1189 @samp{big5} @samp{shift_jis}
1190
1191 @item Other languages
1192
1193 The following coding systems are based on ISO 2022.  Though none of them
1194 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
1195 to 21.4 defaults to) use of ISO 2022 control sequences to designate
1196 other character sets for inclusion the text.
1197
1198 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
1199 @samp{ctext-hebrew}
1200
1201 The following are character sets that do not conform to ISO 2022 and
1202 thus cannot be safely used in a multilingual context.
1203
1204 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
1205 @samp{viscii} @samp{vscii}
1206
1207 @item Special coding systems
1208
1209 Mule uses the following coding systems for special purposes.
1210
1211 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
1212
1213 @samp{escape-quoted} is especially important, as it is used internally
1214 as the coding system for autosaved data.
1215
1216 The following coding systems are aliases for others, and are used for
1217 communication with the host operating system.
1218
1219 @samp{file-name} @samp{keyboard} @samp{terminal}
1220
1221 @end table
1222
1223 Mule detection of coding systems is actually limited to detection of
1224 classes of coding systems called @dfn{coding categories}.  These coding
1225 categories are identified by the ISO 2022 control sequences they use, if
1226 any, by their conformance to ISO 2022 restrictions on code points that
1227 may be used, and by characteristic patterns of use of 8-bit code points.
1228
1229 @samp{no-conversion}
1230 @samp{utf-8}
1231 @samp{ucs-4}
1232 @samp{iso-7}
1233 @samp{iso-lock-shift}
1234 @samp{iso-8-1}
1235 @samp{iso-8-2}
1236 @samp{iso-8-designate}
1237 @samp{shift-jis}
1238 @samp{big5}
1239
1240
1241 @node Internals, , Charsets and Coding Systems, Top
1242 @chapter Internals
1243
1244 No internals documentation yet.
1245
1246 @file{latin-unity-utils.el} provides one utility function.
1247
1248 @defun latin-unity-dump-tables
1249
1250 Dump the temporary table created by loading @file{latin-unity-utils.el}
1251 to @file{latin-unity-tables.el}.  Loading the latter file initializes
1252 @samp{latin-unity-equivalences}.
1253 @end defun
1254
1255 @c end of latin-unity.texi
1256