Abhijit Dasgupta: Bangla Unicode to <tt>bangtex</tt> converter (perl script)

Typing LaTeX source documents directly in unicode Bangla

A simple solution was to prepare the LaTeX source document using unicode UTF-8 encoded Bangla text, and then use a special script called uni2bangtex.perl to convert it into transliterated ASCII in bangtex format. Since unicode UTF-8 encoding is a superset of ASCII, the ASCII needed to type the LaTeX commands can be freely mixed within the UTF-8 Bangla source text.

This means that I can use any unicode UTF-8 editor to prepare the LaTeX source directly in Bangla, with any appropriate Bangla keyboard input method (phonetic, inscript, etc) and any Bangla font for UTF-8 (usually truetype or opentype) --- see the screenshot below. If your native language is Bangla, you will probably find this to be a faster, more pleasant, more intutive, and less error-prone way to type the source LaTeX document than using a specific romanized form (ASCII transliteration) of Bangla.

An Example

Preparing a sample LaTeX Bangla document

Let us go through an example showing how to prepare a sample Bangla LaTeX document named smpldoc:

Step 1 : Prepare the LaTeX source document in unicode UTF-8 Bangla text. Here is a screenshot showing the unicode Bangla LaTeX source text file smpldoc.txt being prepared in the editing window of a unicode editor called yudit using the SolaimanLipi font (you can of course use any unicode UTF-8 editor and Bangla font of your choice).

The source being prepared in a UTF-8 text editor

The editing session displayed above was run by the command
```
        yudit smpldoc.txt
```
where the LaTeX source document smpldoc.txt is the unicode UTF-8 encoded Bangla text file. In the above screenshot you can see that LaTeX commands in ASCII are mixed freely with Bangla characters, a nice feature of the UTF-8 encoding.

Step 2 : Convert the Bangla unicode into bangtex format ASCII transliteration. To convert smpldoc.txt (which is a unicode text file) into bangtex format ASCII transliteration, run the following command:

        uni2bangtex.perl smpldoc.txt > smpldoc.tex

The perl script uni2bangtex.perl parses the unicode UTF-8 encoded Bangla source document, and converts it into transliterated ASCII in bangtex format. (Thus, like seicor, it is a supporting tool for the bangtex package.)

The resulting file smpldoc.tex is a pure ASCII bangtex format LaTeX file with the following contents:

\documentclass{barticle} \title{EkiT pRbn/dh} \author{Ajana elkhk} \date{pNNicesh {oi}bshakh, 1413} \begin{document} \bng \maketitle \section{pRthm pirec/chd} rbiin/dRnaethr EkiT kibtar shuru inec ed{O}ya Hl. \begin{verse} kt Ajanaer jana{I}el tuim \\ kt gher idel ThNNa{I}. \\ duurek kirel inkT bn/dhu,\\ prek kirel bha{I}.. \end{verse} \end{document}

However, I never need to directly view or edit this file, and regard it as an "intermediate machine file".

Step 3 : Run latex/pdflatex as usual. Finally, run pdflatex (or latex) as usual on the intermediate machine file (the pure ASCII bangtex format LaTeX file) smpldoc.tex:
```
        pdflatex smpldoc.tex
```
This produces the final PDF output smpldoc.pdf which looks like:

The final PDF produced by pdflatex

System Requirements

What you need on your computer

For a setup like the one I have described above, you will need to have the following installed on your computer:

The bangtex package, and the bpsf font package.
Perl version 5.8.5 or newer, and the script uni2bangtex.perl. The script will not work with old versions of Perl with broken unicode UTF-8 support.
A UTF-8 Bangla text editing setup, which may very well be present already in your system, as recent versions of most modern operating systems with graphical desktops have UTF-8 based multilingual support. This means any unicode UTF-8 editor, supplied with a font capable of displaying Bangla and a unicode Bangla keyboard input method of your choice (e.g., a phonetic layout such as Probhat). If you are not familiar with this, see the Appendix below for more details.

Download

About unicode and UTF-8

What is unicode and UTF-8?

Unicode is a standard for font-independent and orthographically accurate digital representation of written language using character codes. The role unicode plays for general languages is identical to the role played by the ASCII code for English. In particular, there is a perfect one-to-one correspondence between Bangla unicode and written Bangla which preserves all spellings.

In this way, unicode can be viewed as an extension of ASCII to encode the characters of all other languages. In fact, a specific unicode encoding scheme called UTF-8 is designed in such a way that it is a direct superset of ASCII. Thus a UTF-8 text document can contain ASCII characters, and an ASCII text document is simply a special type of UTF-8 text document.

To learn more, see the UTF-8 and Unicode FAQ for Unix/Linux. On recent linux systems, you can look up the manpages for unicode(7), utf-8(7), and charsets(7).

Why use unicode for encoding Bangla text in this way?

We note the following points in support of unicode for encoding Bangla text.

We repeat: The role of unicode UTF-8 for encoding Bangla text (and most other languages) is identical to that of ASCII for encoding English text: Unicode provides a standardized, font-independent, and orthographically accurate representation of written language for essentially all languages, making it perfect for digital storage and communication of "pure text" documents.
In particular, it is as easy and convenient to use a unicode text editor to write Bangla web pages in direct html text or to compose and send text emails in Bangla, as it is to do these things in English. (ASCII had been providing this functionality only for English and a few languages based on the Roman character set.)
Use of unicode (especially UTF-8) in Bangla documents is now widespread. Websites like the Bangla Wikipedia, the Bangla version of Google's search engine, news media such as Bangla BBC, Bangla Deutsche Welle, and Bangla China Radio International --- all use UTF-8 encoded Bangla web pages as their standard.
Notably, the entire Rabindra Rachanabali is now on the web thanks to SNLTR, and they plan to add more classic Bangla literature.
UTF-8 has the remarkable property that written texts of multiple languages (ASCII, Bangla, and most other languages) can be freely mixed in the same text document (while being font independent) and can then be easily separated back again (because different languges have disjoint sets of character codes). This makes it useful not only for TeX/LaTeX, but also for most markup languages (such as xml), and vitally for html. It has been reported that the vast majority of webpages now use UTF-8.
For bangtex, using unicode for the source text file makes its character encoding standardized, and so the source text file becomes portable for sharing, i.e. readable and usable by most users of Bangla computing such as Bangla magazine editors, even if they are not familiar with TeX/LaTeX (the embedded Tex/LaTeX formatting commands in ASCII does not affect the content too much, and are often self-explanatory).
Even for Tex/LaTeX users, this provides much better readability of the source text file, since (a) there is no more romanization (transliteration of Bangla into ASCII), and (b) the user can freely choose the (opentype/truetype) Bangla text font for viewing the source text.
UTF-8 encoded pure Bangla text can now be quickly turned into a TeX/LaTeX source text by inserting some TeX/LaTeX formatting commands into the UTF-8 encoded text itself, without having to first manually convert it into transliterated ASCII in bangtex format (the conversion being done by the script).
Unicode also makes the "cut and paste" feature of character based documents highly portable. As more and more Bangla references are becoming available on the web, this makes it easy to quote from them in documents. Recently, I needed to quote a large section of an article of Rabindranath into a Bangla article composed in bangtex. I went to the online version of that article on the Rabindra Rachanabali website, and then simply did a "cut and paste" of the relevant section into my bangtex article. (Without the uni2bangtex script, I would have to manually transliterate the quoted section into romanized bangtex form, spending considerably more time in "data entry".)
Using unicode allows the user to customize the keyboard input method. Users familiar in typing with the romanized QWERTY keyboard will probably prefer Bangla phonetic keyboard layouts such as Probhat or Baishakhi (as opposed to non-phonetic layouts such as Inscript or Unijoy). Even further customization, including modification of the phonetic transliteration mapping for the keyboard, is possible.
Using a unicode text editor to compose the source document can be somewhat less error prone and more pleasant for native Bangla users than using a romanized transliteration of Bangla, since the instant rendering of written Bangla directly into a Bangla font constantly provides the same visual feedback loop as the one used when reading or writing native language on paper.

Appendix

Setting up Bangla unicode text support in Linux / X windows

You need three things for using Bangla unicode text on Linux:

A text editor with UTF-8 support. (This means a simple character based text editor, not a word processor such as OpenOffice or MS word.)
This may already be present in your system, as most modern operating systems with graphical desktops have a default GUI editor with this feature. For example, many Linux distributions include either GNOME's default text editor gedit or KDE's default text editors Kate and/or Kwrite, MS Windows comes with Notepad, Mac OS X has TextEdit, etc, and all these now support multilingual UTF-8. If your system does not have it, you may want to install and use the simple free classic UTF-8 editor yudit. Other choices are possible, such as the GNU super-editor Emacs. See Wikipedia's Comparison of text editors.
To display Bangla text, the editor will need a font capable of rendering unicode UTF-8 Bangla. This will usually be an opentype or truetype font.
Note that this font is only for displaying Bangla in the text editor in which you prepare the LaTeX source document, and has nothing to do with the font of the final document output by bangtex (such as bpsf).
Again, most modern operating systems now come with default fonts for displaying most of unicode UTF-8, and so it may not be necessary to install any special Bangla UTF-8 font, unless you do not like the system default fonts for displaying Bangla.
See Bangla script display help at Wikipedia and Bangla Wiktionary for more details.
If you need to install Bangla fonts:
- Sample Bangla UTF-8 fonts at the gallery of Bangla unicode fonts, many of which are free.
- Check out many free unicode Bangla fonts available for download at Ekushey OpenType Bangla Fonts and/or Ubuntu's Indic Fonts.
- Avoid some old versions of fonts which are still around E.g., for the Bangla opentype font Likhan, use the latest version (2005-AUG-11).
A keyboard input method for typing Bangla UTF-8 characters using a romanized keyboard (usually QWERTY).
Once again, most modern operating systems provide keyboard layouts for various languages and a way to switch between various layouts. The default layout usually is a form of English, which maps keyboard scan codes into ASCII characters. Switching to a different layout will cause this map to change, and a Bangla layout will map keyboard scan codes into Bangla UTF-8 characters instead of ASCII characters.
See Bangla script input help at Wikipedia and Bangla Wiktionary for more details.
There are different types of layouts for Bangla available, such as phonetic, non-phonetic, etc.
If you are used to typing on QWERTY keyboards using primarily a language with essentially Roman script (English, German, French, Spanish, Italian, etc) and you are new to Bangla typing, then you will probably find a phonetic layout to be the easiest to use. For an X-window based system, a Bangla phonetic layout called probhat (picture of layout) is generally available. I personally use a variant of it, which I call suprobhat. Another possibility for a modern phonetic layout is baisakhi (PDF document, picture of layout), developed by SNLTR.
Also see this explanation of phonetic Bangla typing.

Low Level Keyboard Layout Switching in X windows

Warning! You should not use this method unless you really know what you are doing, or else it can make your computer unusable. If you use a modern distribution of Linux with a graphical desktop manager such as GNOME or KDE, you will most likely have a way (perhaps a menu in your desktop manager or a graphical applet) to switch your keyboard layout, and you should use it to select a Bangla layout of your choice (e.g., Probhat).

If you really want to use this low level method (bypassing your desktop manager) to switch to a new keyboard layout, use the setxkbmap command to directly instruct the X windows server to select or switch to an xkb keyboard layout for X, which are found in the directory /etc/X11/xkb/symbols/. Look there for a file named in (for India), or bd (for BanglaDesh), or ben or bang, which should have an entry for the Probhat layout, named ben_probhat, or simply probhat. (You can also download the layout here. I personally use a variant of probhat which I call suprobhat.)

E.g., if the file /etc/X11/xkb/symbols/in has a layout entry called ben_probhat, you can activate it by a command such as

    setxkbmap -model pc101 -layout "us,in(ben_probhat)" -option "grp:shifts_toggle,grp_led:num"

    setxkbmap -model pc101 -layout "us,in(ben_probhat)" -option "grp:shift_toggle,grp_led:num"

depending on the version of your X. This will set things up in xkb so that pressing the two shift keys together will toggle between the standard US (English) and the ben_probhat (Bangla) keyboard layouts.

Typing Bangla TeX/LaTeX Source (`bangtex`) Directly in Native Bangla (Unicode/UTF-8)

Introduction

`bangtex`: Bangla TeX and LaTeX

The Problem

Can we avoid romanization in typing LaTeX source in Bangla?

The Solution: Use This Script to Convert Bangla Unicode into `bangtex`'s Transliterated ASCII Format

Typing LaTeX source documents directly in unicode Bangla

An Example

Preparing a sample LaTeX Bangla document

System Requirements

What you need on your computer

Download

The perl script file and this webpage

About unicode and UTF-8

What is unicode and UTF-8?

Why use unicode for encoding Bangla text in this way?

Appendix

Setting up Bangla unicode text support in Linux / X windows

Low Level Keyboard Layout Switching in X windows

Web Resources

List of usefule websites

Typing Bangla TeX/LaTeX Source (bangtex) Directly in Native Bangla (Unicode/UTF-8)

Introduction

bangtex: Bangla TeX and LaTeX

The Problem

Can we avoid romanization in typing LaTeX source in Bangla?

The Solution: Use This Script to Convert Bangla Unicode into bangtex's Transliterated ASCII Format

Typing LaTeX source documents directly in unicode Bangla

An Example

Preparing a sample LaTeX Bangla document

System Requirements

What you need on your computer

Download

The perl script file and this webpage

About unicode and UTF-8

What is unicode and UTF-8?

Why use unicode for encoding Bangla text in this way?

Appendix

Setting up Bangla unicode text support in Linux / X windows

Low Level Keyboard Layout Switching in X windows

Web Resources

List of usefule websites

Typing Bangla TeX/LaTeX Source (`bangtex`) Directly in Native Bangla (Unicode/UTF-8)

`bangtex`: Bangla TeX and LaTeX

The Solution: Use This Script to Convert Bangla Unicode into `bangtex`'s Transliterated ASCII Format