HTML txt Manual

By Daniel Hellerstein

=HTML_TXT: An HTML to Text Converter=

Introduction
HTML_TXT, version 1.10, is used to convert an HTML file to a text file. HTML_TXT is written in REXX and is meant to be run under OS/2. However, it also runs under other REXX interpreters, such as Regina REXX for DOS and Regina REXX for WIN95.

HTML_TXT will attempt to maintain the format of the HTML document by using appropriate spacing and ASCII characters. HTML_TXT can use ASCII art ( lines and boxes), as well as other high-ascii characters, to improve the appearance of the output (text) file.

HTML_TXT can be customized in a number of ways. For example, you can:
 * suppress the use of line art and other high ASCII characters (your output will be rougher, but will suffer from fewer compatability problems).
 * display tables (including nested tables) in a tabular format with auto-sized columns
 * change the bullet characters used in ordered lists
 * display  headings as an hierarchical outline
 * change characters used to signify logical elements (emphasis, anchors, list bullets, etc.)

Installling and Executing HTML_TXT
HTML_TXT is easy to install and run:
 * 1) Copy HTML_TXT.CMD to a directory.
 * 2) Open up an OS/2 prompt, change to the directory containing HTML_TXT.CMD, and type   at the command prompt.
 * 3) Follow the instructions.

No other libraries or support files are needed.

The READ.ME file describes how to install HTML_TXT if you are a Regina REXX user.

Running from the command line
You can also run HTML_TXT from the command line. The syntax is (where x:\HTMLTXT is the directory containing HTML_TXT.CMD): x:\HTMLTXT>HTML_TXT file.htm file.txt /var var1=val1 ; var2=val2 where: file.htm is the input file (an HTML document) file.txt is the output file (a text document) /VAR var1=val1 ; var2=val2</tt> is an optional list of parameters to modify.

Example: D:\HTMLTXT HTML_TXT foo.htm foo.txt /VAR lineart=0 ; lagul=* $

Alternatively, you can run HTML_TXT from an (OS/2) prompt without any arguments; you will then be asked for an input and output file, and will be permitted to change the values of several of the more important parameters.

Features
HTML_TXT attempts to support many HTML options; including nested tables, nested lists, centering, and recognition of FORM elements.

The following summarizes HTML_TXT's capabilities.

This table assumes that you have a basic familiarity with HTML.

Changing Parameters
As noted in the customization column of the above table, HTML_TXT contains a number of user configurable parameters.

Although the default values of these parameters work well in most cases, you can change them by editing HTML_TXT.CMD with your favorite text editor (look for the "user configurable parameters" section)

Alternatively, you can temporarily changes values using the /VAR</tt> command line option. In fact, by specifying a PLIST=file.ext (in the /VAR section), you can create custom instructions for sets of HTML documents.

The following lists the more important parameters. Of particular interest are the NOANSI, LINEART, TABLEMAXNEST, TABLEMODE2</tt> and TOOLONGWORD</tt> parameters. Table Controls

Display of tables, in a tabular format, can be tricky. In particular, nested tables may tax the resources of your 80 character text display. HTML_TXT allows you to modify table specific display options, and convert tables into lists. Display Controls

Since it's not possible to use italics, bold, <font size="-1">font styles, and other such visual aids in a text file, HTML_TXT uses a few tricks instead.

The last trick, the use of "quote strings", is frequently used by HTML_TXT; with different sets of quote strings used for different emphasis. For example, EM and I emphasis, anchors, submit fields, and < src="xxx" alt="in-line images"> in-line images are indicated with unique sets of "quote strings". For detailed descriptions of these parameters, see HTML_TXT.CMD.
 * Capitalization can be used - by default, Bold, STRONG and TypewriTer</tt> emphasis is indicated with capitalization.
 * Spaces can be replaced with underscores - this is used to indicate Underline emphasis
 * "quote strings" can be placed around emphasised strings.
 * }
 * }
 * }

Troubleshooting HTML_TXT
The following lists possible troubles you might have, and suggested solutions.


 * � HTML_TXT display all kinds of wierd garbage (such as $ and [ characters)
 * You don't have ANSI support installed. You should either install ANSI.SYS (for example, include a DEVICE=C:\OS2\MDOS\ANSI.SYS in your OS/2 CONFIG.SYS file), or set NOANSI=1 (in HTML_TXT.CMD).


 * � Nested tables aren't displaying properly (this is especially likely to happen when running under Regina REXX for DOS).
 * You can try using lists instead of tables -- set TABLEMAXNEST=0 (in HTML_TXT.CMD).


 * � Tables have unappealing characters used as vertical and horizontal lines
 * Either your output device (say, your printer) does not support high-ascii characters, or your code page is somewhat unusual. You can use standard characters (- and !) for line borders by setting LINEART=0 (in HTML_TXT.CMD)..


 * � Unappealing characters are being used as bullets and to "quote" text strings
 * This can also occur if your code page is somewhat unusual. You can either change the various "display control parameters" (in HTML_TXT.CMD), or you can set LINEART=-1; in which case some default, standard charactes (such as * and @ for bullets) will be used.


 * � Long words (such as URLs) are being lost.
 * You can change the "trimming" action to "word wrap", or to "extend beyond margins", by setting the TOOLONGWORD parameter.


 * � The display of headings is not informative
 * You can set HN_OUTLINE=2, heading will then be displayed in an "outline format". You can even change the numbering style (say, 2.a.ii versus II.2.b) by changing the HN_NUMBERS.n parameters.

Disclaimer
This is freeware that is to be used at your own risk - the author and any potentially affiliated institutions disclaim all responsibilties for any consequence arising from the use, misuse, or abuse of this software. You may use this, or subsets of this program, as you see fit, including for commercial purposes; so long as proper attribution is made, and so long as such use does not preclude others from making similar use of this code.

Contact Information
Do you have the http://www.srehttp.org/apps/html_txt/ latest version of HTML_TXT?

If you find errors in this program, would like to make suggestions, or otherwise wish to commment... please contact [mailto:danielh@econ.ag.gov Daniel Hellerstein]