HTML txt Manual

By Daniel Hellerstein

=HTML_TXT: An HTML to Text Converter=

Introduction
HTML_TXT, version 1.10, is used to convert an HTML file to a text file. HTML_TXT is written in REXX and is meant to be run under OS/2. However, it also runs under other REXX interpreters, such as Regina REXX for DOS and Regina REXX for WIN95.

HTML_TXT will attempt to maintain the format of the HTML document by using appropriate spacing and ASCII characters. HTML_TXT can use ASCII art ( lines and boxes), as well as other high-ascii characters, to improve the appearance of the output (text) file.

HTML_TXT can be customized in a number of ways. For example, you can:


 * suppress the use of line art and other high ASCII characters (your output will be rougher, but will suffer from fewer compatability problems).
 * display tables (including nested tables) in a tabular format with auto-sized columns
 * change the bullet characters used in ordered lists
 * display  headings as an hierarchical outline
 * change characters used to signify logical elements (emphasis, anchors, list bullets, etc.)

Installling and Executing HTML_TXT
HTML_TXT is easy to install and run:


 * 1) Copy HTML_TXT.CMD to a directory.
 * 2) Open up an OS/2 prompt, change to the directory containing , and type   at the command prompt.
 * 3) Follow the instructions.

No other libraries or support files are needed.

"The READ.ME file describes how to install HTML_TXT if you are a Regina REXX user."

Running from the command line
You can also run HTML_TXT from the command line. The syntax is (where x:\HTMLTXT is the directory containing ): ����x:\HTMLTXT> HTML_TXT file.htm file.txt /var var1=val1 ; var2=val2  where :

file.htm is the input file (an HTML document) file.txt is the output file (a text document)  /VAR var1=val1 ; var2=val2  is an optional list of parameters to modify.

Alternatively, you can run HTML_TXT from an (OS/2) prompt without any arguments; you will then be asked for an input and output file, and will be permitted to change the values of several of the more important parameters.

Features
HTML_TXT attempts to support many HTML options; including nested tables, nested lists, centering, and recognition of FORM elements.

The following summarizes HTML_TXT's capabilities.

This table assumes that you have a basic familiarity with HTML.

Changing Parameters
As noted in the customization column of the above table, HTML_TXT contains a number of user configurable parameters.

Although the default values of these parameters work well in most cases, you can change them by editing HTML_TXT.CMD with your favorite text editor (look for the "user configurable parameters" section)

Alternatively, you can temporarily changes values using the /VAR </tt> command line option. In fact, by specifying a PLIST=file.ext (in the /VAR section), you can create custom instructions for sets of HTML documents.

The following lists the more important parameters.  Of particular interest are the NOANSI, LINEART, TABLEMAXNEST, TABLEMODE2</tt> and TOOLONGWORD</tt> parameters.

"For detailed descriptions of these parameters, see HTML_TXT.CMD."

Troubleshooting HTML_TXT
The following lists possible troubles you might have, and suggested solutions.


 * � HTML_TXT display all kinds of wierd garbage (such as $ and [ characters)
 * You don't have ANSI support installed. You should either install ANSI.SYS (for example, include a DEVICE=C:\OS2\MDOS\ANSI.SYS in your OS/2 CONFIG.SYS file), or set NOANSI=1 (in HTML_TXT.CMD)..


 * � Nested tables aren't displaying properly (this is especially likely to happen when running under Regina REXX for DOS).
 * You can try using lists instead of tables -- set TABLEMAXNEST=0 (in HTML_TXT.CMD)..


 * � Tables have unappealing characters used as vertical and horizontal lines
 * Either your output device (say, your printer) does not support high-ascii characters, or your code page is somewhat unusual. You can use standard characters (- and !) for line borders by setting LINEART=0 (in HTML_TXT.CMD)..


 * � Unappealing characters are being used as bullets and to "quote" text strings
 * This can also occur if your code page is somewhat unusual. You can either change the various "display control parameters" (in HTML_TXT.CMD), or you can set LINEART=-1; in which case some default, standard charactes (such as * and @ for bullets) will be used..


 * � Long words (such as URLs) are being lost.
 * You can change the "trimming" action to "word wrap", or to "extend beyond margins", by setting the TOOLONGWORD parameter.


 * � The display of headings is not informative
 * You can set HN_OUTLINE=2, heading will then be displayed in an "outline format". You can even change the numbering style (say, 2.a.ii versus II.2.b) by changing the HN_NUMBERS.n parameters.

Disclaimer
This is freeware that is to be used at your own risk -- the author and any potentially affiliated institutions disclaim all responsibilties for any consequence arising from the use, misuse, or abuse of this software. You may use this, or subsets of this program, as you see fit, including for commercial purposes; so long as proper attribution is made, and so long as such use does not preclude others from making similar use of this code.

Contact Information
Do you have the [/web/20070824135634/http://www.srehttp.org/apps/html_txt/ latest version of HTML_TXT]?

If you find errors in this program, would like to make suggestions, or otherwise wish to commment.... please contact [mailto:danielh@econ.ag.gov Daniel Hellerstein]