[ Sven ]     [ Linux ]     [ GAOS e.V. ]     [ Home ]

Yasig -- Yet Another Site Index Generator


  1. Overview
    What is Yasig? -- What Yasig is Not -- Supported Platforms
  2. Installation and Use
    License -- Installation -- How to use Yasig -- HTML Authoring
  3. A Detailed View of Yasig
  4. Useful Links
    Similar Programs -- Related Stuff
  5. Index

Copyright © 2000 Sven Türpe <sven@gaos.org>

Download current version: yasig-0.2.tar.gz (30kB)

By downloading Yasig you accept the license agreement that you find below on this page in section 2.1.


1 Overview

1.1 What is Yasig?

Yasig is yet another site index generator. Yasig takes a collection of HTML files, extracts important words and phrases, forges the corresponding URLs and creates an alphabetically sorted list. You can provide your own script to turn that list into an HTML document or use the default presentation with your own HTML code around it.

The kind of index the author had in mind while writing Yasig was the sort one can find at the end of books. Hence Yasig generates a list of keywords or phrases with references to the locations where more information can be found -- and nothing more. Yasig does not collect page information or context but simply phrases and URLs.

Yasig doesn't use any linguistic methods. Instead, it tries to guess what might be important phrases from HTML markup in a configurable way. You can support the indexing process by providing invisible keywords using the META tag and by invisible markup using the SPAN tag with a certain value of the CLASS attribute. It is also possible to exclude certain phrases or URLs from indexing by regular expression matching.

The intended use of Yasig is on small web sites with handcrafted HTML. What Yasig can do for you is best explained by example. Have a look at the index generated for this document and the site Yasig was originally written for.

1.2 What Yasig is Not

Yasig is not stable. It is the result of three nights of hacking and mostly untested. Expect bugs and problems.

Yasig is not clickable nor easy to use. You will have to edit files for configuration and possibly even write your own presentation script or hack the one provided with Yasig.

Yasig is not fault-tolerant with respect to your HTML code. It relies on syntactical correctnes and good HTML authoring style. If you don't know what a validator is and why to use it, you won't like Yasig. Also if you click together your HTML documents with some HTML editor without knowing the concepts behind, you won't like Yasig.

Yasig is in no way correct. It does not really parse HTML but identifies tags and text between tags, and it does not support any character set other than ISO 8859-1. So if the language of your documents is not english or one of the european languages within the scope of ISO 8859-1, you won't like to use Yasig. (You may still give it a try as long as your HTML documents are stored in some 8-bit character encoding that is a superset of ASCII and do not contain any of the character entity references for ISO 8859-1 characters. In that case you should remove the umlaut handling code from the presentation script or write your own one. Unicode characters won't hurt if and only if they are given as HTML character entity references, i.e. &foobar;, but might confuse sort.)

1.3 Supported Platforms

Yasig is intended (but not yet tested) to run on any GNU/Unix system. It requires only three standard utilities:

You will also need a text editor.

2 Installation and Use

2.1 License

Copyright (c) 2000 Sven Türpe <sven@gaos.org>

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Any use of this software is at the user's own risk.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. If you use this software on a web site available to the public, a hyperlink to the primary distribution site, http://gaos.org/~sven/yasig/, would be appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

2.2 Installation

  1. Go through the three awk source files, index.awk, cleanup.awk, and present.awk and adjust the path in the first line to point to wherever your awk interpreter is installed. The default is /usr/bin/awk. Leave the -f option untouched. Do the same for the shell script,yasig.sh. Path defaults to /bin/ksh here.
  2. Go through the global configuration section of yasig.sh and make changes where necessary.
  3. Check the permissions of *.awk and yasig.sh. They must be executable for any user invoking Yasig.
  4. Copy the sample.cfg file to your preferred location and edit it to fit your needs. It is heavily commented.
  5. Use Yasig, remove all bugs you encounter while doing so, increase the version number by 0.1, and send a copy back to the author.

2.3 How to use Yasig

2.3.1 Configuring Yasig for a Specific Site

Using Yasig mainly means to write a configuration file and then run the yasig.sh shell script with the path to this configuration file as its only argument. You don't have to write the configuration file from scratch. A heavily commented example comes with the distribution of Yasig (sample.cfg). If everything works as intended, you will then see some messages on stderr and if Yasig stops you can find the newly created index at the location specified in the configuration file.

You should at least set the following configuration variables to values specific to your site:

LC_ALL and LANG
These are the locale settings. They are important for correct sorting especially with non-english languages. If your documents are written neither in english nor in german (i.e., use special characters from the ISO 8859-1 160-255 range other than german umlauts) you will also have to hack the present.awk script slightly.
DOCROOT
Your document root directory. Does not have to be the same as that of your whole web server. You MUST include the trailing '/'.
URLPATH
The URL that corresponds to the document root directory as specified in DOCROOT. Can be a relative URL with respect to the intended location of the index to be generated or even empty. The value of URLPATH is used in conjunction with the file names (see below) to forge URLs.
SITEINDEX
The value of SITEINDEX specifies where to put the index. Should be an absolute path.
FILES
A space separated list of file names relative to DOCROOT, i.e. if you set DOCROOT=/var/httpd/htdocs, the file /var/httpd/htdocs/foo/bar.html were specified as foo/bar.html. You might use wildcards and command substitution.
EXCLUDE_FILE and URL_EXCLUDE_FILE
You probably don't want to use the examples that came with the distribution.

Other options only influence details of indexing and presentation. You can leave them untouched when trying Yasig for the first time.

2.3.2 Adding HTML and Style Sheets

If you use Yasig with the presentation script supplied by the author, it generates only the raw index with some HTML markup. You have to supply the HTML around and eventually a style sheet to change the appearance.

You can do this in at least two ways. First, you could put the generated index and the stuff that goes around in separate files and use some server side include mechanism to put them together automatically. In that case you set the LEADING_HTML_FILE and TRAILING_HTML_FILE both to the empty string. Yasig replaces the empty string by the filename /dev/null which indicates end of file whenever it is read.

Second, you could split your site index template at the position where you want the actual index to be included, put both parts into separate files, and specify their locations using the LEADING_HTML_FILE and TRAILING_HTML_FILE variables in your configuration file. Then the generated index is the concatenation of the LEADING_HTML_FILE file, the index itself, and the TRAILING_HTML_FILE file.

There are currently only a few options that influence the appearance of the index as created by the default presentation script present.awk. You can select the tag that is used to mark the heading that is created for each letter by setting the HEADTAG variable. You can also set a limit on the number of references per phrase. Any keyword or phrase that has more URLs than the numeric value of MAXLINKS associated with it is discarded. The third presentation option is provided by the MENULINE flag. If set to any value, a line with the alphabetic letters from A to Z is created in front of the index and each letter that actually appears in the index gets hyperlinked to the appropriate location within the index.

To help you with stylesheeting, the whole index is framed in a <DIV CLASS="siteindex">. The headings and the lists (but not the single list items) used in the index also have their class attribute set to siteindex. When using the MENULINE feature, the menu line is marked as a paragraph in its own DIV, both of class siteindexhead.

2.3.3 Getting Rid of Unwanted Phrases and URLs

If your HTML documents are syntactically correct, have some content and represent the logical structure of the content by useful HTML markup, the first run of Yasig will create you an index that is already usable but possibly polluted with phrases or references you won't like to see there.

Two mechanisms can help you to clean up the index and keep it small. First, you can influence how certain tags are treated and whether they are used at all by setting or unsetting the corresponding environment variables within the configuration file. Second, you can provide regular expressions to discard everything that matches them, either based on the phrase or the URL.

Each type of tag has an ignore flag associated with it. By setting is to any value, you tell index.awk not to use this tag. In most cases this flag is called IGNORE_<TAG> where <TAG> is a placeholder for the tag to be ignored. The only exception are the IGNORE_KEYWORDS and IGNORE_LOCAL flags. They refer to META tags with the NAME attribute set to keywords and localkeywords respectively.

For some tags, A, H?, EM, STRONG, and DT there is also an USE_<TAG>_TITLE flag which tells Yasig to prefer the TITLE attributes of these tags over the marked text. That means that those flags won't make any difference as long as you don't use TITLE attributes in your documents.

The exclusion lists specified by the EXCLUDE_FILE and URL_EXCLUDE_FILE variables are files containing regular expression, each expression on its own line. Each phrase-URL pair is matched against all expressions from both files, the phrase part against the expressions in the EXCLUDE_FILE file and the URL part against the expressions in URL_EXCLUDE_FILE file.

If you use exclusion lists, try to be as specific as possible. For instance use ^ and $ to refer to start of phrase and end of phrase, especially if you put single words nto the lists. As everywhere else in Yasig, matching is case insensitive.

2.4 Hints and Recommendations for HTML Authoring

Yasig cannot do any magic. It helps you to create an index of your HTML documents, but in turn it needs your help to make that index really useful. All information Yasig uses in indexing comes from you, the HTML author.

2.4.1 Which Tags Get Indexed?

HTML elements suitable for extraction of important words and phrases should have at least some of the following properties:

They ...

Influenced by his own style of HTML writing, the author of Yasig has made this selection:

H1...H6, EM, STRONG, and DT
With exception of DT, these elements fulfill all of the requirements. The exception with DT is that it does not require an end tag, but ends with the next DD, DT, or /DL if the end tag is omitted.
TITLE
Justification is obvious.
ABBR and ACRONYM
If marked explicitly as such, an acronym or abbreviation probably is important with respect to documents content. Yasig treats them special: If a TITLE attribute is provided with them, its value is added to the marked word in brackets.
Anchors
Anchors with a HREF attribute also get special treatment by Yasig. The marked phrase is considered as a phrase that describes the document the HREF points to. Absolute URLs are considered remote and only relative ones used for indexing. Anchors without HREF attribute are ignored in current version.

Besides normal markup Yasig provides means to define additional keywords invisible to the user. Yasig honours the META tag with NAME attribute set to keywords as it is also used by search engines. Set NAME to localkeywords if you want to provide additional keywords for use by Yasig only.

To mark words and phrases within the body of a document, use the SPAN tag and set CLASS="index". If you provide a TITLE attribute, use tindex instead.

2.4.2 Syntax

First and most important rule: Write syntactically correct HTML. Your WWW browser can render virtually everything into a visible page, be it HTML, HTML with errors or a purely random series of characters. But Yasig cannot render random data or even HTML with errors into an index. Syntactical correctness (after the W3C HTML 4.0 Specification) is simply assumed. Even slight errors might lead to confusion.

Due to pseudo-parsing done by the index.awk script, you should never use the '<' and '>' characters in any place of your documents where they do not delimit tags. Use the character entity references &lt; and &gt; instead in text and also in quoted attribute values. The closing '>' of comments should always immediately follow the "--" with no whitespace or line breaks in between. By the way, PHP and ASP code delimiters are also recognized as comments. At least the author hopes they are, since it was not yet tested.

2.4.3 Style

Make your markup as rich as possible. Use all the tags that are considered for indexing in accordance with their meaning. Write plain structural markup. Do not write HTML with certain visual effects in mind but in a way that it represents the logical structure of your text.

Make use of attributes. By providing document-wide unique identifiers as values of the ID attribute you allow the index to directly point to a phrase instead of only the document the phrase is contained in. The TITLE attribute if not used otherwise in your documents provides a way to let phrases in the index differ from those in the actual text.

Use the SPAN tag to invisibly mark phrases. Set the CLASS attribute to the value of index to put the marked text into the index or use the value tindex and provide a TITLE attribute.

Provide document-wide keywords in <META NAME="keywords" CONTENT="1st keyword, 2nd keyword, ..."> tags. They also help internet search engines to index your page. To provide additional keywords for local use only, set the NAME attribute to the value localkeywords.

When hyperlinking within your document collection, use relative URLs instead of absolute ones. When linking to "directories", always include the trailing slash, e.g. write "../" instead of just "..". This not only allows Yasig to include the destination URL of an internal link in conjunction with the anchor text into the index but also saves you a lot of pain if you once have to move your pages to a different site or a different part of your site's local namespace.

If your typical use of a certain tag makes that tag unsuitable for indexing, tell Yasig to ignore it.

3 A Detailed View of Yasig

Yasig's functionality is spread over three awk scripts. Each of these does one single task and could also be used on its own. A shell script puts them together in a pipe, provides for configuration, and does some sanity checks. Let's start our journey through Yasig with the latter one.

3.1 The Scripts Explained

3.1.1 yasig.sh

The yasig.sh script is what you run to generate an index of a collection of HTML documents. It takes only one parameter, a path to the configuration file. All options are set there. Once you have figured out the settings that fit your needs, you could easily start Yasig from time to time as a cron job. If you want to redirect Yasig's messages to /dev/null or to some file, mind that it talks to you via stderr. The index is directly written to a file whose name you specify in the configuration file.

There is a global configuration section in yasig.sh which you should check and edit at installation time. All settings you make there can be overridden in the configuration files.

Global settings

What follows the global configuration section is rather simple. The configuration file specified on the command line is inlined using . $1 (which implies that configuration files have shell syntax). After some sanity checks, the three scripts and the sort command specified above are executed in a pipe, first the indexer, then the sort command, third the cleanup and last the presentation script. Output from the pipe is redirected to a file specified in the configuration file, where you can also specify files with arbitrary content to be prepended and appended respectively.

Should you want to include your own scripts into that pipe, data are exchanged as follows. The indexer and cleanup scripts each output a series of lines. A line consists of a single word or a space-separated list of words. Then a horizontal tab (ASCII 09) character follows and after the tab an URL string. The cleanup and presentation scripts expect their input to be sorted alphabetically.

3.1.2 index.awk

This is the core part of Yasig where all the interesting things happen. To understand how index.awk works you should be familiar with awk's way of dealing with input. If you aren't, read the overview sections of the awk(1) manual page.

The key idea to non-parsing of HTML with awk is to set the record separator, that defaults to the newline character, to a regular expression matching the tag delimiters '<' and '>', thus splitting the input into records which are either a tag (including all attributes if it is a start tag) or text outside tags. So the HTML code <EM ID=13>blah</EM> is turned into the sequence

EM ID=13
blah
/EM

of records from awk's point of view. You can easily confuse the program by using the tag delimiters unquoted somewhere else.

Whenever the awk script encounters a tag record it is interested in, it extracts some atrribute values from it, sets a flag and then collects all non-tag text until it sees the next end tag of the same type, where the collected text is output together with an URL and the flags is unset. Comment handling is a bit more difficult and will not be explained here. Have a look at the source if you are curious.

3.1.3 cleanup.awk

I'm an unwritten section.

3.1.4 present.awk

I'm an unwritten section.

3.2 TODO and Future Plans

I'm an unwritten section.

4.1 Similar Programs

This list is based on one collected and posted to de.comm.infosystems.www.authoring.misc by Heiko Schlenker <hschlen(at)gmx.de>.

htmltoc
Htmltoc is a Perl program to generate a table of contents for HTML documents.
http://www.oac.uci.edu/indiv/ehood/htmltoc.html
SiteIndex
SiteIndex.pl is a Perl script which produces an HTML index of HTML documents in a given directory tree. Options allow the script to be set to index an entire directory tree or just part of the tree, to display a user-specified number of items, to include text to act as headers and footers of the finished page (or in place of text, a Perl program may be provided to generate the headers/footers), and to include a variable amount of sample text from each file in the index.
URL unknown
tree
create a HTML sitemap
http://www.ev-stift-gymn.guetersloh.de/server/tree_e.html
HTML-Tree
Generates HTML tree diagram of web site HTML web pages.
http://pagesz.net/~scotty/perlscripts/html-tree-desc.html
Sintrasearch
Simple Intranet Search Engine
http://www.geocities.com/necho8/sintra/sintra.html
wsmap
Fully configurable sitemap generator
http://kyd.net/vaga/, currently not available
SiteMap
http://www.klografx.de/software/
sitemapper
ftp://ftp.perl.org/pub/CPAN/authors/id/A/AW/AWRIGLEY/
MiniSearch
http://www.dansteinman.com/minisearch/
Perlfect Search
http://perlfect.com/freescripts/search/
The World Wide Web Consortium
World Wide Web technology specifications and other information.
http://www.w3c.org
The GNU Project
Get GNU awk and information about it.
http://www.gnu.org
comp.lang.awk FAQ
Answers on frequently asked questions around the awk language from comp.lang.awk.
http://www.faqs.org/faqs/computer-lang/awk/faq/

Sven Türpe, August 2000. Web site sponsored by iT-Netservice GmbH.