[ Sven ] [ Linux ] [ GAOS e.V. ] [ Home ]
Copyright © 2000 Sven Türpe <sven@gaos.org>
Download current version: yasig-0.2.tar.gz (30kB)
By downloading Yasig you accept the license agreement that you find below on this page in section 2.1.
Yasig is yet another site index generator. Yasig takes a collection of HTML files, extracts important words and phrases, forges the corresponding URLs and creates an alphabetically sorted list. You can provide your own script to turn that list into an HTML document or use the default presentation with your own HTML code around it.
The kind of index the author had in mind while writing Yasig was the sort one can find at the end of books. Hence Yasig generates a list of keywords or phrases with references to the locations where more information can be found -- and nothing more. Yasig does not collect page information or context but simply phrases and URLs.
Yasig doesn't use any linguistic methods. Instead,
it tries to guess what might be important phrases from HTML
markup in a configurable way. You can support the indexing process
by providing invisible keywords using the META tag and by
invisible markup using the SPAN tag with a certain value
of the CLASS attribute. It is also possible to exclude
certain phrases or URLs from indexing by regular expression matching.
The intended use of Yasig is on small web sites with handcrafted HTML. What Yasig can do for you is best explained by example. Have a look at the index generated for this document and the site Yasig was originally written for.
Yasig is not stable. It is the result of three nights of hacking and mostly untested. Expect bugs and problems.
Yasig is not clickable nor easy to use. You will have to edit files for configuration and possibly even write your own presentation script or hack the one provided with Yasig.
Yasig is not fault-tolerant with respect to your HTML code. It relies on syntactical correctnes and good HTML authoring style. If you don't know what a validator is and why to use it, you won't like Yasig. Also if you click together your HTML documents with some HTML editor without knowing the concepts behind, you won't like Yasig.
Yasig is in no way correct. It does not really parse HTML but identifies
tags and text between tags, and it does not support any character set
other than ISO 8859-1. So if the language of your documents is
not english or one of the european languages within the scope of
ISO 8859-1, you won't like to use Yasig. (You may still give
it a try as long as your HTML documents are stored in some 8-bit
character encoding that is a superset of ASCII and do not
contain any of the character entity
references for ISO 8859-1 characters. In that case you should
remove the umlaut handling code from the presentation script or write
your own one. Unicode characters won't hurt if and only if they are
given as HTML character entity references, i.e. &foobar;, but
might confuse sort.)
Yasig is intended (but not yet tested) to run on any GNU/Unix system. It requires only three standard utilities:
/dev/stderr) for I/O redirection, the
IGNORECASE variable, use of RS as a regular expression,
and use of delete array. In theory it'd be possible to
make Yasig more portable. In practice, the author does not want to
as long as he does not receive requests for it.
/bin/bash, /bin/ksh,
or /bin/sh. Csh-like shells won't work.
sort utility. If you use non-english languages
with special characters (e.g. umlauts) in your documents, your
version of the sort utility should honour the
locale settings (i.e. the LANG and LC_*
environment variables) and put special characters into the correct
equivalence classes. Especially when using the GNU textutils
package it could be necessary to update to a recent version.
You will also need a text editor.
Copyright (c) 2000 Sven Türpe <sven@gaos.org>
This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Any use of this software is at the user's own risk.
Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:
index.awk,
cleanup.awk, and present.awk and adjust
the path in the first line to point to wherever your awk interpreter
is installed. The default is /usr/bin/awk. Leave the -f
option untouched. Do the same for the shell script,yasig.sh.
Path defaults to /bin/ksh here.
yasig.sh and
make changes where necessary.
*.awk and yasig.sh.
They must be executable for any user invoking Yasig.
sample.cfg file to your preferred location and
edit it to fit your needs. It is heavily commented.
Using Yasig mainly means to write a configuration file and then
run the yasig.sh shell script with the path to this
configuration file as its only argument. You don't have to write
the configuration file from scratch. A heavily commented example
comes with the distribution of Yasig (sample.cfg). If
everything works as intended, you will then see some messages on
stderr and if Yasig stops you can find the newly created index at
the location specified in the configuration file.
You should at least set the following configuration variables to values specific to your site:
LC_ALL and LANGpresent.awk
script slightly.
DOCROOTURLPATHDOCROOT. Can be a relative URL with respect
to the intended location of the index to be generated or even
empty. The value of URLPATH is used in conjunction
with the file names (see below) to forge URLs.
SITEINDEXSITEINDEX specifies where to put
the index. Should be an absolute path.
FILESDOCROOT,
i.e. if you set DOCROOT=/var/httpd/htdocs, the file
/var/httpd/htdocs/foo/bar.html were specified as
foo/bar.html. You might use wildcards and command
substitution.
EXCLUDE_FILE and
URL_EXCLUDE_FILEOther options only influence details of indexing and presentation. You can leave them untouched when trying Yasig for the first time.
If you use Yasig with the presentation script supplied by the author, it generates only the raw index with some HTML markup. You have to supply the HTML around and eventually a style sheet to change the appearance.
You can do this in at least two ways. First, you could put the generated
index and the stuff that goes around in separate files and use some server
side include mechanism to put them together automatically. In that case
you set the
LEADING_HTML_FILE
and
TRAILING_HTML_FILE
both to the empty string. Yasig replaces the empty string by the filename
/dev/null which indicates end of file whenever it is read.
Second, you could split your site index template at the position where
you want the actual index to be included, put both parts into separate
files, and specify their locations using the LEADING_HTML_FILE
and TRAILING_HTML_FILE variables in your configuration file.
Then the generated index is the concatenation of the
LEADING_HTML_FILE file, the index itself, and
the TRAILING_HTML_FILE file.
There are currently only a few options that influence the appearance
of the index as created by the default presentation script
present.awk. You can select the tag that is used to
mark the heading that is created for each letter by setting
the HEADTAG variable.
You can also set a limit on
the number of references per phrase. Any keyword or phrase that has
more URLs than the numeric value of
MAXLINKS associated
with it is discarded. The third presentation option is provided by
the MENULINE flag.
If set to any value, a line with
the alphabetic letters from A to Z is created in front of the index
and each letter that actually appears in the index gets hyperlinked
to the appropriate location within the index.
To help you with stylesheeting, the whole index is framed in a
<DIV CLASS="siteindex">. The headings and the
lists (but not the single list items) used in the index also have
their class attribute set to siteindex. When using the
MENULINE feature, the menu line is marked as a
paragraph in its own DIV, both of class
siteindexhead.
If your HTML documents are syntactically correct, have some content and represent the logical structure of the content by useful HTML markup, the first run of Yasig will create you an index that is already usable but possibly polluted with phrases or references you won't like to see there.
Two mechanisms can help you to clean up the index and keep it small. First, you can influence how certain tags are treated and whether they are used at all by setting or unsetting the corresponding environment variables within the configuration file. Second, you can provide regular expressions to discard everything that matches them, either based on the phrase or the URL.
Each type of tag has an ignore flag associated with it. By setting
is to any value, you tell index.awk not to use this tag.
In most cases this flag is called
IGNORE_<TAG>
where <TAG> is a placeholder for the tag to be ignored. The only
exception are the IGNORE_KEYWORDS and IGNORE_LOCAL
flags. They refer to META tags with the NAME
attribute set to keywords and localkeywords
respectively.
For some tags, A, H?, EM,
STRONG, and DT there is also an
USE_<TAG>_TITLE
flag which tells Yasig to
prefer the TITLE attributes of these tags over
the marked text. That means that those flags won't make any
difference as long as you don't use TITLE attributes
in your documents.
The exclusion lists
specified by the
EXCLUDE_FILE and
URL_EXCLUDE_FILE
variables are files containing
regular expression, each expression on its own line. Each
phrase-URL pair is matched against all expressions from both
files, the phrase part against the expressions in the
EXCLUDE_FILE file and the URL part against the expressions
in URL_EXCLUDE_FILE file.
If you use exclusion lists, try to be as specific as possible. For instance use ^ and $ to refer to start of phrase and end of phrase, especially if you put single words nto the lists. As everywhere else in Yasig, matching is case insensitive.
Yasig cannot do any magic. It helps you to create an index of your HTML documents, but in turn it needs your help to make that index really useful. All information Yasig uses in indexing comes from you, the HTML author.
HTML elements suitable for extraction of important words and phrases should have at least some of the following properties:
They ...
CODE
or KBD do, andInfluenced by his own style of HTML writing, the author of Yasig has made this selection:
H1...H6, EM, STRONG,
and DT
DT, these elements fulfill all
of the requirements. The exception with DT is
that it does not require an end tag, but ends with the next
DD, DT, or /DL if the
end tag is omitted.
TITLE
ABBR and ACRONYM
TITLE attribute is provided with
them, its value is added to the marked word in brackets.
Anchors
HREF attribute also get special
treatment by Yasig. The marked phrase is considered as a
phrase that describes the document the HREF points
to. Absolute URLs are considered remote and only relative ones
used for indexing. Anchors without HREF attribute
are ignored in current version.
Besides normal markup Yasig provides means to define additional keywords
invisible to the user. Yasig honours the META tag with
NAME attribute set to keywords as it is
also used by search engines. Set NAME to
localkeywords if you want to provide additional
keywords for use by Yasig only.
To mark words and phrases within the body of a document, use the
SPAN tag and set CLASS="index". If you
provide a TITLE attribute, use tindex
instead.
First and most important rule: Write syntactically correct HTML. Your WWW browser can render virtually everything into a visible page, be it HTML, HTML with errors or a purely random series of characters. But Yasig cannot render random data or even HTML with errors into an index. Syntactical correctness (after the W3C HTML 4.0 Specification) is simply assumed. Even slight errors might lead to confusion.
Due to pseudo-parsing done by the index.awk script, you
should never use the '<' and '>' characters in any place of
your documents where they do not delimit tags. Use the character
entity references < and > instead in text and
also in quoted attribute values. The closing '>' of
comments
should always immediately follow the "--" with no whitespace or
line breaks in between. By the way, PHP and ASP code delimiters
are also recognized as comments. At least the author hopes they are,
since it was not yet tested.
Make your markup as rich as possible. Use all the tags that are considered for indexing in accordance with their meaning. Write plain structural markup. Do not write HTML with certain visual effects in mind but in a way that it represents the logical structure of your text.
Make use of attributes. By providing document-wide unique identifiers
as values of the ID attribute you allow the index to directly
point to a phrase instead of only the document the phrase is contained in.
The TITLE attribute if not used otherwise in your documents
provides a way to let phrases in the index differ from those in the actual
text.
Use the SPAN tag to invisibly mark phrases. Set the
CLASS attribute to
the value of index to put the marked text into the index
or use the value tindex and provide a TITLE
attribute.
Provide document-wide keywords in <META NAME="keywords"
CONTENT="1st keyword, 2nd keyword, ..."> tags. They also
help internet search engines to index your page. To provide
additional keywords for local use only, set the NAME attribute
to the value localkeywords.
When hyperlinking within your document collection, use relative URLs instead of absolute ones. When linking to "directories", always include the trailing slash, e.g. write "../" instead of just "..". This not only allows Yasig to include the destination URL of an internal link in conjunction with the anchor text into the index but also saves you a lot of pain if you once have to move your pages to a different site or a different part of your site's local namespace.
If your typical use of a certain tag makes that tag unsuitable for indexing, tell Yasig to ignore it.
Yasig's functionality is spread over three awk scripts. Each of these does one single task and could also be used on its own. A shell script puts them together in a pipe, provides for configuration, and does some sanity checks. Let's start our journey through Yasig with the latter one.
The yasig.sh script is what you run to generate an index
of a collection of HTML documents. It takes only one parameter, a
path to the configuration file. All options are set there. Once you
have figured out the settings that fit your needs, you could easily
start Yasig from time to time as a cron job. If you want to redirect
Yasig's messages to /dev/null or to some file, mind that
it talks to you via stderr. The index is directly written to a file
whose name you specify in the configuration file.
There is a global configuration section in yasig.sh which
you should check and edit at installation time. All settings you make
there can be overridden in the configuration files.
Global settings
YASIG_DIRSORT/bin/sort -d -f -u
which means "ignore non-alphanumeric characters, ignore case,
output only the first of several equal lines".
INDEX${YASIG_DIR}/index.awk.
CLEANUP${YASIG_DIR}/cleanup.awk.
PRESENTATION${YASIG_DIR}/present.awk.
What follows the global configuration section is rather simple. The
configuration file specified on the command line is inlined using
. $1 (which implies that configuration files have shell
syntax). After some sanity checks, the three scripts and the sort
command specified above are executed in a pipe, first the indexer,
then the sort command, third the cleanup and last the presentation
script. Output from the pipe is redirected to a file specified in
the configuration file, where you can also specify files with
arbitrary content to be prepended and appended respectively.
Should you want to include your own scripts into that pipe, data are exchanged as follows. The indexer and cleanup scripts each output a series of lines. A line consists of a single word or a space-separated list of words. Then a horizontal tab (ASCII 09) character follows and after the tab an URL string. The cleanup and presentation scripts expect their input to be sorted alphabetically.
This is the core part of Yasig where all the interesting things
happen. To understand how index.awk works you should
be familiar with awk's way of dealing with input. If you aren't,
read the overview sections of the awk(1) manual page.
The key idea to non-parsing of HTML with awk is to set the record
separator, that defaults to the newline character, to a regular
expression matching the tag delimiters '<' and '>', thus
splitting the input into records which are either a tag (including
all attributes if it is a start tag) or text outside tags. So the
HTML code <EM ID=13>blah</EM> is turned into
the sequence
EM ID=13
blah
/EM
of records from awk's point of view. You can easily confuse the program by using the tag delimiters unquoted somewhere else.
Whenever the awk script encounters a tag record it is interested in, it extracts some atrribute values from it, sets a flag and then collects all non-tag text until it sees the next end tag of the same type, where the collected text is output together with an URL and the flags is unset. Comment handling is a bit more difficult and will not be explained here. Have a look at the source if you are curious.
I'm an unwritten section.
I'm an unwritten section.
I'm an unwritten section.
This list is based on one collected and posted to de.comm.infosystems.www.authoring.misc by Heiko Schlenker <hschlen(at)gmx.de>.
Sven Türpe, August 2000. Web site sponsored by iT-Netservice GmbH.