AWK Scripts: Web Pages

This next example program demonstrates using the technique of scripts creating and managing scripts to extract information from Web pages and put it in a format usable by other programs. The specific example was developed to extract share prices from a Johannesburg (South Africa) Stock Exchange page. Since that is of rather limited use, I have prepared a couple of versions of the formatting code: one to insert decimal points three places from the end of the price and one to convert American style numbers and fractions into decimal numbers. Some recoding will still be needed to adapt the program to other uses, but my task is to provide learning tools and prototype programs, not individual solutions to individual problems.

I base the prototypes on the first case of an interesting type brought to my attention, so the person who suggests a program that I decide to use to illustrate one or more interesting points gets a specific solution, but misses the learning experience of adapting the prototype to his/her own needs. This particular case provides opportunities to show how to

In this program, the names of the shares to be processed are passed to the program through the environment: the configuration files is, in fact, just a subroutine batch file to set environment variables. Trivial on the batch side, but subtle on the AWK side: a series of variables of the form AWKVARn, where n is a number or letter - it doesn't matter what they are, so long as there are no duplicates and the variable name is in upper case if the OS is NT. The share name codes will appear in the output stream in alphabetical order. The numbers/letters do not have to be sequential or in order - : the variable values become string indexes in an array and are automatically sorted alphabetically by the language.

Two additional variables are set to tell the program what line the desired price is on relative to the line containing the share code, and to partially define the date string and the portion to remove: AWKLINES and AWKDATE. MLTL0080.BAT (SETNAMES.BAT)
 @echo off
 set AWKLINES=4
 set AWKVAR01=ABC
 set AWKVAR0G=CDE
 set AWKVAR03=GHI
 set AWKDATE=on
Note that the entries are out of their final order and are nothing more than SET commands. This code could easily be incorporated in the main batch file, but would loose its identity as a configuration file. The test Web page used for demonstration contains several more fake shares entries than are to be processed. To avoid copyright problems, the "share names" used are blocks of three consecutive letters and are not intended to represent real stock market symbols, though some duplication of some real symbol used somewhere in the world is likely but unavoidable. It is also likely that some of the fake share names unintentionally duplicate real share names. Several of the share prices were made up to show various features of the way printf() can display numbers. The prices given on the real page that was given to me as an example lack a decimal point, but must have one (three digits) in the report - I have made some prices contain decimal points, and other be so small that they have nothing to the left of the decimal point.

Previous examples of AWK scripts have used the print statement for output, but AWK also provides a clone of the powerful printf() statement from the C language. It is not my intent to provide a tutorial on the secondary languages I use here, rather to introduce some of the tools they provide. I won't go very far into the syntax of printf() - it would take an entire page by itself and would duplicate material readily available in the documentation sources mentioned near the top of this page. See the page on Character Codes in Decimal, Hex, and Octal mentioned above for a few notes on printf().

Plain old AWK is not the most suitable dialect for extracting data from HTML containers because it would be more efficient to make the tags themselves - or even better, regular expressions describing them - rather than single characters the field separator. However, that feature is not found in MAWK, so we have to use a somewhat more cumbersome method based on the fact that the "<" character is a forbidden character in the text of HTML pages - it is allowed only as part of a tag or as the escaped entity "&lt;". What we will do here is to separate fields with "&lt;", then split() them again on ">" to isolate the contents of containers. I have omitted tags that aren't parts of pairs of container tags, but this is not important, it would just complicate field counting. To adapt this program for your own use, you will have to analyse the page you want to process to determine which fields are which. Many lines as well as fields will have to be ignored, and this has to be embedded in the code.

The structure of the sample page is basically a table: the columns in a row are represented on sequential lines in the file, but each block has the same structure - the meaning of the row in a <TR></TR> container matches the heading in the corresponding row in the first such container, the one with its fields in <TH></TH> containers. The data fields themselves have <TD></TD> container tags. The price fields have opening tags that contain alignment information as part of the tag which will cause the field count for the desired data to be different for the name and stock code fields, and the price fields.

There are several possible methods by which the desired data can be found in the file, including some in which the program actually recognizes the data by its nature and others in which the locations of the data are hard coded into the program. This program uses a combination approach based on some assumption about the data: Examination of the sample page and its source (view the page source code to see the source) shows that the closing price is the fourth line after the one that exactly matches the share code and that there is only one numerical field on each line. The date field needed for every output line appears only once in the entire page and contains only numbers and '/'s, and it occurs before any other data - this allows and requires us to manage the date field as a special case. The line containing the date is
<H3 ALIGN="center">on 9/12/1999</H3>
and a typical data block is
        <TD>ABC</TD>
        <TD>Abend Computers</TD>
        <TD align="right">123456</TD>
        <TD align="right">123000</TD>
        <TD align="right">123412</TD>
After our two-stage splitting process, these produce the fields
H3 ALIGN="center"
on 9/12/1999
/H3
and
(spaces)
TD
ABC
(spaces)
/TD
(spaces)
TD align="right"
123456
(spaces)
/TD
and so forth.

The date still requires additional processing - we can either split it, or replace the extra stuff with a null substring. The latter is a bit simpler and will be used here, though it must be kept in mind that this requires that the format of the line never change during the life of the script.

The closing price, as noted above, is on the fourth line following the one with the code, and by inspection of the list above, the numerical field (the one with the price) is the only numerical field in the line, so we need only identify the code line as one we want to process, extract the code field, count lines until we reach the closing price line, extract the second field there, and do nothing until the next code match.

Now, after all that, we get to the real thing: the script to extract data from the sample Web page and produce one output line for each share listed in the AWKVARn environment variables. You will need to copy/paste MLTL0080.BAT above into a file named SETNAMES.BAT and then run the file from the same command prompt you will run the test script from. Copy/paste the following script into a file named GETSHRS.AWK and issue the command
AWK -fgetshars.awk > result.txt
. The RESULT.TXT file will contain
ABC,123.000,09/12/1999
CDE,76.543,09/12/1999
GHI,0.100,09/12/1999
If your version of AWK is called MAWK or GAWK, or whatever, either copy it with the name AW.EXE or change the command to match whatever you have.

GETSHRS.AWK
BEGIN{
# Explore the entire environment (this copy) for entries having names 
# beginning with "AWKVAR", the marker for variables of interest to
# this program.
    for( i in ENVIRON ){
        if( i ~ /AWKVAR/ ){
# When one is found make its *value* an index into an array of null strings.
# Some versions of AWK don't like array indexes that are themselves array
# data, so an intermediate variable will be used.
            s = ENVIRON[ i ]
            Array[ s ] = ""
        }
# There are other variables of interest: the one that specifies the location
# of the desired price field relative to the name code and the one that 
# defines both the date field itself and the string to be reomoved.
        if( i ~ /AWKLINES/ ) MaxLcount = ENVIRON[ i ]
        if( i ~ /AWKDATE/ ) DateMark = ENVIRON[ i ] " "
# The space was added to DateMark so there would be no invisible characters
# in the batch file that sets the variables.  This might have to be changed
# if the date string format doesn't have a space just before the date.
    }
# Initialize the variable to be used as a line counter.
    Lcount = 0
}
{
# We need to split up the line into fields with either "<" or ">"
# as the field separator to isolate the contents of containers, and 
# also contents ofthe tags themselves as an unwanted byproduct. The 
# easiest way to do this in this language is to replace ">" with "<"
# and split() the line into an array.

    gsub(/>/, "<" ) 

# Blank lines are  ignored by the "if()" - zero fields means an empty line
    if(split( $0, Fields, "<" ) != 0 ) {
# Increment the line counter if it needs to be.
        if( Lcount != 0 ) Lcount++
# Look for the line with the date - it contains a date in nn/nn/nnnn format,
# where the 'n's are numerals. The contents of the AWKDATE variable followed
# by a numeral identifies the line.
        s1 = DateMark "[0-9]"
        if( $0 ~ s1 ) {
# Note that a variable can be used as a regular expression.
            for( i in Fields ) {
                if( Fields[ i ] ~ /[0-9]/ ) {
                    DateField = Fields[ i ]
# Remove the value of DateMark from the field. DateMark contains the value
# of the AWKDATE environment variable followed by a space.
                    sub( DateMark, "", DateField )
# Sometimes it is desirable to change the format of the date.  Since this
# is probably to be done only once, we can hard code it.  This assumes you
# want the date elements in the yyyymmdd order, that the existing order is
# ddmmyyyy, that the new separator is '-', and that the existing separator
# is '/' - change it to suit your needs.
                    split( DateField, DateArray, "/" )
                    DateField = DateArray[3] "-" DateArray[2] "-" DateArray[1]
                }
            }
        }
        else {
# All other non-blank lines.

# We have to decide whether to look for a name code or a price.  Since
# Lcount is initially 0 and will be reset to 0 when we find a price, if it
# is 0, look for a code, if it isn't, look for a price.
# If we are looking for a price, we wait for Lcount to reach the required
# value, 4 in the demonstration case, but the value passed in AWKLINES in 
# any case (stored in MaxLcount).
            if( Lcount == 0 ) {
# Looking for a code.
                for( i in Fields ) {
                    for( j in Array ) {
# Note that the following match is equality, not a regular expression match.
# This prevents partial matches from causing trouble
                        if( Fields[ i ] == j ) {
                            Lcount = 1
                            ShareCode = j
                        }
                    }
                }
            }
            if( Lcount == MaxLcount ) {
# The line with the closing price in it.
                for( i in Fields ) {
                    if( Fields[ i ] ~ /[0-9]/ ) {
# When we have found the field with the numbers in it we can output the line,
# and we must reset Lcount.
                        Lcount = 0
                        printf( "%s,%.3f,%s\n", ShareCode, (Fields[ i ] * 0.001), DateField )
                        delete Array[ ShareCode ]
# That line is an additonal insurance against false matches, and it speeds up
# later matches because there are fewer comparisons to make.
                    }
                }                
            }
        }
    }
}
This brings us to the batch program to manage what is really conversion of a Web page into a Quicken input file (in the original case that prompted all of this). Wouldn't it really be nice if the batch program could get the page as well as process it? It can, but that requires another supplemental language - one that can deal with Internet. Such a language is available free for 36 different platforms: Rebol - unfortunately, DOS is not one of the platforms, but Win9x and NT4 are. The original task was to run on Win9x, so it's fair to use Rebol to get the page.

Since the AWK script is a bit long for generation on the fly and it is for repetitive use anyway, it is convenient just to copy/paste it into a file and leave the file around for later use. There doesn't seem much point is doing the Rebol script dynamically either. It is very small - more header than script - but is intended for frequent reuse as well. The example here reads the sample page I used to develope this example program. Note that this is explicitly not for Real DOS or Win3.1 - it is for the 32 bit versions of Windows (Win9x and NT4). From here on I will use Win32 type file names in mixed case without the 8.3 limitation of DOS. Copy/paste the following script into a file with the indicated name and make sure Rebol is in the PATH or the current directory (the same is true for your AWK interpreter, of course).

GetWebPage.r
REBOL [
    Title: "GetWebPage"
    Date:  19-Sept-1999
]
write %shares.htm read /batch/shares.htm
quit
Obviously you can change the URL to whatever you need - the file name too, just make sure to prefix it with '%'. The header is unimportant to us, but essential to the language interpreter, so you might as well leave it as-is or change the title and date - but don't do anything else unless you are familiar with the language.

A word about Rebol (URL above): it's much to complex for this level of introductory material, and very unapproachable, but it does make it easy to get Web pages for use with batch programs. Perl can do all of this stuff, so if you want a single scripting language for everything and are working exclusively in Win32 and/or some flavor of Unix or Linux, it's probably well worth the trouble to master it. It just doesn't seem to fit in with what I'm trying to do here, which is batch related stuff using languages that are either simple in themselves or can be used with very clear and simple canned code.

This batch file includes code to set the AWK and Rebol paths - change it as needed, or omit it entirely if the path already contains these programs.

MLTL0090.BAT(GetSharePrices.bat)
 @echo off
 call setnames
 set oldpath=%path%
 path c:\progra~1\mawk; c:\progra~1\rebol;%path%
 start /w rebol -s GetWebPage.r
 awk -fgetshrs.awk shares.htm > report.txt
 set path=%oldpath%
 type report.txt
 pause
The pause command allows the program to stay on the screen when it's launched by double clicking on the file or from an icon. If this is the method always used to launch it, then the third from the last line - the one that restores the path, is not really needed.




This stuff has been only partially tested at the time of its initial release, but it is known that the versions of GAWK and MAWK used here do work in Real DOS, Win9x, and NT4. The complete programs have not all been tested under Real DOS.



  ** Copyright 1995, 1996, 1997, 1998, 1999, 2000, 2001 Ted Davis - see License, included by reference. ** 

Input and feedback from readers are welcome. NOTE: the subject of the message must contain the word "batch" for the message to get past the spam filter.

Back to the Table of Contents page

Back to my personal links page - back to my home page