AWK Scripts: List Processing

The one thing that AWK does better than it does anything else, and does better than just about any other language is process files that are lists. The very nature of AWK's automatic input is to read a file (or files) line by line and apply the same code to each line. If the file is a list, then little need be done except the perform the desired action and ignore blank lines. If the file only contains the list, then some additional sorting is required to ignore the lines that are not part of the list.

I have written about list processing in batch language in the List Processing and Multi-OS Batch Programs - from these and from an approach using PROMPT by Tom Levadas, you can get an idea of the complexity of list processing in pure batch. All of the pure batch approaches have serious limitations - different limitations for each approach (only the second of those was written expressly to work in all three common "DOS" batch languages, and it's extremely ugly. In fact it was my efforts to write complex batch programs that were OS independent that led to this new set of pages that take advantage of other languages.

We will begin with a trivial (for AWK - very difficult for batch language) task that was presented as In pure batch language, this is at least a dozen lines, and is likely to be either OS or human language version specific and to have a set of characters or words that cannot appear in the file, or cannot appear as the first thing in the line. In GAWK, it's a one-liner; MAWK requires a few more lines to create the script and delete it when done.

In all the AWK languages, the line in work is $0, and since the file contains nothing but the material to use as the file name and to insert into the given string, then no processing of the input line is needed, except the test for blank lines - we just use $0 where we want the color name to appear.

The basic script is COLORS.AWK
{
	if( $0 != "" ){
		Fname = $0 ".txt"
		String = "This is the " $0 " file."
		print String > Fname
		close( Fname )
	}
}
The color name is concatenated with ".txt" to make the name of the file to write to, then the color name is conactenated with the rest of the string (note the trailing space in the first part and the leading space in the second part of the pattern string), and the result is printed to the file (and the file is closed). The original question used such a short list that it wasn't necessary to close the files as the program progressed, so when all the string substitutions were moved to the places where ther are used and the internal quote marks were escaped, a one line script resulted:
gawk "{if($0!=\"\")print \"This is the \" $0 \" file.\" > $0 \".txt\"}" colors.txt
If we make no assumptions about the length of the list, it gets a bit longer:
gawk "{F=$0\".txt\";if($0!=\"\")print \"This is the \" $0 \" file.\" > F; close(F)}" colors.txt
That is three statements delimited with semicolons. Obviously that need only be inserted into a batch program at the appropriate place

To use MAWK, we could either prepare the script in advance or have great difficulty generating it on the fly because of the need to echo a redirection character. This is one of the places where we cannot represent the character with an escaped octal character code - it is an operator, not a string. It could be done this way, though: ASL10010.BAT(COLORS.BAT)
 @echo off
 echo BEGIN{print "{if($0!=\"\")print \"This is the \" $0 \" file.\" \076 $0 \".txt\"}"} > }{.awk
 awk -f}{.awk > colors.awk
 awk -fcolors.awk colors.txt
 del }{.awk
 del colors.awk
which just carries the concept of batch program creating scripts for other languages a step farther and has one create a script that creats a script - still much shorter and more comprehensible than any pure batch technique (once you get used to escaping characters (nested quotes here) with backslashes and octal codes (the output redirection character)).

A modification is required for use with Real DOS - there is a line that is longer than the allowed command line length. This version should work there, but I don't have Real DOS available for testing. ASL10020.BAT(COLORS.BAT)
 @echo off
 echo BEGIN{print "{if($0!=\"\") > }{.awk
 echo print \"This is the \" $0 \" file.\" \076 $0 \".txt\"}"} >> }{.awk
 awk -f}{.awk > colors.awk
 awk -fcolors.awk colors.txt
 del }{.awk
 del colors.awk





This one is from an e-mail message that arrived out of the blue: given a string of characters, generate all possible combinations of those characters. That's an abstraction of the actual message, but it does provide an opportunity to use recursion in a script. I hope to have a pure batch version someday - it appears to be simple enough, just time comsuming.

The approach is to place the individual characters in an array and recursively generate a string to be printed that, for n characters in the string, is composed of n characters from the array chosen by an algorithm that systematally generates a unique combination of indexes. The model is something like an automobile speedometer: one wheel goes around and when it reaches its starting point it advances the next and starts over. This is easily modeled with nested for() counters. Unfortunately, we can't know at the time the script is written how many nested for()s are needed. The solution is to recursively call a function that does the for() stuff and detect when it has recursed deeply enough to equal the number of characters in the string. The deepest nested for() prints the composite string; the others just add to it and call the routine again.

Asuming the string is "abc", as the program recurses on the very first pass, it will build the string "aaa" in three steps: "a", "aa", and "aaa". The last level will work itself through the list producing "aaa", "aab", and "aac". thenit will return to the second level which will build "ab" and call the last level again to produce "aba", "abb", and "abc". Eventually the second level will have finished with "acc" and will return to the first level, which will # go to its next character and call the second level again to produce "baa" through "bcc", then everything backs out to the first level for the last character to produce "caa" through "ccc". With no more entries in the array, the outer for() loop will exit and the program will terminate.
BEGIN{
# Split the string into individual characters and put each in an 
# element of the Letters array.
    NumChars = split( ARGV[ 1 ], Letters, "" )

# Outermost level of the recursion - it starts with level 1 and 
# an arbitrary index (all will be used eventually, but not all versions
# of AWK can be counted upon to supply them in order).
    for( i in Letters ) {
# Call the function and pass it the level number and the first character
# of the string that will eventually be printed by the last level of
# recursion.
        Rotate( 1, Letters[ i ] )
    }
# That's all folks, terminate the program
    exit(0)
}

# The Rotate() function adjusts the level counter and repeats the 
# for( i in Letters) thing, this time calling itself once for each
# character in the the pattern string (and element in the Letters
# array.  If it is the last level (the level counter is the same
# as the number of characters), it adds its characters to the string and
# prints the string.  If it isn't the last level, it calls itself for
# the next level or recursion.  The last two variables (following the extra
# space) are variables local to each instance of the function.
function Rotate( Level, String,  i, S ) {
# Increment the level counter to the present position in the string.
    Level++
# For each element in the array, add a different character to the string,
# one at a time.
    for( i in Letters ) {
        S = String Letters[ i ]
# If this is the last level        
        if( Level == NumChars ) {
# print the composite string,
            print S
        }
        else {
# otherwise, call itself and pass itself the current level and that portion
# of the string already built.
            Rotate( Level, String Letters[ i ] )
        }
    }

}
*************** More later ************




This stuff has been only partially tested at the time of its initial release, but it is known that the versions of GAWK and MAWK used here do work in Real DOS, Win9x, and NT4. The complete programs have not all been tested under Real DOS.



  ** Copyright 1995, 1996, 1997, 1998, 1999, 2000, 2001 Ted Davis - see License, included by reference. ** 

Input and feedback from readers are welcome. NOTE: the subject of the message must contain the word "batch" for the message to get past the spam filter.

Back to the Table of Contents page

Back to my personal links page - back to my home page