AWK Scripts: Disk and File

At the moment, this is a collection of pieces and notes - it hasn't been organized.

How can I create a random file name?
You probably mean a unique file name, but that's not what you asked for. This creates strings of random upper case letters as long as you want to make them - it uses a bang counter to set the length, one bang (!) per character (eight in this case - the actual length is determined by the number of lines in }{.dat):

RANDCHAR.BAT
 @echo off
 if %1!==}{! goto pass2
 if exist }{.dat del }{.dat
 set rand=
 for %%a in (! ! ! ! ! ! ! !) do call %0 }{
 gawk "BEGIN{srand();printf(\"set rand=\045rand\045\")}{printf(\"\045c\",65+int(26 * rand()))}END{print\"\"}" }{.dat > }{.bat
 call }{
 :done
   REM your code to use the string goes here
 echo %rand%
 del }{.?at
 set rand=
 goto end
 :pass2
 echo. >> }{.dat
 goto end
 :end


This is a purer awk/gawk approach - it generates the same sort of SET line, but it returns a unique string of eight mixed letters (upper case) and numbers. The program is passed the name of a file as its argument - this file contains the list of already used strings against which to compare the newly created string for uniqueness. It is the responsibility of the surrounding batch file to append the returned string to the file:

MAKERAND.AWK
{
	sub( / $/, "" )
	Array[ $0 ] = ""
}

END{ 
	srand()
	flag = 0

	while( !flag ){
		string = ""
		for( i = 0; i < 8; ) {
			number = 48 + (42 * rand())
			if( (number < 58) || ( number > 64) ) {
				string = string sprintf( "%c", number )
				i++
			}
		}
		flag = 1
        if( string in Array ) flag = 0
	}
	print "set string=" string
}
The program first reads the file into an array, then uses the pseudorandom number generator to select characters. While the variable is still a number, it is tested for range to reject the punctuation characters that occur between the numerals and the upper case letters. If the number is in one of the proper ranges, it is converted to a character and appended to the string in work until the limit of eight is reached, at which time the program prints the SET command to STDOUT.

MAKERAND.BAT
 @echo off
 gawk -fmakerand.awk strings.dat > }{.bat
 call }{
 del }{
 echo %string% >> strings.dat
There is one thing not mentioned so far - the very first line in the script strips trailing spaces from the input - that is to accomodate the space between the '%' and ">>" in the last line of the batch file. The space is there to make the program NT compatible (if the string has a trailing '1' or '2', NT would interpret it as part of the redirection , not as part of the string).

Someone asked about finding out whether there was enough space on a Jaz drive to hold the contents of one or more hard drives - in the Win95 environment (but I had to ask about the OS). There are batch utilities out there that determine the total bytes in groups of files, the free space on disks, and whether one number is larger than another. I've never been able to get them to work in my environments, and neither could he. This is beside the point since byte sizes are only loosely related to the question of drive space as the following experiment will demonstrate (you will have to do the arithmetic yourself):

Do a DIR on an empty directory (not the root) and note the bytes free value at the end
Create a tiny file with ECHO.>foo.txt
Repeat the DIR command and again note the bytes free value as well as the size of the file
Subtract the second bytes free value from the first
Compare that result with the length of the file
Delete the file and remove the directory (if created especially for the test)
Repeat the experiment on drives of different sizes and types, and if possible, under different operating systems (note that sometimes you may get a difference of zero - this is due to the OS not having actually written the file to the disk, but only to the cache. Expect results between 512 bytes (floppy) and 64 Kilobytes (huge HDD partition with large clusters) for the disk space consumed by a single 2 byte file. Obviously this can be ignored only if the number of files and their total size are small relative to the size of the target disk. Two approaches will be given: one exact but slow, and one faster, but based on the probability that the average file wastes half an allocation unit (cluster) - for large numbers of files of random sizes larger than one allocation unit (not always true), the average waste per file is half a cluster. Large numbers of small files combined with large cluster size invalidate the assumption and require the exact approach.

The specific problem to be solved has a couple of constants that need be determined only once: the cluster sizes of the source and target file systems. These are most conveniently determined manually using the method of the above experiment and then hard coded into the batch file. The Win95 and Zip drive file systems I'm using for testing have cluster sizes of 4096 and 2048 bytes respectively - these values will be used as constants in the examples replace them with the values for the file systems you actually have if you use the programs yourself. The source cluster size is not needed, only the target drive's cluster size.

The algorithm determines the number of bytes needed on the target drive by rounding the sizes of the individual files up to the appropriate exact multiple of the target drive's cluster size and summing the results. The algorithm incorporates somewhat arbitrary, but generous, corrections to allow for directory entries (which also take up disk space). The correction can be calculated exactly, but probably isn't worth the trouble - each directory is treated as a file the size of one cluster or 16 Kb, whichever is greater.

We can pass an entire directory listing through the MAWK script and let it extract much of what it needs from the stream.

There is one additional OS dependent complication: the format of the directory listing - DOS, Win9x, and NT produce quite different formats, Win9x conveniently displays the number of directories as a separate total in the summary at the end, DOS and NT don't; the file size is the field just before the date in DOS and Win9x, but the third field under NT. This will be handled by using variables containing code appropriate to the OS - the programs will not determine the format for themselves. For NT, the file size field is $3, for DOS it is $(NF-2), but for Win9x it is more complicated: it is the second field before the one containing a colon (colons cannot occur in file or directory names, only in the time stamp). It may be necessary to modify the latter for other language versions - these examples assume American English versions of all operating systems because those are all I have data for.

We will count the number of instances of <DIR<> that are not single or double dots to determine the number of directories. This works for all three operating systems. We will also count the number of files and sum the cluster counts in all OSs. For each entry we will determine if it is a directory, a file, or something to be ignored, then process it to extract the necessary information. We will need a constant to indicate the proper location of the file size field. The location of the size field for Win9x will be given relative to an internal variable determined by analysing the entry, for the others it can be given relative to the number of fields in the line (DOS) or absolutely (NT). The basic script is

WILLFIT.AWK
BEGIN{
    W9xFlag = 1
    if( "SIZEFIELD" in ENVIRON ) W9xFlag = 0
    ClusterSize = ENVIRON["CLUSTERSIZE"]
    DirSize = ENVIRON["DIRSIZE"]
    SizeField = ENVIRON["SIZEFIELD"]
    DriveSize = ENVIRON["DRIVESIZE"]
}
{
    if( $0 != "" ){
        if( $0 ~ /<DIR>/ ) {
# If the line is a directory entry
            DotFlag = 0
            for( i = 1; i <= NF; i++ ) {
                if( $i == "." ) DotFlag = 1
                if( $i == ".." ) DotFlag = 1
            }
# If the line is a directory but not . or ..
            if( !DotFlag ) TotalClusters += DirSize
        }
        else {
            if( $0 !~ /^ / ){
# If the line does not begin with a space and is not a directory entry, it is a file entry   
                if( W9xFlag ) {
                    for( i = 1; i <= NF; i++ ) {
                        if( $i ~ /:/ ) {
                            SizeField =  i - 2
                        }
                    }
                }
# Remove the commas
                gsub( /,/, "", $SizeField )
                Clusters = int( $SizeField / ClusterSize )
# If there is a remainder, we have to increment the cluster count to account for the one partially used
                if( $SizeField % ClusterSize ) Clusters++
                TotalClusters += Clusters
            }
        }
    }
}

END{
print "Total Clusters Needed = " TotalClusters, "DriveSize = " DriveSize " bytes"
    ExitCode = 0
    if( ( ClusterSize * TotalClusters ) > DriveSize ) ExitCode = 1
    exit( ExitCode )
}
(This was written for Netscape 4.6 which incorrectly handles <DIR> inside the PRE/CODE block. I had to change the < and > characters to their escaped equivalents "&lt;" and "&gt;" - if it comes out wrong, just read it as if it were correct - if you copy/paste the code, make sure it looks like the DIR marker in a directory listing in your script.)

Note to AWK programmers: don't bother criticizing my style - this is not my native AWK dialect and there are numerous special considerations involved. Do, however, please point out any actual errors.

Unfortunately, we can't get a batch program to write that script as-is because of the magic characters. You could easily just copy/paste the script into a file and use that with the command
dir | awk -ffoo.awk
where foo is the name you gave the file (use .AWK as the extension). Don't forget to set the environment variables to their proper values.

For a working program, we also need a way to determine the size of the target drive. This can be done with pure batch techniques, but it also provides an opportunity to demonstrate using a MAWK script to write batch files that are then called from the parent program. The algorithm is simplicity itself: extract the "bytes free" value from the last line in a DIR listing, remove the commas, and print the command to set the environment variable to a .BAT file.

TARGETSZ.AWK
{
    if( $0 ~ /bytes free/){
        gsub( /,/, "", $(NF-2) )
        print "set DRIVESIZE=" $(NF-2)
    }
}
Here's a batch program that just sets the DRIVESIZE variable to the size (in bytes) of drive d: (a Zip drive in my case).

MLTL0050.BAT
 @echo off
 echo {if( $0 ~ /bytes free/){ > }{.awk
 echo gsub( /,/, "", $(NF-2) ) >> }{.awk
 echo print "set DRIVESIZE=" $(NF-2)}} >> }{.awk
 dir d: | awk -f}{.awk > }{.bat
 call }{.bat
 del }{.*
I recommend that users copy/paste both scripts as given into the files WILLFIT.AWK and TARGETSZ.AWK, and use this batch program after changing the constants to whatever is appropriate for the target drive.

MLTL0060.BAT
@echo off
 set CLUSTERSIZE=2048
 set DIRSIZE=1
 set sizefield=
  dir i: | awk -ftargetsz.awk > }{.bat
 call }{.bat
 del }{.bat
 dir c:\*.* /s | awk -fwillfit.awk
 if errorlevel 1 goto fail
 xcopy c:\*.* d:\ /s
 goto end
 :fail
 echo Not enough room
 :end


Note that setting SIZEFIELD to nothing causes the program to find the field the hard way. This is the Win9x version - set it to 3 for NT and (NF-2) for DOS. Add whatever switches are needed and wanted to the XCOPY command - as given it can't copy the files of a running Win9x system and displays all the file names. If there are any switches for DIR in the environment, add the necessary countermand switches to the DIR commands to cause DIR to generate a completely standard report.




A related problem was presented by a user: given a directory with a large number of fair size files, move groups of the files to sequentially numbered directories in such a way that the contents of the target directories would consume no more than 50 Mb of actual space on some sort of removable media. An additional point of interest is that the numbers in the directory names are not to be reused: each session is to begin with the number following the last one used, regardless of whether the old directory still exists or not. Concepts from the previous example can be reused, but not all that much of the code: rather than seeing is a given batch of files will fit, the task is to build a list of files that will fit.

Before even starting on the main task, bits and pieces of useful code suggested themselves: first a method of dealing with the directory names suggested by a script the user alread had, and then a program to discover the available allocation units and their size automatically.

In the directory number program, the number is stored in a file (foo in the example). DIRNUM.BAT
> @echo off
 awk "{print \"set count=\" ++$0}" foo > }{.bat
 for %%a in (call del) do %%a }{.bat
 md g:\a\Logs_%count%
 echo %count% > foo
That's a throw-away - the code is not reused here.

The automatic size determining program is rather too complex to put in the batch file itself, so it needs to be a separate script: DISKSIZE.AWK
# DISKSIZE filters the output of a CHKDSK command to 
# extract the size of the allocation units on the disk and the number 
# of allocation units available.  
# These numbers are inserted into a string given in the 
# DSKSZPAT environemnt variable in place of markers placed in that string.

# The markers are ::size::, ::number::, and ::total::.  The pattern string can
# contain any plain text and the markers \t and \n for Tab and newline.
# \e, \l, \r, and \p can be used to represent "=", "<", ">", and "|" symbols,
# and \a and \h to represent & and ^. If the markers are needed as strings 
# literal, use a double backslash (\\t to get \t, for example).
# Upper case versions of the markers will result in any thousands separators
# that are present in the numbers being retained in the output - lower
# case versions will result in removal of the separators.

# The output of the CHKDSK comand normally will be piped to this program
# and the output of this filter will be redirected into a file.

# Typical usage would be to generate a batch file to set environment variables
# to the two numbers, so that is the default pattern to be used if the user 
# does not provide one.  If only a screen readout is wanted, a pattern
# string of "Unit size = :: size::\t Number of units = ::number::" would
# generate a single line response with a bit of space between the two reports.

# Note that size is in bytes, and number is the count of units of the given
# size available on the disk - *not* the total number of units on the disk;
# that is obtained with the total marker.
# If total number is needed, a blank disk must be used

BEGIN{
# Define the pattern for the output format - if a user defiend pattern is
# available, use it instead of the default.
    OutputFormat = ENVIRON[ "DSKSZPAT" ]
    if( OutputFormat == "" ) 
        OutputFormat = "set DSKSZSIZE=::size::\nset DSKSZNUM=::number::"
# Replace any instances of the Tab and newline markers, and the special 
# symbol markers with the real ones -
    OutputFormat = ReplaceSymbols( OutputFormat )

# Initialize the output variables to nulls in case the input is not a 
# valid CHKDSK output with the expected format. ("AU" stands for "Allocation
# Unit".)
    AUsize = ""
    AUnumber = ""
    AUtotal = ""
    AUsizeRaw = ""
    AUnumberRaw = ""
    AUtotalRaw = ""
}

# Process the input lines.
{
# The line with the allocation unit size is a bunch of space, a number, then
# the string "bytes in each allocation unit.".  We will use "each alloc" as
# the recognition pattern for that line.  The number is the first field found 
# by the automatic line parser.
    if( $0 ~ /each alloc/ ) AUsizeRaw = $1
# The line with the number of available units is a bunch of space, a number, 
# then the string "allocation units available on disk".  We will use "units 
# avai" as the recognition pattern for that line.  The number is the first 
# field found by the automatic line parser.
    if( $0 ~ /units avai/ ) AUnumberRaw = $1
# The line with the total number of units is a bunch of space, a number, 
# then the string "total allocation units on disk".  We will use "total 
# alloc" as the recognition pattern for that line.  The number is the first 
# field found by the automatic line parser.
    if( $0 ~ /total alloc/ ) AUtotalRaw = $1
}

END{
# Make copies of the raw numbers for thousands separator removal.
    AUsize = AUsizeRaw
    AUnumber = AUnumberRaw
    AUtotal = AUtotalRaw

# Remove the thousands separators, is any are present - anything except 
# numbers is treated as a separator.  This is done ahead of the test for
# sanity to catch any errors from matched lines in wrong input that don't
# have numbers in the proper places.
    gsub( /[^0-9]/, "", AUsize )
    gsub( /[^0-9]/, "", AUnumber )
    gsub( /[^0-9]/, "", AUtotal )
# Write the results to STDOUT unless the input was defective and the values
# weren't found or the values aren't numbers.
    if( (AUsize != "") && (AUnumber != "") && ( AUtotal != "") ) {
# Make the substitutions.
    gsub( /::size::/, AUsize, OutputFormat )
    gsub( /::number::/, AUnumber, OutputFormat )
    gsub( /::total::/, AUtotal, OutputFormat )
    gsub( /::SIZE::/, AUsizeRaw, OutputFormat )
    gsub( /::NUMBER::/, AUnumberRaw, OutputFormat )
    gsub( /::TOTAL::/, AUtotalRaw, OutputFormat )
# Print the modified string.
    print OutputFormat
    }
    else {
# Print an error message if the numbers weren't sane.  It would be nice if
# we could force this to STDERR, but the standard method for that doesn't
# exist in the Microsoft world.  We can return an ERRORLEVEL (exit code).
        print "ERROR: invalid input format."
        exit( 1 )
    }
    exit( 0 )
}

function ReplaceSymbols( String ) {
# Replace symbols for newline, Tab, and certain characters that are magic in 
# various batch languages with the real characters.
# In order to provide a method of inserting these symbols as literals,
# provision must be made to escape the backslash - this is done by
# doubling the \ to \\, so a literal \t is represented by \\t.
# The way to deal with that is to replace \\ with something else before
# processing the symbols, then replace the something else with \.  ^Z is 
# used because it cannot appear in the controlling batch files.
    gsub( /\\\\/, "\032", String )
    gsub( /\\t/, "\t", String )
    gsub( /\\n/, "\n", String )
    gsub( /\\e/, "=", String )
    gsub( /\\l/, "<", String )
    gsub( /\\r/, ">", String )
    gsub( /\\p/, "|", String )
    gsub( /\\h/, "^", String )
# According to all available documentation this next line is incorrect:
# \\& is supposed to be \&. Just as inserting a $ would require \$ - 
# *that* works, but not \& to insert &.  This may be version sensitive,
# but it works with the four versions available to me.
    gsub( /\\a/, "\\&", String )
# Now fix the backslashes by replacing ^Z with \.
    gsub( /\032/, "\\", String )
    return( String ) 
}
That also contains characters that are magic in NT, so there would be problems getting it into a file under that OS anyway. This script is use in conjunction with CHKDSK to generate most any kind of output report you need: to the screen, a batch file, a delimited list (with or without quoting) for a spread sheet - most anything. See the section on DLIST for a discusssion of formats for this type of program - I reused the formatting concept unchanged except for the details of its implementation. This batch file takes a drive letter and colon as its argument and puts the size of the disk's allocation units and the number of available allocation units in the environment after removing any thousands separators that might be present in the reported numbers (NT doesn't use them, the other MS operating systems do).

DSIZE.BAT
 @echo off
 chkdsk %1 | awk -fdisksize.awk > }{.bat
 for %%a in ( call del ) do %%a }{.bat
The main part of the program is a script to read a directory listing and generate a batch file that will create any needed directory entries in the target directory and copy as many files as will fit within the size limit set for the target.

It should be noted that while this code is concerend with sizes in allocation units instead of bytes, if bytes sizes are actually wanted, the allocation unit size can be set to 1 byte and the number of units to the number of bytes desired. Otherwise, the allocation unit size is set to the size of the allocation unit on the ultimate target device without regard to the size of the units on the intermediate target device, and the number of units is either the number available on the ultimate target or the size of the ultimate target in bytes divided by the size of its allocation units in bytes.

Certain assumptions must be made and some approximations accepted. I general, they are the same here as above in the section on backing up to a Jaz drive above. The most visible one is the assumption that a directory takes up one allocation unit or 16 Kb, whichever is greater - this is for the directory information and has nothing to do with the files in the directory, even though their number is what determines the actual, as distinct from the assumed size. It is felt that this assumption is reasonably conservative, but the minimum size and multiplier can be changed if experience indicates otherwise. The value for the number of AUs available is passed in the environment variable DSIZEAUNUM, the size of the AU is in DSIZEAUSIZE, and the directory size in DSIZEDIRSIZE (in AUs).

DIRSIZE reads a standard directory listing reformatted by the DLIST program into a comma delimited list in the form directory, full long filespec, size without thousands separators using the pattern

::dir::,::dir::::long::,::SIZE::

A comma delimited list is used because we need the fields separated, but spaces won't do since they are valid characters in the directory and filespec fields. The size field has no thousands separators because we are going to use it as a number.

Since the directory given in that listing is from the root of the source drive, it is necessary to replace the starting point portion of the directory in the filespec with the starting point on the target drive, and to make the same substitution in the directory name when it is necessary to generate a MD command in the output. The directory of the first line processed will be taken as the portion to be replaced unless the DSIZEROOTS environment variable directs otherwise. The replacement directory is given in DSIZEROOTT - backslashes must be appended if they are not provided. The munged names must be quoted if they contain spaces, but otherwise not in order to maintain functionality for real DOS (which can't deal with quotes). Existence of the base target directory will be assumed here - it can be tested for in the managing batch file if needed, but in the overall program presented here, it is to be newly created, so the assumption is reasonable.

The algorithm for selecting files to copy is ineffecient in that it does not optimize the output to use as much as possible of the target space, instead it is simple and just uses as many files as will fit in the order they are encountered. It must be complicated a bit so it can reject files that are larger than the target space. These will be listed in an error file specified in the DSIZEERR environment variable or in DSIZEERR.TXT in the default directory by default. Note that environment variable names are case sensitive in some operating systems.

The text written for each file in the listing is given by the DSIZEFORMAT environment variable. This string contains replacable parameters to indicate where to insert the old name, new name, and directory number - ::old::, ::new::, and ::number::.

The directory naming code assumes that the original number is the first one to use. When the program terminates, it writes a final line to the output (presumably a batch file). The format of that line is passed to the program with the DSIZELASTF environemnt variable. This variable contains the replacable parameter ::number:: where the number of the first directory of the next pass is to go. Normally, this would be a command to put a SET command into a batch file that would be CALLed the next time the program is run to set the environment variable DSIZENUM which tells the program what number to start with.

Note that if a format variable is missing, nothing will be written for it. If it doesn't contain a replacable parameter, only the litteral text will be written to the output. If DSIZEROOTT is omitted, the program will abort - this is a safety feature to prevent writing to unexpected locations.

Two additional environment variables are optionally used: DSIZEHEAD, if present will cause its contents to be written to the output file before any normal output (mostly for "@echo off"), and DSIZEMDF will replace the default MD command with whatever string is wanted (usually a test for the directory followed by the MD command). Aside from the usual symbols, the only replacable parameter is ::dir:: which will be replaced with the spec of the directory to be used for the following block of files (default is "md ::dir::"). DIRSIZE.AWK
BEGIN{
# Define a global exit code variable - this is used as a flag and a value.
    ExitCode = 0
# Read the environment variables into global variables for this program.
    SourceRoot = ENVIRON[ "DSIZEROOTS" ]
    TargetRoot = ENVIRON[ "DSIZEROOTT" ]
    ErrorFile = ENVIRON[ "DSIZEERR" ]
    Format = ENVIRON[ "DSIZEFORMAT" ]
    LastLine = ENVIRON[ "DSIZELASTF" ]
    DirNumber = ENVIRON[ "DSIZENUM" ]
    MaxAUs = ENVIRON[ "DSIZEAUNUM" ]
    AUsize = ENVIRON[ "DSIZEAUSIZE" ]
    DirSize = ENVIRON[ "DSIZEDIRSIZE" ]
    Header = ENVIRON[ "DSIZEHEAD" ]
    MDformat = ENVIRON[ "DSIZEMDF" ]
# Some of those must be initialized to default values if they are blank
# The default error file:
    if( ErrorFile == "" ) ErrorFile = "dsizeerr.txt"
# and the directory number:    
    if( DirNumber == "" ) DirNumber = 1
# The string used to create new directories must exist or receive a default
# string, and also have the special symbols replaced.
    if( MDformat == "" ) MDformat = "MD ::dir::"
    MDformat = ReplaceSymbols( MDformat )
# SourceRoot must be initialized, but that is done at the first input line
# if it wasn't given in the environment.
# For effeciency (numerical comparisons are faster than string comparisons),
# a flag will be used to mark whether it has been initialized or not.  This
# is the only time a string comparison will be used for that purpose.
    SourceRootFlag = 0
    if( SourceRoot != "" ) SourceRootFlag = 1
# If there is no trailing backslash, add one.
    if( SourceRoot !~ /\\$/ ) SourceRoot = SourceRoot "\\"
    
# DirSize defaults to one AU or 16Kb, whichever is greater.
    Temp = 1024 * 16
    if( AUsize > Temp ) Temp /= AUsize
    if( DirSize == "" ) Dirsize = Temp

# If TargetRoot is null, the program must abort.
    if( TargetRoot == "" ) {
        print "ERROR: DSIZEROOTT environment variable missing." > ErrorFile
# The program normally exits with an exit code (ERRORLEVEL) of 0, but if
# it abortes, it exits with a non-zero value, 1 in this case.
        ExitCode += 1
    }
# If the size of the target or the size of the target's AU is missing, it is
# necessary to abort.
    if( MaxAUs == "" ) {
        print "ERROR: DSIZEAUNUM environment variable missing."
        ExitCode += 2
    }
    if( AUsize == "" ) {
        print "ERROR: DSIZEAUSIZE environment variable missing."
        ExitCode += 4
    }
# Note that the exit codes are powers of two - multiple errors can be 
# determined from its final value.
    if( ExitCode ) exit( ExitCode )
# An exit command does not terminate the program in standard AWK versions,
# the END code will still run.

# Both output pattern strings need to have the \n amd \t markers replaced
# with the real things, and also the markers for =, <, >, and |.
# \e, \l, \r, and \p can be used to represent "=", "<", ">", and "|" symbols,
# and \a and \h to represent & and ^.  If the markers are needed as strings 
# literal, use a double backslash (\\t to get \t, for example).
    Format = ReplaceSymbols( Format )
    LastLine = ReplaceSymbols( LastLine )
# Print the header, if there is one.  Since no data have been processed
# it makes little sense to have replacable parameters here (other than 
# the special symbols - it's just a string litteral.  
# It is expected that it will usually be "@echo off".
    print ReplaceSymbols( Header )

# Set the field separator to a single comma so that the comma delimited
# input lines will be properly parsed into fields.
    FS = ","
# A Global variable is needed for the accumulating file sizes.
    TotalAUs = 0
# The actual target directory is the TargetRoot directory with DirNumber
# appended.  If the number has one or more leading zeros, the number will
# be left padded with leading zeros to the same total length if need be.
    DirNumberLength = 0
    if( DirNumber "" ~ /^0/ ) DirNumberLength = length( DirNumber "" )  
# The "" forces it to be treated as a string.
# TargetRoot must not have a trailing backslash - it's not a complete
# directory spec.
    sub( /\\$/, "", TargetRoot )
# Now build the complete directory substitution string from the pieces
# and write a command to create it.
    DirString = NewDirectory( TargetRoot, DirNumber, DirNumberLength )
# Correct total to account for estimated size of the directory.
    TotalAUs += DirSize
}
# Each line is in the format: directory, full long name filespec, size.
# There are no spaces except those in file and directory specs.
{
# Test for and abort any blank lines
    if( $0 == "" ) next
    
# Test for whether SourceRoot is set.  If we could be certain that the first
# record would not be blank, we could just use NR.
    if( !SourceRootFlag ) {
         SourceRoot = $1
         SourceRootFlag++
    }
# $1 is required to have a trailing backslash because the input file is in
# DLIST format, so SourceRoot does not need processing to add one.

# We have to double the backslashes in SourceRoot so it can be used as a
# pattern later.  We need to do this exactly once.
    if( SourceRootFlag == 1 ) {
        gsub( /\\/, "\\\\", SourceRoot )
        SourceRootFlag ++
    }
# Size computation - $3 is the size in bytes.
# First test to see if the file is larger than the target can ever hold.
    FileAUs = $3 / AUsize
    if( FileAUs > MaxAUs ) {
        print "ERROR: " $2 " is too large." > ErrorFile
        next    # Done with this line, read and process the next one.
    }
# Fall through to here means the file is not larger than the target space.
# The next question is will it fit in the current directory.
    TotalAUs += FileAUs
    if( TotalAUs > MaxAUs ) {
# If it will not fit, start a new directory,
        DirString = NewDirectory( TargetRoot, ++DirNumber, DirNumberLength )
# Reset the total counter to 0
        TotalAUs = 0
    }
# We now have a string containing the fully qualified filespec of the 
# file in work, the name of the target directory it is to go in, and 
# knowledge that the directory will exist and that the file will fit.
# The user has provided a command string that is to move or copy the
# file.  Note that if the file is in a subdirectory of the source root,
# it will be in the same subdirectory under the target root.  There are 
# only two substitutions that need be made: the old filespec and the new.
# The third substitution is included for completeness.
# The new is the old with the root replaced with the target root.  The
# original filespec is the second field in the input line.
    Filespec = $2
    OutString = Format
    sub( SourceRoot, DirString, Filespec )
    gsub( /::old::/, $2, OutString )
    gsub( /::new::/, Filespec, OutString )
    gsub( /::number::/, DirNumber, OutString )
    print OutString
}

END {
# If the program is in abort mode, it is necessary to skip the code here.
    if( !ExitCode ) {
# The only substitution that makes any sense in LastLine is the number
# of the next directory to create, but the other two are included just in
# case someone needs them - after all, both output strings can be multiple
# lines.
        DirNumber = PadDirNumber( ++DirNumber, DirNumberLength )
        gsub( /::number::/, DirNumber, LastLine )
        print LastLine
        
    }
}


function PadDirNumber( Number, TotalLength ) {
# Obviously, if TotalLength is zero (no padding), the while loop will
# not change the number string at all.
    while( length( Number "" ) < TotalLength ) {
        Number = "0" Number ""
    }
    return( Number )
}

function ReplaceSymbols( String ) {
    gsub( /\\t/, "\t", String )
    gsub( /\\n/, "\n", String )
    gsub( /\\e/, "=", String )
    gsub( /\\l/, "<", String )
    gsub( /\\r/, ">", String )
    gsub( /\\p/, "|", String )
    return( String ) 
}

function NewDirectory( Base, Number, Length,   DirString, String ) {
        DirString = Base PadDirNumber( DirNumber, Length )
# That line increments the directory number counter, pads it with leading 
# zeros, and rebuilds the string that replaces the invarient part of the 
# source filespec.  This string is substituted into MDformat and the result
# printed as the command to create a new directory.  We have to work on a 
# copy of MDformat because we may have to repeat the substitutions later.
        String = MDformat
        gsub( /::dir::/, DirString, String )
        print String
# Add a trailing backslash when returning the directory string.
        return( DirString "\\")
}
This script - and the DLIST.AWK script (below) which is a preprocessor for this script - would be managed with a batch file somewhat like this one (the example is NT specific):

DSIZE.BAT
 @echo off
 set DLISTNOH=foo
 set DLISTL=::dir::,::dir::::long::,::SIZE::
 set DSKSZPAT=set DSIZEAUNUM\e::number::\nset DSIZEAUSIZE\e::size::
 set DSIZEROOTT=d:\foo
 set DSIZEFORMAT=move ::old:: ::new::
 set DSIZENUM=001
 set DSIZEMDF=if not exist ::dir:: md ::dir::
 set DSIZEHEAD=@echo off
 if exist dsizenbr.bat call dsizenbr
 set DSIZELASTF=echo set DSIZENUM=::number:: \r dsizenbr.bat
 chkdsk %1 | awk -fdisksize.awk > }{.bat
 for %%a in ( call del ) do %%a }{.bat
 dir c:\myfiles\*.* /s | awk -fdlist.awk |  awk -fdirsize.awk > }{.bat
 REM for %%a in ( call del ) do %%a }{.bat
Don't uncomment that last line until you are absolutely sure that }{.bat is being properly generated. It won't actually run, or be erased, until the REM is removed. Check it carefully before turning the program loose on your files. This is presented as example code only, and if you use it, you are on your own as far as responsibility goes - there are entirely too many variables for me to write code that is certain to work properly on your machine in your environment with the version of AWK you are using without me being there to test it.




One point that needs clarification is that while secondary scripts and utilities may be able to access environment variables that were created before the program or script was invoked, if an executable program is needed, either as the result of the script or to interpret the script, they cannot modify the environment of the program that invoked them. There are exceptions: it is possible to write programs that explore memory, find the parent environment, and modify it, but that is a can of worms that will not be opened here - in general, executable programs get a copy of the environment belonging to their parent process. Keep in mind that batch files are scripts for a program already running (the command processor, usually COMMAND.COM or CMD.EXE, but possibly 4DOS or some other program), so they can modify that program's environment - when the batch program invokes another executable, even another instance of the command processor, the new program is passed a copy of the batch program's environment as it is at the instant of launching the secondary program.

Having said that, here's how to modify the environment of the invoking batch program with a secondary script: have the script write a batch file that the parent batch program then CALLs - since that batch program runs in the same environment and under the same instance of the command processor as the parent program, as a subroutine or subprogram, it can do anything the parent batch program can do.

This example causes the environment variable DDATE to take on the value of the current date in yyyymmdd format (suitable for sorting by date) for use in naming files and directories. It's a bit more complex than is usually needed because it deals with all three operating environments. This code assumes the American English order of the fields (easily changed), that the date is the last field on the first line of the DATE command's report, and that it is delimited by something that is not a numerical character or space.

SETEVNMT.AWK
BEGIN{
    Dflag = 0
}

{
    echo 
    if( !Dflag ) {
        Dflag = 1
        ThisDate = $NF
        gsub( /[^0-9]/, "+", ThisDate )
        Fields = split( ThisDate, Array, "+" )
        ThisDate = Array[ Fields ] Array[ Fields - 2 ] Array[ Fields -1 ]
        print "@set DDATE=" ThisDate
    }
}
That script doesn't contain any magic characters (except one '^' which is magic in NT), so it can be created on the fly by this batch program which writes a script which writes a batch program:

MLTL0070.BAT
 @echo off
 echo BEGIN{ Dflag = 0 }> }{.awk
 echo {if(!Dflag){Dflag=1;ThisDate=$NF >>}{.awk
 if %OS%!==Windows_NT! echo gsub(/[^^0-9]/," ",ThisDate)>> }{.awk
 if not %OS%!==Windows_NT! echo gsub(/[^0-9]/," ",ThisDate)>> }{.awk
 echo Fields=split(ThisDate,Array," ")>> }{.awk
 echo ThisDate=Array[ Fields ]Array[(Fields-2)]Array[(Fields-1)]>> }{.awk
 echo print"@set DDATE="ThisDate}}>> }{.awk
 echo. | date | awk -f}{.awk > }{.bat
 call }{.bat
 echo %DDATE%
 del }{.*
Note that I have removed all unnecessary spaces and newlines from the script. It could be all one line - or maybe two - but there are line length limits here and in the batch program. Note also that separate lines with IF test of the OS environment variable are used to provide a line with '^' escaped ("^^") if the OS is NT and in plain form if it is not (the OS variable doesn't exist except in NT, so its value is an empty string in the other environments).




I'm not sure whether this should be here or under "Text processing".

It's about reformatting directory entries to include fully qualified filespecs along with the udual file data: size, date, time. Almost any reasonable output format is possible.

The hard part is deciphering the DIR listing format in order to extract the drive and directory from the header, and the filename (long form, if available) and other data from the file listing. This is made harder by the need to change the directory string as the directory changes when processing complete branches and complete tree listings.

The directories fall into three formats (assuming the /a:-d switch to prevent having to deal with directories as well as files): This could be guessed from the operating system and a few simple tests, but a more flexible approach ties the format detection into the parsing code: the format can be different on adjacent lines, so the code needs to be applied to each line anyway.

Having decided to let wrapping occur, and considering the impossibility of guessing in advance how long the longest line is going to be, it won't be possible to arrange the output in columns, at least not in a single pass program that doesn't store the entire listing in memory. Since there is way to tell in advance whether the user will use the DOS or Win32 version of the language (or even which port), or how much space will be needed, it cannot be assumed that there will be enough memory available. Columns are right out. It seems to me that a Tab and space delimiter is the best choice for default.

The user may create an environment variable named DLISTL to contain a pattern for the output string.
::dir:: will be replaced with drive and directory (with a trailing backslash)
::long:: will be replaces with the long file name is there is one, otherwise with the short name
::short:: will be replaced with the short name, if there is one, otherwise with the long name
::date:: will be replaced with the date stamp
::time:: will be replaced with the time stamp
::size:: will be replaced with the file size with original thousands delimiters
::SIZE:: will be replaces with the numerical value of the size without the original delimiters
The default pattern if none is given is ::dir::::long::(Tab)(Space)::date::(Tab)(Space)::time::(Tab)(Space)::size::. Since it can be a bit tricky to get Tabs and newline marks into the environment, these are to be represented by \t and \n respectively. Other symbols causing problems are =, <, >, |, ^, and & - these are marked by \e, \l, \r, \p, \h, and \a respectively. If the user wants a field quoted, it must be quoted in the format string. Arbitrary text is also allowed.

If DLISTNOH even exists, no header or footer will be displayed in the output. Any value at all will do, so long as the variable is not null - it's only a switch.

Well, now that I made all those promises, it's time to write the code to implement them.

The code is not particularly sensitive to what operating system generates the DIR listing but is somewhat sensitive to the switches used in the DIR command: specifically those that change the layout of the listing in such a way as to remove the date and time fields (/w and /b), and also to the /-n switch in NT because it has a feature (MS software doesn't have bugs) that eleminates the space between the base name and the extension if the base name is more than eight characters long - the name is truncated but there is no space between it and the extension, so it is not possible to tell whether the last few characters are the extension or not with the algorithym used in the program - I'm not sure that any algorithm can tell the difference.

DLIST.AWK
BEGIN{
# Initialize some variables that may not otherwise receive an initial value.
    HeaderFlag = 0
    FooterFlag = 0
    Footer[1] = ""
    Footer[2] = ""
    LineFlag = 0
    
# If DLISTNOH even exists in the environment, we need to inhibit header code.
# This can be accomplished by setting the header flag to a special value.
    if( ENVIRON["DLISTNOH"] != "" ) HeaderFlag = 3
# Read the other environment variables into variables with easier names and
# assign default values to those that don't exist.
    ListFormat = ENVIRON[ "DLISTL" ]
    if( ListFormat == "" ) ListFormat = "::dir::::long::\t ::date::\t ::time::\t ::size::"
# Replace any litteral \t and \n marks with their real values. Also the magic 
# characters for various operating systems: = for \e, < for \l, > for \r, 
# | for \p, ^ for \h, and & for \a.
    ListFormat = ReplaceSymbols( ListFormat )
}

{
# Extract the first header encountered (or none if the header is inhibited).
# There are two lines in the header, so starting with a count of 0 and 
# incrementing twice will pass just the first two lines from the first
# header encountered.  Note that this may be language version sensitive.
# The ordering of the tests combined with the "next" statement prevents
# processing of these header lines as listing lines.
    if( $1 ~ /Volume/ ) {
        if( HeaderFlag < 2 ) {
            if( $0 !~ /:/ ) {
                HeaderFlag++
# Print the lines as encountered and get it over with. (This also facilitates
# testing the code as it is written.)
                print $0
            }
        }
        next
    }
# For each directory in the listing, we need the name of the directory.
# The directory name is in the line in its listing's header that begins
# "Directory of " and consists of the rest of the line.
    if( $0 ~ /Directory of/ ) {
        ThisDir = $0
        
# We all know there are no bugs in MS software, so this must be a feature:
# if the /s switch is present in Win95 (and possibly other versions), 
# there is no leading space on the line containing the directory listing.
# The substitution regex must have " ?" to test for zero or one space.
        sub(  / ?Directory of /, "", ThisDir )
        Array[ "::dir::" ] = ThisDir "\\"
# If HeaderFlag equals exactly 2, the header is done and the footer is
# allowed - look for it.  The footer lines are the last two lines having 
# a number followed by exactly one space followed by the string "bytes".
# This definition is valid for all the recent MS operating systems. Since
# not every occurence of a string that matches this is the real thing, it
# is necessary to reset FooterFlag after each occurence that isn't.  This
# is done by detecting a directory listing line from the presence of an
# "a" or "p" which exists in all time stamps, but not in the footer lines.
    if( HeaderFlag == 2 ) {
        if( $0 ~ /[ap]/ ) FooterFlag = 0
        if( $0 ~ /[0-9] bytes/ ) Footer[ ++FooterFlag ] = $0
    }
# It would be possible to save a bit of code by rearranging the above
# to put the test for [ap] first and putting the following code in the 
# else block, but it would not be as clear as separately testing for 
# DIR listing lines - the test is repeated here in the opposite sense:
# if the line has a date stamp, it is to be processed.

# Clean up reusable variables - most are elements in an array.
    Array[ "::long::" ] = ""
    Array[ "::short::" ] = ""
    Array[ "::date::" ] = ""
    Array[ "::time::" ] = ""
    Array[ "::size::" ] = ""
    Extention = ""
# A "next" statement here prevents processing of this line as a listing line.
        next
    }
# Note that the substitution pattern includes the spaces around the pattern -
# this code is highly language version sensitive, but easily modified.

# Now for the listing lines.
# The following code will not assume that the DIR command contained a /a:-d
# switch to suppress directory listings.  Directory lines contain  
# are ignored.
    if( $0 ~ /[ap]/ ) if( $0 !~ // ) {
# Find the time stamp: if it is in field 2, the OS is NT.  The time stamp
# has a colon followed by a two digit number between 00 and 59 and either
# "a" or "p".
        if( $2 ~ /:[0-5][0-9][ap]/ ) {
            Array[ "::date::" ] = $1
            Array[ "::time::" ] = $2
            Array[ "::size::" ] = $3

# The only major restriction on NT DIR formatting is that the /x switch 
# is not allowed - it's just too difficult to deal with.  The long name
# begins with column 40.

            Array[ "::long::" ] = substr( $0, 40 )
        }
        else {
# If it isn't NT, then it's one of the DOS/Win9x formats.
# $1 is the short name base,
# if $5 contains a : followed by 00-59 and an "a" or 'p"
#   $2 is the extension
#   $3 is the size
#   $4 is the date
#   $5 is the time
# else
#   $2 is the size
#   $3 is the date
#   $4 is the time
#   and there is no extension
# anything beyond the date field is the long name.
                T = $1
                if( $5 ~ /:[0-5][0-9][ap]$/ ) {
                    d = $5
                    Array[ "::short::" ] = T "." $2
                    Array[ "::size::" ] = $3
                    Array[ "::date::" ] = $4
            }
            else {
                d = $4
                Array[ "::short::" ] = T
                Array[ "::size::" ] = $2
                Array[ "::date::" ] = $3
            }
            Array[ "::time::" ] = d
            i = index( $0, d ) + length( d ) + 1
            if( length( $0 ) > i ) {
                Array[ "::lomg::" ] = substr( $0, i )
            }
        }
# If the short name is null after removing its spaces, there wasn't one,
# and the long name and short name are the same - force the short name
# to the only name.
        gsub( / /, "", Array[ "::short::" ] )
       if( Array[ "::short::" ] == "" ) Array[ "::short::" ] = Array[ "::long::" ]
# Similarly, if the long name is null, then force it to the short name value.
# They can't both be null in the listing.
        if( Array[ "::long::" ] == "" ) Array[ "::long::" ] = Array[ "::short::" ]

# Make a copy of the size field without delimiters - anything not a number
# is a delimiter.
        OutString = ListFormat
        t = Array[ "::size::" ]
        gsub( /[^0-9]/, "", t )
        Array[ "::SIZE::" ] = t
        for( i in Array ) {
            t = Array[ i ]
            gsub( i, t, OutString )   
        }
    print OutString
    }
}

END{
# Print the footer if there is one. (This part was written at the same time
# as the footer code to facilitate testing as code was written.)
    if( FooterFlag ) for( i in Footer ) print Footer[ i ]
}

function ReplaceSymbols( String ) {
# Replace symbols for newline, Tab, and certain characters that are magic in 
# various batch languages with the real characters.
# In order to provide a method of inserting these symbols as literals,
# provision must be made to escape the backslash - this is done by
# doubling the \ to \\, so a literal \t is represented by \\t.
# The way to deal with that is to replace \\ with something else before
# processing the symbols, then replace the something else with \.  ^Z is 
# used because it cannot appear in the controlling batch files.
    gsub( /\\\\/, "\032", String )
    gsub( /\\t/, "\t", String )
    gsub( /\\n/, "\n", String )
    gsub( /\\e/, "=", String )
    gsub( /\\l/, "<", String )
    gsub( /\\r/, ">", String )
    gsub( /\\p/, "|", String )
    gsub( /\\h/, "^", String )
# According to all available documentation this next line is incorrect:
# \\& is supposed to be \&. Just as inserting a $ would require \$ - 
# *that* works, but not \& to insert &.  This may be version sensitive,
# but it works with the four versions available to me.
    gsub( /\\a/, "\\&", String )
# Now fix the backslashes by replacing ^Z with \.
    gsub( /\032/, "\\", String )
    return( String ) 
}


Since that script is so long and complex, it isn't practical to generate it on the fly - it should be saved as a utility script and invoked with a batch file like this one - this is not a working program: you have to insert some user and use specific data

DLIST.BAT
 @echo off
 set DLIST=pattern
 cd foo
 dir %1 | gawk -fdlist.awk > dlist.txt
where foo is the directory containing the script and to contain the output file and pattern is the output format string. *************** More later ************




This stuff has been only partially tested at the time of its initial release, but it is known that the versions of GAWK and MAWK used here do work in Real DOS, Win9x, and NT4. The complete programs have not all been tested under Real DOS.



  ** Copyright 1995, 1996, 1997, 1998, 1999, 2000, 2001 Ted Davis - see License, included by reference. ** 

Input and feedback from readers are welcome. NOTE: the subject of the message must contain the word "batch" for the message to get past the spam filter.

Back to the Table of Contents page

Back to my personal links page - back to my home page