Access Archon #144
Introduction
I was recently given an assignment. The client needed to extract information from several log files and put them into a format that would expedite ad-hoc analysis. These log files were in several different folders, and were generated at several times during the day. They wanted to be able to go back over several weeks of log files when performing their analysis.
I knew the client had Microsoft Access and could perform their own ad-hoc queries as needed, so I suggested this as the target for the log records. They agreed. However, since new log files were created at specific times during the day, I wanted to perform the extract off the Windows Scheduler. This required a non-Access solution that updated the Access DB.
We discussed several alternatives, and the client was most comfortable with a Visual Basic solution. The client preferred a scripted solution to a compiled solution, so I suggested writing it in Visual Basic Script, and running it under the Windows Scripting Host (see sidebar). The client agreed.
Sidebar – The Windows Scripting Host
Built into every version of Windows since Windows 98 is a true scripting environment (not just a command line). This actually came about as part of Microsoft’s integration of Internet Explorer into Windows. Since the browser needs to be able to process VB Script and JavaScript, this functionality was built into Internet Explorer, and thus Windows. All MS had to do was create a light-weight hosting environment as an alternative to IE, and the same scripting engine could be used.
Thus the Windows Scripting Host (WSH) was born.
Since it’s the same engine as Internet Explorer, it supports the same languages as Internet Explorer – VB Script (VBS) and JavaScript (or rather, Microsoft’s flavor of JavaScript, Jscript, both abbreviated as JS). So you can write a stand-alone program in one (or both) of these languages using your favorite text editor (even Notepad), and execute the script using the WSH.
Very cool.
Let’s make our Hello World program. Start up Notepad and type in this single line:
wscript.echo “Hello World”
Save this as “c:hello.vbs”. Now open a command line window and enter the command “cscript c:hello.vbs”. Congratulations!
Now try the same program again, except run it using the command “wscript c:hello.vbs”. Notice that the output is in a Window this time.
WSH offers two user interfaces, a command line interface (aka DOS) and a Windowed interface. As you saw, the wscript.echo method acts differently depending on which interface you’re using.
Which brings me to the real power of Windows Script. You can use all “scriptable” objects on your machine, and most of their methods and properties. For most purposes, this means all installed ActiveX and OLE controls. This means you have file access, registry access, access to most Windows administration functions (via WMI), database access, the ability to send e-mail, and the ability to control most programs (including Office). WSH even offers its own objects (thus the echo method we used).
Each of these functions requires knowledge of a particular object model, which sometimes can be a bit daunting, but once you know how to use that object, you can script it using the Windows Scripting Host.
Like a web page, WSH lets you create single file that contains both VBS and JS code. You can even call between the languages. Even cooler, ActiveState’s ActivePerl and ActivePython (free for personal use) both integrate into the Windows Scripting Host, so you can interoperate between all these languages.
Here’s a helpful hint for developing VBS and JS programs. Built into every version of Office is an integrated development environment called Microsoft Script Editor (MSE). An integrated development environment includes at least an editor and a debugger.
For Office XP or 2003, the program is MSE7.EXE. If you used the default installation, you can find it in folder C:Program FilesMicrosoft OfficeOFFICE10 (for Office XP) or C:Program FilesMicrosoft OfficeOFFICE11 (for Office 2003).
If you can’t find it, run the Office Setup and (for Office XP and 2003) select Office Tools, HTML Source Editing, Web Scripting. Once you’ve found it, create a shortcut to it on your Start menu.
This editor is vastly superior to Notepad. It provides some color coding and line numbers (turn line numbers on under Tools, Options, Text Editor, All Languages, General). It also provides something MS calls IntelliSense. This provides prompting and other help for the built-in VBS and JS functions and constants, as well as any object you instantiate in your code using CreateObject (VBS) or new ActiveXObject (JS). See the menu under Edit, IntelliSense.
Helen’s note: I selected MSE7.EXE as the program to use when opening .vbs files (in the Tools|Folder Options|File Types dialog in a Windows Explorer), so I can easily open a script in this editor.
To use the debugging features, you must turn on Script Debugging in Internet Explorer. This can be found under Internet Options, Advanced – Disable Script Debugging must be off. You can start your script in debug mode by adding the //x switch when you run your script, as in “cscript //x c:hello.vbs”. This will open your script in MSE in debug mode and break before the first statement.
Here are some helpful links:
The reference documentation (and links to the downloads) for scripting at MSDN:
http://www.microsoft.com/scripting
The incredibly helpful Script Center (including the Script Repository) at Technet:
http://www.microsoft.com/technet/scriptcenter/default.mspx
ActiveState’s links to their latest versions:
http://www.activestate.com/Products/new_releases.plex
Log Files
I needed a flexible solution that would allow me to find new files in a set of directories. My solution was to create a Monitor table listing which directories to monitor. The code would use this table to find which directories to “walk”.
However, I needed to only process a file if it had not already been processed. I therefore needed to remember what had already been processed. So I created a History table.
An additional complication that came about during testing was that some of the directories contained other, non-log files in addition to the log files. To handle this, I added a field to the Monitor table to specify a file name. Since the log file names included date stamps, the file names had to be generic. I decided to support DOS-type wildcards in the file names since those would be most natural for the client.
Sidebar – WildCards2RegExp
The FileSystemObject works well for most things, but it is deficient in a few areas (see http://support.microsoft.com/?id=189751 for a complete list). We want the ability to search for files using wildcards (like on the DIR command), but this is one of the limitations. So we need to “grow our own”.
My solution was to take advantage of Regular Expressions (see the Regular Expressions sidebar). And since they’re supported in VBS, all I needed to do was to convert the wildcard pattern into a Regular Expression pattern.
This proves to be simple since there are only two possible wildcards – the question mark “?” and the asterisk “*”. Other characters will match only themselves. For example, wildcard pattern “a*” will match any name starting with an “a”, and “*a” will match any name ending with “a”. And wildcard pattern “1?” matches “12” and “1x” but not “1”.
In a Regular Expression pattern, the period “.” matches any character. This matches the function of the question mark “?” wildcard exactly. So we need to replace each “?” with a “.”. Easy.
Also in a Regular Expression pattern, the asterisk “*” is a modifier that means “match zero or more of the previous character”. So “a*” matches “a”, “aa”, “aaa”, or no “a”. An example is the pattern “ca*t” which will match “cat”, “caat”, and “ct”, but not “c”.
So the Regular Expression “.*” will match “zero or more of any character”. Fortunately, this exactly matches the function we need to handle the “*” wildcard.
So WildCards2RegExp just replaces the wildcard “?” with “.” and “*” with “.*”.
There is one caveat, though. There might be other characters in the wildcard pattern that would be taken as a special character when used in a Regular Expression pattern. To prevent that from happening, we need to put a backslash “” in front of any of those characters. This is known as “escaping” that character.
This is easily accomplished using yet another Regular Expression! I create a Regular Expression called Escapes which lists all the characters that must be escaped. See the Regular Expressions sidebar for a deeper explanation of this pattern.
Finally, the resulting Regular Expression pattern includes beginning and ending anchors so that the pattern must match the entire file name and not just part of the name. The caret “^” matches only the beginning of the string, and the dollar sign “$” matches only the end of the string. So by wrapping our pattern with these characters, we force the pattern to have to match the entire string, or it doesn’t match.
Sidebar – Regular Expressions
Regular Expressions are not “Regular” at all, but they are an extremely powerful pattern matching language. Regular Expressions are built into almost all current computer languages, and are available in some form on every major computing platform, but the implementations vary. There are complete books written on Regular Expressions, so this is just a primer. Also, we’re focusing here specifically on the VBS implementation of Regular Expressions.
Regular Expressions are implemented as a type of object in VBS. They are instantiated using the New operator on a RegExp object. For example, Set re = New RegExp.
To use a RegExp after you instantiate it, you set the Pattern property and any options, then you run an Execute method against a string. This performs a match of the pattern against that string and returns another object, a collection called Matches. This is a collection of Match objects, each of which represents one of the matches (duh).
If you’ve asked for all matches, you can iterate through the Matches collection to find them all. If you’ve asked for only the first match, the Matches collection will only contain one Match object. So in this case, you could just refer to Matches(0), since collections are indexed starting at zero.
If you don’t need the results of the match, and you just need to know if a match was found, you can use the Test method. This returns True if the pattern was found in the specified string.
In addition to the Execute and Test methods, there’s a Replace method. This allows you to perform replacement using Regular Expression patterns, which is much more powerful than the VB string function with the same name (Replace).
The power (and complexity) of Regular Expressions comes from the Pattern property. These Patterns describe what you want to match. Patterns are case sensitive unless the IgnoreCase property is set to True. However, even if IgnoreCase is set, special characters in the pattern must still be specified using the proper case.
A character in a pattern matches that character in the string, unless the character is one of the special characters. For example, the pattern “a” will match the character “a”. There are many special characters, but we’ll only cover the most common.
The most common special character is the backslash “”. This is the “escape” character, in that it reverses the meaning of the following character. For example, a period “.” matches any character, but putting a backslash in front of the period “.” makes it just a period. Also, a “d” just matches the letter “d”, but “d” matches any digit. To match a backslash, use “”.
The period “.” matches any character, but it must be a character. For example, “a.c” matches “abc” and “a#c”, even “a c” (that’s a space between the “a” and the “c”), but not “ac”.
“d” matches any digit, while “D” matches any non-digit. “s” matches any white space character (space, form feed, line feed, carriage return, tab, and vertical tab). Conversely, “S” matches a character that’s not white space.
Some special characters are modifiers. The question mark “?” makes the previous character optional. That is, it matches zero or one of those items. So “a?” matches “a”, or nothing.
The plus sign “+” modifies the previous character to mean “one or more of those”. So “a+” matches “a” and “aaa”, but there must be at least one “a”. You use this for a required repeating component in your string.
The asterisk “*” modifies the previous character to mean “zero or more of those”. So it’s just like the “+” except it matches nothing also. For example, “a*” also matches “a” and “aaa”, but it also matches if there’s no “a”. You use this for an optional repeating component in your string.
Curly brackets “{” and “}” say there must be a certain number of the previous character. For example, “a{5}” matches only “aaaaa”, while “a{2,5}” matches between 2 and 5 “a”‘s.
Some special characters are used for grouping. Square brackets “[” and “]” match any of the individual characters inside. So “[abc]” matches an “a”, a “b” or a “c”. This can also be coded as “[a-c]”. Adding a caret after the opening square bracket “[^” means “not these characters”, so “[^0-9]” matches any character that’s not a digit – which is exactly the same as “D”.
The most powerful grouping operator is the parenthesis “(” and “)”. This does two things. It groups together a set of characters, and it also creates submatches. Submatches allow you to retrieve subsets of your string for later reference. Submatches are counted based on left parenthesis, so the first left parenthesis is the first submatch, and so on.
For example, “(.*).(.*)” matches a normal file name (not all valid file names, since the period is not required in a real file name), and remembers the name and extension as separate submatches.
This brings up an important note. Patterns are greedy. That means “.*” matches the most it can. So “(.*).(.*)” matches all characters up to the LAST period (in the first submatch), then any characters following that period (in the second submatch).
Submatches can be accessed in several ways. They are part of the Match object. They can be referenced in a pattern (e.g. “(.)1” matches any two characters in a row). They can also be referenced in the replacement string of the Replace method by using a dollar sign “$”. I use this in the article to put backslashes in front of all escape characters – Escapes.Replace(string, “$1”).
The vertical bar “|” is an “OR” operator. So “ab|cd” matches “ab” or “cd”. Using that with the parenthesis for grouping, you could create something like “(b|t)oo” which would match “boo” or “too”.
There are some special characters that don’t match characters, just conceptual locations. The caret “^” matches beginning of a string, and “$” matches the end of a string. A common use of this is to make sure your pattern matches the entire string. So “^a.c$” matches “abc” but not “abcd”.
Putting this all together, let’s pull out a typical U.S. zip code. This can be five digits, or five digits followed by a “-” and four more digits. We want to capture each item separately. That is, we want the first five digits, the last four digits if they exist, and also the entire zip code. That pattern is:
“((d{5})(-(d{4}))?)”
Submatch 1 will contain the entire matched zip code. Submatch 2 will match the first five digits. Submatch 3 wasn’t required per se, but we needed it to make the dash and last four digits optional. Submatch 3 will contain the dash and the last four digits if they exist. Submatch 4 will contain just the last four digits, without the dash.
Supporting Files
The zip file containing this article, in Word format, plus the supporting file(s), may be downloaded from the Access Archon page of my Web site. It is accarch144.zip, which is the last entry in the table of Access Archon columns for Access Watch.
|
Document Name |
Document Type |
Place in |
|
FX_Log.mdb |
Access 2000 database (can also be used in higher versions of Access) |
Wherever you want |
|
FX_Log Load.vbs |
Windows Script Host script |
Wherever you want |
|
fx_2005_12_23.txt |
Sample source document |
Wherever you want |