ADUG
Home
About Us
Services
Meetings
Fees
Mailing List
Rules
Reference Papers
Downloads
Apply to Join
Links
Delphi Jobs
Special Offers

 
Programming Comp No 2: Substring Searching

 

The Melbourne Sub Committee of ADUG is running a programming competition open to ADUG members in all states. The aim of the competition is to produce a reasonably efficient piece of code which counts all occurrences of any give word in a text file containing a piece of English text up to 1 megabytes in size.

The results will be assessed on accuracy, speed and code style as follows.

If the solution offered does not return the correct number of occurrences in any one of a series of test cases it will be rejected as inaccurate.

Solutions which take longer than four times the time taken by the fastest accurate solution averaged over a series of tests will be rejected as too inefficient.

The code of all accurate, efficient solutions will be perused by a select committee of eminent judges and graded according to their views of neat, well structured code. The judges' decision is final and no correspondence will be entered into.

Competition Specification

1       The solution code must offer a single interface

The solution code must offer a single interface of "function FindWordInFile(const Filename, FindWord: string): Integer;" in a unit named "ADUGStringInFile"

For Example


unit ADUGStringInFile;
interface
function FindWordInFile(const Filename, FindWord: string): Integer;
implementation
...

2       All code must be contained in that unit file.

All solution code must be contained in the unit file. The solution can "Use" any files provided with Delphi professional releases but not third party code.

Only Delphi code can be used in the submission. In line assembler is not permitted.

3       Parameters

FileName may be either a fully specified path or file referenced from the current directory. The code must handle both.
FindWord may be any combination of [a..z] and or [A..Z] of length 1 to 45
Result is the count of occurrences.

4       Test File

The test file to be used will be limited to 1 Megabytes but can contain any valid ASCII Characters   (Char(0) to Char(127)).

5       Definition of a Word Match
The function result must return the count of the whole words of FindWord in that file. A whole word is defined as any case insensitive match both preceded by and followed by any character outside those specified in the parameters section as valid in FindWord. That is, if it is not an alpha, it is a word separator. So a search for ROGER should count Roger's but not count Rogers. The actual start of file and end of file will also be considered as word boundaries.

Submissions

All submissions should be emailed to Richard King after 31 July 2003 and before the closing time of 6:00 pm on the 19 August 2003. We expect results to be announced at the September ADUG meeting. The judges' justification may be some "long" time later.

Submissions should be attached to an email Titled "ADUG String Find Entry From ." The body of the email should provide your name email address and, if you are an ADUG Member, your ADUG Membership number.

Ideally the attachment should be zipped but must include only the single .pas unit. Any entries including sample 1MB text files and/or compiled exe's will be disqualified!

Testing Environment

Entries will be compiled in D7 Professional and run under Windows (XP). This should make no difference to submissions as any version of Delphi can be expected to compile an run in this environment.

Prizes

We were only doing this for fun. One prize only to the overall best entry as judged by the judges: A signed copy of Danny Thorpe's excellent and now out of print "Delphi Component Design" for ADUG memebers or an ADUG membership for non-ADUG members.

Hints

Preliminary investigation for this topic yielded the fast word search algorithm "Boyer-Moore" implemented in Pascal at http://www.bsdg.org/swag/STRINGS/0138.PAS.html you are welcome to use the Pascal implementation of this code but it will need modification to meet the specification and I am not sure about the judges' views on the neat well structured test.

The file size has been limited to 1 Megabyte to enable the buffer required for an implementation similar to the "Boyer-Moore" code above to be allocated on the stack without the added complexity of handling multiple reads.

Eg.
var Buffer: array[1.. 1000000] of char;

begin
  ThisFile := TFileStream.Create(FileName, OpenMode);
  BufCount := ThisFile.Read(Buffer, 1000000);

However better solutions which enable larger files to be handled with indications as to the maximum limits supported may be favorably considered.
  
A big thank you to Roger Connell for the content of this page and for providing the initial testing and timing framework.