This article and download files are at: https://www.codeproject.com/Articles/14728/EDX-Spelling-Checker
EDX Spelling Checker
By David Deley.
Add to your program the ability to spell check a word and perform spell guessing.
|
(See link at top for original article and download links)
This package gives you the ability to spell check a word, and the ability to suggest some correctly spelled words the user might have meant when a misspelled word is encountered (spell guessing). It also provides support for a user's personal auxiliary dictionary. An American English lexicon is provided, and instructions for creating a lexical database in other languages is given. A guide is given if you wish to port this spelling checker code to another operating system. The code has been around for many years, and has proven itself to be quite fast and stable. I call this spelling checker, EDX.
The edxspell.dll contains the following four routines:
edx$dic_lookup_word |
Checks the spelling of a word. You hand it a word, and it will return a
value indicating if the word is correctly spelled or not. It does this by first checking
the EDX lexical database (i.e., "dictionary"), and then checking the user's personal Aux1
dictionary. If the word is found it returns EDX__WORDFOUND . If it does not
find the word, it returns EDX__WORDNOTFOUND . |
edx$spell_guess |
Guesses what word the user meant to type. If edx$spell_guess
returns EDX__WORDNOTFOUND you may then make repeated calls to
edx$spell_guess to get some words the user might have meant to type. Each
call to edx$spell_guess will return a correctly spelled word from the
dictionary, similar to the misspelled word passed to edx$dic_lookup_word .
You may continue making calls to edx$spell_guess until
edx$spell_guess returns EDX__WORDNOTFOUND indicating there are
no more suggestions. |
edx$add_persdic |
Add a word to the user's personal auxiliary dictionary. A user's personal auxiliary dictionary is a plain text file with one word per line. The user may also edit this file with an ordinary text editor to add or remove words. |
edx$dll_version |
(Returns the version number of edxspell.dll. Also returns information about the EDX dictionary and the user's personal auxiliary dictionary if they are loaded.) |
You pass a word to edx$dic_lookup_word
, and it will return a value indicating if the word is correctly spelled or not.
int edx$dic_lookup_word(char *spellwordptr, char *errbuf, int errbuflen, char *Dic_File_Name, char *Aux1_File_Name);
The word you want to check the spelling of (pointer to ASCIIZ string). This string should contain just the word. Any leading or trailing spaces will not be trimmed off for you. It's up to you to trim off any leading and trailing spaces. The case of the word (uppercase or lowercase) is not important. (edx$dic_lookup_word
makes a lowercase copy of the word before looking it up in the lexical database.)
You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. One instance where you will get an error message is if the EDX lexical database file (i.e., the "dictionary") can't be found. The error message returned in errbuf
, in this case, will be something like:
"Error opening C:\Program Files\Multi-Edit 2006\EDXDIC.DIC
Error is: 2: The system cannot find the file specified."
It is up to you to display the error message to the user.
ASCIIZ string containing the filename of the lexical database file. This is usually the full path/filename, e.g., Dic_File_Name = 'C:\Program Files\Multi-Edit 2006\EDXDIC.DIC'
. If you specify just "EDXDIC.DIC", then the current directory is searched. If the file is not found, edx$dic_lookup_word
returns EDX__ERROR
, and the error message is placed in errbuf
.
If you wish, you may rename EDXDIC.DIC to something else (perhaps EDX_DICTIONARY.DAT). Just specify the new name here.
On the first call to edx$dic_lookup_word
, the lexical database file specified in Dic_File_Name
will be mapped into memory. The value of Dic_File_Name
is ignored on all future calls once the lexical database file is loaded.
ASCIIZ string containing the filename of the user's personal auxiliary dictionary file.
You may use a null string here if the user does not have a personal auxiliary dictionary file.
This is usually the full path/filename, e.g., Aux1_File_Name = 'C:\Program
Files\Multi-Edit 2006\EDXAUX1.TXT'
. If you specify just "EDXAUX1.TXT", then the
current directory is searched. If a file is specified, and the specified file is not
found, it is created.
A user's personal auxiliary dictionary file is a plain text file with one word per line.
The user may edit this file with an ordinary text editor to add or remove words. You may
also use the edx$add_persdic
function to append a word to this file.
On the first call to edx$dic_lookup_word
, the user's personal auxiliary
dictionary specified in Aux1_File_Name
will be loaded into memory (unless
a NULL string is passed, in which case this step is skipped). The
value of Aux1_File_Name
is ignored on all future calls.
Add the following three defines to your code:
#define EDX__WORDFOUND 1 #define EDX__WORDNOTFOUND 2 #define EDX__ERROR 4
These are the three possible return values. Note there are two underscore characters
after EDX. If the return value is EDX__ERROR
then errbuf
will
contain further information about the error. It's up to you to display this error message
to the user.
On the first call to edx$dic_lookup_word
, the main dictionary specified
by Dic_File_Name
is opened and mapped, and if Aux1_File_Name
is
not a null string ("") then that file is briefly opened and read into memory. Then the
word to spell check in spellwordptr
is searched for in the main dictionary
and in the user's personal auxiliary dictionary. If the word is found EDX__WORDFOUND
is returned. If the word is not found EDX__WORDNOTFOUND
is returned.
EDX__WORDFOUND
.
edx$dic_lookup_word
.
Dic_File_Name
and optional user's personal dictionary file Aux1_File_Name
to edx$dic_lookup_word
. You may then use NULL strings for these two parameters
on all future calls to edx$dic_lookup_word
, as these two parameters will be
ignored on all future calls.
Every call to edx$dic_lookup_word
sets up spell guessing. The word passed to edx$dic_lookup_word
is saved, and spell guessing is initialized. If the said word is misspelled, you may then make repeated calls to edx$spell_guess
to get suggested words the user might have meant to type. (You may make these calls even if the word is not misspelled, though I don't know why anyone would bother.) Each call to edx$spell_guess
will return a correctly spelled word from the main EDX dictionary or the user's personal auxiliary dictionary which is similar to the misspelled word passed to edx$dic_lookup_word
. You may continue making calls to edx$spell_guess
until edx$spell_guess
returns EDX__WORDNOTFOUND
indicating there are no more suggestions.
int edx$spell_guess(char *guessword, char *errbuf, int errbuflen);
Pointer to a buffer which you supply to receive the guess-word. The buffer should be large enough to hold a 31 character word (don't forget the trailing NULL
byte, so you need 32 bytes).
You provide an error buffer where error messages can be written to. Provide a pointer
to a buffer, and provide the length of the buffer. (I suggest an error buffer size of
around 400 characters.) The only instance I know of where you will get an error message is if the
EDX lexical database file (i.e., the "dictionary") is located on a remote computer and
the network connection to that remote computer is lost. In this case, the error message
returned in errbuf
will be:
"EDXspell.dll encountered
error EXCEPTION_IN_PAGE_ERROR. This error can occur if the EDX dictionary file is on a
remote computer and the network connection to that remote computer is lost."
It is up to you to display the error message to the user. (You don't have to display
it. You could just treat this error as if EDX__WORDNOTFOUND
were returned,
and stop spell guessing. In this case, you may specify errbuflen = 0
and not
receive the error message since you're not going to display it.) Normally, the EDX
dictionary file is on the same computer as the program, and this is not a problem.
EDX__WORDFOUND - guessword is filled with another guess word.
EDX__WORDNOTFOUND - all out of guesses.
EDX__ERROR - EXCEPTION_IN_PAGE_ERROR (see above).
Here is an outline of how edx$spell_guess
goes about spell guessing:
The code takes care not to guess the same word twice.
Adds a word to the user's auxiliary personal dictionary. The user's personal auxiliary dictionary is a plain text file with one word per line. The user may also edit this file with an ordinary text editor.
The word you want to add to the user's personal auxiliary dictionary. (pointer to ASCIIZ string). Leading and trailing spaces are trimmed and the word is lowercased before adding to the file. The resulting word can be no longer than 31 characters.
You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. It is up to you to display the error message to the user. (I suggest an error buffer size of around 400 characters.)
EDX__WORDFOUND - Successfully added word to user's personal dictionary.
EDX__ERROR - Error adding word to user's personal dictionary. Error text is in errbuf.
Returns a long string containing information. The string may look something like:
EDX Spelling Checker file edxspell.dll version 7.2 November 26, 2006.
EDX dictionary file is version 5 (Extended ANSI character compatible)
There are no extended ANSI characters in the dictionary.
Extended ANSI Guessing is: OFF.
User's personal auxiliary dictionary file is: EDXMYAUX1DIC.TXT
You provide a buffer where the version message string can be written to. Provide a pointer to a buffer, and provide the length of the buffer. (I suggest a buffer size of around 550 characters.)
To try the demo:
If you are spell checking a buffer, then the general outline of code you would write would be:
edx$dic_lookup_word
to spell-check the word.
edx$dic_lookup_word
returns EDX__WORDNOTFOUND
, then
declare the word misspelled, and offer suggestions to the user by making repeated calls to edx$spell_guess
until edx$spell_guess
returns EDX__WORDNOTFOUND
(meaning, no more guess words). Here is a pseudo code example which spell checks testword
:
status = edx$dic_lookup_word(testword,errbuf,errbuflen,edxdic); switch( status ) { EDX__WORDFOUND: //Good. Word is correctly spelled. break; EDX__WORDNOTFOUND: //Word misspelled. Let's spell guess. while (EDX__WORDFOUND == (guess_status = edx$spell_guess(ResultBuf, errbuf, errbuflen))) { //ResultBuf contains a guess word. <Display guess word to user.> } //When we drop out here //guess_status is either EDX__WORDNOTFOUND //indicating no more guesses //or EDX__ERROR (which is very unlikely) if (guess_status == EDX__WORDNOTFOUND) { //No more guesses. } else if (guess_status == EDX__ERROR) { <Bad. Display error message and stop spell checking.> } break; EDX__ERROR: <Bad. Display error message and stop spell checking.> break; }
For a simple working example, see file "CallEdxSpell.cpp" in the "CallEdxSpell Source" folder of the source download.
It's up to you to provide words to edx$dic_lookup_word
, so if you're spell checking a buffer, you must write some code that will parse off the next word to spell check. See the file "Parsing off words.txt" in the "Documentation" folder of the source download for some suggestions.
The code now supports an optional user's personal auxiliary dictionary (also called
the "User's Aux1 dictionary" or the "User's Aux1 Lexical Database"). This is a plain text
file with one word per line. The contents of the file are loaded on the first call to
edx$dic_lookup_word
. Leading and trailing spaces are trimmed, and the words
are lowercased when loaded. Resulting words may be no longer than 31 characters.
edx$dic_lookup_word
will return an error message if it finds a word longer
than 31 characters when loading the user's personal auxiliary dictionary.
Words in the user's personal auxiliary dictionary are checked when spell checking a word, and when spell guessing.
A further enhancement would be to keep track of spelling corrections as they are made. If the user misspells a word, and selects a correctly spelled word from your list of guess words, then you can save that correction. If you encounter the same misspelled word again, offer to make the same change. (The EDX spelling checker does not do this for you.)
This spelling checker has proven itself to be quite fast. The secret to the speed of the EDX Spelling Checker is to keep page faults down to a minimum. The design of the EDX lexical database file reflects this goal by keeping memory reads near each other. For more information about the layout of the lexical database file and how to optimize it, see the file "Lexical Database File Layout.txt" in the source download. For more information about what page faults are and why a good understanding of them is so crucial to program execution speed, see the file "PAGE_FAULTS_AND_ARRAY_ADDRESSING.TXT" in the "Documentation" folder of the source download.
The other secret to speed is to map the lexical database file ("dictionary") into virtual memory instead of reading it in. The dictionary could be loaded by first allocating enough memory to hold the file, and then reading the entire file into the allocated memory. This would be quite slow due to the large size of the database. Also, a user's page file quota limits the total amount of memory a user may allocate, and the memory required to hold the database file is a considerable amount of memory.
Instead of this, we use system service calls to the Microsoft Windows Operating System supplied functions CreateFileMapping
and MapViewOfFile
to load the dictionary file. CreateFileMapping
and MapViewOfFile
accomplish the same result of allocating memory and then reading the file into memory, except they never allocate memory from the system, and never read in the file. Instead, they expand the process region by the size of the file EDXDIC.DIC (the "dictionary", i.e., the lexical database file), thereby instantly making new virtual memory available, and then it declares that the physical file EDXDIC.DIC itself is the read-only paging file for that section of the memory. The initialization is now complete, with hardly any work having been done.
Now, when the program attempts to read some of the dictionary that's in that memory range, a page fault will occur if that page is not already in memory, and that page is automatically read into memory. And since we're not using the system paging file for this, the user's page file quota is not affected.
It also helps if you defragment the file EDXDIC.DIC, since it's being used as a paging file.
An American English lexical database is provided with over 90,000+ words. Every effort has been made to assure all the words are correctly spelled.
Lexicons (a file that contains a list of words) for other languages may be found at SourceForge. (See SCOWL - Spelling Checker Oriented Word Lists)
Be forewarned, the lexicons at the above web site contain a lot of words which aren't to be found in any dictionary! (They contain a lot of misspelled words, or words which should actually be hyphenated or two separate words.) A lot of work has gone into ensuring the words in my lexicon, EDX_DICTIONARY.TXT, are correctly spelled.
Another site that has Lexicons in various languages is: WinEdt Dictionaries.
For more info on creating the EDX lexical database file and optimizing your EDX_COMMONWORDS.TXT file, see "0Readme EDXBuildDictionary.txt" in the "Build EDX Dictionary Source" folder of the source download.
Update: The code has been updated to handle all ANSI characters 128 - 255. So it can now handle characters such as: ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ See files "EDX Using Extended ANSI Characters.txt" and "EDX lowercasing extended letters.htm" in the Documentation folder for more information about this.
See the file "Porting EDXspell to other operating systems.txt" in the "Documentation" folder of the source download if you wish to port this code to another operating system besides Microsoft Windows. (The code was originally written for the VMS operating system, and later modified to work with Microsoft Windows.)
Back to Deley's Homepage |