file:///d://format2/ntx.txt

ntx file format

version of this document: 2.0
by cesar a. gil, january 06, 1997
introduction
	the ntx file format is the format of index files used by
clipper. it is much more efficient than ndx files, with respect to
performance and file size (usually more than twice the speed, and
only 80% of the size of an equivalent ndx file). the extension is also
different to indicate that its format is not the same as ndx.
	the ntx file stores a b-tree. a b-tree consists of nodes pointing
to sub-nodes.
	the b-tree allows you to traverse the nodes in a depth-first
order, thus allowing you to retrieve the keys in crescent order. it also
allows you to retrieve a given key or its closest-matching key with an
efficient algorithm, since you must only traverse a few nodes to
locate a specific record. however, as far as theory is concerned,
nothing beyond a diagram of a b-tree will be given here.
	updating an ntx file may be very difficult, since after a
change you must keep the b-tree balanced. however updating the b-tree
is a laborious task, the information given here may easily allow you
to read the ntx file.
diagram of a b-tree
			    root node
-----------------------
| . | 50 | . | 60 | . |
-----------------------
|        |        |
+----------------+        |        +-----------+
|                         |                    |
-----------------------   -----------------------   --------------
| . | 10 | . | 20 | . |   | . | 52 | . | 56 | . |   | . | 62 | . |
-----------------------   -----------------------   --------------
|        |               |        |        |       |        |
///       |              ///      ///      ///     ///      ///
|
-----------------------
| . | 15 | . | 16 | . |
-----------------------
|        |        |
///      ///      ///
   in this diagram we have assumed that each node can hold up to 2 keys,
however a b-tree can be specified to have any maximum number of keys
per node (n >= 2). a node doesn't need to be used up to the maximum. for
example, in this diagram we have a node which is not fully used (the node
which only contains the value 62).
   we also didn't care here to create a balanced tree, as would happen
in a real situation, since we are concerned about giving the reader
a feeling of how it is to traverse a b-tree. the diagram is also
intended to help visualize the structure of the ntx file.
   each node contains pointers between its key values. for example, look
at the root node. the pointer to the left of the value 50 points to a
sub-tree that contains every key lesser than 50. the pointer bewteen the
values 50 and 60 points to a sub-tree containing every value between 50
and 60. the pointer after the value 60 points to a sub-tree containing
all values greater than 60.
   at some given time we will arrive at a point where there will be no more
branching. we indicate that by an empty pointer (represented here by
///). in ntx files, an empty pointer (that does not point to a
sub-tree) contains a 0.
   now it is very easy to see that, if you traverse the tree in a depth-
first search, you'll retrive all the key values in order. it's also
very easy to define an algorithm to locate a given key efficiently
in the tree. the difficult part is updating the b-tree, by inserting,
changing or removing records, because the b-tree must be kept balanced.
the algorithm for doing so is very complex and literature is available
on the subject.
description of an ntx file
    the ntx file consists of pages of 1k.
the first 1k page contains a header.
the other pages contain nodes of the b-tree. (each 1k page will hold
a node).
    header structure:
	offset
00h  unsigned signature - contains 09 if good ntx file
02h  unsigned version   - version of indexing system that the
file was created under
(0x1 is typical but may change)
04h  long root          - file offset (within the ntx file)
where the root page is.
08h  long free_page     - file offset (withing the ntx file)
where the first of a list of free
pages is. a 0 value ends the list.
the clipper index system is intended
to reuse free pages.
0ch  unsigned item_size - the size of the item.
0eh  unsigned key_size  - the size of the index key
10h  unsigned key_decim - number of decimal positions for the
key, if it is a numeric key.
12h  unsigned max_item  - maximum number of items per page.
14h  unsigned min_item  - minimun number of items per page
(max_item/2 is typical)
16h  char key_expr[256] - an array containing the key
expression for the index.
the expression must be terminated by
null (ascii 0) and its length cannot
be greater than 256 bytes.
272h unsigned unique    - contains the value of the unique
option for the index.
(1=index is unique; 0=isn't unique)
    comments:
    (1)	the size of the index key (key_size) is determined when you issue
the command
	    index on key_expr to file.ntx
	the key_expr is evaluated for each record of the data file.
the value returned by evaluating key_expr has its size calculated.
it is assumed that all the evaluations of key_expr will return
values of the same type and length.
if you define a key_expr that returns values of different lengths,
the clipper indexing system will give unpredictable results and,
eventually, an error will happen.
	an example (of what you shouldn't do) is:
	    index on trim(client_name) to clients
	based on the determined key_size, clipper will calculate how
many items a page can hold and will thus determine max_items
and min_items. remember, each item holds a key value, whose size
depends on the key_size. that's why the item size is dynamically
determined.

    node structure:
	offset
00h  unsigned item_count - how many items this particular
page holds.
02h  unsigned item_offset[item_count] - holds the offsets, within
the page, where the items are located.
??   item area
	       the size of each item is: 8 + key_size.
the last item doesn't contain a key value, only a pointer
to a sub-tree.
	       each item has the following structure:
	       long page_offset - file offset (within ntx file) where
sub-tree is. if there's no subtree,
a 0 is stored.
long recno       - record number for this key value
(in the dbf file)
char key_value[key_size]


	   for example, the node:
			    -----------------------
| . | 50 | . | 60 | . |
-----------------------
|        |        |
to page  to page   to page
at       at       at
offset x  offset y  offset z
	   will be represented in a page containing:
	   offset
00h  unsigned item_count = 3
02h  unsigned item_offset[0] = 08h
04h  unsigned item_offset[1] = 12h
06h  unsigned item_offset[2] = 1ch
08h  item[0] :
long page_offset  : x
long recno        : dbf record number pointed by
key value 50
char key_value[2] : '50'
12h  item[1] :
long page_offset  : y
long recno        : dbf record number pointed by
key value 60
char key_value[2] : '60'
1ch  item[2] :
long page_offset  : z
conclusion:
    the ntx file structure is easy to access. the main concern is about
the fact that each ntx file will have items of different sizes and
pages with different number of items.
source code example - ntxview.c :
    author     : cesar a. gil
date       : january 06, 1997
language   : c
use        : ntxview file.ntx
description: lists the contents of the ntx file's header, then
traverses all the tree in depth-first order, listing all
the keys in order.
    ---------------------------------------------------------------
    /*
* ntxdump.h
*/
#define max_key  256
#define buf_size 1024
    typedef struct {
unsigned type;
unsigned version;
long     root;
long     next_page;
unsigned item_size;
unsigned key_size;
unsigned key_dec;
unsigned max_item;
unsigned half_page;
char     key_expr[max_key];
char     unique;
} ntx_header;
    typedef struct
{
long page;
long rec_no;
char key[1];
} ntx_item;
    typedef struct
{
unsigned item_count;
unsigned item_offset[1];
} ntx_buffer;
    ---------------------------------------------------------------
    /*
* ntxview.c
*/
#include "stdio.h"
#include "stdlib.h"
#include "fcntl.h"
#include "io.h"
    #include "ntxview.h"
    ntx_header ntx_header;
int        ntx_handle;
    void view_page(long page_offset);
    void main(int argc, char ** argv)
{
if( argc != 2 ) {
printf( "use: ntxhead arquivo.ntx\n" );
exit(1);
}
	ntx_handle = open(argv[1], 0x8000);
if( ntx_handle == -1 ) {
printf( "open error: %s\n", argv[1] );
exit(1);
}
	/* reads header */
if( read(ntx_handle, &ntx_header, sizeof(ntx_header))
!= sizeof(ntx_header) ) {
printf( "error reading header of %s\n", argv[1] );
exit(1);
}
	/* prints header information */
printf( "***** ntx header *****'n" );
printf( "signature %xh\n", ntx_header.type );
printf( "version %xh\n", ntx_header.version );
printf( "root %lxh\n", ntx_header.root );
printf( "next page %xh\n", ntx_header.next_page );
printf( "item size %xh\n", ntx_header.item_size );
printf( "key size %xh\n", ntx_header.key_size );
printf( "key decimals %xh\n", ntx_header.key_size );
printf( "max item %xh\n", ntx_header.half_page );
printf( "key expr: '%s'\n", ntx_header.key_expr );
printf( "unique: %d\n", ntx_header.unique );
	view_page(ntx_header.root);
	close(ntx_handle);
}
    void view_page(long page_offset)
{
char       * page;
ntx_buffer * buffer;
ntx_item   * item;
unsigned   i;
	/* allocates page */
page = malloc(buf_size);
buffer = (ntx_buffer *) page;
	if( page == null ) {
printf( "error allocating memory.\n" );
exit(1);
}
	/* moves to the given page offset within the file */
lseek(ntx_handle, page_offset, 0);
	/* reads page */
if( read(ntx_handle, buffer, buf_size) != buf_size ) {
printf( "error reading page %lxh\n", page_offset );
exit(1);
}
	/* traverses the vector, item by item */
for( i = 0; i < buffer->item_count; i ++ ) {
item = (ntx_item *) (page + buffer->item_offset[i]);
if( item->page != 0 )
view_page(item->page);
printf("dbf recnum: %5lx   key: %.*s\n", item->rec_no,
ntx_header.key_size, &item->key);
}
	/* handles last pointer */
item = (ntx_item *) (page + buffer->item_offset[buffer->item_count]);
if( item->page )
view_page(item->page);
	/* deallocates page */
free(page);
}