Wiki Home

Non Discriminating Index


Namespace: WIN_COM_API
A non discriminating index is one whose expression produces very few or nearly all records. This is the result of a unbalanced distribution of values for the index key. Depending on the key expression, these nondiscriminating indexes can reduce performance rather than enhance it.

For example, if your 100,000 record table has say 100 deleted records, then an index on DELETED() is non discriminating, and is really only useful for finding deleted records, which is unlikely.

The problem with non-discriminating indexes is in the way VFP uses its bitmap index methods. Essentially, to use an index in a query, VFP has to bring it over the wire. Actually, it has to bring all matching records over the wire, which in case of non discrimating indexes means that most of the index file is transferred. For large tables, the index bitmaps can be large. Non discriminating indexes brought over the wire take time without really reducing the size of intermediate result sets.

Perhaps the term "index keys" would be better than the term "records" in the above paragraph. Does VFP transfer record data when examining an index for Rushmore Optimization (what I call the Rushmore Effect)? -- Mike Yearwood


Before I begin, let me state that I do not yet have VFP9 (though I will in a few days), however, I believe I have a good understanding of both this issue and a new VFP9 feature, which is listed as "New Binary Index support for new index type to improve performance using Deleted tag."

While I don't have a copy to test with, I am assuming this new feature implements a True / False (binary, no .NULL.) index by doing the obvious, which is to only store True values. This shrinks the size of the index to next to nil for indexes which are non-discriminating, eliminating the performance penalty, while keeping the performance advantage. This would mean that it would almost never be beneficial to not have an index tag on DELETED(), unless the table contained almost entirely deleted records, in which case, one should add an index tag on !DELETED(). I'm still assuming at this point, but if this works as I suspect it does, kudos to the Fox team. -- Peter Crabtree

Since it appears my understanding of this index type was... *cough* limited, is there any reason why an index which only stores one or the other of a True / False condition wouldn't make sense (sure, the method Igor describes is better for something like gender, but presumably, the deleted records in most tables are not 50%? Or even 3.125% (100 / 32)? -- Peter Crabtree
Peter, Binary index in VFP9 is just a bitmapped index - so it include all records from the table, but use exactly 1 bit for each (I don't count index pages links and other system info that exist in each index page), while non bitmapped index use at least 4 bytes for each entry in index file (record number). -- Igor Korolyov
Hmm. So it appears we get about a 32 to 1 reduction in binary index size? That may still be enough to justify a Deleted() index on any table. -- Peter Crabtree
There is only possibly an 8 to 1 reduction in binary index size. -- Mike Yearwood
Peter. I believe an index on !DELETED() cannot be optimized. -- Mike Yearwood
I believe you're correct for versions earlier than VFP9, and perhaps I should refrain from making these comments until I recieve my copy of VFP9 final, but according to the VFP9 beta documentation:

INDEX ON NOT(DELETED()) optimizes NOT(DELETED()) and DELETED() query conditions. In versions prior to Visual FoxPro 9.0, only DELETED() was optimized.
-- Peter Crabtree

From the above you can see that the amount of potential performance degradation is dependent on the degree of imbalance in the distribution of values for the index key expression. With an index on DELETED() that has 99% of the records as .F. and 1% as .T. would create a slow down in performance with very large numbers of records. However, a distribution like an index on City where 70% of customers are in the home city of the client would also cause a nondiscriminating index but the impact would be much smaller.
I am confused... unless an index is conditional doesn't it always 'produce' all records? Or do you mean the number of repeating values in an index? For example, the primary key indexes ALL records, but since every record is unique does it 'produce' few records? Please explain?
Producing records, whatever that means, is beside the point. In a query a condition either matches or it doesn't. A non-discriminating index produces far more matches than non-matches. Normally that's good, except that it may not bring you any closer to the desired query result. Classic example: an index on Deleted(). Far better to let the client resolve the deleted records from a tiny result set than it is to suck the whole non-deleted index from the server as part of a query.-- Steven Black

Another classic case is Gender: you always get half the population as hits. This issue also applies to SQL Server 6.5. One talks about the selectivity of an index an unselective index will actually degrade performance because it results in more page reads from the hard drive. -- James Beerbower

Although I agree in theory, I found that removing the index on deleted() killed performance on a SELECT COUNT(*) ... query. The query with the index ran under a second, without the index, over a minute. This is across a 100M network. I'd say YMMV! -- Bill Armbrecht

If you have a deleted() index, you have two branches one for .T. and one for .F. Under each branch, are the record pointers. When you select count(*) where deleted(), Fox accesses the .T. branch node and gets the count from there. That's why the counts are so fast. -- Mike Yearwood

What was the ratio of deleted records? That's the issue.
2 or 3 out of 51K records -- Bill Armbrecht
This situation does not present an effect until the size of the index reaches a point that it has a noticable time issue. With a table of 51K records you would not see this effect. Try your same test with a table of 10M records and see what the results are. The real problem here is that the situation will not show up during normal testing as the test tables are generally small. However after deployment of the system and accumulation of data the performance will begin to degrade until it becomes unacceptable. Jim BoothOffsite link to http://www.jamesbooth.com

In addition to what Steve has said, the issue here is not what records are in the index but how the values of the key are distributed. In some indexes the values are distributed fairly evenly, for example an index on Last Name in a customer table. However, other indexes are by nature unevenly distributed (meaning there is an off balance of the distribution of the values of the key). One index that very often is unevenly distributed is the index on DELETED() which usually has many many more records that are .F. than are .T.. This situation is in no way limited to DELETED() indexes though. The potential performance problems affect any index that has a poor distribution of key values. -- Jim BoothOffsite link to http://www.jamesbooth.com
[06/29/99] I agree in general, but as far as an index on DELETED() is concerned, I thought that it was desirable to have one (if SET DELETED is ON) regardless of the number of deleted records, as FoxPro is going to have to determine that for every record. -- Mike Collins
That's the point. Better to resolve DELETED() on the client with a 500-records result set than have the client do it on a 100,000+ records remote table. An index on DELETED() works great on a desktop app with local data. That's how this bit of questionable wisdom got perpetuated. Put a wire in the situation, with data on a server, and you've got the client chugging away evaluating DELETED() records (because hey, we have an index) all for the benefit of triage of a very small number of records from the eventual result set. See the article by Chris Probst in a recent FoxPro Advisor, May 1999.-- Steven Black
This brings up a critical distinction: testing performance locally vs. testing over the network. None of this really matters much when you have the data on your local machine with 128MB RAM. You can be loading all the indexes you need and then some, and things will be very snappy. But run the same thing over a shared network connection, and you will start to see the effect of the wasted bandwidth. Now add hundreds of users doing the same thing on that network, and even if you're running gigabit pipes, performance will suffer tremendously. -- Ed Leafe
Assuming SET DELETED is ON, and you don't have the DELETED() tag, if you run a query, SYS(3054) will report partial optimization. This might be confusing to a programmer. Consider 2 options. 1) Drop the tag on DELETED() as a performance enhancement to be done prior to delivery. 2) Don't operate VFP with SET DELETED ON while testing query performance. -- Mike Yearwood
The important thing is not to trust SYS(3054), but to measure the performance under real-world conditions. We found that opening a view with NODATA, which should theoretically only pull a few bytes of structure info over the network, was instead sucking multi-megabytes of data through the wires, slowing performance to a crawl! The reason was that Rushmore doesn't care about NODATA; it still pulls over the indexes it thinks it will need to optimize performance. Removing the tag on DELETED(), as well as a few other non-essential tags, made a huge difference. -- Ed Leafe

I agree with that, but my point was to aim for full optimization during development and to drop tags such as deleted after testing performance in the real world. SYS(3054) is still useful to ensure that you've not made any mistakes in other parts of the query. -- Mike Yearwood
I just had someone tell me they got excellent performance by doing the following: INDEX ON DELETED() TAG DELREC FOR DELETED()
Can anyone confirm or deny the validity of this? I've always been told to NEVER use FOR on an index expression -- Randy Jean
Of course they get better performance this way, because such an index is not used for Rushmore at all. But without activating it explicitely, it's never used. :) -- Christof Wollenhaupt

It has changed in VFP9. The INDEX ON DELETED() TAG DELREC FOR DELETED() can be used for Rushmore optimization.


In addition, there are no way in VFP to recreate primary key with filter, at least genDBC cannot do this. -- Vlad Grynchyshyn

OK, so is the consensus that for multiuser network apps, do you index on deleted() if you expect very large numbers of deleted records, do you never index on deleted(), or do you always index on deleted() ? -- JR
I'm not sure you'll ever get a consensus from this bunch - what would be the fun in that? - but, as a general rule, a tag on DELETED() for small tables (<50k records) and small groups of users (<50) over a fast network is reasonable. Much larger tables or users demand that performance testing dictate what works and what doesn't, and testing is wise under most circumstances. Many of us have had the "Gee, it ran a lot faster when I was developing it" experience. Your mileage *will* vary.
Ok, my turn.

I think this is the easy answer, in a 'perfect world' that doesn't have any of the hyper variables listed further down.

A = cdx bytes per record for index node
N = number of records
M = number of records that don't meet the filter criteria
R = record size

If M*R < A*N , you do not want the index.

M*R = dbf bytes that don't meet the filter criteria

A*N = cdx bytes that are read from disk that are used to figure out
that the M*R data isn't needed, so don't get it


N and M are relevant to a subset returned by a query based on some other WHERE clause.

Here are some factors that have been ignored in our perfect world.

First, lets list the bottlenecks:
1. HD head movement - The arm in the HD has to overcome inertia, move, overcome inertia, stop. It is amazing how fast it goes, but none the less, it is a major factor. This is measured in Seek Time.

2. HD disks spin. Once the head is in position, it has to wait for the data to come around. Measured in RPM.

3. Data moves from disk to head to buffer on drive, then though a few busses (Ide, scsi, PCI, ISA, usb..) to a buffer that the OS can get to.

4. If this data is on a network file server, then it needs to be moved from a buffer on the server across the network to a buffer on the client. This is perhaps the most unknown factor, because it will vary widely due to the amount of traffic and how well it is managed (hardware quality and configuration.) When it comes to dbf access, an old 2 node 10Mb network will outperform a 50 node 100Mb if 20 of them are all sending 100Mb movie files to each other.

5. Caching. Some of the above steps will be skipped the 2nd time. The more ram, the more often they will be skipped. Some may even be skipped the first time due to Read Ahead (system gets data that is close to the current request, just because it can do it quickly.)

Now some more applicable factors:

6. Hard drives read and return a whole sector to the OS. So even if you only need 1 byte, the OS is going to end up with a whole sector of bytes in ram, and it will pluck the byte from the buffer. So even if you only need every other field from a table, you are going to read in the whole record anyway. (so 1, 2, and 3 apply) Not sure if 4 applies. VFP may ask for the OS for whole record, and then only use the fields needed; or it may ask the OS for each field. There is some overhead in each request, so I can see why it would ask for the whole record.

7. Reading X bytes from 1 files is faster than X/2 from 2 files. It may or may not be significant.

8. Windows and VFP isn't Open Source, so we can't be sure that there isn't some totally stupid code that is introducing an unnecessary delay.

I was going to show how the above 8 factors can screw up the simple M*R < A*N , but I need to get back to work.

I think it all comes down to this: the greater the difference between M*R and A*N, the less the other factors matter, so ignore them and go with the 'perfect world'. The closer they are, the less it matters if you have an index or not, so base your decision on some simple tests and don't worry about it until someone pays you to test more.

Carl Karsten
While studying for the MS-SQL exam I came across an interesting formula. It is how to calculate the 'Selectivity Ratio' of an index. The books rule of thumb is that an index should be used if the Selectivity Ratio > 95%. ( Personally I think that is a bit high, I may go as low as 10 % )

Selectivity_Ratio = ( 100 * ( Total_number_of_distinct_index_rows ) / ( Total_table_rows ) )

Example Say we have a table of people, size 1234 rows
50 rows are marked for deletion ( 2 distinct values )
780 are female ( 2 distinct values )
970 distinct first names
1200 distinctS.S.N.s ( 34 null )
Address_Zip_Code 15 distinct values

The Selectivity_Ratio for each of these 'columns'

Deletion:
( 100 * ( 2/1234 ) ) = 0.16%

Sex:
( 100 * ( 2 /1234 ) ) = 0.16%

First_Name
( 100 * ( 970/1234 ) ) = 78.6%

SSN
( 100 * ( 1200/1234 ) ) = 97.2%

Zip_code:
( 100 * ( 15/1234 ) ) = 1.2%

So if one follows the 95% rule, indexes would only be added to the SSN. I would also add an index also on First_name and most likely on Zip_code. But I would test with some real queries to see of zip_code really did benefit. Indexes on deletion and sex would not seem to be of much benefit.

AnthonyLTesti
There are several reasons for using an index - to retrieve records with a SEEK or SELECT, to enable a relationship or simply to put the records in order. Does your analysis apply to all these? I think your penultimate sentence sums it all up - "Try it and see what happens". Geoff Franklin
Contributors: Steven Black, Christof Wollenhaupt, Jim BoothOffsite link to http://www.jamesbooth.com
, Mike Yearwood, Ed Leafe, Carl Karsten, AnthonyLTesti, Peter Crabtree
Category Performance Category 3 Star Topics
( Topic last updated: 2014.09.22 09:39:45 AM )