Wiki Home

SEEKvs ASCAN


Namespace: Wiki
SEEK() is fast.

So fast in fact, that it can pay to create a cursor, index it and use SEEK() rather than ASCAN() even if you are starting with an array of data.

It can be a VFP Rookie Mistake to think ASCAN will outperform SEEK() because "it operates in memory". Since VFP caches indexes, SEEK() operates in memory too, and it is a lot smarter than the simple front-to-back search that ASCAN() performs.

Of course, there are special cases where ASCAN may be preferable (low record counts, low number of lookups) but if you have more that 400 records and/or intend to perform more that 100 lookups you should consider using SEEK() even if your starting point is with the array, not a table.

The table below depicts the following situation:

1. Start with an array of random strings (the length of this array is registered on the rows of the table as "records")
2. Select a string at random from the list (so the lookup will not fail)
3. Determine if the string exists in the array by performing one of four alternative methods:
a) SEEK() on an indexed cursor of the strings
b) ASCAN() on the original array
c) BINASCAN() a UDF which performs a Binary Search algorithms on a sorted copy of the array
d) INDEXSEEK() on an indexed cursor of the strings

Very important: the "setup cost" is included in each of these measures. That is, the cost of creating and indexing the cursor is included in the total time for SEEK() and INDEXSEEK(). And, the cost of sorting the array is counted for Bin ASCAN(). There is no setup cost for ASCAN() since our starting point is with the array in hand.


The cells are colored according to the strategy that is fastest in each cell. Move your mouse to a cell of interest to see the raw times. Note, these should be thought of as "relative times" because system specifications will impact the performance. These particular results were obtained on a 500mhz PIII running VFP7.
R
E
C
O
R
D
S
LOOKUPS
0.1K 0.2K 0.3K 0.4K 0.5K 0.6K 0.7K 0.8K 0.9K 1.0K
0.1K
0.2K
0.3K
0.4K
0.5K
0.6K
0.7K
0.8K
0.9K
1.0K
1.1K
1.2K
1.3K
1.4K
1.5K
1.6K
1.7K
1.8K
1.9K
2.0K
2.1K
2.2K
2.3K
2.4K
2.5K

SEEK
ASCAN
BinASCAN
IndexSeek

If you read the table above carefully it is evident that ASCAN out performs SEEK() for every number of records with 0.1K lookups. As most systems don't do a series of 100's of lookups for the same data in the same set of records sequentially, the advantages of SEEK() here is questionable for the real world. It is clear that using this benchmark SEEK() is faster than ASCAN() if the number of operations exceeds 100. --JimBooth

Agreed, you really have to look closely at this and apply it properly in your own domain. I will point out however, that SEEK() actually wins even in the low lookup area if you discount the setup time required. The lion's share of the time in that column goes to creating and indexing the cursor. As for the "real-world", my real world has me in the lookups=500 and records=1000 range for every hit on a web-app. So I'm needing to squeeze this process as much as possible. That said, the real liability of SEEK() from my perspective is the dependence on a table. Needing that lookup-table available from hit to hit can run afoul of other workspace management strategies and inccur more overhead. In our situation, we are actually using an array-based strategy not listed here. It improves on Bin Ascan and offers performance somewhat better than SEEK() in our target record/lookup area. This way we can configure our parsing objects without dependencies on tables being available after init(). - ?lc

A benefit of memory based data structures is their ability to store object references. Lauren, is your Hybrid lookup all VFP code? And since you didn't post your test framework, compare the numbers from this simple HashTable class (HashTable class will be released under its own topic when I have added hash bucket chaining.) - Albert Ballinger

Excellent point Albert. I should probably clarify the reason I started this topic. We were creating a class which is to be configurable from a propritary text file, the contents of which could be easily read into an array. Also, we wanted to avoid the depencency on an external table as there may be multiples of this class instantiated at once in an environment of many private datasessions. So, these were the factors driving us to see if we could at least draw even with the performance we could get if we went to tables and use SEEK() it surprised me how difficult it was to do this. Your point about object references is a good one and is another reason to consider arrays over seek(), hopefully this topic will help people weight the costs of table vs array in order to make this sort of decision. Our hybrid method is all VFP, and is not nearly as flexible as your standard hash approach, however I think it is faster. I'll send you the test framework in a private email and you can double check I implanted your SimpleHash class correctly. At first glance, it seems to be the slowest of all strategies in the test space above. However, it is certainly more flexible than our hybrid method which is really specific to the task of determining if a given string is contained in a set of strings.
- ?lc

lc - I don't know if this belongs here - but I noticed your concern about "external table as there may be multiples of this class instantiated at once in an environment of many private datasessions". One thing I do is have a property called cLocalFile that I stuff sys(2023)+"\tmp"+sys(3) to upon init() or load() of the class. I use it if needed to create a temporary file and/or index. Upon destroy() of the object, I test to see if the cLocalFile .dbf or .idx exist in the sys(2023) directory - and if do, erase them. Just a thought. - Peter Diotte

The SimpleHash class is slowest for setup, the seek times are very competitive for larger datasets and it stores keys and referenced data - including objects. - Albert Ballinger

You can view some code-snippings here: __ SEEKvs ASCANCode

Caveats and Considerations

Some things to note here.

1. The lookups performed here were always successful. That is, there were no searches for strings which were not present. On balance, this will make ASCAN look even better than if we were searching for strings not present in the array.

2. The only reason SEEK() is not the best choice in all of these cases is that the one-time cost of creating and indexing the cursor is included in each test. This setup cost can be seen by hovering over each cell. So, if you already have a table, certainly consider SEEK() over ASCAN().

3. INDEXSEEK() is really equivalent to SEEK() in this context. However, in some situations, since it does not move the record pointer (and subsquently fire triggers etc), INDEXSEEK() can offer better performance.

4. These tests were performed on local data. Writing out the table, indexing etc will be impacted by having to go over the wire, but perhaps not enough to warrant ASCAN(). YMMV.

5. The array-based data can be made faster than SEEK() in this situation, but it requires some special organization of the array. Note that BINASCAN() helps. There are other strategies that can improve the array-based lookups even more.


Note: I don't seem to see this in the documentation: ASCAN() for string searches will only compare up to the number of characters found in the searched_for string, even if searched_in string is longer. So:
dimension a(2)
  a[1] = "zip5"
  a[2] = "zip"
  c = "zip"
  r = ascan( a, c )

r == 1, even though it is not an exact match.

- Neal D.


Also see ATvs $ vs ASCAN
Contributors Lauren Clarke Paul Jordan
Category VFP Functions Category Performance Category VFP Command Comparison Category VFP Commands Category 3 Star Topics
( Topic last updated: 2007.09.24 09:46:32 AM )