Here's a challenge ! Category VFPCorruption
Referring to the Memo File Missing Or Invalid topic of a few days ago, maybe you also can help me on this one.
Below you'll find the problem-description.
Referring to this topic, I recognize my problem, though it has nothing to do with Memo's;
Please note that I'm not interested in salvaging corrupted records, but wat to know what originates to the problem.
The re's on the before-mentioned topic indicate for sure (to me) that the concerning persons have dived into the similair, so I sure hope that we all may come to the solution of this very serious problem. Though -I in the text below- ask for not responding with "general" hints, I think it may be useful if you mention any recognition on the topic, whether you solved your (own) problem or not.
In fact I hope for those, who fuzzle about with hex-bytes as in the Memo-problem, where I myself do nothinh but that, but still can't find the solution.
To that respect, the text below has some stuff on that, but since this text was created for another Forum, I copied it from there in it's original form, but now add some addition hereafter (original text begins at "============="); the text hereafter may overrule some things from the copied text, because we are a few days further here, and I have some new info.
The text below doesn't say that in fact only one record appears to be destroyed after all, but which can only be seen with outside-VFP-tools like Ultra Edit (which I mention below); The strange behaviour VFP shows, is -now to my opinion- caused by not being able to deal with the first byte of the record being null ( Does "null" = chr(0) instead of the allowed values of chr(32)/space or chr(42)/'*'?) ( -> yep, note : records NEVER contain rubbish and always null = "never been written to" ), and where the Delete-mark is expected.
A record is always destroyed from the beginning of its formal offset, so all starts with the Append Blank (see copied text);
After another analysis of the last days, it appears that this only happens when a new datablock (in dbf) has to be created on the server.
All together this implies that it only happens (can happen) in the situation the last record of a block doesn't fit in the block, and -thus- a new block has to be created.
Now refer to the Memo-problem, where (I didn't calculate) I think the 0800 is a block-boundary too, with in addition my info that the "thing" always resets at the block-boundary of the new block.
Please note that the blocksize may differ on different servers / volumes.
At our own site, we have 4K blocks (dbf).
Thus, the record which doesn't fit in the current block get's corrupted, and starts to be right again as soon as the new block starts;
Having dozens of examples at hand right now, there is not any indication for the "logic" on the start-point of the corrupted area, except for it always being a record which doesn't fit. And mind you, I'm fairly experienced in dividing by 8, 16, 32 etc. incorporating record-length etc., but it all leads to nowhere.
The most imortant info of all (also stated in th copied text) is that this can only happen "after a night" or any table. So, in 100 % of cases, one table can only get corrupted once a day;
Now please note that on one day more tables can get corrupted, but that I'm fairly sure that the "others" always originate from a first one. I mean, that always a first corrupted table leads for 100 % to expectations of my knowledge on this, while a second one never fits to this expectation, and therefore leading to my suggestion that once one table get's corrupted in a "logical" way, its takes others with it in a non-logical way. The "others" are one or two, but in 95 % of cases, just nothing and only one table gets corrupted.
I'm now working on this struggle for almost two years, still not knowing the answer;
The copied text implies a 1000$reward which may sound silly, but how I presented things on the other Forum.
Of course I won't withdraw this now.
Thank you all very much on forehand,
Here my copied text :
Note : It is not my idea that from now on everone posts messages with rewards, because this is not the intention of a forum. However, I do such, because I need an answer to our serious problem of datacorruption, which -up till today- can't be answered by anyone.
I also know on forehand, that anyone who is willing to help me (us) us seriously, may have to spent a great deal of time on this one.
It is my intention that only those respond, who seriously recognize the descreption of the problem hereafter. I mean, that I'm not waiting for any "good suggestions", because for 99,99 % sure I've got them all already. You just must recognise the problem (which is very, very specific), and if you do, our problem may be solved by your solution, that is, if you have one. Please think also of the blurring of this forum when you only present your general hint's on this one, which may be -then- hundreds. Thus, please only respond if you recognize the desription !
The first person which brings the solution is the winner of the award, ok ?
Heart - Profit is an ERP package, consisting of 5M lines of code, and therefore one of the larger FoxPro-applications. The only thing I'm saying on this one is, that any respondant may expect that "we know it all already", but unfortunately not the answer on this one.
And yes, we encountered all de data-corruption-types to be found on all forums etc., and all can be explained and/or worked around.
But now this one :
So here we go.
In FoxPro-Dos 2.5 and VFoxPro 5.00 (without SP) we get corrupted tables.
So note the "2.5" which implies that no difficult object coding is applicable here, and that no SQL is used, except for an SQL-Insert as a general "Append Blank" feature (note : of course VFP5 is used with full object-coding, but since Dos-2.5 isn't, the problem can be determined to simple Dos-code).
Now here my description which you should recognise if you know about this problem :
Once the corruption has occurred, and which is always at the end of a table followed by one or more not-corrupted records, the table can be closed, renamed to another name so no one can and will access it, and the show starts;
With Set Refresh To 1,1, and opening a normal Browse-window (no Index active !), you see the corrupted records (contain null's) flickering from the corrupted contents (|||||||| = null's -> best viewable in VFP) to normal content. Thus, the content IS somewhere there, but the Network-OS somehow gets the data (block) from different places.
This situation can be "kept alive" for the time the server is up and running, AND as long as the datablock is forced to be kept in the server's cache. Mind you, this behaviour can be shown on any PC in the network and for an infinit period of time.
The corrupted data begins at an unpredictable point in the datablock, but always ends at the beginning of a new datablock.
With some experience, the data can be recovered as long as the situation is alive as described, by enforcing FoxPro to do a re-read of the datablock, and which simply can be performed by pressing a key on one of the fields in the Browse, but which must be performed on a record which resides in the same datablock as the corrupted records, but not on a corrupted record itself; once this has been performed correctly, one can go to the end of the table with the downarrow, and all data re-appears but one record. Now a copy to can be performed to another table(name), and there the data is in a fixed recovered form, but the one record.
Any other way of copying the file (ie Dos-copy) results in a copy of the file with corrupted records, and which corruption is fixed.
So far for the description to be recognized, and if you don't, you may continue reading for interest, but forget about the 1000$.
However, please feel free to read further anyhow, because it is possible that you do experienced this problem, but did not encounter the behaviour as explained, which already becomes more difficult to see beause your Refresh may be at a higher value on one hand, and it's very easy to destroy the behaviour by making amendments in the corrupted datablock on the other hand; the latter writes the block again, and makes the situation fixed.
The occurrence of this situation is very rare, by which I mean that it occurs only at one user-site, and ... at our own (development) site. But, at these two sites, the problem occurs a few times a week !
We ourselves describe this thing as the "F-Syndrom" derived from the name of the user-site where it occurred in june 1999 for the first time, and which company is named "Fennema" (part of Cargill);
It just started happening overthere where this site didn't encounter these problems for over 2 years, and no recognizeable thing had changed. At our own site it started happening in february 2000;
all our further user-sites (approx. 90) don't have the problem.
The only real similarity between the two sites where it is happening, is that both use Novell > 3; We use 5 and Fennema uses 4;
More similarities cannot be found, or it should be the usage of Arcserve as the backup-softwarem which somehow could be related; please feel free to read my following analyses of almost two years of experience on this one :
The problem in 100 % of cases occurs when the table (could be any table) is firstly accessed on a day, and for Appending a record.
Also sure is, that the previous amendmend to the table (which may have been several days ago) was a Delete. Thus, one day the last thing was a Delete, and another day the first thing is an Append;
In between is always a night, with of course a backup. Thus, thinking of things which make nights different from days, here's something.
Up till now, I think the problem may be related to running out of file handles, which Heart - Profit by iteslf will not do, but since it can use about 200, the combination with other (Win) tasks, may.
In general all the analysis-effort has been gone to the cacheing of FoxPro of client itself, which is usually the source for these kind of problems. However, with all the knowlegde known, we are not able to emulate the problem on purpose, and knowing better about FoxPro than FoxPro itself. Further, all cacheing-facilities have been disabled in whatever sneeky corner to be found, including of course a Set Refresh To 1,1 of FoxPro itself (which is normally 60,60 in Heart - Profit).
I'm fairly sure this is not a FoxPro-problem, but a Novell-problem, though hard to prove; on one hand the problem is created somewhere which cannot be found, but on the other hand once the problem is there, it should be the Network-OS which presents the wrong data.
However, this one too is tricky, because the table may be corrupted in such form, that FoxPro -while viewing it back- deals with it in a random manner, and which -to my opinion- can be caused by the Delete-indicator, residing in the first byte of the record, but which contains null now. But now think of "the show" where any tool shows corrupted data in its fixed form and thus always the same, where FoxPro is able to show the data afterall, which thus must be retrieved from somewhere, where the normal tool (Ultra Edit or whatever) really doesn't contain (show) the data. And this for months if you like.
Some more detail :
Heart - Profit consists of a Transaction-module, which exactly shows what happend when at field-level. Though even this doesn't help us tackling this problem, this shows the following :
The corrupted data doesn't come from the memory of the PC, which transfers the proper bytes to the network. But, where the data of any transaction (ie Append Blank, SQLInsert, Replace, Delete) immedeately is read back by the module, right afer the Append Blank, null's appear to be in the record, and FoxPro (both 2.5 and 5.0) don't report any error on whatever manipulation is performed on such a record. Of course the record can't be found anymore, because the key is null'd too, and depending on when the record is needed for a certain process, the user picks up the phone ...
We've tried anything, up till the changing of servers, client-software, protocol's (use only one etc.) and eliminate anything which may influence this problem. Only Arcserve remained untouched ...
Note : The Compress-process of Novell (4/5) is also a process which becomes active overnight; I'm not sure whether this can influence things, with my remark that the concerning tables shouldn't be touched by this process.
When you think to attribute on this problem, you may need to as me some questions before you come with your solution; if you have one, please also state why you ask the question. I mean, that when I already know that your question won't lead to the solution, I would answer your question for nothing. Therefore, if you won't find your question answered, I thought it was unapplicable, ok ?
I realise that I almost ask for not giving me the solution, but please bear in mind that already several hundreds of pages were written by me on this case, including communication with MS who also gave up on me. But therefore, anyone who solves this thing for me, will have earned the reward with respect.
Thanks for participating !
By know I've prooven that at the moment the thing occurs, this can be done when you're the only one in the system, and therefore implies that this is not some kind of concurrency problem. Though I ran programs on 6 PC's or so, doing all kind of hard stuff on Appending, Deleting (not) Locking etc. etc., all remained well;
Having loads of experiece right now, and knowing that somthing seems to happen the previous day -or something's influencing over night- a few days ago for the first time I started to reproduce the thing at will, but now not by means of running programs, but by being a normal user. And it worked :
On day-1 I was the only user at the end of the day.
I started to Append records (adding Sales Orders etc.) in a few tables, which were for me three logical ones (I mean, adding a Sales Order = 1 but since I also added some Sales Order Lines this is another table, which I don't count here).
I went home, had some sleep, and the next morning I was the first and only user.
In the same three logical tables I deleted an antry (I'm still not 100 % sure a Delete has to do with it) and further added as many entries, that I could expect the datablock (dbf) to get full, and thus a new block should be created.
In one of these three tables it happened.
Thus, coincidence or not (I didn't try more "days") I reproduced the problem at will, being the only user influencing things.
Note that no situation of this corruption ever occurred in the midde of the day, which litterally means :
once a new block has been created in the proper way at a certain day, all the other new blocks will be created well too. This is 100 %.
Once the block contains the last area (= last record at her beginning until end-of-block) corrupted, two interesting things show :
1. When a Replace is performed on a field in the currupted area (thing of one flied only), this area will be proper again;
I'm pretty sure this Replace (as it occurs in the live-situations) is then perfored by the same user who initiated the corruption.
2. When a Replace is performed on any record in the corrupted block but in the NOT-corrupted area, the block becomes corrupted from "a point" to the beginnng of the block; I don't know yet where the "a point" exactly is.
Note on this one, that where the Replace of no 1 before MUST be a Replace in a program, the Replace here is only the press of a key on any field (I usually use the left-most one) in the Browse-window, where the pressed key should be another character than the one which is already there. Two things happen here : VFP obtains the actual version of the block, internally performing Rlock(), and when the record is leaved (ie DownArrow), the data is written back to the block, which data then gets corrupted too.
So 1 and 2 work a kind of opposite.
I should emphasize on the fact that where I earlier stated that VFP will have problems with the null'd Delete-byte (first byte of record), and therefore shows it's strange behaviour once the block is corrupted, I made a mistake here and forgot about something :
In all situations stuff like Ultra Edit shows a "fixed" situation. However VFP shows "the show" (with Set Refresh to 1,1) for a period of time, which may be days and days;
The show will end by itself, and I can't predict when. Usually somewhere the same day and for sure when the server is rebooted.
The show can be enforeced to continue forever, by opening the Browse-window on a PC, and just leave it there sitting.
My opinion : Novell's cache has to do with it, and as long as the block is kept in this cache, the show will go on.
Note : The overnight backup will or may overwrite all the server's cache, where I remain having the feeling this has somehow to do with it (but, for the initiating of the problem).
However, I have one table right nowm which is alive for already 13 days or so, not having done anything to enforce this (remember, after the thing occurs, I rename the table to another name, so nobody can and will use it evermore, after a COPY TO and a recovery of the corruption at best).
Where the thing happens a few times a week (with us as well as at the customer's site), only once in a while a complete analysis is performed on that situation (each day a full day of writing stuff like this); So only very few occasions are examined, leaving all the others sit in the renamed table, and where the name now consists of the date it happened;
Very interesting now (well, to me), is that -let's say- 10 % of these tables have another date of last updated, in comparison with the filename !!
I am pretty sure that the shown date, is the date where the show stopped and the table got "fixed". But what is doning this ?!! For sure not any user, and not even me, because when analysis is performed, it's always immediate after the thing occurred, and never many days later;
To my opinion the updating of the date impies that somewhere a pretty formal process is doing this, but what ?
For this matter I -from the beginning- have in mind that some Ethernet-packets are floating around not nowing where to go to or so, and therefore we eliminated all protocols but one (in having the idea that two can do mixed things).
Thinking of the date-strangeness before, I once encountered the situation that where the renamed file was closed and all, the formal date last updated (dir-command) showed yesterday ! This, where the user for sure appended records "today".
All leads to the network-OS (Novell-only ?) not only causing the strange behaviour of VFP (and Fox-Dos) after the corruption, but also initiating it.
For this matter : all normal copy-commands outside of VFP leave the table fixed, whereas the COPY TO of VFP does too.
Now note that I've never experienced some wrongness in the table-header, that is, with respect to the length of the table.
Please bear in mind that the show is one thing but not the real problem, and the initiating of is is another. However, I'm fairly sure if the show can be explained, we are nearer to what initiated it.
The first time it happened at the customer's site was 05-1999, while they didn't change anything (recognizable);
At that time, FoxPro 2.5 was used, and for sure (of course) there wasn't changed according new versions.
So we must have done it ourselves, by changing something in the systems programs; the only thing which could be found of that period, was some already existing program -counting the number of licences allowed and running- which was amended in having a FLUSH, but which formally is illegal, since this program is activated at unpredictable intervals, and therefore being right in the middle of started transactions (at us controlled by Rlock()'s and Unlock's / Unlock All'). Well, I think it's illegal, because somehow this doesn't fit in the logical transaction thing (how to flush all, were you before said "do at the Unlock").
A few days ago a eliminated this Flush from the program, and since the thing occurs once in the few days, I'm still wating for it to happen right at this moment.
I'll pronounce a next time if it did.
[2001.04.03 06:18:14 AM EST] -> too bad, it didn't help
HELLLLP (please ?)
I have to admit, I did not read your post in gory detail, however, I have seen the type of corruption you describe in SBT Accounting system data. In my experience it is almost always (99.9%) related to a flaky network card or a flaky network connection. If it seems to be happening no mater which workstation the update was coming from I would look for a bad card on the server, or a bad router, switch, hub etc. If it seems to be coming from one or two work stations, do not hesitate, do not pass go, immediatly replace their network cards. Other things to check, network cable runs longer than spec. I once found one that was 3 times the maximum spec. length for an ethernet coax network run. We were amazed it worked at all. I should also add that the network poeple will ALWAYS claim it is not them, they didn't do any thing there is no problem etc. I have even gone to the length of buying the new network cards myself and installing them for free, just to get rid of the problem. (I do draw the line at buying a new hub, however). Even if nothing has been changed, network cards can go bad because of a power spike that does not effect anything else noticably.
I fully agree with you, and maybe you're even right; however, all the cases I've seen myself on network cards and hubs will shift the data in the record and/or get's rubbish in the block. This is not the case here.
But I wish it were ...
Thanks for contributing !
Category Data Corruption Category Needs Refactoring
( Topic last updated: 2001.04.03 06:18:14 AM )