RHDN Forum Archive - Mega Drive gets some bsnes style lovin'?

Ok, I have spent several hours planning out all possible methods for parallel emulation, and discussing the idea with some talented minds in the emulation scene. This is what we could come up with:

[option 1]
Break down each processor core so that it steps only one cycle at a time, through the use of a finite state machine. Whenever a read/write occurs, make sure all other processors are caught up to this one before allowing this to happen. The overhead of breaking instructions down into cycle stepping like this will likely be almost as high as a context switch, and far messier. This is the most obvious method of emulation.

[option 2]
Use a form of context switching (threading) to run one processor ahead of the other until a read/write is needed, and then switch. This results in far less context switches, but is less portable, as it requires a lower level context switch. It also wreaks havoc on the idea of save states since contextual information is now machine-dependent.

[option 2a]
Context switches can be eliminated for writes, but not reads. Take for example, CPU A is ahead of CPU B, and A writes to B. You can log this write with a timestamp, and when B is eventually run, it can read through this buffer to get the correct value. This can be tricky to implement, and may be slower than the context switch. This also does not solve the read problem, if A reads from B while ahead of B, then there can be no log to fetch what value it should read. A context switch is required to be absolutely sure B's emulation will not affect A's.

[option 3]
Perform periodic hard snapshots and subsequent soft snapshots (eg change list + timestamps) of all changes to one processor. You can now run one processor ahead, and ignore reads and writes. If, when running B, you find out that it would have affected A, you can use this snapshot information to rewind A. This is probably the most complex solution to emulate, and will probably be a lot slower than context switches as well.
This is one interpretation of predictive emulation. Many others can be thought up like this. Analyzing code to guess at which code sections are used for processor crosstalk and which aren't, etc also fall into this category. They all have the trait that it is very complex to implement, not entirely reliable, and potentially much slower. We really don't want all of this added complexity when the emulators are already extremely sensitive to errors in the first place.

[option 4]
Emulate threads at a much lower OS level. If it were possible to shunt off and dedicate one core to one processor, then you could emulate the system just like real hardware. Say your CPU has 16 cores. Core 1 could run CPU A, and core 2 could run CPU B. They would have to be absolutely dedicated to this purpose. When A reads a value shared by B while ahead of B, it can simply lock in a loop, waiting on B to catch up. Since each host CPU core is truly parallel, there will be no context switching overhead involved. The downside to this approach is that it will consume 100% of CPU resources for each core used, and presently there does not exist a method to obtain such fine control over CPU cores. Most likely, the OS will shuffle your threads between cores as it sees fit, performing multiple contextual switches anyway, and the OS will probably interrupt your cores. If one core gets interrupted by the OS, then all threads waiting on that one will be deadlocked waiting as well. Not necessarily a bad thing, but potentially one.
This will be the fastest, and most logical approach. It will allow full parallelism, and fast wait locks for processors that have large shared busses, guaranteeing accuracy. We can reduce CPU usage a bit by adding in periodic sleep requests. Definitely not every time a CPU is waiting for another to catch up. Ideally, only a few times a second, as these sleep requests will cause context switches to the kernel, and too many sleep requests will stall out the other processors. The user should probably have control over how often to sleep, as finely tuned as never.
Unfortunately, this puts all current PCs out of reach. Even quad cores aren't really suitable for this method when you have two 68ks, an SH-1, a Z80, a VDP and a 2612 all running at the same time. And we cannot be sure the future will continue to offer multiple CPU cores. Most likely it will, but you never know ...
If there are not enough cores (especially only one), then they could all stall out forever waiting on other threads to catch up if you do not force periodic sleep states so that the kernel can give the other threads a chance to run.

[ultimate problem]
There really is no way to predict what one processor will do, other than running that processor up to date. The best you can do is guess and offer ways to minimize the damage, such as the write logger, marking certain sections of the memory map as read-only (hence, writes will never affect anything and there's no need to sync up the other processor), implementing rewind capabilities and so on. All of these just complicate emulation further and risk adding in even harder to find bugs.

[closing]
The only sane thing I could think of, for those that have only single core processors, is to allow for varying degrees of accuracy to be emulated. Thanks to our use of threading, we have very, very strong control over how many contextual switches we use. For absolute accuracy, we have to do it always. The second a bus read or write occurs while said CPU is ahead of any others that share that bus, we have to sync up everything else. However, we also have the option of allowing the user to reduce the accuracy of the emulator, in return for significant speed gains. Take for instance, an emulator version built with accuracy in mind, will add #if blocks around each read / write request to perform synchronization. An emulator version built with speed in mind could instead only perform this sync test (and thus, potential context switch) only between opcode edges. Or perhaps only every several opcodes. This is easily controlled, and with no speed impact, through the use of #if blocks.
It's a sucky solution, definitely. But it allows users with slower hardware to use our emulators now, and when they upgrade their hardware in the future, they can increase the accuracy setting until they have full hardware accuracy. And best of all, it allows us to write these fully accurate emulators, now, rather than waiting for this future hardware to exist, and long before it's impossible for us to run our own custom hardware tests on real SNES / Genesis hardware.

This probably wasn't what you were wanting to hear, sadly. But, I won't rule out that it's not possible to come up with alternate solutions. Just that I wasn't able to come up with any other than the above Sad

I'd definitely be interested if you have more ideas.

Quote from: Piotyr

Wow this is like listening to two people swap gardening or cooking tips... Well it would be if I could understand jack shat if anything you two were saying lol.

Yes, I apologize. Definitely the wrong forum to be discussing this. In fact, forums in general are bad for discussing emulation-related issues. Too much line noise from non-emulator authors.