Author
|
Topic: Mega Drive gets some bsnes style lovin'? (Read 2765 times)
|
deespence2929
Guest
|
|
« Reply #15 on: March 28, 2007, 03:54:54 pm » |
|
I took that quiz long ago, I tried it again recently, didn't pass and didn't bother retrying. Anyhow, that emulator looks good. If it's what I'm thinking it is, this thing could be real good for game genie code hacking. I might have to hit up some old hacking friends and let them know about this.
|
|
|
|
RedComet
Guest
|
|
« Reply #16 on: March 28, 2007, 05:10:01 pm » |
|
Here's a similar thread on a not-so elitist forum. I posted a link to this thread over there.
|
|
|
|
Nemesis
Guest
|
|
« Reply #17 on: March 29, 2007, 07:51:39 am » |
|
I could be wrong, but it sounds like he is using pre-emptive threads for processor synchronization (eg one thread for 680x0, one for Z80, etc). While this works for rough accuracy pretty well (eg PS2 or Saturn emulators that only need to sync a few hundred times a second), it's absolutely terrible if you're going for perfect accuracy. If you actually sync on every memory access that could affect another thread, you'll spend most of your time in the kernel handling context switches. An older system like the Genesis would benefit a lot more from either cooperative threads or finite state machines. They're also a lot more portable. The former gives you cleaner code, the latter gives you faster speed. True, you only get to take advantage of one core, but you have to look at the platform, too. A Genesis emulator probably won't tax a single core of any dual core processors, so optimizing for those processors probably isn't ideal.
I'm also curious how he plans to use threads, and yet still have savestate support. Threads of any sort are dependent on the host OS, which is a state you cannot reliably save, unlike finite state machines where your functions are sloppy and hideous, but fully reentrant. Lots of small issues like this you don't realize until well into development of an emulator. My current implementation is very much a work in progress, and I've just done a major redesign. I'll tell you basically how it works in theory, but I should add that the implementation I'm about to describe isn't fully implemented and tested yet. There is a System object, which has a list of Chipset objects that make up the system. While the chipsets may run in separate threads, they do not run of their own accord. Chipsets are allocated timeslices by the system object. Timeslices in this context do not mean physical CPU time, they mean a length of time relative to the physical chipset that is being emulated. A chipset performs its normal execution task until its timeslice expires, then waits for a new timeslice to be allocated. The system essentially allows all the chipsets in the system to execute in parallel for a period of time, until all the chipsets have reached the end of their timeslice. Timeslices can be as broad or as fine grained as the system requires in order to maintain a reasonable level of accuracy. When all the chipsets have used their timeslice, that is called a timing point, or a single point when all the chipsets are at the correct stage of execution relative to each other. At this point the system is synchronized, and can be safely halted, a savestate taken, etc. While the system is synchronized, the system asks each chipset how long it can execute before it knows it needs to be synchronized again. The system then allocates another timeslice for the length of the smallest interval. In the case of the Mega Drive, these timing points will mainly be driven by the VDP graphics chipset, indicating the amount of time remaining before events such as hblank and vblank for example, which often impact on code executing on the main 68000 CPU. This system is far from perfect. The number one problem is unpredictable timing points, such as when the 68000 sends a DMA command to the VDP. If the command is a VRAM fill or copy, the DMA operation will execute in parallel, and the 68000 can query the VDP to see if it has completed. I can simulate the timing for the 68000 CPU to make it appear accurate by recording the timeslice progress of the 68000 core when it initiated the operation, and determining the length of time the operation has been running on subsequent access, but what I can't do is actually perform the DMA operation over that time period. The VDP may report a DMA operation is still in progress, but in actual fact it would fully complete the DMA operation when it was triggered, and merely simulate the progress of the DMA operation in order to make the timing appear accurate to the 68000. If the 68000 was to read the last value from the VRAM immediately after initiating a DMA fill from the start of the VRAM, the value should not yet have been overwritten. It seems to me, the only accurate way to deal with this specific case is to roll back the system to a previous consistent state, and execute from that point with advanced knowledge of the upcoming timing point. Ultimately, timing is the biggest challenge. I underestimated the complexity of the requirements during my initial design. Of course, the best solution is to keep cycle-for-cycle accuracy between all the chipsets in the system, but it's simply not going to run anything close to full speed, and it's certainly not going to be able to make use of multiple cores. Seeing as this level of timing accuracy is impractical, we have to sacrifice a degree of timing accuracy in order to gain overall performance. The problem of timing control is really about sacrificing timing accuracy when you know you don't need it, without compromising your ability to fully emulate it when you do need it. It's a difficult task, and one I'm still working on. I would be interested in any suggestions you might have to offer.
|
|
|
|
Nemesis
Guest
|
|
« Reply #18 on: March 29, 2007, 08:00:25 am » |
|
I'm going to post some of the ramblings I wrote to myself when I was breaking down the issues I was having with timing emulation. While it doesn't follow all the way through to some of the decisions I've made of late, it might help you to understand the thought process behind my design. We are going back to the drawing board with timings here. Inter-Chipset timing issues are going to be the achilles heel of this project unless the requirements are mapped out, and an appropriate framework is designed to manage these timing requirements.
Internal timing within a chipset is not a problem. The current method relies on a chipset being allocated a timeslice and performing its own internal operations until the timeslice elapses. This is easy to manage, such as in the case of cycle counts for a CPU, and this method works. Problems appear where there are timing requirements between two or more separate processors. In this case, the chipsets must be kept in sync to a sufficient degree where they can bounce of each other at the appropriate intervals. It is possible to write code that is extremely timing sensitive. The Exodus platform must easily scale to support these timing requirements, or it is not a viable emulation platform.
There is an additional question of where the responsibility lies to determine the length and distribution of the timeslices that need to be allocated, in order to keep the system in sync with the requirements of all the attached chipsets. This system needs to take into account the additional requirements addon devices may have regarding timing.
It seems the issue of inter-process timing requirements can be solved by the introduction of the concept of a timing point. Inter-processor timings can be represented by a bus network. There are separate buses travelling along their own routes, running their own schedules. Each of these routes has timing points along the way, indicating target times for the bus to arrive. The problem of accurate timing emulation is that of keeping the relative timings between buses accurate at each timing point for each service. If these timing points are respected, a passenger can travel between any two points transferring between any number of buses, and arrive at the predicted time. If timings are off, the passenger may miss a connecting bus service they would have caught, had the buses been running on time.
In terms of emulation, this means each chipset must register with the system how long it can execute before a critical timing point occurs. The system receives timing points from all chipsets, determines the time interval until the next critical timing point, and dispatches a timeslice for the intervening time block.
How the chipsets inform the system object of their timing requirements is a question that needs consideration. One option is that the chipsets register their timing points with the system when they are added or initialized, and from that point on do not generally change their registered timing points. This makes it easiest for the system to interpret and manage timing requirements, as the timing points can be analyzed and sorted during initialization, and do not impact on performance from that point on. This does not easily allow for chipsets like the VDP however, for which the timing points change when mode settings change. This would require such a mode change to be detected by the VDP core, and a specific notification to be sent to the system to register the new timing requirements.
Another option is that the system is not aware of the overall timing requirements of each chipset at all, and simply gets informed of the next timing point each chipset requires. The system takes the closest timing point, and issues a timeslice to reach it. This provides the flexibility to allow each chipset to change their timing requirements constantly, without actively informing the system of the changes, and without timing state information in the system becoming invalidated by state changes to an individual chipset. This is sounding like a better option.
One thing which needs consideration is whether we will encounter problems due to unpredictable timing points, meaning an unexpected change of status in a chipset which forces the need for a new timing point, when a timeslice has already been allocated which passes the target time interval. Two questions need to be answered: 1. Are unpredictable timing points a reality? 2. Are they critical enough to require the ability to invalidate the current time slice, and roll back to the previous timing point and recalculate?
Adding the ability to re-do a currently executing timeslice to account for an upcoming unpredictable timing event is an extreme measure, and an extremely difficult and costly one to implement. It is possible, but should not even be considered unless a specific need can be identified for this ability.
An example of where an unexpected timing point could be used is where the 68000 accesses the VDP and requests time-sensitive information, such as the current state of the HV counter. An upcoming request for the HV counter cannot be reasonably predicted, it must be dealt with when it arises. One way this can be done is to keep the VDP in step with the 68000 to a sufficient degree that the HV counter can be calculated to a reasonable degree of accuracy.
A true example of an unexpected timing point has been noted. In the case of the 68000 initiating a memory copy or fill DMA operation, an additional timing point is introduced to mark the end of the DMA operation. This is a truly unpredictable event, and inaccurate emulation of the timings involved is visible from the 68000 side if it inspects the status flag, informing it whether the DMA operation is complete or still in progress.
|
|
|
|
byuu
Guest
|
|
« Reply #19 on: March 29, 2007, 11:47:00 am » |
|
Hi Nemesis,
First, thank you for registering here to reply. I apologize that I did not have the time or patience to do the same myself on the s2beta forum :/
Now then,
I see what you're saying. If the multiple chipset objects run in parallel, then they are preemptive. This means that the OS determines when each thread should run, but on a multicore, many chips can run at once. I understand you have your own internal variables that tell it how long to run when they are called before they stall / sleep, waiting for you to tell them it's ok to run for another length of time.
This is an innovative approach, but again I really must reiterate the tremendous overhead a preemptive thread switch incurs. For example, using CreateThread / SwitchToThread for sleeping on Windows, it takes ~600x as long to perform 10,000,000 thread switches compared to a standard quick subroutine call (that doesn't preserve all nonvolatile registers), whereas a cooperative thread takes ~6-10x as long, or ~1.25x as long as a safe function call that preserves all nonvolatile registers.
You mention an issue with DMA, where it appears you are able to quickly perform the entire DMA process in one step, presumably a memcpy / memset operation, and then account for the timing as happening all at once. The tricky part of course being when another chipset accesses that data at a time that is supposed to be in the middle of that DMA operation. Unfortunately, as you surmised, the most obvious solution is actually the one I took: perform the operations in lock-step. With my CPU DMA routine, it will transfer one single byte, update the clocks, and repeat. You're right that it's a lot slower than simply simulating the entire transfer at once, but it removes the very, very difficult problem of having to determine when and if another chip will attempt to access the chip performing the DMA so you can set the proper timeslice to get the correct bus values.
I believe that trying to predict future timing events adds a lot of overhead and complexity to the design of the emulator, more chances for bugs and miscalculations that result in the wrong bug values.
Let me go over my method a little more.
I have much the same setup as you, a class/object for the entire system controls internal emulation. This class has an interface between it and the UI for video/audio/input stuff, and directly owns chipset objects. One for CPU, SMP, DSP and PPU. The four major chips that make up the SNES. It also has another object that handles memory mapping. But only the four chipset objects have their own threads.
Now, these are obviously not real threads, they are userspace threads. I create their stacks with a simple malloc(), and I switch between them by saving and restoring all volatile registers and swapping the stack pointer between the main thread pointer and my malloc() pointers. So essentially, I personally control scheduling between all of the threads, and I can switch from one thread to another, but only when needed. So the idea is that I design each of these objects as their own separate programs that emulate only said chips. When that chip tries to access a resource/bus that is shared by another chipset, it must make sure that that other chip is caught up, emulated timewise, to the chip that wants to do the read, that way the correct value is there. If it is not caught up, it will switch threads to the chip that is behind, immediately, right in the middle of the memory read function asking for the value.
Because only bus accesses matter for synchronization between chips, the memory read/write routines for each chip have these checks built into them, thus making the design of the actual chip objects much easier. They actually look like oldschool opcode emulators that simple execute entire opcodes at a time. But they're actually even more powerful. With my design, I can actually execute several thousand opcodes on the CPU without ever running the secondary audio processor, the SMP, at all. Then, when the CPU finally does need to probe the SMP, it will simply switch over to it and catch the SMP up. Since the SMP is behind the CPU, for any reads it performs, it will know that the CPU did not write anything since it was able to execute so far ahead. For any writes, the same applies. The CPU never tried to read this soon, so it can safely write the value and continue running, and it can even now run past the CPU, until it tries to access the CPU after being caught up. So you're really only switching threads a couple of times a second. And when you hit a particularly nasty section of code where the crosstalk is really bad (eg CPU is transferring a program to the SMP), it will work the same, it will just perform a lot more thread switches. Because of the assembler implementation of this switching, the overhead is barely over that of a fully safe function call, so it doesn't really affect speed much at all.
For my CPU DMA support, I simply transfer one byte at a time, and then update the timing to reflect that one byte has been transferred, and repeat. It is slower than simply doing the entire thing at once, but there are some edge cases of CPU DMA that make it not so bad. Eg the CPU also has HDMA, think of it as interrupt-driven DMA that triggers once every hblank. This is used to write to MMIO registers, and you really couldn't map that to a fast memcpy anyway. You'll probably lose a bit more speed with VDP DMA functions like memory fill, if you have to do it in lockstep like this.
Now, I realize my design does not allow for running multiple processors at the same time. My belief is that by the time you reach a system that has more than one real processor, you'll already have more than enough power to emulate the system, even at the lowest possible level of accuracy, in realtime. And indeed, that has been the case for bsnes, and I haven't really optimized my code all that much. However, by using preemptive threading as you are, this will greatly raise the overhead on single processor machines, making the entry level the dual core processors. Great for thinking ahead, at least. Eventually everyone should have dual core processors.
Oh, and before I forget, there is one major issue with synchronization for both types of threading: shared busses. How much is shared between two chips largely determines how many times you must synchronize on read/write requests to busses.
Eg, let's say you have two CPUs running in your main system RAM. Any time one of those reads or writes to any RAM, the other CPU must catch up. This means that two processors sharing a large memory bus will synchronize a lot more frequently than two processors that only share a small, four-byte communication bus like the SNES CPU and SMP. Well, that or your prediction algorithm to see if the other CPU will access the same value on the bus at this time will be a whole lot more complex. This is the reason I do not emulate the 10.5mhz SA-1 or 11mhz SFX coprocessors in bsnes: they have large shared busses like this, and with my approach I have no way to predict when the two will access the same bus values and am forced to always synchronize on any bus accesses. Even though it's very fast to do this for me, that's still literally millions of thread switches per second. You're really better off taking the accuracy hit and only syncing those once per opcode or so, and dealing with slight bus access misses here and there. This is how other emulators can run these coprocessors on less than Core 2 Extremes overclocked with liquid nitrogen cooling :)
The next problem I have is with savestates. Because I can literally be in the middle of a thread for all four chipset objects, I cannot reliably save the state to a file. You'd have to save the program stack and registers to a file, and that can easily change each time you run the program or run it on different OSes / machines. The only way I can see to reliably save states is to use a finite state machine. Basically, a big switch(state) {} at the top of your thread entry point, and to be able to execute your timeslice and allow emulation to resume right where it left off when you call that entry point again, as you'll need to do when you load a savestate. It's easy to wait until one chipset has reached the end of the chipset object main routine's loop and is about to return to the top and grab a savestate there, but then you have to do the same for the other three chipsets, and those could easily cause the CPU to execute further, and thus you are stuck waiting on the CPU to sync again. I do have a clever workaround courtesy of mozz for this, but it's too complex to be worth implementing in my opinion. I figure, I'm emulating the hardware as faithfully as possible, and real hardware does not have savestates. End users can use other emulators if they insist on that functionality. I believe you will have the same problems by using any type of threading model, and am curious how you did or how you are planning to avoid this issue. Perhaps you have thought of a better solution than me :) If you do in fact have a giant switch() {} at the top of each chipset object, then there's really no reason to have true threads at all, other than being able to run them in parallel, which has all the aforementioned problems. I honestly believe, based on my own benchmarking, that it would be beneficial to drop the threading entirely if you have a giant FSM switch table already implemented. My problem with FSMs are that while one of them is most certainly faster than cooperative threading, multiple levels of them are much slower, and the code ends up looking like spaghetti. And you pretty much have to have more than one unless you want your entire chipset implemented in a single function, as you can't break out in the middle of subroutine calls to synchronize unless they also contain FSMs. And to implement an entire CPU emulator inside one big function would be absolutely psychotic.
We can continue the discussion here if you like, or if you'd prefer realtime, e-mail or otherwise, let me know and I will send you my contact info.
Thanks again for registering and posting.
|
|
|
|
Numonohi_Boi
Guest
|
|
« Reply #20 on: March 29, 2007, 01:59:48 pm » |
|
this thread is what I was referring to earlier http://www.romhacking.net/forum/index.php/topic,3020.0.htmlI know that ACCURATE Megadrive emulation is first priority, but this is certainly a nice feature. Just forget the post if it's simply not compatible with your plans. I just thought it should be brought to everyone's attention again.
|
|
|
|
Piotyr
Guest
|
|
« Reply #21 on: March 29, 2007, 06:59:41 pm » |
|
this thread is what I was referring to earlier http://www.romhacking.net/forum/index.php/topic,3020.0.htmlI know that ACCURATE Megadrive emulation is first priority, but this is certainly a nice feature. Just forget the post if it's simply not compatible with your plans. I just thought it should be brought to everyone's attention again. That would look really sexy on my big screen, if only it did it with sonic and knuckles as well cause that was awesome too. Does gensplus support this too? lol.
|
|
|
|
Numonohi_Boi
Guest
|
|
« Reply #22 on: March 29, 2007, 08:01:51 pm » |
|
unfortunately with the sonic 3 and sonic 3 and knuckles they didn't squish the picture, they actually created smaller sprites etc. so it only really works with sonic 2.
*I THINK*
|
|
|
|
Talbain
Guest
|
|
« Reply #23 on: March 30, 2007, 06:06:13 am » |
|
Somehow I foresee that quiz setup being attacked by a brute force script. As an aside, the easiest way I've found to kill bots from spamming boards (dunno if that's the intention) is to just not allow them to post topics until they've reached a certain post count. It's worked marvelously for the boards I've been an admin at thus far, plus it prevents stupid people from... well, spreading their stupidity. Back on topic, I'm interested in this for the sake of emulation, so I applaud you Nemesis. Hopefully this knowledge will make it easier to hack Genesis games and thereby increase the work that's able to be done on the system. Looking forward to seeing more of your work, it's already looking great from the screens I've seen. Also, it'd be really great to see a Genesis emulator for Linux distros!
|
|
|
|
deespence2929
Guest
|
|
« Reply #24 on: March 30, 2007, 06:31:28 am » |
|
They already do that too. After 30 posts you can't post anymore until a mod reviews your posts. Really tho, why couldn't they just require the quiz for people who want to post in the sonic portion of the forums? That would make the most sense.
|
|
|
|
Nemesis
Guest
|
|
« Reply #25 on: March 30, 2007, 10:50:17 pm » |
|
I see what you're saying. If the multiple chipset objects run in parallel, then they are preemptive. This means that the OS determines when each thread should run, but on a multicore, many chips can run at once. I understand you have your own internal variables that tell it how long to run when they are called before they stall / sleep, waiting for you to tell them it's ok to run for another length of time.
This is an innovative approach, but again I really must reiterate the tremendous overhead a preemptive thread switch incurs. For example, using CreateThread / SwitchToThread for sleeping on Windows, it takes ~600x as long to perform 10,000,000 thread switches compared to a standard quick subroutine call (that doesn't preserve all nonvolatile registers), whereas a cooperative thread takes ~6-10x as long, or ~1.25x as long as a safe function call that preserves all nonvolatile registers. Now, I realize my design does not allow for running multiple processors at the same time. My belief is that by the time you reach a system that has more than one real processor, you'll already have more than enough power to emulate the system, even at the lowest possible level of accuracy, in realtime. And indeed, that has been the case for bsnes, and I haven't really optimized my code all that much. However, by using preemptive threading as you are, this will greatly raise the overhead on single processor machines, making the entry level the dual core processors. Great for thinking ahead, at least. Eventually everyone should have dual core processors. Context switching is expensive, there's no doubt. At the same time, I see a clear need for multithreaded support. There's no reason why the base Mega Drive itself couldn't be emulated with highly accurate timing in a single thread on a current generation CPU. The Mega Drive has a plethora of addons however. The 32x for example adds two SH2 processors running at 20MHz. I want to design a framework that can scale to meet this requirement. I've sat through presentations from Intel outlining their roadmap for the next few years. They're talking about 16 core processors in the regular consumer market by 2008. In the short term at least, the progression for CPU's has shifted from increasingly faster single-core processors, to massively parallel processors with a moderate clock speed per core. In 5 years time, we might all be running 32 core processors, but those individual cores may be no faster than 3-4GHz. I want to build a platform that can move with this change. My target platform is one where each primary chipset can run its own thread on a dedicated core, because that's where we'll be in not too much time if Intel have their way. If this means my emulator runs significantly slower on single core machines in the mean time, so be it. All that said, this is still very much a work in progress. My current approaches may not be viable. I do have a definite goal to make a multithreaded implementation work however. I would see falling back to a single-threaded method as a last resort. Oh, and before I forget, there is one major issue with synchronization for both types of threading: shared busses. How much is shared between two chips largely determines how many times you must synchronize on read/write requests to busses. Eg, let's say you have two CPUs running in your main system RAM. Any time one of those reads or writes to any RAM, the other CPU must catch up. This means that two processors sharing a large memory bus will synchronize a lot more frequently than two processors that only share a small, four-byte communication bus like the SNES CPU and SMP. Well, that or your prediction algorithm to see if the other CPU will access the same value on the bus at this time will be a whole lot more complex. This is the reason I do not emulate the 10.5mhz SA-1 or 11mhz SFX coprocessors in bsnes: they have large shared busses like this, and with my approach I have no way to predict when the two will access the same bus values and am forced to always synchronize on any bus accesses. Even though it's very fast to do this for me, that's still literally millions of thread switches per second. You're really better off taking the accuracy hit and only syncing those once per opcode or so, and dealing with slight bus access misses here and there. This is how other emulators can run these coprocessors on less than Core 2 Extremes overclocked with liquid nitrogen cooling This is the number one problem I'm still trying to solve. The Z80 for example can read from and write to banked 68000 memory at any time. While very few roms are probably very timing critical when it comes to this kind of access, it's possible to write code that would be. When you start combining addons like the MegaCD and the 32x however, there are multiple primary processors with shared access to memory, and handshaking tests in the bios. A Mega Drive with a MegaCD and a 32x would have two 68000's, two SH2's, a Z80, and multiple video and sound chipsets, all with their own timing requirements. The next problem I have is with savestates. Because I can literally be in the middle of a thread for all four chipset objects, I cannot reliably save the state to a file. You'd have to save the program stack and registers to a file, and that can easily change each time you run the program or run it on different OSes / machines. The only way I can see to reliably save states is to use a finite state machine. Basically, a big switch(state) {} at the top of your thread entry point, and to be able to execute your timeslice and allow emulation to resume right where it left off when you call that entry point again, as you'll need to do when you load a savestate. It's easy to wait until one chipset has reached the end of the chipset object main routine's loop and is about to return to the top and grab a savestate there, but then you have to do the same for the other three chipsets, and those could easily cause the CPU to execute further, and thus you are stuck waiting on the CPU to sync again. I do have a clever workaround courtesy of mozz for this, but it's too complex to be worth implementing in my opinion. I figure, I'm emulating the hardware as faithfully as possible, and real hardware does not have savestates. End users can use other emulators if they insist on that functionality. I believe you will have the same problems by using any type of threading model, and am curious how you did or how you are planning to avoid this issue. Perhaps you have thought of a better solution than me In my current implementation, when a timeslice expires, each processor returns to a consistent state. A processor doesn't check how much time is remaining in its allocated timeslice partway through executing an opcode for example, it will fetch, decode, and execute the next instruction, deduct the execution time, then check if it has any remaining allocated time before it fetches the next opcode. When it has used all the time that has been allocated, the thread signals completion, and waits for another timeslice to be allocated. The system lets every processor reach this wait state before allocating the next timeslice. Once all the processors have reached this wait state, and before the next timeslice is allocated, the entire system is suspended and consistent. It is during this state that system requests will be taken, such as a pause command, or a request to take a savestate. this thread is what I was referring to earlier http://www.romhacking.net/forum/index.php/topic,3020.0.htmlI know that ACCURATE Megadrive emulation is first priority, but this is certainly a nice feature. Just forget the post if it's simply not compatible with your plans. I just thought it should be brought to everyone's attention again. That's interlace mode 2. The system doesn't actually render a single frame at double the height, it renders two separate frames, the first consisting of all the even lines, the second consisting of all the odd lines. This prodives a noticeable flickering effect on the screen. You could combine the two frames into a single frame at double height, but this isn't what the system actually does. I could add a setting to switch to this mode of rendering for the curious. I have numerous alternate modes and outputs planned for the VDP, but things like this will come later.
|
|
|
|
Numonohi_Boi
Guest
|
|
« Reply #26 on: March 30, 2007, 11:59:07 pm » |
|
this thread is what I was referring to earlier http://www.romhacking.net/forum/index.php/topic,3020.0.htmlI know that ACCURATE Megadrive emulation is first priority, but this is certainly a nice feature. Just forget the post if it's simply not compatible with your plans. I just thought it should be brought to everyone's attention again. That's interlace mode 2. The system doesn't actually render a single frame at double the height, it renders two separate frames, the first consisting of all the even lines, the second consisting of all the odd lines. This prodives a noticeable flickering effect on the screen. You could combine the two frames into a single frame at double height, but this isn't what the system actually does. I could add a setting to switch to this mode of rendering for the curious. I have numerous alternate modes and outputs planned for the VDP, but things like this will come later. [/quote] that would be excellent, I know it is nowhere near a priority, but that you are considering it for the distant future is encouraging. Thanks for the rundown of the science behind it too.
|
|
|
|
byuu
Guest
|
|
« Reply #27 on: April 02, 2007, 12:55:26 am » |
|
I want to design a framework that can scale to meet this requirement. I've sat through presentations from Intel outlining their roadmap for the next few years. They're talking about 16 core processors in the regular consumer market by 2008. In the short term at least, the progression for CPU's has shifted from increasingly faster single-core processors, to massively parallel processors with a moderate clock speed per core. In 5 years time, we might all be running 32 core processors It will be a good 30-40 years until people are all running more than one core, sadly. With some people, we have to wait until the silicon dies completely before they upgrade. But yeah, their loss, not ours. Well, it's good that you're trying this, at least. I've considered it myself, but do not want to take the chance, as I've seen the tremendous overhead preemptive threading can consume. I don't know that this overhead will go down with the addition of more cores, but who knows. If all threads are in their own cores, I can't see why a full context switch would be necessary. It will really then be up to the OS to implement competent threading, with things like "force thread to one dedicated CPU" options and such. I guess we'll see with time. I have the same problem, but to a lesser extent. There are two >10mhz coprocessors for the SNES that I could never emultate at fullspeed on today's processors. So I'll be very interested to see how things go for you. All that said, this is still very much a work in progress. My current approaches may not be viable. I do have a definite goal to make a multithreaded implementation work however. I would see falling back to a single-threaded method as a last resort. Please keep me/us up to date on this This is the number one problem I'm still trying to solve. The Z80 for example can read from and write to banked 68000 memory at any time. While very few roms are probably very timing critical when it comes to this kind of access, it's possible to write code that would be. When you start combining addons like the MegaCD and the 32x however, there are multiple primary processors with shared access to memory, and handshaking tests in the bios. A Mega Drive with a MegaCD and a 32x would have two 68000's, two SH2's, a Z80, and multiple video and sound chipsets, all with their own timing requirements. Hmm, I might be able to conjure up some ideas. The problem now is that we need to sync no matter what. So let's say you have CPU A and B. If A writes to something that B might read, we immediately need to sync. But what if A writes to something that B doesn't read? When an entire 64k+ RAM region is shared between A and B, it's unlikely they'll both touch the same value at the same time. The biggest problem right now is that I need to sync both on reads and writes between both processors, because I allow one to run ahead of the other otherwise. So, let's say A runs 300 clocks ahead. If it reads a value, B may have written to that value by now. And if A is ahead and writes a value, similarly B may try and read it. If we could eliminate one, then we could just keep a list of writes and previous values, and when we eventually switch to the other CPU, memory fetches when that processor is behind any others can be read out of that list, which will basically simulate "rolling time backwards". In other words, your list would look like: { offset, time - 16ticks, value }, { offset, time - 10ticks, value }, ... So if B accesses offset at time - 12ticks, we know to fetch the offset, time - 16ticks value. If B accesses an offset not in the list, then we just read it normally. The overhead of these list lookups could be very high, though. I can't see how we could run one processor into the future and handle reads from the other processor that is behind it, as the list won't be updated yet with that information. In my current implementation, when a timeslice expires, each processor returns to a consistent state. A processor doesn't check how much time is remaining in its allocated timeslice partway through executing an opcode for example, it will fetch, decode, and execute the next instruction, deduct the execution time, then check if it has any remaining allocated time before it fetches the next opcode. When it has used all the time that has been allocated, the thread signals completion, and waits for another timeslice to be allocated. The system lets every processor reach this wait state before allocating the next timeslice. Once all the processors have reached this wait state, and before the next timeslice is allocated, the entire system is suspended and consistent. It is during this state that system requests will be taken, such as a pause command, or a request to take a savestate. Ah, so basically when an opcode starts with your emulator, it is forced to complete, even if it runs out of time. What happens if another processor changes a memory value in the middle of an opcode, or this processor changes a value that will be used mid-opcode by another processor? Your approach will not be able to break out in the middle of an opcode to allow that other thread/processor to run and catch up, so that you have the correct value for that fetch. I'll be very curious to see if you can get this working with full accuracy when needed. I can't see how it would be possible, myself. Very interesting discussion, it's been a while since I've talked with someone willing to think outside the box in terms of how to emulate a system. I'll try and come up with more ideas for that syncing problem we both share.
|
|
|
|
Piotyr
Guest
|
|
« Reply #28 on: April 02, 2007, 02:21:10 am » |
|
Wow this is like listening to two people swap gardening or cooking tips... Well it would be if I could understand jack shat if anything you two were saying lol.
|
|
|
|
Talbain
Guest
|
|
« Reply #29 on: April 02, 2007, 05:01:03 am » |
|
Wow this is like listening to two people swap gardening or cooking tips... Well it would be if I could understand jack shat if anything you two were saying lol.
You don't understand yet Piotyr, but you will.
|
|
|
|
|