Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Writing an emulator in python - implementation questions (forperformance) (http://www.velocityreviews.com/forums/t705059-writing-an-emulator-in-python-implementation-questions-forperformance.html)

Santiago Romero 11-12-2009 12:35 PM

Writing an emulator in python - implementation questions (forperformance)
 

Hi.

I'm trying to port (just for fun), my old Sinclair Spectrum emulator,
ASpectrum, from C to Python + pygame.

Although the Sinclair Spectrum has a simple Z80 8 bit 3.5Mhz
microprocessor, and no aditional hardware (excluding the +2/+3 model's
AYsound chip), I'm not sure if my loved scripted language, python,
will be fast enought to emulate the Sinclair Spectrum at 100% speed.
There are Java Spectrum emulators available, so it should be possible.

Anyway, this message is directed to prepare the DESIGN so that the
code can run as fast as possible. I mean, I want to know the best way
to do some emulation tasks before starting to write a single line of
code.

My questions:


GLOBAL VARIABLES VS OBJECTS:
==================================

I need the emulation to be as fastest as possible. In my C program I
have an struct called "Z80_Processor" that contains all the registers,
memory structures, and so on. I pass that struct to the Z80Decode(),
Z80Execute() or Z80Dissassemble() functions, i.e.. This allows me (in
my C emulator) to emulate multiple z80 processors If I want.

As Python is scripted and I imagine that emulation will be slower
than the emulator written in C, I've thought of not creating a
Z80_Processor *object* and declare global variables such as reg_A,
reg_B, reg_PC, array main_memory[] and so on, and let the z80
functions to directly access that global variables.

I'm doing this to avoid OOP's extra processing... but this makes the
program less modular. Do you think using processor "objects" would
make the execution slower, or I'm doing right using global variables
and avoiding objects in this type of program?

Should I start writing all the code with a Z80CPU object and if
performance is low, just remove the "object" layer and declare it as
globals, or I should go directly for globals?



HIGH AND LOW PART OF REGISTERS:
=================================

- In C, I have the following structs and code for emulating registers:

typedef union
{
struct
{
unsigned char l, h;
} B;

unsigned short W;
} eword;

eword reg_A;

This means that reg_A is a 16 bit "variable" that I can directly
access with reg_A.w=value, and I can access also the LOW BYTE and HIGH
BYTES with reg_A.B.h and reg_A.B.l. And more importante, changing W
modifies l and h, and changing l or h modifies W.

How can I implement this in Python, I mean, define a 16 byte variable
so that high and low bytes can be accessed separately and changing W,
H or L affects the entire variable? I would like to avoid doing BIT
masks to get or change HIGH or LOW parts of a variable and let the
compiled code to do it by itself.

I know I can write an "object" with set and get methods that
implement that (that could be done in C too), but for emulation, I
need the FASTEST implementation possible (something like the C's Union
trick).


MEMORY (ARRAYS):
===========================

To emulate Spectrum's memory in C, I have a 64KB array like: unsigned
char memory_array[65535]. Then I can access memory_array[reg_PC] to
fetch the next opcode (or data for an opcode) and just increment
reg_PC.

Is python's array module the best (and fastest) implementation to
"emulate" the memory?


MEMORY (PAGES):
=============================

The Sinclair Spectrum 8 bit computer can address 64KB of memory and
that memory is based on 16KB pages (so it can see 4 pages
simultaneously, where page 0 is always ROM). Programs can change
"pages" to point to aditional 16KB pages in 128KB memory models.

I don't know how to emulate paging in python...

My first approach would be to have eight 16KB arrays, and "memcpy()"
memory to the main 64KB array when the user calls page swapping. I
mean (C + pseudocode):


main_memory[65536];
memory_blocks[8][16384];

// Initial settings
current_pages[4] = [0, 1, 2, 3]

// User swaps last memory page (3) to block 7, so I:
page_to_swap_from = 3
page_to_map = 7

// Save the contents of current page (page being unmapped):
memcpy( main_memory, // Source
16384*page_to_swap_from, // Starting
at
memory_blocks[current_pages[page_to_swap_from], // To
16384 ); // 16K

// Now map page 7 to memory block 3:
memcpy( memory_blocks[page_to_map], // Source
0, // Starting
at
main_memory[page_to_swap_from*16384], // To
16384 ); // 16K
current_pages[page_to_swap_from] = page_to_map;



Memcpy is very fast in C, but I don't know if doing the same in
python with arrays would be fast enough, or if there is a better
approach to simulate paging of 16KB blocks in a 64KB memory windows (4
mappable memory blocks).

Maybe another approach based in pointers or something like that?


Carl Banks 11-12-2009 01:58 PM

Re: Writing an emulator in python - implementation questions (forperformance)
 
On Nov 12, 4:35*am, Santiago Romero <srom...@gmail.com> wrote:
> *Hi.
>
> *I'm trying to port (just for fun), my old Sinclair Spectrum emulator,
> ASpectrum, from C to Python + pygame.


The answer to your question is, "Use numpy". More details below.



[snip]
> *Should I start writing all the code with a Z80CPU object and if
> performance is low, just remove the "object" layer and declare it as
> globals,


Yes, but I'd suggest to index a numpy array rather than structures.
See below.



[snip]
> *How can I implement this in Python, I mean, define a 16 byte variable
> so that high and low bytes can be accessed separately and changing W,
> H or L affects the entire variable? I would like to avoid doing BIT
> masks to get or change HIGH or LOW parts of a variable and let the
> compiled code to do it by itself.


You can do clever memory slicing like this with numpy. For instance:

breg = numpy.zeros((16,),numpy.uint8)
wreg = numpy.ndarray((8,),numpy.uint16,breg)

This causes breg and wreg to share the same 16 bytes of memory. You
can define constants to access specific registers:

R1L = 1
R1H = 2
R1 = 1

breg[R1H] = 2
print wreg[R1]


[snip]
> *Is python's array module the best (and fastest) implementation to
> "emulate" the memory?


I'd use numpy for this as well. (I doubt the Z80 had a 16-bit bus,
but if it did you could use the same memory sharing trick I showed you
with the registers to simulate word reads and writes.)

Note that, when performing operations on single values, neither numpy
nor array module are necessarily a lot faster than Python lists, might
even be slower. But they use a LOT less memory, which is important
for largish arrays.


[snip]
> *The Sinclair Spectrum 8 bit computer can address 64KB of memory and
> that memory is based on 16KB pages (so it can see 4 pages
> simultaneously, where page 0 is always ROM). Programs can change
> "pages" to point to aditional 16KB pages in 128KB memory models.
>
> *I don't know how to emulate paging in python...


numpy again. This would mean you'd have to fiddle with addresses a
bit, but that shouldn't be too big a deal. Create the physical
memory:

mem = numpy.zeros((128*1024,),numpy.uint8)

Then create the pages. (This is a regular Python list containing
numpy slices. numpy slices share memory so there is no copying of
underlying data.)

page = [mem[0:16*1024],
mem[16*1024:32*1024],
mem[32*1024:48*1024],
mem[48*1024:64*1024]]

To access the byte at address 42432, you'd have use bit operations to
get a page number and index (2 and 9664 in this case), then you can
access the memory like this:

page[2][9664] = 33
p = page[3][99]

To swap a page, reassign the slice of main memory,

page[2] = mem[96*1024:112*1024]

Now, accessing address 42432 will access a byte from a different page.

If you don't want to fiddle with indirect pages and would just rather
copy memory around when a page swap occurs, you can do that, too.
(Assigning to a slice copies the data rather than shares.) I don't
know if it's as fast as memset but it should be pretty quick.

..

Hope these brief suggestions help. If you don't want third party
libraries, then numpy will be of no use. But I guess if you're using
pygame third party modules are ok. So get numpy, it'll make things a
lot easier. It can be a bit daunting to learn, though.


Carl Banks

Santiago Romero 11-12-2009 02:37 PM

Re: Writing an emulator in python - implementation questions (forperformance)
 
> > *I'm trying to port (just for fun), my old Sinclair Spectrum emulator,
> > ASpectrum, from C to Python + pygame.

>
> The answer to your question is, "Use numpy". *More details below.


Let's see :-)

> > *How can I implement this in Python, I mean, define a 16 byte variable
> > so that high and low bytes can be accessed separately and changing W,
> > H or L affects the entire variable? I would like to avoid doing BIT
> > masks to get or change HIGH or LOW parts of a variable and let the
> > compiled code to do it by itself.

>
> You can do clever memory slicing like this with numpy. *For instance:
>
> breg = numpy.zeros((16,),numpy.uint8)
> wreg = numpy.ndarray((8,),numpy.uint16,breg)
>
> This causes breg and wreg to share the same 16 bytes of memory. *You
> can define constants to access specific registers:
>
> R1L = 1
> R1H = 2
> R1 = 1
>
> breg[R1H] = 2
> print wreg[R1]


And how about speed?

Assuming a 16 bit register named BC which contains 2 8 bit regiters (B
and C)...

Will the above be faster than shifts and bit operations (<<, and,
>> ) with new B and C values to "recalculate" BC when reading or

changing either B, C or BC?

> > *Is python's array module the best (and fastest) implementation to
> > "emulate" the memory?

>
> I'd use numpy for this as well. *(I doubt the Z80 had a 16-bit bus,
> but if it did you could use the same memory sharing trick I showed you
> with the registers to simulate word reads and writes.)


Z80 has a 16 bit ADDRESS bus, 8 bit DATA bus. This means you can
address from 0 to 65535 memory cells of 8 bytes. Z80 has 16 bit bus
operations, but currently in C I write 16 bit values as two 8 bit
consecutive values without using (unsigned short *) pointers. But it
seems that numpy would allow me to do it better than C in this case...

> Note that, when performing operations on single values, neither numpy
> nor array module are necessarily a lot faster than Python lists, might
> even be slower. *But they use a LOT less memory, which is important
> for largish arrays.


Well, we're talking about a 128KB 1-byte array, that's the maximum
memory size a Sinclair Spectrum can have, and always by using page
swapping of 16KB blocks in a 64KB addressable space...

If you really think that python lists can be faster that numpy or
array modules, let me know.

Maybe I'll make a "benchmark test", by making some millions of read /
write operations and timing the results.

I wan't to write the emulator as "pythonic" as possible...

> > *I don't know how to emulate paging in python...

>
> numpy again. *This would mean you'd have to fiddle with addresses a
> bit, but that shouldn't be too big a deal. *Create the physical
> memory:
>
> mem = numpy.zeros((128*1024,),numpy.uint8)


A 128K array of zeroes...

> Then create the pages. *(This is a regular Python list containing
> numpy slices. numpy slices share memory so there is no copying of
> underlying data.)
>
> page = [mem[0:16*1024],
> * * * * mem[16*1024:32*1024],
> * * * * mem[32*1024:48*1024],
> * * * * mem[48*1024:64*1024]]


Those are just like pointers to the "mem" numpy array, pointing to
concrete start indexes, aren't they?

> To access the byte at address 42432, you'd have use bit operations to
> get a page number and index (2 and 9664 in this case), then you can
> access the memory like this:


Do you mean:

page = address / 16384
index = address MOD 16384

?

Or, better, with:

page = address >> 14
index = address & 16383

?


> page[2][9664] = 33
> p = page[3][99]
>
> To swap a page, reassign the slice of main memory,
>
> page[2] = mem[96*1024:112*1024]
>
> Now, accessing address 42432 will access a byte from a different page.


But the above calculations (page, index) wouldn't be slower than a
simple play 64KB numpy array (and make no paging mechanism when
reading or writing) and copy memory slices when page swappings are
requested?

> If you don't want to fiddle with indirect pages and would just rather
> copy memory around when a page swap occurs, you can do that, too.
> (Assigning to a slice copies the data rather than shares.) I don't
> know if it's as fast as memset but it should be pretty quick.


That's what I was talking about.
With "page and index" I'm slowing down EACH memory read and write,
and that includes opcode and data fetching...

With "memory copying in page swaps", memory is always read and write
quickly, and if "slice copies" are fast enough, the emulation will be
>100% of speed (I hope) for a 3.5Mhz system ...


> Hope these brief suggestions help. *If you don't want third party
> libraries, then numpy will be of no use. *But I guess if you're using
> pygame third party modules are ok. *So get numpy, it'll make things a
> lot easier. *It can be a bit daunting to learn, though.


Yes, you've been very helpful!

How about numpy portability? Is array more portable?

And finally, do you think I'm doing right by using global variables
for registers, memory, and so, or should I put ALL into a single
object and pass it to functions?

Ummm ... I think I'm going to do some tests with some "for i in range
(1,100000000000)" + time :-)

Thanks a lot for your help :-)

Any other suggestion is welcome.

Dave Angel 11-12-2009 04:40 PM

Re: Writing an emulator in python - implementation questions (forperformance)
 


Santiago Romero wrote:
>>> I'm trying to port (just for fun), my old Sinclair Spectrum emulator,
>>> A

>> <snip>

> Do you mean:
>
> page =ddress / 16384
> index =ddress MOD 16384
>
> ?
>
> Or, better, with:
>
> page =ddress >> 14
> index =ddress & 16383
>
> ?
> <snip>
>

How about
page, index = divmod(address, 16384)




Santiago Romero 11-12-2009 04:44 PM

Re: Writing an emulator in python - implementation questions (forperformance)
 

> How about
> * * page, index = divmod(address, 16384)


Surely, much better and faster :-)

Thanks a lot.


Santiago Romero 11-12-2009 05:41 PM

Re: Writing an emulator in python - implementation questions (forperformance)
 
> You can do clever memory slicing like this with numpy. *For instance:
>
> breg = numpy.zeros((16,),numpy.uint8)
> wreg = numpy.ndarray((8,),numpy.uint16,breg)
>
> This causes breg and wreg to share the same 16 bytes of memory. *You
> can define constants to access specific registers:


What I'm doing wrong?

[sromero@compiler:~]$ cat test.py
#!/usr/bin/python

import numpy

# Register array
breg = numpy.zeros((16,),numpy.uint8)
wreg = numpy.ndarray((8,), numpy.uint16, breg )

reg_A = 1
reg_F = 2
reg_AF = 1
reg_B = 3
reg_C = 4
reg_BC = 3

breg[reg_B] = 5
breg[reg_C] = 10
print breg[reg_B]
print breg[reg_C]
print wreg[reg_BC]


[sromero@compiler:~]$ python test.py
5
10
0

?

Santiago Romero 11-12-2009 05:51 PM

Re: Writing an emulator in python - implementation questions (forperformance)
 

Oops, numpy arrays start with index=0 :-)



greg 11-13-2009 02:29 AM

Re: Writing an emulator in python - implementation questions (forperformance)
 
Carl Banks wrote:
> You
> can define constants to access specific registers:
>
> R1L = 1
> R1H = 2
> R1 = 1
>
> breg[R1H] = 2
> print wreg[R1]


But keep in mind that named "constants" at the module level
are really global variables, and therefore incur a dictionary
lookup every time they're used.

For maximum speed, nothing beats writing the numeric literals
directly into the code, unfortunately.

Generally, I think you're going to have quite a battle on
your hands to get a pure Python implementation to run as
fast as a real Z80, if it's even possible at all. And if
you do succeed, the code will be pretty awful (due to things
such as not being able to use named constants).

I would be taking a different approach -- develop a prototype
in Python, concentrating on clarity rather than speed, and
later reimplement the core of the emulator as an extension
module, using Pyrex or Cython or otherwise.

--
Greg

Steven D'Aprano 11-13-2009 03:00 AM

Re: Writing an emulator in python - implementation questions (forperformance)
 
On Fri, 13 Nov 2009 15:29:03 +1300, greg wrote:

> Generally, I think you're going to have quite a battle on your hands to
> get a pure Python implementation to run as fast as a real Z80, if it's
> even possible at all. And if you do succeed, the code will be pretty
> awful (due to things such as not being able to use named constants).


I don't know about that...

Python on a 2GHz processor (which is more or less entry-level for desktop
PCs these days), emulating something which used to run at 2.5MHz? Even if
the Python code does 1000 times more work, the modern processor is nearly
1000 times faster, so your Python code won't be much slower than the
first generation Z80.

The latest Z80 processors operate at 50MHz. That still gives you a factor
of 40 to work with. Write your Python carefully, optimizing only the bits
that *need* optimizing, and perhaps using a few C extensions or Psyco,
and I think you have a good chance of being close enough to the speed of
a real Z80.



--
Steven

Steven D'Aprano 11-13-2009 03:12 AM

Re: Writing an emulator in python - implementation questions (forperformance)
 
On Fri, 13 Nov 2009 15:33:53 +1300, greg wrote:

> Santiago Romero wrote:
>>>How about
>>> page, index = divmod(address, 16384)

>>
>> Surely, much better and faster :-)

>
> Not necessarily, because it involves a function call, and constructing
> and deconstructing a result tuple. If you time them, you may well find
> that the explicit shift and mask operations turn out to be faster.


It's easy enough to test:

>>> from timeit import Timer
>>> t1 = Timer('a = n>>14; b = n & 16384', 'n=2137902')
>>> t2 = Timer('a,b = divmod(n, 16384)', 'n=2137902')
>>> min(t1.repeat(repeat=5))

0.32850909233093262
>>> min(t2.repeat(repeat=5))

0.54839301109313965

The shift and mask are a little faster on my machine, but that's
certainly what I would call a micro-optimization. Unless the divmod call
is the bottleneck in your code -- and it almost certainly won't be -- I
don't think it's worth the obfuscation to use shift/mask. What should be
done is write the code in the clearest way you can, then, only if it is
it too slow, profile it to see where it actually needs optimizing.


--
Steven


All times are GMT. The time now is 12:31 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.