Pages: 1 2 »
|
 |
|
Author
|
Topic: Slow? (Read 6167 times)
|
RogerLevy
Full Member
   
Karma: +1/-0
Offline
Posts: 55
|
 |
Slow?
« on: January 09, 2005, 04:33:27 PM » |
|
so i've been perfecting for quite a while this piece of base code i use for managing moving graphical objects (specifically for games), and i recently "ported" that code to RetroForth, with what I thought were some great new optimizations. but i was very dissapointed and surprised that its performance was drastically slower than Win32Forth, (which is what i had been working with before moving to Retro). How can that be? W32F uses threaded code, even for ALL of its primitives (even DUP!). And I've turned all my code primitives into macros plus compiled versions, like ColorForth.
Performance comparison: Win32Forth can process 20000+ "null" objects (that do nothing and draw nothing on the screen) without slowdown. RetroForth can only process 2000-3000, before already i have slowdown.
I can only think of three possible reasons for it. 1) most important to consider is, it's my fault and has to do with the code that is responsible for updating the screen, but it's unlikely because it hasn't changed since W32F. 2) Macros are not searched first (it's not documented in TheRetroBook whether or not they are) so i'm not getting the speed boost if inline code, and getting slower performance because Retro doesn't use the internal "hacks" that W32F does to manipulate the stack in words like ROT and ! 3) Lack of alignment of variables and other things may be making it slow on my particular laptop's architecture. i actually tried aligning my data but couldn't notice much of a difference, so i don't know if this is relevant. Plus, the lack of address interpreter (2 extra instructions PER WORD) should have made up for alot.
here's the relevant code. "objects" is a big table of forward-linked objects (that part is shown in "nxt"). also "loop" = "next". hopefully the rest is self-explanatory (please read it, it's not much!)
5000 constant total variable used variable me macro : dofield $008b 2, $0503 2, me , ; forth : ofs create , does> dofield ( @ me @ + ) ; : field used @ ofs cell used +! ; field link field en field x field y field vx field vy field hide field disp field beha : nxt me @ @ me ! ; : display en @ hide @ xor if x 2@ 2i at disp @ exec then nxt ; : behave beha @ dup if exec ;; then drop ; : move en @ if vx 2@ y +! x +! behave then nxt ; : start objects me ! ; : all start total for move loop start total for display loop ;
|
|
|
|
|
Logged
|
|
|
|
Helmar
Library Contributor
Full Member
   
Karma: +1/-0
Offline
Posts: 129
TUCK what??? SWAP OVER!!!
|
 |
Re:Slow?
« Reply #1 on: January 09, 2005, 06:19:34 PM » |
|
Only for interest: did you try Reva? This has a complete asm core.
STC does not need to be faster than the traditional threaded code. Threaded code is always aligned and has everything in data-space: it does not mix data and code.
Bis dann, Helmar
|
|
|
|
|
Logged
|
|
|
|
Charles Childers
Administrator
Sr. Member
    
Karma: +2/-0
Offline
Posts: 745
|
 |
Re:Slow?
« Reply #2 on: January 09, 2005, 06:36:42 PM » |
|
That code won't run on an unmodified version of RetroForth 7.6:
I need your implementations of: cell +! if 2@ 2i at exec
to test it. +! was in 7.5, so I have that, but exactly how are you doing each of the others?
|
|
|
|
|
Logged
|
|
|
|
Charles Childers
Administrator
Sr. Member
    
Karma: +2/-0
Offline
Posts: 745
|
 |
Re:Slow?
« Reply #3 on: January 09, 2005, 06:48:07 PM » |
|
I never went out of my way to make RetroForth fast (Reva, the successor to TetraForth) is much faster at running code. (I should mention that the experimental 8.0 codebase is slightly faster than the current betas of Reva, but I have no doubt that Reva will ultimately pull ahead of it).
Macros are searched first, so that's one step in the right direction. My words like ROT and ! don't do any tricks: they were written to be easily understood, not fast.
A look at ROT reveals this a bit: : rot >r swap r> swap ; : -rot rot rot ;
Try modifying source/blocks/000 a bit. In 8.0, I'm doing:
macro : swap $0687 2, ; : drop $ad 1, ; : nip $04c683 3, ; forth : swap swap ; : drop drop ; : nip nip ; : dup [ $fc4689 3, $fc768d 3, ] ; : and [ $0623 2, ] nip ; : or [ $060b 2, ] nip ; : xor [ $0633 2, ] nip ; : not -1 xor ; : @ [ $008b 2, ] ; : ! [ $89adc289 , $ad02 2, ] ; : c@ @ $ff and ; : c! [ $adc289 3, $ad0288 3, ] ; : + [ $0603 2, ] nip ; : * [ $26f7 2, ] nip ; : - [ $ad0629 3, ] ; : / [ $c389 2, $99ad 2, $fbf7 2, ] ; : mod / [ $d089 2, ] ; : negate -1 * ;
swap, drop, and nip are made into macros; which yields a pretty big performance boost in my tests. Inlining others could make things faster too. I can post a version of this block with almost all of the words inlined if you'd like. The compiler itself is somewhat slow due to some checks I added; in 7.7 it will be faster as a result of several cleanups I've made while working on 8.0.
|
|
|
|
|
Logged
|
|
|
|
RogerLevy
Full Member
   
Karma: +1/-0
Offline
Posts: 55
|
 |
Re:Slow?
« Reply #4 on: January 09, 2005, 07:42:43 PM » |
|
i have modified retroforth to the point of incompatibility. all the words are implemented for speed, and many as macros that weren't before. at least i *thought* they would be faster. how can it be that win32for's threaded code is faster than plain machine code? caching? alignment? seperating of data and code?
forth : - -1 xor 1+ ; ( negate ) macro : for m: push here ; : loop $240cff 3, $241c8b 3, $db85 2, $75 1, here - + -1 + 1, $04c483 3, ; : +! $1e8b 2, $1801 2, m: drop m: drop ; : if ( f ) $c085 2, m: drop $74 1, here 0 1, ; forth : exec push ;
you can just totally cut out the part that says "x 2@ 2i at" - that's not the slow part, because it is not even being executed when the object table is blank, which is the state for testing. the slowest part is the pair of loops in "all".
on a side note, what i would like is for my programs to somewhat rival C in performance, which I believe Forth should and can do and still remain simple to use.
here's the part of the code i left out. it allocates and links the table.
: cell- -4 + ; 10 10 + cells constant maxsize 5000 constant total create objects total maxsize * cell- allot objects total maxsize * 0 cfill -1 , : link objects total for dup maxsize + over ! maxsize + loop drop ; link
you can't really properly test the speed of the code without the entire modified retroforth system i've made. if it becomes necessary i wouldn't mind sharing it with you.
|
|
|
|
« Last Edit: January 09, 2005, 08:00:37 PM by RogerLevy »
|
Logged
|
|
|
|
|
|
RogerLevy
Full Member
   
Karma: +1/-0
Offline
Posts: 55
|
 |
Re:Slow?
« Reply #6 on: January 09, 2005, 11:15:54 PM » |
|
as i said, i tried aligning. i made my own word too, because i couldn't understand how the original version worked:
: align 0 here 4 mod allot ;
what's the 0 for? won't this leave a 0 on the stack?
|
|
|
|
|
Logged
|
|
|
|
Charles Childers
Administrator
Sr. Member
    
Karma: +2/-0
Offline
Posts: 745
|
 |
Re:Slow?
« Reply #7 on: January 10, 2005, 01:47:43 AM » |
|
So it does; I suppose that's a leftover from an earlier implementation of align; thanks for pointing it out.
|
|
|
|
|
Logged
|
|
|
|
Mark Gaspar
Guest
|
 |
Re:Slow?
« Reply #8 on: January 10, 2005, 02:52:37 AM » |
|
On the alignment performance front, if you are using a Pentium III/IV class processor, I believe that 16 byte increments may be the magic value due to the caching architecture (line fills and so forth).
Also, with Pentium III/IV class processors, there is the performance counter available which ticks at the CPU cycle rate. So... I've always been tempted to define a new Forth word that reads that performance counter to facilitate simple yet useful cycle counts for elements of code that one wants to analyze and optimize.
I realize that not everyone has Pentium III/IV+ processors and RF and other projects may want to keep the standard code basis to run with processors down to the 386 or even lower level (don't actually know the minimum x86 architecture you may want to support).
If not, I guess I could invest the required large amount of time to actually try do what I've blathering about and see if my notions are useful.
I'm butting in on your conversation, so I'm sorry if it's annoying... just trying to be helpful in my own way!
|
|
|
|
|
Logged
|
|
|
|
Mark
Jr. Member
  
Karma: +1/-0
Offline
Posts: 13
I'm a llama!
|
 |
Re:Slow?
« Reply #9 on: January 10, 2005, 03:14:15 AM » |
|

Oops... forgot to login before posting.... I'm so fired!
|
|
|
|
|
Logged
|
|
|
|
|
|
RogerLevy
Full Member
   
Karma: +1/-0
Offline
Posts: 55
|
 |
Re:Slow?
« Reply #11 on: January 10, 2005, 03:21:20 AM » |
|
Mark I'm afraid we're going to have to let you go.
|
|
|
|
|
Logged
|
|
|
|
Mark
Jr. Member
  
Karma: +1/-0
Offline
Posts: 13
I'm a llama!
|
 |
Re:Slow?
« Reply #12 on: January 10, 2005, 03:52:28 AM » |
|
Darn! Just when I was close to vesting. Can I tell my family I was just laid off? Or was my position retired? Hum... getting fired from being a volunteer can't be good on the old resume. Doh!
Seriously, when I get time, I'll let you know if Pentium optimizations are significant in anyway and whether it may justify an edition that I would be willing to help with providing/supporting if anyone is actually interested.
|
|
|
|
|
Logged
|
|
|
|
RogerLevy
Full Member
   
Karma: +1/-0
Offline
Posts: 55
|
 |
Re:Slow?
« Reply #13 on: January 10, 2005, 10:53:32 PM » |
|
alright, i guess you can stay. please do and let me know. to me any low-level optimization is a good optimization.
|
|
|
|
|
Logged
|
|
|
|
Charles Childers
Administrator
Sr. Member
    
Karma: +2/-0
Offline
Posts: 745
|
 |
Re:Slow?
« Reply #14 on: January 11, 2005, 12:49:19 AM » |
|
I'm wondering if the fact that RetroForth's compiler writes to executable space is slowing it down. It's possible that doing this invalidates the cache, which could require reloading from RAM. (This is pure speculation on my part...)
|
|
|
|
|
Logged
|
|
|
|
Pages: 1 2 »
|
|
|
|
|