TM Research Journal

Wednesday, April 12, 2006

Moved

Hi to all my loyal readers,

I've moved to WordPress (cause I was jealous of Nick).

Please update your bookmarks to
http://hstm.wordpress.com/

Cheers,
Fuad

Friday, April 07, 2006

Magic Instructions in Sun

After getting a grasp of how inlining works, I had to figure out how to write the so called "magic" instructions as specified here in Solaris. What this code does is that it left shifts the parameter service by 16 and then loads it into the upper 22 bits of the register g0. Registre g0 is normally a 0 register, but is used by Gems as a signaling register as well when writing to it.

Since in Sun's Inline Templates, code is passed in registers, what I did was left shift the value of register o0 by 26 (16 for the initial shift and 10 for the sethi compensation) and store the result in g0.

sll %o0, 26, %g0

The problem is, in the GCC code, this assembly portion is marked as volatile - which means that the compiler should not attempt to optimise it out (if it deems its unneccessary) nor should it move it around.

When I compiled the code, I tried it with a file along with the -g flag

cc -g -O new_ruby_magic_call.il testil.c

It was supposed to show me whether the code was being late inlined or not (which as I understand it means whether its being optimised or not), but appearantly it's not.

I have posted a question on the Solaris forums (had to create a new account since for some reason I cannot log in using my old one). Hopefully someone will answer.

Unanswered questions which I have asked Kevin (from U Wisconsin) are why register l0 is considered as clobbered even though it's not, and why is a parameter r being passed as 0.

I think these two things might be just code that was kept there from a previous version or something like that.

Research Proposal

I submitted my research proposal today. To be honest, it isn't really well written, but I was told not to spend too much time on it. So here it goes:-

Hardware Supported Software Transactional Memory

Moore’s law is dead – or at least that is what some experts (including Gordon Moore himself) have been saying for a while. While it is true that many of the claims of the demise of Moore’s law are exaggerated, it does seem that the trend as drawn by Moore’s law is slowing down, and that sooner or later it will not hold any more. [1]

Due to this and other economical considerations, the idea of multiprocessor and multi-threading machines is becoming more attractive to processor manufactures. That way, instead of trying to speed up a single processor, all that has to be done is to add more processors to a machine in hopes of getting it to work faster.

The main problem with this scheme is how to get these processors to actually work faster when there are shared resources involved. Like coordinating humans working on similar tasks that involve shared resources, it is difficult to coordinate the work of more than one processor to effectively share resources while avoiding conflicts.

The way conflicts are usually avoided is by designating certain portions of the code where conflicts may arise as critical sections. These critical sections are protected by using locks which prohibit more than one running process to enter this section thus eliminating any conflicts that may arise. [2]

While critical sections do eliminate such conflict, they create problems of their own. These problems include priority inversion, where a lower priority thread would gain precedence over a higher priority one because it managed to acquire the lock first, thus forcing the higher priority one to wait for it until it is done. Moreover, with the use of locks problems such as deadlocks are always an issue. To complicate things further, if a thread holds a lock and for some reason is terminated, then the lock will remain held indefinitely, thus keeping all other threads from ever accessing this critical section or any other critical section from accessing the data guarded by that lock. [3]

Another consideration when using locks is the trade-off between performance and ease of programming. Writing a multi-threaded program is a difficult task as it is, and having to use locks makes it even more complicated. It would be easier if programmers would just use coarse-grained locks and not have to worry about any potential conflicts within their code, but this would result in more serialized code which runs slowly. If they were to use finer grained locks then that would make the program more difficult to reason about. [2]

The database community has solved this problem long ago by introducing the concept of a transaction. A transaction is essentially a series of operations that appear to execute instantaneously, as far as the outside world is concerned. If a transaction fails, it appears to the outside world as if it had never executed to begin with – in other words the state of the world (or the resources said transaction touched) is never in an inconsistent state. [4]

The concept of transactional memory was first introduced to the computer architecture community by Herlihy and Moss. Their implementation was done completely in hardware which suffered from the need to add new instructions and adding a special cache or buffer to keep track of all the changes [3]. Rajwar and Goodman further refined the concept of hardware transactions in their work on Speculative Lock Elision (SLE) and Transactional Lock Removal (TRL). Their solution required no new instructions to be added to the architecture, and utilized certain hardware properties that are already implemented within most modern high performance processors to do most of the work [2]. However, a drawback to most hardware transactional memory implementations is the fact that transactions are limited by how much physical memory or buffering space is allocated for their use. Moreover, handling external events such as operating system context switching can prove to be problematic [5].

On the other hand, transactional memory was also implemented in software. Such an implementation was first introduced by Shavit and Toutoui [6] and then expanded upon by many others such as Herlihy et al in their work on Dynamic Software Transactional Memory [7]. Implementing transactions in software removed the spatial and temporal limitations imposed by hardware, yet the problem is that software implementations are inherently much slower than hardware implementations.

It has been noticed however that most critical sections are short and require little resources to run. Yet the few that actually exceed these resources can impose a significant performance penalty or worse, never succeed at all [8]. Given these observations, what we are hoping to achieve is to have a solution which makes the common case fast, by running it in hardware. When it fails to run in hardware, this transaction should fall back gracefully and run as a software transaction.

Another area that we might investigate is optimizing the performance of software transactional memory by having it take advantage of certain hardware facilities such as SLE if available, yet run correctly (albeit slower) if not. This would make the concept of transactions widely implemented on all architectures, with the ones supporting it running faster.

“He who every morning plans the transaction of the day and follows out that plan, carries a thread that will guide him through the maze of the most busy life. But where no plan is laid, where the disposal of time is surrendered merely to the chance of incidence, chaos will soon reign.”

Victor Hugo
French dramatist, novelist, & poet (1802 - 1885)

References

[1] M. Dubash, "Moore's Law is dead, says Gordon Moore," in Techworld, 2005.
http://www.techworld.com/opsys/news/index.cfm?NewsID=3477

[2] R. Ravi and R. G. James, "Transactional lock-free execution of lock-based programs," in Proceedings of the 10th international conference on Architectural support for programming languages and operating systems. San Jose, California: ACM Press, 2002.

[3] M. Herlihy and J. E. B. Moss, "Transactional memory: architectural support for lock-free data structures," in Proceedings of the 20th annual international symposium on Computer architecture. San Diego, California, United States: ACM Press, 1993.

[4] T. Kyte, Expert One-on-One Oracle, 1st Edition ed: Apress, 2003.

[5] R. Ravi, H. Maurice, and L. Konrad, "Virtualizing Transactional Memory," in Proceedings of the 32nd Annual International Symposium on Computer Architecture: IEEE Computer Society, 2005.

[6] N. Shavit and D. Touitou, "Software transactional memory," in Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing. Ottowa, Ontario, Canada: ACM Press, 1995.

[7] H. Maurice, L. Victor, M. Mark, and William N. Scherer, III, "Software transactional memory for dynamic-sized data structures," in Proceedings of the twenty-second annual symposium on Principles of distributed computing. Boston, Massachusetts: ACM Press, 2003.

[8] C. Scott Ananian, K. Asanovi'c, B. C. Kuszmaul, C. E. Leiserson, and S. Lie, "Unbounded Transactional Memory," presented at 11th International Symposium on High-Performance Computer Architecture (HPCA'05), San Franscisco, California, 2005.

Inline Assembly

I have to figure out how inline assembly works in Sun's C Compiler as opposed to the GNU C compiler. There is some assembly code which has to be ported (basically the magic instructions for stuff like compare-and-swap, activating SLE, etc...).

The reason why I'd like to get it working with Sun's CC is because the environment we're using is a Sun SPARC machine - so it seems only reasonable to use Sun's own C compiler to do the trick since it should be optimised for their architecture.

Moreover, I got the Bohm-Weisser Garbage Collector working on Sun's C compiler, and it seems that it would be a hassle to build the collector using GCC under Solaris.

Sun's Inline Assembly Page
http://developers.sun.com/prodtech/cc/articles/inlining.html

GCC's Inline Assembly Tutorial
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Thursday, April 06, 2006

Victor Hugo on Transactions

He who every morning plans the transaction of the day and follows out that plan, carries a thread that will guide him through the maze of the most busy life. But where no plan is laid, where the disposal of time is surrendered merely to the chance of incidence, chaos will soon reign.

Victor Hugo
French dramatist, novelist, & poet (1802 - 1885)

http://www.quotationspage.com/quote/2785.html

Wednesday, April 05, 2006

DSTM C vs Locks C

I just created a test which Increments and Decrements a shared counter 2 million times each using a single global lock. As expected, and due to the simplicity of this example it performed much better. With only 0:01 seconds and an RSS of 1240.

DSTM should however should a significant performance advantage for more complicated scenarios.

Tuesday, April 04, 2006

DSTM: Java Vs. C

Ok. Today I managed to implement the same test in Java as the one I had in C (increment and decrement a shared counter equally).

First of all, I've never written a Java program before. It was easy enough for me to follow Java programs that have already been written, but writing my own proved to be more difficult. To be honest, I do not like the Java language. But eventually I got it done.

Anyway, as for the results. For 2 million increments and the same decrements, here's how three different platforms performed (all running the Polite Contetion Manager):-

Sun CC (without -fast)
Time 0:16
RSS 1864

Sun CC (with -fast compiler option)
Time 0:12
RSS 1856

Java (default compilation)
Time 0:14
RSS 17960

What this tells us is that appearantly, the Java version performs better than the default C compilation - this is most probably due to Java having better garbage collection than C (since the C garbage collector has to be conservative). It could also be due to my own code, but I followed the Java program in porting as closely as possible so I don't think that's the issue.

I don't know if it's fair to compare the optimized C port with the Java one, since I am not familiar with any optimizations for the Java compiler. I need to ask about this.

Finally, the Java version obviously consumes a lot more memory due to the loading of the JVM.

In conclusion, it seems that the C port is comparable to the Java port in terms of performance, but more tests need to be done.

Monday, April 03, 2006

DSTM Polite and Java

Today I finished the Polite Contention manager component for the C port. I tested it then compared it with the agressive one for the simple counter increment/decrement test - I got the same runtime for both (8 seconds). Maybe agressive vs polite matters with more complex tests.

I also created a makefile to build the DSTM C port, and I built and tested the Java port as well. I have to mention that the Java runone.pl benchmark script performed the same with both Polite and Agressive contention managers.

To have a viable comparison of both, I need to implement similar benchmarks for the Java and the C version and compare the performance of both kinds of benchmarks.

Todo list:-
- Implement common Java and C benchmarks and test them
- Read the garbage collection paper I mentioned in my previous post
- Reread the RSTM paper
- Write research proposal

Useful Command

To find the system specs of a Solaris machine
/usr/platform/`uname -i`/sbin/prtdiag