TM Research Journal

Moved

Hi to all my loyal readers,

I've moved to WordPress (cause I was jealous of Nick).

Please update your bookmarks to
http://hstm.wordpress.com/

Cheers,
Fuad

Magic Instructions in Sun

After getting a grasp of how inlining works, I had to figure out how to write the so called "magic" instructions as specified here in Solaris. What this code does is that it left shifts the parameter service by 16 and then loads it into the upper 22 bits of the register g0. Registre g0 is normally a 0 register, but is used by Gems as a signaling register as well when writing to it.

Since in Sun's Inline Templates, code is passed in registers, what I did was left shift the value of register o0 by 26 (16 for the initial shift and 10 for the sethi compensation) and store the result in g0.

sll %o0, 26, %g0

The problem is, in the GCC code, this assembly portion is marked as volatile - which means that the compiler should not attempt to optimise it out (if it deems its unneccessary) nor should it move it around.

When I compiled the code, I tried it with a file along with the -g flag

cc -g -O new_ruby_magic_call.il testil.c

It was supposed to show me whether the code was being late inlined or not (which as I understand it means whether its being optimised or not), but appearantly it's not.

I have posted a question on the Solaris forums (had to create a new account since for some reason I cannot log in using my old one). Hopefully someone will answer.

Unanswered questions which I have asked Kevin (from U Wisconsin) are why register l0 is considered as clobbered even though it's not, and why is a parameter r being passed as 0.

I think these two things might be just code that was kept there from a previous version or something like that.

Research Proposal

I submitted my research proposal today. To be honest, it isn't really well written, but I was told not to spend too much time on it. So here it goes:-

Hardware Supported Software Transactional Memory

Moore’s law is dead – or at least that is what some experts (including Gordon Moore himself) have been saying for a while. While it is true that many of the claims of the demise of Moore’s law are exaggerated, it does seem that the trend as drawn by Moore’s law is slowing down, and that sooner or later it will not hold any more. [1]

Due to this and other economical considerations, the idea of multiprocessor and multi-threading machines is becoming more attractive to processor manufactures. That way, instead of trying to speed up a single processor, all that has to be done is to add more processors to a machine in hopes of getting it to work faster.

The main problem with this scheme is how to get these processors to actually work faster when there are shared resources involved. Like coordinating humans working on similar tasks that involve shared resources, it is difficult to coordinate the work of more than one processor to effectively share resources while avoiding conflicts.

The way conflicts are usually avoided is by designating certain portions of the code where conflicts may arise as critical sections. These critical sections are protected by using locks which prohibit more than one running process to enter this section thus eliminating any conflicts that may arise. [2]

While critical sections do eliminate such conflict, they create problems of their own. These problems include priority inversion, where a lower priority thread would gain precedence over a higher priority one because it managed to acquire the lock first, thus forcing the higher priority one to wait for it until it is done. Moreover, with the use of locks problems such as deadlocks are always an issue. To complicate things further, if a thread holds a lock and for some reason is terminated, then the lock will remain held indefinitely, thus keeping all other threads from ever accessing this critical section or any other critical section from accessing the data guarded by that lock. [3]

Another consideration when using locks is the trade-off between performance and ease of programming. Writing a multi-threaded program is a difficult task as it is, and having to use locks makes it even more complicated. It would be easier if programmers would just use coarse-grained locks and not have to worry about any potential conflicts within their code, but this would result in more serialized code which runs slowly. If they were to use finer grained locks then that would make the program more difficult to reason about. [2]

The database community has solved this problem long ago by introducing the concept of a transaction. A transaction is essentially a series of operations that appear to execute instantaneously, as far as the outside world is concerned. If a transaction fails, it appears to the outside world as if it had never executed to begin with – in other words the state of the world (or the resources said transaction touched) is never in an inconsistent state. [4]

The concept of transactional memory was first introduced to the computer architecture community by Herlihy and Moss. Their implementation was done completely in hardware which suffered from the need to add new instructions and adding a special cache or buffer to keep track of all the changes [3]. Rajwar and Goodman further refined the concept of hardware transactions in their work on Speculative Lock Elision (SLE) and Transactional Lock Removal (TRL). Their solution required no new instructions to be added to the architecture, and utilized certain hardware properties that are already implemented within most modern high performance processors to do most of the work [2]. However, a drawback to most hardware transactional memory implementations is the fact that transactions are limited by how much physical memory or buffering space is allocated for their use. Moreover, handling external events such as operating system context switching can prove to be problematic [5].

On the other hand, transactional memory was also implemented in software. Such an implementation was first introduced by Shavit and Toutoui [6] and then expanded upon by many others such as Herlihy et al in their work on Dynamic Software Transactional Memory [7]. Implementing transactions in software removed the spatial and temporal limitations imposed by hardware, yet the problem is that software implementations are inherently much slower than hardware implementations.

It has been noticed however that most critical sections are short and require little resources to run. Yet the few that actually exceed these resources can impose a significant performance penalty or worse, never succeed at all [8]. Given these observations, what we are hoping to achieve is to have a solution which makes the common case fast, by running it in hardware. When it fails to run in hardware, this transaction should fall back gracefully and run as a software transaction.

Another area that we might investigate is optimizing the performance of software transactional memory by having it take advantage of certain hardware facilities such as SLE if available, yet run correctly (albeit slower) if not. This would make the concept of transactions widely implemented on all architectures, with the ones supporting it running faster.

“He who every morning plans the transaction of the day and follows out that plan, carries a thread that will guide him through the maze of the most busy life. But where no plan is laid, where the disposal of time is surrendered merely to the chance of incidence, chaos will soon reign.”

Victor Hugo
French dramatist, novelist, & poet (1802 - 1885)

References

[1] M. Dubash, "Moore's Law is dead, says Gordon Moore," in Techworld, 2005.
http://www.techworld.com/opsys/news/index.cfm?NewsID=3477

[2] R. Ravi and R. G. James, "Transactional lock-free execution of lock-based programs," in Proceedings of the 10th international conference on Architectural support for programming languages and operating systems. San Jose, California: ACM Press, 2002.

[3] M. Herlihy and J. E. B. Moss, "Transactional memory: architectural support for lock-free data structures," in Proceedings of the 20th annual international symposium on Computer architecture. San Diego, California, United States: ACM Press, 1993.

[4] T. Kyte, Expert One-on-One Oracle, 1st Edition ed: Apress, 2003.

[5] R. Ravi, H. Maurice, and L. Konrad, "Virtualizing Transactional Memory," in Proceedings of the 32nd Annual International Symposium on Computer Architecture: IEEE Computer Society, 2005.

[6] N. Shavit and D. Touitou, "Software transactional memory," in Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing. Ottowa, Ontario, Canada: ACM Press, 1995.

[7] H. Maurice, L. Victor, M. Mark, and William N. Scherer, III, "Software transactional memory for dynamic-sized data structures," in Proceedings of the twenty-second annual symposium on Principles of distributed computing. Boston, Massachusetts: ACM Press, 2003.

[8] C. Scott Ananian, K. Asanovi'c, B. C. Kuszmaul, C. E. Leiserson, and S. Lie, "Unbounded Transactional Memory," presented at 11th International Symposium on High-Performance Computer Architecture (HPCA'05), San Franscisco, California, 2005.

Inline Assembly

I have to figure out how inline assembly works in Sun's C Compiler as opposed to the GNU C compiler. There is some assembly code which has to be ported (basically the magic instructions for stuff like compare-and-swap, activating SLE, etc...).

The reason why I'd like to get it working with Sun's CC is because the environment we're using is a Sun SPARC machine - so it seems only reasonable to use Sun's own C compiler to do the trick since it should be optimised for their architecture.

Moreover, I got the Bohm-Weisser Garbage Collector working on Sun's C compiler, and it seems that it would be a hassle to build the collector using GCC under Solaris.

Sun's Inline Assembly Page
http://developers.sun.com/prodtech/cc/articles/inlining.html

GCC's Inline Assembly Tutorial
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Victor Hugo on Transactions

“He who every morning plans the transaction of the day and follows out that plan, carries a thread that will guide him through the maze of the most busy life. But where no plan is laid, where the disposal of time is surrendered merely to the chance of incidence, chaos will soon reign.”

Victor Hugo
French dramatist, novelist, & poet (1802 - 1885)

http://www.quotationspage.com/quote/2785.html

DSTM C vs Locks C

I just created a test which Increments and Decrements a shared counter 2 million times each using a single global lock. As expected, and due to the simplicity of this example it performed much better. With only 0:01 seconds and an RSS of 1240.

DSTM should however should a significant performance advantage for more complicated scenarios.

DSTM: Java Vs. C

Ok. Today I managed to implement the same test in Java as the one I had in C (increment and decrement a shared counter equally).

First of all, I've never written a Java program before. It was easy enough for me to follow Java programs that have already been written, but writing my own proved to be more difficult. To be honest, I do not like the Java language. But eventually I got it done.

Anyway, as for the results. For 2 million increments and the same decrements, here's how three different platforms performed (all running the Polite Contetion Manager):-

Sun CC (without -fast)
Time 0:16
RSS 1864

Sun CC (with -fast compiler option)
Time 0:12
RSS 1856

Java (default compilation)
Time 0:14
RSS 17960

What this tells us is that appearantly, the Java version performs better than the default C compilation - this is most probably due to Java having better garbage collection than C (since the C garbage collector has to be conservative). It could also be due to my own code, but I followed the Java program in porting as closely as possible so I don't think that's the issue.

I don't know if it's fair to compare the optimized C port with the Java one, since I am not familiar with any optimizations for the Java compiler. I need to ask about this.

Finally, the Java version obviously consumes a lot more memory due to the loading of the JVM.

In conclusion, it seems that the C port is comparable to the Java port in terms of performance, but more tests need to be done.

DSTM Polite and Java

Today I finished the Polite Contention manager component for the C port. I tested it then compared it with the agressive one for the simple counter increment/decrement test - I got the same runtime for both (8 seconds). Maybe agressive vs polite matters with more complex tests.

I also created a makefile to build the DSTM C port, and I built and tested the Java port as well. I have to mention that the Java runone.pl benchmark script performed the same with both Polite and Agressive contention managers.

To have a viable comparison of both, I need to implement similar benchmarks for the Java and the C version and compare the performance of both kinds of benchmarks.

Todo list:-
- Implement common Java and C benchmarks and test them
- Read the garbage collection paper I mentioned in my previous post
- Reread the RSTM paper
- Write research proposal

Useful Command

To find the system specs of a Solaris machine
/usr/platform/`uname -i`/sbin/prtdiag

DSTM with GC is Working

Yay!

I finally got DSTM working with garbage collection under Solaris. I almost cried tears of joy! Fortuantely, since I abided by the ANSI C standard, I didn't have much trouble porting the code to Solaris.

For the question on whether to use or not to use garbage collection. What I had in mind what to test the code which is running correctly with garbage collection, and compare it to the leaky manual memory allocation code then see which version is performing better.

If the leaky version performed better than the garbage collected one, then it would be worthwhile trying to figure out where the leaks were and fix them. But if the garbage collected one performed better, then it would be pointless to fix the leaks but we should just stick to the garbage collected one.

So, what results did I get from my tests? (I should note that the manual memory allocation was was even relaxed a bit more to allow for more memory leaks in hopes of better performance).

I used my first DSTM test case which basically increments and decrements a counter equally (thus the counter should end up with the same value it started with). I tried it on both with 1000 increments and 1000 decrements. Then tried it again with 991000 increments and decrements. The result was that it performs signifinctly better (twice as much) with garbage collection rather than manual memory allocations. I was expecting better performance but this was way better than i thought.

What this means is that we should definitely proceed with the Garbage Collector based implementation.

Things to do: compare the performance of DSTM in Java vs C and see if there are any significant differences. I would think that the Java implementation might actually perform better than the C one simply because of the better Garbage Management available there.

Another thing to do: write a makefile to do all the compiling and building.

Results

Quick note of the system I was using: Sun One Studio 10 on a Solaris Sparc machine.

Manual Memory Management

1000 Increments and 990 Decrements
RSS (Memory used): 1064 kbytes
Time: 0:00

991000 Increments and 990990 Decrements
RSS (Memory used): 47736 kbytes
Time: 0:15

Garbage Collection

1000 Increments and 990 Decrements
RSS (Memory used): 1728 kbytes
Time: 0:00

991000 Increments and 990990 Decrements
RSS (Memory used): 1792 kbytes
Time: 0:08

Some useful commads that I used

To find out the running processses execution time as well as their memory utilization:-
ps -ly

To compile the code (for every file):-
cc -c test.c

To build the project with garbage collection:-
cc atomic.o locator.o readtable.o thread.o transaction.o contention.o readset.o test.o tobject.o utility.o -lgc -z muldefs -o test

The -z muldefs is to allow for the multiple definition of some global constants, and the -lgc is to load the garbage collector library (this should come AFTER all the source files)

Note: You definitely should create a makefile instead.

Why is Garbage Collection Faster

A couple of articles that might shed some light:-
http://www.memorymanagement.org/faq.html#delete.speed
http://www.memorymanagement.org/faq.html#gc.speed

Good Garbage Collection References:-
http://www.iecc.com/gclist/GC-faq.html
http://www.memorymanagement.org/faq.html
http://www.hpl.hp.co.uk/personal/Hans_Boehm/gc/

Read this paper:-
ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps

Garbage Collection in Solaris cc

Since our target environment is Solaris, I thought I should use Solari's libgc which deals with garbage collection for c.

http://developers.sun.com/prodtech/cc/articles/libgc.html

I had a hard time geting it to work, was following the instructions and adding -lgc as a compiler flag. Finally, after a lot of blood, sweat and tears, it runs out that I should add the flag AFTER the source file! Go figure!

e.g. cc testmem.c -lgc -o testmem

http://docs.sun.com/source/819-3688/cc_ops.app.html

Eventually I got it working. It is doing garbage collection and all. The only thing which I need to ascertain whether it is thread-safe or not. If it's not, then I'm back to square one. I posted a question on Sun's forums and am waiting for an answer.

http://forum.sun.com/thread.jspa?threadID=29907

RSTM Paper

Had a quick look at the RSTM paper. Need to go through it again. What I noticed is that they do say that freeing memory that's not being used anymore is a hassle. Which is a problem I discovered the hard way.

Fixing the Memory Leaks

I haven't worked on fixing the memory leaks (for the reasons mentioned my previous post). But I would like to note that adding a reference counter to the user data (or objects) most probably will neccessitate adding one more level of indirection, which will generate more cache misses thus making performance even worse.

That was an argument in favour of using a garbage collector (or an argument in favour of me being lazy and not wanting to fix the leak :) )

Compare with and without GC

I just thought of something - I should not waste any more time trying to fix the memory leaks in DSTM until I get the GC version of DSTM working. After doing that, I should compare the performance of the GC version with the current leaky version.

If GC performs better than the current version, then fixing the memory leaks will definitely NOT improve performance, thus would be a moot point.

More reference counting?

After fixing the program as I mentioned in the last post, it worked fine (not immediately, had a couple of bugs to quash).

Yet from about 500,000 memory allocation I still had about 30 memory leaks. Why?

After working on this problem all yesterday and today, I realized that it had to do with the acutal user objects that are being referenced. This is too complicated to explain here but in certain cases, when a thread manages to advance a bit more than I'd anticipated, it would not delete an object it should delete (no way for it to know that it should).

To make a long story short, it seems that the only way around this problem is to use a reference counter for the objects. This way I would in effect be having a reference counter for every datastructure used in DSTM. Moreover, I believe that there are still a couple of memory leaks that I still have unaccounted for.

I am having a hard-time getting the garbage collector to work. Will continue working on that now.

I am not happy :(

I think I can see the light

Ok. It seems that I've almost got it done. My eyes hurt so I won't really write much.

Things I want to say:-

First of all, the solution to the memory management: C requires the programmer to manage their memory by themselves - so this is how I should do it. Whenever I open an object for write, it should be the programmer who's responsible for deallocating the memory for that object when they're done with it.

I started reading one of the papers Jim sent, here's an interesting quote from the abstract of the first one:

Most either employ locks in the underlying implementation or depend on thread-safe general-purpose garbage collection to collect stale data and metadata.

[Source: Lowering the Overhead of Nonblocking Software Transactional Memory, Marathe et al, 2006]

What this shows is that I beleive it will be really difficult to have a lock-free and garbage collector free STM implementation. But remember to investigate the CAS possibility.

To be honest, I haven't read the papers yet. I will read them tomorrow. You hear me Fuad, READ THE PAPERS TOMORROW! :)

DSTM: Found the Problem - Don't know the Solution

I figured out why reference counting the locators isn't working. The problem is, I don't know how to solve this problem.

Basically, when a thread opens an object for writing, and if it manages to compare and swap the start pointer, it decrements the number of references to the newly created locator (since the local reference is going to fall out of scope), and it is possible that in that instant when the number of references has been decremented to 1 (since the start pointer is still poiting to it), and before it actually returning the reference to the object that another transaction might jump in, abort this transaction and swap the pointer. This other transaction will see that the number of references to the locator is one, would decrement it since it it switched the start pointer, and clear it since there are no more pointers to it.

Hope the above made sense.

How to solve this problem?

First solution which comes to mind is to add reference counters to the objects. I fear though that this would require even more locks and that there would be even more unforseen problems here.

Another one would be to somehow (haven't thought this through yet) invalidate the old object (setting the pointer to null) - don't think this'll work though.

James mentioned that the paper Jim sent today deals with Garbage Collection and this sort of thing, so I guess that this afternoon I'll read those papers and maybe see if it is possible to come up with a better solution.

Peace Out!

Garbage Collection is Inevitable

This is making me depressed. Today I realized that even if reference counting for Locators were to work properly, there is still a problem. Somehow, malloc sometimes crashes! Malloc! I honestly don't know what to do now other than use a garbage collector...

The fact that a garbage collector is required, to be honest, makes me question DSTM even more.

I would like it if Jim or James were to find out how the C++ version of DSTM (which we don't have access to) deals with garbage collection or memory allocation.

Meeting with Robert

We just metup with Robert Strandh since he's an expert when it comes to Garbage Collectors. It seems that the best thing we could do is use a Boehm-Demers-Weiser garbage collector (like the HP one I found the other day). The thing is there garbage collectors aren't incrementatl, they arctually stop the world when they do their stuff. Thus a Java GC would have actually been better.

Memory Leaks

Like I anticipated yesterday - there are memory leaks. Plus I discovered yet another portion of the code that needs to be locked.

The cause of the memory leak is obvious, but the solution isn't. What's causing the leak is when I increment the reference counter for the Locator objects, I do not decrement it if the object fails to aquire it. The simple solution might seem to be just to decrement it, but if I do decrement it later if I fail, all hell breaks loose. I don't know why exactly but judging by previous similar problems this indicates that I actually need to lock yet another portion of the code.

Another portion which had to be locked was where you get the committed version of the object from the locator. When that happens, you attempt to clear the old object from memory (since if it's committed then it's no longer valid). However, if two threads enter that portion of code simultaniously and attempt to clear that portion of code concurrently then we have a problem - which I'm afraid can only be solved by locking.

Nick Hay pointed out an interesting research paper about multi-threaded Garbage Collection at http://www.cs.technion.ac.il/~erez/Papers/cgc9.pdf . GC for multithreaded programs is not trivial. But Nick poitned out that Transactional Memory behaves in a certain predictable pattern, so a specific garbage collecotor for TM might be written.

Moreover, I was thinking that an interesting area of research would be hardware assisted garbage collectors. Just a thought.

All in all, my current DSTM implementation is looking ugly, and I am not sure whether I should proceed as it is or find a garbage collector.

You know what... I think for the time being I will trying to get it working with the minimum number of locks possible, then when that (hopefully) works, try to figure out a way to elude all these locks.

Moreover, investigating garbage collectors,, and maybe finding a simple one might be a good idea.

special case nick

DSTM Memory Management

Ok. I've decided against using a garbage collector since it will involve using external libraries and I'd like the DSTM implementation to be completely independant of any library used.

Anyway, two problems I've had (and since have solved) are below:-

Clearning the transactions

I have used reference counting to keep track of the transactions like I mentioned in a previous post. But there were still some issues when it came time to clear the transactions. After a lot of work, it turns out that the reference counting is working fine - BUT if two threads get to the point where the number of references to the transaction is 0 at the same time, one will free the transaction and succeed, while the other will fail.

The way around this problem was to implement a check just before you attempt to free thr transaction, and if someone else has freed it then don't try to free it. That sim ple.

I wish I could say the same for the test though....

Clearning Old Locators

Now this one is bad. Really BAD.

The problem here is this. We have two threads which attempt to open an object for writing. Both ones point to the old locator, and both create a new locator which they attempt to install. Now let's say transaction A just pointed to the old locator and was switched out, while transaction B managed to switch the old locator to the new one, and delete the old locator.

Now the problem is, when transaction A continues working, the locator it is poiting to has its memory deallocated, and this will obviously lead to a runtime error. The problem is, there is no way for inform the first transaction that it has been deallocated.

In java, what DSTM does, is that it doesn't free the memory, and in time the transaction will realize that its not valid any more and the garbage collector will take care of all the problems.

In C, we have to manually do all the garbage collection though. The idea would be in this case to implement reference counting again to keep track of how many transactions are pointing to each object. Which is all well until you consider the following scenrario:

Transaction A points to the old locator, it sees that it's there so it attempts to increment the reference count of the locator. Before it actually does that. Transaction B manages to successfully swap the old locator with the new. It attempts to clear the locator but it checks the reference count beforehand to make sure that it should. Having checked the reference count, it finds that its 0 (since A hasn't actually updated it yet). So it deallocates it.

Now Transaction A is up and running again, it attempts to increment the transaction count (which points to a random place in memory now), and hell breaks loose.

The problem here is that a compare-and-swap will not work, since you do not know what you should compare with.

After a lot of thought, I realized that the best way to handle it would be, dare I say, a LOCK! Yes a lock. At first this idea seemed repulsive, since the whole point here is to avoid locks! But as I thought about it more, I realized that if each transactional object (TMObject) implements its own lock, and we just lock incrementing the reference counter as well as decrementing/deallocating then it will actually be at the finest level of granularity possible. In other words, such a lock will only serialize what really has to be serialzed, and it will not serialize the portions that are fine if they ran together.

Anyway, that's what I did. James doubts that Jim would be happy with this solution. But in my humble opinion, this solution is better than using an external library GC. Moreover, even if SLE was used for this particular portion a lock would actually be better since SLE will try, and in case of contention will fail and fall back to a lock. While here we know for a fact that it WILL fail so we just go ahead and lock it from the start.

Rogue Frees

The remaining problem now is that it seems that somewhere I am freeing something I shouldn't free. I do not know if I'm actually up to investigating this today or tomorrow.

Moreover, I have this nagging feeling that I might have a memory leak somewhere. This might be cause by (this will not make any sense to people not familiar with the code) by line 245 in tobject.c . This is just a reminder in case I forget it.

DSTM and Memory Management in C

I figured out the problem... the thing is it seems that the solution won't be pretty.

The problem basically is that when I wrote the program to run in C, I had to manually manage memory. I thought about memory management as a single threaded thing, and it was working fine back then.

Now though, when the program is actually multi-threaded, different pointers are mingling and interacting in ways I didn't except, and to be honest, even now, I cannot see how to be able to forsee every possible interaction. Thus from the looks of it, it seems that I need to implement garbage collection.

Of course I'm not gonna write my own collector since that's an involved process. I found one developed by some people at HP which seems to be quiet a respectable GC:
http://www.hpl.hp.co.uk/personal/Hans_Boehm/gc/

So, what I'm going to do now is try to get my DSTM working with a garbace collector. Once I do that I will see if it's possible for me to get rid of the collector.

DSTM Bug Yesterday

I need to write this down before lunch.

James found a bug in my DSTM implementation yesterday. As I mentioned earlier, to memory manage the trasnactions, each transaction has a counter of the number of objects poiting to it.

Now as each transaction aquires an object, this counter is incremented. While as each transaction has an object dissassociated with it, the counter is decremented.

Now this counter can only be incremented by one thread (since only one thread can assocaite objects with a transaction). What I didn't think of is that more than one thread can actually decrement the counter. This is how that could happen:

Let's say a transaction has more than one object assocaited with it, then commits. Now two different new transactions running under different threads could associate different objects that were assocaited with the old transaction, thus decrementing the counter simultaniously. If the decrement isn't atomic, this could cause problems.

This is easily solved by either using an atomic decrement instruction or an atomic compare and swap. (or maybe SLE)

DSTM - You, you, you make a grown man cry

I feel like I'm going to break down and cry!

I was really careful in coding my C DSTM implementation, made sure to follow proper programming practices. Tested each component individually.

Now when I try to run it as a multi-threaded program - it fails! I have no idea where to begin in debugging the program! I think I will go for lunch now, clear my head a bit and then come back with an attack plan...

DSTM Testing

Ok, so far it seems that my C DSTM is working. I need to devise some test scenarios and double check that it works on a single thread:-

Committed transaction with:
- a read and a write
- a read
- a write
- a read upgraded to a write
- a write upgraded to a read
- Nothing
- Several read and written objects with a mix of upgrades

Aborted transactions with the same.

Run more than one transaction in a row and see what values you get.

Verify that all the values are what you expect them to be.

DSTM Obstacle Solved

I finally managed to solve that problem. The way I went about it was similar to reference couting. I just kept a counter of the number of objects referencing each transaction, and whenever that number would reach zero AND the transaction would be committed or aborted all the memory used by it would be freed.

There are no issues in terms of contention since each transaction is associated with only one thread.

The next step would be to find a proper memory leak tester and perform such a test. The one with Visual Studio isn't working properly for some reason (keeps crashing)!

DSTM C Coding Obstable

I have encountered an obstacle in the coding of DSTM in C. I thought taht maybe writing it down will help me see the big picture and hopefully get around it.

In DSTM, when a transaction begins, each thread creates an object of type Transaction. This transaction object keeps track of the current status of the transaction and whether it's Active, Committed or Aborted.

Now all objects point to the last transaction which has accessed them. When an object is opened for writing, if it is pointing to a committed or aborted transaction then access is granted.

The problem is, in Java, whenever a transaction is created the programmer doesn't worry about deallocating the memory its using since the garbage collector is responsible for doing that when no more objects or threads are pointing to that transaction. While in C, we should deallocate the transaction manually.

The question is, when do we deallocate it? If we deallocate it when the transaction commits, then all the objects that are committed and poiting to this transaction will point to some unknown place in memory. We cannot deallocate it when the object stops pointing to it since we don't know how many objects are pointing to it.

The first idea I have in mind to to create a write list, but after thinking about it more I thought that a counter for every transaction might be better.

Let's have an object counter for every transaction. Whenever an object points to it we increment the counter. Whenever an object stops pointing to it we decrement it. When the counter reaches zero we deallocate the transaction. A new trasnaction is always allocated by every thread at the start of a transaction.

I just realized that this is actually a method of garbage collection known as reference counting. In this case though, this will be more efficient than a generally garbage collected environment since it is limited to a certain portion.

DSTM Coding Done

I have finished the code for DSTM in C. Now is the time to test it. I hope that there aren't any serious bugs, and if that works well, it should be done by tomorrow.

DSTM Contention Manager

I figured out the contention manager today - it was easier than I expected. I have to say things are progressign quiet smoothly with DSTM. Of course I have only tested tyhe modules and didn't actually test the thing as a whole yet. So it will get exciting when I have to do that.

I need to find a toold which checks for memory leaks and stuff like that. I am kind of uncomfortable with the fact that C doesn't perform any garbage collection (I'm so spoilt, I'm used to living without garbage collection but not any more). Anyway, I have to make suret hat it's 100% accurate it that sense.

Other than that, I had a look at the Hybrid Transactional Memory paper that James sent me, I need to reread it.

I attended Clark's Master's orientation - it was quiet informative.

Nested Transactions for LogTM and DSTM Progress

I think it would be easy to extend LogTM to support a practically unlimited number of nested transcations - where if a transaction were to fail, you'd only roll back that one. Basically implement some sort of stack and markers in the nested transcation thing.

Expert one-on-one Oracle

Today I read certain chapters from Thomas Kyte's Expert one-on-one Oracle. Good book which talks about how Oracle works - with a chapter on transcations.

DSTM Progress
Code is going well. I need to understand the cotnention manager since that part hasn't been explained in the paper. I had a look at William's port to C, I believe that the main problem he had was (from what I could tell) lack of experience in C programming. I am not a C expert by any means, but C was the first programming langauage I worked with at Uni so I believe that that gives me an advantage.

Tomorrow (Sunday) or Monday I will work on finishing the code and understanding the contention manager. I believe that if the contention manager takes me a day to figure out, I should finish the port in a week.

DSTM in C Tables

Yesterday and today I continued working on DSTM in C. The implementation is progressing slowly yet surely. Getting back to programming in C after working with C# for over 2 months full-time is more difficult than I thought. Having to worry about pointers, memory allocation, etc....

Anyway, to keep the readset, instead of using a Hashtable I created a normal array based table. I did it that way because if the table is small, due to better memory locality, arrays with linear search perform better than hashtables (due to the cash misses generated by hash tables). The implementation is flexibile enough though to allow it to be changed to a hashtable.

If things continue going the way they are, I hope to get this done in about a week from now. I am not really sure, I should (and will) do a better guesstimate tomorrow.

As a sidenote, fire alarm went off twice today - once in the morning and once in the afternoon. Had to leave our desks and all... very disruptive.

pthreads and DSTM in C

Not a very productive day. My computer broke down this morning and it took them a while to fix it.

Anyway, I managed to get pthreads working under Windows using MS Visual Studio (http://sourceware.org/pthreads-win32/). Why? To be honest, I have been using visual studio for about 6 years now and I am comfortable with it. Too lazy to learn how to use emacs or vi and gcc unless I have to. so plan is to develop DSTM using MS VS and then port it to SPARC from there.

Still don't know how to handle the version control thing of old and new objects in DSTM. Problem is, in Java DSTM just compares object - which isn't that straighforward in C. Maybe I'll implement a version number which gets incremented.

Phd

Talked to Cris Calude about the PhD program - need to work on that ASAP and need to prove that I have material worthy of a PhD. Will talk to Jim about that when he's back.

Logging Cache Overflows

The more I think of it, the more I feel that using SLE and logging cache overflows is the way to go...

DSTM in C

Today I started with implementing DSTM in C. From the looks of it, it doesn't seem so difficult. But a really brilliant student tried that before me and faced some problems. Thus I am trying to be really careful about it and do things carefully.

What I did today basically was re-read the paper, focusing on the data-structures used. I drew a diagram on my whtieboard of the whole thing (I hope no one erases it :) ), then I went through the Java DSTM code.

The question is, should I do a direct translation, or just re-implement DSTM with the help of the Java code?

If I translate, I have to figure out how to translate exceptions and things like constructors. If I re-implementing it, I could be reinventing the wheel and might fall into problems which the guys that implemented DSTM have already encountered and solved.

I have a feeling my approach is gonna be a mix of both. I need a better understanding of pthreads though. But I am planning on having this implementation be independant of what kind of threading is used (forks, pthreads or whatever it is that MS does....).

Virtual Copies

An idea which was just inspired by James - how about we log the changes to the object (to be done by the hardware). And from the software's perspective have it appear as if you had two copies, an old one and a new one?

Milestones in Program Execution

A good thing to do, if we're mixing hardware and software TMs would be to have the hardware TM, after it fails (due to a context switch or an abort), handle the work it has done so far to the software TM.

The problem is, how should the software know where the hardware is at?

My half-baked idea, which I plan on pursuing later, is to have milestones. Like a statement which saves a milestone (number or something like that) every N statements. This when the work is handed to the software, we know where we should proceed from.

This half-baked idea also involves a log somehow which rolls back the last few statements to the milestone.

LogTM and SLE

I came up with this idea today - which appearantly James had actually thought of first, to implement a LogTM and SLE hybrid. Basically the processor would run with SLE as long as there are sufficient resoucres. When it runs out of resources, it only logs the memory values of the data items it evicts.

This scheme, as I first thought, would save a lot on aborts. Since all the abort would have to do is to rollback onthe they (hopefully) few log entries it creates (which should be significantly less than the ones with LogTM), and clear the cache. Moreover, it should reduce the overhead of logging.

The problem is though, the way LogTM does it, the values it saves in the log will most probably have been in the cache already, while the scheme I proposed here will definitely have to go to memory to fetch the old value to log. Thus it would end up being slower than LogTM.

Moreover, aborts are very infrequent, so there is no real need for optimising them.

The problem with LogTM is essentially that it cannot handle context switches. Plus James thinks that even though the idea is cool, the actual implementation is not elegant. I am not familair enough with LogTM yet to make any judgements personally...

Reinventing the Wheel and Other Stuff

Good but old idea

The idea I had the other day, about using SLE and TLR at first, and when they fill up just dump them into memory was actually reinventing VTM. It was kind of embarrassing actually since it showed that I didn't quiet understand VTM at that time...

I still think that the concept of database checkpoints could be helpful though. I will investigate this later.

Visible vs Invisible Reads

An idea came to mind today when James and I were talking about STM and visible vs invisible reads. The problem with visible reads is that they aren't wait-free, while invisible reads are wait-free, they are not elegant and potentially more resource intensive. A way would be is to combine the two by starting off with visible reads, have a counter which is incremented everytime a transaction fails due to the fact that a variable it read has been changed, and if that counter exceeds a certain threshhold move to invisible reads.

Future Work

I am now going to take over William's work, and continue with implementing DSTM tomorrow. It should be interesting and hopefully not difficult...

Kernel TM and Checkpoints

Kernel STM

I was thining, from what I've read so far, it seems that transactional memory has been implemented purely at the hardware level or high up on the software level with objects - there is the word STM but doesn't really look practical.

Why not have it on the Kernel level? Apply same concepts as SLE and TLR there?

Checkpoints

One limitation in HTM is lack of resources and what to do when they run out, why not implement a checkpoint there? Similar to database checkpoints?

Or maybe it's called a rollback point? Will check it out...

Research Journal

Not sure if this is the best idea - i think it would be since having my research journal online would enable me to access it anywhere. So here goes nothing...

TM Research Journal

Wednesday, April 12, 2006

Moved

Friday, April 07, 2006

Magic Instructions in Sun

Research Proposal

Inline Assembly

Thursday, April 06, 2006

Victor Hugo on Transactions

Wednesday, April 05, 2006

DSTM C vs Locks C

Tuesday, April 04, 2006

DSTM: Java Vs. C

Monday, April 03, 2006

DSTM Polite and Java

Friday, March 31, 2006

DSTM with GC is Working

Thursday, March 30, 2006

Garbage Collection in Solaris cc

Wednesday, March 29, 2006

Compare with and without GC

More reference counting?

Monday, March 27, 2006

I think I can see the light

DSTM: Found the Problem - Don't know the Solution

Saturday, March 25, 2006

Garbage Collection is Inevitable

Friday, March 24, 2006

Meeting with Robert

Memory Leaks

Thursday, March 23, 2006

DSTM Memory Management

Tuesday, March 21, 2006

DSTM and Memory Management in C

DSTM Bug Yesterday

DSTM - You, you, you make a grown man cry

Monday, March 20, 2006

DSTM Testing

Saturday, March 18, 2006

DSTM Obstacle Solved

Friday, March 17, 2006

DSTM C Coding Obstable

Thursday, March 16, 2006

DSTM Coding Done

Monday, March 13, 2006

DSTM Contention Manager

Saturday, March 11, 2006

Nested Transactions for LogTM and DSTM Progress

Thursday, March 09, 2006

DSTM in C Tables

Tuesday, March 07, 2006

pthreads and DSTM in C

Monday, March 06, 2006

Logging Cache Overflows

DSTM in C

Thursday, March 02, 2006

Virtual Copies

Milestones in Program Execution

LogTM and SLE

Monday, February 27, 2006

Reinventing the Wheel and Other Stuff

Wednesday, February 22, 2006

Kernel TM and Checkpoints

Research Journal

About

About Me

Links

Previous

Archives