User talk:Philosopher/Nasal introspection

Logging overhead observations

FYI: I set max_threads to 0 and disabled all logging output, and ended up with a pretty steady frame latency of 125ms here, which suggests that there's also a fair overhead due to console I/O (which is a known issue in FG, due to FG's sequential main loop). It would sometimes spike to ~500ms, which is in sync with the GC stats (which end up not being purged due to the lack of processing threads running in that mode).

Like you mentioned previously, for some weird reason, performance stats are massively affected once the aircraft is moving, even when just running the "zero eye candy" profile.

I did get nice GC stats when exiting (after profiling just the test func):

**** Nasal garbage collection statistics: objects: 837843, references: 1915555
Exit
Waiting
**** Nasal garbage collect: { objects: 289897, references: 658582, n: 1 }
**** Nasal garbage collect: { objects: 498513, references: 1129243, n: 1 }
Done
_dump

With dummy callbacks (extras[i] = func{}), I see ~30 ms latencies (spikes up to 50), so it looks like this is mostly Nasal-space complications. —Philosopher (talk) 16:12, 17 August 2013 (UTC)

After just disabling LOCAL/SETLOCAL, it's about 240 with spikes up to ~270 (or something like that), so about 50% better. —Philosopher (talk) 16:21, 17 August 2013 (UTC)

Profiling individual threads via call()

# The extras hash can be set in several ways: as an extra argument (arg[4]) to call,
#  which does it locally to that context, or passed to debug.setDefaultExtra(extras),
#  which sets it for all contexts created in the C code (excluding threads). One can
#  get the current extras for a context with debug.getCurrentExtra.

Is that even relevant regarding threads, can't we simply invoke call() inside a thread (or outside the newthread() invocation) in order to profile individual threads ?

It was just to point out that setDefaultExtras won't enable profiling of threads other than the main one. I'm undecided as to whether or not to do it, but for now I thought it was safer to disable it. —Philosopher (talk) 18:55, 16 August 2013 (UTC)

August 14

okay, had a chance to rebuild everything, including your latest changes – some observations:

using the “minimum startup profile” (wiki), frame rates and spacing were “bearable”
frame latency would sometimes spike up to 550 here though
also, even after completion of the profiling run, it would sometimes spike up to 220 (which it usually doesn’t)
I didn’t seem to get a success/completion message here ?

will do some more testing and report back …

Here’s another question for you: seeing how we wanted to restrict profiling to a single call, and extras being passed to the call in line 347, I thought I could get rid of the previous setDefaultExtra() call, which -according to my understanding- would make the whole Nasal main thread subject to profiling (?) – however, once I remove that call, there’s no profiling going on at all ?

—Hooray

This stuff is only printed out to the log file ($FG_HOME/nasal_test_data.log) on simulator exit, which means that you have to either use the timer, the menu, or escape to exit (i.e. use the fgcommand directly/indirectly). Frame rates are terrible here too, especially once one starts moving… —Philosopher (talk) 18:31, 14 August 2013 (UTC)

Regarding your changes in gc.c @ 230+ – I think we can use the same method to count active/reachable references per func, just by modifying mark() to increment a counter once it is invoked on a func, which would give us a rough idea how heavy some functions are – we could also distinguish between references to the active/lower scope and outer scope references (globals etc) —Hooray

Also, regarding the total number of references, it is probably my earlier suggestion that isn’t helping here: adding tons of naFunc callbacks to a queue, will drastically increase the number of temporary references to live objects – thus, it would seem better to merely copy the data to the queue and then use one function object per thread instead – because otherwise, each pushed function will end up adding a full closure with tons of references to external objects.

—Hooray

Yeah, it could be closure optimized, but I was thinking that a closure is the safest way, how else would it be done? —Philosopher

Ideally, in a "share-nothing" fashion (if possible), where you'd have a pool of worker threads, that merely receive temporary copies of data (i.e. not to be modified), which isn't referenced otherwise, so easy to mark/reap (recursion depth-wise). This is what "message passing" is all about: reduce the amount of necessary synchronization to an absolute minimum, and otherwise have a pool of "agent" threads that process messages and that only ever communicate with other agents through messages.

I just pushed another threading approach, it's something in between what you just described and what I had before. I'm still getting huge amounts of GC-findable objects, and I have a feeling that's coming from my script, but where?? The queues should be thrown out sometime, as once the worker thread is done with it it should be garbage, and I can't think of anything else that is more than a "temp". Anyways, it sorta works now, frame latencies of 100-200 when I don't garbage collect, but that happens every other frame so really it is in the thousands and going up :P. —Philosopher (talk) 20:29, 14 August 2013 (UTC)

Dealing with collected data

As an aside, I'd use another thread to sort the data prior to writing it to a file - I am getting a bunch of stuff at the beginning of the file that simply doesn't seem very relevant in comparison to the rest.

Would my explore.py script help here? You can move it to $FG_HOME (or link -s), run it, and type "list by number of calls / 20" to get the top 20 most-called functions. It's quite helpful, actually ^.^ —Philosopher (talk) 21:14, 14 August 2013 (UTC)

Tried running it, but getting syntax errors - in a bunch of places, probably due to a different python version here. Also, looking at it, it would seem like another candidate to be eventually reimplemented as a Nasal background task once performance has improved, because we'd ideally show canvas-dialogs in FG with profiling results.--Hooray (talk) 21:23, 14 August 2013 (UTC)

Disregard that idea, looking at the log file and type of info written there, it would actually be simpler to directly use Andy's SQLite bindings and dump each profiling run to a real DB that supports key/value lookups and searches, and it would take less space too.

Whole session getting profiled ??

It's confirmed: As previously suggested, your setupDebugExtras() call results in the entire sim session being profiled, see my earlier comments - ideally, just passing extras to call() should MERELY profile the specified naFunc, nothing else - according to the profile, I am seeing the entire sim session profiled during the run, not just the test func - so that's where all the data and refs are coming from.

Oh yeah, sorry, get forgetting to take that out.... Well I do (occasionally) but I put it back in. The question is will developers want to profile everything or use call() to just profile their stuff? (which could end up being hard if there are a lot of settimers/setlisteners involved since each would need to be touched) I do think that profiling everything gives a sort of context and ranking, like I was able to tell that aar.nas was particularly bad, whereas I really wouldn't know what "bad" is. I also would like to see if we could just profile a whole session. BTW: I can confirm that stats aren't getting printed for the single call(). —Philosopher (talk) 21:23, 14 August 2013 (UTC)

Yeah, completely agreed - ideally, we really WANT the full session to be profiled, but obviously there are still some issues, so we should slow down a second and fix the call() implementation first to be able to profile individual scripts. That would give us an idea about what's going on - profiling the whole session is still too massive timing/memory-wise, so that we need to use call() as a testbed to understand what's happening before we can look again at profiling the entire sim session, which I also consider a primary goal. Regarding listeners and timers, I think we could patch the FG side of things to "inherit" extras based on the initial call().--Hooray (talk) 21:29, 14 August 2013 (UTC)

It would be really interesting if we could call() the test.nas infrastructure to get ITSELF profiled when profiling the sim session, to identify which parts of test.nas are heavy (i.e. responsible for 200ms frame spacing), and why (GC overhead, callgraph, data structures). Need to do some more thinking to figure out if/how that could be made to work...

My suggestion would be to add a global thread counter and increment/decrement it whenever you create/finish a thread – which will tell you what’s going on threading wise.

You could use something like this (crude and just briefly tested):

var synced = {_LOCK:nil};
synced.new = func {
 var temp = { parents:[synced] };
 temp._LOCK = thread.newlock();
 return temp;
}
synced.atomic_do = func (what, thread_id=nil) {
 var start = systime();
 thread.lock( me._LOCK);
 if (thread_id != nil)
  print("Thread: ", thread_id, " spent waiting for lock:", systime()-start );
 what();
 thread.unlock( me._LOCK);
}
 
var counter=synced.new();
counter._value = 0;
counter.get = func return me._value;


var newthread = func(code, thread_id) {
counter.atomic_do ( func counter._value +=1); # increment thread counter 
code(); # run callback in new thread
counter.atomic_do ( func counter._value -=1 ); # decrement thread counter
}
 
var stats = func {
  print("Threads:", counter.get() );
}

print("***");
for(var i=0; i<15; i+=1)
 newthread( stats,i );

You could also register a GC stats callback that dumps stats at the beginning/end of the thread, to see if objects/references roughly scale with the number of threads here.

—Hooray

September 14

I've pushed lots of changes yesterday and some today. I would appreciate thoughts on performance – I basically fixed all errors and most warnings, I get these three on exit:

Stack processing error: warning: no call stack/return for function test.listener:
 at /Users/philosopher/Documents/FlightGear/fgdata/Nasal/test.nas, line 468
Stack processing error: warning: no call stack/return for function __gcsave.89:
 at /Users/philosopher/Documents/FlightGear/fgdata/Nasal/test.nas, line 468
Stack processing error: warning: no call stack/return for function __gcsave.90:
 at /Users/philosopher/Documents/FlightGear/fgdata/Nasal/test.nas, line 468

This is related to the fact that I turn on profiling halfway through a function call stack ;-). (I set a timer, probably shows up as __gcsave.90, which, before the timer exits, triggers a listener, say __gcsave.89, while calls the actual listener body, test.listener, due to the wrapping in globals.nas.)

I also get frequent timing warnings:

warning: long time for add_func call: 5.640983581542969ms with a size of 10 for entry func:0x7fef4909aa70
warning: long time for add_func call: 15.3660774230957ms with a size of 11 for entry func:0x7fef4909aa00
warning: long time for add_func call: 8.067131042480469ms with a size of 40 for entry func:0x7fef49154ff0
Queue_up: Waiting time: 14.64819908142089ms
warning: long time for add_func call: 15.04707336425781ms with a size of 44 for entry func:0x7fef49112090
warning: long time for add_func call: 4.133224487304688ms with a size of 53 for entry func:0x7fef4a8e8250
warning: long time for add_func call: 4.251003265380859ms with a size of 54 for entry func:0x7fef4a8ad070
warning: long time for add_func call: 5.972862243652344ms with a size of 63 for entry func:0x7fef49154d88
warning: long time for add_func call: 5.639076232910156ms with a size of 70 for entry func:0x7fef49111a70
do_worker_thread/get: Waiting time: 5.290031433105469ms
warning: long time for add_func call: 4.57000732421875ms with a size of 79 for entry func:0x7fef4990c4e0
warning: long time for add_func call: 4.376173019409179ms with a size of 83 for entry func:0x7fef4b135d10
warning: long time for add_func call: 4.052877426147461ms with a size of 84 for entry func:0x7fef4b149ff8
warning: long time for add_func call: 4.475116729736328ms with a size of 87 for entry func:0x7fef4a9e8670
warning: long time for add_func call: 4.47392463684082ms with a size of 89 for entry func:0x7fef49908ca8
warning: long time for add_func call: 4.024982452392578ms with a size of 90 for entry func:0x7fef49909338
Queue_up: Waiting time: 17.55404472351074ms
warning: long time for add_func call: 4.741907119750977ms with a size of 91 for entry func:0x7fef49110d88
warning: long time for add_func call: 5.394935607910156ms with a size of 94 for entry func:0x7fef4aa0c488
warning: long time for add_func call: 4.699945449829102ms with a size of 99 for entry func:0x7fef4b1496c8
Queue_up: Waiting time: 17.0600414276123ms
warning: long time for add_func call: 4.919052124023438ms with a size of 100 for entry func:0x7fef4aa0a690
warning: long time for add_func call: 10.23697853088379ms with a size of 103 for entry func:0x7fef49155338
warning: long time for add_func call: 4.286050796508789ms with a size of 105 for entry func:0x7fef49111db8
warning: long time for add_func call: 5.666971206665039ms with a size of 106 for entry func:0x7fef49111df0
warning: long time for add_func call: 5.033016204833984ms with a size of 107 for entry func:0x7fef490b9e38
warning: long time for add_func call: 5.100965499877929ms with a size of 108 for entry func:0x7fef49154c70
warning: long time for add_func call: 27.99296379089355ms with a size of 112 for entry func:0x7fef491551b0
Queue_up: Waiting time: 6.907939910888672ms
Queue_up: Waiting time: 13.18097114562988ms
warning: long time for add_func call: 5.21397590637207ms with a size of 116 for entry func:0x7fef49290a60

I would postulate that the locking warnings happen when, in between a lock/unlock event, a GC event occurs – otherwise things don't typically hold locks for more than a handful of lines, and that goes zippity-fast in Nasal. You can experiment with the add_func call level to get a sense of what is normal and what isn't. I should probably delegate some of the more expensive tasks (~cough~, searching for a global lvalue) to the background thread, but it really doesn't take up time worth optimizing yet. I will see what queue_up timing statistics look like tomorrow, but that still doesn't tell me what takes up so much time.

Hm, looking at the code, maybe calling queue_up recursively for allocations isn't such a good idea...? Removing the 1 on line 323 speeds things up some, but I'm still concerned about this statistic:

Took 182.8384615384615 times as long w/ extras
Took 167.578231292517 times as long w/ extras
Took 213.0176991150443 times as long w/ extras
Took 246.7391304347826 times as long w/ extras

Just so that you know, I have been following your progress (subscribed via gitorious), but I am currently not able to build from source and actually run FG- and still need to find some time to catch up with things, will keep ya posted! Thanks for the update. --Hooray (talk) 06:17, 15 September 2013 (UTC)

Debugging Threads

Referring to:

268         DBG_LOCKS and print("Do_thread try lock");
269	    thread.lock(queue_lock);
270	    DBG_LOCKS and print("Do_thread holds lock");

274         thread.unlock(queue_lock);
275	    DBG_LOCKS and print("Do_thread release lock");

Consider wrapping the low-level APIs to come up with classes, with built-in debugging support - which can also be extended (via inheritance) to add support for queues, stacks or other data structures:

var Lock = {};
Lock.new = func(name) {
 var temp = {parents:[Lock], name:name };
 temp.LOCK=thread.newlock();
 return temp;
}

Lock.lock = func {
 DBG_LOCKS and print("try lock:",me.name);
 thread.lock(me.LOCK);
 DBG_LOCKS and print("holds lock:",me.name);

}
Lock.unlock = func {
 thread.unlock(me.LOCK);
 DBG_LOCKS and print("lock released:",me.name);
}

var queue = Lock.new("queue lock");

Manual locking/synchronization is tedious, fragile and often error-prone, I still have to check what exactly your code is doing, wrapping everything in a single object is definitely helpful, for example run this snippet of code in your Nasal console and see for yourself:

var DBG_LOCKS = 1;


##
# a simple Lockable class that wraps lock/unlock functionality
#
var Lockable = {};
Lockable.new = func(name) {
  var l = thread.newlock();
  return { parents:[Lockable],LOCK:l,name:name };
}
Lockable.lock = func thread.lock(me.LOCK);
Lockable.unlock = func thread.unlock(me.LOCK);

##
# also print some stats to the console
#
Lockable.do_atomic = func(what, location="(none)" ) {
 DBG_LOCKS and print(location,":Trying to lock:", me.name);
 var start=systime();
 me.lock();
 DBG_LOCKS and print(location,":Lock held:", me.name);
 DBG_LOCKS and print(location,":Waiting time:", systime() - start);
 what();
 me.unlock();
 DBG_LOCKS and print(location,":Lock released:", me.name);
}

##
# extend the Lockable class with counter functionality
var counter = Lockable.new("counter");
counter.value = 0;
counter.get = func me.value;
counter.inc = func(meta) me.do_atomic( func me.value +=1, meta );
counter.dec = func(meta) me.do_atomic( func me.value -=1, meta );

##
# we'll run this as the "task" in our worker threads
var stress_test = func(thread_id) {
for (var i=0;i < 50; i+=1) {
 counter.inc( thread_id );
 counter.dec( thread_id );
 }
}

# start a bunch of threads, each incrementing/decrementing the counter concurrently (depending on your hardware and OS)
for(var i=0;i < 180; i+=1) {
  func {
  var c=i;
  thread.newthread(func stress_test("Thread #" ~ c));
 }()
}

Similarly, a wrapper to abstract threading and thread pools would make things more intuitive - for example, we don't need to use a "busy-wait" loop for all threads to complete, you can simply use semaphores here, but also to send a worker thread sleeping when no work is available, in general it is more efficient to reuse previously allocated threads than spawn/terminate and re-create new ones all the time.

GC Stats

At the very least, we should also dump stats for each naPool - so that we know how objects are distributed (scalars, vectors, hashes, ghosts). Afterwards, we should look into exposing a function that dumps stats per namespace, so that we can use an extension function that dumps stats gathered via mark() - i.e. to tell how tell how heavy a certain namespace is.