Re: [linuxtools-dev] TMF indexing performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [linuxtools-dev] TMF indexing performance

From: Francois Chouinard <fchouinard@xxxxxxxxx>
Date: Tue, 20 Oct 2009 13:08:26 -0400
Delivered-to: linuxtools-dev@xxxxxxxxxxx
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=I8nyi5uThrVfktOip4wBu6A3lGzKW1yTdRNMKLT2c4NVLT0kU8q647A7KJR7Gbu1al i0aVX1yYastDLcGuaYqs3daDdlRbcWXXIudTXRoOzz49CVIHUQ+F75jLebqDkNAgm2c5 Ixy0VKarqT7rXBWE6s7Syw0lrRV02kCWxNlmM=
List-archive: <https://dev.eclipse.org/mailman/private/linuxtools-dev>
List-help: <mailto:linuxtools-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/linuxtools-dev>, <mailto:linuxtools-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/linuxtools-dev>, <mailto:linuxtools-dev-request@eclipse.org?subject=unsubscribe>

Fundamentally, I think we are interested in having the experiments/traces answer efficiently the following requests:
- What is the Nth event?
- What is the event that occurred at time T?

And their variants:
- What are the events ranked N to M?
- What are the events that occurred in time range [T1,T2]?

The timestamp criteria is useful for time-based analysis while the rank criteria is useful for tabular display.

The whole purpose of indexing is to have a scheme to access rapidly any event, in the set of traces, based on these criterias (timestamp and rank).

Is the rank criteria really needed? It is currently used to populate the Events View which is implemented as a virtual table and requires access to the Nth event. It could also be re-used for other purposes (I certainly hope so).

An alternative to "pure" rank (used in LTTv) is to estimate the timestamp of that Nth event by linear interpolation. This can work reasonably well if the event distribution is quite even (a fairly arguable assumption). I am not too warm for that solution because I think that, without implementing a Table widget from scratch :-(, it would be more trouble to ensure the correctness of scrolling through events than it is to simply build an index.

Anyway, this is quita academic since the rank index doesn't even exist: since the timestamp index keeps track of every Nth event, it is an implicit by-product of building the indexes.

There are actually 2 types of indexes built: per trace and per experiment.

[1] Trace index

This index keeps track of every Nth event in the trace by recording its location and timestamp. Its rank is implicit (it is the Nth event...).

I see a number of valid concerns about the building of this index:
a) Can we get away with *not* building an index for the individual traces?
b) How to optimize the index building performance?
c) How do we make it scale?

As stated above, we want fast access to a specific event (based on timestamp or rank). Since the events are of variable length (in the general case), I think we agree that it would not be practical to scan the traces from the beginning every time we want to answer one of the above request. So I believe it is useful to build this index of trace contexts (rank-timestamps-locations tuples) to try to improve the retrieval time.

Also, I like the idea of building an index per trace because it can potentially be re-used for multiple experiments.

Who should build it? When? How?

As of today, the TmfTrace base class constructor initiates the index building. It can be built synchronously or not. To give the extending class some flexibility, I am looking into making the indexing function protected in order to give the extending class some latitude. I am also looking into adding a new TmfTrace constructor so the application can decide to perform (or not) indexing (this would have impacts on the internal structure).

The building of the index could be performed as a low priority job or "just-in-time" (i.e. as the experiment requests events) or both.

As the number of events grows, the simple approach of keeping the location of every Nth event does not scale very well (the size of the index is O(n)). An alternative would be to keep the size of the index constant and to dynamically adjust N as the number of events increases. This would require some pruning of the index.

[2] Experiment index

The purpose of this index is to determine the Nth event in an experiment with multiple, heterogeneous traces. Essentially, this is a snapshot of the traces contexts at every Nth event in the experiment. These contexts are then use to re-position the traces so event N (overall) will be the next to be parsed.

The comments above about who, when, how and scalability also apply.

[3] Optimizations

I really like your proposition of adding a new getNextEvent() that will skip N events. This could definitely speed up the building of the trace indexes by having the trace handle efficiently the skipping..

So, for the individual traces, I would see something like this: getEvent(context, numberOfEventsToSkip) where the extending class can overload that function with a more efficient scheme than simply performing successive calls to getNextEvent().

However, at the experiment level, I'm not sure it would work. Traces could be heterogeneous and I don't see a single function handling that case. Unless the experiment itself becomes extendable...

Best Regards,
/fc

References:
- [linuxtools-dev] TMF indexing performance
  - From: Polzer, Franz

Prev by Date: Re: [linuxtools-dev] TMF indexing performance
Next by Date: [linuxtools-dev] Re: TMF indexing performance
Previous by thread: Re: [linuxtools-dev] TMF indexing performance
Next by thread: [linuxtools-dev] Re: TMF indexing performance
Index(es):
- Date
- Thread

Breadcrumbs