Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [linuxtools-dev] Re: TMF indexing performance

> For time based access, LTTng keeps a sorted array of fixed size blocks
> (in memory I presume). If I understand well, these blocks correspond
> (more or less) to disk blocks which are read in one shot.

Currently the files are divided into large blocks (e.g. 1MiB). These
logically represent a sorted array on disk. Their size is naturally a
multiple of the disks block size (1/2 to 4KiB typically). Anyway, disk
controllers now always read ahead much more than these tiny so called
"disk blocks".

To perform a binary search, the first few bytes of the block in the
middle of the file is first read, then the beginning of other blocks
depending on the result of the comparison in the search. Obviously, the
block in the middle of the file is likely to be kept in the OS buffer
cache.

There is a problem coming up for that approach however. Indeed, in
streaming mode, we will need to flush regularly the buffers to get
information even from low volume channels. Otherwise, some buffers would
take hours to fill and that would pose a problem to show live all the
recent events, from all channels, in sorted time order. As a result, we
may need to go for variable sized blocks, blocks that were far from full
when periodically flushed. Another possibility would be to pack several
mostly empty blocks in a single block or to waste space when blocks are
more than half full.

If we accept that blocks are variable size, random access is compromised
and we would need to generate a small index, an array with an entry
offset/time for each block ((offset o0, time t0), (offset o1, time
t1)...). That array would be small enough to fit in memory and thus
might be even faster than the binary search currently performed directly
on file blocks. It would even be possible to add the "rank" of the event
at the beginning of each block, either counting events as we trace or
after the fact.

>  Looking for a particular event, based on a timestamp, boils down to:
> [1] Perform a binary search in the sorted array of blocks
> [2] Read the corresponding disk block
> [3] Do a linear search (in memory) for the lucky event.

Yes.

> This is very similar to what is proposed in TMF at the trace level:
> [0] Build an index of timestamps spaced at regular intervals
> [1] Perform a binary search for the requested timestamp in the index
> [2] Read the corresponding event
> [3] Perform successive reads until you hit the wanted event

Yes, rank values are easy at the single file level. They pose problem
when logically merging or filtering files.

> Why build an index in the FW? Because, in the general case, traces are
> not structured anywhere as well as LTTng traces are. There is likely
> no index to just read off the disk. So it has to be built.

That does not change the problem of merging files based on the time axis
which, as far as I know, is the use case ultimately.

> What about the successive reads to find the wanted event in the block?
> The TMF solution (coming soon :-) is to define cache pages in
> TmfTrace, where each page holds N events, N being the number of events
> between 2 successive entries in the index. When a read request comes
> in, the whole page (N events) is read in cache.

The OS already caches the file data. Here you would copy data from the
files into a time sorted cache, eventually competing for memory. If the
time merging process is expensive, this may be the right place to
"cache". Otherwise, it is better to use the filesystem cache already
provided by the OS.

> Currently, TMF takes advantage of the index to quickly locate an event
> by rank. Since each index entry corresponds to N events, it is obvious
> that simple modulo arithmetics gets us to the "right" block (in a
> trace) even faster than time based event seeking.

I completely agree that for a single file both are equivalent. When
merging files (almost always since there are per cpu files), since in my
understanding the merging is always based on the time axis (for any type
of log/trace), then the rank based access paradigm is problematic.

For a single trace, where the associated tracefiles are fixed, it could
be worthwhile to build a rank index. However, when defining an
"experiment" where we examine several traces together, the rank would
have to be recomputed for each grouping of traces used as an experiment.
Similarly, when defining filters, rank indexes again lose their meaning
since some events are "filtered out" from the display. 

> Here, I would like to point out that finding the Nth event of a trace

Is the user really ever asking for the Nth event, independent of time?

> Again, maybe not all tracing tools are as friendly as LTTng as to
> provide that time range up front (specially in streaming), in which
> case, the trace has to be somehow scanned to determine it. It follows
> that if the trace has to be scanned anyway (in the general case), it
> would be advantageous to build that index at the same time. BTW, I am
> toying with the idea of indexing individual traces as they are
> imported and store the index (and time range, why not?) with the
> trace. Then, TMF will become eerily similar to LTTng :-)

All logs/traces contain timestamps. Otherwise they cannot be merged on
the time axis and the problem becomes simple anyway. If a log cannot be
accessed randomly or is not in sorted time order, then binary search is
not possible and an indexing pass may be required. Many text logs
(e.g. /var/log/*) can be accessed randomly just as LTTng binary traces,
seeking to a position and then looking for a newline indicating the
start of an event with its timestamp.

If an index needs to be built, whether it is for a strange log format or
streaming LTTng, so be it. However, this does not change the problem
that rank based indexes are hardly usable when merging/filtering.

> I totally agree that the composition of traces at the experiment level
> is a potential nightmare. The LTTng solution seems to be to recompute
> the ranking. Same for TMF :-(

No, you can seek to a certain time effeciently in each trace and then
you merge events from there to fill your display table, implicitly
getting a local ranking, no nightmare. This is where I dont see how you
can proceed if you base everything on ranking?




Back to the top