Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [cdt-dev] decoupled preprocessor

Just to give a little history here, since I originally wrote the Scanner
mess (warning: scanner/lexer used interchangeably here). The focus on the
scanner has always been on performance. The fact that it processes every
character of source that we parse scared the heck out of me so I tried to
make it as fast as I could.

One thing I figured would help was if I combined the preprocessor into the
lexer. That way we didn't create an extra character stream theoretically
being faster requiring less memory. I also did things like using char[]'s
instead of Strings (which was the famous 2.0.2 bug fix, a rewrite of the
Scanner into Scanner2). In the end, 2.0.2 was much faster. 

In retrospect John Camelon, who inherited my mess, and I have agreed with
Markus's point, combining the preprocessor and lexer probably wasn't a good
idea. And what probably makes it worse is that we have the BaseScanner base
class that serves both the old parser and the DOM parsers making it even
harder to maintain.

There's some fundamental things there I would like to preserve, such as the
buffer stack and the location map. But separating out the code that creates
the tokens and making the preprocessor look like a character stream would be
a much cleaner architecture. We should also really get rid of the old parser
and clean things up while we're at it.

>From my quick look at Mike's preprocessor, it looks like he's taken the
spec's description of preprocessor tokens to heart. The best architecture
for ANTLR, though, would be to introduce a character stream that handles the
text replacement and conditionals and let the lexer focus on creating the
tokens. ANTLR actually generates most of the lexer for you which I wouldn't
want to mess with. We could probably reuse that in the current parser as
well by redoing the DOMScanner. We should probably create language specific
scanners while where at it too.

Anyway, maybe topics for CDT Ganymede...

Doug Schaefer, QNX Software Systems
Eclipse CDT Project Lead, http://cdtdoug.blogspot.com


> -----Original Message-----
> From: cdt-dev-bounces@xxxxxxxxxxx [mailto:cdt-dev-bounces@xxxxxxxxxxx] On
> Behalf Of Schorn, Markus
> Sent: Wednesday, June 20, 2007 2:58 AM
> To: CDT General developers list.
> Subject: RE: [cdt-dev] decoupled preprocessor
> 
> Mike,
> is there a chance that we can use your decoupled preprocessor for the
> current C- and C++-parsers? The DOM-Scanner really is a nightmare to
> maintain.
> 
> Markus.
> 
> > -----Original Message-----
> > From: cdt-dev-bounces@xxxxxxxxxxx
> > [mailto:cdt-dev-bounces@xxxxxxxxxxx] On Behalf Of Mike Kucera
> > Sent: Dienstag, 19. Juni 2007 23:40
> > To: CDT General developers list.
> > Subject: RE: [cdt-dev] decoupled preprocessor
> >
> > It looks like you are planning to do preprocessing on the raw
> > character
> > stream and then feed the result to your ANTLR lexer.
> >
> > The C99 preprocessor works differently, it processes a token
> > stream, not a
> > character stream. It creates a CodeReader for each include,
> > passes it to
> > the lexer and expects a token stream as the result. It then
> > adds the token
> > stream to its own input and continues processing.
> >
> > I don't know which approach makes more sense with ANTLR. With
> > LPG I was
> > able to separate the lexer and parser and stick the preprocessor
> > in-between.
> >
> > I believe that doing lexing before preprocessing makes the
> > preprocessing
> > phase much easier to write and maintain. For example the C99
> > preprocessor
> > doesn't need to deal with comments, from bug reports this is
> > something that
> > I can tell has created many issues in the DOM scanner. Also
> > the code is
> > cleaner because it is processing a token stream instead of a
> > raw character
> > stream (for example, compare Macro.invoke() to BaseScanner.
> > expandFunctionStyleMacro()).
> >
> > Also, if you return raw characters from the preprocessor then
> > how will you
> > the calculate the offsets on the AST nodes? The offsets are normally
> > contained in the tokens.
> >
> > > But if you already have everything we've done
> > > there, then might be the better approach.
> >
> > Well, I hope so :) Its pretty new and I'm still working out
> > the bugs. It
> > does have a few features the DOM scanner doesn't, like support for
> > trigraphs.
> >
> > I hope you do decide to give it a try. I'll decouple it soon.
> >
> >
> > Mike Kucera
> > Software Developer
> > IBM CDT Team, Toronto
> > mkucera@xxxxxxxxxx
> >
> >
> >
> >
> >
> >              Doug Schaefer
> >
> >              <DSchaefer@xxxxxx
> >
> >              m>
> >           To
> >              Sent by:                  "CDT General
> > developers list."
> >              cdt-dev-bounces@e         <cdt-dev@xxxxxxxxxxx>
> >
> >              clipse.org
> >           cc
> >
> >
> >
> >      Subject
> >              06/19/2007 03:53          RE: [cdt-dev]
> > decoupled
> >              PM                        preprocessor
> >
> >
> >
> >
> >
> >              Please respond to
> >
> >                "CDT General
> >
> >              developers list."
> >
> >              <cdt-dev@eclipse.
> >
> >                    org>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Yes, it is definitely something I'll need. I'll need to take
> > a look at what
> > you've done. ANTLR uses it's own character stream interface to feed
> > characters to the lexer. It provides implementations that can
> > pull that out
> > of Readers and InputStreams. I will likely want to create a
> > new one that
> > doesn't try to load it all into a char[] at startup like the
> > built in ones
> > do. We can then hook that up to the preprocessor.
> >
> > I'm not sure how you built yours but the easiest path I can
> > see is to take
> > our current scanner and replace nextToken with getChar and strip out
> > anything that creates a token. But if you already have
> > everything we've
> > done
> > there, then might be the better approach.
> >
> > Anyway, another shiny object flew by called CDT user docs, so
> > I'll get back
> > to ANTLR in a few days :).
> >
> > Cheers,
> > Doug Schaefer, QNX Software Systems
> > Eclipse CDT Project Lead, http://cdtdoug.blogspot.com
> >
> >
> > > -----Original Message-----
> > > From: cdt-dev-bounces@xxxxxxxxxxx
> > [mailto:cdt-dev-bounces@xxxxxxxxxxx] On
> > > Behalf Of Mike Kucera
> > > Sent: Tuesday, June 19, 2007 3:43 PM
> > > To: CDT General developers list.
> > > Subject: [cdt-dev] decoupled preprocessor
> > >
> > >
> > > Hi Doug,
> > >
> > > I take it from your latest blog post that you are going to
> > be in need of
> > a
> > > preprocessor for you ANTLR C++ experiment. I was planning
> > on decoupling
> > > the
> > > preprocessor that I wrote for the C99 parser so that it can
> > be used with
> > > any parser. If you are interested in picking this up when
> > would you need
> > > it?
> > >
> > > Mike Kucera
> > > Software Developer
> > > IBM CDT Team, Toronto
> > > mkucera@xxxxxxxxxx
> > >
> > > _______________________________________________
> > > cdt-dev mailing list
> > > cdt-dev@xxxxxxxxxxx
> > > https://dev.eclipse.org/mailman/listinfo/cdt-dev
> > _______________________________________________
> > cdt-dev mailing list
> > cdt-dev@xxxxxxxxxxx
> > https://dev.eclipse.org/mailman/listinfo/cdt-dev
> >
> >
> > _______________________________________________
> > cdt-dev mailing list
> > cdt-dev@xxxxxxxxxxx
> > https://dev.eclipse.org/mailman/listinfo/cdt-dev
> >
> _______________________________________________
> cdt-dev mailing list
> cdt-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/cdt-dev


Back to the top