Add pdf files to search index [message #484994] |
Thu, 10 September 2009 07:49 |
Mohamed Hussein Messages: 76 Registered: July 2009 |
Member |
|
|
Hello,
I would like to add pdf files to help search index of my rcp product.
I understand that Lucene library can index pdf files (and word documents)
but it seems that Eclipse only supports html files.
Do I need to add a new LuceneSearchProvider and associate it with pdf
extension?
Can I then just delegate adding to the index to the default indexer, or do I
need to use Lucene APIs directly to parse and index the documents?
Thanks in advance for your help,
Mohamed.
Best Regards,
Mohamed.
|
|
|
|
Re: Add pdf files to search index [message #485978 is a reply to message #485412] |
Tue, 15 September 2009 19:13 |
Lee Anne Kowalski Messages: 54 Registered: July 2009 |
Member |
|
|
Mohamed Hussein wrote:
> Can I then just delegate adding to the index to the default indexer,
or do I
> need to use Lucene APIs directly to parse and index the documents?
I think that in the Eclipse help, it parses the HTML files also. I'm not
sure at which point in the indexing process. I do know that there is an
HTMLParser.java class in the org.eclipse.help.base.source JAR, and I
presume that is there because the HTML files have to be parsed at some
point in the process.
So I would imagine that to get PDF files into the Lucene index, the PDF
files would have to be parsed.
Chris Goldthorpe wrote:
> This sounds like the sort of feature that others in the community may
> have already implemented, and if so it would be great to get this
> contributed to Eclipse. I'll ask around at IBM to see if we have anyone
> looking into this. Meanwhile if anyone else on the newsgroup has
> implemented this or thought about implementing this I'd be interested to
> know what approach you used.
One thought is that the PDF document would need to be parsed. I just
went over to lucene.apache.org and the FAQ has this about indexing PDF:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d7 86f4e384936fa93ce1137a23b7e422
"In order to index PDF documents you need to first parse them to extract
text that you want to index from them. Here are some PDF parsers that
can help you with that:
PDFBox is a Java API from Ben Litchfield that will let you access the
contents of a PDF document. It comes with integration classes for Lucene
to translate a PDF into a Lucene document.
XPDF is an open source tool that is licensed under the GPL. It's not a
Java tool, but there is a utility called pdftotext that can translate
PDF files into text files on most platforms from the command line.
Based on xpdf, there is a utility called pdftohtml that can translate
PDF files into HTML files. This is also not a Java application.
JPedal is a Java API for extracting text and images from PDF documents."
----------------------------------------
A link about PDFBox to extract the text from a PDF:
http://www.pdfbox.org/userguide/text_extraction.html
The PDFBox site says that it is licensed under the BSD License. I don't
know if that is compatible with the Eclipse license, such that PDFBox
would be a viable solution to ship with the Eclipse Platform itself.
XPDF and Jpedal seem to be GPL or LGPL.
Hope that helps,
Lee Anne
|
|
|
|
Re: Add pdf files to search index [message #623554 is a reply to message #485412] |
Tue, 15 September 2009 19:13 |
Lee Anne Kowalski Messages: 54 Registered: July 2009 |
Member |
|
|
Mohamed Hussein wrote:
> Can I then just delegate adding to the index to the default indexer,
or do I
> need to use Lucene APIs directly to parse and index the documents?
I think that in the Eclipse help, it parses the HTML files also. I'm not
sure at which point in the indexing process. I do know that there is an
HTMLParser.java class in the org.eclipse.help.base.source JAR, and I
presume that is there because the HTML files have to be parsed at some
point in the process.
So I would imagine that to get PDF files into the Lucene index, the PDF
files would have to be parsed.
Chris Goldthorpe wrote:
> This sounds like the sort of feature that others in the community may
> have already implemented, and if so it would be great to get this
> contributed to Eclipse. I'll ask around at IBM to see if we have anyone
> looking into this. Meanwhile if anyone else on the newsgroup has
> implemented this or thought about implementing this I'd be interested to
> know what approach you used.
One thought is that the PDF document would need to be parsed. I just
went over to lucene.apache.org and the FAQ has this about indexing PDF:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d7 86f4e384936fa93ce1137a23b7e422
"In order to index PDF documents you need to first parse them to extract
text that you want to index from them. Here are some PDF parsers that
can help you with that:
PDFBox is a Java API from Ben Litchfield that will let you access the
contents of a PDF document. It comes with integration classes for Lucene
to translate a PDF into a Lucene document.
XPDF is an open source tool that is licensed under the GPL. It's not a
Java tool, but there is a utility called pdftotext that can translate
PDF files into text files on most platforms from the command line.
Based on xpdf, there is a utility called pdftohtml that can translate
PDF files into HTML files. This is also not a Java application.
JPedal is a Java API for extracting text and images from PDF documents."
----------------------------------------
A link about PDFBox to extract the text from a PDF:
http://www.pdfbox.org/userguide/text_extraction.html
The PDFBox site says that it is licensed under the BSD License. I don't
know if that is compatible with the Eclipse license, such that PDFBox
would be a viable solution to ship with the Eclipse Platform itself.
XPDF and Jpedal seem to be GPL or LGPL.
Hope that helps,
Lee Anne
|
|
|
|
Powered by
FUDForum. Page generated in 0.03954 seconds