search inside pdf
I've got a new install of Nuxeo (I downloaded the 5.7.1 vmware) it seems to work and launch fine.
I uploaded 2 different PDF files, but the search does not find the text inside of one of them. why?
the one pdf is not an image, it is text selectable and searchable inside of adobe reader.
0 votes
1 answers
2989 views
From your traces above you're victim of a PDFBox bug: PDFBOX-1512.
A workaround is add this to the JAVA_OPTS:
-Djava.util.Arrays.useLegacyMergeSort=true
Inside
bin/nuxeo.conf
. See the other places where JAVA_OPTS
is used in this file.08/05/2013
Maybe the PDF library didn't manage to parse this file on particular. You can see if it's the case by checking the Nuxeo logs for errors (look for "logs/server.log" in the Nuxeo folder, not sure where it is on the VM though)
2013-08-02 13:06:56,525 WARN [org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork] Could not extract fulltext of file 'eli.pdf' for document: 1189530d-0b13-4ce6-93e8-8834a759ff88: org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox
This:
2013-08-05 14:33:37,425 WARN [org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork] Could not extract fulltext of fi le 'eli3.pdf' for document: f8af5c77-d464-430d-b7f4-e54f1dfecd2d: org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox 2013-08-05 14:33:37,425 DEBUG [org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork] Could not extract fulltext of fi le 'eli3.pdf' for document: f8af5c77-d464-430d-b7f4-e54f1dfecd2d: org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox
Caused by: java.lang.IllegalArgumentException: Comparison method violates its general contract!