The WorkerThread Blog

we know stuff so you don't have to

Indexing PDFs – Can Someone Make it Easier Please?

Posted by workerthread on October 14, 2008

The two posts on this blog with the greatest number of hits are the one on configuring a PDF iFilter with WSS 3.0, and using Adobe Reader 9 with SharePoint.  Almost every SharePoint implementation I’ve been involved in has required setting up a PDF iFilter and I would say that after standard Office documents (mostly Word but some PowerPoint and Excel), PDFs are the file type most commonly uploaded to SharePoint document libraries. 

So please, could someone somewhere make it easier for SharePoint Admins to set up their servers for crawling and indexing PDF documents!  I really would like to see the day when I don’t have to mess with registry settings and XML files to get this to work!

Sadalit Van Buren has a post on her wishlist for the next version of SharePoint.   As she says, “forget the relationship with Adobe already, so that the Acrobat Filter is out of the box!”.  I also spotted a post on the Res Cogitans blog with a speculative SharePoint v14 Feature List which also mentions PDF support as a “Probably” – I really, really hope so.

If this does happen, I would really like to get the metadata captured as well, in the same way as Office documents.  PDF document properties generally look like this:

PDFDocProperties

So of course I would like to get them automatically mapped to document library columns on upload.  Bamboo Solutions are moving towards a solution with their pre-release PDF Document Parser, so maybe as this progresses at least one of my wishes will come true…

Technorati tags: , ,
Advertisements

6 Responses to “Indexing PDFs – Can Someone Make it Easier Please?”

  1. […] Indexing PDFs – Can Someone Make it Easier Please? (Worker Thread Blog)The two posts on this blog with the greatest number of hits are the one on configuring a PDF iFilter with WSS 3.0, and using Adobe Reader 9 with SharePoint.  Almost every SharePoint implementation I’ve been involved in has required setting up a PDF iFilter and I would say that after standard Office documents (mostly Word but some PowerPoint and Excel), PDFs are the file type most commonly uploaded to SharePoint document libraries. […]

  2. Leonard Rosenthol said

    Be aware that the screen shot you showed from Adobe Reader only lists the OLDER PDF properties and metadata. Since Acrobat/Reader 5, Adobe uses an XML-based metadata schema called XMP.

    Leonard Rosenthol
    Adobe Systems

  3. Hi Leonard

    Thanks for the XMP note. Also worth bearing in mind though there is a HUGE number of OLDER format PDF files out there which people want to store/index/search for in products like SharePoint. Frankly if I could automatically map Title, Author, Subject and Keywords as well as full-text indexing with the iFilter I would be a happy bunny.

    Derek

  4. Asher said

    How to index PDF files

    By default, PDF files will not be indexed (and therefore searchable) by MOSS. Here is a guide to make this possible.

    1. Download the Adobe PDF iFilter version 6.0, available from Adobe here
    http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611

    2. Download an icon for PDFs, also available from Adobe here

    3. Install the iFilter on the Sharepoint search server/s and restart IIS

    4. Add a registry entry for the .pdf extension in the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\\Gather\Search\Extensions\ExtensionList.

    To do this, Open the registry editor. Navigate to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\\Gather\Search\Extensions\ExtensionList\. Identify the highest “number” value in the key. On a default installation of WSS, the highest entry is 37.

    Create a registry value for the next number, e.g. 38, by choosing New String Value then naming the value the next highest number (e.g. 38). Double-click the value you just created and, in the Value Data box, type: pdf. Note there is no dot preceding the extension.

    5. Add 2 registry values in the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.pdf

    – Value Name: FileTypeBucket; Type: REG_DWORD; Data: 0x00000001 (1)
    – Value Name: MimeTypes; Type: REG_SZ; Data: application/pdf

    6. Restart the Windows SharePoint Services Search service. Open a command prompt and type:

    net stop spsearch
    net start spsearch

    7. Rebuild the WSS search index.- Open a command prompt.- Navigate to Program Files\Common Files\Microsoft Shared\web server extensions\12\BIN and type the following commands:

    stsadm.exe -o spsearch -action fullcrawlstop
    stsadm.exe -o spsearch -action fullcrawlstart

    Any existing PDFs will, after being indexed, appear in search results. But they will still not have correct icons. So, while the site is being indexed, add the icon to the Sharepoint library.

    1. Open the folder Program Files\Common Files\Microsoft Shared\Web Server Extensions\12\Template\Images.
    2. Copy the gif you downloaded in Step 1 into the folder.
    3. Open the folder Program Files\Common Files\Microsoft Shared\Web server extensions\12\Template\Xml.
    4. Right-click the file docicon.xml and choose Open With and select Notepad.
    5. In the element, you’ll see a number of elements. You will add one for pdf. It does not have to be in alphabetical order. The element you need to add is:

    6. Save that file and close Notepad.
    7. Restart IIS

    Previously added PDFs and new PDFs will now appear in the search results.

  5. Asher said

    now if someone could add the guide so that PDF file properties can be search that would be great 😀

  6. Asher: if you want to crawl PDF metadata so that it can be searched, you may want to look at something like the Bamboo PDF Document Parser which allows you to import PDF metadata and form data into a SharePoint list – then it can become searchable. More details here http://community.bamboosolutions.com/blogs/bambooteamblog/archive/2008/09/12/announcing-the-pdf-document-parser-in-bamboo-labs.aspx

    Derek

Sorry, the comment form is closed at this time.

 
%d bloggers like this: