saved
Request for Proposal
Status: RFP is Closed

Intelligent Chart Recognition and Data Extrapolation from PDF Documents

Request Number
RFP_2018_3920
Due Date
Jan 18
Program Manager

Opportunity

Product Acquisition, Supplier Agreement

Timeline

Phase 1 – Proof of concept in 6-12 months

Phase 2 – Commercial application in 6-12 months

Financials

Financial support for the proof of concept phase will be negotiated based on specific performance targets agreed between both parties

SOLUTION PROVIDER HELP DESK USA:
 
RFP was closed on
Jan 2019

RFP Title

 

Intelligent Chart Recognition and Data Extrapolation from PDF Documents
RFP Description

NineSigma, representing a major Oil & Gas company, invites proposals for an intelligent data analysis and image recognition system that is able to process PDF files, correctly recognize the included figures, and extract the data in an Excel format.

Background

NineSigma’s client has a large number of articles and documents in PDF format. These PDF documents contain a mix of text, charts and other graphical information like photos.

While character recognition routines work fine to extract text, the analysis of charts is still a challenging problem. Such charts are now manually processed to extract the underlying data.

 

The aim of this request is to find a technology that can solve one or more of the following challenges:

  • Recognize from a PDF file all charts belonging to one or more of the following categories:
    • Histograms
    • Pie Charts
    • XY plots (either scatter plots or straight lines)
    • Basin Depth Maps (geographic maps with coordinates)
    • Core Gamma Ray Curves (multiple depth-value curves plotted side by side in the vertical direction)
  • Extrapolate, by analyzing the chart, the original underlying data table:
    • For Histograms, Pie Charts, XY Plots and Core Gamma Ray Curves: extract a data table that can be used to accurately reconstruct the original chart
    • For Basin Depth Maps: extract the coordinates of the contours in a format that can be used to import the data onto a GIS (Geographic Information Management) system.

 

An efficient document management system able to automatically treat several PDF files will save time and money by reducing manual data entry.

 

Anticipated Project Phases or Project Plan

Phase 1 – Proof of concept

 

Histograms, Pie charts and XY Plots:

  • Proof the demo version of the technology on only one PDF chart at a time and write the output, i.e. the underlying inferred table, in an Excel file (see examples below)
  • The Excel chart that is rebuilt from that table should match the original one (see examples below)

 

 

 

Basin Depth Maps

  • Proof the demo version of the technology on only one PDF figure containing a Basin Depth Map at a time
  • The focus will be restricted to those maps that have explicit indication of the coordinates, see figure below

 


Credit: U.S. Geological Survey
Department of the Interior/USGS
U.S. Geological Survey/URL: https://pubs.usgs.gov/of/2002/0353/depth.html

 

 

  • For each contour within the map, the technology should be able to derive the coordinates of the points on the contour with enough accuracy to reconstruct the same shape within a GIS system
  • The contour coordinates will be written in a format suitable to be imported into a GIS system
  • The contours displayed by the GIS system should match with the contours displayed in the original picture

 

Core Gamma Ray Curves

  • Proof the demo version of the technology on only one PDF figure containing a Core Gamma Ray Curve at a time
  • Core Gamma Ray Curves are typically displayed as shown in the figure below. Several depth-value curves are displayed side by side in the vertical direction. Each curve has a header which specifies the value ranges
  • For each curve, generate an Excel file containing the name of the curve and the depth-value pairs that can be used to reconstruct the chart in Excel
  • The Excel charts that are rebuilt from the depth-value pairs should match the original

 

 

Phase 2 – Commercial application of one or more of the proof of concepts demonstrated above

Key Success Criteria

The successful technology will:

  • Be able to treat PDF documents (including scanned ones) in different format, size and resolution.
    • They can be processed one by one or in batch mode.
  • Give as output a structured database containing, for each chart:
    • References to the extracted chart (path/name of the PDF, page, chart name, etc.)
    • Classification of the chart among the given/supported categories
    • The underlying data table extrapolated by analyzing the original chart
  • The product should be able to recognize and analyze one or more of the charts described in the previous paragraphs.
Possible Approaches

Possible approaches might include, but are not limited to:

  • Novel OCR methods
  • Advanced image processing and feature recognition technologies
  • Technologies that make use of machine learning approaches
Approaches not of Interest

The following approaches are not of interest:

  • Technologies or approaches that are not supported by relevant data or experience in design and development of recognition systems or software
  • Projects which start with a concept for a PDF data analysis platform that might deliver against the requirements at some indefinable future point
Items to be Submitted

Your response should address the following:

  • A non-confidential description of the proposed approach
  • List of the challenges that will be addressed (one or more between: Histograms, Pie Charts, XY Plots, Basin Depth Maps, Core Gamma Ray Curves)
  • High level description of proposed analysis technology including:
    • Working principle of the analysis method
    • If available, examples of analysis on similar image/data sets
    • Availability of technical features including output type of analysis report, quality control, flexibility, ease of use
  • Description of deliverables, timing, budget to evaluate a sample data set
  • Technical maturity of the approach (concept, prototype, ready to commercialize, commercialized)
  • Any missing features and development plan to include these
  • Desired relationship with client
  • IP situation
  • Team description

 

Appropriate responses to this Request

Responses from companies (small to large), researchers, consultants, venture capitalists, entrepreneurs, or inventors are welcome. For example:

 

You represent a company that has demonstrated track record of advanced image processing technologies that are suitable for this request

You represent a university research department that is working on new concepts of image processing technologies that should be suitable for this request

Preferred Collaboration Types
Area of Interest