NineSigma’s client has a large number of articles and documents in PDF format. These PDF documents contain a mix of text, charts and other graphical information like photos.
While character recognition routines work fine to extract text, the analysis of charts is still a challenging problem. Such charts are now manually processed to extract the underlying data.
The aim of this request is to find a technology that can solve one or more of the following challenges:
- Recognize from a PDF file all charts belonging to one or more of the following categories:
- Pie Charts
- XY plots (either scatter plots or straight lines)
- Basin Depth Maps (geographic maps with coordinates)
- Core Gamma Ray Curves (multiple depth-value curves plotted side by side in the vertical direction)
- Extrapolate, by analyzing the chart, the original underlying data table:
- For Histograms, Pie Charts, XY Plots and Core Gamma Ray Curves: extract a data table that can be used to accurately reconstruct the original chart
- For Basin Depth Maps: extract the coordinates of the contours in a format that can be used to import the data onto a GIS (Geographic Information Management) system.
An efficient document management system able to automatically treat several PDF files will save time and money by reducing manual data entry.
Anticipated Project Phases or Project Plan
Phase 1 – Proof of concept
Histograms, Pie charts and XY Plots:
- Proof the demo version of the technology on only one PDF chart at a time and write the output, i.e. the underlying inferred table, in an Excel file (see examples below)
- The Excel chart that is rebuilt from that table should match the original one (see examples below)
Basin Depth Maps
- Proof the demo version of the technology on only one PDF figure containing a Basin Depth Map at a time
- The focus will be restricted to those maps that have explicit indication of the coordinates, see figure below
Credit: U.S. Geological Survey
Department of the Interior/USGS
U.S. Geological Survey/URL: https://pubs.usgs.gov/of/2002/0353/depth.html
- For each contour within the map, the technology should be able to derive the coordinates of the points on the contour with enough accuracy to reconstruct the same shape within a GIS system
- The contour coordinates will be written in a format suitable to be imported into a GIS system
- The contours displayed by the GIS system should match with the contours displayed in the original picture
Core Gamma Ray Curves
- Proof the demo version of the technology on only one PDF figure containing a Core Gamma Ray Curve at a time
- Core Gamma Ray Curves are typically displayed as shown in the figure below. Several depth-value curves are displayed side by side in the vertical direction. Each curve has a header which specifies the value ranges
- For each curve, generate an Excel file containing the name of the curve and the depth-value pairs that can be used to reconstruct the chart in Excel
- The Excel charts that are rebuilt from the depth-value pairs should match the original
Phase 2 – Commercial application of one or more of the proof of concepts demonstrated above