Howto:Processing d-tpp using Python

From FlightGear wiki
Revision as of 18:30, 28 November 2017 by Hooray (talk | contribs) (→‎Motivation)
Jump to navigation Jump to search
This article is a stub. You can help the wiki by expanding it.


Idea

if processing actual PDFs to "retrieve" such navigational data procedurally is ever supposed to "fly", I think it would have to be done using OpenCV runnning in a background thread (actually a bunch of threads in a separate process), i.e. using machine learning - basically, feeding it a bunch of manually-annotated PDFs, segmenting each PDF into sub-areas (horizontal/vertical profile, frequencies, identifier etc) and running neural networks.

Basically, such a thing would need to be very modular to be feasible - i.e. parallel processing of the rasterized image on the GPU, to split the chart into known components and retrieve the identifiers, frequencies, bearings etc that way.

It is kind of an interesting problem and it would address a bunch of legal issues, too - just like downloading such data from the web works for a reason, but it would definitely be a rather complex piece of software I believe, and we would want to get people involved with machine learning and computer vision (OpenCV) - it is kinda a superset of doing OCR on approach charts, i.e. not just looking for a character set, but actual document structure and "iconography" for airports, navaids, route markers and so on.

Motivation

Screenshot showing scrapy scraping d-TPPs

Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/

We will be downloading two different AIRAC cycles, i.e. at the time of writing 1712 & 1713:

Each directory contains a set of charts that will be post-processed by converting them to raster images.

Data sources

Chart Classification

  • STARs - Standard Terminal Arrivals
  • IAPs - Instrument Approach Procedures
  • DPs - Departure Procedures

Modules

XML Processing

http://155.178.201.160/d-tpp/1712/xml_data/d-TPP_Metafile.xml

Scraping

Alternatively, use a media pipeline [1]

import os
import urlparse
import scrapy

from scrapy.crawler import CrawlerProcess
from scrapy.http import Request

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

def createFolder(directory):
    try:
        if not os.path.exists(directory):
            os.makedirs(directory)
    except OSError:
        print ('Error: Creating directory. ' +  directory)
        


class dTPPSpider(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["155.178.201.160"]
    start_urls = ["http://155.178.201.160/d-tpp/1712/"]

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open('./PDF/'+path, 'wb') as f:
            f.write(response.body)


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})


createFolder('./PDF/')
process.crawl(dTPPSpider)
process.start() # the script will block here until the crawling is finished

Converting to images

pip3 install pdf2image [2]

from pdf2image import convert_from_path, convert_from_bytes
import tempfile

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path('/folder/example.pdf', output_folder=path)
     # Do something here

Image Randomization

Since we only have very little data, we need to come up with artifical data fo training purposes - we can do so by randomizing our existing image set to create all sorts of "charts":

Uploading to the GPU

Classification

OCR

We don't just need to do character recognition, but also deal with aviation specific symbology/iconography. Once again, we can refer to PDF files for the specific symbols [3]

Prerequisites

pip install --user

  • requests
  • pdf2image

See also


Related