Howto:Processing d-tpp using Python: Difference between revisions

From FlightGear wiki
Jump to navigation Jump to search
Line 104: Line 104:


=== OCR ===
=== OCR ===
We don't just need to do character recognition, but also deal with aviation specific symbology/iconography.
Once again, we can refer to PDF files for the specific symbols <ref>https://www.icao.int/safety/ais-aimsg/AISAIM%20Meeting%20MetaData/AIS-AIMSG%204/SN%208%20Att%20B.pdf</ref>


== Prerequisites ==
== Prerequisites ==

Revision as of 18:12, 28 November 2017

This article is a stub. You can help the wiki by expanding it.


Motivation

Screenshot showing scrapy scraping d-TPPs

Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/

We will be downloading two different AIRAC cycles, i.e. at the time of writing 1712 & 1713:

Each directory contains a set of charts that will be post-processed by converting them to raster images.

Data sources

Chart Classification

  • STARs - Standard Terminal Arrivals
  • IAPs - Instrument Approach Procedures
  • DPs - Departure Procedures

Modules

XML Processing

http://155.178.201.160/d-tpp/1712/xml_data/d-TPP_Metafile.xml

Scraping

Alternatively, use a media pipeline [1]

import os
import urlparse
import scrapy

from scrapy.crawler import CrawlerProcess
from scrapy.http import Request

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

def createFolder(directory):
    try:
        if not os.path.exists(directory):
            os.makedirs(directory)
    except OSError:
        print ('Error: Creating directory. ' +  directory)
        


class dTPPSpider(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["155.178.201.160"]
    start_urls = ["http://155.178.201.160/d-tpp/1712/"]

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open('./PDF/'+path, 'wb') as f:
            f.write(response.body)


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})


createFolder('./PDF/')
process.crawl(dTPPSpider)
process.start() # the script will block here until the crawling is finished

Converting to images

pip3 install pdf2image [2]

from pdf2image import convert_from_path, convert_from_bytes
import tempfile

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path('/folder/example.pdf', output_folder=path)
     # Do something here

Image Randomization

Since we only have very little data, we need to come up with artifical data fo training purposes - we can do so by randomizing our existing image set to create all sorts of "charts":

Uploading to the GPU

Classification

OCR

We don't just need to do character recognition, but also deal with aviation specific symbology/iconography. Once again, we can refer to PDF files for the specific symbols [3]

Prerequisites

pip install --user

  • requests
  • pdf2image

See also


Related