Howto:Processing d-tpp using Python

From FlightGear wiki
Revision as of 17:35, 28 November 2017 by Hooray (talk | contribs) (→‎Motivation)
Jump to navigation Jump to search
This article is a stub. You can help the wiki by expanding it.


Motivation

Screenshot showing scrapy scraping d-TPPs

Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/

We will be downloading two different AIRAC cycles, i.e. at the time of writing 1712 & 1713:

Each directory contains a set of charts that will be post-processed by converting them to raster images.

Data sources

Chart Classification

  • STARs - Standard Terminal Arrivals
  • IAPs - Instrument Approach Procedures
  • DPs - Departure Procedures

Modules

XML Processing

http://155.178.201.160/d-tpp/1712/xml_data/d-TPP_Metafile.xml

Scraping

Alternatively, use a media pipeline [1]

import os
import urlparse
import scrapy

from scrapy.crawler import CrawlerProcess
from scrapy.http import Request

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

def createFolder(directory):
    try:
        if not os.path.exists(directory):
            os.makedirs(directory)
    except OSError:
        print ('Error: Creating directory. ' +  directory)
        


class dTPPSpider(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["155.178.201.160"]
    start_urls = ["http://155.178.201.160/d-tpp/1712/"]

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open('./PDF/'+path, 'wb') as f:
            f.write(response.body)


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})


createFolder('./PDF/')
process.crawl(dTPPSpider)
process.start() # the script will block here until the crawling is finished

Converting to images

pip3 install pdf2image [2]

from pdf2image import convert_from_path, convert_from_bytes
import tempfile

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path('/folder/example.pdf', output_folder=path)
     # Do something here

Uploading to the GPU

Classification

OCR

Prerequisites

pip install --user

  • requests
  • pdf2image

See also


Related