Howto:Processing d-tpp using Python
Jump to navigation
Jump to search
This article is a stub. You can help the wiki by expanding it. |
Motivation
Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/
We will be downloading two different AIRAC cycles, i.e. at the time of writing 1712 & 1713:
Each directory contains a set of charts that will be post-processed by converting them to raster images.
Data sources
Chart Classification
- STARs - Standard Terminal Arrivals
- IAPs - Instrument Approach Procedures
- DPs - Departure Procedures
Modules
XML Processing
http://155.178.201.160/d-tpp/1712/xml_data/d-TPP_Metafile.xml
Scraping
Alternatively, use a media pipeline [1]
import os
import urlparse
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
def createFolder(directory):
try:
if not os.path.exists(directory):
os.makedirs(directory)
except OSError:
print ('Error: Creating directory. ' + directory)
class dTPPSpider(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["155.178.201.160"]
start_urls = ["http://155.178.201.160/d-tpp/1712/"]
def parse(self, response):
for href in response.css('a::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open('./PDF/'+path, 'wb') as f:
f.write(response.body)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
createFolder('./PDF/')
process.crawl(dTPPSpider)
process.start() # the script will block here until the crawling is finished
Converting to images
pip3 install pdf2image [2]
from pdf2image import convert_from_path, convert_from_bytes
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/folder/example.pdf', output_folder=path)
# Do something here
Uploading to the GPU
Classification
OCR
Prerequisites
pip install --user
- requests
- pdf2image
See also
- https://github.com/euske/pdfminer
- https://dzone.com/articles/pdf-reading
- https://automatetheboringstuff.com/chapter13/
- https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167
- https://github.com/pmaupin/pdfrw