Howto:Processing d-tpp using Python: Difference between revisions

From FlightGear wiki
Jump to navigation Jump to search
(Partial copy editing; +-cat: Python Software → Python software)
 
(38 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Stub}}
{{Stub}}


[[File:KSFO-28RILS.png|thumb|Screenshot showing the scraped, converted and transformed approach chart for KSFO 28R (aspect ratio doesn't matter for machine learning purposes, i.e. we can use random scaling/ratios here to come up with artificial training data).]]
[[File:Merged-charts.png|thumb|Simplified navigation charts merged into a single texture]]
If processing actual PDFs to "retrieve" such navigational data procedurally is ever supposed to "fly", I think it would have to be done using OpenCV running in a background thread (actually a bunch of threads in a separate process), in essence using machine learning, basically feeding it a bunch of manually-annotated PDFs, segmenting each PDF into sub-areas (horizontal/vertical profile, frequencies, identifier etc) and running neural networks.
Basically, such a thing would need to be very modular to be feasible, in essence parallel processing of the rasterized image on the GPU, to split the chart into known components and retrieve the identifiers, frequencies, bearings etc that way (in essence requiring an OCR stage too).
It is kind of an interesting problem and it would address a bunch of legal issues, too.  Just like downloading such data from the web works for a reason, but it would definitely be a rather complex piece of software I believe, and we would want to get people involved familiar with machine learning and computer vision (OpenCV).  It is kinda a superset of doing OCR on approach charts, in essence not just looking for a character set, but an actual document structure and "iconography" for airports, navaids, route markers and so on.


== Motivation ==
== Motivation ==
[[File:Chart-scraping.png|thumb|Screenshot showing scrapy scraping d-TPPs]]
Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/


http://155.178.201.160/d-tpp/
We will be downloading two different AIRAC cycles, for example at the time of writing cycles '''1712''' and '''1713''':
* http://155.178.201.160/d-tpp/1712/
* http://155.178.201.160/d-tpp/1713/


Each directory contains a set of charts that will be post-processed by converting them to raster images.


== Data sources ==
== Data sources ==
* d-TPP
* d-TPP
* VATSIM Charts
* [http://www.eurocontrol.int/articles/eurocontrol-regional-charts-erc EuroControl]
* IVAO Charts
* [https://www.vatsim.net/charts/ VATSIM Charts]
* [https://xn.ivao.aero/pilot/xn/charts IVAO Charts]
 
== Chart classification ==
; STARs:  Standard Terminal Arrivals
; IAPs:  Instrument Approach Procedures
; DPs:  Departure Procedures


== Modules ==
== Modules ==
Line 18: Line 37:


=== Scraping ===
=== Scraping ===
{{Note|This will download roughly 4 GB of data in ~17'000 files for each AIRAC cycle!}}
* This should support caching
* And interrupting/resuming scraping
* Alternatively use a media pipeline <ref>http://sergeis.com/web-scraping/downloading-files-scrapy-mediapipeline/</ref>
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
import os
import urlparse
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
def createFolder(directory):
    try:
        if not os.path.exists(directory):
            os.makedirs(directory)
    except OSError:
        print ('Error: Creating directory. ' +  directory)
       
class dTPPSpider(scrapy.Spider):
    name = 'dTPPSpider'
    # https://doc.scrapy.org/en/latest/topics/settings.html
    custom_settings = {
'HTTPCACHE_ENABLED': True,
        'HTTPCACHE_STORAGE': 'scrapy.extensions.httpcache.FilesystemCacheStorage',
'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy'   
}
    allowed_domains = ["155.178.201.160"]
    start_urls = [ "http://155.178.201.160/d-tpp/1712/",
"http://155.178.201.160/d-tpp/1713/"]
    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )
    def save_pdf(self, response):
directory = './PDF/'
        path = response.url.split('/')[-1]
cycle = response.url.split('/')[-2]
createFolder(directory)
createFolder(directory+cycle)
# TODO: split folder (AIRAC cycle)
        self.logger.info('Saving PDF %s (cycle:%s)', path, cycle)
        with open(directory + '/'+cycle+'/' + path, 'wb') as f:
            f.write(response.body)
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(dTPPSpider)
process.start() # the script will block here until the crawling is finished
</syntaxhighlight>
</syntaxhighlight>


=== Downloading ===
=== Converting to images ===
=== Converting to images ===
{{Note|By default, all PDF files will be  387 x 594 pts (use pdfinfo to see for yourself).<ref>https://github.com/Belval/pdf2image</ref>}}
<syntaxhighlight lang="python">
pip3 install pdf2image
</syntaxhighlight>
<syntaxhighlight lang="python">
from pdf2image import convert_from_path, convert_from_bytes
import tempfile
with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/folder/example.pdf', output_folder=path)
    # Do something here
</syntaxhighlight>
=== Simplification / Feature Extraction ===
[[File:Random-charts.png|thumb|screenshot showing simplified charts based on creating thumbnails that are added to a new image.|right ]]
We can easily simplify our charts by creating thumbnails for each chart and merging all files into a larger image/texture.
=== Image Randomization ===
Since we only have very little data we need to come up with artificial data fo training purposes.  We can do so by randomizing our existing image set to create all sorts of "charts", For example by transforming/re-scaling our images or changing their aspect ratio:
<syntaxhighlight lang="python">
</syntaxhighlight>
=== Uploading to the GPU ===
=== Uploading to the GPU ===
=== Classification ===
=== Classification ===


=== OCR ===
=== OCR ===
We dont just need to do character recognition, but also deal with aviation specific symbology/iconography.
Once again, we can refer to PDF files for the specific symbols.<ref>https://www.icao.int/safety/ais-aimsg/AISAIM%20Meeting%20MetaData/AIS-AIMSG%204/SN%208%20Att%20B.pdf</ref>


== Prerequisites ==
== Prerequisites ==
pip install --user  
<syntaxhighlight lang="python">
pip install --user
</syntaxhighlight>


* requests
* requests
* pdf2image
* pdf2image


== Code ==
== References ==
 
<references />
<syntaxhighlight lang="python">
</syntaxhighlight>


== External links ==
=== Python resources ===
* http://www.pythonware.com/products/pil/
* http://effbot.org/imagingbook/introduction.htm


== See also ==
=== See also ===
* https://github.com/euske/pdfminer
* https://github.com/euske/pdfminer
* https://dzone.com/articles/pdf-reading
* https://dzone.com/articles/pdf-reading
Line 47: Line 158:
* https://github.com/pmaupin/pdfrw
* https://github.com/pmaupin/pdfrw


 
=== Related ===
== Related ==
* https://en.wikipedia.org/wiki/AIXM
* https://en.wikipedia.org/wiki/AIXM
* https://aeronavdata.com/what-we-do/axim-5/
* https://aeronavdata.com/what-we-do/axim-5/
* http://ww1.jeppesen.com/industry-solutions/aviation/government/arinc-424-navigational-data-service.jsp
* http://ww1.jeppesen.com/industry-solutions/aviation/government/arinc-424-navigational-data-service.jsp
[[Category:Python software]]

Latest revision as of 09:46, 25 March 2020

This article is a stub. You can help the wiki by expanding it.
Screenshot showing the scraped, converted and transformed approach chart for KSFO 28R (aspect ratio doesn't matter for machine learning purposes, i.e. we can use random scaling/ratios here to come up with artificial training data).
Simplified navigation charts merged into a single texture

If processing actual PDFs to "retrieve" such navigational data procedurally is ever supposed to "fly", I think it would have to be done using OpenCV running in a background thread (actually a bunch of threads in a separate process), in essence using machine learning, basically feeding it a bunch of manually-annotated PDFs, segmenting each PDF into sub-areas (horizontal/vertical profile, frequencies, identifier etc) and running neural networks.

Basically, such a thing would need to be very modular to be feasible, in essence parallel processing of the rasterized image on the GPU, to split the chart into known components and retrieve the identifiers, frequencies, bearings etc that way (in essence requiring an OCR stage too).

It is kind of an interesting problem and it would address a bunch of legal issues, too. Just like downloading such data from the web works for a reason, but it would definitely be a rather complex piece of software I believe, and we would want to get people involved familiar with machine learning and computer vision (OpenCV). It is kinda a superset of doing OCR on approach charts, in essence not just looking for a character set, but an actual document structure and "iconography" for airports, navaids, route markers and so on.

Motivation

Screenshot showing scrapy scraping d-TPPs

Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/

We will be downloading two different AIRAC cycles, for example at the time of writing cycles 1712 and 1713:

Each directory contains a set of charts that will be post-processed by converting them to raster images.

Data sources

Chart classification

STARs
Standard Terminal Arrivals
IAPs
Instrument Approach Procedures
DPs
Departure Procedures

Modules

XML Processing

http://155.178.201.160/d-tpp/1712/xml_data/d-TPP_Metafile.xml

Scraping

Note  This will download roughly 4 GB of data in ~17'000 files for each AIRAC cycle!
  • This should support caching
  • And interrupting/resuming scraping
  • Alternatively use a media pipeline [1]
import os
import urlparse
import scrapy

from scrapy.crawler import CrawlerProcess
from scrapy.http import Request

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

def createFolder(directory):
    try:
        if not os.path.exists(directory):
            os.makedirs(directory)
    except OSError:
        print ('Error: Creating directory. ' +  directory)
        

class dTPPSpider(scrapy.Spider):
    name = 'dTPPSpider'
    # https://doc.scrapy.org/en/latest/topics/settings.html 
    custom_settings = {
	'HTTPCACHE_ENABLED': True,
        'HTTPCACHE_STORAGE': 'scrapy.extensions.httpcache.FilesystemCacheStorage',
	'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy'    
	}

    allowed_domains = ["155.178.201.160"]

    start_urls = [	"http://155.178.201.160/d-tpp/1712/",
			"http://155.178.201.160/d-tpp/1713/"]

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
	directory = './PDF/'
        path = response.url.split('/')[-1]
	cycle = response.url.split('/')[-2]
	createFolder(directory)
	createFolder(directory+cycle)
	# TODO: split folder (AIRAC cycle)
        self.logger.info('Saving PDF %s (cycle:%s)', path, cycle)
        with open(directory + '/'+cycle+'/' + path, 'wb') as f:
            f.write(response.body)


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})


process.crawl(dTPPSpider)
process.start() # the script will block here until the crawling is finished

Converting to images

Note  By default, all PDF files will be 387 x 594 pts (use pdfinfo to see for yourself).[2]
pip3 install pdf2image
from pdf2image import convert_from_path, convert_from_bytes
import tempfile

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path('/folder/example.pdf', output_folder=path)
     # Do something here

Simplification / Feature Extraction

screenshot showing simplified charts based on creating thumbnails that are added to a new image.

We can easily simplify our charts by creating thumbnails for each chart and merging all files into a larger image/texture.

Image Randomization

Since we only have very little data we need to come up with artificial data fo training purposes. We can do so by randomizing our existing image set to create all sorts of "charts", For example by transforming/re-scaling our images or changing their aspect ratio:

Uploading to the GPU

Classification

OCR

We dont just need to do character recognition, but also deal with aviation specific symbology/iconography. Once again, we can refer to PDF files for the specific symbols.[3]

Prerequisites

pip install --user
  • requests
  • pdf2image

References

External links

Python resources

See also

Related