Howto:Processing d-tpp using Python: Difference between revisions

Jump to navigation Jump to search
Partial copy editing; +-cat: Python Software → Python software
m (cat: Python Software)
(Partial copy editing; +-cat: Python Software → Python software)
 
Line 1: Line 1:
{{Stub}}
{{Stub}}


[[File:KSFO-28RILS.png|thumb|Screenshot showing the scraped, converted and transformed approach chart for KSFO 28R (aspect ratio doesn't matter for machine learning purposes, i.e. we can use random scaling/ratios here to come up with artificial training data).]]
[[File:KSFO-28RILS.png|thumb|Screenshot showing the scraped, converted and transformed approach chart for KSFO 28R (aspect ratio doesn't matter for machine learning purposes, i.e. we can use random scaling/ratios here to come up with artificial training data).]]
[[File:Merged-charts.png|thumb|Simplified navigation charts merged into a single texture]]


[[File:Merged-charts.png|thumb|Simplified navigation charts merged into a single texture]]
If processing actual PDFs to "retrieve" such navigational data procedurally is ever supposed to "fly", I think it would have to be done using OpenCV running in a background thread (actually a bunch of threads in a separate process), in essence using machine learning, basically feeding it a bunch of manually-annotated PDFs, segmenting each PDF into sub-areas (horizontal/vertical profile, frequencies, identifier etc) and running neural networks.
== Idea ==
if processing actual PDFs to "retrieve" such navigational data procedurally is ever supposed to "fly", I think it would have to be done using OpenCV runnning in a background thread (actually a bunch of threads in a separate process), i.e. using machine learning - basically, feeding it a bunch of manually-annotated PDFs, segmenting each PDF into sub-areas (horizontal/vertical profile, frequencies, identifier etc) and running neural networks.


Basically, such a thing would need to be very modular to be feasible - i.e. parallel processing of the rasterized image on the GPU, to split the chart into known components and retrieve the identifiers, frequencies, bearings etc that way (i.e. would require an OCR stage, too).
Basically, such a thing would need to be very modular to be feasible, in essence parallel processing of the rasterized image on the GPU, to split the chart into known components and retrieve the identifiers, frequencies, bearings etc that way (in essence requiring an OCR stage too).


It is kind of an interesting problem and it would address a bunch of legal issues, too - just like downloading such data from the web works for a reason, but it would definitely be a rather complex piece of software I believe, and we would want to get people involved familiar with machine learning and computer vision (OpenCV) - it is kinda a superset of doing OCR on approach charts, i.e. not just looking for a character set, but actual document structure and "iconography" for airports, navaids, route markers and so on.
It is kind of an interesting problem and it would address a bunch of legal issues, too.  Just like downloading such data from the web works for a reason, but it would definitely be a rather complex piece of software I believe, and we would want to get people involved familiar with machine learning and computer vision (OpenCV).  It is kinda a superset of doing OCR on approach charts, in essence not just looking for a character set, but an actual document structure and "iconography" for airports, navaids, route markers and so on.


== Motivation ==
== Motivation ==
Line 17: Line 14:
Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/
Come up with the Python machinery to automatically download aviation charts and classify them for further processing/parsing (data extraction): http://155.178.201.160/d-tpp/


We will be downloading two different AIRAC cycles, i.e. at the time of writing '''1712''' & '''1713''':
We will be downloading two different AIRAC cycles, for example at the time of writing cycles '''1712''' and '''1713''':
* http://155.178.201.160/d-tpp/1712/
* http://155.178.201.160/d-tpp/1712/
* http://155.178.201.160/d-tpp/1713/
* http://155.178.201.160/d-tpp/1713/
Line 25: Line 22:
== Data sources ==
== Data sources ==
* d-TPP
* d-TPP
* EuroControl [http://www.eurocontrol.int/articles/eurocontrol-regional-charts-erc]
* [http://www.eurocontrol.int/articles/eurocontrol-regional-charts-erc EuroControl]
* VATSIM Charts [https://www.vatsim.net/charts/] [https://www.vatsim.net/charts/]
* [https://www.vatsim.net/charts/ VATSIM Charts]
* IVAO Charts [https://xn.ivao.aero/pilot/xn/charts]
* [https://xn.ivao.aero/pilot/xn/charts IVAO Charts]


== Chart Classification ==
== Chart classification ==
* '''STARs''' - Standard Terminal Arrivals  
; STARsStandard Terminal Arrivals  
* '''IAPs''' - Instrument Approach Procedures
; IAPsInstrument Approach Procedures
* '''DPs''' - Departure Procedures
; DPsDeparture Procedures


== Modules ==
== Modules ==
Line 40: Line 37:


=== Scraping ===
=== Scraping ===
{{Note|This will download roughly 4gb of data in ~17k files, for each AIRAC cycle!}}  
{{Note|This will download roughly 4 GB of data in ~17'000 files for each AIRAC cycle!}}  
* this should support caching
* This should support caching
* and interrupting/resuming scraping  
* And interrupting/resuming scraping  
* Alternatively, use a media pipeline <ref>http://sergeis.com/web-scraping/downloading-files-scrapy-mediapipeline/</ref>
* Alternatively use a media pipeline <ref>http://sergeis.com/web-scraping/downloading-files-scrapy-mediapipeline/</ref>
 
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
import os
import os
Line 102: Line 100:
process.crawl(dTPPSpider)
process.crawl(dTPPSpider)
process.start() # the script will block here until the crawling is finished
process.start() # the script will block here until the crawling is finished
</syntaxhighlight>
</syntaxhighlight>


=== Converting to images ===
=== Converting to images ===
{{Note|By default, all PDF files will be  387 x 594 pts (use pdfinfo to see for yourself)}}
{{Note|By default, all PDF files will be  387 x 594 pts (use pdfinfo to see for yourself).<ref>https://github.com/Belval/pdf2image</ref>}}
pip3 install pdf2image <ref>https://github.com/Belval/pdf2image</ref>
<syntaxhighlight lang="python">
pip3 install pdf2image
</syntaxhighlight>


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
Line 120: Line 118:


=== Simplification / Feature Extraction ===
=== Simplification / Feature Extraction ===
[[File:Random-charts.png|thumb|screenshot showing simplified charts, based on creating thumbnails that are added to a new image.|right ]]
[[File:Random-charts.png|thumb|screenshot showing simplified charts based on creating thumbnails that are added to a new image.|right ]]
We can easily simplify our charts by creating thumbnails for each chart and merging all files into a larger image/texture:
We can easily simplify our charts by creating thumbnails for each chart and merging all files into a larger image/texture.


=== Image Randomization ===
=== Image Randomization ===
Since we only have very little data, we need to come up with artifical data fo training purposes - we can do so by randomizing our existing image set to create all sorts of "charts". e.g. by transforming/re-scaling our images or changing their aspect ratio:
Since we only have very little data we need to come up with artificial data fo training purposes.  We can do so by randomizing our existing image set to create all sorts of "charts", For example by transforming/re-scaling our images or changing their aspect ratio:


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
Line 134: Line 132:


=== OCR ===
=== OCR ===
We don't just need to do character recognition, but also deal with aviation specific symbology/iconography.
We dont just need to do character recognition, but also deal with aviation specific symbology/iconography.
Once again, we can refer to PDF files for the specific symbols <ref>https://www.icao.int/safety/ais-aimsg/AISAIM%20Meeting%20MetaData/AIS-AIMSG%204/SN%208%20Att%20B.pdf</ref>
Once again, we can refer to PDF files for the specific symbols.<ref>https://www.icao.int/safety/ais-aimsg/AISAIM%20Meeting%20MetaData/AIS-AIMSG%204/SN%208%20Att%20B.pdf</ref>


== Prerequisites ==
== Prerequisites ==
pip install --user  
<syntaxhighlight lang="python">
pip install --user
</syntaxhighlight>


* requests
* requests
* pdf2image
* pdf2image


== Python resources ==
== References ==
<references />
 
== External links ==
=== Python resources ===
* http://www.pythonware.com/products/pil/
* http://www.pythonware.com/products/pil/
* http://effbot.org/imagingbook/introduction.htm
* http://effbot.org/imagingbook/introduction.htm


== See also ==
=== See also ===
* https://github.com/euske/pdfminer
* https://github.com/euske/pdfminer
* https://dzone.com/articles/pdf-reading
* https://dzone.com/articles/pdf-reading
Line 154: Line 158:
* https://github.com/pmaupin/pdfrw
* https://github.com/pmaupin/pdfrw


 
=== Related ===
== Related ==
* https://en.wikipedia.org/wiki/AIXM
* https://en.wikipedia.org/wiki/AIXM
* https://aeronavdata.com/what-we-do/axim-5/
* https://aeronavdata.com/what-we-do/axim-5/
* http://ww1.jeppesen.com/industry-solutions/aviation/government/arinc-424-navigational-data-service.jsp
* http://ww1.jeppesen.com/industry-solutions/aviation/government/arinc-424-navigational-data-service.jsp


[[Category:Python Software]]
[[Category:Python software]]

Navigation menu