Collecting and filtering PowerPoint slides

An experiment in collecting and filtering Microsoft PowerPoint slides.

A custom automated tool has been written to scrape Google results for .ppt files related to any given keyword. After downloading 1000 randomly chosen slideshows, texts from the slideshows have been extracted to learn more about the data.

Based on the text content of each slide I was able to compose mosaics of the most text-heavy and content-less slides. A mosaic of backgrounds extracted from all collected slides has been assembled and sorted by average brightness.

Tools used: Ruby (with Mechanize), ImageMagick, IrfanView, Processing, Intelligent Contact Sheet Maker.