A couple of weekends ago I put together a python script that utilizes sumy (https://github.com/miso-belica/sumy andhttps://pypi.python.org/pypi/sumy/0.3.0) and pdfminer (http://www.unixuser.org/~euske/python/pdfminer/ and https://pypi.python.org/pypi/pdfminer/) to summarize all pdf, docx (Word), and .txt files in a user-specified directory, including sub-directories as well. In addition, it lists (but doesn’t summarize) the Powerpoint (.ppt and .pptx files as well. I had recently returned from the Intelligent Transportation Society of America’s Annual Meeting, and had a USB drive with the conference proceedings. The problem is, the files are all just organized into folders by session code (e.g., TS-3), and each session could have a quite diverse range of papers. I wanted a way to quickly scan the proceedings to identify items that might be worth my while to read, and also might serve a similar purpose for others.
The user may also specify how many sentences to include in the summary of each document, as well as which of the summarization algorithms included in sumy that they would like used.
Summarizers generally attempt to determine the most important sentences within a document in terms of describing its content, and present them. They do not really understand a report, and can’t write a new abstract like a human could. So the sentences in the summary to not flow together, but typically do capture the content of the document. In addition to the summary, I pull out the first line in each report, as this is often the title or the first part of the title of the report.
Here’s an example 6-sentence summary the tool produced for one of the papers in the proceedings, related to semi-automated platooning of trucks to reduce fuel consumption. I think it captures the scope of the paper:
POSSIBLE TITLE: EVALUATION AND TESTING OF DRIVER ASSISTIVE TRUCK PLATOONING:
This paper provides selected final results from Phase One, which is explored a range of technical and non-technical challenges, including assessing feasible real-world business models within the trucking industry.
Testing in past FHWA EAR research and by project partner Peloton has shown that, due to aerodynamic drafting effects, DATP has the potential to significantly reduce fuel use.
The premise of this research is that taking this technology to full commercialization requires a simpler technical approach (compared to fully automated platooning) which bridges from current trucking operations to DATP.
Data was taken in order to compare the relative distance measurements provided by Dynamic Based Real Time Kinematic (DRTK) and a Delphi automotive RADAR.
This particular road segment was chosen for the initial analysis due to its relatively low traffic volumes (resulting in a data set of manageable size) and limited ingress/egress points (allowing the consideration of trucks that remained on the corridor for an extended distance).
ATA Trucking Trends 2013) indicate that over-the-road operations, with an emphasis on truckload (TL) and line-haul less-than-truckload (LTL) sectors would experience the highest likelihood of encountering the desired DATP attributes.
File Path: E:TS01\2_14620_abstract_2183_0.pdf
The Directory-Summarizer can be used to generate summaries for any collections of documents stored in a master directory, and the code is available on github.
P.S.: I understand that there is a python port of tika that, when the bugs are out, could be dropped in so the summarizer could handle even more file types, or the code could be modified to utilize tika service instance to do the same. If anyone does that, let me know how it goes.
Pingback: Update to Directory Summarizer | The Aspiring Roboticist