Directory-Summarizer: A Tool for Summarizing Conference Proceedings and Other Document Collections

A couple of weekends ago I put together a python script that utilizes sumy  (https://github.com/miso-belica/sumy andhttps://pypi.python.org/pypi/sumy/0.3.0) and pdfminer (http://www.unixuser.org/~euske/python/pdfminer/ and https://pypi.python.org/pypi/pdfminer/) to summarize all pdf, docx (Word), and .txt files in a user-specified directory, including sub-directories as well. In addition, it lists (but doesn’t summarize) the Powerpoint (.ppt and .pptx files as well. I had recently returned from the Intelligent Transportation Society of America’s Annual Meeting, and had a USB drive with the conference proceedings. The problem is, the files are all just organized into folders by session code (e.g., TS-3), and each session could have a quite diverse range of papers. I wanted a way to quickly scan the proceedings to identify items that might be worth my while to read, and also might serve a similar purpose for others.

The user may also specify how many sentences to include in the summary of each document, as well as which of the summarization algorithms included in sumy that they would like used.

Summarizers generally attempt to determine the most important sentences within a document in terms of describing its content, and present them. They do not really understand a report, and can’t write a new abstract like a human could. So the sentences in the summary to not flow together, but typically do capture the content of the document. In addition to the summary, I pull out the first line in each report, as this is often the title or the first part of the title of the report.

Here’s an example 6-sentence summary the tool produced for one of the papers in the proceedings, related to semi-automated platooning of trucks to reduce fuel consumption. I think it captures the scope of the paper:

POSSIBLE TITLE: EVALUATION AND TESTING OF DRIVER ASSISTIVE TRUCK PLATOONING:

This paper provides selected final results from Phase One, which is explored a range of technical and non-technical challenges, including assessing feasible real-world business models within the trucking industry.

Testing in past FHWA EAR research and by project partner Peloton has shown that, due to aerodynamic drafting effects, DATP has the potential to significantly reduce fuel use.

The premise of this research is that taking this technology to full commercialization requires a simpler technical approach (compared to fully automated platooning) which bridges from current trucking operations to DATP.

Data was taken in order to compare the relative distance measurements provided by Dynamic Based Real Time Kinematic (DRTK) and a Delphi automotive RADAR.

This particular road segment was chosen for the initial analysis due to its relatively low traffic volumes (resulting in a data set of manageable size) and limited ingress/egress points (allowing the consideration of trucks that remained on the corridor for an extended distance).

ATA Trucking Trends 2013) indicate that over-the-road operations, with an emphasis on truckload (TL) and line-haul less-than-truckload (LTL) sectors would experience the highest likelihood of encountering the desired DATP attributes.

File Path: E:TS01\2_14620_abstract_2183_0.pdf

The Directory-Summarizer can be used to generate summaries for any collections of documents stored in a master directory, and the code is available on github.

P.S.: I understand that there is a python port of tika that, when the bugs are out, could be dropped in so the summarizer could handle even more file types, or the code could be modified to utilize tika service instance to do the same. If anyone does that, let me know how it goes.

 

Struct and functions when using the Arduino IDE

As anyone reading this blog probably knows, the Arduino IDE simplifies a number of programming for an embedded environment and hides some of the required C / C++ material. This can make life a lot easier, but it can also cause problems, especially when you step out to do more complex things. I got bit by one of those earlier today. Since I eventually found a post to the work around, I thought I’d post it here.

In my robot code, I”ve defined a struct called coord that holds two doubles, which are the x and y coordinates for whatever I need (e.g., the position of the robot, the next waypoint, etc.

Today, I wanted to compute the distance from the ray defined by the previous and next waypoint and the current position of the vehicle, so that the error could be fed into a PID controller. I figured it would be easy to pass the parameters as coord types. BUT, this turns out to be trickier than it should be with the Arduino. Unless the structs are defined in a .h file, there are problems with their scope. A work-around is documented by Alexander Brevig on the Arduino Playground: Struct Resource for Arduino.