Last week was another winter meeting of the American Astronomical Society (AAS227) and another installment of the AAS Hack Day. Earlier in the week, I gave a short 5 minute talk on Software Testing and participated in a panel on better practices for scientific programming, then gave my Dissertation talk on the density evolution of stellar streams formed around chaotic orbits. At the hack day, I worked with Scott Idem (AAS) and David W. Hogg (NYU) to continue the theme of hacks started 2 years ago by Dan Foreman-Mackey (UW), Dylan Gregersen (UU), and I working with all titles and abstracts presented at the AAS meeting.

The idea for this year was to work towards automating the scheduling process for the AAS meetings. Right now (as far as I know) submitted abstracts are sorted manually by volunteers and then organized into sessions. I'm not sure how much thought goes in to scheduling simultaneous sessions, but I know that for my own interests there are often multiple sessions happening at the same time (leading to "session jumping"). Maybe computers can do better?

The AAS (really: Scott Idem, facilitated by Kelle Cruz - thanks!) gave us access to a database containing session and presentation info for AAS227 and past meetings. This was a huge improvement from previous hack days where Foreman-Mackey spent most of his time parsing the abstract info from a PDF (if I remember correctly). With access to the database, we set off to do some proof of concept hacks with the text data. Below I've included cells from the Jupyter notebook I worked in during the hack day:

We'll start by seeing what kind of session types are in the database for the AAS227 meeting. (Note: I've already established a database connection to the database using sqlalchemy, but I've left that code out of the post)

In [4]:
res = engine.execute("""
SELECT DISTINCT(session.type) FROM session 
WHERE session.meeting_code = 'aas227';
""")
res.fetchall()
Out[4]:
[('Special Session',),
 ('Workshop',),
 ('Splinter Meeting',),
 ('Private Splinter Meeting',),
 ('Attendee Event',),
 ('Invitation-Only Event',),
 ('Town Hall',),
 ('Plenary Session',),
 ('Poster Session',),
 ('Oral Session',),
 ('Open Event',),
 ('Public Event',)]

We're going to limit this hack to the usual oral sessions, special sessions, and splinter meetings -- we are ignoring the plenary sessions, town halls, poster sessions, and any other specialized session type. Let's start by querying the database to get all of the sessions and presentations of this type from AAS227:

In [5]:
query = """
SELECT session.title, session.start_date_time, session.end_date_time, session.so_id
FROM session
WHERE session.meeting_code = 'aas227'
    AND session.type IN (
        'Oral Session',
        'Special Session',
        'Splinter Meeting'
    )
ORDER BY session.so_id;
"""
session_results = engine.execute(query).fetchall()

# load the presentation data into a Pandas DataFrame
session_df = pd.DataFrame(session_results, columns=session_results[0].keys())

# turn the timestamps into datetime
session_df['start_date_time'] = pd.to_datetime(session_df['start_date_time']) 
session_df['end_date_time'] = pd.to_datetime(session_df['end_date_time'])
session_df = session_df[1:] # zero-th entry has a corrupt date
In [6]:
query = """
SELECT presentation.title, presentation.abstract, presentation.id, session.so_id
FROM session, presentation
WHERE session.meeting_code = 'aas227'
    AND session.so_id = presentation.session_so_id
    AND presentation.status IN ('Sessioned', '')
    AND session.type IN (
        'Oral Session', 
        'Special Session',
        'Splinter Meeting'
    )
ORDER BY presentation.id;
"""
presentation_results = engine.execute(query).fetchall()

# sort the presentatons by session
presentation_results = sorted(presentation_results, key=lambda x: x['so_id'])

# load the presentation data into a Pandas DataFrame and clean out HTML tags
presentation_df = pd.DataFrame(presentation_results, columns=presentation_results[0].keys())
presentation_df['abstract'] = presentation_df['abstract'].str.replace('<[^<]+?>', '')
presentation_df['title'] = presentation_df['title'].str.replace('<[^<]+?>', '')

Let's look at the first few entries for the sessions:

In [7]:
session_df[:3]
Out[7]:
title start_date_time end_date_time so_id
1 Lectures in AstroStatistics 2016-01-06 10:00:00 2016-01-06 11:30:00 212363
2 Hubble Space Telescope: a Vision to 2020 and B... 2016-01-06 14:00:00 2016-01-06 15:30:00 212364
3 The Astrophysics of Exoplanet Orbital Phase Cu... 2016-01-06 14:00:00 2016-01-06 15:30:00 212365

And for the presentations:

In [8]:
presentation_df[:3]
Out[8]:
title abstract id so_id
0 A Plan for Astrophysical Constraints of Dark M... We present conclusions and challenges for a co... 21704 212362
1 The Likelihood Function and Likelihood Statistics The likelihood function is a necessary compone... 21706 212363
2 From least squares to multilevel modeling: A g... This tutorial presentation will introduce some... 21707 212363
In [9]:
nsessions = len(session_df)
npresentations = len(presentation_df)
print(nsessions, npresentations)
139 675

There were 139 sessions and 675 presentations (again, this excludes plenary sessions, town halls, etc.)

Now we have local access to the session and presentation information in Pandas DataFrame objects. We're going to use a combination of nltk and scikit-learn to process and work with this text data (both are pip installable). The first thing we can look at is what the most popular words are in all AAS abstracts. To do this, we first have to split up the abstracts (jargon: "tokenize" the text) and remove any inflection from the tokens (jargon: "stem" the tokens) -- e.g., we want to count "galaxy", "Galaxy", and "galaxies" as equivalent. We'll use tools from nltk to define our own function to tokenize a block of text:

In [10]:
import nltk
from nltk.stem.porter import PorterStemmer
In [11]:
def tokenize(text, stemmer=PorterStemmer()):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    
    # tokenize
    tokens = nltk.word_tokenize(text)
    
    # stem
    stems = [stemmer.stem(token) for token in tokens]
    
    return stems

The process of tokenizing and stemming a block of text will take, for example, a sentence or paragraph like "This image contains one foreground galaxy and many background galaxies" and convert this to a Python list of word stems:

In [12]:
tokenize("This image contains one foreground galaxy and many background galaxies")
Out[12]:
['Thi',
 'imag',
 'contain',
 'one',
 'foreground',
 'galaxi',
 'and',
 'mani',
 'background',
 'galaxi']

We're going to use a class from scikit-learn to automatically count instances of word stems in a given block of text or a list of blocks of text. To do this right, we also want to remove common words like "as" or "the" from the token list (jargon: "stop words"). We can do this by specifying stop_words=True when initializing this class. We'll also pass in our custom tokenizer function, and specify that we want to convert all text to lowercase before processing (lowercase=True):

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    analyzer='word',
    tokenizer=tokenize,
    lowercase=True,
    stop_words='english',
)

We can now process all of the presentation titles and abstracts and count word stem occurrences using the fit_transform() method of the CountVectorizer class. This returns a (sparse) array with a row for each presentation title and a vector of counts for all unique word stems. Let's break down what that means with a very simple example. Imagine there were only three presentations with titles:

  • Exoplanets and stuff
  • I found an exoplanet
  • Stuff that explodes in space

The unique word stems from this imaginary AAS would be (with stop words removed) "exoplanet", "stuff", "explod", "space":

In [14]:
example_titles = ["Exoplanets, exoplanets, exoplanets", "I found an exoplanet", "Stuff that explodes in space"]
example_count_matrix = vectorizer.fit_transform(example_titles).toarray()
vectorizer.get_feature_names()
Out[14]:
['exoplanet', 'explod', 'space', 'stuff']

The count matrix would then have shape (3,4). The 0th row specified how many times each of these unique word stems appears in the first title:

In [15]:
example_count_matrix[0]
Out[15]:
array([3, 0, 0, 0], dtype=int64)

The only word stem that appears in this first row is "exoplanet", which occurs 3 times.

Let's start with the real data by analyzing the presentation titles:

In [16]:
title_count_matrix = vectorizer.fit_transform(presentation_df['title']).toarray()
title_count_matrix.shape
Out[16]:
(675, 1498)

This matrix is quite a bit bigger. From this, we can see there are 1498 unique word stems across all presentation titles. If we sum over the rows, we can sort and find the most common words in titles:

In [17]:
title_counts = title_count_matrix.sum(axis=0)
sort_by_count_idx = title_counts.argsort()[::-1] # reverse sort
words = np.array(vectorizer.get_feature_names())

# print the word stem and the number of occurrences
for idx in sort_by_count_idx[:10]:
    print(words[idx], title_counts[idx])
galaxi 101
star 82
survey 66
ray 58
mass 50
format 45
observ 44
stellar 36
cluster 34
disk 33

Let's repeat this for the abstracts themselves:

In [18]:
abs_count_matrix = vectorizer.fit_transform(presentation_df['abstract']).toarray()
abs_count_matrix.shape
Out[18]:
(675, 5568)
In [19]:
abs_counts = abs_count_matrix.sum(axis=0)
sort_by_count_idx = abs_counts.argsort()[::-1] # reverse sort
words = np.array(vectorizer.get_feature_names())

# print the word stem and the number of occurrences
for idx in sort_by_count_idx[:10]:
    print(words[idx], abs_counts[idx])
galaxi 1042
star 948
thi 850
observ 798
mass 686
use 672
model 441
survey 437
format 422
present 387

Galaxies beat stars by a narrow margin!


Now that we have the matrx of word (stem) counts for the presentation abstracts, we can defined a metric to assess how similar two presentations are. We'll use the "cosine similarity", which is essentially just the cosine of the angle between two word count vectors (e.g., two rows of the count matrix).

In [20]:
def similarity(v1, v2):
    numer = v1.dot(v2)
    denom = np.linalg.norm(v1) * np.linalg.norm(v2)

    if numer < 1: # if no common words, the vectors are orthogonal
        return 0.
    else:
        return numer / denom

To explain this in a bit more detail, let's return to our example above with only three presentation titles. Let's compute the similarity between all pairs of titles:

In [21]:
example_count_matrix
Out[21]:
array([[3, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 1, 1, 1]], dtype=int64)
In [22]:
sim_01 = similarity(example_count_matrix[0], example_count_matrix[1])
sim_02 = similarity(example_count_matrix[0], example_count_matrix[2])
sim_12 = similarity(example_count_matrix[1], example_count_matrix[2])

sim_01, sim_02, sim_12
Out[22]:
(1.0, 0.0, 0.0)

This makes sense: the first two titles only contain the word stem "exoplanet", so they point in the same direction. All other pairs of titles are orthogonal because the third title does not contain the word stem "exoplanet". We'll now compute the cosine similarity between all pairs of abstracts, which will give us a 675 by 675 matrix of values:

In [23]:
similarity_matrix = np.zeros((npresentations, npresentations))
for ix1 in range(npresentations):
    for ix2 in range(npresentations):
        similarity_matrix[ix1,ix2] = similarity(abs_count_matrix[ix1], abs_count_matrix[ix2])

Here's a visualization of the similarity matrix -- remember that we sorted the presentations by session, so if we see large values off-diagonal (bright pixels), this is an indication of sessions with overlapping topics:

In [24]:
pl.figure(figsize=(8,8))
vmax = similarity_matrix[similarity_matrix < 0.99].max()
pl.imshow(similarity_matrix, cmap='magma', interpolation='nearest',
          vmax=vmax)
Out[24]:
<matplotlib.image.AxesImage at 0x10e38b048>