- © 2014 by the Seismological Society of America
INTRODUCTION AND MOTIVATION
The Data Management Problem
Research in seismology commonly involves three main tasks: data acquisition, data management, and data analysis. Acquisition has been greatly advanced by institutions such as the Incorporated Research Institutions for Seismology (IRIS) and the Network of Research Infrastructures for European Seismology (NERIES), accelerated by facilities such as the USArray Transportable Array (TA), and facilitated by tools such as Standing Order for Data (Owens et al., 2004) and ObsPyLoad (Scheingraber et al., 2013). The availability of abundant broadband data has fueled novel analysis tools (Lou et al., 2013), techniques (Levander and Nolet, 2005), and data products (Trabant et al., 2012) previously not possible with earlier limited datasets. It has also highlighted, however, the long‐standing data management problem for many researchers in seismology.
In the face of the data flood, traditional workflows, such as a combination of shell scripts, text files, miniSEED or Seismic Analysis Code (SAC) files, and C, Fortran, or MATLAB analysis codes, can quickly become untenable. This is partly because, for each ad hoc storage scheme, the researcher must hand code what is essentially a database query/command in order to select (meta)data or to store measurements. This makes obtaining basic information such as event times, station availability, and phase arrival times burdensome and error prone. In this paper, we present Pisces, a data management library that utilizes databases, leverages existing free and open‐software tools and standards, and frees the user to both manage and analyze data with a single easy‐to‐use language, Python.
Relational databases have long existed to help with the data management problem. Versions of the Center for Seismic Studies (CSS 3.0) seismic database schema (Anderson et al., 1990), a set of standard tables holding seismological data, have been adopted by United States national laboratories, national and international data centers, and in some commonly used software, such as the Antelope Datascope system (http://www.brtt.com/software.html, last accessed December, 2013). There are several problems, however, that prevent more widespread adoption of these tools among researchers:
(1) Researchers do not want to learn another computer language for managing data. Database backends like Oracle or PostgreSQL use structured query language (SQL), and perhaps another scripting language like Perl, to interact with a database. A researcher is more likely to string together tools that he already knows, such as shell and MATLAB scripts, than learn a single‐use data management language. Some analysis environments can interact with databases, such as C/Fortran, Antelope, MATLAB, and Python. More and more, however, analysis is done with interpreted languages like MATLAB and Python.
(2) Many existing data management solutions are difficult to learn due to a steep learning curve or narrow‐user base. A wide‐user base is essential because it lowers barriers to adoption and productivity, for example, when bugs have been cataloged or solved, and examples or recipes have been shared online. Software licensing fees and restrictions can also limit the user base, in addition to introducing an operational cost. A visible and diverse‐user community is a software problem‐solving tool.
To address some of these problems, we introduce Pisces, a practical and extensible data management library in Python. The name Pisces is not an acronym, but was chosen for its etymological association with waves, and because its first syllable sounds like “py,” conventionally used in naming Python modules. Pisces was designed with the following goals in mind:
(1) Take advantage of existing widely used open‐source technologies. In addition to gaining open‐source benefits such as freedom and reproducibility (Lees, 2012), we use well‐known software tools and standards to take advantage of existing documentation, code examples, and rich user communities. Pisces employs Python (http://python.org, last accessed December 2013), SQLAlchemy (Bayer, 2012), NumPy (Oliphant, 2007), and SQL relational databases using the CSS 3.0 schema. Through these technologies, and the accompanying large‐ and diverse‐user communities, researchers can leverage a wide knowledge base for writing and troubleshooting data management code.
(2) Let users manage and analyze data in a single language: Python. Thanks to data acquisition tools like ObsPyLoad and wavesdownloader (Bernardi and Michelini, 2013) and analysis tools like ObsPy (Beyreuther et al., 2010; Megies et al., 2011), AIMBAT (Lou et al., 2013), MSNoise and pyTDMT (see Data and Resources), and the SciPy Stack (e.g., Jones et al., 2001; Hunter, 2007; Oliphant, 2007; Pérez and Granger, 2007), the Python ecosystem is a desirable research environment. Many of these high‐level numerical libraries are available in consistent prepackaged free distributions, such as Enthought Canopy, Python(x,y), and Anaconda (see Data and Resources). In addition, through its use of SQLAlchemy, Pisces users can just write Python code to interact with the underlying SQL database without having to learn SQL (though literal SQL queries can still be used). This frees the researcher from having to invest in learning a separate data management language or how to glue multiple languages together.
(3) Make data management code extensible, portable, and scalable. All components of Pisces are free, open source, and cross platform. This allows the user to inspect and change source code, to install on multiple architectures, and to use without paying expensive fees. Python and SQLAlchemy can be used on different systems and with different database backends, so code written for a small project that uses SQLite on a Mac will also work for a large project using a remote Oracle database on a Linux system. Although some enterprise‐level database backends are not free, such as Oracle, there are many free alternatives such as MySQL and PostgreSQL. In addition, SQLite is part of the Python standard library, providing Pisces users an immediate and capable database backend with no configuration required.
Even in the span of just a few years, a researcher may encounter many different projects, system architectures, or budgetary and licensing concerns. The ultimate goal of Pisces is to allow the user to write code that will not eventually have to be abandoned due to any one of these concerns. Furthermore, we wish to integrate with existing data acquisition and analysis tools and to employ widely used open‐source technologies, in order to bridge the gap between data acquisition and data analysis (Fig. 1). The following sections detail the functionality of the Pisces package. We will demonstrate several use cases, such as beginning a new database, using an existing database, and importing/exporting data. We will also briefly discuss the Pisces SQL database and its performance, and compare Pisces with several existing data management solutions.
OVERVIEW AND USE CASES
Pisces is a minimal library that largely exposes SQLAlchemy’s Object Relational Mapper (ORM). Simply, the ORM links regular Python classes to SQL database tables, and instances of these classes to rows in a table. Column values for a database row are attributes on the class instance. This mapping of database entities to familiar Python constructs (classes, instances, and attributes) isolates the user from having to manually build and manipulate SQL query strings and facilitates more native‐looking code. Pisces comes with a set of classes that represent core CSS 3.0 seismic database tables, such as those containing station, phase arrival, and event information, but exposes relatively few specialized functions on top of this framework. It is written as a library, not a more rigid application, with the recognition that nearly all processing is custom, and for this reason, any data management solution must be exposed at a relatively low level and be easily customizable.
We do, however, leverage the fact that database tables are represented as Python classes to add some useful functionality in Python. Pisces classes are aware of the default values for all fields in a table row (class instance), whether or not the underlying database defines default values, and they are also aware of their correctly formatted text file (“flat file”) representation. In addition, through Python’s class inheritance, this functionality is easily added to new tables. Pisces contains submodules that handle reading flat files, reading waveform data of various formats, converting waveform file (e.g., SAC) headers into table values, and building common queries. The remainder of this section will demonstrate how Pisces may be applied in some common use cases.
Declaring, Loading, and Querying Database Tables
SQAlchemy and Pisces provide convenient functions to declare or load existing database tables and build common basic queries, but querying and editing tables are generally done using SQLAlchemy constructs. Figure 2 demonstrates declaring and loading tables. This is done once for each table name, after which they can then be used or imported for use. Here, a “site” and “origin” table are declared, gaining all columns and constraints from the corresponding CSS 3.0 prototype tables the come with Pisces. A site table contains station names and locations, and an origin table contains locations of seismic sources. Arbitrary tables can also be loaded from an existing database. In Figure 2, an “affiliation” table, which associates stations with networks, is also declared, but its database representation (columns and constraints) will later be loaded directly from a database in a script (see Fig. 3).
Figure 3 demonstrates querying and editing the declared tables. To use the declared/defined tables, one can just import them from the file declaring them, as in line 3. The db_connect function accepts a standard database URL or a series of keyword arguments, and returns a session object, which manages interactions with the target database. In line 9, the Affiliation class’s prepare method loads the internals of the class directly from an existing affiliation table in the database. Lines 12–16 demonstrate basic queries on the tables using both SQLAlchemy and one of the Pisces query builders in pisces.request, and lines 20–27 show how to add, update, and delete entries.
Import and Export Flat Files
Standard text file tables (flat files) can easily be read into or written from a database, as well. In Figure 4, rows from an origin flat file “TA.origin” are added to a “TA.sqlite” SQLite database. Each line in the text file is passed to Origin.from_string, which is a Pisces‐specific method that uses the information in the underlying class declaration to interpret the line and populate an origin instance. Finally, session.add(iorigin) and session.commit() add and write rows to the database.
Database records can just as easily be exported to flat files, using natural Python syntax. Beginning at line 18, 10 entries of our TA database site table, ordered by longitude, are loaded and written to a text file. Each Site instance returned from the database query is converted to a string by str(isite), also using information in the class definition, and written to a site flat file with a system‐appropriate newline character (e.g., Fig. 5).
Extend the Schema with a Custom Table
The CSS 3.0 table set that comes with Pisces can easily be extended to accommodate custom tables, using SQLAlchemy syntax. In Figure 6, we reproduce the ambient cross‐correlation descriptor table, “ccwfdisc,” from Barmine and Ritzwoller (2012), which describes stacked noise cross‐correlation waveforms. In this example, the table was formed from a combination of earlier‐defined columns, which were imported from a known schema, and new columns. The info= dictionary argument in a column definition describes the default value, string width and parsing function, and string output format for each field on the table. The parse key is simply a function that accepts the correct substring of flat file line and returns the correct value and type in Python.
The table class inherits from css.Base, which is a class that recognizes and applies the info= arguments for handling default values and flat file reading/writing. The Ccwfdisc class can then be imported and used in scripts, as in previous examples. With small changes to this syntax, the table can be declared in a way to make it reuseable as a generic table type via inheritance.
Load Event‐Windowed Waveforms
Event‐windowed waveforms are commonly desired for processing. The SQL databases underlying Pisces generally cannot perform specialized operations such as geographic calculations, travel‐time predictions, or waveform file reading without special configuration. Pisces uses NumPy, ObsPy, and data conversion libraries written in C for this out‐of‐database functionality.
Figure 7 demonstrates how to use the ObsPy‐style Client class for retrieving event‐based waveform segments. The Client provides a convenient way to manage basic queries. First, the client is instantiated with a database connection URL, as before. Next, the required station, event, and data tables are loaded with load_tables. The get_events method is called to request events in 2010 with mb≥5.4 that are 30–90 epicentral degrees from latitude 40, longitude −105, returning a list of Origin class instances (table rows). As the distance calculation is done out‐of‐database, all events that match the other criteria are loaded first before applying the distance filter. For large result sets, this can be memory intensive without an additional in‐database filter, such as a region box or smaller magnitude range. The get_stations method collects broadband vertical channels in the western United States and returns a list of Site instances. Finally, we loop through events and stations to predict a pre‐P and post‐surface‐wave phase arrival‐time window at each station using ObsPy distance calculators and travel_times request the corresponding waveforms (as a collection of ObsPy Trace objects, called a Stream) using get_waveform, and write each trace to disk as an SAC file using the ObsPy Trace object write method, which supports several output formats, including miniSEED.
Waveforms retrieved from the database can be stored in any number of binary or text formats recognized by the CSS 3.0 or National Nuclear Security Administration KB Core schema (Carr, 2002), including the “e1” variable‐difference compression format. NumPy is used for reading many of the supported binary formats, ObsPy is used to read SAC, SEGY, and SEED/miniSEED formats, and “e1” and “s3” (24 bit integer) formats are read using included libraries written in C.
Comparison to Existing Data‐Management Solutions
Pisces differs from existing data‐management software packages in that it is a library for leveraging SQL databases to facilitate geophysical research. It is not an integrated data acquisition and analysis application like EMERALD (West and Fouch, 2012) or IGeoS (Morozov and Pavlis, 2011a,b), nor is it a real‐time acquisition and processing system like SeisComp3 from GFZ‐Potsdam (http://www.seiscomp3.org, last accessed December 2013), Earthworm (Johnson et al., 1995), Winston (http://www.avo.alaska.edu/Software/winston/W_Manual_TUT.html, last accessed December 2013) from the Alasko Volcano Observatory, Hydra (Buland et al., 2009) from the U.S. Geological Survey, or Antelope Real‐Time Systems from Boulder Real Time Technologies. SeisHub (Barsch, 2009) is a seismological database that also utilizes an underlying SQL database, but its emphasis is on integrating XML standards and web‐based technologies. Pisces is designed to support traditional research, using tools and data structures that are actively used in the wider scientific and engineering communities so that documentation and help are easy to find.
Pisces SQL Database
The database tables that come with Pisces can be used to initialize an SQL database. They follow the CSS 3.0 schema and are otherwise designed to minimize constraints. Value types are enforced (e.g., strings versus floats), but there are no value check constraints. This keeps table editing fast but requires the user to check that his data make sense. Also, there are no “foreign key” constraints or enforced relationships between columns in tables. For example, imagine a seismic origin id (an integer corresponding to a unique row) in one table should correspond to the origin id in a related table. In Pisces, no error will be raised if the ids do not match. This means that changes to one table must be manually propagated to related tables, and joins between tables must always be explicit (no “natural joins”). It is assumed that, for a given project, a researcher may only be interested in a subset of tables and may not care about consistency with irrelevant tables. This allows the researcher to choose the level of rigor/consistency with which a database is maintained. Unique value and primary key constraints are enforced, however, so that a minimum level of physical meaning can be attributed to each table entry. Finally, there are no indexes, which are internal database structures that speed up searches on a specific set of columns (at the expense of memory).
Operating on a database using Python is generally slower than using a raw SQL environment. Speeds may be even slower when using SQLAlchemy and its ORM, which introduces another layer of abstraction. Here, we will provide an example profiling of a basic table join and data selection using SQLite and SQLAlchemy.
We create a database of 54,849 distinct western United States seismic events (origins) and associated global phase arrivals for each event, totaling 950,586 arrivals. The resulting binary SQLite database file is 340 MB on disk, and the equivalent flat file representation (four text files) is a total of 373 MB. In both SQLAlchemy and SQLite directly (outside of Python), we join the origins to their associated arrivals in the binary database and count the P arrivals. On a 2×2.66 GHz six‐core Xeon Mac Pro with 16 GB RAM, this operation takes 6.7 s in SQLAlchemy, compared with 0.6 s in SQLite. Similar speed differences (10–20 times) have been noted on the web for inserts and other operations. Although this is a significant increase in execution time compared with pure SQL, we posit that using Python and SQLAlchemy is still likely faster for overall management and analysis, as Pisces/SQLAlchemy results are immediately useable inside the Python environment (i.e., there is no intermediate representation). Furthermore, databases are generally well optimized compared with non‐SQL, script, and file‐based data management solutions, offering additional time savings.
Thus far, development of Pisces has focused on integrating SQL database tables, text flat files, and waveform data. This integration will continue to improve, but we will also expand development to additional important areas. Command‐line tools are in development to build a database from directories of SAC or miniSEED files, or from flat files. Other command‐line tools for basic queries could further strengthen integration with the system shell. These tools will make it easier for users to try Pisces with existing datasets and in existing workflows. Also, Pisces will provide better management of instrument responses, including a uniform interface for reading and writing of various instrument response formats, such as resp, SAC pole‐zero, fap, pazfir, and others. Finally, we will continue to write a robust suite of tests for each module in the Pisces package to ensure that code works as intended, that future development does not break functionality, and to facilitate the incorporation of any contributed code.
Pisces is a practical and extensible data management library that leverages existing widely used free and open‐source technologies, such as SQL databases and Python, in order to provide a seismological data management solution that: (1) allows the user to both manage and analyze data with a single easy‐to‐learn language, Python; (2) leverages large existing user communities to facilitate adoption and problem solving in code development; and (3) imposes no expensive or restrictive licensing constraints on users. Pisces aims to bridge the gap between seismological data acquisition and data analysis, with extensible, portable, and scalable tools, so that researchers can be confident that the code they write will work wherever they work.
Data and Resources
Excellent documentation and tutorials for Python and its scientific libraries can be found from many online sources. The Python 2.7 standard library can be found at http://docs.python.org/2/library/ (last accessed April 2014) and examples thereof at http://pymotw.com/2/ (last accessed April 2014). Documentation for the core scientific libraries (SciPy Stack) can be accessed from http://www.scipy.org/about.html (last accessed April 2014), and a comprehensive set of notes and examples can be found at http://scipy-lectures.github.io (last accessed April 2014). The Enthought Canopy, Python(x,y), and Anaconda Python distributions can be found online at https://www.enthought.com/products/canopy/, https://code.google.com/p/pythonxy/, and https://store.continuum.io/cshop/anaconda/ (last accessed April 2014). Information about ObsPy and packages using it, such as the pyTDMT time‐domain moment tensor inversion package, can be found at http://www.obspy.org (last accessed April 2014), and the MSNoise ambient seismic noise package can be found at www.msnoise.org (last accessed April 2014). Oracle SQL Developer can be found at http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html (last accessed April 2014), and SQLite browser at http://sourceforge.net/projects/sqlitebrowser/ (last accessed April 2014).
This work was supported by the Institute of Geophysics, Planetary Physics, and Signatures (IGPPS) at Los Alamos National Laboratory. This document is Los Alamos Unlimited Release LA‐UR‐14‐20002, and the Pisces code is Los Alamos Computer Code LA‐CC‐13‐122, released under a Massachusetts Institute of Technology (MIT)‐style license. The Pisces documentation and code repository can be found at http://jkmacc-lanl.github.io/pisces and https://github.com/jkmacc-LANL/pisces (last accessed April 2014). This software was inspired by discussions at the EarthScope USArray Data Processing and Analysis Short Course 2009 and from conversations with and code contributions from Richard Stead. Thanks to George Randall and Stephen Arrowsmith for providing feedback on an early version of this manuscript.