- © 2016 by the Seismological Society of America
This article presents a new MATLAB software, DigitSeis, that converts digital, raster images of analog seismograms to readily usable, digital time series using image‐processing techniques. It classifies important features of analog seismograms, such as time marks, to simplify the generation of continuous time series and uses some of this information to assign accurate timing. Geometrical distortions associated with various stages of seismogram creation, storage, and digitization are detected automatically and corrected. Although the software is written to minimize human input, DigitSeis provides interactive tools for challenging situations such as when traces cross. Finally, the effectiveness of the software is demonstrated with digitization of a long‐period seismogram from the Harvard‐Adam Dziewoński observatory recorded in November 1938. At least five significant teleseismic events are identified, and its spectral analysis shows no spurious features.
Instrumental seismology is over a century old, and the first seismographic stations were installed in North America at Mount Hamilton and Berkeley in 1887 (Litehiser, 1989). These stations produced analog recordings, and digitally recorded seismograms became broadly available only three decades ago. As a consequence, a vast amount of seismic data is in analog form, either on paper or film. Extracting seismic information from such archives is vital for modern analysis, ranging from improved earthquake catalog completeness to examining time variability of subsurface features. Efforts to preserve and restore these valuable data typically produce digital images of the seismograms through the use of scanners or digital cameras (e.g., Okal and Kirby, 2014; Okal, 2015). These images are in raster format and require additional processing to extract time‐series data. Over the years, a few attempts have been made to develop software to obtain digital seismograms from raster images (e.g., Liu et al., 2001; Bromirski and Chuang, 2003; Pintore et al., 2005; Church et al., 2013).
This article describes a new software, DigitSeis, for converting the raster images into discretized time series. It is written in MATLAB (see Data and Resources), and, unlike most previous software, it digitizes both the main trace as well as the time marks, after correcting for their offset, yielding continuous time‐series data. Furthermore, it considers various types of geometrical distortions that are present in images of analog seismograms and corrects for errors in the timing and amplitude that arise from such distortions. Another distinction of DigitSeis is that it synchronizes traces using the time marks to obtain as accurate time information as possible. The results are saved in MATLAB and Seismic Analysis Code (SAC) formats.
Some features of DigitSeis rely upon routines or standard functions supported by MATLAB version 220.127.116.117246/R2015b. The program takes common digital high‐resolution raster image format (e.g., tif, tiff, gif, png, jpg, and jpeg) that can be readily analyzed by MATLAB or the “.mat” format native to MATLAB. Typical analog seismograms generated through the use of galvanometers show the waveform with dark colors on a lighter background (e.g., Fig. 1), which is the default input type for DigitSeis. It does include an option to change the polarity of the image so that seismograms recorded on dark background with white traces can be processed.
If a seismogram is scanned in gray scale, raster images are stored as 2D arrays or matrices, in which each element corresponds to a pixel and its value of intensity, which is a positive integer. For example, in a 1‐bit (binary) image, the intensity is either 0 (black) or 21−1=1 (white). Similarly, in an 8‐bit (gray scale) image, the possible values span from 0 (black) to 28−1=255 (white). Colored images, on the other hand, are represented as a combination of gray‐scale images. For example, in the case of red–green–blue (RGB) color space, a color is depicted as three gray‐scale images that correspond to the intensities of the red, green, and blue channels, and the number of possible intensity values defines the color depth. If the 8‐bit scale is used for each red–green–blue color array, this constitutes a color depth of 8‐bits. Color depth is important for seismogram digitization, because the deeper it is, the better the faint traces are captured. In DigitSeis, 8‐bit color depth is used whenever possible, because it provides an adequate balance between efficiency and preservation of information of the scanned images. If seismograms are scanned in higher color depth (e.g., 24‐bit), the program downsamples the images into the 8‐bit format.
DigitSeis also remaps the color images into hue, saturation, value (HSV) color space. This is convenient, because almost all the useful information can now be obtained from one gray scale that corresponds to the value (i.e., brightness) channel (Fig. 2). Hue and saturation values are irrelevant and discarded. Furthermore, this reduction removes some unwanted features from the scanned image such as chromatic aberration and pseudocolorization that may introduce difficulties during later tracing. The single array containing the value intensity is inverted to produce a negative image (i.e., each pixel value is subtracted from the maximum value of 255), allowing the background to be represented with null values and the signal (i.e., traces) with nonzeros (Fig. 3). In the ideal case, a seismogram on the image is represented in bright white and the background in black. This negative‐image setup is compatible with most image‐processing algorithms, and allows sparse representation of the image, reducing required memory and processing time.
To suppress anomalously bright or dark spots, the distribution of the intensity values is adjusted (Fig. 3). The intensities corresponding to 1%–99% of the available values are automatically selected, and this range is linearly stretched to values between 1 and 254. The pixels with initial intensity that is outside of the bounds are given intensity of either 0 or 255 (Fig. 3). The user has the option to further remap the intensity to increase the contrast of the traces or reduce unwanted signals (Fig. 3).
Depending upon the image, additional corrections may be needed. The cropping tool allows users to exclude empty regions at the edges of the scanned image, and the rotation tool enables image rotation. Generally, the traces are expected to be slanted with respect to the horizontal, due to the spiral recording around a rotating drum, also known as helicorder drum (e.g., Havskov and Alguacil, 2004). However, if the seismogram is significantly misaligned when it is fed into the scanner, the image acquires additional tilt, and rotation correction may be desirable. The later processing to remove these types of rotation (along with image distortion) is often sufficient, but an option allows applying the rotation correction using Radon transform (e.g., Stanley, 2007). Based upon the Radon transform, DigitSeis software suggests an angle for rotation, but the user can input a different angle. The rotation is applied using bilinear interpolation. Finally, the user may select and remove, with a lasso‐type tool, any spurious features that should not be included in the digitization procedure. Such features include stains, handwritten notes, and/or holes (e.g., Fig. 1). Although some of these features are automatically identified and rejected in a subsequent classification stage, it is a good practice, whenever possible, to remove at least the large and dominant objects.
Throughout the preprocessing stage, the user has the option to bypass the automatic adjustments and perform the functions manually (or not at all). DigitSeis is not intended to be an image‐processing software and, therefore, does not include all possible methods by which the image could be improved. The user can preprocess the image using other software and then load the processed image into DigitSeis for subsequent procedures.
Identification of Traces and Time Marks
In this phase, the program establishes the position and the number of seismic traces within an image and classifies them into time marks and main traces. The image is divided into vertical strips for which the number of strips is set such that it is enough to resolve the curvature of the traces but small enough to ensure clear distinction between the trace and the background. For example, ten vertical strips are defined for a 36‐inch wide seismogram. For every strip, the pixel values are summed in the horizontal direction to identify peaks that correspond to the vertical position of each trace, and the number of the traces is estimated by counting these peaks. In general, this approach returns a correct number of traces, but if traces are too short, such as at the beginning or end of a seismogram, the estimated number of traces may be incorrect. For such cases, the user has the option to input the number and the position of traces through an interactive menu.
The zero position of a trace (i.e., trace‐zero line) is defined as the line if no seismic signal or noise is present. Theoretically, this line should be horizontal, with uniform spacing between time markers (e.g., minute marks), but in practice, images are often distorted (Fig. 4a; Ishii et al., 2015), and the positions of the traces determined from each strip are offset. In general, this distortion is dominated by parabolic curvature, and DigitSeis corrects for this basic distortion using positions of the traces from each strip. The vertical shift required to align the traces of the neighboring strips is obtained through a cross‐correlation analysis, and the main image distortion is corrected by fitting a second‐degree polynomial (i.e., assuming a quadratic or parabolic distortion model).
In the next step, the program classifies main traces and short time marks based on their geometrical characteristics. For this classification purpose, the program creates a 1‐bit version of the image by converting all pixels above a given user‐selected threshold to 1 (white) and all other pixels to 0 (black). Depending upon the image, the user can explore different tolerance values. This conversion to a binary image is advantageous in increasing the computational speed and reducing the memory requirement, and the clear distinction between 1 (visible features) and 0 (background) allows easy definition of objects: a single white pixel or a cluster of connected white pixels form an object. Two pixels are considered connected if they have adjacent sides or corners. The program finds all different objects and stores their positions, the indices of the corresponding pixels, and their dimensions.
Once the objects are defined, the DigitSeis program requires the user to provide an approximate horizontal extent of time marks in pixel units (measured with an interactive tool). The program uses the length provided by the user and classifies the objects as time marks if the lengths are similar. Objects of shorter lengths are classified as noise (to be ignored in the digitization procedure), and those that are longer are classified as parts of traces (Fig. 5). This classification scheme works well in most cases and classifies the majority of objects, especially if the seismograms are of good quality (e.g., without extensive damage, stains, or faint traces). However, it is not perfect, and inspection by a human analyst is necessary. If an object is misclassified, the user must manually change the type of the object. Correct classification of the objects is crucial for subsequent steps, and it could be the most time‐consuming component of the digitization procedure. When this classification step is finished, each object is stored in a structure that contains the object and its identity (i.e., trace, time mark, noise).
Reliable timing of the seismograms is essential for scientific research, yet significant difficulties arise in identifying the correct time for a given x position. First of all, there is an ambiguity about the exact starting and ending pixel of each trace, because the traces start and finish with a gradual increase or decrease in intensity. Moreover, as noted elsewhere, the digital images of seismograms are distorted in both horizontal and vertical directions, affecting the timing of the traces systematically and arbitrarily. All of these issues are in addition to the limitations in the accuracy of the time measurement when the instruments were being operated.
The time determination process in DigitSeis is a semiautomatic procedure that takes advantage of the presence of time marks. Once classification of the objects is completed, the analysis returns to the gray‐scale image, and the user initially selects the beginning of the first trace. For the remaining traces, the sum of the intensity values in the vertical direction is calculated, and the sudden increase is taken as the starting position of the trace (Fig. 6). This approach is also applied at the end of each trace to detect the end positions. The user has the option to change these positions by moving the markers one at a time or simultaneously. In addition, two columns containing time marks near the beginning and end of the image are selected through a similar procedure. Ideally, these columns should be composed of the earliest and the latest time marks, but the user should ensure that the selected columns include as many usable time marks as possible. If the selected column includes many problematic time marks, such as those that are washed out, the user should select the next (or previous) set of time marks.
For the time marks within these two columns, the user is required to identify the pixel that represents the round minute (i.e., the beginning of the minute that corresponds to zero second), with the assumption that the minute marks begin or end at the beginning of a minute. The user also provides the number of time marks N that exist in each trace, and the exact time and date associated with the first time mark. Finally, the program asks for the duration for a given line of trace in hours, the time between successive time marks in seconds (τ), and the approximate number of pixels between time marks (p0) that can be measured using an interactive tool. Next, the program predicts the position of the time marks and attempts to associate these predictions with each time mark. This task is formulated as an optimization problem where the initial second‐degree polynomial model for distortion is improved by comparing the predicted and the real positions of the time marks. The time T between two successive pixels can be modeled using the quadratic distortion model as (1)in which α is the optimization variable, f(x) is the predicted difference between the y coordinates of the trace‐zero line at points x and x+1 based upon the quadratic distortion model, and τ is the time between two time marks. Because the quadratic model fits the trend of each trace relatively well (Fig. 4a), α is expected to take values close to 1, and it is bound to be between 0.9 and 1.1.
Based upon this distortion model, the cost function to be minimized for each line of trace is (2)in which N is the user‐defined number of time marks for a given line of seismogram, na is the estimated number of time marks based upon the quadratic distortion model (equation 1), j represents the user‐provided horizontal coordinate of the end reference time mark, and ja is the corresponding position of this time mark as it is predicted from the quadratic distortion model. Both na and ja depend upon the parameter α, and they can be readily derived using equation (1), starting from the location of the starting reference time mark. Although this correction improves the relative locations of actual and predicted time marks, residuals that vary arbitrarily may remain (Fig. 7). This suggests that there are higher‐order distortions that are difficult to model deterministically (e.g., Fig. 4b). To manage this arbitrary distortion, an irregularly spaced vector containing the times of time‐mark positions is generated. This vector, along with the corresponding pixel‐location information, is interpolated using cubic spline to identify pixels that are spaced evenly in time. This procedure of determining time tolerates missing or nonidentifiable time marks, because the interpolation is performed using a full line of trace. For the portion of the traces at the edge of seismograms that occur before or after the time marks, linear extrapolation is applied under the assumption that these segments are sufficiently short. In the case of short traces where no time marks are available, the algorithm performs the timing analysis based upon the time marks of the adjacent traces. The sampling interval is automatically determined as the evenly spaced sampling interval that can be obtained from the unevenly spaced time vector.
One feature of DigitSeis that does not exist in previous digitization software is the ability to combine the main traces and time marks automatically. For each line of seismogram consisting of the main trace and time marks, two horizontal strips are created using the classification information of the objects. The user has the option to change the lower and the upper extent of the strip. Using these limits, the program designates all objects with centroids within the defined areas to the specific trace. Centroids that are outside are ignored, regardless of pixels inside the region. This procedure results in one strip containing only the trace objects and another containing only the time marks of the corresponding trace.
Once the two strips are defined, DigitSeis uses a simple nonlinear optimization scheme that utilizes golden section search and parabolic interpolation (e.g., Press et al., 1997) and minimizes the amplitude of the first derivative of the combined trace as a function of the relative vertical shift between the main trace and the time marks. More specifically, for each vertical shift, a fused image of the main and time marks is created (Fig. 8). Commonly, the beginning and ending pixels from the main trace objects overlap with the beginning and ending pixels of time marks, due to the finite thickness of the light beam. The overlapping pixel inherits the larger of the two pixel‐intensity values. For each horizontal position j of the fused image, the amplitude yj of the seismogram is calculated as the weighted average of the vertical coordinates i of the pixels with intensity Iij that exceeds a predefined tolerance I0, that is, (3)
The weights are set as the squared intensities of the pixels (i.e., high‐intensity values correspond to brighter pixels; Fig. 9a). If a column does not include any pixels with intensity Iij that exceeds the predefined tolerance I0, the algorithm returns the IEEE arithmetic representation for a Not‐a‐Number (i.e., NaN) data‐type value. Once the fused image is converted to j (or x) and yj values, the first derivative (dy/dx) is calculated using the forward differences approach, ignoring the NaN values. If the vertical shift used to fuse the images is insufficient, that is, an offset remains between the main trace and the time marks, the derivative is expected to be large at the edges of the time marks.
The automatic fusing and tracing scheme works well if the traces do not cross one another, but often, especially at times of interest (e.g., seismic‐phase arrivals), large amplitudes extend beyond traces above and below. The automatic algorithm results in truncated waveforms, and the user must process the trace manually. DigitSeis provides a set of interactive tools that allow the user to select the trace, define its vertical extent, and then isolate it from adjacent traces by manually removing the pixels that do not belong to the target trace. Once the trace is isolated, the classification, timing, and digitization processes are repeated (Fig. 9b).
The main output format of the digitized traces is SAC (Goldstein and Snoke, 2005), and one SAC file is created for each line of trace within an image. The DigitSeis program automatically completes the following fields of the SAC header: FILENAME, DELTA, B, E, NPTS, LEVEN, DATA1 DEPMIN, IFTYPE, NZYEAR, NZJDAY, NZHOUR, NZMIN, NZSEC, and DEPMAX. In addition, the software provides an interactive window in which the user can provide other header information, such as the station latitude, longitude, elevation, network code, station code, station name, and so on. The data are also saved in MATLAB’s native “.mat” format, and this file also contains all information required to recreate the digitization procedure. This includes the processed image, the classification structure, digitization and timing results, and current values of the graphical user‐interface controls. The digitization procedure can be easily reviewed and reproduced, and the .mat file can also be used to resume incomplete digitization of an image.
AN EXAMPLE FROM THE HRV ANALOG SEISMOGRAM COLLECTION
The Harvard‐Adam Dziewoński observatory (HRV) is located in the town of Harvard, Massachusetts, and the seismic instruments have been deployed in the underground vault since 1933 (Leet, 1934). Seismic activities have been recorded on photographic paper from 1933 to 1953, and a large number of these seismograms (∼9000) have been cleaned and scanned in color with 24‐bit depth and 1200‐dpi resolution (Ishii et al., 2015; see Data and Resources). As an example of the DigitSeis software application, a north–south long‐period seismogram from 1938 is digitized (Fig. 1). The recording contains time marks at every minute, and each line roughly corresponds to one hour. The seismogram starts at approximately 13:56 (UTC) on 13 November and ends at 13:55 on 15 November. The seismogram is in good condition with minor problems such as stains and yellow pseudocolorization, hence no user‐specified processing is performed, except for cropping and removal of handwritten notes and stains.
All traces are identified successfully by the software, except for the first line that is only a few minutes long. This trace is added manually, and the start and end positions of each trace are defined. The automatic classification algorithm successfully identifies the vast majority of the objects, and only a handful of objects require specification by a user. From a total of 49 traces, 46 are digitized automatically, but due to crossings between the 10th, 11th, and 12th traces, these traces are processed separately. The 0th and 57th minute marks of each hour are chosen as the left and right columns used as references for time determination. The approximate distance between time marks is 445 pixels, and the approximate length of a time mark is 25 pixels. The resulting SAC files are visualized in Figure 10. The sampling interval of the digitized time series is 0.1273 s. When the power spectral density of the digitized trace is calculated, no distinct peaks corresponding to 1 min period appear, implying that the digitization procedure effectively combines the time marks with the main trace to generate a seamless recording.
Within the time period covered by the seismogram shown in Figure 10a (also in Fig. 1a), the International Seismological Center (see Data and Resources) reports occurrences of eight earthquakes with magnitudes up to 7.0 (Table 1). The beginning of the digitized traces shows arrival of some waves that correspond to an Ms 6.9 event along the Kuril Islands, and a clear earthquake signal around 22:30 UTC on 13 November is from an Ms 7.0 earthquake off the east coast of Honshu, Japan (Fig. 10a). The power spectral density of the digitized traces brings out additional earthquake signals that are difficult to discern in the time series. The signals associated with earthquakes of unknown magnitude off the east coast of Honshu, Japan (event 3; Table 1), Papua New Guinea (event 5; Table 1), and along the Aleutian island arc (event 7; Table 1) are clearly captured (Fig. 10a).
Another interesting feature of this seismogram is the increase in the noise level during the morning of 14 November that lasts until the end of the seismogram (Fig. 10a). An HRV recording from the same days of the year but in 2014 (Fig. 10b; BH1 component, location code 00, instrument type Streckeisen STS‐1VBB_w/E300; see Data and Resources) contains the same noise that is characterized by two peaks at about 0.14 and 0.25 Hz. The good agreement between the noise spectra of the analog and digital recordings suggests that no spurious spectral signals have been introduced by the DigitSeis processing. The double peaks have been argued to result from interaction of ocean waves of similar wavenumber, generated from different cyclonic storm systems (e.g., Haubrich et al., 1963; Oliver and Page, 1963; Cessaro, 1994). The generation and the propagation of these microseisms are topics of considerable interest (e.g., Fukao et al., 2010; Bromirski et al., 2013; Ardhuin et al., 2015; Tian and Ritzwoller, 2015). The digitized noise spectrum from 1938 indicates that analog seismograms can extend the time window for studies of storms and their evolution over the years.
DISCUSSION AND CONCLUSIONS
The DigitSeis software is a new tool for supervised digitization of analog seismograms that includes analysis for image distortion, combines time marks and main traces, and provides as accurate timing information as possible. A typical seismogram, such as those from the Harvard‐Adam Dziewoński observatory (Ishii et al., 2015), requires a few hours to process; however, this depends strongly upon the experience of the user, condition of the seismograms, and complexities in the recorded signal. The most time‐consuming part of the digitization, not surprisingly, is manual interaction. Correcting object classification and separating overlapping/crossing traces are the most demanding tasks, and they are difficult to automate. For example, identifying a single trace when it crosses over others requires subjective decisions, and it can be difficult even for a human to perform. Future versions of DigitSeis will aim to improve the efficiency of classification and identification of potentially problematic regions. Another future direction is toward extending the applicability of DigitSeis to other types of recordings. For example, a seismometer using a pen that swings around the motor axis produces circular arcs when large displacements occur, and this can be incorporated into the digitization analysis.
A digitized trace of a seismogram recorded in 1938 at the HRV station contains no spurious features that are visible in the time domain or with spectral decomposition. The comparison with modern digital recordings from the same station shows that the digitized seismograms provide high‐quality data for studies of earthquakes and of the background noise. However, for research that requires absolute amplitude information or conversion to physical quantity (e.g., velocity or displacement), significant challenges remain. The instrument response is often poorly known, and the only information available may be the type of the instrument. In addition, the digitization procedure can introduce arbitrary features. For example, the quadratic correction applied to each line of trace is required to remove distortion of the recording, but it will also remove corresponding low‐frequency signals. The details of digitization transfer function will be investigated in future studies.
The analog seismograms from the predigital era contain a wealth of information, which extends the data window available for seismological research. The software, DigitSeis, presented in this article, provides a means to convert analog seismograms to time series. It identifies and corrects for time‐mark offsets semiautomatically, with a minimal amount of user confirmation and input. Particular care has also been taken to obtain accurate timing of the traces, that is, conversion of the horizontal position to time. The time series obtained using DigitSeis compare well with modern recordings and present an opportunity for digitization of a collection of analog seismograms for modern analyses. For example, digitization of the HRV station archive (Ishii et al., 2015) will make data available for analyses of large global earthquakes, local sequences at the northeastern United States (e.g., Collins, 1937), and severe weather phenomena (e.g., Leet, 1947).
DATA AND RESOURCES
Earthquake information for 13–15 November 1938 is obtained from the International Seismological Centre online bulletin (http://www.isc.ac.uk, last accessed March 2015). The Harvard‐Adam Dziewoński observatory (HRV) archive is available online at http://www.seismology.harvard.edu/HRV/archive.html (last accessed February 2016), and the modern digitized recording from the same station is obtained from the Data Management Center of the Incorporated Research Institutions for Seismology (IRIS). The station is currently part of the IU network (doi: 10.7914/SN/IU) and of the Global Seismic Network (GSN) installed, and operated by the U.S. Geological Survey Albuquerque Seismological Laboratory. MATLAB software can be found at http://www.mathworks.com/products/matlab/ (last accessed February 2015). DigitSeis source code and its manual can be downloaded from http://www.seismology.harvard.edu/research/DigitSeis.html (last accessed February 2016).
The authors thank Associate Editor J. Douglas Zechar, Charles R. Hutt, and an anonymous reviewer for constructive comments and suggestions that helped to improve the manuscript. This project was supported by the U.S. Geological Survey Earthquake Hazard Program G14AP00016.