- © 2013 by the Seismological Society of America
Large stores of seismic information currently exist on paper, film, and other media, containing unique records of earthquakes, geological exploration, and nuclear testing events. In order to become useful, the seismic data from the traces on these seismograms need to be extracted and outputted as computer‐readable digital data, a process referred to in this paper as digitization.
Individual paper or microfilm seismograms can be digitized by hand and this tedious process has been ongoing since the 1950s. A few software packages were developed for semi‐automatic digitization, for example, the Teseo software for digitizing scanned seismogram images by Pintore et al. (2005). However, there is a pressing need to automate this process as much as possible, in order to realize the possibility of extracting the vast stores of information that reside in large archives worldwide. Although seismograms created since the 1980s are now readily available online in digital format, the need to digitize old seismograms is still necessary in order to apply modern techniques in studying pre‐1980 earthquakes (e.g., see Kanamori et al., 2010; 2012) and nuclear explosions. We have made significant progress on this problem, working under a DOE SBIR Phase I grant, No. DE‐SC002717. The objective of our Phase I work was to create computer code that automatically extracts x–y digital data files of the seismic traces in seismogram image files, while addressing some of the problems associated with noise, distortions, and image discontinuities. In accomplishing this goal, we have developed a preliminary start‐to‐finish digitization process that begins with the inputted raster image, and ends with a digital data file representing trace center‐line coordinates. This process applies corrections to the images via our filters, and outputs, if desired, image overlays to allow for direct visual checking of the accuracy of the work.
In this work, we used RaveGrid (Prasad and Skourikhine, 2006), a licensable software available from Los Alamos National Laboratory. RaveGrid takes raster images as input, and using specialized algorithms separates features into polygonal elements based on size, gray levels, and other morphological entities. We then used the software package MATLAB (http://www.mathworks.com/products/matlab/) to manipulate these features and produce the final digital trace data.
For our Phase I work we have chosen to work with World Wide Standardized Seismographic Network (WWSSN) seismograms (Oliver and Murphy, 1971; Lee and Benson, 2008). Because of their importance and relevance to the problem, and because we have access to a large number of scanned WWSSN images, these seismograms provided a large and representative data set for our efforts. It is important to keep in mind that our results have been designed to approach general problems and features, and that a major goal of our ongoing work will be to apply the analysis to a broad range of other seismogram types.
The following steps summarize our automated start‐to‐finish process. The subsequent sections detail each step with supporting images and data.
The images are scanned from paper into .tif format.
We vectorize the images in RaveGrid and apply image‐specific algorithms/filters based on size and color to refine the initial vectorized output, extract the traces, and reject background and noise.
Next, using MATLAB for this and all subsequent steps, we automatically determine the number of traces in an image, and their individual start and end points.
With the traces defined as polygonal elements, we locate the centerline, also referred to in this document as the medial axis, running down the middle of each trace.
An eigenvalue analysis is applied to remove any tilt in the image.
The traces are unrolled, that is, the discrete, periodically organized traces on the seismogram image are concatenated end to end in order to create a continuous x–y data file.
A Fast Fourier Transform analysis is applied to identify the frequency of the seismogram’s discontinuous timing marks.
The timing marks are moved per a simple prescription.
An x–y data file is outputted and saved to disk.
In addition to successfully running this start‐to‐finish procedure, we have created and tested some preliminary filters that identify and connect trace gaps that occur due either to trace disappearances or trace crossings, and have developed an initial set of hierarchical rules to automatically make the proper trace connection across these gaps.
As needed, we overlaid the data, including the polygons and the medial axes, on top of the original image in order to visually check the accuracy of our work. Finally, we have also compared some of the data with digitized seismic data provided by USGS and found excellent agreement.
IMAGE VECTORIZATION USING RAVEGRID
RaveGrid is employed as a preliminary morphological process to identify the traces as distinct features. It uses Canny edge‐detection operators (Canny, 1986) to find high‐contrast separations that define the edges of the seismic traces. The Canny operator and its variants are some of the most fundamental and successful feature identifiers available for automated image analysis (Gonzalez and Woods, 2008), and have worked very well for this project. For the purpose of trace detection as image features, it locates both the trace‐to‐background boundary as well as texture and intensity gradients within the trace itself.
RaveGrid then uses a combination of Delaunay triangulation (Delaunay, 1934) and rules‐based polygon merging to convert the initial rasterized image into a vectorized image, that is, an image composed of polygons for which the size, location, and color descriptors are compactly described as vector elements. These vector elements are well suited to further manipulation using the software program MATLAB, described later. Figure 1 shows a basic intensity histogram used for some feature discrimination.
An example of the original and vectorized image is shown in Figure 2. The black background color represents non‐data background and is not included in further analysis. Differences in polygon gray levels represent variations in intensity and texture of the seismic traces. Although not visible in this image, the timing marks were preserved due to their intensity and size values, whereas noise (such as scratches) was rejected. Ongoing work uses more sophisticated filters to reject unwanted features.
RaveGrid has proved relatively well suited for preservation of relevant image features while being relatively insensitive to unwanted noise. It preserves grayscale information, which can be used in further analysis. However, the presence of confounding features, such as spurious polygons at trace intersections, and the availability of raster‐based software, such as MATLAB, which is more appropriate and customizable for the initial steps of feature identification, has led to the elimination of RaveGrid use for future work. Software development efforts for the specific task of trace tracking will continue to proceed as before using MATLAB.
LOCATING AND COUNTING THE TRACES
We have written code that locates traces, provides a total trace count, and indicates their individual start and end point without any initial user input. The code proceeds by sampling downward (the y direction, which is time forward) and essentially counting the number of polygons for a given y value. A peak in polygon counts indicates a trace candidate. We find the baseline of each trace, which we do with a zeroth‐order fit using MATLAB’s polyfit( ), assuming that most of the medial axis segments sit in the quiescent baseline part of the trace. This makes it relatively robust to some seismic activity in a few of the traces. There are some hard‐wired values in the code including the ratio of the standard deviations of the peak centers to the means of their values that check for uniformity of the spacing and rejects unlikely candidates. Once the y position and total count of the traces is completed, those values are used to search across x to locate and categorize the beginning and end of each trace. Finally, the median of the spacing among these lines is calculated and reserved for later analysis. This technique has been successfully applied to relatively clean traces that do not overlap at their beginnings or ends.
LOCATING TRACE CENTERLINES
After polygonization of the traces, medial axes were drawn through the center of the traces. Different methodologies were used in order to assess their effectiveness and also to remedy some of the shortcomings of the various techniques. Two of these methodologies are described below.
Delaunay Triangle Method
We utilized the MATLAB function DelaunayTri( ) to extract medial axes from the individual polygons comprising the traces. This function creates a bounded set of triangles within the trace polygons. We then use the MATLAB function Circumcenters( ) to identify the center of each triangle; the centers of neighboring triangles are then indexed and connected to create a medial axis segment that is as long as that trace polygon. Because a trace can comprise multiple polygons (as in Fig. 2) these line segments need to be concatenated and identified as complete traces. We accomplished this by moving left to right in x with the line segments, allowing only one value of x, that is, the first one encountered. Every non‐integer pixel value in x, which arises from the routine that connects triangle centers, is forced to take exactly one integer value. y‐values are taken from the segments, as returned by MATLAB’s interp1( ). Single pixel gaps in segments in x are filled in later with a last‐occurrence‐shifted‐forward protocol. The end result is an initial x–y trace.
Examples of outputted data are shown in Figure 3. The main feature of note is that, although very accurately tracking the centerlines, it produces spurious features off of the main centerline. We have further investigated means of pruning these extraneous features and have had generally good success with these techniques (Sinha and Suresh, 2005; Bai et al., 2007). As our work discovered, regardless of the extraneous structure these data gave us the most accurate information for use in subsequent tilt‐correction operations, due to its highly sampled nature.
We investigated other techniques in order to eliminate the need for pruning. The next section outlines the XScan method, which produces good data without extra spurs.
Our XScan.m code steps across the trace in x, polygon‐by‐polygon, and for each x coordinate of a polygon vertex finds the average of all intersections that a vertical line makes with the polygon. This is a relatively finely sampled trace, and provides an intuitive way to scan each potential change in curvature. Despite the high sample rate the code is still quite fast. It is seen that the medial axis is accurate, eliminates the spurs of the previous method, and creates a high sampling rate without excessive CPU time. Figure 4 shows the centerlines plotted for this method. To highlight features, the seismogram images have been stretched in y in Figures 3 and 4.
As noted, the x–y data from the medial axes are outputted as a .dat file as evidence of the final desired digital output. Figure 5 shows the discrete nature of the data points of this file, as obtained from the above image.
Our experience with hand‐digitized data (as generously provided by the USGS) indicates that we obtain a sampling rate approximately 5× higher than comparable USGS data. Our more precisely defined data have the potential to allow for more accurate analysis and curve fitting of the data.
The XScan method, combined with polygonized traces, often leads to multiple solutions at trace intersections, namely that a multiple‐trace crossing can look like a single forked polygon. In this case, multiple medial‐axis values occur at forks in the polygonized traces, which we have treated by implementing a series of rules that eliminates points outside of polygons, and looks for a large value in the standard deviation of the medial axis points as would be found for forked axes and compute the multiple y values for a single x coordinate. This particular problem has been eliminated in ongoing work by working with traces as single‐pixel‐wide entities rather than as bulk polygons.
We have written a filter to automatically identify and correct the inherent tilt in the scanned images. The code snips one of the traces out of the many on the image and uses those x–y data to do a minimization of the moment of inertia in two dimensions: a principal components analysis. The MATLAB function eig( ) accomplishes this. The eigenvectors that are returned are those of the new, rotation‐corrected coordinate system. Experience has shown this procedure to be more effective and just as accurate when calculating the eigenvectors for just a single trace rather than multiple traces; we then apply the tilt to all traces in the plot.
This code shows it is feasible and straightforward to analyze the entire image for a simple rotational distortion. Ongoing work looks to improve upon the code by analyzing multiple lines and creating a best‐fit rotation. In addition, extreme seismic activity in all the traces would render this code less than optimal, and would require refinement for such a situation. Ongoing work also is investigating whether non‐seismic features, such as headers and image borders, will also be useful as tilt correctors.
We have written code to connect the separate traces in our WWSSN images into a single, continuous image file. This concatenation allows us to achieve our goal of outputting the final .dat file as a single record. Using the previously defined x‐start and ‐stop points of each trace, and the y‐trace spacing, we unwrap and connect the traces into a single trace that spans the full time span of our sample. At this point we have continuous medial axis data, but we also have information that is used in subsequent steps for extraction of the timing‐mark frequency.
An outcome of the concatenation process is the discovery that the start and end points of different traces do not typically sit on the same x coordinate. Although this is not surprising, it shows that we can automatically assign start and end points to individual traces rather than, as is the case with existing software packages (Bromirski and Chuang, 2003), selecting end points that are an average of the actual values.
Identifying Timing Trace Marks
The images used to investigate timing‐mark discontinuities were WWSSN long‐period seismograms. These data have timing marks that occur every minute, manifested as a discontinuous shift in the trace for several seconds (depending on when the mark was made). In order to generalize our work as much as possible, however, we incorporated code that identifies these timing marks a priori, and moves them back to their appropriate location in the trace prior to the extraction of the trace’s x–y coordinates.
Locating Timing‐Mark Frequency
We use the MATLAB function fft( ) applied to the data (in this case all the vertices of the polygons) to find and extract the periodic frequencies in an image. A typical result of the FFT analysis is shown in Figure 6.
It is apparent in the frequency plot that the dominant timing‐mark frequency of 0.0044, which for this image corresponds to around 227 pixels, that is, one minute on the image, and many of its harmonics are immediately discernible. Using rules, such as enforcing signal to noise ratios for the candidate mode and its harmonics, gives the timing‐mark frequency with high confidence. It is apparent that we can use this type of procedure to search for any other known‐to‐exist modes in a given seismogram.
With the frequency known, we next explicitly identify the timing marks using MATLAB’s polygeom( ) function to extract timing‐mark centroids. We use this information, in which the timing‐mark size and location are specified within a range, to define those polygons that are identified as timing marks.
MOVING TIMING MARKS
We have so far employed a simple scheme to move the timing‐mark polygons. We find the neighboring polygon to the left of each timing mark and find its right‐most vertex. We mark the left‐most vertex of the timing mark and then move the entire timing‐mark polygon down to the location of the neighbor, matching up these vertices. Figure 7 shows a typical result of this movement. It can be seen that not all timing marks are properly moved. Nevertheless, we investigated this technique as a means by which any periodic feature could be identified and dealt with, and have demonstrated that we can create the tools to successfully do this. In the case of WWSSN timing marks, there is question as to whether these marks should be moved or simply ignored, as has been done with the hand‐digitized images we have seen. We can create code to handle either situation.
It is worth noting an unexpected result from the analysis of the timing marks. Figure 8 shows timing‐mark positions in y for the first dozen or so traces of a typical seismogram. The periodic bowing of these lines shows that the images are slightly distorted in y at approximately 30 vertical pixels over the 14,000‐pixel width of each trace, or about 0.2%. We did not quantify and correct for this distortion; regardless, it was easily picked up by our analysis, and indicates how distortion corrections could readily be applied in further iterations of the digitization routine.
OUTPUTTING x–y DATA
We have outputted data in a pairwise fashion with four data points defining each medial axis segment. This (x1,y1):(x2,y2) format is sufficient, because each segment is a linear representation of the midpoint of sequential polygon vertices. The data are currently saved in simple ASCII format as comma‐separated data, but other data types, such as SEED, can be used.
TRACE CROSSINGS AND DISCONTINUITIES
We have investigated examples of trace crossings and trace gaps in order to begin developing logical algorithms to automatically identify and connect these discontinuities. The following examples contain initial results. Following those is a summary of the current rules hierarchy and paths for future development.
An example of a simple trace gap is shown in Figure 9. The original image has a discontinuity due to the trace disappearing on the original image. To connect the gap, we have developed some general rules that first find and catalog the medial axis for every polygon in the vicinity of the gap.
A third‐degree polynomial is fitted to the last N points of each end of those axes adjoining the gap. The default value of N is 8, but we have experimented with other values. Next we take the left fit for each segment and calculate the second‐order‐polynomial derivative evaluated at the endpoint. We calculate the slope for the right fit for all other segments with a defined distance and then point them at each other. If this pointing from one polygon medial axis end to another, to within the pointing error of 1 σ (one standard deviation) as dictated by the third polynomial order fit, intersects the endpoint to within the y error of the fit on that other polygon, a potential match is declared. Figure 9 shows the declared gap connector for this image.
Similar logic is used when the discontinuity is the result of a trace crossing, as shown in Figure 10. In this particular example we increased the value of N to 10, and had to increase the pointing‐error constraint to a value within 0–1.9 σ. It can be seen that the presence of multiple polygons can lead to ambiguity in the results.
Once identified as a potential connection, we have developed an initial set of hierarchical rules to assign a merit score to each candidate fit.
A high score is given for which the combined pointing errors are minimized with respect to the other choices.
One dead‐on pointing connection will increase the score.
Higher scores are given to the more vertical gap connection (this assumes for now that high‐amplitude events are more likely to lead to gaps and trace crossings).
Ties in merit scores are broken by minimizing the gap distance.
These rules have shown that we can write code that identifies and connects trace discontinuities in a way that utilizes trace features, such as slope and proximity, and applies initial merit‐based connection likelihoods based on a sum of several scores. This Phase I work has helped to identify general classes of trace discontinuities, and given us good information on the path that our ongoing work will take to expand and generalize the code to deal with many types of discontinuities, including various combinations of multiple‐trace intersections, trace gaps, and background noise.
Our Phase I work has demonstrated the feasibility of our approach to the automatic digitization of seismograms. We are currently working under a new SBIR grant, No. DE‐SC0008219, to build on these efforts.
We will be working exclusively in the MATLAB environment for ongoing work. The experience gained using foundational edge‐detection algorithms, and unexpected features created in RaveGrid (in particular, spurious polygons at trace intersections) will allow us to proceed with purely rasterized images. This work will include:
A variety of image‐preprocessing techniques will be used, including frequency domain filtering, mean filters, and other well‐known and effective image‐processing techniques.
Advanced edge detection.
Automatic discontinuity identification to direct further analysis.
Trace crossing and dropout work.
This last effort will be our prime focus, and will incorporate a large range of merit and fit techniques, with an emphasis on accurate, flexible, and robust trace identification.
The ultimate goal of start‐to‐finish automatic digitization is achievable. For the most‐difficult‐to‐analyze seismograms, it is likely that some user input would be required to verify trace paths, but the vast majority of the work has been shown to be amenable to automatic methods.
We gratefully acknowledge the assistance from the DOE Office of Science in providing the funding for this work. We would also like to thank Bob Hutt at the USGS for making the WWSSN seismograms and some digitized data available for this work. The seismogram archives at http://www.iris.edu/seismo/ are an invaluable resource for this and other archival efforts.