^IssueTrack
Mining
for
Metadata
By ^Author
Captured inside computer systems and applications are valuable assets that have a tremendous ability to ease the foray into electronic commerce and business-to-business application integration. These valuable assets are called metadata--data about data--and describe critical factors about your systems and applications, such as where a particular data source is located and the data types that are used by these systems and applications.
Metadata plays a key role in reacting quickly to new technologies, and thus in using your current systems and applications to remain competitive.
Of note, the term "metadata" is employed heavily in today's technical literature, for a simple reason. As product vendors and companies realize the importance of metadata to their products and organizations, the greater the chance that they will expose this metadata publicly. The act of making this metadata public then becomes a feature of the application, thus making use of the application more appealing to other groups and customers.
What is Corporate Metadata?
While this is a seemingly simple question, it does not have a simple answer. Metadata is the set of data that describes locations for data sources, data types used within applications, and dictionary-like descriptions of the data being used (for example, product number represents the unique indicator for products produced by a manufacturer). But it also includes information such as the author of a word processing document, the elements and attributes of an XML (eXtensible Markup Language) document, and the names and phone numbers in the corporate directory.
Metadata can sometimes be described reasonably as data that tells us about the data we use, but in many cases the data itself can become metadata, such as names and phone numbers in a corporate directory. If you're looking to call someone within the company, then the name is the data for that particular use. However, the same piece of data can be identified as metadata if the information we are looking for is the person's security code. In this case, the name describes the owner of the security code.
Leveraging these assets--data about the data--within your systems and applications will ease your company's foray into e-commerce and B2B application integration.
That point causes great difficulties for companies when they attempt to identify and categorize their corporate metadata. If they attempt this using a cursory approach to collecting metadata, the size of the data set can become overwhelming and difficult to control or use. If overanalyzed, important metadata may be discarded because it is viewed as data, not metadata. It is possible to overcome these hurdles, however, which will be discussed later in this article.
Of note, most metadata does not stand alone. That is, there are few pieces of metadata within an organization that do not require association with other metadata components in order to provide contextual understanding. For example, how useful is "account number" as a piece of metadata without understanding what type of account this particular piece of metadata relates to, such as checking or investment. This need for context requires metadata to be collected in batches, and for the relationships between the metadata to be captured as part of the overall metadata environment.
Why is Metadata Valuable?
This question is extremely timely given the recent focus on Y2K. The Y2K problem was originally caused by technical limitations of our systems that required us to limit the amount of memory and disk space we used to represent dates. However, correcting the problem became significantly more difficult because of the lack of available metadata to support those systems.
Metadata helps us understand our data and our systems, but more than documentation about how the system runs, it tells us where the system is running and where the physical resources being used by the system are located. Even systems with thorough documentation still require the implementers to define this level of detail once installation is complete. With available metadata, applications become easier to maintain and, if necessary, replace. Additionally, metadata helps us to spot potential pitfalls and errors, such as a date field that cannot support a change in century.
Moving forward, metadata can significantly increase our ability to deliver personalized data to customers and business partners. In the age of e-commerce, clearly one of the defining factors is the ability to customize delivery of a singular set of information to multiple recipients in a variety of formats. For example, for a large bank to integrate their investment systems with their retail banking systems, it is necessary to understand the data, data types, and data sources for both systems. Through integration of these systems, customers can be provided with a consolidated statement, instead of two separate statements from the same bank. Or, in the case of a Web interface, the consolidated statement can simplify navigation by not requiring the user to view their checking and investment account information separately. In both of these cases, it is the underlying metadata that will drive the integration that facilitates the personalized delivery of information, and thus, provides a more professional impression of the bank to the customer.
Additionally, the same metadata that drives personalization of information can drive data reuse. When a company has a thorough understanding of the data it has, it can then intelligently decide that data's overall benefit to the company. More importantly, when the metadata is made available to all corporate personnel, new and innovative ways to use that data can emerge. For example, if the IT department is the only group that has access to the metadata, certain integrations and reports are possible. However, when the business manager for new account development has access to a source of well-defined metadata, then that person is empowered to devise a new campaign for attracting new customers.
Of note, reuse also leads to lower costs for software development, implementation, and maintenance, and increases the opportunity for standardization of information across the company. The latter point is extremely important for companies looking to optimize their internal processes or to create straight-thru processing.
Where is Corporate Metadata Today?
Here are some of the common places to mine for metadata:
Legacy mainframe systems
- Cobol and PL/1 copybooks (the definitions of the data records used by Cobol and PL/1 programs)
- Source code
- IMS and CICS screens
- Job Control Language (JCL)
Relational Databases
- Database catalog
- Database design models and documents
Hierarchical Databases
- IMS segments
- Document Type Definitions (DTDs)
Object-Oriented Applications
- Interface Definition Language (IDL)
- Class definitions
- Source code management tools
- Object modeling tools
Logical Models
- Entity-Relationship diagrams
- CASE tools
- Unified Modeling Language (UML) tools
Enterprise Resource Planning Software
- Data and object models
- Schemas
XML
- Document Type Definitions
- XML documents
Office Automation Documents
- Word processing files
- Spreadsheets
The above list comprises the most common sources of metadata in use today, but it is far from complete. However, the diversity of the list identifies the complexity associated with extracting metadata, and illustrates the importance of capturing metadata at the time of definition. The skills required to mine for metadata from these sources are extremely costly to obtain and, more importantly, rare and difficult to acquire even when money is not an issue.
Additionally, mining this metadata requires a carefully thought-out approach that should identify which metadata components are critical for a particular goal. For example, if a system needs to be integrated with a new e-commerce system, then the metadata components gathered should pertain to those components that need to be used for purposes of procurement and sales. Identifying the total number of accounts that are delinquent more than 30 days is not as important as identifying the total number of pieces of stock on a particular product. Prioritization of the metadata being identified will limit the costs associated with gathering and managing the base of corporate metadata.
This brings us to our next important point regarding metadata, and that is what to do with it once it is defined and/or extracted.
Making Metadata Accessible
Part of a company's commitment to capturing metadata requires two additional decisions. The first is where the metadata will be stored, and the second is how the metadata will be made available to those who need it.
In terms of storage, the most obvious answer is to use a metadata repository. This is a specialized database application designed to provide the infrastructure and support for storage of interrelated components of information. As stated earlier, very few, if any, metadata components stand on their own. Metadata repositories not only help capture information about singular metadata components, but also about the relationships between individual components. Metadata repositories also provide important functionality for searching and browsing the available metadata, delivering one of the more important functions--producing impact analyses.
Impact analyses identify all the resources that rely on a particular piece of metadata and, therefore, assist in defining all the resources that would be impacted by a change in the location or type of data associated with a metadata component. Producing these types of reports, however, requires a dedication to inputting and updating all the information in the repository.
For the second decision, how to make the metadata available, there is no single answer. Indeed, the answers to this question will be defined by the uses for the metadata itself. For example, simple browsing for informational purposes can be provided in HTML for use by a Web browser. A direct application programming interface may be provided for use by a proprietary client that offers more advanced querying and metadata management. However, perhaps the most innovative and novel method for distributing metadata is to provide it as an XML document.
XML is a powerful language for representing both metadata and data combined in the same document. XML is a tag-based language that allows users to demarcate items of data through the use of named tags, called elements. In addition, elements can contain supplemental information, called attributes, that assign values to uniquely named keys. The following sample XML document illustrates these points:
<Invoice>
<Bill_To>
<Clear_Name>Jin Wook Lim</Clear_Name>
<Electronic_Address>jin@xmls.com</Electronic_Address>
<Postal_Address>
<Street>7929 Westpark Drive, Suite 100</Street>
<City>McLean</City>
<State abbrev="VA">Virginia</State>
<Zip>22102</Zip>
</Postal_Address>
</Bill_To>
<Items>
<Item>
<Description SKU="1223123">Shirt</Description>
<Price currency="USD">12.50</Price>
<Quantity>1</Quantity>
<Total_Price currency="USD">12.50</Total_Price>
<Item>
<Total_Cost currency="USD">12.50</Total>
</Items>
</Invoice>
From this example we can see how these elements clearly define the data that they are demarcating, and provide additional metadata that helps to clarify the information. For example, on Total_Price, identifying the currency for the amount ensures that processing will occur in U.S. dollars.
Whether metadata is captured with the data in the document, as illustrated here, or the document is the metadata itself, XML is a simple and platform-neutral data format that can easily be processed by a number of tools and products.
Many companies have spent a significant amount of money chasing their tails trying to gather up their metadata, much like the way a squirrel would collect nuts for winter.
Using Metadata for Application Integration
An emerging area that is heavily reliant upon the availability of metadata is application integration. The reason for the rise in the interest is that metadata is required by the integration engines to automate the extraction of data.
Integration engines simplify the extraction and aggregation of data from disparate sources for the purpose of supplying data to other applications. However, these integration engines only operate if they can access the metadata for an existing application and expose it to the user. For example, many integration engines can extract the metadata from database systems because there are well-defined interfaces for extracting this type of information. The schemas and data type information can then be provided to the integration engine user, who can decide which fields need to be input to another system (using more metadata). From this definition, the integration engine will then automate the extraction and update process.
Considering the complexity associated with writing the software to perform the actual extraction and update, these engines offer significant assistance in this process. However, without knowing where the data is located and what the data means, the integration engine is not a very useful tool.
Metadata is one of those concepts that have been discussed for more than a decade, but it is only recently that we have seen the emergence of tools that support and are driven by metadata. Many companies have spent a significant amount of money chasing their tails trying to gather up their metadata, much like the way a squirrel would collect nuts for winter. However, once the till comes and tills the field, those nuts are lost. The same can be said of using a metadata repository without a complete plan for distribution, access, and maintenance of the data in the repository.
JP Morgenthal is CTO of XMLSolutions Corp., McLean, Va., and a leading expert in the area of enterprise application integration (EAI) and business-to-business e-commerce. Morgenthal is also co-author of Manager's Guide to Distributed Environments (J.Wiley & Sons, 1998) and the forthcoming Enterprise Application Integration with XML and Java (Prentice-Hall, 2000).
Priscilla Walmsley is VP of Development for XMLSolutions. She is a leading authority on metadata and repositories. Walmsley was directly involved in the development of Platinum Software's Metadata Repository product and Microsoft's repository.