Privacy and XML, Part I
by Paul Madsen, Carlisle Adams
April 17, 2002
Overview
The widespread uptake of e-commerce has been stalled as much by the
inability of businesses to guarantee the privacy preferences of their
customers for the personal data entrusted to them as by any other
single factor. Of those who are connected but do not purchase online
-- which is over half of all Internet users -- over half say their
reluctance is due to fear that their personal information will be
stolen or misused. In a sense, XML, through the smart data transfer
it enables, contributes to the problem. However, a number of
XML-based efforts are emerging that offer solutions to some of the
major technology issues for privacy.
Introduction
Privacy, in the context of this article, may be understood as the
ability of individuals to control the collection, use, and
dissemination of personal information that is held by others. Privacy
is a major issue these days. Corporations are appointing Chief Privacy
Officers (CPOs), and governments around the world are creating
legislation which forces companies to satisfy requirements on how they
collect, secure, and use customer data.
But businesses have always collected data about their customers. In
a trivial sense, without certain information, a company could simply
not do business with their customers (e.g., a shipping address). More
interesting than this trivial case, however, is how, with the
appropriate information, a given service or product can be optimized
or tailored to specific customers: the more a company knows, the
greater are the opportunities for real personalization, potentially
benefitting both company and customer.
If businesses collecting information on customers is not new, why
then is privacy receiving such attention lately? Reasons include the
following:
- The unfamiliarity most people have for the technologies
that make up the Web. Users are asked to make decisions and state
their preferences on issues for which they have no expertise. For
instance, browser cookies, if implemented in a responsible manner, can
allow businesses to maintain a relationship with a customer between
browser visits, preventing the customer from unnecessarily entering
contact information multiple times. However, for the vast majority,
cookies are perceived to provide the Web business an unacceptable
access point into their computers and their lives.
- The interconnectedness of networks enable faster and easier
information flow -- both authorized and unauthorized. With more
and more customer data being moved online, the opportunities for
illicit access are increased. Ten years ago, a company may have
maintained information on their customers on an off-line
mainframe. Now it is likely that the database will be connected to a
Web server -- both to customize and simplify the customer's browsing
and shopping experience, and to allow customers to self-manage their
data. The price to be paid for these advantages is of course that a
channel now exists for the unauthorized access of that data. As an
example, a hacker recently penetrated the computer network of a
hospital in Seattle and was able to extract files containing
information on more than 5000 patients.
In the past, public records were most likely kept on
paper or magnetic tape in physical filing cabinets at offices of
various levels of government throughout a country. Thus, even though
the information may have been freely available (in a legal sense), the
realities of the storage medium prevented this happening on a large
scale. Internet technology enables the easy distribution of the
information and consequently raises people's sensitivity to this
access.
- The emergence of mobile technologies. Smart phone use in
the United States and Europe is predicted to grow dramatically in the
next few years and the technology enables scenarios unimaginable in
the past. For instance, it is possible to determine the user's exact
location through the signals emitted by mobile devices. While this
sort of ability could be of great benefit in an emergency -- cutting
ambulance response times or facilitating the location of missing
children -- it is also feared in some quarters that this tracking
ability, through its linkage of identity and location, could be
misused.
- Federated identity. Both Microsoft .NET My Services and the
Liberty Alliance are designing architectures for federated
authentication and identity -- in which an individual will be able to
create an identity at one Web site and be able to use that identity in
order to access the services at another. Federation, with the
attendant sharing of user information between Web sites, amplifies
concerns about the misuse of that information.
If access to information is part of the problem, it would seem that
XML, with its logically identified and structured information objects,
would only add fuel to the fire. Imagine how much easier a hacker's
'job' would be if she knew that all banks kept the credit card numbers
of their customers in an XML Schema that specified a
<creditCardNumber> element. No longer would hackers need to scan
multiple tables in multiple databases, rather they would simply let
loose a "bot" that read every file it came across, looking
for the appropriate tags, and once found, retrieve the number as well
as the card owner and expiry dates (these also conveniently captured
in their own elements).
Fortunately, XML has more to offer the privacy issue than simply a
mechanism to allow hackers to automate their efforts and spend more
time dreaming up increasingly elaborate hacks. This two-part article
will discuss the existing and potential applications of XML to
privacy. Before doing so, we provide an overview of privacy related
concepts in the following section.
Privacy Issues/Concepts
The remainder of this article provides overviews of some of the
concepts and issues central to understanding privacy. The second
installment, to be published next week, will highlight some of the
XML-based initiatives underway to enhance Internet privacy.
Personally Identifiable information
Personally Identifiable Information (PII) is information that is
unique to an individual and, as such, can serve as a locator for that
individual, or at least as a way to distinguish that individual from
many (or all) others. Examples of personally identifiable information
are a social security number, a telephone number, a home address, and
possibly an email address. Data such as age, gender, and salary do not
uniquely identify the bearer, and so are typically not considered
PII. Although somewhat less of a concern than PII, therefore, such
anonymous data is very relevant to privacy if it can be linked to
PII. As an example, in 1999 DoubleClick was planning on combining the
anonymous browser click-stream data it collected with the database of
PII it acquired through the purchase of Abacus Direct. DoubleClick
dropped these plans under pressure from Privacy groups and the
media.
Opt-in versus Opt-out
Opt-in and Opt-out refer to the model by which businesses get
approval from consumers for sharing of their information; they differ
in the assumptions they make about the value of the data and what the
appropriate default rule for sharing should be.
The "opt-in" model assumes that consumer information has
high value, and as such, the consumers should be given an explicit
choice for approval as each opportunity to share their data
arises. With opt-in, the default is not to share consumer information,
consistent with the assumption of a high-value for the information. If
a consumer is willing to share her information, she must affirmatively
"opt in".
 Do you have questions about this article? Ask them of the author in our forums. |
| Post your comments |
The "opt-out" model places less value on the consumer
information; it assumes that information is insensitive and can be
shared unless a consumer explicitly requests otherwise. The default
operation in this model is to share information. If a consumer is not
willing to share his information, he must affirmatively "opt
out".
Privacy policy
For consumers to trust a company sufficiently to do business with
it, the business must ensure that the consumer can, if they so wish,
read and understand the company's policies for data protection and
secondary data sharing. Even though most consumers may not avail
themselves of the opportunity, not presenting the option could be
enough to convince the consumer to do business elsewhere. Privacy
policies typically detail what level of protection and what mechanisms
for protection are employed by the company, as well as how, when, why,
and with whom data in their possession will be shared. Several
interactive, web-based tools have been developed to help organizations
create a privacy policy step-by-step.
Transparency
Another crucial consideration for companies is transparency or
consumer accessibility to data collected. The trend, motivated both by
legislation and by the desire to maintain a friendly and trusting
relationship with consumers, is toward allowing consumers on-line
access to their own data. Significantly opening up on-line access to
data inevitably raises security issues, since it may increase the risk
of unauthorized access by third parties to an individual's personal
information.
If adequate steps are in place to authenticate a person's request
to view their information, such as a user name and password or other
techniques, this mechanism can benefit both sides. The consumer is
reassured as to the nature of the information maintained about them
and the openness of their relationship with the business; furthermore,
the company minimizes its costs by placing some of the onus of keeping
information up-to-date on the consumer.
Exposure and Disclosure
The concepts of "exposure" and "disclosure" are
distinct, but both are related to privacy. Exposure has to do with
identity: am I willing to reveal who I am to one or more other
entities within the context of this transaction? Disclosure has
to do with other information about me: am I willing to reveal this
personal or sensitive information to other entities for some
particular purpose? It is sometimes argued that these two
concepts collapse into one if "identity" is simply
considered to be one type of personal information that may be
disclosed. However, in many environments it is useful to keep the
concepts separate because an authentication step (which may expose
identity) occurs prior to the remainder of any transaction or set of
transactions that discloses additional information.
Exposure may be further categorized into techniques providing
anonymity, pseudonymity, or veronymity.
- Anonymity ("no name") refers to the use of no name
whatsoever or to the use of a name that was never used before a given
transaction and will never be used again. The defining property of
anonymity is that no linkage is possible between this transaction and
the actual, real-life entity performing the transaction, and no
linkage is possible between two different transactions (i.e., it
cannot be known that they were both performed by the same actual,
real-life entity).
- Pseudonymity ("false name") refers to the use of a
particular name for multiple transactions, but that name is different
from the identity of the actual, real-life entity performing the
transactions. The defining property of pseudonymity is that no
explicit linkage is given between this transaction and the actual,
real-life entity performing the transaction, but a linkage is possible
between different transactions. In this way, a server can know that
the same entity is visiting again (and personalize accordingly), but
it does not know who this entity actually is. (Care must always be
taken with pseudonymity, however, because multiple transactions all
known to be performed by the same entity can sometimes allow an
observer to derive clues about the actual identity, thereby weakening
the property of pseudonymity.)
- Finally, veronymity ("true name") refers to the use of
the actual, real-life identity of the entity performing the
transaction within the transaction context. Linkage both from a given
transaction to the actual identity, and between two transactions
performed by the same identity, is obviously possible.
Therefore, identity information may be not exposed at all (in
anonymous transactions), may be partially exposed (in pseudonymous
transactions), or may be fully exposed (in veronymous transactions).
Information use
There are three categories of use to which collected information may be put.
- Approved Intended uses. These are uses for which the
company has notified the customer and received approval. An example
might be collecting and storing a customer's shipping information to
streamline future purchasing.
- Non-Approved Intended
uses. These are uses for which the company has either not notified
the customer or has notified the customer but has not received
approval. The intent is on the company's side, not the consumer's
(i.e., the company intends to use the data in a particular way but the
consumer has not (yet) given explicit approval for this use). An
example would be selling a customer's purchasing history to another
company.
- Unintended uses. These are uses for which
neither the company nor the customer anticipated or approved. An
example would be a hacker gaining access to a back-end database of
credit card numbers and posting them to the Web.
The goal of most privacy legislation and technology is to protect
consumers by allowing them access to a company's list of
"non-approved intended" uses so that informed choices can be
made. Implicit, as well, is a recognition that protection against
"unintended uses" must be provided.
Security
Security is very closely related to privacy. A privacy policy is
worthless if there are not security mechanisms in place to enforce
it. Most of the current privacy debate focuses on giving consumers
notice and choice (i.e., addressing the intended uses of consumer
information). Unfortunately, a consumer's choice on how her
information is to be used is unlikely to deter any hackers that may
get into the back-end systems. As such, consumer choice and notice
won't solve the public's top online privacy concerns: identity theft;
computer credit card fraud; and digital high-jacking of personal
information. Until the security issues are addressed and guard against
unauthorized (i.e., unintended) uses, there can be no true online
privacy.
Security often encompasses such concepts as confidentiality,
authorizations, authentication, and non-repudiation; each of these are
relevant in some way to privacy.
Confidentiality refers to keeping sensitive information
secret and protected from inappropriate viewing. Privacy requires that
the confidentiality of user information is protected both in transit
and in storage.
Authorization refers to the process of determining what an
individual or business entity is allowed to do; for instance, a user
may allow one company to only to view their online calendar while
another is authorized to write to it.
Non-repudiation refers to mechanisms that prevent
individuals and business entities from denying an action of
theirs. Such functionality is relevant to privacy because it would
prevent a business from denying that it made a claim for how user data
would be used if the business was later found to have broken this
policy.
Authentication refers to proving that individuals or
business are indeed who they claim to be.
Information Sharing
Currently, a Web user will likely maintain separate collections of
their personal information with multiple businesses, with resulting
duplication and administrative burden. For example, they will likely
have provided their shipping address to every Web site from which they
ever made a purchase. Privacy will become even more of an issue in the
future as these existing islands of customer information are
connected to each other to create a virtual whole (as in Microsoft's
.NET My Services initiatives and the evolving Liberty Alliance).
The power of such aggregation is obvious, from the perhaps mundane
scenario of auto form-filling to new and exciting scenarios of
applications providing a holistic experience for a user (e.g., an
online grocery service that is able to access a filtered view of the
user's agenda to determine when is the best time to deliver their
order). This sort of concentration of data, either physical or
virtual, has obvious implications for privacy. If nothing else, it
would seem to present an incredibly attractive target for hackers
wishing to concentrate their efforts where there is the greatest
potential for reward.
Privacy of user information in this information sharing model requires:
- Protected data storage
- Authentication and authorization of requesting applications
- Confidentiality of transmitted data
Another privacy aspect of the model described above, quite separate
from the issue of controlling access to the user's personal
information stored in the information repository, is that the
authentication service, through its central role in the authentication
process, will have access to a vast store of click-stream data: the
record of sites a user visits. Such data could enable powerfully
targeted marketing. For instance, if a user were seen to visit the Web
sites of high-end furniture and antique stores, then a displayed
banner ad for cigars would presumably enjoy greater success with this
user than a member of the public chosen at random. Privacy experts
have expressed concerns about a single corporation (Microsoft or any
other) playing such a central role in e-commerce transactions.
Microsoft has promised that they will neither use Passport data in
this way themselves, nor sell it to others. An organization wishing to
participate in a Liberty community will necessarily make the same
commitment.
.NET My Services will make the user's information available through
a published XML API; Microsoft is calling this the "XML Message
Interfaces" (XMI). XMI will simplify for application developers
both the retrieval of this information and its integration into their
applications (browser-based and non-browser-based). The initial .NET
My Services roll-out will include core services like .Net
Profile (nicknames, picture, etc.) and .Net Calendar (time
and task management), each of which will have an appropriately defined
XML Schema. The following shows an example of the stored XML.
<c:contact xmlns:c="http://schemas.microsoft.com/hs/2002/10/myContacts"
xmlns:p="http://schemas.microsoft.com/hs/2002/10/myProfile">
<c:firstName xml:lang="en-us">Bill G.</c: firstName>
<c:lastName xml:lang="en-us">Ates<c:lastName>
<c:emailAddress>
<p: address>billg.ates@microsoft.com</p:address>
</c:emailAddress>
</c:contact>
Passport, .NET My Services, the Liberty Alliance, and
similar architectures built around Web protocols and services have
heightened public awareness of privacy issues with regard to the
Internet. The next article will highlight some XML-based efforts to
enhance Internet privacy.