Index
1.XML Fundamentals
2.XML Markup
3.Well-formed and Valid Documents
4.XML Information Models
5.DTD�s
6.XML Schemas
7.XSLT
8.XML Processing
1. XML Fundamentals
XML is Structured data. It is not a programming Language but a Data Format.
XML is Extensible because it is not a fixed format like HTML. XML uses tags to provide Information content. It has hierarchical data structure.
XML is an abbreviated version of SGML. (Much simpler).
Programming languages specify calculations, actions and decisions to be carried out, whereas, Markup languages (like XML) describe information for storage, transmission, or processing by a program.
XML files cannot be run on their own. They need programs to be created, displayed or processed,
XML removes the dependence on a single, inflexible document type (HTML) and also removes the complexity of SGML.
XML does not replace HTML. HTML has been redefined as an XML vocabulary instead of an SGML vocabulary. HTML is now a child of XML, known as XHTML. HTML 5.0 is XHTML 1.0. Existing HTML files will work with XML only if they are well-formed.
XML Family:
�Display : XHTML, XSLT, XSL
�Modeling : DTD, XML Schema
�Manipulating : DOM, SAX
�Querying : Xlink, XQL, Xpath
Strengths of XML
Robust Data Representation
Easily mutable (APIs are easily accessed by existing code)
Platform Independent
Works with existing technologies
The path of a standard at W3C:
NOTEs � Working Drafts � Candidate recommendations � Tech. Recommendations
2. XML Markup
Prolog provides initial parameters to XML
<?xml version="1.0" encoding="UTF-16" standalone="yes" ?>
Default value for standalone is "no"
Elements
Types are Document (Root) Element, Other Elements and Empty Elements.
Empty Elements are a common way of including multimedia files in a document.
Attributes provide additional information about an element.
PCDATA Vs CDATA
PCDATA is parsed character data and CDATA is unparsed character data. CDATA section is used to pass data, which contains characters, reserved for markup. Technique when using data from a legacy system.
Processing instructions are passed onto the processing application.
<?TargetName any sequence of characters?>
Comments are often stripped by the parser and not passed on to the application. The processor ignores them.
<!�Here are comments � Namespaces allow us to disambiguate names in our document.
3. Well-formed and Valid documents
XML Constraints
�XML prolog at the top
�Only one root element
�Elements must nest properly
�Attribute values must be quoted
�Every start tag must have an end tag (case sensitive)
�Well-formed XML documents must obey the basic XML constraints. If a document is not a well-formed document it is not a XML document.
Structural/Semantic Constraints are defined in the information model (DTD, XML Schema).
What element (tag) names are allowed
What attributes are used with each element
Which child elements belong to which parent elements
What order child elements can appear in
If a document�s structure and tag names match the information model, it is �valid�. Validation is optional.
A valid document is always well-formed.
4. XML Information Models
Purpose
Data Control � ensure that elements in document follow order
Processing � document matches a prescribed schema
Definition � provides a way to define schema
Authoring � assist authors in creating valid documents
Information Models enforce rules. Rules allow standardized documents. Standardized documents allow companies to
exchange data.
Types of Information Models
�DTDs
�XML Schemas
Schemas Vs DTDs
�Schemas support namespaces
�Schema is written in XML syntax
�Schemas provide extensive datatype support, whereas, DTDs has very limited datatype support.
�Schemas have full object oriented extensibility whereas DTDs have extended via
string substitutions
�Schemas are open, closed or refinable content models, whereas, DTDs support closed models only.
5. DTD�s
Document Type Definition is a series of statements where document component names and relationships between them are defined.
DTD�s can be Internal or External.
The dtdname.dtd is a DTD definition and the
<!DOCTYPE rootelement SYSTEM "dtdname.dtd"> is a DTD declaration.
An Internal DTD is defined and declared within the XML document. An external document is defined as a .dtd file and then declared in the XML document.
Identifiers for External DTDs are System(location) or Public (publicly registered identifier)
Element Names
Must start with aLetter, �_�(Underscore), �:� (Colon).
Allowed following characters include aLetter, aDigit, �_� (Underscore), �-� (Hyphen), �:� (Colon) , �.� (Dot).
Naming Tips :
Do not use cryptic names
Avoid unwieldy names
Keep consistent naming scheme
Do not use numbers
Do not append the name of the parent to an element name.
Element Content :
DTD syntax allows the control of element content:
Type - for ex. EMPTY
Order � Separated by commas must appear only once and in listed order
Car|truck|bike means car or truck or bike
Multiplicity - * Zero or more
? Zero or One
+ One or More
No Symbol Once and once only
When mixing component types, separate components with pipes and #PCDATA must be declared first.
Attributes
Sub elements and machine-readable codes end up as attributes.
Attribute types : String, Tokenized (varying lexical and semantic constraints), Enumerated (list if valid values).
Attribute Qualifiers :
#FIXED (must have a default value, not to be overridden)
#IMPLIED (optional default value, not mandatory in xml document)
#REQUIRED (Default not allowed in DTD, value required in xml document)
Enumerated (optional default attribute)
Tokenized attributes :
ID attribute is unique for an element in the document. Similar to primary key.
IDREF is a pointer to an ID. Similar to foreign key.
IDREFS functions as a pointer to multiple ID�s separated by spaces.
ENTITY is a pointer to an external entity
ENTITIES is a list of entity�s separated by white space
NMTOKEN (name token) contains a value
NMTOKENS is a list of NMTOKENs separated by white space
Entities
Entities are Storage Units or Storage Objects. They can be Parsed and Unparsed.
Parsed entities are used as replacement text and invoked by name (ex. &abc
Unparsed entities are non-XML resources and invoked by name via the ENTITY attribute. Entity declaration must reference a notation, which specifies the format type of the information, and what application should handle it.
Types of Entities :
Pre-defined by the parser (for ex. <
.
Internal General Entity (text substitution)
External General Entity (Uses System identifier)
Internal Parameter entities (used within DTDs. Use % sign)
DTD weaknesses
�Cryptic Syntax
�Everything is treated as text
�Performance Impact (Validation requires extra time)
�Inconsistent parser support for entities
�Limited capacity for data validation
�Can define only for hierarchical relationships
�Poorly suited for automation
Elements Vs Attributes
�It is easier to edit/display Element content than Attribute values.
�Processors can check Attribute values easily than Element content.
�It is easier to extract information from attributes than from sub-elements.
�Attributes can have default values, Elements cannot.
�Elements define content, Attributes describe content.
6. XML Schemas
XML schema is a spec, which is defined and maintained by W3C.
XML Schemas is a syntax and is a model for describing the structure of XML documents.
XML schemas � Highlights
Enhanced Datatypes
Written in XML
Object Oriented
Can express sets (child elements can occur in any order)
Can specify element content as being unique (Keys)
Can define multiple elements with same name but different content
Can define elements with null content
Can create equivalent elements (subway element equals train element)
Open, closed and refinable content models
Namespace support
Grouping (attributes etc.)
XML Schema Components
XML schema has 3 components : Declarations, Types and Type Definitions.
Element Types can be classified as :
Simple Types and Complex Types
Simple Types do not have children and no attributes.
Complex Types allow element children and attributes are allowed.
Users can build new simple and complex types. Some simple types like boolean, String, decimal are built-in.
Named Types and Anonymous Types
Anonymous Type does not have a name and so used in one element only.
Named Type has a name and can be referenced.
Inline Declarations and out-of-line Declarations
Inline declarations are done top to bottom and out-of-line declarations are done bottom to top.
In Inline declarations tree is declared first with ref to branch. Branch is then declared with ref to leaf.
Custom Data Types
New Data types can be created from an existing data type (called the base type).
For Ex :
<simpleType name="name" base="source">
<facet value="value"/>
<facet value="value"/>
</simpleType>
source can be any one of string, boolean, float, double, decimal, timeDuration, recurringDuration, uriReference ����.
Facet can be any one of
pattern, enumeration, length, maxlength, minlength �..
Default value of minOccurs is 1 and default value of maxOccurs is 1 when minOccurs is 0 or 1 and equal to minOccurs when value of minOccurs is greater then 1.
7. XSLT
XSL Vs XSLT
XSL is a styling language and XSLT is a spec, which is used for transformation.
The namespace for XSL is fo: and for XSLT is xsl:
CSS Vs XSLT
CSS can only style XML documents but XSLT can do the styling as well as do �
Reorder nodes in the input document
Transform nodes in the input document
Sort nodes in the input document
Add, remove nodes in the input document
Transform both attributes and elements
.xsl file is actually a XSLT transform file.
NameSpaces
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform">
xsl is the namespace used for style sheets. Namespaces may or may not be present in the source and target documents.
XSLT uses namespaces in the xslt transformation file to differentiate between xslt instructions and literal result elements.
XSLT Components
Top level element xsl:stylesheet can be replaced by xsl:transform
xsl:apply-templates selects nodes to be processed, and processes them.
For ex. the XSLT instructions below selects all child nodes of root and processes them :
<xsl:template match="/" >
<xsl:apply-templates />
</xsl:template>
XSLT Elements
XSLT Elements can be classified into top-level elements and Instruction elements.
Example of top-level elements are <xsl:template>;<xsl
aram>;<xsl
utput>
Instruction elements are only found inside the template body.
for ex. <xsl:apply-templates>;<xsl:for-each>;<xsl:element>;<xsl:value-of>
XSLT Patterns
Patterns are used for node processing, template matching etc.. XSLT patterns are described in the XPath spec.
Patterns are used to define a condition that a node must satisfy in order to be selected.
Examples of a pattern are "/course/topic/slide[2]"; "book/@isbn"
Patterns are used in match attribute of xsl:template; select attribute of xsl:for-each�
The general form of XSLT patterns is /step/step/step.
The Expanded form is /axis::nodetest[predicate]/axis::nodetest[predicate]
axis can be parent, child, self etc.
nodetest can be done on nodename or nodetype.
Predicates potentially reduce the nodelist, zero or more allowed and evaluated from left to right.
XML Nodes
XML trees are composed of Nodes which have a type, and may have name.
Nodes may have child node, parent node etc.
XSLT Extensions
XSLT engines may offer custom extensions to the basic W3C XSLT spec.
XSLT instructions
<xsl:for-each> is similar to �for loop�.
<xsl:sort> sorts the elements.
<xsl:if test="something"> is similar to �if statement�.
There is no <xsl:else> in XSLT, <xsl:choose> is used instead.
<xsl:choose> is similar to �switch� and <xsl:when> is similar to �case� and <xsl
therwise> is similar to �default�.
<xsl:variable> is similar to �final� variable. Variable has a scope limited to the element it is defined in.
<xsl
aram> can only be used within a named template.
<xsl:call-template> calls the named template and passes parameters using <xsl:with-param>
<xsl:element> and <xsl:attribute> is used to create elements and attributes.
<xsl:comment> emits comments
<xsl
rocessing-instruction> emits processing instructions
<xsl:
test> emits text.
<xsl:copy> does a shallow copy.
<xsl:copy-f> does a deep-copy.
<xsl:number> is used to apply numbers to the output document.
count and level attributes of xsl:number are used to calculate node numbers.
<xsl
utput method="html"> is used to tell the XSL processor the type of output. The legal values of method are HTML, text and XML. Other attributes of xsl
utput other than method are value, encoding, indent etc.
<xsl:strip> and <xsl
reserve> are used to tell the XSL processor whether to strip or preserve white space.
XSLT Templates
Examples of template mode are �debug� and �production�.
If more than one template can handle a given code, there is a conflict. XSLT engine resolves this by assigning a priority to each template. User can also assign a priority to a template. System assigned priorities are from �0.5 to +0.5. User assigned priorities are usually > +1.0. If two priorities match XSLT engine will pick up the last one.
External Stylesheets
Breaking up of stylesheets into separate modules will provide reuse.
Stylesheets can be included or imported.
Importing a stylesheet is similar to subclassing.
XSLT Functions
XSLT Datatypes are String, Number, Boolean, Node-set, Tree �.
XSLT Functions are called inside of XSLT elements.
String Functions
string() : Conversion to string
concat() : Concatenates two or more strings
starts-with() : Takes two strings, returns true if first string starts with second string
substring() : note that the first character starts with 1 not 0
string-length(): returns the length of the string
name() : returns name of the node
contains() : takes two strings and returns true if string1 contains string2.
document() : to access other XML documents. Takes url as argument.
id() : returns ID.
Key() : returns nodes with unique valued defined with xsl:key.
translate :
Takes three arguments. First argument is a string and second and third are patterns (formats). Translate converts from format in second argument to format in third argument.
normalize-space() :
returns a string after trimming it and replacing sequences of spaces with a space.
substring-before ;
Takes two arguments, returns sub-string of first argument that precedes the first occurrence of the second argument. Same is the case with substring-after.
Boolean Functions : All functions return boolean.
boolean()
not()
true()
false()
lang()
Number Functions : All functions return number
number()
sum()
floor()
ceiling()
round()
last() : returns last position in the current node list
position() : returns position of current node in the list
count() : returns count of all named nodes in the
doc Arithmetic operators available are +, "-", div, mod
8. XML Processing
Interface Vs Implementation
Interface specifies �what� and implementation specifies �how�.
Interface is the �spec� and implementation is �code�
Different providers can provide different implementations for the same interface.
Parser
A set of software components designed for reading, processing and creating XML documents.
Parsers expose the structures and tags within a XML document thus making it easy to process XML documents.
Types of Parsers
SAX (Simple API for XML) Parsers � event based
DOM (Document Object Model) Parsers � tree (object) based
Validating Parsers
Non-Validating Parsers - faster
SAX Vs DOM
SAX is event-based and DOM is tree-based
SAX is developed by XML-Dev mailing list and DOM is a W3C recommendation
DOM constructs a tree in memory and SAX does not
SAX fires events (streaming) and DOM reads the entire document.
DOM is harder to use than SAX, but is flexible
SAX is read-only and DOM is read-write.
SAX uses less memory and is fast & efficient
SAX is preferable for large documents.
DOM is preferable for non-sequential processing
DOM maintains history. SAX does not.
DOM spec is written in CORBA IDL, SAX spec is written in
java.
DOM spec is 500 pages whereas the SAX spec is only 20 pages.
SAX gives control to user during parsing, DOM gives control only after parse.
DOM provides range support, traversal support, HTML DOM support and CSS/Stylesheet support.
Parser errors are of three types
warning : Problems that are not errors as defined by the XML specification.
error : Errors defined by the XML specification. Recoverable.
fatalError : Defined by XML specification. Non-recoverable.
SAX Parsing
SAX has 5 interfaces and ~30 methods.
SAX Interfaces
ContentHandler
LexicalHandler
DTDHandler
DeclHandler
ErrorHandler
Some of the methods which handle events are :
ContentHandler Interface is the main SAX interface.
Public void startDocument()
Public void endDocument()
Public void setDocumentLocator(Locator locator)
Public void startElement(String uri, String localName, String qName, Attributes atts)
Public void endElement(String uri, String localName, String qName)
Public void characters(char ch[], int start, int length)
Public void ignorableWhitespace(char ch[], int start, int length)
Public void processingInstruction(String target, String data)
Public void skippedEntity(String name)
Public void startPrefixMapping(String prefix, String uri)
Public void endPrefixMapping(String Prefix)
LexicalHandler Interface(for Entities and CDATA)
Public void startDTD(String name, String publicId, String systemId)
Public void endDTD()
Public void startEntity(String name)
Public void endEntity(String name)
Public void startCDATA()
Public void endCDATA()
Public void comment(char ch[], int start, int length)
DTDHandler Interface(DTD Processing)
Public void notationDecl(String name, String publicId, String systemId)
Public void unparsedEntityDecl(String name, String publicId, String systemId, String notationName)
DeclHandler Interface
ErrorHandler Interface
DOM Parsing
Provides two complementary views of the parse tree
Flat View : Everything is a node
Object-Oriented View : Objects
DOMImplementation Methods
CreateDocument()
CreateDocumentType()
hasFeature()
Document Methods
Node Types
Node Methods
NodeList Interface
NamedNodeMap Interface
CharacterData Interface
Element Interface (Manages Attributes)
Attr Interface
DocumentFragment is a lightweight implementation of the document object which does not require a root element and will be inserted into a larger document.