You're viewing our new website - find out more

Publication - Guidance

Open Data resource pack: version 2

Published: 30 Aug 2016
Part of:

Resource pack to help public authorities develop and implement plans for open data. This is the second version of the document.

90 page PDF


90 page PDF


Open Data resource pack: version 2
8. Create a dataset

90 page PDF


8. Create a dataset

After selecting the information you wish to publish you need to organise it so it can be made available for download in bulk and in machine readable formats. This is called creating a dataset. Creating a dataset is a quick and easy process. A dataset is a structured presentation of data, such as a spreadsheet or table.

The steps to for creation of your dataset are set out below. Annex A has a checklist to help you make sure you cover each of the steps.

Step 1: Apply an open format

One of the most common questions asked is 'what format should I use?' Open data should be in an open format and machine readable.

Open Formats are non-proprietary and platform independent. They can be accessed by anyone and do not require access to licensed software. E.g. Microsoft formats are not open as they use proprietary software.

Machine Readable formats allow a computer to read the data. Machine readable data is structured and easy to query using code.

The most appropriate format will depend on the type of data. Any type of data can be stored in an open format, but it is likely you will have to transform the data from its original format. Open, machine readable formats allow the data to be used and

edited easily. It also allows for interoperability between different datasets. For example, a PDF publication may look nice but it severely limits the user's ability to re-use the information.

You should be aiming to select a format which satisfies 3 star publication requirements. Below is a table of common open data formats which satisfy 3 star release. Examples of Common Open Formats

Format Name Definition Type of data to use this for
Comma Separated Values ( CSV) Comma Separated Values ( CSV) is a great way of storing large amounts of data with just commas separating the data values. Often the CSV file will contain a header with names describing what data is populating the file. Tabular data e.g. Use instead of Excel
Tab-Separated Values ( TSV) TSV is a very common form of text file format for sharing tabular data and is highly machine readable. Tabular data Use instead of Excel
JavaScript Object Notation ( JSON) JSON uses human-readable text to transmit data objects consisting of attribute-value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. The file size will be more compact or smaller than XML. Complex structured data Multidimensional data Tabular
Extensible Markup Language ( XML) XML is a widely known markup language that defines a set of rules for encoding documents in a format that is both human- readable and machine-readable. Users create and define their own tags. Complex Structured data Multidimensional data Tabular data
Rich Site Summary ( RSS) RSS (originally RDF Site Summary), often dubbed Really Simple Syndication, uses a family of standard web feed formats to publish frequently updated information: blog entries, news headlines, audio, video. An RSS document (called "feed", "web feed" or "channel") includes full or summarised text, and metadata, like publishing date and author's name. Use for announcements or events e.g. on websites
ATOM The Atom Syndication Format is an XML language used for web feeds. The Atom format was developed as an alternative to RSS. Note RSS is the preferred standard. Use for announcements or events e.g. on websites
Open Document Format for Office Applications ( ODF) The Open Document Format for Office Applications ( ODF), also known as OpenDocument, is an XML-based file format for spreadsheets, charts, presentations and word processing documents. It was developed with the aim of providing an open XML-based file format specification for office applications. Non-system generated metadata or additional information you release with your dataset. (replaces Excel, Word, PDF)
HTML Used for formatting information on the web Non-system generated metadata or additional information you release (replaces PDF, Word)
Keyhole Markup Language ( KML) KML is an XML language focused on geographic visualization, including annotation of maps and images. Spatial/location data
Geography Markup Language ( GML) GML is the XML grammar defined by the Open Geospatial Consortium ( OGC) to express geographical features. GML serves as a modelling language for geographic systems as well as an open interchange format for geographic transactions on the internet. Spatial/location data
GeoJson Geo JSON is an open standard format for encoding collections of simple geographical features along with their non- spatial attributes using JavaScript Object Notation. Spatial/location data

Table taken from Government of South Australia Open Data Process Guide

Useful reading

Government Service Design Manual - Choosing appropriate formats

Open Knowledge - Open Format Definition

Sunlight Foundation Open Data Guidelines

W3C Best Practice - Make the data available in a language people want

Step 2: Capture Metadata

Your data can only be used effectively if you also provide some metadata. Metadata is descriptive information about the data. It can describe the content, format, currency, limitations and frequency of updates. Metadata provides the user context about the data and good metadata will allow interoperability with other datasets.

Your publishing portal may allow metadata to be displayed below your data, or you could create an accompanying file for the metadata.

What standard should be used

What standard should be used

Suggested progression

Over time it is expected that all public authorities will progress towards the Data Catalog Vocabulary ( DCAT) standard. DCAT will be used to describe all public data in Europe. Its use will make public data searchable across borders and sectors thus enabling discoverability by automated systems including aggregators and search engines. Progressing towards DCAT is an ambitious aim. DCAT is a high standard and captures much more than your organisation may have considered. It is recommended that you build your metadata catalogue slowly, embedding it within processes and ensuring that the metadata is recorded consistently. The use of intermediate standards, such as Dublin Core, is recommended. Dublin Core provides a robust metadata standard which can then be built upon as your organisation progress towards DCAT.

Current metadata themes

Feedback has indicated that public authorities are capturing limited metadata and that it is often being captured in an ad-hoc manner. In order to build capability and increase metadata maturity, you should decide within your organisation what metadata should be captured and begin to record it in a consistent manner for all data.

It is useful to consider embedding metadata collection into your data governance processes from the outset. Increased open data publication should be a long term organisational goal and there is a benefit in adopting consistent metadata standards early to facilitate further re-use and demand for publication.

As sharing data and making it open becomes the norm, the adoption of metadata use is also expected to grow.

Marine Scotland: Marine Portal Metadata Collection

Marine Scotland have now mandated the adoption of MEDIN Discovery Metadata Standard. This means that any dataset to be published on the Marine Scotland portal must meet this standard prior to publication. MEDIN satisfies and exceeds the suggested Dublin Core standard.

While Marine Scotland acknowledge that this requirement can occasionally slow down dataset publication, they are confident that this approach offers significant benefits. In particular, it ensures that consistent metadata collection will become a routine business practice.

You can read more about the ongoing work in Marine Scotland in Annex B.

Useful Reading

DCMI Metadata Basics

Dublin Core Elements

NISO Understanding Metadata

The ODI - Marking up your dataset with DCAT W3C Best Practice - Metadata

Step 3: Apply an open license

A key requirement of making data open is applying an open licence. Prior to publication, all datasets should have an open license. The following applies to data and information which you hold the intellectual property rights for, if the data you wish to release includes third party IP rights please read the Licensing and Third Party rights section.

Why is licensing important?

Licensing data is essential to provide potential users with clarity and certainty. When you create something, original works or photographs for example, you automatically obtain rights over the work and can determine how the work is used. Applying a licence to your work explicitly tells users what they can and cannot do with it.

Applying an open licence to your content or data should allow people and organisation to re-use, modify and share content in any way. It should allow others to use the data for commercial purposes. It is generally accepted that only two restrictions may be attached to an open licence:

  • attribution - users must acknowledge the source of the data
  • share-alike - users must publish any derived data under the same licence

Open licenses can have no restrictions (public domain - all rights waived), attribution or attribution and share-alike.

How to select a licence

The chosen licence should support your organisations open data strategy. You need to think about what you want to achieve by releasing your data. Requiring attribution will normally help promote your open data initiative as users have to link back to your original work. Share-alike restrictions will require users of the data to publish their work openly. This may deter commercial businesses and people who want to make profit from their use of the data, resulting in reduced innovation and use.

Whilst possible to create your own unique licence, it is advisable to use a standard re-usable licence as they provide greater recognition amongst users, increased interoperability due to the use of standard terms and increased user compliance.

There are two instances when you cannot choose your own licence -

  • Crown Bodies - if your organisation is a Crown Body, which covers most government departments and arms-length bodies, then any information you have gathered or created is owned by the Crown. This information must be published under the Open Government Licence.
  • Publishing data that has been derived from data published under a share-alike licence. You must publish that data under the same licence as the original data.

Open Government Licence 3.0

The Open Government Licence 3.0 ( OGL) allows anyone to publish, distribute, transmit and adapt the licensed work, and to exploit it both commercially and non- commercially. The user must acknowledge the source of the work and where possible provide a link to the OGL. The OGL was developed to be used by public sector bodies.

There can be no charge for data licensed under the OGL. The OGL is compatible with the latest versions of Creative Commons Attribution Licence ( CC-by) and the Open Data Commons Attribution Licence ( ODC-by).

Other popular licences

Open Knowledge provides an extensive list of the licences which conform to the open definition. The most popular are:

Level of Licence Creative Commons Licence Open Data Commons Licence
Public domain (all rights waived) CC0 PDDL
Attribution CC-by ODC-by
Attribution & share-alike CC-by-sa ODbL

How to write your attribution statement

If your license requires attribution, you must state how your work should be attributed. As your work may be combined with others who also require attribution you should keep the statement to a minimum. For example, your organisation's link to the data that is covered by the licence and link to licence.

You can also prescribe how the attribution should be presented (size, location, format etc.). You will need to consider the users of the data and make sure any requirements are not too onerous. Successfully apply your licence

You must signpost users to your licence by using both human-readable and machine-read-able descriptions. Your descriptions should be displayed prominently with your data so that users know they can use the data you are licensing.

The common standard licences - OGL, CC and ODC, all provide machine and human readable descriptions and logos that you should use.

Licensing and Third Party Rights

You can only apply a licence to data which:

  • you own the copyright/and or database right for;
  • or the owner has given permission for it to be licensed.

If you do not own the intellectual property or do not have the owner's authority, you cannot release the data openly.

Public sector organisations engage with and contract with many third parties in the course of its daily activities. Many of those contracts will grant third party rights. It would be an inefficient and costly use of public resource if all of those contracts were to be renegotiated. In the future we want to limit the existence of third party rights in public data. We expect when future contracts are negotiated and put out to tender that it will be made explicit in the contract that any data resulting from the contract will be subject to open data principles and may be release for free to the public for onward use.

If you require further guidance on any matters relating to third party rights you should speak to your organisation's legal department.

Useful Reading

Step 4 - Review datasets

Every dataset will vary in completeness and quality, before releasing the data you should strive to ensure the data is as complete, accurate and up to date as possible. However, there is no such thing as "perfect" data. Scotland's Open Data Strategy emphasises both quality and quantity of data.

Imperfections should not deter you from releasing data. When you publish your datasets, be explicit about any limitations and add caveats which will help any re- user understand the limitations of the data. The clearer you are about the limitations, the more usable your data will be as re-users will have a greater understanding about what the data represents. Additionally, re-users will provide feedback on data quality and mistakes, which will help improve the quality of your data.

ODI certificates are a great tool to assess how open your dataset is. The tool also provides tips and information about how to improve the openness of your dataset.


Email: Stuart Law,

Kyle Malcolm,