Home > training > Some D-I-K-W about open data

Some D-I-K-W about open data



data is the new oil – Clive Humby

data is the new soil – David McCandless

We live in the information society or the knowledge one, both used indistinctly. Nevertheless, both information and knowledge are the refined result of processing raw data, the real oil (and soil) of our modern society. Information and Communication Technologies and users interacting with it generate huge amounts of data, allowing us to capture, store, process, analyze and visualize them, extracting information and knowledge useful for a wide range of purposes. Most of these data, though, are managed by large corporations and administration, without being truly accessible to citizens. The movement toward the concept of “Open Data” tries to establish the basis for creating and sharing data that can be of interest to citizens, taking into account technological, legal, ethic and other deeply interconnected aspects. We would like to discuss about the origins of the Open Data movement and its pioneers, some basic definitions about “open” but also “data” as well as the most important abovementioned aspects. We want also to introduce the concept of Big Data, one of the real hypes nowadays, as the result of the multiplication of different factors (space, time, number of elements, …) that create huge data sets completely out of comprehension.


WHAT is Open Data?

  • What is OPEN?
  • What is DATA?
    • Example: 42
    • Data – Information – Knowledge – Wisdom: the D-I-K-W pyramid
    • Structured:
      • Flat:
        • Sequences (1D)
        • Tables (2D, 3D, …)
        • Images (2D/3D x 1..N channels)
      • Hierarchical:
      • Relations: RDF
    • Semi-structured:
      • Text documents: characters, words, lines, paragraphs, pages, chapters, …
      • Web pages: HTML
  • What is BIG (Open) Data?
    • The result of multiplying three factors (the 3 V’s):
      • Volume: how many samples?
      • Variety: how many variables?
      • Velocity: how many changes?
    • Where and who does generate big data sets:
      • Citizens in social networks: Facebook, Twitter, …
      • Citizens in real life: debit/credit cards, telecommunications companies, transport, …
      • Administration gathering data from users
      • Sensor networks: temperature, traffic, pollution, …
    • Examples:
      • Wal-mart: 8500 stores, thousands of goods, 10^8 consumers / week
      • CERN’s Large Hadron Collider (LHC): 25 petabytes / year
      • Google: 1.17 * 10^9 users  x 1.29 * 10^10 searches / month = 24 petabytes / day
  • What is LINKED (Open) Data?
    • Based on URIs + HTTP + RDF
    • Basic idea (Sir Tim Berners-Lee’s TED talk):
      • Everything is accessed through URIs
      • Everything is described so each element (part of everything) can be “understood” 
      • Everything is composed of elements and their relationships
    • Tools:
      • SPARQL Protocol and RDF Query Language
      • Yahoo! Query Language (YQL)
    • Examples:

WHY Open Data?

  • Because…
    • …it belongs to everybody
    • …in most cases it’s been paid with public money
    • …it generates economic value
    • …it creates better citizens
    • …it promotes transparency (governments, science, corporations, …)

WHO is promoting/using/producing Open Data?

WHERE is Open Data used?

  • Educational Data Mining / Learning Analytics
  • Social apps:

HOW to use/produce Open Data?

  • Technological aspects:
    • Open formats
      • Manipulable
      • No proprietary software needed to use/edit it
    • Data for humans and machines: the 5-star model
      • * PDF, TIFF, …
      • ** XLS, SPSS, …
      • *** CSV (flat), JSON (hierarchical)
      • **** XML using URIs
      • ***** XML using URIs and RDF
    • Data Life Cycle:
      • Capture:
        • Goal: to obtain the desired data
          • Static data (files)
          • Dynamic data (web APIs):
            • Use of web services for accessing data
            • Well-formed validated (and authenticated) queries
            • Some limits might apply (number of queries, number of results, …)
          • Server Log files
          • Web scrapping
          • Crowdsourcing
        • Tools:
          • Use existing APIs: Intel Mashery, …
          • Scrapy
          • Tabula
          • HTML / URL manual inspection + scripting
          • Forms
      • (Pre)Process:
        • Goal: prepare data for its manipulation
          • Joining several sources
          • Aggregating / summarizing data
          • Selection of samples (filtering)
          • Transformation of variables (i.e. units)
          • Computing new variables
        • Tools:
      • Analyze:
        • Goal: extract information and knowledge from (pre-)processed data
          • Pattern detection
          • Modeling:
          • Interpretation:
            • Classification / prediction / regression
            • Variable importance
            • Characterization
        • Tools:
          • OpenOffice
          • R
          • Gephi
          • Tableau
          • Online tools:
            • SOCR (UCLA)
            • StatPages
      • Visualize:
      • Publish:
  • Legal aspects:
  • Other important aspects:

WANT to learn more?


Feel free to make any comment, idea or suggestion and I’ll try to incorporate it into this open data summary!

Categories: training
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: