msc-computer-science-notes

MSC Big Data Analytics Lesson Notes

This module is for MSC Big Data Analytics lessons notes.

Overview

WEEK 1

Main Topics

After completing this Week you should be able to:

Sub titles:

Reasons to ask question of data

Research Questions

Data science

Data Gathering

History of Data analysis

alt text

Where Is Data Science Used?

Sales and marketing

Governments using data science

Data science in professional sports

Myths about Data Science

Possible uses of data science

Five projects that are harnessing big data for good

1. Finding humanitarian hot spots

2. Improving fire safety in homes

3. Mapping police violence in the US

4. Optimising waste management

5. Identifying hotbeds of street harassment

Data science can help us fight human trafficking

Big data’s arrival in sport is changing the rules of the game

Explainer: what is big data?

Big data: The next frontier for innovation, competition, and productivity

Describing a reasonable data science process

What are data and what is a data set ?

alt text

The CRISP-DM Process

alt text

alt text

alt text

Describe the way data is structured in analytics

Input: Concepts, instance, attributes

Concept

Example (aka Instances)

Relations

Attribute

Statistics

Intro

alt text

Data basics

Types of variables

alt text

Relationships between variables

alt text

Explanatory and response variables

Observational studies and Experiments

Examining numerical data

alt text

alt text alt text

alt text
alt text

alt text

alt text

Data Mining and Ethics

Smartphone data tracking is more than creepy

With smart cities, your every step will be recorded

TODO:

* 1.8.1 Activity -> Diez 1.1-1.8 do it.
* 1.8.3 Activity -> Diez 2.1-2.1.8 fo it
* 1.8.4 Activity 
* 1.9.1 Discussion
* Week 1: Activity Forum: https://onlinestudy.york.ac.uk/courses/577/discussion_topics/19736?module_item_id=48187

WEEK 2

Main Topics

Sub titles:

Key concepts in machine learning

FIELDED APPLICATIONS

MACHINE LEARNING AND STATISTICS

Machine Learning 101

Supervised versus Unsupervised Learning

alt text

Standard Data Science Tasks

Clustering (Who Are Our Customers?)

Anomaly Detection or outlier analysis (Is This Fraud?)

Association-Rule Mining (Do You Want Fries with That?)

Classification (Churn or No Churn, That Is the Question)

Regression (How Much Will It Cost?)

Linear regression

alt text

alt text

Correlations

Correlations & Regression

Knowledge presentation

Decision trees

alt text</br> alt text</br> alt text

alt text

Evaluating learned models

Predict Performance

Cross Validation

Evaluating Numeric Prediction

alt text

Confusion matrices and accuracy scores

alt text

Receiver Operation Characteristic Curve (ROC)

alt text

TODO:

WEEK 3

Main Topics

After completing this week you should be able to:

Sub titles:

Evaluating the suitability of given data for use in analysis

ARFF Format (attribute-relation file format)

alt text

@attribute bag relational
    @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric
    @attribute humidity numeric
    @attribute windy {true, false}
@end bag

alt text

Sparce Data

Attribute Types

Missing Values

alt text
alt text

Inaccurate values

alt text

Unbalance Data

Getting to know your data

Cleaning Data

alt text</br> alt text</br> alt text</br>

Improving Decision Tree

Robust regression

Detecting Anomalies

One Class Learning

Outlier Detection

Generating Arificial Data

Attribute Selection

Scheme Independent Selection

Searching Attributes in space

Scheme Spesific Selection

Data Preparation and integration

Creating Analytics Base Table (ABT)

WEEK 4

Main Topics

After completing this week, you will have made significant steps towards achieving the following module learning outcomes:

Sub titles:

Alternative techniques for regression

Numeric Prediction: Linear Regression

alt text</br>

alt text</br>

alt text</br>

alt text</br>

Linear Classification: Logistic Regression

alt text</br>

alt text</br>

Linear Classification using the perception

Linear Classification using Winnow

Extending Linear Models

The Maximum Margin Hyperplane

Non Linear Class Boundries

Support Vector Regression

Alternative Techniques for classification

Naive Bayes for Document classification

Choosing a learning technique for a given analysis

The Model Selection and Tuning Problem

Model Selection and Managment Systems (MSMS)

Model Selection Triple

Three-Phase Iteration.

alt text</br>

Technique comparisons

Comparing Data mining schemes

TODO:

WEEK 5

Main Topics

After completing this week you should be able to:

After completing this week, you will have made significant steps towards achieving the following module learning outcomes:

Sub titles:

Why databases?

Intro

Elements of Database System

Database User

Database language

Advantage of Database System and Database Management

Data independence

Database Modeling

Managing Structured, Semi-Structured, and Unstructured Data

Managing Data Redundancy

Specifying Integrity Rules

Concurrency Control (ACID)

Backup and Recovery Facilities

Data Security

Performance Utilities

Data and quality management

Data Management

Data Quality

Data Governance

Roles in Data Management

Modelling data and the Relational Database model

Phases of designing Database

alt text</br>

The Entity Relationship Model

Entity Type

Attribute Type

Basic concepts

Relationship Types

alt text</br>

alt text</br>

alt text</br>

alt text</br>

alt text</br>

Relational Database Keys

Database Normalization

Questions

WEEK 6

Main Topics

After completing this week you should be able to:

After completing this week, you will have made significant steps towards achieving the following module learning outcomes:

Sub titles:

Constructing and querying a simple database

CREATE TABLE 'schooldata'.'STUDENT' (
'studentId' INT NOT NULL,
'firstName' VARCHAR(45) NULL,
'lastName' VARCHAR(45) NULL,
'Gender' VARCHAR(1) NULL,
'Dob' VARCHAR(20) NULL,
'auth' VARCHAR(45) NULL, PRIMARY KEY ('studentId'));

INSERT INTO STUDENT
VALUES (00001, SAM, Williams, m, 1979 / 5 / 9, Herts)

Next generation of databases

Big Data and the problems for analytics

5V of Big Data

Hadoop and distributed computation

Hadoop components

Hadoop Distributed File System (HDFS)

Hadoop MapReduce

alt text</br>

Hadoop YARN (Yet Another Resource Negotiator)

When to use Hadoop (and when not to!)

Different types of databases

Big Data Architectures – NoSQL Use Cases for Key Value Databases

How Can Graph Analytics Uncover Valuable Insights About Data?

Graph Databases for Beginners: ACID vs. BASE Explained

ACID vs BASE consistency models

WEEK 7

Main Topics

Sub titles: