MSC Big Data Analytics Lesson Notes
This module is for MSC Big Data Analytics lessons notes.
Overview
WEEK 1
Main Topics
After completing this Week you should be able to:
- Reasons might want to ask questions of data
- Explaining the way data is structured in analytics
- Describing a reasonable data science process, and explaining why each step is important
- Applying some basic descriptive statistics
- Explaining why validity and privacy are important concerns in big data analytics
Sub titles:
Reasons to ask question of data
Research Questions
- A research question is a question, and it describes something about the world that you don’t know, and that you intend
to find out.
- For example, what is the population of York?
- Criterion for research questions:
- Criterion 1: It needs to be in the form of a question.
- Criterion 2: It needs to be a complete question
- For example : are videos a better way to learn? -> Are videos a better way to learn than text? => That’s
better, it’s a bit more specific
- Criterion 3: t needs to be narrow enough to answer using data we can collect, and the tools we have available.
- For example : “are beans healthy?” -> “s there a correlation - statistical term - between average bean
consumption - a particular measure which will need some operationalising, but it’s a measure you could take -
and self-reported sick days?”
- Likewise, it’s not trivial to measure, although the fact that it’s self-report makes it a bit easier; it
also makes it less reliable but that’s by the by here. But you see you’ve got something you can really
start to think about measurement and statistical analysis.
- Finding RQ’s is a craft skill:
- What questions matter to you?
- What data do you have? or extension, “what data can you easily get?”
- Write down the questions that come up in your head.
- Always iterate
Data science
Data Gathering
- Transactional or operational data : Transactional data include event information such as the sale of an item, the
issuing of an invoice, the delivery of goods, credit card payment, insurance claims, and so on.
- Non-transactional data: Such as demographic data, also have a long history.
- in 1970 Edgar F. Codd published a paper about relational data and relational database came up.
- Relational data model: the relational data model, which was revolutionary in terms of setting out how data were (at
the time) stored, indexed, and retrieved from databases. The relational data model enabled users to extract data from
a database using simple queries that defined what data the user wanted without requiring the user to worry about the
underlying structure of the data or where they were physically stored.
- Relational databases store data in tables with a structure of one row per instance and one column per attribute.
This structure is ideal for storing data because it can be decomposed into natural attributes.
- The development of structured query language (SQL) is using.
- Data warehouses: Data warehouses has the technology that was able to bring the data together and reconcile the data
from disparate databases and that facilitated more complex analytical data operations. This business challenge led to
the development of data warehouses.
- Meta data: Describe the structure and properties of the raw data.
- NoSQL databases :
- A NoSQL database stores data as objects with attributes, using an object notation language such as the JavaScript
Object Notation (JSON).
- The advantage of using an object representation of data (in contrast to a relational table-based model) is that
the set of attributes for each object is encapsulated within the object, which results in a flexible
representation.
-
MapReduce is a framework on Hadoop. In the MapReduce framework, the data and queries are mapped onto (or
distributed across) multiple servers, and the partial results calculated on each server are then reduced (merged)
together.
- Big Data is often defined in terms of the three Vs:
- Three Vs:
- Volume: the extreme volume of data
- Variety: the variety of the data types,
- Velocity: and the velocity at which the data must be processed.
History of Data analysis
- When are rigorous processes likely to be useful?
- When we have a large volume of data examples
- When we have are trying to find patterns over many attributes of entities
- What are the most important reasons computer science skills are valuable for a data scientist?
- Understanding and developing machine learning algorithms
- Using high-performance computing techniques for data processing
- Why have increases in networked app use stimulated
- The volume of amount and variety of available data has increased
Where Is Data Science Used?
- Data science drives decision making in nearly all parts of modern societies.
- Cases to show impact of ata science:
- consumer companies using data science for sales and marketing;
- The equivalent of up-selling and cross-selling in the online world is the “recommender system.”
- governments using data science to improve health, criminal justice, and urban planning;
- professional sporting franchises using data science in player recruitment.
Sales and marketing
- By tracking the past customers activities, behavior, analyzing social media trends, analyzing credit card activity
they are using to recommend some product to customers.
- The equivalent of up-selling and cross-selling in the online world is the “recommender system.”
- Chris Anderson’s book The Long Tail (2008) argues that as pro- duction and distribution get less expensive, markets
shift from selling large amounts of a small number of hit items to selling smaller amounts of a larger number of niche
items.
Governments using data science
- Governments have recognized the advantages of adopting data science and uses it in different areas from
health/targetted drop development to track, analyze & control environmental, energy, and transport systems and to
inform long-term urban planning.
- Data science is also being used to predict crime hot spots and recidivism.
Data science in professional sports
- The MoneyBall movie tells the true story of how the Oakland A’s baseball team used data science to improve its player
recruitment.
- The Moneyball story is a very clear example of how data science can give an organization an advantage in a competitive
market space. However, from a pure data science perspective perhaps the most important aspect of the moneyball story
is that it highlights that sometimes the primary value of data science is the identification of informative
attributes.
Myths about Data Science
- One of the big- gest myths is the belief that data science is an autonomous process that we can let loose on our data
to find the answers to our problems.
- In reality, data science requires skilled human oversight throughout the different stages of the process. Humans
analysts are needed to frame the problem, to design and prepare the data, to select which ML algorithms are most
appropriate, to critically interpret the results of the analysis, and to plan the appropriate action to take based
on the insight(s) the analysis has revealed.
- The second big myth of data science is that every data science project needs big data and needs to use deep learning.
- In general, having more data helps, but having the right data is the more important requirement.
- A third data science myth is that modern data science software is easy to use, and so data science is easy to do.
- The danger with data science is that people can be intimidated by the technology and believe whatever results the
soft- ware presents to them. They may, however, have unwittingly framed the problem in the wrong way, entered the
wrong data, or used analysis techniques with inappropriate assumptions.
- Data science properly requires both appropriate domain knowledge and the expertise regarding the properties of the
data and the assumptions underpin- ning the different ML algorithms.
- The last myth about data science we want to mention here is the belief that data science pays for itself quickly.
- Depends on the context of the organization.
- data science will not give positive results on every project. Sometimes there is no hidden gem of insight in the
data,and sometimes the organization is not in a position to act on the insight the analysis has revealed.
Possible uses of data science
Five projects that are harnessing big data for good
1. Finding humanitarian hot spots
- This project was made with Australian Red Cross to figure out where the humanitarian hot spots are in Victoria.
- They use social media data to map everyday humanitarian activity to specific locations and found that the hot spots of
volunteering and charity activity are located in and around Melbourne CBD and the eastern suburbs
- These kinds of insights can help local aid organisations channel volunteering activity in times of acute need.
2. Improving fire safety in homes
- In the United States, Enigma Labs built open data tools to model and map risk at the level of individual
neighbourhoods.
- Their model combines national census data with a geocoder tool (TIGER), as well as analytics based on local fire
incident data, to provide a risk score.
3. Mapping police violence in the US
- The Mapping Police Violence project in the US monitors, make sense of, and visualises police violence.
- It draws on three crowdsourced databases, but also fills in the gaps using a mix of social media, obituaries, criminal
records databases, police reports and other sources of information
4. Optimising waste management
- These smart bins have solar-powered trash compactors that regularly compress the garbage inside throughout the day.
- This eliminates waste overflow and reduces unnecessary carbon emissions, with an 80% reduction in waste collection.
5. Identifying hotbeds of street harassment
- A group of four women – and many volunteer supporters – in Egypt developed HarassMap to engage with, and inform, the
community in an effort to reduce sexual harassment.
- The platform they built uses anonymised, crowdsourced data to map harassment incidents that occur in the street in
order to alert its users of potentially unsafe areas.
Data science can help us fight human trafficking
- Data science can help us fight human trafficking
- Millions individuals worldwide are trapped in some form of modern-day slavery.
- Human trafficking occurs in every country in the world, because it’s one of the largest sources of profit for global
organized crime, second only to illicit drugs.
- They have some problems:
- Finding people at risk: They collect some data from goverment or organizations to understand the people who is at
risk
- Victim identification and location: Trafficking networks are dynamic. Traffickers are likely to frequently change
distribution and transportation routes to avoid detection, leaving law enforcement and analysts with incomplete
information as they attempt to identify and dismantle trafficking networks.
- Network disruption: Interrupting the flow of people, money and other components of trafficking is critical to
identifying trafficking networks, disrupting their infrastructure at the source and eliminating them.
- In operations research, scientists apply mathematical methods to answer complex questions about patterns in data and
predict future trends or behaviors.
Big data’s arrival in sport is changing the rules of the game
- Big data’s arrival in sport is changing the rules of the game
- Using data analysis in sport is complex – not just because of the sheer volume of it, but in finding ways to structure
and relay many highly dynamic pieces of information to a coach, manager or athlete in order to make quick strategic
decisions.
- Wearable clothes can collect health data and can use to increase performance of the athletes.
- Not only athletes ,data can use for fans too.
- By understanding how fans engage with the sport or a team’s brand, decisions can be made about tailored sports
advertising or broadcast content
Explainer: what is big data?
- Explainer: what is big data?
- Big Data, as the name implies, relates to very large sets of data collected through free or commercial services on the
internet.
- It is impossible to remove/withdraw information from big data - information once added will persist indefinitely in
the cloud
- virtually any information that is stored electronically, including information within personal devices, offline data
storage, even information thought to be deleted, has the potential to be included in big data.
- Data mining is the process of analysing inter-data relationships – connecting the dots and finding hidden meanings and
relationships that can provide startling new insights.
- It may be argued that the capability to fully analyse internet-scale data will be key to nations in maintaining their
prosperity and perhaps even security. The future may indeed rest with those with the best big data technologies.
Big data: The next frontier for innovation, competition, and productivity
- Big data: The next frontier for innovation, competition, and productivity
- The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key
basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus.
-
MGI studied big data in five domains—healthcare in the United States, the public sector in Europe, retail in the
United States, and manufacturing and personal-location data globally.
- There are 5 ways for big data can create a value:
- Big data can unlock significant value by making information transparent and usable at much higher frequency.
- As organizations create and store more transactional data in digital form, they can collect more accurate and
detailed performance information on everything from product inventories to sick days, and therefore expose
variability and boost performance.
- Big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products and
services.
- Sophisticated analytics can improve decision making.
- Bid data can used to improve the development of the next generation products and services.
- The use of big data will become a key basis of competition growth for individual firms.
- The use of big data will underpin new waves of productivity growth and consumer surplus.
- Some sectors are set for greater gains.
- There will be a shortage of talent necessary for organizations to take advantage of big data
- Several issues will have to be addressed to capture the full potential of big data. Policies related to privacy,
security, intellectual property, and even liability will need to be addressed in a big data world.
Describing a reasonable data science process
What are data and what is a data set ?
- Data science is fundamentally de- pendent on data. In its most basic form, a datum or a piece of information is an
abstraction of a real-world entity (person, object, or event).
- The terms variable, feature, and attribute are often used interchangeably to denote an individual abstraction
- Each entity is typically described by a number of attributes.
- A data set consists of the data relating to a collection of entities, with each entity described in terms of a set
of attributes.
- A data set is organized in an n * m data matrix called the analytics record, where n is the number of entities (
rows) and m is the number of at- tributes (columns
-
The terms data set and analytics record are often used interchangeably, with the analytics record being a particular
representation of a data set.
- The terms instance, example, entity, object, case, individual, and record are used in data science literature to refer
to a row.
- The construction of the analytics record is a prerequisite of doing data science.
- Understanding and recognizing different attribute types is a fundamental skill for a data scientist. The standard
types are numeric, nominal, and ordinal.
- Numeric attributes describe measurable quantities that are represented using integer or real values.
- Numeric attributes can be measured on either an interval scale or a ratio scale.
- Interval attributes are measured on a scale with a fixed but arbitrary interval and arbitrary origin
- ie. date and time measurements.
- It is appropriate to apply order- ing and subtraction operations to interval attributes, but other arithmetic
operations (such as multiplication and division) are not appropriate
- Ratio scales are similar to interval scales, but the scale of measurement possesses a true-zero origin.
- A value of zero indicates that none of the quantity is being measured.
- Celcius is interval, Kelvin is rational scale.
- Nominal (also known as categorical) attributes take values from a finite set. These values are names (hence
“nominal”)
for categories, classes, or states of things.
- ie marital status (sin- gle, married, divorced) and beer type (ale, pale ale, pils, porter, stout, etc.).
- A binary attribute is a special case of a nominal attribute where the set of possible values is restricted to just
two values. Boolean
- Nominal attributes cannot have ordering or arithmetic operations applied to them.
- Ordinal attributes are similar to nominal attributes, with the difference that it is possible to apply a rank
order over the categories of ordinal attributes.
- ie “strongly dislike, dislike, neutral, like, and strongly like.”
- there is no notion of equal distance between these values. As a result, it is not appropriate to apply arithmetic
operations (such as averaging) on ordinal attributes
-
The data type of an attribute (numeric, ordinal, nominal) affects the methods we can use to analyze and understand the
data, including both the basic statistics we can use to describe the distribution of values that an attribute takes
and the more complex algorithms we use to identify the patterns of relationships between attributes.
- 2 characteristics of data science cannot be overemphasized:
- for data science to be successful, we need to pay a great deal of attention to how we create our data
- we also need to “sense check” the results of a data science process. The pattern may simply be based on the biases
in our data design and capture.
- Structured data are data that can be stored in a table, and every instance in the table has the same structure (
i.e., set of attributes).
- Structured data can be easily stored, organized, searched, reordered, and merged with other structured data.
- Unstructured data are data where each instance in the data set may have its own internal structure, and this
structure is not necessarily the same in every instance
- ie collections of human text (emails, tweets, text messages, posts, novels, etc.) can be considered unstructured
data, as can collections of sound, image, music, video, and multimedia files.
- We can often extract structured data from unstructured data using techniques from artificial intelligence (such as
natural-language processing and ML), digital signal processing, and computer vision. However processes is expensive
and time-consuming and can add significant financial overhead and time delays to a data science project.
- Data can also be derived from other pieces of data.
- ie the average salary in a company or the variance in the temperature of a room across a period of time.
- It is frequently the case that the real value of a data science project is the identification of one or more important
derived attributes that provide insight into a problem
- There are generally two terms for gathered raw data: captured data and exhaust data (Kitchin 2014a).
- Captured data are collected through a direct measurement or observation that is designed to gather the data.
- By contrast, exhaust data are a by-product of a process whose primary purpose is something other than data
capture.
- primary purpose of the Amazon website is to enable users to make purchases from the site. However, each
purchase generates volumes of exhaust data: what items the user put into her basket, how long she stayed on
the site, what other items she viewed, and so on.
- One of the most common types of exhaust data is metadata—that is, data that describe other data.
- DIKW pyramid: Eliot’s hierarchy mirrors the standard model of the structural relationships between wisdom,
knowledge,information, and data known as the DIKW pyramid
- Data are created through abstractions or measurements taken from the world.
- Information is data that have been processed, structured, or contextualized so that it is meaningful to humans.
- Knowledge is information that has been interpreted and understood by a human so that she can act on it if
required.
- Wisdom is acting on knowledge in an appropriate way.
The CRISP-DM Process
- Cross Industry Standard Process for Data Mining (CRISP- DM)
- Most of the companies are using this pyramid to follow.
- The main advantage of CRISP-DM is that it is designed to be independent of any software, vendor, or data-analysis
technique.
- CRISP-DM was originally developed by a consortium of organizations consisting of leading data science vendors, end
users, consultancy companies, and researchers.
- CRISP-DM lifecycle consists of 6 stages:
- Business understanding
- investigating the business objectives and requirements, deciding whether data mining can be applied to meet
them, and determining what kind of data can be collected to build a deployable model.
- Data understanding
- In the fist 2 steps, the data scientist is trying to define the goals of the project by understanding the
business needs and the data that the business has available to it.
- The process typically involves identifying a business problem and then exploring if the appropriate data are
available to develop a data-driven solution to the problem
- an initial dataset is established and studied to see whether it is suitable for further processing. If the
data quality is poor, it may be necessary to collect new data based on more stringent criteria.
- Data preparation
- The focus of the data-preparation stage is the creation of a data set that can be used for the data analysis
- This involves creating the data set by integrating data from different sources, also identifying outliers and
missing values and cleaning data.
- Preparation involves preprocessing the raw data so that machine learning algorithms can produce a
model—ideally, a structural description of the information that is implicit in the data.
- Modeling
- This is the stage where automatic algorithms are used to extract useful patterns from the data and to create
models that encode these patterns. ML is using for it.
- the initial model test results will uncover problems in the data. This can help to investigate the data errors
in first hand for data science and they can revisit data again and fix.
- Preprocessing may include model building activities as well, because many preprocessing tools build an
internal model of the data to transform it. In fact, data preparation and modeling usually go hand in hand. It
is almost always necessary to iterate: results obtained during modeling provide new insights that affect the
choice of preprocessing techniques.
- Evaluation
- Do the structural descriptions inferred from the data have any predictive value, or do they simply reflect
spurious regularities?
- If the evaluation step shows that the model is poor, you may need to reconsider the entire project and return
to the business understanding step to identify more fruitful business objectives or avenues for data
collection. If, on the other hand, the model’s accuracy is sufficiently high, the next step is to deploy it in
practice.
- Deployment
- These 2 process are focused on how the models fit the business and its processes.
- The evaluation phase involves assessing the models in the broader context defined by the business needs. This
also help data science can validate business needs, improve the model and last controls
- The deployment phase involves examining how to deploy the selected models into the business environment.
- This involves planning how to integrate the models into the organization’s techni- cal infrastructure and
business processes.
- Outer Ciscle means iterate whole the process periodicly
- Because there are many reasons why a data-driven model can become obsolete:
- the business’s needs might have changed;
- the process the model emulates and provides insight into might have changed
- the data streams the model uses might have changed
- According to a survey, data scientists spend 79% of their time on data preparation, 19% for collection and 60% for
cleaning up & reshaping etc, building training sets, mining data for patterns and refining algorithms 16% and 5% for
other tasks.
Describe the way data is structured in analytics
- The input takes the form of concepts, instances, and attributes. We call the thing that is to be learned a concept
description.
- The result of the learning process is a description of the concept that is, ideally, intelligible in that it can be
understood, discussed, and disputed, and operational in that it can be applied to actual examples.
- The set of the data given to ML named as instance and each instance was an individual, independent example of the
concept to be learned.
- Each instance is characterized by the values of attributes that measure different aspects of the instance.There are
many different types of attribute, although typical machine learning schemes deal only with numeric and nominal, or
categorical, ones.
Concept
Example (aka Instances)
- The input to a machine learning scheme is a set of instances.
- These instances are the things that are to be classified, or associated, or clustered.
- In the standard scenario, each instance is an individual, independent example of the concept to be learned.
- Instances are characterized by the values of a set of predetermined attributes
- The instances are the rows of the tables that represent each data point, record.
Relations
- The idea of specifying only positive examples and adopting a standing assumption that the rest are negative is called
the closed world assumption.
- You can take a relationship between different nodes of a tree and recast it into a set of independent instances
- Relational data is more complex than a flat file.
- Denormalization called a process of flattening that is technically. In database terms, you take two relations and join
them together to make one.
- It is always possible to do this with any (finite) set of (finite) relations, but denormalization may yield sets
of rows that need to be aggregated to form independent instances.
- Problems in denormalization:
- When you create some other superrelations for each data set, then the computational and can cause enormous cost in
space
- It can generate spurious regularities in the data, and it is essential to check the data for such artifacts before
applying a learning scheme
- Computer scientists usually use recursion to deal with situations in which the number of possible examples is
infinite.
- This is addressed in a subdiscipline of machine learning called inductive logic programming
- Potentially infinite concepts can be dealt with by learning rules that are recursive.
- The real drawbacks of such techniques, however, are that they do not cope well with noisy data, and they tend to
be so slow as to be unusable on anything but small artificial datasets.
- Some important real-world problems are most naturally expressed in a multi-instance format, where each example is
actually a separate set of instances. This “multi-instance” setting covers some important real-world applications.
- The goal of multi-instance learning is still to produce a concept description,
- But the task is more difficult because the learning algorithm has to contend with incomplete information about
each training example.
Attribute
- Each instance that provides the input to machine learning is characterized by its values on a fixed, predefined set of
features or attributes
- Attributes are the columns of table.
- Numeric attributes, sometimes called continuous attributes, measure numbers—either real or integer valued.
- Nominal attributes take on values in a prespecified, finite set of possibilities and are sometimes called
categorical.
- The values them- selves serve just as labels or names, for example, in the weather data the attribute outlook has
values sunny, overcast, and rainy.
- There is no ordering and distance value each other, does not make sense to add the values together, multiply them,
or even compare their size.
- Nominal attributes are sometimes called categorical, enumerated, or discrete. Enumerated is the standard term used
in computer science to denote a categorical data type;
- A special case of the nominal scale is the dichotomy, which has only two members—often designated as true and
false, or yes and no in the weather data. Such attributes are sometimes called Boolean.
- Ordinal attributes are ones that make it possible to rank order the categories. However, although there is a
notion of ordering, there is no notion of distance.
- For example : hot > mild > cool or strongly dislike, dislike , like , strongy like
- Ordinal attributes are often coded as numeric data, or perhaps continuous data, but without the implication of
mathematical continuity.
- Ratio quantities are ones for which the measurement scheme inherently defines a zero point.
- For example, when measuring the distance from one object to another, the distance between the object and itself
forms a natural zero.
- Ratio quantities are treated as real numbers: any mathematical operations are allowed.
- Notice that the distinction between nominal and ordinal attributes is not always straightforward and obvious
- Metadata is data about data. Machine learning systems can use a wide variety of other information about attributes, we
use metadata for it.
Statistics
Intro
- A summary statistic is a single number summarizing a large amount of data.
- the primary results of the study after 1 year could be described by two summary statistics: the proportion of people
who had a stroke in the treatment and control groups.
- Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%.
- Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.
- This summary statistic is important for 2 reasons:
- First, This is contrary when doctors expected which was that stents would reduce the rate of strokes
- Second, It leads to a statistical question: do the data show a “real” difference between the groups?
- The larger the difference we observe (for a particular sample size), the less believable it is that the difference is
due to chance. So what we are really asking is the following: is the difference so large that we should reject the
notion that it was due to by chance?
- Important! Do not generalize the results of this study to all patients and all stents.
Data basics
- Data matrix is a convenient and common way to organize data, especially if collecting data in a spreadsheet.
- Each row of a data matrix corresponds to a unique case (observational unit)
- Each column corresponds to a variable.
- Image below , each row represents an instance of a loan
Types of variables
- Numerical variable can take a wide range of numerical values
- It is sensible to add, subtract, or take averages with those values.
- Discrete values are values only can take non-negative numeric values
- Decimal variables also called continuous.
- Categorical variable are to categorise the data and possible values are called the variable’s levels
- When there is no ordering between these values is meaningful, they are called nominal variables.
- When there is a natural ordering, ie level of education -> high school, college etc, then it is ordinal.
Relationships between variables
- Examining summary statistics could provide insights for each of the three questions about counties. Additionally,
graphs can be used to visually explore data.
- The scatterplot suggests a relationship between the two variables: counties with a higher rate of multi-units tend to
have lower homeownership rates.
- When two variables show some connection with one another, they are called associated variables (dependent).
- Negative association sample, counties with more units in multi-unit structures are associated with lower
homeownership
- Possiive association: where counties with higher median household income tend to have higher rates of population
growth.
- A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both
associated and independent.
Explanatory and response variables
- When we suspect one variable might causally affect another, we label the first variable the explanatory
variable and the second the response variable.
- explanatory variable — might affect —> response variable
- For many pairs of variables, there is no hypothesized relationship, and these labels would not be applied to either
variable in such cases.
Observational studies and Experiments
- Researchers perform an observational study when they collect data in a way that does not directly interfere with
how the data arise.
- researchers may collect information via surveys, review medical or company records, or follow a cohort of many
similar individuals to form hypotheses about why certain diseases might develop.
- When researchers want to investigate the possibility of a causal connection, they conduct an experiment.
- Usually, there will be both an explanatory and a response variable. When individuals/test units are randomly
assigned to a group, the experiment is called a randomised experiment.
- Association != Causation association does not imply causation, and causation can only be inferred from a
randomized experiment.
Examining numerical data
- Scatterplot : A scatterplot provides a case-by-case view of data for two numerical variables.
- Scatterplots are helpful in quickly spotting associations relating variables, whether those associations come in
the form of simple trends or whether those relationships are more complex.
- Dot plots and the mean: Sometimes two variables are one too many: only one variable may be of interest.
- In these cases, a dot plot provides the most basic of displays. A dot plot is a one-variable scatterplot; an
example using the interest rate of 50 loans is shown
- The mean, often called the average, is a common way to measure the center of a distribution of data.
- The mean is useful because it allows us to rescale or standardize a metric into something more easily
interpretable and comparable
- Histogram and shape:
- Dot plots show the exact value for each observation. This is useful for small data sets, but they can become hard
to read with larger samples. Rather than showing the value of each observation, we prefer to think of the value as
belonging to a bin.
- These binned counts are plotted as bars into what is called a histogram, which resembles a more heavily binned
version of the stacked dot plot
- Histograms provide a view of the data density. Higher bars represent where the data are relatively more
common.
- Histograms are especially convenient for understanding the shape of the data distribution.
- Long tail to identify skew : When data trail off in one direction, the distribution has a long tail. If a
distribution has a long left tail, it is left skewed. If a distribution has a long right tail, it is right
skewed. When is equal then it is symentric.
- Histograms can be used to identify modes. A mode is represented by a prominent peak in the distribution.
- Distributions are called unimodal, bimodal, and multimodal, respectively.
- Deviation: We call the distance of an observation from its mean its deviation.
- Below are the deviations for the 1st, 2nd, 3rd, and 50th observations in the interest rate variable:
- Variance: If we square these deviations and then take an average, the result is equal to the sample variance,
denoted by s2
- in a short desc, The variance is the average squared distance from the mean.
- We divide by n − 1, rather than dividing by n, when computing a sample’s variance; there’s some mathematical
nuance here, but the end result is that doing this makes this statistic slightly more reliable and useful
- Standard deviation is defined as the square root of the variance:
- The standard deviation is the square root of the variance.
- The standard deviation is useful when considering how far the data are distributed from the mean.
- Deviations does two things:
- First, it makes large values relatively The standard deviation is defined as the square root of the variance:
much larger
- it gets rid of any negative signs by square
- Box plot: A box plot summarizes a data set using five statistics while also plotting unusual observations.
- The first step in building a box plot is drawing a dark line denoting the median, which splits the data in
half.
- Median:
If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an
even number of observations, there will be two values in the middle, and the median is taken as their average.
- Interquartile range: The IQR is the length of the box in a box plot. It is computed as IQR = Q3 − Q1 where Q1
and Q3 are the 25th and 75th percentiles.
- The two boundaries of the box are called the first quartile (the 25th percentile, i.e. 25% of the data fall
below this value) and the third quartile (the 75th percentile), and these are often labeled Q1 and Q3,
respectively.
- The whiskers attempt to capture the data outside of the box. However, their reach is never allowed to be more than
1.5 × IQR. They capture everything within this reach
- Extending out from the box, the whiskers attempt to capture the data outside of the box but their reach is never
allowed to be more than 1.5 × IQR, Q3 + 1.5 * IQR and Q1 - 1.5 * IQR.
- Any observation lying beyond the whiskers are labeled with dot and called outlier
- An outlier is an observation that appears extreme relative to the rest of the data. Examining data for outliers
serves many useful purposes, including
- Identifying strong skew in the distribution.
- Identifying possible data collection or data entry errors.
- Providing insight into interesting properties of the data.
- Robust statistics:
- The median and IQR are called robust statistics because extreme observations have little effect on their
values: moving the most extreme value generally has little influence on these statistics.
- On the other hand, the mean and standard deviation are more heavily influenced by changes in extreme
observations, which can be important in some situations
- Transforming data (special topic): When data are very strongly skewed, we sometimes transform them so they are easier
to model.
- A transformation is a rescaling of the data using a function.
- For instance, a plot of the logarithm (base 10) of county populations results in the new histogram in Figure
2.13(b). This data is symmetric, and any potential outliers appear much less extreme than in the original data
set.
- By reigning in the outliers and extreme skew, transformations like this often make it easier to build statistical
models against the data
- Transformations other than the logarithm can be useful, too. For instance, the square root ( original observation)
and inverse ( 1 / original observation ) are commonly used by data scientists.
-
Common goals in transforming data are to see the data structure differently, reduce skew, assist in modeling, or
straighten a nonlinear relationship in a scatterplot.
Data Mining and Ethics
- The useful data, the data that’s about people, usually have implications. Therefore data miners should be aware of
these and act responsibly.
- The discriminative application of data mining, like who does and doesn’t get the loan, is common but when it
discriminates based on race, sexual orientation etc, then it’s unethical and/or illegal.
- However, using racial or sexual information might still be required for medical diagnosis. So, it depends on the
application.
- There is also the chance of building models that indirectly rely on discriminating information. For example, excluding
racial information but using postcode is problematic where certain ethnicities group in certain areas.
- Reidentification techniques have provided shocking insights into the difficulty of anonymizing data. It turns out,
e.g., that over 85% of Americans can be identified from publicly available records using just three pieces of
information: five-digit ZIP code, birthdate and gender or if city info is available instead of zipcode, over 50% of
Americans can still be identified.
- Stories abound of companies releasing allegedly anonymous data in good faith, only to find that many individuals
are easily identifiable.
- For example from Netflix rating data, it’s possible to identify all the movies someone has watched if a few movies
they rated and their approximate dates are known,
- The moral is that if you really do remove all possible identification information from a database, you will
probably be left with nothing useful.
- Before people make a decision to provide personal information they need to know how it will be used and what it will
be used for.
- This information needs to provide straightforwardly in plain language they can understand not not in legalistic
small print
- The potential use of data mining techniques may stretch far beyond what was conceived when the data was originally
collected. Thats why it has to be clear to determine the conditions under which the data was collected and for
what purposes it may be used
- The ownership of data doesn’t bestow the right to use it in ways other than those purported when it was originally
recorded, however, in reality, the situation is complex.
- The discriminative type of applications of insurance companies, for example, ie higher insurance premiums for younger
males, are based on stereotypes and statistical correlations.
- As well as community standards, purely statistical correlations also must be subject to logical and scientific
standards. For example, if statistics show that red car owners as posing greater credit risks, this begs for further
investigation and explanation.
- Machine learning is a technology that we need to take seriously.
Smartphone data tracking is more than creepy
With smart cities, your every step will be recorded
TODO:
* 1.8.1 Activity -> Diez 1.1-1.8 do it.
* 1.8.3 Activity -> Diez 2.1-2.1.8 fo it
* 1.8.4 Activity
* 1.9.1 Discussion
* Week 1: Activity Forum: https://onlinestudy.york.ac.uk/courses/577/discussion_topics/19736?module_item_id=48187
WEEK 2
Main Topics
- Regression and prediction — the two forms of prediction
- Using linear regression to predict numerical values
- Using decision trees to predict classifications
- Evaluating predictions
Sub titles:
Key concepts in machine learning
FIELDED APPLICATIONS
- Being fielded applications, the illustrations that follow tend to stress the use of learning in performance
situations, in which the emphasis is on the ability to perform well on new examples.
- WEB MINING
- Search engine companies examine the hyperlinks in web pages to come up with a measure of “prestige” for each
web page and website. Dictionaries define prestige as “high standing achieved through success or influence.”
- A metric called PageRank, introduced by the founders of Google and used in various guises by other search
engine developers too, attempts to measure the standing of a web page. The more pages that link to your
website, the higher its prestige.
- Another way in which search engines tackle the problem of how to rank web pages is to use machine learning
based on a training set of example
- queries—documents that contain the terms in the query and human judgments about how relevant the documents are
to that query. Then a learning algorithm analyzes this training data and comes up with a way to predict the
relevance judgment for any document and query.
- Search engines mine the content of the Web. They also mine the content of your queries—the terms you search
for—to select advertisements that you might be interested in
- And then there are social networks and other personal data. We live in the age of selfrevelation: people share
their innermost thoughts in blogs and tweets, their photographs, their music and movie tastes, their opinions
of books, software, gadgets, and hotels, their social life. They may believe they are doing this anonymously,
or pseudonymously, but often they are incorrect
- DECISIONS INVOLVING JUDGMENT
- When you apply for a loan, you have to fill out a questionnaire asking for relevant financial and personal
information.
- made in two stages: first, statistical methods are used to determine clear “accept” and “reject” cases. The
remaining borderline cases are more difficult and call for human judgment.
- A machine learning procedure was used to produce a small set of classification rules that made correct predictions
on two-thirds of the borderline cases in an independently chosen test set.
- Not only did these rules improve the success rate of the loan decisions, but the company also found them
attractive because they could be used to explain to applicants the reasons behind the decision.
- Although the project was an exploratory one that took only a small development effort, the loan company was
apparently so pleased with the result that the rules were put into use immediately.
- SCREENING IMAGES
- Radar satellites provide an opportunity for monitoring coastal waters day and night, regardless of weather
conditions.
- A hazard detection system has been developed to screen images for subsequent manual processing.
- Machine learning allows the system to be trained on examples of spills and nonspills supplied by the user and lets
the user control the tradeoff between undetected spills and false alarms.
- DIAGNOSIS
- Diagnosis is one of the principal application areas of expert systems. Although the hand-crafted rules used in
expert systems often perform well, machine learning can be useful in situations in which producing rules manually
is too labor intensive.
- The goal was not to determine whether or not a fault existed, but to diagnose the kind of fault, given that one
was there.
- Thus there was no need to include fault-free cases in the training set. The measured attributes were rather
low level and had to be augmented by intermediate concepts,
- i.e., functions of basic attributes, which were defined in consultation with the expert and embodied some
causal domain knowledge
- It is interesting to note, however, that the system was put into use not because of its good performance but
because the domain expert approved of the rules that had been learned
- MARKETING AND SALES
- Market basket analysis is the use of association techniques to find groups of items that tend to occur together in
transactions, e.g., supermarket checkout data.
- For many retailers this is the only source of sales information that is available for data mining.
- or example, automated analysis of checkout data may uncover the fact that customers who buy beer also buy
chips, a discovery that could be significant from the supermarket operator’s point of view (although rather an
obvious one that probably does not need a data mining exercise to discover)
MACHINE LEARNING AND STATISTICS
Machine Learning 101
- Machine learning is the field of study that develops the algorithms that the computers follow in order to identify and
extract patterns from data.
- ML algorithms and techniques are applied primarily during the modeling stage of CRISP-DM. ML involves a two-step
process.
- First, an ML algorithm is applied to a data set to identify useful patterns in the data. These patterns can be
represented in a number of different ways.
- Simply put, ML algorithms create models from data, and each algorithm is designed to create models using a
particular representation (neural network or decision tree or other).
- Second, once a model has been created, it is used for analysis.
- In some cases, the structure of the model is what is important.
- A model structure can reveal what the important attributes are in a domain.
Supervised versus Unsupervised Learning
- The goal of supervised learning is to learn a function that maps from the values of the attributes describing an
instance to the value of another attribute, known as the target attribute, of that instance.
- For example, when supervised learning is used to train a spam filter, the algorithm attempts to learn a function
that maps from the attributes describing an email to a value (spam/not spam) for the target attribute; the
function the algorithm learns is the spam-filter model returned by the algorithm.
- Supervised learning works by searching through lots of different functions to find the function that best maps between
the inputs and output.
- However depends on the complexity, there are so many combinations of inputs and possible mappings to outputs that
an algorithm cannot try all possible functions.
- As a consequence, each ML algorithm is designed to look at or prefer certain types of functions during its search.
These preferences are known as the algorithm’s learning bias.
- The real challenge in using ML is to find the algorithm whose learning bias is the best match for a
particular data set.
- this task involves experiments with a number of different algorithms to find out which one works best on that data
set.
- Supervised learning is “supervised” because each of the instances in the data set lists both the input values and the
output (target) value for each instance.
- So the learning algorithm can guide its search for the best function by checking how each function it tries
matches with the data set, and at the same time the data set acts as a supervisor for the learning process by
providing feedback
- for supervised learning to take place, each instance in the data set must be labeled with the value of the target
attribute.
- In unsupervised learning, there is no target attribute.
- As a consequence, unsupervised-learning algorithms can be used without investing the time and effort in labeling
the instances of the data set with a target attribute.
- The most common type of unsupervised learning is cluster analysis, where the algorithm looks for clusters of instances
that are more similar to each other than they are to other instances in the data.
- These clustering algorithms often begin by guessing a set of clusters and then iteratively updating the clusters
- A challenge for clustering is figuring out how to measure similarity.
- If all the attributes in a data set are numeric and have similar ranges, then it probably makes sense just to
calculate the Euclidean distance (better known as the straight-line distance) between the instances (or rows).
- In some data sets, different numeric attributes have different ranges, with the result that a variation in row
values in one attribute may not be as significant as a variation of a similar magnitude in another attribute.
- In these cases, the attributes should be normalized so that they all have the same range
- Another complicating factor in calculating similarity is that things can be deemed similar in many different ways.
- Some attributes are sometimes more important than other attributes, so it might make sense to weight some
attributes in the distance calculations, or it may be that the data set includes nonnumeric data.
- The choice of which attributes to include and exclude from a data set is a key task in data science, but for the
purposes of this discussion we will work with the data set as is.
- In unsupervised clustering algorithm will look for groups of rows that are more similar to each other than they are to
the other rows in the data
- Each of these groups of similar rows defines a cluster of similar instances.
- The simple idea of looking for clusters of similar rows is very powerful and has applications across many areas of
life. Another application of clustering rows is making product recommendations to customers.
Standard Data Science Tasks
- Most data science projects can be classified as belonging to one of four general classes of task:
- Clustering (or segmentation)
- Anomaly (or outlier) detection
- Association-rule mining
- Prediction (including the subproblems of classification and regression)
- Understanding which task a project is targeting can help with many project decisions
- For example, training a prediction model requires that each of the instances in the data set include the value of
the target attribute.
- Understanding the task also informs which ML algorithm(s) to use.
- There are a large number of ML algorithms, each algorithm is designed for a particular datamining task
- For example, ML algorithms that generate decision-tree models are designed primarily for prediction tasks
- There is a many-to-one relationship between ML algorithms and a task, so knowing the task doesn’t tell you exactly
which algorithm to use, but it does define a set of algorithms that are designed for the task.
- Because the data science task affects both the data set design and the selection of ML algorithms, the decision
regarding which task the project will target has to be made early on in the project life cycle, ideally during the
business-understanding phase of the CRISP-DM life cycle
Clustering (Who Are Our Customers?)
- Designing a targeted marketing campaign requires an understanding of the target customer.
-
A good approach is to try to identify a number of customer personas or customer profiles, each of which relates to a
significant segment of the customer base, and then to design targeted marketing campaigns for each persona
-
Human intuition about customers can often miss important nonobvious segments or not provide the level of granularity
that is required for nuanced marketing.
- The standard data science approach to the persona type of analysis is to frame the problem as a clustering task.
- Clustering involves sorting the instances in a data set into subgroups containing similar instances.
- Usually clustering requires an analyst to first decide on the number of subgroups she would like identified in
the data. This decision may be based on domain knowledge or informed by project goals.
- A clustering algorithm is then run on the data with the desired number of subgroups input as one of the
algorithms parameters.
- The algorithm then creates that number of subgroups by grouping instances based on the similarity of their
attribute values.
- Once the algorithm has created the clusters, a human domain expert reviews the clusters to interpret whether they
are meaningful.
- one of the biggest challenges with clustering is to decide which attributes to include and which to exclude so as to
get the best results.
- Making this decision on attribute selection will involve iterations of experiments and human analysis of the
results of each iteration
- The best-known ML algorithm for clustering is the k-means algorithm.
- The k in the name signals that the algorithm looks for k clusters in the data.
- The k-means algorithm assumes that all the attributes describing the customers in the data set are numeric. If the
data set contains nonnumeric attributes, then these attributes need to be mapped to numeric values in order to use
k-means, or the algorithm will need to be amended to handle these nonnumeric values.
- The goal of the k-mean algorithm is to find the position of each cluster’s center in the point cloud.
- The k-means algorithm begins by selecting k instances to act as initial cluster centers.
- Current best practice is to use an algorithm called “k-means++” to select the initial cluster centers.
- The rationale behind k-means++ is that it is a good idea to spread out the initial cluster centers as much as
possible. So in k-means++ the first cluster center is set by randomly selecting one of the instances in the data
set.
- The second and subsequent cluster centers are set by selecting an instance from the data set with the
probability that an instance selected is proportional to the squared distance from the closest existing
cluster center.
- Once all k cluster centers have been initialized, the algorithm works by iterating through a two-step process:
- first, assigning each instance to the nearest cluster center,
- updating the cluster center to be in the middle of the instances assigned to it
- The instances are then reassigned, again to the closest updated cluster center.
- This process of instance assignment and center updating continues until no instances are assigned to a new
cluster center during an iteration.
- The k-means algorithm is nondeterministic, meaning that different starting positions for the cluster centers will
likely produce different clusters.
- the algorithm is typically run several times, and the results of these different runs are then compared to see which
clusters appear most sensible given the data scientist’s domain knowledge and understanding.
- Each cluster center defines a different customer persona, with the persona description generated from the attribute
values of the associated cluster center
- The k-means algorithm is not required to return equal-size clusters, and, in fact, it is likely to return
different-size clusters.
- One of the advantages of clustering as an analytics approach is that it can be applied to most types of data
- Because of its versatility, clustering is often used as a dataexploration tool during the data-understanding stage
of many data science projects.
- Also, clustering is also useful across a wide range of domains.
Anomaly Detection or outlier analysis (Is This Fraud?)
-
Anomaly detection or outlier analysis involves searching for and identifying instances that do not conform to the
typical data in a data set.
- These nonconforming cases are often referred to as anomalies or outliers.
-
Anomaly detection is often used in analyzing financial transactions in order to identify potential fraudulent
activities and to trigger investigations
- The first approach that most companies typically use for anomaly detection is to manually define a number of rules
based on domain expertise that help with identifying anomalous events.
- This rule set is often defined in SQL or in another language and is run against the data in the business databases
or data warehouse
- SQL now includes a MATCH_RECOGNIZE function to facilitate pattern matching in data.
- The MATCH_RECOGNIZE function in SQL enables database programmers to write scripts that identify sequences of
transactions on a credit card that fit this pattern and either block the card automatically or trigger a warning
to the credit-card company.
-
The main drawback with a rule-based approach to anomaly detection is that defining rules in this way means that
anomalous events can be identified only after they have occurred and have come to the company’s attention
- In some ways, anomaly detection is the opposite of clustering:
- the goal of clustering is to identify groups of similar instances, whereas the goal of anomaly that
are dissimilar to the rest of the data in the data set.
- Clustering can also be used to automatically identify anomalies. There are two approaches to using clustering for
anomaly detection.
- The first is that the normal data will be clustered together, and the anomalous records will be in separate
clusters.
- The second approach is to measure the distance between each instance and the center of the cluster. The farther
away the instance is from the center of the cluster, the more likely it is to be anomalous and thus to need
investigation.
- Another approach to anomaly detection is to train a prediction model, such as a decision tree, to classify in-
stances as anomalous or not.
- However, training such a model normally requires a training data set that contains both anomalous records and
normal records
- Ideally, the data set should be balanced; in a binary-outcome case, balance would imply a 50:50 split in the data.
In general, acquiring this type of training data for anomaly detection is not feasible: by definition, anomalies
are rare events, occurring maybe in 1 to 2 percent or less of the data.
- This data constraint precludes the use of normal, off-the-shelf prediction models.
- There are ML algorithms known as one-class classifiers that are designed to deal with the type of imbalanced data
that are typical of anomaly-detection data sets.
- The one-class support-vector machine (SVM) algorithm is a well-known one-class classifier.
- In general terms, the one-class SVM algorithm examines the data as one unit (i.e., a single class) and
identifies the core characteristics and expected behavior of the instances.
- The algorithm will then indicate how similar or dissimilar each instance is from the core characteristics and
expected behavior.
- The more dissimilar an instance is, the more likely that it should be investigated.
- The fact that anomalies are rare means that they can be easy to miss and difficult to identify.
- As a result, data scientists often combine a number of different models to detect anomalies.
- The idea is that different models will capture different types of anomalies.
- The different models are integrated together into a decision-management solution that enables the predictions from
each of the models to feed into a decision of the final predicted outcome
- For example if three or four out of the four models have identified the transaction as possible fraud, then the
transaction would be flagged for a data scientist to investigate.
Association-Rule Mining (Do You Want Fries with That?)
- Association-rule mining is an unsupervised-data-analysis technique that looks to find groups of items that fre-
quently co-occur together.
- The classic case of association mining is market-basket analysis, wherein retail companies try to identify sets of
items that are purchased together, such as hot dogs, ketchup, and beer
-
Unlike clustering and anomaly detection, which focus on identifying similarities or differences between instances (or
rows) in a data set, association-rule mining focuses on looking at relationships between attributes (or columns) in a
data set.
- Using association-rule mining, a business can start to answer questions about its customers’ behaviors by looking for
patterns that may exist in the data.
- Questions that market-basket analysis can be used to answer include: Did a marketing campaign work? Have this
customer’s buying patterns changed?
- The Apriori algorithm is the main algorithm used to produce the association rules. It has a two-step process:
- Find all combinations of items in a set of transactions that occur with a specified minimum frequency. These
combinations are called frequent itemsets.
- Generate rules that express the probable co-occurrence of items within frequent itemsets. The Apriori algorithm
calculates the probability of an item being present in a frequent itemset given that another item or items are
present.
- The Apriori algorithm generates association rules that express probabilistic relationships between items in fre-
quent itemsets. An association rule is of the form “IF antecedent, THEN consequent.”
- IF {hot-dogs, ketchup}, THEN {beer}.
- Two main statistical measures are linked with association rules: support and confidence.
- The support percentage of an association rule—or the ratio of transactions that include both the antecedent
and consequent to the total number of transactions—indicates how frequently the items in the rule occur together.
- The confidence percentage of an association rule—or the ratio of the number of transactions that include both
the antecedent and consequent to the number of transactions that includes the antecedent—is the conditional
probability that the consequent will occur given the occurrence of the antecedent.
- for example, a confidence of 75 percent for the association rule relating hot dogs and ketchup with beer would
indicate that in 75 percent of cases where customers purchased both hot dogs and ketchup, they also purchased
beer. The support score of a rule simply records the percentage of baskets in the data set where the rule
holds. For example, a support of 5 percent indicates that 5 percent of all the baskets in the data set contain
all three items in the rule “hot dogs, ketchup, and beer.”
- In order to control the complexity of the analysis of these rules, it is usual to prune the generated rule set to
include only rules that have both a high support and a high confidence
- Once the rule set has been pruned, the data scientist can then analyze the remaining rules to understand what
products are associated with each other and apply this new information in the organization.
- Including this demographic information in the association analysis enables the analysis to be focused on particular
demographics, which can further help marketing and targeted advertising.
- An example of an association rule augmented with demographic information might be
- IF gender(male) and age(< 35) and {hot-dogs, ketchup}, THEN {beer}. {Support = 2%, Confidence = 90%.}
Classification (Churn or No Churn, That Is the Question)
- The term propensity modeling is used to describe this task because the goal is to model an individual’s propensity to
do something
- In fact, it is estimated that it generally costs five to six times more to attract a new customer than it does to
retain an established one (Verbeke et al. 2011). As a result, many cell phone service companies are very keen to
retain their current customers.
- The term customer churn is used to describe the process of customers leaving one service and joining another.
- So the problem of predicting which customers are likely to leave in the near future is known as churn
prediction. This is a prediction task
- When a prediction model returns a label or category for an input, it is known as a classification model.
- Training a classification model requires historic data, where each instance is labeled to indicate whether the target
event has happened for that instance.
- For example, customerchurn classification requires a data set in which each customer (one row per customer) is
assigned a label indicating whether he or she has churned.
-
The data set will include an attribute, known as the target attribute, that lists this label for each customer.
-
Once the churn event has been defined from a business perspective, it is then necessary to implement this definition
in code in order to assign a target label to each customer in the data set.
- Another complicating factor in constructing the training data set for a churn-prediction model is that time lags need
to be taken into account.
- The goal of churn prediction is to model the propensity (or likelihood) that a customer will churn at some point
in the future.
- As a consequence, this type of model has a temporal dimension that needs to be considered during the creation
of the data set.
- The set of attributes in a propensity-model data set are drawn from two separate time periods: the observation
period and the outcome period.
- The observation period is when the values of the input attributes are calculated.
- The outcome period is when the target attribute is calculated.
- The length of this period is the length of the outcome period, and the prediction that the churn model returns is
actually that a customer will churn within this outcome period.
- Defining the outcome period affects what data should be used as input to the model.
- If the model is designed to predict that a customer will churn within two months from the day the model is run
on that customer’s record, then when the model is being trained, the input attributes that describe the
historic customers who have already churned should be calculated using only the data that were available about
those customers two months prior to their leaving the service.
- Nearly all customer-propensity models will use attributes describing the customer’s demographic information as input:
age, gender, occupation, and so on.
- In scenarios relating to an ongoing service, they are also likely to include attributes describing the customer’s
position in the customer life cycle: coming on board, standing still midcycle, approaching end of a contract.
There are also likely to be attributes that are specific to the industry.
- Once a labeled data set has been created, the major stage in creating a classification model is to use an ML algo-
rithm to build the classification model.
- During modeling, it is good practice to experiment with a number of different ML algorithms to find out which
algorithm works best on the data set.
- Once the final model has been selected, the likely accuracy of the predictions of this model on new instances is
estimated by testing it on a subset of the data set that was not used during the model-training phase.
- If a model is deemed accurate enough and suitable for the business need, the model is then deployed and applied to
new data either in a batch process or in real time.
-
A really important part of deploying the model is ensuring that the appropriate business processes and resources are
put in place so that the model is used effectively.
*There is no point in creating a customer-churn model unless there is a process whereby the model’s predictions result
in triggering customer interventions so that the business retains customers.
-
In addition to predicting the classification label, prediction models can also give a measure of how confident the
model is in the prediction. This measure is called the prediction probability and will have a value between 0 and
1.
- The higher the value, the more likely the prediction is correct
- The prediction-probability value can be used to prioritize which customers to focus on.
Regression (How Much Will It Cost?)
Linear regression
- Linear regression captures the relationship between two (or more) variables and allows prediction.
- linear regression produces a function into which you can put one or more properties of an entity or phenomena (the
“input” or “explanatory” variables or “predictor” ) and get out the correct value of another (the “output” or
“response” variable).
- For example, given a set of observations about the height and weight of many people, we could learn a linear
regression model that could estimate someone’s weight based on their height.
- The first step in a regression analysis is to hypothesize the structure of the relationship between the input
attributes and the target.
- Then a parameterized mathematical model of the hypothesized relationship is defined. This parameterized model is
called a regression function.
- linear-regression functions are relatively easy to interpret
- When a linear relationship is assumed, the regression analysis is called linear regression
- The simplest application of linear regression is modeling the relationship between two attributes: an input attribute
X and a target attribute Y.
- Setting the parameters of a regression function is equivalent to searching for the line that best fits the data.
- The strategy for setting these parameters begins by guessing parameters values and then iteratively updating the
parameters so as to reduce the overall error of the function on the data set.
- The overall error is calculated in three steps:
- The function is applied to the data set, and for each instance in the data set it estimates the value of the
target attribute.
- The error of the function for each instance is calculated by subtracting the estimated value of the target
attribute from the actual value of the target attribute.
- The error of the function for each instance is squared, and then these squared values are summed. (so that the
error in the instances where the function overestimates the target doesn’t cancel out with the error when it
underestimates.)
- This measure of error is known as the sum of squared errors (SSE), and the strategy of fitting a linear function
by searching for the parameters that minimize the SSE is known as least squares.
- Formula :
- The SSE where the data set contains n instances, targeti is the value of the target attribute for instance i in
the data set, and predictioni is the estimate of the target by function for the same instance.
-
Formulas:
- To create a linear-regression prediction model that estimates the likelihood of an individual’s developing diabetes
with respect to his BMI, we replace X with the BMI attribute, Y with the diabetes attribute, and apply the
least-squares algorithm to find the best-fit line for the diabetes data set.
- the dashed lines show the error (or residual) for each instance for this line. Using the least-squares approach,
the best-fit line is the line that minimizes the sum of the squared residuals. The equation for this line is
- Diabetes = −7.38431 + 0.55593 ∗ BMI .
- The slope parameter value ω1 = 0.55593 indicates that for each increase of one unit in BMI, the model
increases the estimated likelihood of a person developing diabetes by a little more than half a percent
- For example, when BMI equals 20, the model returns a prediction of a 3.73 percent
- a linear-regression model fitted using the least-squares method is actually calculating a weighted average over
the instances. In fact, the intercept parameter value ω0 = −7.38431 ensures that the best-fit line goes through
the point defined by the average BMI value and average diabetes value for the data set.
-
The weighting of the instances is based on the distance of the instance from the line: the farther an instance is away
from the line, the larger the residual for that instance, and the algorithm will weight that instance by the residual
squared.
- Linear-regression models can be extended to take multiple inputs.
- For example BMI data regression function becomes : Diabetes = ω0 + ω1 BMI + ω2 Exercise + ω3 Weight.
- In statistics, a regression function that maps from multiple inputs to a single output in this way is known as a
multiple linear regression function.
- The structure of a multi-input regression function is the basis for a range of ML algorithms, including neural
networks.
- Residuals are the leftover variation in the data after accounting for the model fit: Data = Fit + Residual
- One goal in picking the right linear model is for these residuals to be as small as possible.
- Under the line is the value of residuals are negative, above the line is positive .
- A residual is usually discussed in terms of its absolute value
- RESIDUAL: Difference between observed an expected
- The residual of the ith observation (xi,yi) is the difference of the observed response (yi) and the response we would
predict based on the model fit (yˆi ):
- ei = yi − yˆi , we typically identify yˆ by plugging x into the model.
- [Diez -> page 320 check]
Correlations
Correlations & Regression
- Correlation and regression are similar concepts insofar as both are techniques that focus on the relationship across
columns in the data set.
- Correlation is focused on exploring whether a relationship exists between two attributes
- Regression is focused on modeling an assumed relationship between attributes with the purpose of being able to
estimate the value of one target attribute given the values of one or more input attributes.
- n the specific cases of Pearson correlation and linear regression, a Pearson correlation measures the degree to which
two attributes have a linear relationship, and linear regression trained using least squares is a process to find the
best-fit line that predicts the value of one attribute given the value of another.
Knowledge presentation
- The simplest, most rudimentary way of representing the output from machine learning is to make it just the same as the
input—a table.
- Another simple style of representation is a “linear model,” whose output is just the sum of the attribute values,
except that weights are applied to each attribute before adding them together.
- Linear models can also be applied to binary classification problems. In this case, the line produced by the model
separates the two classes: it defines where the decision changes from one class value to the other. Such a line is
often referred to as the decision boundary.
- 1 attribute -> point
- 2 attributes -> straight line
- 3 attribute -> plane
- more than 3 attributes -> hyperplane
Decision trees
- Classification technique known as “decision tree learning”.
- Decision tree learning produces a “machine” into which you can input one or more attributes of an example and get out
a classification.
- Differences from linear regression:
- the output variable is categorical rather than numeric
- input variables may be categorical as well as numeric
- Decision trees are very useful for doing simple classification that works from categorical variables to produce
categorical predictions.
- They can also be quite effective with numerical input variables, provided that the variable range falls meaningfully
into continuous regions.
- The models they produced are easy to understand, particularly when the number of input variables is small.
- Film 4 is an exception — the tree predicts that it should receive an 18 certificate, but the film classification board
has in fact given it only a 15. Depending on the context, this error might or might not be a significant problem.
</br>
</br>
-
A “divide-and-conquer” approach to the problem of learning from a set of independent instances leads naturally to a
style of representation called a decision tree.
-
Nodes in a decision tree involve testing a particular attribute. Usually, the test compares an attribute value with a
constant. Leaf nodes give a classification that applies to all instances that reach the leaf, or a set of
classifications, or a probability distribution over all possible classifications.
- If the attribute that is tested at a node is a nominal one, the number of children is usually the number of possible
values of the attribute.
- If the attribute is numeric, the test at a node usually determines whether its value is greater or less than a
predetermined constant, giving a two-way split.
- a numeric quantity is what is predicted, decision trees with averaged numeric values at the leaves are called **
regression trees**.
- It is possible to combine regression equations with regression trees. Fig. 3.4C is a tree whose leaves contain linear
expressions—i.e., regression equations— rather than single predicted values. This is called a model tree.
Evaluating learned models
- The error rate is just the proportion of errors made over a whole set of instances, and it measures the overall
performance of the classifier.
- This is a surprising fact, and a very important one. Error rate on the training set is not likely to be a good
indicator of future performance.
- Why? Because the classifier has been learned from the very same training data, any estimate of performance based
on that data will be optimistic, and may be hopelessly optimistic
- The error rate on the training data is called the resubstitution error, because it is calculated by resubstituting the
training instances into a classifier that was constructed from them
- Although it is not a reliable predictor of the true error rate on new data, it is nevertheless often useful to
know.
- To predict the performance of a classifier on new data, we need to assess its error rate on a dataset that played no
part in the formation of the classifier. This independent dataset is called the test set. We assume that both the
training data and the test data are representative samples of the underlying problem.
- It is important that the test data is not used in any way to create the classifier.
-
some learning schemes involve two stages, one to come up with a basic structure and the second to optimize parameters
involved in that structure, and separate sets of data may be needed in the two stages. Or you might try out several
learning schemes on the training data and then evaluate them—on a fresh dataset, of course—to see which one works best
- In such situations people often talk about three datasets: the training data, the validation data, and the test
data.
- The training data is used by one or more learning schemes to come up with classifiers.
- The validation data is used to optimize parameters of those classifiers, or to select a particular one.
- The test data is used to calculate the error rate of the final, optimized, method.
- Each of the three sets must be chosen independently: the validation set must be different from the training set to
obtain good performance in the optimization or selection stage, and the test set must be different from both to
obtain a reliable estimate of the true error rate.
-
It may be that once the error rate has been determined, the test data is bundled back into the training data to
produce a new classifier for actual use. There is nothing wrong with this: it is just a way of maximizing the amount
of data used to generate the classifier that will actually be employed in practice
-
Generally, the larger the training sample the better the classifier, although the returns begin to diminish once a
certain volume of training data is exceeded. And the larger the test sample, the more accurate the error estimate.
- This limits the amount of data that can be used for training, validation, and testing, and the problem becomes how to
make the most of a limited dataset. From this dataset, a certain amount is held over for testing—this is called the
holdout procedure—and the remainder used for training (and, if necessary, part of that is set aside for
validation).
- There’s a dilemma here: to find a good classifier, we want to use as much of the data as possible for training; to
obtain a good error estimate, we want to use as much of it as possible for testing.
-
In statistics, a succession of independent events that either succeed or fail is called a Bernoulli process. The
classic example is coin tossing. Each toss is an independent event. Let’s say we always predict heads; but rather than
“heads” or “tails,” each toss is considered a “success” or a “failure.”
-
Bernoulli process—a biased coin—whose true (but unknown) success rate is p. Suppose that out of N trials, S are
successes: thus the observed success rate is f = S/N.
- The success rate is measured as Success/TotalExperimentCount, f=S/N. So, both 750/1000 and 75/100 produce 75% success
rate. However, as the experiment count goes up, so does the confidence interval.
- For samples >= 100, the normal distribution can be assumed. For the normal distribution, the chance that X lies
more than 1.65 standard deviations from the mean is 10%
-
The z value can be found from the lookup table below. For 90% confidence, P(X≥z)P({X}\ge{z})P(X≥z) and P(X≤−z)P(
{X}\le{-z})P(X≤−z) are 5% each.
- For f=75%, N=1000, confidence limits can be calculated from the formula below, ie [0.732, 0.767] for N=1000. For N=100
is [0.691, 0.801]
- The 6 in this expression gives two values for p that represent the upper and lower confidence boundaries.
Cross Validation
-
The holdout method reserves a certain amount for testing, and uses the remainder for training (and sets part of that
aside for validation, if required). In practical terms, it is common to hold one-third of the data out for testing and
use the remaining two-thirds for training.
- A more general way to mitigate any bias caused by the particular sample chosen for holdout is to repeat the whole
process, training and testing, several times with different random samples.
- In each iteration, a certain proportion of the data is randomly selected for training, possibly with stratification,
and the remainder used for testing. The error rates on the different iterations are averaged to yield an overall error
rate. This is the repeated holdout method of error rate estimation.
- Another alternative technique is called cross-validation where you decide on a fixed number of folds/partitions of the
data.
- For k-folds, we have k iterations where 1 fold/share is kept for testing and k-1 folds are used for training.
- This is repeated k times, until each fold is used for testing. If stratification is used as well, it’s called
stratified k-fold cross-validation.
- On each run, the error rate is calculated on the test set and finally, the average error rate is calculated.
- Stratified 10-fold cross-validation is the most commonly used variation. The number 10 is chosen based on
extensive testing and also some theoretical evidence showing this to give the best estimate of error.
- A single 10-fold cross-validation might not be enough to get a reliable error estimate if the data is limited.
Different 10-fold cross-validation experiments with the same learning scheme and dataset often produce different
results, because of the effect of random variation in choosing the folds themselves.
- Due to stratification, results across different complete k-fold runs might change even for the same k value.
Therefore, it’s also common practice to repeat the k-fold runs multiple times.
Evaluating Numeric Prediction
-
The basic principles—using an independent test set rather than the training set for performance evaluation, the
holdout method, cross-validation—apply equally well to numeric prediction. But the basic quality measure offered by
the error rate is no longer appropriate: errors are not simply present or absent, they come in different sizes.
- Mean-squared error (MSE) is the principal and most commonly used measure; some-times the square root is taken to
give it the same dimensions as the predicted value itself.
- Mean absolute error (MAE) is an alternative: just average the magnitude of the individual errors without
taking account of their sign.
- Mean-squared error tends to exaggerate the effect of outliers—instances whose prediction error is larger than the
others—
- But absolute error does not have this effect: all sizes of error are treated evenly according to their
magnitude.
- Sometimes it is the relative rather than absolute error values that are of importance.
- For example, if a 10% error is equally important whether it is an error of 50 in a prediction of 500 or an error
of 0.2 in a prediction of 2, then averages of absolute error will be meaningless: relative errors are appropriate.
This effect would be taken into account by using the relative errors in the mean-squared error calculation or the
mean absolute error calculation.
- Relative squared error (RSE) is made relative to what it would have been if a simple predictor had been used.
- Thus relative squared error takes the total squared error and normalizes it by dividing by the total squared error
of the default predictor.
- Relative absolute error (RAE) and is just the total absolute error, with the same kind of normalization.
- Correlation coefficient, which measures the statistical correlation between the a’s and the p’s. (a is actual
value, p is predicted value)
- The correlation coefficient ranges 1,0,-1, negative values should not occur for reasonable prediction methods
- The squared error measures and root squared error measures weigh large discrepancies much more heavily than small
ones, whereas the absolute error measures do not.
- Taking the square root (root mean-squared error) just reduces the figure to have the same dimensionality as the
quantity being predicted.
-
The relative error figures try to compensate for the basic predictability or unpredictability of the output
variable: if it tends to lie fairly close to its average value, then you expect prediction to be good and the relative
figure compensates for this.
-
If you have outlier in the data and you want to ignore them, MAE is a better option but if you want to account for
them in your loss function, go for MSE/RMSE.
- Evaluation Metrics for Regression models- MAE Vs MSE Vs RMSE vs RMSLE
Confusion matrices and accuracy scores
- The overall success rate is the number of correct classifications divided by the total number of classifications:
*
- Error Rate = 1 - Success Rate
- In a multiclass prediction, the result on a test set is often displayed as a twodimensional confusion matrix
with a row and column for each class. Each matrix element shows the number of test examples for which the actual class
is the row and the predicted class is the column.
- To make this calculation fairer, we’d need to subtract the correct predictions that could have done by a random
predictor. This is called the Kappa statistic.
- Kappa is calculated as (CorrectPredictionsByTheModel CorrectPredictionsByRandomSelection) / (TotalSamples -
CPBRS).
- The max value for Kappa is 100%.
- Kappa statistic is used to measure the agreement between predicted and observed categorizations of a dataset,
while correcting for agreement that occurs by chance.
- Like the plain success rate, Kappa does not take costs into account.
- Cost-sensitive learning, In the two-class situation, there is a simple and general way to make any learning
scheme cost sensitive.
- The idea is to generate training data with a different proportion of yes and no instances. Suppose you
artificially increase the number of no instances by a factor of 10 and use the resulting dataset for training. If
the learning scheme is striving to minimize the number of errors, it will come up with a decision structure that
is biased toward avoiding errors on the no instances, because such errors are effectively penalized 10-fold. If
data with the original proportion of no instances is used for testing, fewer errors will be made on these than on
yes instances—i.e., there will be fewer FPs than FNs— because FPs have been weighted 10 times more heavily than
FNs. Varying the proportion of instances in the training set is a general technique for building cost-sensitive
classifiers.
- Lift factor: Imagine you are direct mailing business, and you will send 1000000 mails to households and response
rate is %0,1 (1000), Suppose there is a data mining tool based on known information about the households, identifies a
subset of 100,000 for which the response rate is 0.4% (400 respondents), You can calculate cst effect and choose one
of those options to send 1 milion mails with 1000 response or 100,000 mails with 0,4 response. . In marketing
terminology, the increase in response rate, a factor of four in this case, is known as the lift factor yielded
by the learning tool. If you knew the costs, you could determine the payoff implied by a particular lift factor.
Receiver Operation Characteristic Curve (ROC)
- Precision p is the number of correctly classified positive examples divided by the total number of examples that
are classified as positive.
- How likely is a member of Positive Data Set to be TP
- Prediction does not include the number od True Negative and is not effected by imbalance.
</br>
</br>
- Recall r is the number of correctly classified positive examples divided by the total number of actual positive
examples in the test set.
- How representative TP is of all Actual Positives.
- Although in theory precision and recall are not related, in practice high precision is achieved almost always at the
expense of recall and vice versa.
- Therefore, F-score, harmonic mean of precision and recall, is used
- Both p and r has to be high for the F-score to be high
- The precision and recall break even point is where p and r equal or the closest
TODO:
- Activity : 2.4.1 Activity [Diez Book - page 310, for calculations]
WEEK 3
Main Topics
After completing this week you should be able to:
Sub titles:
Evaluating the suitability of given data for use in analysis
- Week 1 [What are data and what is a data set ?] title.
- Lines beginning with a % sign are comments.
- @relation tels the name of the data (weather)
- @attributes defines attributes (outlook, temperature, humidity, windy, play?).
- Nominal attributes are followed by the set of values they can take on, enclosed in curly braces.
- Values can include spaces; if so, they must be placed within quotation marks.
- Numeric values are followed by the keyword numeric.
- the class attribute is not distinguished in any way in the data file.
- @data line that signals the start of the instances in the dataset.
- Instances are written one per line, with values for each attribute in turn, separated by commas.
- If a value is missing it is represented by a single question mark (there are no missing values in this dataset).
- ARFF has 5 different attribute types:
- numeric
- nominal
- string attributes
- String attributes have values that are textual.
- String attributes can have values that are very long—even a whole document.
- date attributes
- Weka uses the ISO-8601 combined date and time format yyyy-MM-dd’T’HH:mm:ss
- relation-valued attributes
- they allow multi-instance problems to be represented in ARFF format
@attribute bag relational
@attribute outlook {sunny, overcast, rainy} @attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@end bag
- Note, however, that in multi-instance learning the order in which the instances is given is generally considered
unimportant. An algorithm might learn that cricket can be played if none of the days is rainy and at least one is
sunny, but not that it can only be played in a certain sequence of weather events.
Sparce Data
- Sometimes most attributes have a value of 0 for most of the instances, ie market basket data showing the purchased
items will have a lot 0 for all the other items in the stock.
- Another example is text mining where attributes are the words that occur in the documents.
- Sparse data files have the same @relation and @attribute tags, followed by an @data line, but the data section is
different and contains specifications in braces such as those shown previously.
- Instead of ‘0, X, 0, 0, 0, 0, Y, 0, 0, 0, “class A”’, ‘{1 X, 6 Y, 10 “class A”}’ saves space.
- Note that the omitted values have a value of 0—they are not “missing” values! If a value is unknown, it must be
explicitly represented with a question mark.
Attribute Types
- How an attribute is processed during cleanup also depends on how the model is going to treat that value. For example,
if a numerical value is going to be used as numerical or ordinal.
- If a learning scheme treats numeric attributes as though they are measured on ratio scales, the question of
normalization arises.
- Attributes are often normalized to lie in a fixed range—usually from zero to one—by dividing all values by the
maximum value encountered or
- By subtracting the minimum value and dividing by the range, max-min.
- Another normalization technique is to subtract the mean from each value and divide the result by the standard
deviation.
- This process is called standardizing a statistical variable and results in a set of values whose mean is zero
and the standard deviation is one.
- For distance-based algorithms, nominal attributes can be converted to synthetic binary attributes, ie for
sunny/cloudy/rainy, we use one binary attribute for each.
- Sometimes there is a genuine mapping between nominal attributes and numeric scales, ie the Zipcode or the leading
digits of phone number and the geolocation.
Missing Values
- Understand Missing data and 5 ways to handle!
- Missing values are usually indicated with a number that normally cannot appear for that attribute, ie -1 or 0.
- Different missing values can be distinguished by different values, ie -1: non-recorded, -2: irrelevant etc.
- For nominal attributes, missing values may be indicated by blanks or dashes. Sometimes different kinds of missing
values are distinguished (e.g., unknown vs unrecorded vs irrelevant values) and perhaps represented by different
negative integers (21, 22, etc.).
-
When a value is missing, the question that should be asked is, is it missing just as a random event or a decision was
taken not to carry out a particular measurement as a result of a problem with the specimen.
- Imputation is t put most reasonable value in missing data, for example salary info 5000 euro for missing value.
Inaccurate values
- It is important to check data mining files carefully for rogue attributes and attribute values. The data may not have
been explicitly collected for that purpose.
- Typographic errors, ie spelling variations of a nominal attribute, extra spaces, upper/lower cases mismatches etc
would lead to possible extra values.
- To spot such issues, frequency tables as below and histograms can be useful.
- Typographical or measurement errors in numeric values generally cause outliers that can be detected by graphing one
variable at a time.
- Some incorrect values can be easy to spot due to huge variations from the rest but other times, this might require
good domain knowledge.
- Duplicate data is another potential source of the problem, ie duplication of records leads to most ML algos giving
them more weight.
Unbalance Data
-
In practical applications of classification schemes, it is very often the case that one class is far more prevalent
than the others. For example, when predicting the weather in Ireland, it is pretty safe to predict that tomorrow will
be rainy rather than sunny. Given a dataset in which these two values form the class attribute, with information
relevant to the forecast in the other attributes, excellent accuracy is obtained by predicting rainy regardless of the
values of the attributes.
-
Always predicting the majority outcome rarely says anything interesting about the data. The problem is that raw
accuracy, measured by the proportion of correct predictions, is not necessarily the best criterion of success. In
practice, different costs may be associated with the two types of error.
Getting to know your data
- Simple tools that show histograms of the distribution of values of nominal attributes, and graphs of the values of
numeric attributes (perhaps sorted or simply graphed against instance number), are very helpful.
- These graphical visualizations of the data make it easy to identify outliers, which may well represent errors in
the data file—or arcane conventions for coding unusual situations, such as a missing year as 9999 or a missing
weight as 21 kg, that no one has thought to tell you about
- Domain experts need to be consulted to explain anomalies, missing values, the significance of integers that represent
categories rather than numeric quantities, and so on. Pairwise plots of one attribute against another, or each
attribute against the class value, can be extremely revealing.
- In a large database, you should sample a few instances and examine them carefully.
Cleaning Data
- 5 Principles
- You can’t fix problems until you can see them
- Some within range values can still be problematic if they are repeated erroneously or put in by someone to
fix ‘missing’ value, ie using the mean or median for this purpose.
- So, looking for suspicious repeating values, especially with decimal places.
- Don’t’ fix problems by making things worse
- Refrain from making ‘fixes’ that might distort the statistics.
- Justify your fixes
- Sometimes you shouldn’t fix
- Even if they won’t like to hear it, be honest to the audience and don’t attempt to fix things that you
actually can’t.
- Acknowledge residual uncertainty
- Data should be cleansed when acquired for many reasons:
- Not everyone spots the data anomalies. Decision-makers may make costly mistakes on information based on incorrect
data from applications that fail to correct for the faulty data.
- If errors are not corrected early on in the process, the cleansing will have to be done for every project that
uses that data.
- Data errors may point to a business process that isn’t working as designed. we discovered clients who abused the
couponing system and earned money while purchasing groceries.
- Data errors may point to defective equipment, such as broken transmission lines and defective sensors.
- Data errors can point to bugs in software or in the integration of software that may be critical to the company.
</br>
</br>
</br>
Improving Decision Tree
- Decision trees induced from training data can often be simplified, without loss of accuracy, by discarding
misclassified instances from the training set, relearning, and then repeating until there are no misclassified
instances.
- Experiments show that this hardly affects the classification accuracy but significantly simplifies the DT.
- This method also helps to verify the subtree pruning and also gives a chance to review the misclassified records
as they might erroneous in the first place anyway.
- Tackling the noise
- If there is attribute noise present in the training data, instead of trying to remove it, a similar noise should
be added to the training data as well. So, training on noisy data but testing on noise-free should be avoided.
This is more realistic as real-life attributes are likely to contain noise too.
- If there is class noise, however, then this should either be removed or data with noise-free classes should be
collected. But with class noise (rather than attribute noise), it is best to train on noise free instances if
possible.
Robust regression
Detecting Anomalies
- Auto-detection of outliers/anomalies is prone to errors, especially if a human expert isn’t consulted
- In statistical regression, visualizations help.
- most classification problems cannot be so easily visualized: the notion of “model type” is more subtle than a
regression line.
- One solution that has been tried is to use several different learning schemes— such as a decision tree, and a
nearest-neighbor learner, and a linear discriminant function—to filter the data.
- Training all three schemes on the filtered data and letting them vote can yield even better results.
- However, there is a danger to voting techniques: some learning algorithms are better suited to certain types
of data than others, and the most appropriate scheme may simply get out-voted!
- Method of combining the output from different classifiers, called stacking in Ensemble learning.
- One possible danger with filtering approaches is that they might conceivably just be sacrificing instances of a
particular class (or group of classes) to improve accuracy on the remaining classes
- Automatic filtering is a poor substitute for getting the data right in the first place. And if this is too
time-consuming and expensive to be practical, human inspection could be limited to those instances that are identified
by the filter as suspect.
One Class Learning
- If during training, there is only one class is available but prediction data presents records for unknown classes,
then the one-class learner groups them as target/known and unknown classes.
Outlier Detection
- One-class classification is often called outlier (or novelty) detection because the learning algorithm is being used
to differentiate between data that appears normal and abnormal with respect to the distribution of the training data.
- A generic statistical approach to one-class classification is to identify outliers as instances that lie beyond a
distance d from a given percentage p of the training data.
- Alternatively, a probability density can be estimated for the target class by fitting a statistical distribution,
such as a Gaussian, to the training data; any test instances with a low probability value can be marked as
outliers.
- If an appropriate distribution for the data at hand cannot be identified, one can adopt a non-parametric approach
such as kernel density estimation.
- An advantage of the density estimation approach is that the threshold can be adjusted at prediction time to
obtain a suitable rate of outliers.
- Multi-class classifiers can be tailored to the one-class situation by fitting a boundary around the target data. * If
the boundary is chosen too conservatively, data in the target class will erroneously be rejected. If it is chosen too
liberally, the model will over-fit and reject too much legitimate data.
Generating Arificial Data
- The most straightforward approach is to generate uniformly distributed data and learn a classifier that can
discriminate this from the target.
- However, different decision boundaries will be obtained for different amounts of artificial data.
- To overcome this, rather than classifiers, class probability estimators such as bagged decision trees can be used.
Then, the threshold level of the probability estimator can be adjusted as needed.
- There is one significant problem. As the number of attributes increases, it quickly becomes infeasible to generate
enough artificial data to obtain adequate coverage of the instance space, and the probability that a particular
artificial instance occurs inside or close to the target class diminishes to a point that makes any kind of
discrimination impossible.
- The solution is to generate artificial data that is as close as possible to the target class. In this case,
because it is no longer uniformly distributed, the distribution of this artificial data—call this the “reference”
distribution—must be taken into account when computing the membership scores for the resulting one-class model.
- In other words, the class probability estimates of the two-class classifier must be combined with the
reference distribution to obtain membership scores for the target class.
Attribute Selection
Scheme Independent Selection
- When selecting a good attribute subset, there are two fundamentally different approaches.
- One is to make an independent assessment based on general characteristics of the data;
- This is called the filter method, because the attribute set is filtered to produce the most promising subset
before learning commences.
- the other is to evaluate the subset using the machine learning algorithm that will ultimately be employed for
learning
- This is the wrapper method, because the learning algorithm is wrapped into the selection procedure.
-
One simple scheme-independent method of attribute selection is to use just enough attributes to divide up the instance
space in a way that separates all the training instances.
- Machine learning algorithms can be used for attribute selection. For instance, you might first apply a decision tree
algorithm to the full dataset, and then select only those attributes that are actually used in the tree. While this
selection would have no effect at all if the second stage merely built another tree, it will have an effect on a
different learning algorithm.
- Often the decision tree performs just as well when only the two or three top attributes are used for its construc-
tion—and it is much easier to understand.
- Another possibility is to use an algorithm that builds a linear model—e.g., a linear support vector machine—and ranks
the attributes based on the size of the coefficients. A more sophisticated variant applies the learning algorithm
repeatedly. It builds a model, ranks the attributes based on the coefficients, removes the lowest-ranked one, and
repeats the process until all attributes have been removed.
- This method of recursive feature elimination has been found to yield better results on certain datasets (e.g.,
when identifying important genes for cancer classification) than simply ranking attributes based on a single
model.
-
Attributes can be selected using instance-based learning methods too. You could sample instances randomly from the
training set and check neighboring records of the same and different classes—“near hits” and “near misses.” If a near
hit has a different value for a certain attribute, that attribute appears to be irrelevant and its weight should be
decreased. On the other hand, if a near miss has a different value, the attribute appears to be relevant and its
weight should be increased.
- Another way of eliminating redundant attributes as well as irrelevant ones is to select a subset of attributes that
individually correlate well with the class but have little intercorrelation.
Searching Attributes in space
-
Most methods for attribute selection involve searching the space of attributes for the subset that is most likely to
predict the class best. Fig. 8.1 illustrates the attribute space for the—by now all-too-familiar—weather dataset. The
number of possible attribute subsets increases exponentially with the number of attributes, making exhaustive search
impractical on all but the simplest problems.
-
The downward direction, where you start with no attributes and add them one at a time, is called forward selection
. The upward one, where you start with the full set and delete attributes one at a time, is backward elimination.
-
In forward selection, each attribute that is not already in the current subset is tentatively added to it, and the
resulting set of attributes is evaluated—using, e.g., cross-validation as described in the following section. This
evaluation produces a numeric measure of the expected performance of the subset. The effect of adding each attribute
in turn is quantified by this measure, the best one is chosen, and the procedure continues. However, if no attribute
produces an improvement when added to the current subset, the search ends.
-
Forward selection and backward elimination can be combined into a bidirectional search; again one can either begin
with all the attributes or with none of them.
Scheme Spesific Selection
-
The performance of an attribute subset with scheme-specific selection is measured in terms of the learning scheme’s
classification performance using just those attributes.
-
The entire attribute selection process is rather computation intensive. If each evaluation involves a 10-fold
cross-validation, the learning procedure must be executed 10 times.
-
One way to accelerate the search process is to stop evaluating a subset of attributes as soon as it becomes apparent
that it is unlikely to lead to higher accuracy than another candidate subset. This is a job for a paired statistical
significance test, performed between the classifier based on this subset and all the other candidate classifiers based
on other subsets. The performance difference between two classifiers on a particular test instance can be taken to be
21, 0, or 1 depending on whether the first classifier is worse, the same as, or better than the second on that
instance. A paired t-test can be applied to these figures over the entire test set, effectively treating the results
for each instance as an independent estimate of the difference in performance.
-
A simple method for accelerating scheme-specific search is to preselect a given number of attributes by ranking them
first using a criterion like the information gain and discarding the rest before applying scheme-specific selection.
-
Whatever way you do it, scheme-specific attribute selection by no means yields a uniform improvement in performance.
Because of the complexity of the process, which is greatly increased by the feedback effect of including a target
machine learning algorithm in the attribution selection loop, it is quite hard to predict the conditions under which
it will turn out to be worthwhile. As in many machine learning situations, trial and error using your own particular
source of data is the final arbiter.
Data Preparation and integration
Creating Analytics Base Table (ABT)
-
The final step in data integration involves creating the data that are used as input to the ML algorithms. This data
is known as the analytics base table.
- The most important step in creating the analytics base table is the selection of the attributes that will be included
in the analysis.
- The selection is based on domain knowledge and on an analysis of the relationships between attributes.
- Attributes that are found to have a high correlation with other attributes are likely to be redundant, and so one of
the correlated attributes should be excluded.
- Removing redundant features can result in simpler models which are easier to understand, and also reduces the
likelihood of an ML algorithm returning a model that is fitted to spurious patterns in the data.
- The set of attributes selected for inclusion define what is known as the analytics record.
- An analytics record typically includes both raw and derived attributes
-
After the analytics record has been designed, a set of records needs to extracted and aggregated to create a data set
for analysis. When these records have been created and stored—for example, in a database—this data set is commonly
referred to as the analytics base table. The analytics base table is the data set that is used as input to the ML
algorithms.
- selection of attributes that will be included in the analysis - based on domain knowledge
- removing redundant features –> simple models, easy to understand
- design analytics records - set of records to be extracted and aggregated to create a dataset
- ABT is used as input to the ml algorithm
WEEK 4
Main Topics
- Apply multiple regression and support vector regression
- Apply Naïve Bayes and support vector classification
- Choose between techniques for a given analysis in a principled way
- Identify, address and document residual threats to validity for a given analysis
After completing this week, you will have made significant steps towards achieving the following module learning
outcomes:
- MO3 — Critically evaluate and apply data mining techniques/tools to build a classifier or regression model, and
predict values for new examples
Sub titles:
Alternative techniques for regression
Numeric Prediction: Linear Regression
- When the outcome, or class, is numeric, and all the attributes are numeric, linear regression is a natural technique
to consider.
</br>
- where x is the class; a1, a2,…, ak are the attribute values; and w0, w1,…, wk are weights.
- The predicted value for the first instance’s class can be written as :
</br>
- Then the sum of the squares of the differences is and This sum of squares is what we have to minimize by choosing the
coefficients appropriately.
</br>
- Linear regression is an excellent, simple method for numeric prediction, and it has been widely used in statistical
applications for decades.
- Of course, basic linear models suffer from the disadvantage of, well, linearity. If the data exhibits a nonlinear
dependency, the best-fitting straight line will be found, where “best” is interpreted as the least mean-squared
difference. This line may not fit very well.
- However, linear models serve well as building blocks or starting points for more complex learning methods.
- Multiple regression
</br>
Linear Classification: Logistic Regression
-
Logistic regression
-
Linear regression can easily be used for classification in domains with numeric attributes. Indeed, we can use any
regression technique for classification.
-
The trick is to perform a regression for each class, setting the output equal to one for training instances that
belong to the class and zero for those that do not. The result is a linear expression for the class. Then, given a
test example of unknown class, calculate the value of each linear expression and choose the one that is largest. When
used with linear regression, this scheme is sometimes called multiresponse linear regression.
- Multiresponse linear regression has 2 drawbacks:
- the membership values it produces are not proper probabilities because they can fall outside the range 0-1.
- east-squares regression assumes that the errors are not only statistically independent, but are also normally
distributed with the same standard deviation, an assumption that is blatantly violated when the method is applied
to classification problems because the observations only ever take on the values 0 and 1.
-
A related statistical technique called logistic regression does not suffer from these problems. Instead of
approximating the 0 and 1 values directly, thereby risking illegitimate probability values when the target is
overshot, logistic regression builds a linear model based on a transformed target variable.
- Transformation function called the logit transformation.
</br>
- In logistic regression the log-likelihood of the model is used instead. This is given by
</br>
Linear Classification using the perception
- Logistic regression attempts to produce accurate probability estimates by maximizing the probability of the training
data.
-
the data can be separated perfectly into two groups using a hyperplane, it is said to be linearly separable. t turns
out that if the data is linearly separable, there is a very simple algorithm for finding a separating hyperplane and s
called the perceptron learning rule.
</br>
- The resulting hyperplane is called a perceptron, and it’s the grandfather of neural networks
</br>
Linear Classification using Winnow
- The perceptron algorithm is not the only method that is guaranteed to find a separating hyperplane for a linearly
separable problem.
- For datases with binary attributes there is an alternative known as Winnow, The structure of the two algorithms is
very similar. Like the perceptron, Winnow only updates the weight vector when a misclassified instance is
encountered—it is mistake driven.
- Winnow is very effective in homing in on the relevant features in a dataset— therefore it is called an
attribute-efficient learner. That means that it may be a good candidate algorithm if a dataset has many (binary)
features and most of them are irrelevant.
Extending Linear Models
- Support vector machines use linear models to implement nonlinear class boundaries.
- How can this be possible? The trick is easy: transform the input using a nonlinear mapping. In other words, transform
the instance space into a new space. With a nonlinear mapping, a straight line in the new space does not look straight
in the original instance space. A linear model constructed in the new space can represent a nonlinear decision
boundary in the original space.
</br>
- Here, x is the outcome, a1 and a2 are the two attribute values, and there are four weights wi to be learned.
- Each data set can be modelled with a complicated enough polynomial but there are 2 major problems with this approach
- As the number of parameters increase, the number of coefficients and the required computational power increase
drastically and beyond the practical boundaries
- There also is a risk of overfitting
The Maximum Margin Hyperplane
- Support vector machines address both problems. They are based on an algorithm that finds a special kind of linear
model: the maximum margin hyperplane.
- A hyperplane is—it is just another term for a linear model
-
To visualize a maximum margin hyperplane, imagine a two-class data set whose classes are linearly separable; i.e.,
there is a hyperplane in instance space that classifies all training instances correctly. The maximum margin
hyperplane is the one that gives the greatest separation between the classes—it comes no closer to either than it has
to.
- The instances that are closest to the maximum margin hyperplane—the ones with minimum distance to it—are called
support vectors.
- There is always at least one support vector for each class, and often there are more
- The important thing is that the set of support vectors uniquely defines the maximum margin hyperplane for the learning
problem. Given the support vectors for the two classes, we can easily construct the maximum margin hyperplane. All
other training instances are irrelevant—they can be deleted without changing the position and orientation of the
hyperplane.
</br>
Non Linear Class Boundries
- With support vectors, overfitting is reduced.
- The reason is that it is inevitably associated with instability: with an algorithm that overfits, changing one or
two instance vectors will make sweeping changes to large sections of the decision boundary.
- But the maximum margin hyperplane is relatively stable: it only moves if training instances are added or deleted
that are support vectors—and this is true even in the high-dimensional space spanned by the nonlinear
transformation.
- Overfitting is caused by too much flexibility in the decision boundary. The support vectors are global
representatives of the whole set of training points, and there are often relatively few of them, which gives
little flexibility. Thus overfitting is less likely to occur.
- Computational complexity is still a problem:
- According to the preceding equation, every time an instance is classified its dot product with all support vec-
tors must be calculated. In the high-dimensional space produced by the nonlinear mapping this is rather expensive.
Obtaining the dot product involves one multiplication and one addition for each attribute, and the number of
attributes in the new space can be huge. This problem occurs not only during classification but also during
training, because the optimization algorithms have to calculate the same dot products very frequently.
- Fortunately, it turns out that it is possible to calculate the dot product before the nonlinear mapping is performed,
on the original attribute set. A highdimensional version of the preceding equation is simply
- </br>
- The function (x.y)^n, which computes the dot product of two vectors x and y and raises the result to the power n, is
called a polynomial kernel.
- A good way of choosing the value of n is to start with 1 (a linear model) and increment it until the estimated
error ceases to improve. Usually, quite small values suffice .
- Other kernel functions can be used instead to implement different nonlinear mappings.
- Two that are often suggested are the radial basis function (RBF) kernel and the sigmoid kernel.
- Which one produces the best results depends on the application, although the differences are rarely large in
practice
- It is interesting to note that a support vector machine with the RBF kernel corresponds to a type of neural
network called an RBF network (which we describe later), and one with the sigmoid kernel implements another type
of neural network, a multilayer perceptron with one hidden layer (also described later).
- Support vector machine with the RBF kernel corresponds to a type of neural network called an RBF network
- One with the sigmoid kernel implements another type of neural network, a multilayer perceptron with one hidden
layer
- SVM normally assumes linear separatability of the training data either in the instance space or in a transformed
space.
- Where this isn’t the case, SVM can still be used by tuning the ‘C’ parameter but this is a manual, trial&error
based process.
- we should mention that compared with other methods such as decision tree learners, even the fastest training
algorithms for support vector machines are slow when applied in the nonlinear setting.
- On the other hand, they often produce very accurate classifiers because subtle and complex decision boundaries can
be obtained.
Support Vector Regression
-
The concept of a maximum margin hyperplane only applies to classification. However, support vector machine algorithms
have been developed for numeric prediction that share many of the properties encountered in the classification case:
they produce a model that can usually be expressed in terms of a few support vectors and can be applied to nonlinear
problems using kernel functions.
- As with linear regression, the basic idea of SVR is to find a function that approximates the training points with
minimised errors.
- The crucial difference is that
- All deviations up to a user-specified parameter ε are simply discarded. ε defines a tube around the regression
function in which errors are ignored.
- Also, when minimizing the error, the risk of overfitting is reduced by simultaneously trying to maximize the
flatness of the function.
- Another difference is that what is minimized is normally the predictions’ absolute error instead of the
squared error used in linear regression.
- The value of ε controls how closely the function will fit the training data.
- Too large a value will produce a meaningless predictor—in the extreme case, when 2ε exceeds the range of class
values in the training data, the regression line is horizontal and the algorithm just predicts the mean class
value.
- For small values of ε there may be no tube that encloses all the data. In that case some training points will have
nonzero error, and there will be a tradeoff between the prediction error and the tube’s flatness
</br>
- In the degenerate case ε50 the algorithm simply performs least-absolute-error regression under the coefficient size
constraint, and all training instances become support vectors.
- Conversely, if ε is large enough that the tube can enclose all the data, the error becomes zero, there is no
tradeoff to make, and the algorithm outputs the flattest tube that encloses the data irrespective of the value of
C.
Alternative Techniques for classification
- There are 3 techniques to find what analysis techniques we would like to use:
- Determine the techniques that could work, qualitatively:
- Select the general type, regression vs classification
- Determine the right specific features;
- Descriptive vs regression : descriptive statistics like means are quite appropriate? Or do you need to be
able to input the values of an instance and predict another field?
- Binary vs multiway classification : “yes/no”, “true/false”
- Do you need feature interactions, ie, does the value of one attribute affect the meaning of another. If
yes, this is likely to require employing neural networks.
- Narrow the search to the candidates that seem most likely to work well, quantitatively:
- Empirically test those candidates agains each other to find winner:
- Compare accurancy and error rates
- Think about hypothesis testing (out of scope)
Naive Bayes for Document classification
- An important domain for machine learning is document classification, in which each instance represents a document and
the instance’s class is the document’s topic.
- Na ̈ıve Bayes is a popular technique for this application because it is very fast and quite accurate.
- However, this does not take into account the number of occurrences of each word, which is potentially useful
information when determining the category of a document.
- Instead, a document can viewed as a bag of words—a set that contains all the words in the document, with multiple
occurrences of a word appearing multiple times ( technically, a set includes each of its members just once,
whereas a bag can have repeated elements).
- Word frequencies can be accommodated by applying a modified form of Na ̈ıve Bayes called multinominal Na ̈ıve
Bayes.
- the probability of a document E given its class H—in other words, the formula for computing the probability P(E|H) in
Bayes’ rule—is below , where N 5 n1 + n2 +…+ nk is the number of words in the document.
- Pi is estimated by computing the relative frequency of word i in the text of all training documents pertaining to
category H.
</br>
- In the multinomial Na ̈ıve Bayes formulation a document’s class is determined not just by the words that occur in it
but also by the number of times they occur.
- In general it performs better than the ordinary Na ̈ıve Bayes model for document classification, particularly for
large dictionary sizes.
Choosing a learning technique for a given analysis
The Model Selection and Tuning Problem
- A key activity in computer science is taking a problem and defining it very precisely. This is necessary if you plan
to automate solving the problem, but that’s not the only reason to do it. Sometimes defining a problem precisely helps
you understand it.
Model Selection and Managment Systems (MSMS)
-
Article
- Model selection management systems (MSMS) : To make the iterative process of model selection easier and faster, we
envision a unifying abstract framework that acts as a basis for a new class of analytics systems.
- Model Selection: Broadly defined, model selection is the process of building a precise prediction function to make
“satisfactorily” accurate predictions about a data-generating process using data generated by the same process
- Feature Engineering (FE) is the process of converting raw data into a precise feature vector that provides the
domain of the prediction function (a learned ML model).
- FE includes a variety of options (a sequence of computational operations), e.g., counting words or selecting a
feature subset. Some options, such as subset selection and feature ranking, are well studied
- FE is considered a domain-specific “black art” , mostly because it is influenced by many technical and logistical
factors, e.g., data and application properties, accuracy, time, interpretability, and company policies.
- Algorithm Selection (AS) is the process of picking an ML model, i.e., an inductive bias, that fixes the hypothesis
space of prediction functions explored for a given application.
- For example, logistic regression and decision trees are popular ML techniques for classification applications.
- Parameter Tuning (PT) is the process of choosing the values of (hyper-)parameters that many ML models and
algorithms have.
- For example, logistic regression is typically used with a parameter known as the regularizer. Such parameters are
important because they control accuracy-performance tradeoffs, but tuning them is challenging partly because the
optimization problems involved are usually non-convex .
Model Selection Triple
- A large body of work in ML focuses on various theoretical aspects of model selection. But from a practical
perspective, we found that analysts typically use an iterative exploratory process.
- While the process varies across analysts, we observed that the core object of their exploration is identical – an
object we call the model selection triple (MST).
- It has three components (MST):
- an FE “option” (loosely defined, a sequence of computation operations) that fixes the feature set that represents
the data
- an AS option that fixes the ML algorithm,
- a PT option that fixes the parameter choices conditioned on the AS option.
- Model selection is iterative and exploratory because the space of MSTs is usually infinite, and it is generally
impossible for analysts to know a priori which MST will yield satisfactory accuracy and/or insights.
Three-Phase Iteration.
- Divide an iteration into 3 phases.
- Steering:the analyst decides on an MST and specifies it in an ML-related language or GUI such as R, Scala, SAS, or
Weka.
- Execution: the system executes the MST to build and evaluate the ML model, typically on top of a data management
platform, e.g., an RDBMS or Spark.
- Consumption: the analyst assesses the results to decide upon the MST for the next iteration, or stops the process
- Iteration in an MSMS:
- Steering: an MSMS should offer a framework of declarative operations that enable analysts to easily group
logically related MSTs.
- For example, the analyst can just “declare” the set of tree heights and feature subsets (projections).
- Execution: an MSMS should include optimization techniques to reduce the runtime per iteration by exploiting the
set-oriented nature of specifying MSTs.
- For example, the system could share computations across different parameters or share intermediate
materialized data for different feature sets.
- Consumption: an MSMS should offer provenance management so that the system can help the analyst manage results and
help with optimization. For example, the analyst can inspect the results using standard queries to help steer the
next iteration, while the system can track intermediate data and models for reuse.
</br>
Technique comparisons
Comparing Data mining schemes
-
We often need to compare two different learning schemes on the same problem to see which is the better one to use. It
seems simple: estimate the error using cross-validation (or any other suitable estimation procedure), perhaps repeated
several times, and choose the scheme whose estimate is smaller.
-
If there were unlimited data, we could use a large amount for training and evaluate performance on a large independent
test set, obtaining confidence bounds just as before. However, if the difference turns out to be significant we must
ensure that this is not just because of the particular dataset we happened to base the experiment on. What we want to
determine is whether one scheme is beter or worse than another on average, across all possible training and test
datasets that can be drawn from the domain. Because the amount of training data naturally affects performance, all
datasets should be the same size: indeed, the experiment might be repeated with different sizes to obtain a learning
curve.
-
For each learning scheme we can draw several datasets of the same size, obtain an accuracy estimate for each dataset
using cross-validation, and compute the mean of the estimates. Each cross-validation experiment yields a different,
independent error estimate. What we are interested in is the mean accuracy across all possible datasets of the same
size, and whether this mean is greater for one scheme or the other.
-
From this point of view, we are trying to determine whether the mean of a set of samples—cross-validation estimates
for the various datasets that we sampled from the domain—is significantly greater than, or significantly less than,
the mean of another. This is a job for a statistical device known as the t-test, or Student’s t-test. Because the same
cross-validation experiment can be used for both learning schemes to obtain a matched pair of results for each
dataset, a more sensitive version of the t-test known as a paired t-test can be used.
TODO:
- 4.4.1 Activity : Practical — applying alternative forms of regression
- 4.6.1 Activity : Practical — applying alternative forms of classification
- 4.7.2 Activity : Important
WEEK 5
Main Topics
After completing this week you should be able to:
- Describe the relational database mode
- Explain the problems caused by data that is too large for RAM
- Explain how Big Data causes problems for analytics
After completing this week, you will have made significant steps towards achieving the following module learning
outcomes:
- MO2 — Manipulate a data set to extract statistics and features
Sub titles:
Why databases?
- Answer is : the term ‘database’ can be used in both an informal way (perhaps suggesting a collection of data on which
we might base a decision) and more formally to mean a specific approach and software structure for storing and
manipulating data.
Intro
-
Database can store traditional , numeric and alphanumeric values and also multimedia data such as picture, video or
audio, biometric data such as finger prints, retins scans, geographic info such as coordinates etc.
- A database can derived as a collection of related data items with a special business process or problem setting.
- A Database Management System (DBMS) is the software package used to define create, use and maintain a database.
- Populars , Oracle, Microsoft, IBM ,My-SQL( well known open source DBMS)
- A combination of database with DBMS called database system.
- Before databases, data was stored in files and this has the danger of data duplication, redundancy and inconsistency
as different applications kept records in different files.
- The other option is all the applications accessing the same file store and retrieve data but then this has the
problem of concurrency management which isn’t easy either.
- Then, with databases, DBMS is in charge of delivering the data and managing the data access. DBMS also manage the
metadata related to the stored data.
- DBM is also provides database languages that both data query and access. Well known language SQL
-
The file base approach results in a strong application data dependence, where as tha database approach allows for
applications to be independence from the data and data definitions.
- DBMS also help to achieve a looser application-data coupling as opposed to the stronger coupling with the file-based
approach.
Elements of Database System
Database User
- Information architect, design conceptional data by working closely with business.
- Database designer, converts conceptional data to logical and internal data model.
- Database administrator (DBA), reposnsile implementation and monitoring tha db.
Database language
- The Database Definition Language (DDL) is used by DBA to express the DB external, logical, internal data models.
- Data Manipulation Language (DML) is used to retrieve , insert, update delete and modify data.
- SQL offers bith DDL and DML for relational DBs.
Advantage of Database System and Database Management
- Database offers some addvantages such as,
- data independence; managing structured, semi-structured and unstructured data;
- database modeling;
- managing data radundancy;
- specifiying integrity rules;
- conqurrency control
- backup and recovery facilities
- data security and performance utilities.
Data independence
- Means that changing data definition has minimal to no impact on the applications using the data.
- Physical data independence implies that neither the applications, views, or logical data model must be changed when
changes are made to the data storage specifications in the internal data model.
- Logical data independence implies that software applications are minimally affected by changes in the conceptual or
logical data model. Consider the example of adding new data items, characteristics, or relationships.
Database Modeling
- A data model is an explicit representation of the data items together with their characteristics and relationships.
Also include integrity rules and functions.
- Popular examples of data models are the hierarchical model, the CODASYL model, the (E)ER model, the relational model,
and the object-oriented model.
Managing Structured, Semi-Structured, and Unstructured Data
- It is important to note that not all kinds of data but only structured data can be described according to a formal
logical data model.
- With structured data, individual characteristics of data items can be identified and formally specified, such as the
number, name, address, and email of a student, or the number and name of a course.
- The advantage is the ability to express integrity rules and in this way enforce the correctness of the data
- With unstructured data, ie text documents, there are no finer-grain components in a file or series of characters that
can be interpreted in a meaningful way by a DBMS or application.
- The data that isn’t completely structured or unstructured is called semi-structured. XML and NoSQL databases deal with
semi-structured data.
- The semi-structured data still has a certain structure, but the structure may be very irregular or highly
volatile.
Managing Data Redundancy
- The DBMS is responsible for the management of redundancy by providing synchronization facilities to safeguard data
consistency.
- DBMS might deliberately create data redundancy for faster retrieval where the data is distributed across different
locations.
Specifying Integrity Rules
- These rules can be used to enforce the correctness of the data.
- Syntactical rules specify how the data should be represented and stored. Examples are:
customerID should be represented as an integer
- Semantic rules focus on the semantic correctness or meaning of the data. Examples are:
customerID should be unique; account balance should be bigger than 0;
- n. In the database approach, they are specified as part of the conceptual/logical data model and are stored centrally
in the catalog.
- This substantially improves the efficiency and maintainability of the applications since the integrity rules are
now directly enforced by the DBMS whenever anything is updated
Concurrency Control (ACID)
- A DBMS has built-in facilities to support concurrent or parallel execution of database programs, which allows for good
performance.
- A key concept is a database transaction that is a sequence of read/write operations, considered to be an atomic unit
in the sense that either all operations are executed or none at all.
- Typically, these read/write operations can be executed at the same time by the DBMS.
- DBMS must support the ACID (Atomicity, Consistency, Isolation, Durability) properties.
- Atomicity, or the all-or-nothing property, requires that a transaction should either be executed in its
entirety or not at all.
- Consistency assures that a transaction brings the database from one consistent state to another.
- Isolation ensures that the effect of concurrent transactions should be the same as if they had been executed
in isolation.
- Durability ensures that the database changes made by a transaction declared successful can be made permanent
under all circumstances.
Backup and Recovery Facilities
- A key advantage of using databases is the availability of backup and recovery facilities.
- These facilities can be used to deal with the effect of loss of data due to hardware or network errors, or bugs in
system or application software
Data Security
- Depending on the business application considered, some users have read access, while others have write access to the
data (role-based functionality)
- This can also be further refined to certain parts of the data
- Three key performance indicators (KPIs) of a DBMS are: response time; throughput rate; and space utilization
- The response time denotes the time elapsed between issuing a database request (e.g., a query or update
instruction) and the successful termination thereof
- The throughput rate represents the transactions a DBMS can process per unit of time.
- Space utilization refers to the space utilized by the DBMS to store both the raw data and the metadata. A
high-performing DBMS is characterized by quick response times, high throughput rates, and low space utilization
Data and quality management
Data Management
Data Quality
Data Governance
- To manage and safeguard DQ, a data governance culture should be put in place assigning clear roles and
responsibilities.
- The aim of data governance is to set-up a company-wide controlled and supported approach toward DQ accompanied by DQ
management processes.
- The core idea is to manage data as an asset rather than a liability, and adopt a proactive attitude toward data
quality
-
A TDQM cycle consists of four steps — define, measure, analyze, and improve — which are performed iteratively.
</br>
Roles in Data Management
- The information architect (also called information analyst) designs the conceptual data model, preferably in
dialogue with the business users.
- He/she bridges the gap between the business processes and the IT environment and closely collaborates with the
database designer, who may assist in choosing the type of conceptual data model (e.g., EER or UML) and the
database modeling tool.
- The database designer translates the conceptual data model into a logical and internal data model.
- He/she also assists the application developers in defining the views of the external data model. To facilitate
future maintenance of the database applications, the database designer should define company-wide uniform naming
conventions when creating the various data models.
- Data owner, who has the authority to ultimately decide on the access to, and usage of, the data.
- The data owner could be the original producer of the data, one of its consumers, or a third party.
- Data stewards are the DQ experts in charge of ensuring the quality of both the actual business data and the
corresponding metadata.
- They assess DQ by performing extensive and regular data quality checks.
- A first type of action to be taken is the application of corrective measures.
- However, data stewards are not in charge of correcting data themselves, as this is typically the
responsibility of the data owner.
- The second type of action to be taken upon the results of the DQ assessment involves a deeper investigation
into the root causes of the DQ issues detected.
- The database administrator (DBA) is responsible for the implementation and monitoring of the database.
- Example activities include: installing and upgrading the DBMS software; backup and recovery management;
performance tuning and monitoring; memory management; replication management; security and authorization.
- Closely collaborates with network and system managers
- Data scientist is a relatively new job profile within the context of data management.
- He/she analyzes data using state-of-the-art analytical techniques to provide new insights into, for example,
customer behavior.
- A data scientist has a multidisciplinary profile combining ICT skills (e.g., programming) with quantitative
modeling (e.g., statistics), business understanding, communication, and creativity
Modelling data and the Relational Database model
Phases of designing Database
- Designing a database is a multi-step process
</br>
The Entity Relationship Model
Entity Type
- An entity type represents a business concept with an unambiguous meaning to a particular set of users. Examples of
entity types are: supplier, student, product, or employee
- An entity is one particular occurrence or instance of an entity type.
- An entity type defines a collection of entities that have similar characteristics.
Attribute Type
-
An attribute type represents a property of an entity type. As an example, name and address are attribute types of
the entity type supplier.
</br>
- A domain specifies the set of values that may be assigned to an attribute for each individual entity. A domain for
gender can be specified as having only two values: male and female or a date domain can define date with day, month
and year
- domains are not displayed in an ER model.
- A key attribute type is an attribute type whose values are distinct for each individual entity.
- In other words, a key attribute type can be used to uniquely identify each entity
- Examples are: supplier number, which is unique for each supplier A key attribute is one which allows us to
identify a particular instance of an entity. An example of a key attribute might be a student number – a Primary
Key as every student has a unique number, so we can use that attribute to identify each individual.
- We can also have a key which is a combination of attributes – a composite key. One example of this might be a
combination of name and address to identify an individual – some names are duplicated, and more that one person
may live at a particular address, but the combination is (usually!) sufficient to uniquely identify a person.
-
A simple or atomic attribute type cannot be further divided into parts. Examples are supplier number or supplier
status.
- A composite attribute type is an attribute type that can be decomposed into other meaningful attribute types.
- Think about an address attribute type, which can be further decomposed into attribute types for street, number,
ZIP code, city, and country. Another example is name, which can be split into first name and last name
</br>
-
A single-valued attribute type has only one value for a particular entity. An example is product number or product
name.
- A multi-valued attribute type is an attribute type that can have multiple values. As an example, email address can
be a multi-valued attribute type as a supplier can have multiple email addresses
- Multi-valued attribute types are represented using a double ellipse in the ER model
- A derived attribute type is an attribute type that can be derived from. another attribute type. As an example,
age is a derived attribute type since it can be derived from birth date.
- Derived attribute types are depicted using a dashed ellipse, a
Basic concepts
- Each row in a table (also called a tuple, an ordered set of parts or number of fields) corresponds to an individual
record.
- Each record is distinct – duplicate records (rows in the table) are not allowed.
- There must be some key attribute that allows us to distinguish one record from another. The order in which the
records appear in the table is immaterial, as the relation represents a set.
- Each column in the table corresponds to one of the attributes of the entities represented by the table.
- The column representing an attribute contains values from the domain of that attribute.
Relationship Types
</br>
</br>
</br>
- Strong/Weak Entity Type
- A strong entity type is an entity type that has a key attribute type.
- in image below, PURCHASE ORDER entity depends on SUPPLIER, but PURCHASE ORDER has its own key attribute thats
why it is not weak.
- In contrast, a weak entity type is an entity type that does not have a key attribute type of its own.
- Weak entity types are represented in the ER model using a double-lined rectangle
- In image below, The rhombus representing the relationship type through which the weak entity type borrows a
key attribute type is also double-lined. The borrowed attribute type(s) is/are underlined using a dashed lin
- Since a weak entity type needs to borrow an attribute type from another entity type, its existence will always
be dependent on the latter
- Image below, Room is depends on Hotel attribute m but Rooms does not have key attibute itseld , tahts why it
is weak.
- Entity Y is existence dependent on entity X if each instance of Y must have a corresponding instance of X. If
this is the case, Y must have total participation in its relationship with X.
- If Y does not have its own candidate key, Y is called a weak entity, and X is a strong entity.
- Weak entities may have a partial key, a discriminator that distinguishes instances of the weak entity that are related
to an instance of the strong entity.
</br>
- Ternary Relationship Types
- Higher-order relationship types with more than two entity types, known as ternary relationship types, can
occasionally occur, and special attention is needed to properly understand their meaning.
- Limitations for ER
- the ER model presents a temporary snapshot of the data requirements of a business process. This implies that
temporal constraints, which are constraints spanning a particular time interval, cannot be modeled.
- the ER model cannot guarantee consistency across multiple relationship types. Some examples of business rules that
cannot be enforced in the ER model are: an employee should work in the department that he/she manages, employees
should work on projects
- Example for ER
</br>
Relational Database Keys
- Relations never have duplicate tuples, so you can always tell tuples apart.
- This implies that there is always a key (which may be a composite of all attributes, in the extreme case).
- The key may be of different types:
- Superkey: A set of attributes that uniquely identifies tuples
- Candidate key: A superkey such that no proper subset of itself is also a superkey (i.e. it has no unnecessary
attributes)
- Primary key: A candidate key chosen for unique identification of tuples
- Candidate keys other than the primary key are called the alternative keys.
- The connection between the relations are established by using the concept of a foreign key.
- A foreign key is an attribute, or combination of attributes, of a relation that is not the primary key of that
relation, but is the primary key of some other relation.
- For Example
- In the STUDENT table, StudentId is the primary key, uniquely identifying a particular student.
- In the ENROLL table, studentId is a foreign key, it links a student to the class they are taking,
- But studendId is not itself a primary key for the ENROLL table.
- Representing Relations Database schemas
- The schema for a relation gives the name of the relation, followed by the names of the attributes in
parentheses:
- Student(stuId, lastName, firstName, gender, dob, auth)
- A database schema can have any number of relation schemas. We underline primary key in each relation schema, and
indicate foreign keys
Database Normalization
- The process of analysing a database structure to ensure there is no redundant data is called normalisation.
- If our relational model is not normalised, we may see different types of anomaly. In the data you saw earlier in the
module, there were multiple records for each student, one for each class they had taken.
- Insertion anomaly: if we want to insert a new record for a student, we need to make sure that the data about the
student is consistent with the other records for that student.
- Deletion anomaly: If we wanted to delete records about a class, and there was a student who had taken only that
class, then the information about that student is also deleted.
- Update anomaly: if we want to update a student, we have to find every record about the student and make sure they
are updated consistently.
- There are three common forms of database normalisation: 1st, 2nd, and 3rd normal form, usually abbreviated to 1NF,
2NF, and 3NF.
- Each normal form builds on the previous form, and the relational model becomes less prone to anomalies.
- First Normal Form – The information is stored in a relational table with each column containing atomic values.
- Second Normal Form – The table is in first normal form and all the columns depend on the table’s primary key.
- Third Normal Form – the table is in second normal form and all of its columns are not transitively dependent on
the primary key
Questions
- Explain the difference between data redundancy and data integrity
- Data redundancy is unnecessary repetition of data. Data integrity is data as it should be / accurate, reflect
reality
- Illustrate the concept of referential integrity with an example
- Referential integrity will not allow the deletion of stock_number if it is being referred to by other entries or
tables.
- In terms of a database, explain the term transaction
- Any change in the database
- Explain where record locking might be needed while in a transaction process
- Two transactions attempting an update at the same time on the same data. The first attempt locks the record so
that the second attempt cannot interfere with the first transaction.
WEEK 6
Main Topics
After completing this week you should be able to:
- Explain the models of databases for Big Data
- Explain how Big Data causes problems for analytics
- Explain how SQL databases can help represent a wide variety of data
- Explain how distributed computation approaches such as Hadoop can allow computation on huge volumes of data
After completing this week, you will have made significant steps towards achieving the following module learning
outcomes:
- MO4 Analyse and communicate issues with scaling up to large data sets, and use appropriate techniques to scale up the
computation
Sub titles:
Constructing and querying a simple database
CREATE TABLE 'schooldata'.'STUDENT' (
'studentId' INT NOT NULL,
'firstName' VARCHAR(45) NULL,
'lastName' VARCHAR(45) NULL,
'Gender' VARCHAR(1) NULL,
'Dob' VARCHAR(20) NULL,
'auth' VARCHAR(45) NULL, PRIMARY KEY ('studentId'));
INSERT INTO STUDENT
VALUES (‘00001’, ’SAM’, ’Williams’, ’m’, 1979 / 5 / 9’, ‘Herts’)
Next generation of databases
-
In relational data model ‘relational model’, the database is structured as a set of tables. Within each table, each
row corresponds to a record about some entity, and the columns in the table correspond to attributes of those
entities. Each entity is defined by the set of attributes that it has.
- ‘NoSQL’ database technologies that are NOT built on the relational model.
- Scalability
-
RDBMS => With a traditional RDBMS we can try to scale up the system by adding more memory, a faster CPU and so
on, referred to as ‘vertical scaling’, but there are limits to how far this can go
-
NoSQL => typically we aim to scale ‘horizontally’ — that is, we start distributing the work over several servers
in a cluster, and scale up by adding servers.
-
The data structures used in NoSQL databases are also typically more able to cope with data that doesn’t fit neatly
into tables of records, such as documents, graphs or key-value pairs, making it easier to collect and process large
volumes of unstructured data.
- Database Management Systems (DBMS) is the software that manages all the data which is stored in the secondary storage.
</br>
Big Data and the problems for analytics
5V of Big Data
- 5V properties: Volume, Velocity, Variety that constitute native/original Big Data properties, and Value and **
Veracity** as acquired as a result of data initial classification and processing in the context of a specific process
or model. (Volume, Velocity,Variety are original Big Data 3V, then Veracity and Value added)
- Volume The amount of data matters. With big data, you’ll have to process high volumes of low-density,
unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a web page or a
mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For
others, it may be hundreds of petabytes.
- Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the highest velocity of
data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in
real time or near real time and will require real-time evaluation and action.
- Variety refers to the many types of data that are available. Traditional data types were structured and fit
neatly in a relational database. With the rise of big data, data comes in new unstructured data types.
Unstructured and semistructured data types, such as text, audio, and video, require additional preprocessing to
derive meaning and support metadata.
- Value Data has intrinsic value. But it’s of no use until that value is discovered.
- Veracity Equally important: How truthful is your data—and how much can you rely on it?
Hadoop and distributed computation
- Hadoop is an open-source software framework
- Hadoop provides massive storage for any kind of data
- Enormous processing power and the ability to handle a vast number of concurrent tasks or jobs.
- Hadoop is not a database system
- But supports: the distributed storage, distributed processing, and job control
Hadoop components
- Hadoop consists of three core components
- a distributed file system so that data is spread across a number of servers
- a parallel programming framework, so that computation is spread across a number of servers
- a resource management system to coordinate the distribution of data and computation.
Hadoop Distributed File System (HDFS)
- Hadoop is based on an open-source implementation of a clustered file system called HDFS.
- The design of HDFS supports scalable, reliable, and cost-efficient distributed computing
- HDFS stores data files without prescribing the format, so any data in any format could be stored.
- To give reliability and tolerance to hardware failure, multiple copies of the data are replicated across the
cluster. The hardware itself can be standard server hardware, making implmentation cost effective.
Hadoop MapReduce
- MapReduce’ is the programming model used within Hadoop to distribute processing of very large amounts of data.
- The processing of data is in two phases.
- The ‘Map’ phase takes the data to be processed and splits it into small parts that can be processed quickly.
- The results from processing these small parts are them brought together in the ‘Reduce’ phase – reducing the
partial results down to a single combined result
</br>
Hadoop YARN (Yet Another Resource Negotiator)
- Handles resource management and job
- YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and
scheduling tasks to be executed on different cluster nodes.
When to use Hadoop (and when not to!)
Different types of databases
-
Key-Value databases: Key-value databases, or key-value stores, are an extreme form of database. The data to be
stored could be anything, for example, a text file or image – it is treated as a ‘blob’ of data. The data is accessed
through a key associated with it. To store a piece of data, the database forms an association between the key and the
blob of data you wish to save. To retrieve data, you provide the key and the database returns the associated blob of
data.
-
Document databases: Document databases, or document stores, share the basic access and retrieval semantics of
key-value stores, that is, they also use a unique key associated with a particular piece of data. The main
difference is that instead of storing arbitrary blobs of data, document databases store data in structured formats
called documents. XML, JSON (Java Script Object Notation) and BSON (which is a binary encoding of JSON objects) are
some common standard encodings.
-
Graph databases: Graph databases take a different approach to establishing relationships between data. Where the
relational model uses tables with foreign keys to represent relationships between objects of different types, graph
databases establish connections using the concepts of nodes, edges, and properties. Graph databases represent a piece
of data as an individual node, which may have a number of properties. Between the nodes, edges are established to
represent different types of relationships. In this way, graph databases allow representation of complex, hierarchical
or network structures that may be difficult to model in a relational system.
Big Data Architectures – NoSQL Use Cases for Key Value Databases
- Big Data Architectures – NoSQL Use Cases for Key Value Databases
-
NoSQL data stores usually have a concept of “eventual consistency,” which means that all the redundancy and high
availability is there but synchronicity may not be achieved immediately but eventually. A little data might be lost on
rare occasions. Deal with it. If you have an application where a little data can never be lost, then don’t use a NoSQL
database and be prepared to invest in appropriate recovery and high availability solutions.
- Key value databases are EXTREMELY simple databases.
- There is a key and there is the rest of the data (the values) and that’s it.
- You can find the value data based on searching the key field. There are no alternate keys and no foreign keys and
no broad text searching capabilities against the values. If you want those things then you have to create them and
it involves redundancy.
- It is FAST. They are a lot faster than relational databases.
- And they can scale: they can grow in size by hundreds and thousands of times without significant redesign. The
growth in price is linear with the growth in size. It is very hard to scale relational database solutions
quickly and the price curve is usually not linear but geometrical.
- Relational databases struggle with handling magnitude changes in volume quickly (number of records stored) and with
handling high volumes of transactions simultaneously (thousands or millions of state changes per second).
- Key value NoSQL databases can handle magnitude scaling of number of records and extremely high volumes of state
changes per second with millions of simultaneous users through distributed processing and distributed storage.
- Key value NoSQL databases also have built in redundancy which can handle the loss of storage nodes without losing
the whole application
- Example key value databases include Riak, Redis, Memcached, Berkeley DB, Hamster DB, Amazon Dynamo DB.
How Can Graph Analytics Uncover Valuable Insights About Data?
- How Can Graph Analytics Uncover Valuable Insights About Data?
- Use Cases:
- In computer science, graphs are used to represent networks of communication, data organization, computational
devices, the flow of computation, etc
- For instance, the link structure of a website can be represented by a directed graph, in which the vertices
represent web pages and directed edges represent links from one page to another.
- Graph-theoretic methods have proven useful in linguistics, since natural language often lends itself well to
discrete structure
- Traditionally, syntax and compositional semantics follow tree-based structures, whose expressive power lies in
the principle of compositionality, modeled in a hierarchical graph.
- Graph theory is used in sociology as a way,
- or example, to measure actors’ prestige or to explore rumor spreading, notably through the use of social
network analysis software.
- In the world of intelligence, numerous government agencies are interested in identifying threats through the
detection of non-obvious patterns of relationships and group communications buried in social media, email, texting
and call detail records.
- In life sciences, organizations can use graph analytics to conduct research in healthcare fraud for healthcare
payers.
- In addition to the healthcare fraud detection program, other potential graph analytics use cases include
healthcare treatment efficacy and outcome analysis, analyzing drugs and side effects, and the analysis of
proteins and gene pathways.
- In the area of personalized healthcare, a startup called Lumiata wants to scale personalized medicine by
leveraging machine learning and graphic analytics to help doctors to focus on more urgent care needs and empower
nurses to carry more of the diagnostic chores
- Graph Analytics can be used to address relationship-based problems in manufacturing, energy, gas exploration,
travel, biology, conservation, computer chip design, chemistry, physics, higher education research, government,
security, defense and many other fields.
- Advantages:
- Graph analytics provide another arrow in our quiver – another tool that we can use against these vast amounts of
social media and sensor-based data to uncover new insights about the relationships between our customers, products,
and operations.
- Graph analytics allows us to get new, more actionable, more relevant answers to many of our traditional questions (Who
are our most important customers? What are our most important products?), as well as answer completely new questions (
Who are our most influential customers? Where are our largest networking or operational security risks?)
Graph Databases for Beginners: ACID vs. BASE Explained
- Graph Databases for Beginners: ACID vs. BASE Explained
- The two most common consistency models are known by the acronyms ACID and BASE
-
The key ACID guarantee is that it provides a safe environment in which to operate on your data.
-
In the NoSQL database world, ACID transactions are less fashionable as some databases have loosened the requirements
for immediate consistency, data freshness and accuracy in order to gain other benefits, like scale and resilience.
- Difference BASE and ACID
- A BASE data store values availability (since that’s important for scale), but it doesn’t offer guaranteed
consistency of replicated data at write time. Overall, the BASE consistency model provides a less strict assurance
than ACID: data will be consistent in the future, either at read time (e.g., Riak) or it will always be
consistent, but only for certain processed past snapshots (e.g., Datomic).
ACID vs BASE consistency models
- One very important feature of a database is that the data it contains should be correct.
- ACID Consistency Model,
- Atomic : All operations that make up a transaction succeed or every operation is rolled back to the initial
consistent state of the database.
- Consistent : On the completion of a transaction, all users see the same updated data.
- Isolated :Transactions do not contend with one another to modify data. Contentious access to data is
moderated by the database so that transactions appear to run sequentially.
- Durable : The results of applying a transaction are permanent, even in the presence of failures.
- BASE Consistency Model
- Basic Availability :The database appears to work most of the time.
- Soft-state: When the database is updated, it may be inconsistent for some period of time.
- Eventual consistency: Stores exhibit consistency at some later point (e.g., lazily at read time).
- These BASE properties are more relaxed than we might want from a banking system, for example, but when we are
interested more in what we can get from the data, and in the availability of the database, then BASE might be an
acceptable approach.
- The BASE consistency model is primarily used by NoSQL databases such as key-value and document stores.
WEEK 7
Main Topics
Sub titles: