1492040347
9781492040347
1492040347
4.25
502
unknown
Nov 04, 2019
None
Notes are private!
0
2
Apr 28, 2025
Mar 07, 2025
not set
not set
Mar 07, 2025
Paperback
0753559013
9780753559017
0753559013
3.85
4,992
Sep 2021
Sep 16, 2021
it was amazing
This is such a visionary book.
Its uniqueness stems from the blend of speculative fiction and expert analysis.
I'd recommend it to anyone interested in This is such a visionary book.
Its uniqueness stems from the blend of speculative fiction and expert analysis.
I'd recommend it to anyone interested in AI without the technical expertise to understand its intricacies, but wants to envision near future scenarios.
The book lays out great ideas with realistic scenarios.
The fact that it is set in 2041 makes the stories credible, the future is not that afar so that the reader can empathize with the situations.
However some stories like "Quantum genocide" are too complex to be that short, and the reader may loose the plot.
I've found particularly interesting two stories: “Job Savior� and “Golden Elephant� both with huge ethic implications.
In “Job Savior� AI has automated entire industries, leaving millions of workers displaced and searching for purpose. The protagonist, a former factory worker, finds herself immersed in an AI-driven gig economy that feels as alienating as it is efficient. The algorithms grade every action, offering opportunities based on performance metrics rather than human qualities.
Yet, beyond the dystopian veneer lies a vision of adaptation: AI-powered retraining programs allow individuals to pivot their careers, and Universal Basic Income offers a safety net. The story dares to imagine how humans might coexist with machines in a post-work world, emphasizing resilience and the potential for redefining what “work� means. The speculative yet grounded narrative leaves readers pondering: could AI liberate humanity from monotonous labor, or will it leave us adrift in a sea of automation?
In the end workers are paid to simulate work...
The “Golden Elephant� projects a future where AI revolutionizes healthcare, bringing cutting-edge diagnostics and treatments to remote, underserved areas. The “golden elephant,� an AI avatar that provides both medical insights and emotional comfort, embodies the fusion of precision and compassion. Through the eyes of a young woman and her family in a rural community, the story explores the leap of faith required to trust machines with something as intimate as health.
What sets this vision apart is its optimism. It imagines a world where AI doesn’t just close gaps in medical care but also augments human empathy, ensuring that technology remains a tool for connection, not cold detachment. The story wrestles with ethical concerns, such as the biases AI might inherit and the limits of its clinical perspective, but it ultimately points to a future where technology is an equalizer, breaking down barriers to access.
In fact AI was sheltering Nayana from Sahej, because he was from the poorest caste hence it would have been a tragedy for Nayana's future prospects.
Its uniqueness stems from the blend of speculative fiction and expert analysis.
I'd recommend it to anyone interested in This is such a visionary book.
Its uniqueness stems from the blend of speculative fiction and expert analysis.
I'd recommend it to anyone interested in AI without the technical expertise to understand its intricacies, but wants to envision near future scenarios.
The book lays out great ideas with realistic scenarios.
The fact that it is set in 2041 makes the stories credible, the future is not that afar so that the reader can empathize with the situations.
However some stories like "Quantum genocide" are too complex to be that short, and the reader may loose the plot.
I've found particularly interesting two stories: “Job Savior� and “Golden Elephant� both with huge ethic implications.
In “Job Savior� AI has automated entire industries, leaving millions of workers displaced and searching for purpose. The protagonist, a former factory worker, finds herself immersed in an AI-driven gig economy that feels as alienating as it is efficient. The algorithms grade every action, offering opportunities based on performance metrics rather than human qualities.
Yet, beyond the dystopian veneer lies a vision of adaptation: AI-powered retraining programs allow individuals to pivot their careers, and Universal Basic Income offers a safety net. The story dares to imagine how humans might coexist with machines in a post-work world, emphasizing resilience and the potential for redefining what “work� means. The speculative yet grounded narrative leaves readers pondering: could AI liberate humanity from monotonous labor, or will it leave us adrift in a sea of automation?
In the end workers are paid to simulate work...
The system analyzed the work in real time, awarding credit points based on speed and quality feedback. At the front of the hotel ballroom, there was a leaderboard displaying the names of the workers with the most points.
It was clear to Jennifer and Michael that some of the workers in the trial were quicker learners. They observed Matt impatiently turning in all directions and chatting with fellow workers. Some looked like gamblers sitting before slot machines, hands in constant motion, expressions intoxicated. Noting his place on the leaderboard, one worker performed what seemed like a victory dance. Others looked glum.
"This is a video game, only the name of the game is 'work.'" Jennifer couldn't hide some disgust with the whole affair. The workers' reactions to the tech reminded her of her sad and defeated father.
Michael sat silently.
"That's it!" His reaction startled Jennifer from her reverie. He reached out and took Jennifer's hand. "Maybe this doesn't just seem like a video game. Maybe it actually is one. You need to help me find out."
"Find out what?"
"The architectural designs in this video... are they really being implemented somewhere?"
"You mean..." Jennifer suddenly
The “Golden Elephant� projects a future where AI revolutionizes healthcare, bringing cutting-edge diagnostics and treatments to remote, underserved areas. The “golden elephant,� an AI avatar that provides both medical insights and emotional comfort, embodies the fusion of precision and compassion. Through the eyes of a young woman and her family in a rural community, the story explores the leap of faith required to trust machines with something as intimate as health.
What sets this vision apart is its optimism. It imagines a world where AI doesn’t just close gaps in medical care but also augments human empathy, ensuring that technology remains a tool for connection, not cold detachment. The story wrestles with ethical concerns, such as the biases AI might inherit and the limits of its clinical perspective, but it ultimately points to a future where technology is an equalizer, breaking down barriers to access.
In fact AI was sheltering Nayana from Sahej, because he was from the poorest caste hence it would have been a tragedy for Nayana's future prospects.
"Sahej, why? Why can't we get close to each other?" Nayana chose her words carefully....more
Now it was Sahej's turn to look surprised. "Nayana, do you really not know?"
"Know what?"
"My last name."
"The schools and virtual classroom keep your surname protected just like you're the offspring of a big star or some famous family."
"On the contrary, it's because they don't want it to trigger any discomfort."
"What kind of discomfort?"
"In the past, it was described as a feeling of being polluted."
"You're talking about your caste? But that whole system was outlawed years ago."
Sahej gave a bitter laugh. "Just because it's no longer permitted by law and doesn't appear in the news doesn't mean it's gone."
"But how would the Al know about it?"
"The Al doesn't know. The Al doesn't need to know the definition of the castes. All it needs is its users' history. No matter how we hide or if we change our surnames, our data is a shadow. And no one can escape their shadow."
Nayana thought about what her mother had said, that Al only learns what humans teach it. She rolled the thought about in her head, then looked at Sahej. "So you're saying that Al identifies the invisible discrimination in our society and quantifies it."
Sahej's expression became serious, but he exhaled a soft laugh. "I almost forgot. There's also the color of my
Notes are private!
1
Dec 16, 2024
Jan 03, 2025
Dec 16, 2024
Paperback
0321125215
9780321125217
0321125215
4.15
5,688
Aug 20, 2003
Aug 20, 2003
liked it
As a Senior data analyst, I approached Eric Evans� "Domain-Driven Design" hoping to gain practical insights into better aligning software design with
As a Senior data analyst, I approached Eric Evans� "Domain-Driven Design" hoping to gain practical insights into better aligning software design with real-world business needs. While the book is undeniably influential in software development circles, I found it too technical and theoretical for my needs.
The core ideas—like agreeing on terminology (ubiquitous language) and defining clear boundaries between domains (bounded contexts)—are interesting and relevant, but they’re buried under layers of complex jargon, UML diagrams, and discussions rooted in object-oriented programming. As someone who works closely with data, I was hoping for more actionable strategies or examples that could be applied directly to data modeling and analysis. Instead, the focus on software architecture felt removed from the day-to-day challenges of interpreting and communicating data insights.
The "domain" is the specific area of knowledge or activity your software addresses (e.g., banking, healthcare).
Evans emphasizes the creation of a ubiquitous language—a shared vocabulary used by both developers and domain experts to avoid miscommunication. This language is embedded directly into the code and the design process.
For developers or architects tackling large, enterprise-level systems, this book might be invaluable. But for analysts like me, it leans too heavily into abstract theory and programming frameworks, making it difficult to extract clear, practical takeaways. I’d recommend it only to those deeply embedded in the technical side of software design. For those seeking a more accessible guide to aligning technical solutions with business domains, this might not be the right fit.
In fact there are newer resources inspired by Evans� work that may be more approachable. While it’s not an easy read, the book’s long-term impact on software architecture makes it a valuable, albeit imperfect, resource. ...more
The core ideas—like agreeing on terminology (ubiquitous language) and defining clear boundaries between domains (bounded contexts)—are interesting and relevant, but they’re buried under layers of complex jargon, UML diagrams, and discussions rooted in object-oriented programming. As someone who works closely with data, I was hoping for more actionable strategies or examples that could be applied directly to data modeling and analysis. Instead, the focus on software architecture felt removed from the day-to-day challenges of interpreting and communicating data insights.
The "domain" is the specific area of knowledge or activity your software addresses (e.g., banking, healthcare).
Evans emphasizes the creation of a ubiquitous language—a shared vocabulary used by both developers and domain experts to avoid miscommunication. This language is embedded directly into the code and the design process.
For developers or architects tackling large, enterprise-level systems, this book might be invaluable. But for analysts like me, it leans too heavily into abstract theory and programming frameworks, making it difficult to extract clear, practical takeaways. I’d recommend it only to those deeply embedded in the technical side of software design. For those seeking a more accessible guide to aligning technical solutions with business domains, this might not be the right fit.
In fact there are newer resources inspired by Evans� work that may be more approachable. While it’s not an easy read, the book’s long-term impact on software architecture makes it a valuable, albeit imperfect, resource. ...more
Notes are private!
1
Dec 05, 2024
Dec 15, 2024
Dec 05, 2024
Hardcover
1292164778
9781292164779
1292164778
3.31
111
Jan 01, 2006
Mar 27, 2017
really liked it
This textbook provides a comprehensive introduction to statistics, focusing on concepts and applications rather than complex formulas. It emphasizes t
This textbook provides a comprehensive introduction to statistics, focusing on concepts and applications rather than complex formulas. It emphasizes the practical side of statistics while maintaining a balance with theory, making it suitable for students across various disciplines.
It is definitely, so to say, a school book which to me wasn't the best fit.
The main topics
Descriptive Statistics: Organizing and summarizing data using measures like mean, median, standard deviation, and graphical tools such as histograms and boxplots.
Probability: Basics of probability, probability distributions (like the normal and binomial), and how these underpin statistical inference.
Inferential Statistics: Estimation, confidence intervals, and hypothesis testing to draw conclusions about populations from samples.
Regression Analysis: Simple and multiple regression techniques for modeling relationships between variables.
However, more advanced topics are not explored.
Statistics is About Learning from Data: Statistics provides tools to understand patterns, summarize data, and draw meaningful conclusions. It's not just about numbers but interpreting what the data reveals about real-world situations.
Variability is Key: Understanding and managing variability in data is central to statistics. Measures like standard deviation and variance help quantify how much data points differ, enabling better insights.
Descriptive Statistics Summarize Data: Techniques like mean, median, mode, and graphical tools (e.g., histograms and boxplots) provide a snapshot of the data, making complex datasets easier to interpret.
Probability Links Data to Inference: Probability is the foundation for making predictions and drawing conclusions about populations from sample data. It bridges descriptive and inferential statistics.
Statistical Inference is Powerful: Through confidence intervals and hypothesis testing, statisticians can make predictions, estimate population parameters, and assess relationships between variables with a degree of certainty. ...more
It is definitely, so to say, a school book which to me wasn't the best fit.
The main topics
Descriptive Statistics: Organizing and summarizing data using measures like mean, median, standard deviation, and graphical tools such as histograms and boxplots.
Probability: Basics of probability, probability distributions (like the normal and binomial), and how these underpin statistical inference.
Inferential Statistics: Estimation, confidence intervals, and hypothesis testing to draw conclusions about populations from samples.
Regression Analysis: Simple and multiple regression techniques for modeling relationships between variables.
However, more advanced topics are not explored.
Statistics is About Learning from Data: Statistics provides tools to understand patterns, summarize data, and draw meaningful conclusions. It's not just about numbers but interpreting what the data reveals about real-world situations.
Variability is Key: Understanding and managing variability in data is central to statistics. Measures like standard deviation and variance help quantify how much data points differ, enabling better insights.
Descriptive Statistics Summarize Data: Techniques like mean, median, mode, and graphical tools (e.g., histograms and boxplots) provide a snapshot of the data, making complex datasets easier to interpret.
Probability Links Data to Inference: Probability is the foundation for making predictions and drawing conclusions about populations from sample data. It bridges descriptive and inferential statistics.
Statistical Inference is Powerful: Through confidence intervals and hypothesis testing, statisticians can make predictions, estimate population parameters, and assess relationships between variables with a degree of certainty. ...more
Notes are private!
1
Nov 22, 2024
Dec 05, 2024
Nov 20, 2024
Pocket Book
1735431168
9781735431161
B08JG2C29F
4.34
35
unknown
Sep 17, 2020
it was amazing
This will be my reference point whenever I'll need to refresh my memory around the intricate topic of Hypothesis Testing.
The book introduces readers t This will be my reference point whenever I'll need to refresh my memory around the intricate topic of Hypothesis Testing.
The book introduces readers to different types of hypothesis tests, such as t-tests, ANOVA, and chi-square tests, with an emphasis on when and why to use each. Frost also covers one-tailed vs. two-tailed tests, Type I and Type II errors, and power analysis, making the book a comprehensive guide to the basics of testing.
Frost takes time to discuss common mistakes in hypothesis testing, such as data dredging and misinterpreting p-values. He emphasizes the importance of context and encourages readers to look beyond just the numbers to understand the data's story.
First instance this explanation of Type I error aka false positive is so on point and accessible to everyone
He succeeds in explaining very convoluted and difficult topics, such as why we do not accept the Null Hypothesis, in such a good manner
Frost highlights common pitfalls, like misinterpreting p-values or ignoring statistical power, that can lead to poor decision-making. This focus on interpretation is invaluable and helps readers avoid drawing incorrect conclusions from their data.
Another example is the difficult explanation of what is a P-value.
Of course you cannot simply too much complex concept, but you can try to make them more accessible as Frost does.
At last he explains thoroughly Central Limit Theorem and why it is so important: the distribution of sample means (or sample sums) from a population will tend to follow a normal distribution, regardless of the population's original distribution, as long as the sample size is sufficiently large. This is true even if the data itself doesn’t follow a normal distribution, which is a powerful feature of the CLT.
In hypothesis testing, the CLT justifies the use of normal distribution-based methods (like z-tests or t-tests) to assess sample means or proportions when making data-driven decisions. For instance, it allows researchers to test if a sample mean significantly differs from a hypothesized population mean even if they don’t know the underlying distribution of the data.
In fact, Even if the underlying data is skewed, multimodal, or has any other shape, the distribution of the sample means will approach a normal (bell-shaped) distribution as the sample size increases. ...more
The book introduces readers t This will be my reference point whenever I'll need to refresh my memory around the intricate topic of Hypothesis Testing.
The book introduces readers to different types of hypothesis tests, such as t-tests, ANOVA, and chi-square tests, with an emphasis on when and why to use each. Frost also covers one-tailed vs. two-tailed tests, Type I and Type II errors, and power analysis, making the book a comprehensive guide to the basics of testing.
Frost takes time to discuss common mistakes in hypothesis testing, such as data dredging and misinterpreting p-values. He emphasizes the importance of context and encourages readers to look beyond just the numbers to understand the data's story.
First instance this explanation of Type I error aka false positive is so on point and accessible to everyone
A fire alarm provides a good analogy for the types of hypothesis testing errors. Ideally, the alarm rings when there is a fire and does not ring in the absence of a fire. However, if the alarm rings when there is no fire, it is a false positive, or a Type I error in statistical terms. Conversely, if the fire alarm fails to ring when there is a fire, it is a false negative, or a Type II error.
He succeeds in explaining very convoluted and difficult topics, such as why we do not accept the Null Hypothesis, in such a good manner
You learned that we do not accept the null hypothesis. Instead, we fail to reject it. The convoluted wording encapsulates the fact that insufficient evidence for an effect in our sample isn't proof that the effect does not exist in the population. The effect might exist, but our sample didn't detect it-just like all those species scientists presumed were extinct because they didn't see them.
Finally, I admonished you not to use the common practice of seeing whether confidence intervals of the mean for two groups overlap to determine whether the means are different. That process can cause you to overlook significant results and miss out on important information about the likely range of the mean difference. Instead, assess the confidence interval of the difference between means.
Frost highlights common pitfalls, like misinterpreting p-values or ignoring statistical power, that can lead to poor decision-making. This focus on interpretation is invaluable and helps readers avoid drawing incorrect conclusions from their data.
Another example is the difficult explanation of what is a P-value.
Of course you cannot simply too much complex concept, but you can try to make them more accessible as Frost does.
P-values indicate the strength of the sample evidence against the null hypothesis. If it is less than the significance level, your results are statistically significant.
This is because it is the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume the Null Hypothesis is correct,.
P-values are the probability that you would obtain the effect observed in your sample, or larger, if the null hypothesis is correct. In simpler terms, p-values tell you how strongly your sample data contradict the null. Lower p-values represent stronger evidence against the null.
If the p-value is less than or equal to the significance level, you reject the null hypothesis and your results are statistically significant. The data support the alternative hypothesis that the effect exists in the population. When the p-value is greater than the significance level, your sample data don't provide enough evidence to conclude that the effect exists.
At last he explains thoroughly Central Limit Theorem and why it is so important: the distribution of sample means (or sample sums) from a population will tend to follow a normal distribution, regardless of the population's original distribution, as long as the sample size is sufficiently large. This is true even if the data itself doesn’t follow a normal distribution, which is a powerful feature of the CLT.
In hypothesis testing, the CLT justifies the use of normal distribution-based methods (like z-tests or t-tests) to assess sample means or proportions when making data-driven decisions. For instance, it allows researchers to test if a sample mean significantly differs from a hypothesized population mean even if they don’t know the underlying distribution of the data.
In fact, Even if the underlying data is skewed, multimodal, or has any other shape, the distribution of the sample means will approach a normal (bell-shaped) distribution as the sample size increases. ...more
Notes are private!
1
Oct 16, 2024
Oct 28, 2024
Oct 16, 2024
Kindle Edition
9798991193542
B0DCXMMX7C
4.45
53
unknown
Aug 19, 2024
it was amazing
The beauty of this book lies in its simplicity and clarity. Frost starts with the basics of simple linear regression and gradually moves into more mul
The beauty of this book lies in its simplicity and clarity. Frost starts with the basics of simple linear regression and gradually moves into more multiple regression models, always emphasizing the why behind each concept. He’s less concerned with turning you into a statistician and more focused on helping you understand how to use regression analysis to draw meaningful conclusions from data.
The book is full of practical examples that are easy to relate to, covering real-world applications in fields like business, economics, and social sciences. Frost walks you through interpreting key outputs like coefficients, R-squared values, and p-values, breaking them down into terms that even beginners can grasp. If you’ve ever struggled to make sense of what these numbers actually mean in the context of your data, this guide will be a game-changer.
Beginner-Friendly Approach: Frost’s writing is clear, engaging, and always focused on building an intuitive understanding of the concepts.
Focus on Interpretation: Instead of getting bogged down in formulas, the book emphasizes how to interpret regression results and what they mean for your data.
Practical Applications: Real-world examples make it easier to see how regression analysis can be used to answer practical questions in various fields. ...more
The book is full of practical examples that are easy to relate to, covering real-world applications in fields like business, economics, and social sciences. Frost walks you through interpreting key outputs like coefficients, R-squared values, and p-values, breaking them down into terms that even beginners can grasp. If you’ve ever struggled to make sense of what these numbers actually mean in the context of your data, this guide will be a game-changer.
Beginner-Friendly Approach: Frost’s writing is clear, engaging, and always focused on building an intuitive understanding of the concepts.
Focus on Interpretation: Instead of getting bogged down in formulas, the book emphasizes how to interpret regression results and what they mean for your data.
Practical Applications: Real-world examples make it easier to see how regression analysis can be used to answer practical questions in various fields. ...more
Notes are private!
1
Oct 04, 2024
Oct 16, 2024
Oct 04, 2024
Kindle Edition
1735431109
9781735431109
1735431109
3.85
41
unknown
Aug 13, 2020
it was amazing
Such a well written (true) introduction to statistics.
The book goes straight into the point, sometimes even too much as it tends to be a bit dry.
First Such a well written (true) introduction to statistics.
The book goes straight into the point, sometimes even too much as it tends to be a bit dry.
First off, this book isn’t about bombarding you with formulas and complicated jargon. Frost is more interested in explaining why statistical methods work. He makes sure you actually understand the logic behind the numbers rather than just memorizing steps. For example, when discussing descriptive statistics, he doesn’t just give you definitions of mean, median, and standard deviation—he explains how these measures give you different insights into your data set and when you should use each one. It’s really about building an intuitive feel for how to analyze data.
The book encompasses the following themes:
Data Visualization: starting off with histograms to more advanced charts
Summary Statistics: central tendency, variability, percentiles, correlation
Statistical Distributions: The book also gives a nice overview of key distributions, like normal distribution, and why they matter when analyzing data. Frost explains how concepts like the central limit theorem are at the heart of many statistical methods, but he doesn’t bog you down with unnecessary complexity.
Probability: Frost takes the time to explain probability in a way that’s easy to grasp. Instead of throwing out formulas right away, he walks through practical examples, like rolling dice or drawing cards, to show how probability works in real-life scenarios. This helps ground your understanding before moving into more abstract concepts.
Confidence Intervals: Another topic that Frost handles well is confidence intervals. Instead of throwing a bunch of equations at you, he starts by explaining what a confidence interval really means in terms of how certain you can be about a range of values. He explains it in an approachable way, which is perfect for beginners who might not have a background in mathematics. ...more
The book goes straight into the point, sometimes even too much as it tends to be a bit dry.
First Such a well written (true) introduction to statistics.
The book goes straight into the point, sometimes even too much as it tends to be a bit dry.
First off, this book isn’t about bombarding you with formulas and complicated jargon. Frost is more interested in explaining why statistical methods work. He makes sure you actually understand the logic behind the numbers rather than just memorizing steps. For example, when discussing descriptive statistics, he doesn’t just give you definitions of mean, median, and standard deviation—he explains how these measures give you different insights into your data set and when you should use each one. It’s really about building an intuitive feel for how to analyze data.
The book encompasses the following themes:
Data Visualization: starting off with histograms to more advanced charts
Summary Statistics: central tendency, variability, percentiles, correlation
Statistical Distributions: The book also gives a nice overview of key distributions, like normal distribution, and why they matter when analyzing data. Frost explains how concepts like the central limit theorem are at the heart of many statistical methods, but he doesn’t bog you down with unnecessary complexity.
Probability: Frost takes the time to explain probability in a way that’s easy to grasp. Instead of throwing out formulas right away, he walks through practical examples, like rolling dice or drawing cards, to show how probability works in real-life scenarios. This helps ground your understanding before moving into more abstract concepts.
Confidence Intervals: Another topic that Frost handles well is confidence intervals. Instead of throwing a bunch of equations at you, he starts by explaining what a confidence interval really means in terms of how certain you can be about a range of values. He explains it in an approachable way, which is perfect for beginners who might not have a background in mathematics. ...more
Notes are private!
1
Sep 27, 2024
Oct 04, 2024
Sep 27, 2024
Paperback
1492056316
9781492056317
B09WZJMMJP
4.62
1,694
Jan 25, 2015
Mar 31, 2022
it was amazing
Such a great textbook to look under the hood of Python's engine.
As it often happens with more advanced books, the latest chapters are a bit too comple Such a great textbook to look under the hood of Python's engine.
As it often happens with more advanced books, the latest chapters are a bit too complex for me, but overall the structure of the book is very good.
The book delves into Python’s underlying mechanics and advanced constructs, guiding developers to write cleaner, more Pythonic code by leveraging the language’s built-in capabilities. Ramalho does a deep dive into Python’s core features, including data structures, functions, objects, metaprogramming, and concurrency, while emphasizing the importance of understanding Pythonic idioms and best practices.
I've also appreciated the approach of putting first some technical details that are often overlooked, such as:
or
The Python Data Model:
The book starts by explaining Python’s data model, which is the foundation for everything in Python. It explores how to define and customize objects, and how Python’s magic methods (like __repr__, __str__, __len__, etc.) can be used to make objects behave consistently with Python’s expectations.
You’ll learn how Python leverages special methods for operator overloading, object representation, and protocol implementation.
Data Structures:
Ramalho provides a thorough review of Python’s built-in data structures such as lists, tuples, sets, dictionaries, and more. He explains how to efficiently use these structures, as well as advanced concepts like comprehensions, slicing, and sorting.
He also introduces custom containers, how to create immutable types, and how to use collections to implement advanced data structures like namedtuple and deque.
Functions as Objects:
The book explores first-class functions, demonstrating how functions in Python are objects and can be used as arguments, returned from other functions, or stored in data structures.
Concepts such as closures, lambda functions, decorators, and higher-order functions are explained in detail, showing their importance in building flexible, reusable code.
Object-Oriented Idioms:
Fluent Python emphasizes writing idiomatic object-oriented code and explores Python’s approach to object-oriented programming (OOP).
Topics covered include inheritance, polymorphism, interfaces, protocols, mixins, and abstract base classes (ABCs). The book discusses how to design flexible class hierarchies and how Python’s OOP system differs from other languages.
Metaprogramming:
Metaprogramming is one of the most advanced topics in Python, and the book provides a detailed exploration of how to write code that modifies or generates other code at runtime.
Descriptors, properties, class decorators, and metaclasses are discussed in depth, offering insight into how Python’s internals work and how to leverage them to write dynamic, reusable, and adaptable code.
Concurrency:
Ramalho also addresses concurrency and introduces several approaches for handling parallelism and concurrency in Python, including threading, multiprocessing, asynchronous programming with asyncio, and the concurrent.futures module.
The book provides examples of how to work with I/O-bound and CPU-bound tasks efficiently, using appropriate concurrency models.
Generators and Coroutines:
Generators and coroutines are powerful tools in Python for managing state and producing data on demand (lazy evaluation). Fluent Python covers the use of generators for iterating over sequences and using coroutines for writing asynchronous code in an intuitive way.
Decorators and Context Managers:
Fluent Python covers decorators and context managers extensively, explaining how they work and how they can be used to implement cleaner, more readable code.
The book covers the @property decorator, function decorators, and class decorators, as well as how to use with statements and implement custom context managers.
Design Patterns:
The book touches on common design patterns in Python and how they can be implemented in a Pythonic way. This includes patterns like Strategy, Observer, and Command, as well as more Python-specific approaches like duck typing and protocols. ...more
As it often happens with more advanced books, the latest chapters are a bit too comple Such a great textbook to look under the hood of Python's engine.
As it often happens with more advanced books, the latest chapters are a bit too complex for me, but overall the structure of the book is very good.
The book delves into Python’s underlying mechanics and advanced constructs, guiding developers to write cleaner, more Pythonic code by leveraging the language’s built-in capabilities. Ramalho does a deep dive into Python’s core features, including data structures, functions, objects, metaprogramming, and concurrency, while emphasizing the importance of understanding Pythonic idioms and best practices.
I've also appreciated the approach of putting first some technical details that are often overlooked, such as:
Special methods in Python, often referred to as "magic methods" or "dunder methods" (short for "double underscore"), are methods that have double underscores at the beginning and end of their names, like __init__, __str__, and __add__. These methods are used to enable certain behaviors and operations on objects that are instances of a class. They allow you to define how your objects should respond to built-in functions and operators.
or
A hash is a fixed-size integer that uniquely identifies a particular value or object. Hashes are used in many areas of computer science and programming, such as in data structures like hash tables (which are used to implement dictionaries and sets in Python). The purpose of hashing is to quickly compare and retrieve data in collections that require fast lookups.
- Python provides a built-in hash() function that returns the hash value of an object.
- The hash() function works on immutable data types like integers, floats, strings, and tuples.
- For mutable types like lists or dictionaries, hash() is not applicable because their contents can change, making them unsuitable for use as keys in hash-based collections.
- The hash value of an immutable object (like a string or a tuple) remains constant throughout the program's execution, provided that the object itself doesn't change.
- Mutable objects cannot be hashed because their contents can change, leading to inconsistencies in hash values.
- Hashes are crucial for the performance of dictionaries and sets in Python. When you insert an item into a dictionary or set, Python uses the hash value to determine where to store the data. This allows for very fast lookups.
Dictionaries themselves are not hashable. This is because dictionaries in Python are mutable objects, meaning their contents can change after they are created
Keys must be hashable: This is a strict requirement because dictionaries use a hash table internally to store key-value pairs. The hash value of the key determines where the pair is stored.
Keys must be immutable: This immutability ensures that the hash value of a key remains consistent throughout its lifetime. If the key could change, it would disrupt the dictionary's internal hash table, leading to unpredictable behavior when trying to retrieve or store values.
The key in a dictionary cannot change directly because keys in a dictionary must be immutable. This means that once you assign a key to a dictionary, it must remain unchanged.
However, if you want to "change" a key, you would need to remove the old key-value pair and add a new key with the same value (or a modified value).
The Python Data Model:
The book starts by explaining Python’s data model, which is the foundation for everything in Python. It explores how to define and customize objects, and how Python’s magic methods (like __repr__, __str__, __len__, etc.) can be used to make objects behave consistently with Python’s expectations.
You’ll learn how Python leverages special methods for operator overloading, object representation, and protocol implementation.
Data Structures:
Ramalho provides a thorough review of Python’s built-in data structures such as lists, tuples, sets, dictionaries, and more. He explains how to efficiently use these structures, as well as advanced concepts like comprehensions, slicing, and sorting.
He also introduces custom containers, how to create immutable types, and how to use collections to implement advanced data structures like namedtuple and deque.
Functions as Objects:
The book explores first-class functions, demonstrating how functions in Python are objects and can be used as arguments, returned from other functions, or stored in data structures.
Concepts such as closures, lambda functions, decorators, and higher-order functions are explained in detail, showing their importance in building flexible, reusable code.
Object-Oriented Idioms:
Fluent Python emphasizes writing idiomatic object-oriented code and explores Python’s approach to object-oriented programming (OOP).
Topics covered include inheritance, polymorphism, interfaces, protocols, mixins, and abstract base classes (ABCs). The book discusses how to design flexible class hierarchies and how Python’s OOP system differs from other languages.
Interfaces, Protocols and ABCs
Duck typing is a dynamic typing concept based on the idea of "if it looks like a duck and quacks like a duck, it must be a duck." This means that the type or class of an object is determined by its behavior (i.e., the methods it implements) rather than its explicit type. In other words, if an object implements the necessary methods or properties required by the context, it's considered valid, regardless of its actual class or type.
No explicit type checks: Python doesn't require you to declare the type of an object, and duck typing allows you to use any object that has the required methods.
Runtime flexibility: Duck typing checks happen at runtime, so the system checks the object's methods or attributes during execution.
Error prone at runtime: Since there are no compile-time checks, errors related to missing methods are only discovered when the program is executed.
class Duck:
def quack(self):
print("Quack!")
class Dog:
def quack(self):
print("Bark that sounds like a quack!")
def make_it_quack(animal):
animal.quack() # Works as long as 'animal' has a 'quack' method
duck = Duck()
dog = Dog()
make_it_quack(duck) # Output: Quack!
make_it_quack(dog) # Output: Bark that sounds like a quack!
Goose typing (sometimes referred to as gradual typing) is an extension of duck typing that adds more structure by explicitly checking whether objects conform to a given interface, but without strictly requiring them to belong to a specific class or type. The term "goose typing" isn't as commonly used as duck typing or static typing but can be thought of as a middle ground between the two.
In Python, goose typing can be implemented through tools like protocols and abstract base classes (ABCs). These mechanisms allow objects to be treated as valid types if they conform to a specific interface, regardless of their concrete type.
Metaprogramming:
Metaprogramming is one of the most advanced topics in Python, and the book provides a detailed exploration of how to write code that modifies or generates other code at runtime.
Descriptors, properties, class decorators, and metaclasses are discussed in depth, offering insight into how Python’s internals work and how to leverage them to write dynamic, reusable, and adaptable code.
Concurrency:
Ramalho also addresses concurrency and introduces several approaches for handling parallelism and concurrency in Python, including threading, multiprocessing, asynchronous programming with asyncio, and the concurrent.futures module.
The book provides examples of how to work with I/O-bound and CPU-bound tasks efficiently, using appropriate concurrency models.
Generators and Coroutines:
Generators and coroutines are powerful tools in Python for managing state and producing data on demand (lazy evaluation). Fluent Python covers the use of generators for iterating over sequences and using coroutines for writing asynchronous code in an intuitive way.
Decorators and Context Managers:
Fluent Python covers decorators and context managers extensively, explaining how they work and how they can be used to implement cleaner, more readable code.
The book covers the @property decorator, function decorators, and class decorators, as well as how to use with statements and implement custom context managers.
Design Patterns:
The book touches on common design patterns in Python and how they can be implemented in a Pythonic way. This includes patterns like Strategy, Observer, and Command, as well as more Python-specific approaches like duck typing and protocols. ...more
Notes are private!
1
Aug 21, 2024
Sep 27, 2024
Aug 21, 2024
Kindle Edition
1634629663
9781634629669
1634629663
2.79
34
unknown
Oct 01, 2021
it was ok
There is a lot of theoretical stuff, even too much about basic concepts.
The core idea is intriguing, but the authors never explain in details how to a There is a lot of theoretical stuff, even too much about basic concepts.
The core idea is intriguing, but the authors never explain in details how to achieve it.
The definitions already say a lot of each architecture and actually would save you the time to read the book:
**Data Warehouse**
A data warehouse is a centralized repository that stores structured data from various sources, typically optimized for query performance and reporting. It is designed to support business intelligence (BI) and analytics, enabling users to generate reports and insights. Data in a warehouse is usually organized in a schema-based format (e.g., star schema, snowflake schema) and is subject to strict quality control and transformation processes (ETL: Extract, Transform, Load) before being loaded into the warehouse.
**Data Lake**
A data lake is a large-scale storage repository that can hold vast amounts of raw, unstructured, semi-structured, and structured data in its native format. Unlike a data warehouse, a data lake does not require data to be processed or transformed before being stored. It is designed to accommodate a wide variety of data types, such as log files, videos, images, and sensor data, making it suitable for big data analytics, machine learning, and data exploration. Data lakes are highly scalable and can be deployed on cloud platforms or on-premises.
**Data Lakehouse**
A data lakehouse is a modern data architecture that combines the scalable storage and flexibility of a data lake with the performance and management features of a data warehouse. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while enabling high-performance analytics and query processing. The data lakehouse supports ACID transactions, schema enforcement, and data governance, providing a unified platform for diverse workloads, including BI, data science, and real-time analytics.
It is a hybrid architecture that leverages the scalability and flexibility of data lakes with the reliability and performance of data warehouses. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while also supporting high-performance queries and analytics.
The book discusses the (very well known) limitations of traditional data warehouses, such as their inability to handle unstructured data efficiently and their cost-prohibitive scalability. It also covers the challenges associated with data lakes, like data governance, data quality, and the complexity of managing diverse data formats. The data lakehouse addresses these issues by integrating the best features of both architectures.
A common criticism is that the book tends to be repetitive, rehashing similar ideas and concepts multiple times without providing new insights.
While it covers the basics of the data lakehouse, it may not delve deeply enough into technical details or advanced concepts for readers who are already familiar with data architecture.
In order to move to a Lakehouse architecture:
Assessment and Planning
Evaluate Current Infrastructure: Assess the existing data warehouse architecture, including storage, ETL processes, and BI tools. Identify limitations and areas for improvement.
Define Use Cases: Determine the use cases that the data lakehouse will support, such as real-time analytics, machine learning, and unstructured data analysis.
Identify Stakeholders: Engage business stakeholders, data engineers, data scientists, and IT teams to gather requirements and expectations.
Design the Lakehouse Architecture
Storage Layer: Plan for a scalable storage solution (e.g., object storage like AWS S3, Azure Blob Storage) that can handle diverse data types (structured, semi-structured, and unstructured).
Management Layer: Implement data governance, security, and metadata management practices that ensure data quality and compliance.
Processing Layer: Incorporate processing engines capable of supporting SQL queries, machine learning, and streaming data processing (e.g., Apache Spark, Flink).
Consumption Layer: Ensure compatibility with existing BI tools and provide user-friendly access for data analysts and data scientists. ...more
The core idea is intriguing, but the authors never explain in details how to a There is a lot of theoretical stuff, even too much about basic concepts.
The core idea is intriguing, but the authors never explain in details how to achieve it.
The definitions already say a lot of each architecture and actually would save you the time to read the book:
**Data Warehouse**
A data warehouse is a centralized repository that stores structured data from various sources, typically optimized for query performance and reporting. It is designed to support business intelligence (BI) and analytics, enabling users to generate reports and insights. Data in a warehouse is usually organized in a schema-based format (e.g., star schema, snowflake schema) and is subject to strict quality control and transformation processes (ETL: Extract, Transform, Load) before being loaded into the warehouse.
**Data Lake**
A data lake is a large-scale storage repository that can hold vast amounts of raw, unstructured, semi-structured, and structured data in its native format. Unlike a data warehouse, a data lake does not require data to be processed or transformed before being stored. It is designed to accommodate a wide variety of data types, such as log files, videos, images, and sensor data, making it suitable for big data analytics, machine learning, and data exploration. Data lakes are highly scalable and can be deployed on cloud platforms or on-premises.
**Data Lakehouse**
A data lakehouse is a modern data architecture that combines the scalable storage and flexibility of a data lake with the performance and management features of a data warehouse. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while enabling high-performance analytics and query processing. The data lakehouse supports ACID transactions, schema enforcement, and data governance, providing a unified platform for diverse workloads, including BI, data science, and real-time analytics.
It is a hybrid architecture that leverages the scalability and flexibility of data lakes with the reliability and performance of data warehouses. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while also supporting high-performance queries and analytics.
The book discusses the (very well known) limitations of traditional data warehouses, such as their inability to handle unstructured data efficiently and their cost-prohibitive scalability. It also covers the challenges associated with data lakes, like data governance, data quality, and the complexity of managing diverse data formats. The data lakehouse addresses these issues by integrating the best features of both architectures.
A common criticism is that the book tends to be repetitive, rehashing similar ideas and concepts multiple times without providing new insights.
While it covers the basics of the data lakehouse, it may not delve deeply enough into technical details or advanced concepts for readers who are already familiar with data architecture.
In order to move to a Lakehouse architecture:
Assessment and Planning
Evaluate Current Infrastructure: Assess the existing data warehouse architecture, including storage, ETL processes, and BI tools. Identify limitations and areas for improvement.
Define Use Cases: Determine the use cases that the data lakehouse will support, such as real-time analytics, machine learning, and unstructured data analysis.
Identify Stakeholders: Engage business stakeholders, data engineers, data scientists, and IT teams to gather requirements and expectations.
Design the Lakehouse Architecture
Storage Layer: Plan for a scalable storage solution (e.g., object storage like AWS S3, Azure Blob Storage) that can handle diverse data types (structured, semi-structured, and unstructured).
Management Layer: Implement data governance, security, and metadata management practices that ensure data quality and compliance.
Processing Layer: Incorporate processing engines capable of supporting SQL queries, machine learning, and streaming data processing (e.g., Apache Spark, Flink).
Consumption Layer: Ensure compatibility with existing BI tools and provide user-friendly access for data analysts and data scientists. ...more
Notes are private!
1
Aug 02, 2024
Aug 15, 2024
Aug 02, 2024
Paperback
1098142381
9781098142384
1098142381
4.10
21
unknown
Jan 16, 2024
really liked it
A solid handbook on the emerging field of analytics engineering, which bridges the gap between data engineering and data analytics.
That is why it spec A solid handbook on the emerging field of analytics engineering, which bridges the gap between data engineering and data analytics.
That is why it specifically emphasizes the use of SQL, the lingua franca of DBs, and DBT (data build tool) to create scalable, maintainable, and meaningful data models that can power business intelligence (BI) and analytics workflows.
I would have hoped for a more advanced textbook, as in most of the initial chapters encompass basic Analysts tools and concepts such as Data Modeling and SQL.
Since Analytics Engineer is a more technical and advanced role compared to Data Analyst it would have made sense to skip the basics and go straight into the action.
Although I understand the need to attract as much audience as possible, as a Senior Analytics Engineer myself, I've found the first part of the book redundant and the last one insightful.
That is to say that for a Data Analyst it might be the opposite, making this book not super relevant.
DBT is an open-source tool that enables analytics engineers to transform data in their warehouse by writing modular SQL queries, testing data quality, and documenting data transformations. dbt automates the process of building and maintaining data models, making it easier to manage complex data pipelines.
I've found particularly interesting the sections about DBT macros and the use of Jinja SQL.
One final note: the book focuses almost entirely on the DBT cloud distribution which is a paid service.
I'd have liked to have a more deeper discussion on the open source distribution, DBT Core so to understand how DBT works under the hood.
In fact the book mainly shows the UI without going into the technical details of the CLI.
A bit light on the DBT core part, which is the open source distribution ...more
That is why it spec A solid handbook on the emerging field of analytics engineering, which bridges the gap between data engineering and data analytics.
That is why it specifically emphasizes the use of SQL, the lingua franca of DBs, and DBT (data build tool) to create scalable, maintainable, and meaningful data models that can power business intelligence (BI) and analytics workflows.
I would have hoped for a more advanced textbook, as in most of the initial chapters encompass basic Analysts tools and concepts such as Data Modeling and SQL.
Since Analytics Engineer is a more technical and advanced role compared to Data Analyst it would have made sense to skip the basics and go straight into the action.
Although I understand the need to attract as much audience as possible, as a Senior Analytics Engineer myself, I've found the first part of the book redundant and the last one insightful.
That is to say that for a Data Analyst it might be the opposite, making this book not super relevant.
DBT is an open-source tool that enables analytics engineers to transform data in their warehouse by writing modular SQL queries, testing data quality, and documenting data transformations. dbt automates the process of building and maintaining data models, making it easier to manage complex data pipelines.
I've found particularly interesting the sections about DBT macros and the use of Jinja SQL.
One final note: the book focuses almost entirely on the DBT cloud distribution which is a paid service.
I'd have liked to have a more deeper discussion on the open source distribution, DBT Core so to understand how DBT works under the hood.
In fact the book mainly shows the UI without going into the technical details of the CLI.
A bit light on the DBT core part, which is the open source distribution ...more
Notes are private!
1
Jul 26, 2024
Aug 21, 2024
Jul 26, 2024
Paperback
1617297208
9781617297205
1617297208
4.34
29
unknown
Mar 22, 2022
it was amazing
A very good introduction to Spark and its components.
Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQ A very good introduction to Spark and its components.
Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQL: for instance he explains how JOINs work regardless of PySpark or SQL.
So if you already know these languages it might have redundant, but still valuable info.
NOTES
Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed to provide fast and general-purpose data processing capabilities. Spark extends the MapReduce model to support more types of computations, including interactive queries and stream processing, making it a powerful engine for large-scale data analytics.
Key Features of Spark:
Speed: Spark's in-memory processing capabilities allow it to be up to 100 times faster than Hadoop MapReduce for certain applications.
Ease of Use: It provides simple APIs in Java, Scala, Python, and R, which makes it accessible for a wide range of users.
Versatility: Spark supports various workloads, including batch processing, interactive querying, real-time analytics, machine learning, and graph processing.
Advanced Analytics: It has built-in modules for SQL, streaming, machine learning (MLlib), and graph processing (GraphX).
PySpark is the Python API for Apache Spark, which allows Python developers to write Spark applications using Python. PySpark integrates the simplicity and flexibility of Python with the powerful distributed computing capabilities of Spark.
Key Features of PySpark:
Python-Friendly: It enables Python developers to leverage Spark’s power using familiar Python syntax.
DataFrames: Provides a high-level DataFrame API, which is similar to pandas DataFrames, but distributed.
Integration with Python Ecosystem: Allows seamless integration with Python libraries such as NumPy, pandas, and scikit-learn.
Machine Learning: Through MLlib, PySpark supports a wide range of machine learning algorithms.
SparkSQL is a module for structured data processing in Apache Spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
Key Features of SparkSQL:
DataFrames: A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Pandas.
SQL Queries: SparkSQL allows users to execute SQL queries on Spark data. It supports SQL and Hive Query Language (HQL) out of the box.
Unified Data Access: It provides a unified interface for working with structured data from various sources, including Hive tables, Parquet files, JSON files, and JDBC databases.
Optimizations: Uses the Catalyst optimizer for query optimization, ensuring efficient execution of queries.
Key Components and Concepts
Spark Core
RDD (Resilient Distributed Dataset): The fundamental data structure of Spark, an immutable distributed collection of objects that can be processed in parallel.
Transformations and Actions: Transformations create new RDDs from existing ones (e.g., map, filter), while actions trigger computations and return results (e.g., collect, count).
PySpark
RDDs and DataFrames: Similar to Spark Core but accessed using Python syntax.
SparkContext: The entry point to any Spark functionality, responsible for coordinating Spark applications.
SparkSession: An entry point to interact with DataFrames and the Spark SQL API.
SparkSQL
DataFrame API: Provides a high-level abstraction for structured data.
SparkSession: Central to SparkSQL, used to create DataFrames, execute SQL queries, and manage Spark configurations.
SQL Queries: Enables running SQL queries using the sql method on a SparkSession.
Catalog: Metadata repository that stores information about the structure of the data. ...more
Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQ A very good introduction to Spark and its components.
Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQL: for instance he explains how JOINs work regardless of PySpark or SQL.
So if you already know these languages it might have redundant, but still valuable info.
NOTES
Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed to provide fast and general-purpose data processing capabilities. Spark extends the MapReduce model to support more types of computations, including interactive queries and stream processing, making it a powerful engine for large-scale data analytics.
Key Features of Spark:
Speed: Spark's in-memory processing capabilities allow it to be up to 100 times faster than Hadoop MapReduce for certain applications.
Ease of Use: It provides simple APIs in Java, Scala, Python, and R, which makes it accessible for a wide range of users.
Versatility: Spark supports various workloads, including batch processing, interactive querying, real-time analytics, machine learning, and graph processing.
Advanced Analytics: It has built-in modules for SQL, streaming, machine learning (MLlib), and graph processing (GraphX).
PySpark is the Python API for Apache Spark, which allows Python developers to write Spark applications using Python. PySpark integrates the simplicity and flexibility of Python with the powerful distributed computing capabilities of Spark.
Key Features of PySpark:
Python-Friendly: It enables Python developers to leverage Spark’s power using familiar Python syntax.
DataFrames: Provides a high-level DataFrame API, which is similar to pandas DataFrames, but distributed.
Integration with Python Ecosystem: Allows seamless integration with Python libraries such as NumPy, pandas, and scikit-learn.
Machine Learning: Through MLlib, PySpark supports a wide range of machine learning algorithms.
SparkSQL is a module for structured data processing in Apache Spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
Key Features of SparkSQL:
DataFrames: A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Pandas.
SQL Queries: SparkSQL allows users to execute SQL queries on Spark data. It supports SQL and Hive Query Language (HQL) out of the box.
Unified Data Access: It provides a unified interface for working with structured data from various sources, including Hive tables, Parquet files, JSON files, and JDBC databases.
Optimizations: Uses the Catalyst optimizer for query optimization, ensuring efficient execution of queries.
Key Components and Concepts
Spark Core
RDD (Resilient Distributed Dataset): The fundamental data structure of Spark, an immutable distributed collection of objects that can be processed in parallel.
Transformations and Actions: Transformations create new RDDs from existing ones (e.g., map, filter), while actions trigger computations and return results (e.g., collect, count).
PySpark
RDDs and DataFrames: Similar to Spark Core but accessed using Python syntax.
SparkContext: The entry point to any Spark functionality, responsible for coordinating Spark applications.
SparkSession: An entry point to interact with DataFrames and the Spark SQL API.
SparkSQL
DataFrame API: Provides a high-level abstraction for structured data.
SparkSession: Central to SparkSQL, used to create DataFrames, execute SQL queries, and manage Spark configurations.
SQL Queries: Enables running SQL queries using the sql method on a SparkSession.
Catalog: Metadata repository that stores information about the structure of the data. ...more
Notes are private!
1
Jul 16, 2024
Jul 25, 2024
Jul 16, 2024
Paperback
1492092398
9781492092391
1492092398
3.78
327
Apr 12, 2022
Apr 12, 2022
it was ok
It really bothers me when I feel like the author is wasting my time - this is one of the times.
Don't get me wrong the concepts laid out are interestin It really bothers me when I feel like the author is wasting my time - this is one of the times.
Don't get me wrong the concepts laid out are interesting, but as pointed out by many reviewers, they could have well been summarized in half the pages.
"The concept is very simple, and it's presented in the initial section of the book; what you get later is a lot of repetition w/o practical advice or (what's even worse) any useful examples - and that's probably the biggest drawback of the book: it's by far too dry and theoretical."
There are far too many definitions that create confusion and the books remains too much theoretical.
This is what ChatGPT has to say about Data Mesch, and unfortunately for the author, it covers most of what you need to know about this movement.
Data mesh is a decentralized approach to data architecture that aims to overcome the limitations of traditional centralized data systems, particularly in large and complex organizations. It was introduced by Zhamak Dehghani in 2019. The core idea behind data mesh is to treat data as a product and to manage data ownership and responsibilities in a decentralized way, much like how microservices are managed in software development. Here are the key principles and components of data mesh:
Domain-Oriented Data Ownership: Data is owned by the teams that know the data best, typically the ones that generate it.
Each domain team is responsible for the data it produces, ensuring high quality and relevance.
Data as a Product: Data is treated as a product with its own lifecycle, including development, maintenance, and deprecation.
Domain teams are responsible for delivering their data in a way that is easily discoverable, understandable, and usable by others.
Self-Serve Data Infrastructure: A self-service infrastructure platform is provided to domain teams to enable them to manage their data independently.
This platform typically includes tools for data storage, processing, governance, and access control.
Federated Computational Governance: Governance is implemented in a federated manner, balancing global standards with local autonomy.
This involves establishing policies and standards that are enforced across all domains while allowing domains the flexibility to manage their own data.
Components of Data Mesh
Domain Data Products These are datasets produced by different domain teams, designed to be used by other teams.
Each data product comes with a clear contract, including schema, SLAs, quality metrics, and documentation.
Data Platform: A central platform provides common infrastructure services like data storage, processing, and security.
The platform abstracts away the complexities of underlying technologies, allowing domain teams to focus on their data products.
Governance and Standards: Policies and standards are established to ensure data quality, security, and compliance.
Governance is implemented in a federated manner, with responsibilities distributed across domain teams.
Interoperability and Communication:Mechanisms are put in place to ensure that data products from different domains can be easily integrated and used together.
This may involve standardizing on formats, interfaces, and communication protocols.
Benefits of Data Mesh
Scalability: By decentralizing data ownership and management, organizations can scale their data practices more effectively.
Each domain team can work independently, avoiding bottlenecks associated with centralized data teams.
Agility:
Domain teams can develop and iterate on their data products more quickly, responding to changing business needs.
This leads to faster innovation and time-to-market for data-driven initiatives.
Quality and Relevance:
Data ownership by domain teams ensures that the people most familiar with the data are responsible for its quality and relevance.
This leads to higher quality data that is more aligned with business needs.
Collaboration and Reuse:
Data mesh promotes a culture of data sharing and reuse, making it easier for teams to discover and use data from other domains.
This reduces duplication of effort and leads to more efficient use of data resources.
Challenges and Considerations
Cultural Change:
Implementing data mesh requires a significant cultural shift, as teams need to take on new responsibilities for data ownership and product management.
Organizations need to invest in training and change management to support this transition.
Complexity:
Managing a decentralized data architecture can introduce new complexities, particularly around governance and interoperability.
It requires careful planning and robust tooling to ensure that data remains discoverable, usable, and compliant.
Technology and Tooling:
Building a self-serve data platform requires significant investment in technology and infrastructure.
Organizations need to ensure they have the right tools and platforms to support the needs of their domain teams.
Data mesh represents a significant shift in how organizations manage and utilize their data. By decentralizing data ownership and treating data as a product, organizations can become more agile, scalable, and effective in their use of data. However, successful implementation requires careful planning, investment in infrastructure, and a commitment to cultural change. ...more
Don't get me wrong the concepts laid out are interestin It really bothers me when I feel like the author is wasting my time - this is one of the times.
Don't get me wrong the concepts laid out are interesting, but as pointed out by many reviewers, they could have well been summarized in half the pages.
"The concept is very simple, and it's presented in the initial section of the book; what you get later is a lot of repetition w/o practical advice or (what's even worse) any useful examples - and that's probably the biggest drawback of the book: it's by far too dry and theoretical."
There are far too many definitions that create confusion and the books remains too much theoretical.
This is what ChatGPT has to say about Data Mesch, and unfortunately for the author, it covers most of what you need to know about this movement.
Data mesh is a decentralized approach to data architecture that aims to overcome the limitations of traditional centralized data systems, particularly in large and complex organizations. It was introduced by Zhamak Dehghani in 2019. The core idea behind data mesh is to treat data as a product and to manage data ownership and responsibilities in a decentralized way, much like how microservices are managed in software development. Here are the key principles and components of data mesh:
Domain-Oriented Data Ownership: Data is owned by the teams that know the data best, typically the ones that generate it.
Each domain team is responsible for the data it produces, ensuring high quality and relevance.
Data as a Product: Data is treated as a product with its own lifecycle, including development, maintenance, and deprecation.
Domain teams are responsible for delivering their data in a way that is easily discoverable, understandable, and usable by others.
Self-Serve Data Infrastructure: A self-service infrastructure platform is provided to domain teams to enable them to manage their data independently.
This platform typically includes tools for data storage, processing, governance, and access control.
Federated Computational Governance: Governance is implemented in a federated manner, balancing global standards with local autonomy.
This involves establishing policies and standards that are enforced across all domains while allowing domains the flexibility to manage their own data.
Components of Data Mesh
Domain Data Products These are datasets produced by different domain teams, designed to be used by other teams.
Each data product comes with a clear contract, including schema, SLAs, quality metrics, and documentation.
Data Platform: A central platform provides common infrastructure services like data storage, processing, and security.
The platform abstracts away the complexities of underlying technologies, allowing domain teams to focus on their data products.
Governance and Standards: Policies and standards are established to ensure data quality, security, and compliance.
Governance is implemented in a federated manner, with responsibilities distributed across domain teams.
Interoperability and Communication:Mechanisms are put in place to ensure that data products from different domains can be easily integrated and used together.
This may involve standardizing on formats, interfaces, and communication protocols.
Benefits of Data Mesh
Scalability: By decentralizing data ownership and management, organizations can scale their data practices more effectively.
Each domain team can work independently, avoiding bottlenecks associated with centralized data teams.
Agility:
Domain teams can develop and iterate on their data products more quickly, responding to changing business needs.
This leads to faster innovation and time-to-market for data-driven initiatives.
Quality and Relevance:
Data ownership by domain teams ensures that the people most familiar with the data are responsible for its quality and relevance.
This leads to higher quality data that is more aligned with business needs.
Collaboration and Reuse:
Data mesh promotes a culture of data sharing and reuse, making it easier for teams to discover and use data from other domains.
This reduces duplication of effort and leads to more efficient use of data resources.
Challenges and Considerations
Cultural Change:
Implementing data mesh requires a significant cultural shift, as teams need to take on new responsibilities for data ownership and product management.
Organizations need to invest in training and change management to support this transition.
Complexity:
Managing a decentralized data architecture can introduce new complexities, particularly around governance and interoperability.
It requires careful planning and robust tooling to ensure that data remains discoverable, usable, and compliant.
Technology and Tooling:
Building a self-serve data platform requires significant investment in technology and infrastructure.
Organizations need to ensure they have the right tools and platforms to support the needs of their domain teams.
Data mesh represents a significant shift in how organizations manage and utilize their data. By decentralizing data ownership and treating data as a product, organizations can become more agile, scalable, and effective in their use of data. However, successful implementation requires careful planning, investment in infrastructure, and a commitment to cultural change. ...more
Notes are private!
1
Jun 10, 2024
Jun 23, 2024
Jun 10, 2024
Paperback
1942788290
9781942788294
1942788290
4.26
48,280
Jan 10, 2013
Feb 01, 2018
it was amazing
This gem of a book that took me by surprise and deservedly seats in my "best" section.
It is the perfect blend between fiction and textbook: some parts This gem of a book that took me by surprise and deservedly seats in my "best" section.
It is the perfect blend between fiction and textbook: some parts makes you laugh, others makes you think and reflect on important work-related topics.
The story centers around Bill Palmer, an IT manager at Parts Unlimited, a struggling automotive parts company. The company's new initiative, code-named Phoenix Project, is critical for its survival, but it's over budget, behind schedule, and plagued by numerous issues.
If you work in the Tech industry, and in particular in the Digital (ex IT) teams, you can very much relate to what happens at Parts Unlimited.
Bill Palmer is unexpectedly promoted to VP of IT Operations.
The CEO sugarcoats the pill, but in truth Bill is on course for a suicide mission: the relationship between IT and the rest of the company is dysfunctional to say the least.
Since day 0 Bill finds himself in the middle of political meetings where more often than not everyone blames IT failures for everything that does not work.
In short he has faces a mess situation.
Then appears a savior, Erik, who teaches Bill how to deal with complexity.
In short
First Way: Emphasizes the performance of the entire system, rather than individual departments.
Second Way: Focuses on creating feedback loops to enable continuous improvement.
Third Way: Encourages a culture of continual experimentation and learning.
DevOps Principles: The book illustrates the core principles of DevOps, including continuous delivery, automation, and the integration of development and operations teams.
Workflow Optimization: Emphasizes the importance of streamlining workflows, eliminating bottlenecks, and improving efficiency.
Cultural Change: Highlights the necessity of cultural transformation within an organization to adopt DevOps practices effectively.
Systems Thinking: Encourages a holistic view of the IT environment and its impact on the business as a whole.
But before meeting Erik, Of course the first thing Bill tries to do is to understand what's going on, what are the root causes of so many incidents and to better the relationships with the Developers and Security.
But most importantly Bill wants to get the grip of the situation, who is doing what, and who is authorizing any changes.
However Bill, Patty and Wes soon realise that there are too many pending changes, so much so that it is difficult even to list them all
And then, when everything seemed lost, the strange figure of Erik entered the scene.
He is brought into the story as a mysterious and knowledgeable figure who understands the deep-rooted problems within IT operations and the organization as a whole. His primary role is to mentor Bill Palmer, providing him with the insights and guidance needed to tackle the complex issues facing Parts Unlimited.
Erik emphasizes the importance of optimizing the entire system rather than focusing on individual departments. This involves understanding how various parts of the organization interact and ensuring that improvements in one area do not create problems in another.
Moreover Erik teaches Bill how to avoid bottlenecks and necessity of creating robust feedback loops within the organization. These feedback loops help identify issues quickly, allow for continuous improvement, and ensure that knowledge is shared across teams.
In Bill's team they have the Brent problem: a guy who can do it all, but who doesn't even how he does it, and therefore everybody calls for his help.
That is Brent is a constraint.
But if everybody needs Brent, his workload is unmanageable, therefore his tasks are always late.
And then there is the ultimate monster: Unplanned work.
Unplanned work often interrupts scheduled tasks and projects, leading to delays and inefficiencies.
When team members are constantly pulled away to deal with unplanned issues, it consumes valuable time and resources. This can prevent the completion of strategic work and contribute to employee burnout.
Frequent unplanned work can indicate underlying issues in processes or systems. It often leads to a reactive mode of operation where teams are firefighting instead of proactively improving systems and preventing problems.
It is the perfect blend between fiction and textbook: some parts This gem of a book that took me by surprise and deservedly seats in my "best" section.
It is the perfect blend between fiction and textbook: some parts makes you laugh, others makes you think and reflect on important work-related topics.
The story centers around Bill Palmer, an IT manager at Parts Unlimited, a struggling automotive parts company. The company's new initiative, code-named Phoenix Project, is critical for its survival, but it's over budget, behind schedule, and plagued by numerous issues.
If you work in the Tech industry, and in particular in the Digital (ex IT) teams, you can very much relate to what happens at Parts Unlimited.
Bill Palmer is unexpectedly promoted to VP of IT Operations.
The CEO sugarcoats the pill, but in truth Bill is on course for a suicide mission: the relationship between IT and the rest of the company is dysfunctional to say the least.
Since day 0 Bill finds himself in the middle of political meetings where more often than not everyone blames IT failures for everything that does not work.
In short he has faces a mess situation.
Then appears a savior, Erik, who teaches Bill how to deal with complexity.
In short
First Way: Emphasizes the performance of the entire system, rather than individual departments.
Second Way: Focuses on creating feedback loops to enable continuous improvement.
Third Way: Encourages a culture of continual experimentation and learning.
DevOps Principles: The book illustrates the core principles of DevOps, including continuous delivery, automation, and the integration of development and operations teams.
Workflow Optimization: Emphasizes the importance of streamlining workflows, eliminating bottlenecks, and improving efficiency.
Cultural Change: Highlights the necessity of cultural transformation within an organization to adopt DevOps practices effectively.
Systems Thinking: Encourages a holistic view of the IT environment and its impact on the business as a whole.
But before meeting Erik, Of course the first thing Bill tries to do is to understand what's going on, what are the root causes of so many incidents and to better the relationships with the Developers and Security.
But most importantly Bill wants to get the grip of the situation, who is doing what, and who is authorizing any changes.
However Bill, Patty and Wes soon realise that there are too many pending changes, so much so that it is difficult even to list them all
Thinking for a moment, I add, "For that matter, do the same thing for every person assigned to Phoenix. I'm guessing we're overloaded, so I want to know by how much. I want to proactively tell people whose projects have been bumped, so they're not surprised when we don't deliver what we promised."
Both Wes and Patty look surprised. Wes speaks up first, "But...but we'd have to talk with almost everyone! Patty may have fun grilling people on what changes they're making, but we can't go around wasting the time of our best people. They've got real work to do!"
"Yes, I know they have real work to do," I say adamantly. "I merely want a one-line description about what all that work is and how long they think it will take!"
Realizing how this might come across, I add, "Make sure you tell people that we're doing this so we can get more resources. I don't want anyone thinking that we're outsourcing or firing anyone, okay?"
Patty nods. "We should have done this a long time ago. We bump up the priorities of things all the time, but we never really know what just got bumped down. That is, until someone screams at us, demanding to know why we haven't delivered something."
She types on her laptop. "You just want a list of organizational commitments for our key resources, with a one-liner on what they're working on and how long it will take. We'll start with all Phoenix and audit remediation resources first, but will eventually cover the entire IT Operations organization. Do I have it right?"
And then, when everything seemed lost, the strange figure of Erik entered the scene.
He is brought into the story as a mysterious and knowledgeable figure who understands the deep-rooted problems within IT operations and the organization as a whole. His primary role is to mentor Bill Palmer, providing him with the insights and guidance needed to tackle the complex issues facing Parts Unlimited.
Erik emphasizes the importance of optimizing the entire system rather than focusing on individual departments. This involves understanding how various parts of the organization interact and ensuring that improvements in one area do not create problems in another.
I look at Erik suspiciously. He supposedly couldn't get anyone's name right, and yet he apparently remembers the name of some security guard from years past. And no one ever mentioned anything about a Dr. Reid.
After climbing five flights of stairs, we're standing on a catwalk that overlooks the entire plant floor, looking like it goes on for at least two city blocks in every direction.
"Look down there," he says. "You can see loading docks on each side of the building. Raw materials are brought in on this side, and the finished goods leave out the other. Orders come off that printer down there. If you stand here long enough, you can actually see all the WIP, that's 'work in process' or 'inventory' for plant newbies, make its way toward the other side of the plant floor, where it's shipped to customers as finished goods."
"For decades at this plant," he continues, "there were piles of inventory everywhere. In many places, it was piled as high as you could stack them using those big forklifts over there. On some days, you couldn't even see the other side of the building. In hindsight, we now know that WIP is one of the root causes for chronic due-date problems, quality issues, and expediters having to rejuggle priorities every day. It's amazing that this business didn't go under as a result."
He gestures broadly with both arms outstretched, "In the 1980s, this plant was the beneficiary of three incredible scientifically-grounded management movements. You've probably heard of them: the Theory of Constraints, Lean production or the Toyota Production System, and Total Quality Management. Although each movement started in different places, they all agree on one thing: WIP is the silent killer. Therefore, one of the most critical mechanisms in the management of any plant is job and materials release. Without it, you can't control WIP."
He points at a desk near the loading docks closest to us.
Moreover Erik teaches Bill how to avoid bottlenecks and necessity of creating robust feedback loops within the organization. These feedback loops help identify issues quickly, allow for continuous improvement, and ensure that knowledge is shared across teams.
In Bill's team they have the Brent problem: a guy who can do it all, but who doesn't even how he does it, and therefore everybody calls for his help.
That is Brent is a constraint.
He shakes his head, recalling the memory, "He sat down at the keyboard, and it's like he went into this trance. Ten minutes later, the problem is fixed. Everyone is happy and relieved that the system came back up. But then someone asked, 'How did you do it?' And I swear to God, Brent just looked back at him blankly and said, 'I have no idea. I just did it."
Wes thumps the table and says, "And that is the problem with Brent. How the hell do you document that? 'Close your eyes and go into a trance?"
Patty laughs, apparently recalling the story. She says, "I'm not suggesting Brent is doing this deliberately, but I wonder whether Brent views all his knowledge as a sort of power. Maybe some part of him is reluctant to give that up. It does put him in this position where he's virtually impossible to replace."
"Maybe. Maybe not," I say. "I'll tell you what I do know, though. Every time that we let Brent fix something that none of us can replicate, Brent gets a little smarter, and the entire system gets dumber. We've got to put an end to that.
But if everybody needs Brent, his workload is unmanageable, therefore his tasks are always late.
An ever-growing pile of changes trapped inside of IT Operations, with us running out of space to post the change cards.
Work piling up in front of the heat treat oven, because of Mark sitting at the job release desk releasing work.
Work piling up in front of Brent, because of...
Because of what?
Okay, if Brent is our heat treat oven, then who is our Mark? Who authorized all this work to be put in the system?
Well, we did. Or rather, the CAB did. Crap. Does that mean we did this to ourselves?
But changes need to get done, right? That's why they're changes. Besides, how do you say no to the onslaught of incoming work?
Looking at the cards piling up, can we afford not to?
But when was the question ever asked whether we should accept the work? And on what basis did we ever make that decision?
Again, I don't know the answer. But, worse, I have a feeling that Erik may not be a raving madman. Maybe he's right. Maybe there is some sort of link between plant floor management and IT Operations. Maybe plant floor management and IT Operations actually have similar challenges and problems.
I stand up and walk to the change board. I start thinking aloud, "Patty is alarmed that more than half our changes aren't completing as scheduled, to the extent that she's wondering whether this whole change process is worth the time we're investing in it.
"Furthermore," I continue, "she points out that a significant portion of the changes can't complete because Brent is somehow in the way, which is partially because we've directed Brent to reject all non-Phoenix work. We think that reversing this policy is the wrong thing to do."
I take a mental leap, following my intuition. "And I'd bet a million dollars that this is the exact wrong thing to do. It's because of this process that, for the first time, we're even aware of how much scheduled work isn't getting done! Getting rid of the process would just kill our situational awareness."
Feeling like I'm getting on a roll, I say adamantly, "Patty, we need a better understanding of what work is going to be heading Brent's way.
And then there is the ultimate monster: Unplanned work.
Unplanned work often interrupts scheduled tasks and projects, leading to delays and inefficiencies.
When team members are constantly pulled away to deal with unplanned issues, it consumes valuable time and resources. This can prevent the completion of strategic work and contribute to employee burnout.
Frequent unplanned work can indicate underlying issues in processes or systems. It often leads to a reactive mode of operation where teams are firefighting instead of proactively improving systems and preventing problems.
I turn back to Patty and say slowly, "Let me guess. Brent didn't get any of his non-Phoenix change work completed either, right?"...more
"Of course not! You were there, right?" she says, looking at me like I had grown eight heads. "Brent was working around-the-clock on the recovery efforts, building all the new tooling to keep all the systems and data up. Everything else was put on the back-burner."
All the firefighting displaced all the planned work, both projects and changes.
Ah... Now I see it.
What can displace planned work? Unplanned work. Of course.
I laugh uproariously, which earns me a look of genuine concern from Patty, who even takes a step back from me.
That's why Erik called it the most destructive type of work. It's not really work at all, like the others. The others are what you planned on doing, allegedly because you needed to do it.
Unplanned work is what prevents you from doing it. Like matter and antimatter, in the presence of unplanned work, all planned work ignites with incandescent fury, incinerating everything around it. Like Phoenix.
So much of what I've been trying to do during my short tenure as VP of IT Operations is to prevent unplanned work from happening: coordinating changes better so they don't fail, ensuring the orderly handling of incidents and outages to prevent interrupting key resources, doing whatever it takes so that Brent won't be escalated to...
I've been doing it mostly by instinct. I knew it was what had to be done, because people were working on the wrong things. I tried to take all necessary steps to keep people from doing wrong work, or rather, unplanned work.
I say, cackling and pumping my arms as if I had just scored a game-winning, sixty-yard field goal, "Yes! I see it now! It really is unplanned work! The fourth category of work is unplanned work!"
Notes are private!
1
May 09, 2024
May 22, 2024
May 09, 2024
Paperback
1484251776
9781484251775
B07Z1PHHQ9
3.56
9
unknown
Oct 10, 2019
really liked it
A very good introduction to ML, DL and anomaly detection but with the original sin of a poor pagination and an even poorer graphics design.
All in all A very good introduction to ML, DL and anomaly detection but with the original sin of a poor pagination and an even poorer graphics design.
All in all it does its job in explaining how to deal with anomaly detection, but I'd have liked a little bit more of unsupervised examples, which are the toughest situations to deal with.
The Deep Learning section is very well written, they start from the basics, from the artificial neuron up to state of art like CNNs and GPT.
NOTES
Data-based Anomaly Detection
Statistical Methods: These methods rely on statistical measures such as mean, standard deviation, or probability distributions to identify anomalies. Examples include z-score, interquartile range (IQR), and Gaussian distribution modeling.
Machine Learning Algorithms: Various machine learning algorithms learn patterns from the data and detect anomalies based on deviations from learned patterns. Techniques like decision trees, support vector machines (SVM), isolation forests, and autoencoders fall into this category.
Context-based Anomaly Detection
Domain Knowledge: Context-based approaches leverage domain-specific knowledge to identify anomalies. For example, in network security, unusual network traffic patterns may be detected based on knowledge of typical network behavior.
Expert Systems: Expert systems use rule-based or knowledge-based systems to detect anomalies based on predefined rules or heuristics derived from domain expertise.
Pattern-based Anomaly Detection
Pattern Recognition: Pattern-based approaches focus on identifying deviations from expected patterns within the data. Techniques such as time series analysis, sequence mining, and clustering fall into this category.
Deep Learning: Deep learning techniques, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs), can be used for pattern-based anomaly detection by learning complex patterns and detecting deviations from learned representations.
Outlier detection focuses on identifying data points that deviate significantly from the majority of the dataset. These data points are often called outliers. Outliers can be indicative of errors, anomalies, or rare events in the data. Techniques such as statistical methods (e.g., z-score, IQR), machine learning algorithms (e.g., isolation forests, one-class SVM), and clustering can be used for outlier detection.
Novelty detection, also known as one-class classification, involves identifying instances that significantly differ from normal data, without having access to examples of anomalies during training. The goal is to detect novel or unseen patterns in the data. It's particularly useful when anomalies are rare and difficult to obtain labeled data for. Techniques such as support vector machines (SVM) and autoencoders are commonly used for novelty detection.
Event detection aims to identify significant occurrences or events in a dataset, often in real-time or near real-time. These events may represent changes, anomalies, or patterns of interest in the data stream. Event detection is crucial in various domains such as sensor networks, finance, and cybersecurity. Techniques such as time series analysis, signal processing, and machine learning algorithms can be applied for event detection.
Noise removal involves the process of filtering or eliminating unwanted or irrelevant data points from a dataset. Noise can obscure meaningful patterns and distort the analysis results. Techniques such as smoothing filters, wavelet denoising, and outlier detection can be used for noise removal, depending on the nature of the noise and the characteristics of the data.
Traditional ML Algorithms
Isolation Forest
It is an unsupervised machine learning algorithm used for anomaly detection. It works by isolating anomalies in the data by splitting them from the rest of the data using binary trees.
Random Partitioning: Isolation Forest randomly selects a feature and then randomly selects a value within the range of that feature. It then partitions the data based on this randomly selected feature and value.
Recursive Partitioning: This process of random partitioning is repeated recursively until all data points are isolated or a predefined maximum depth is reached.
Anomaly Score Calculation: Anomalies are expected to be isolated with fewer partitions compared to normal data points. Therefore, anomalies are assigned lower anomaly scores. These scores are based on the average path length required to isolate the data points during the partitioning process. The shorter the path, the more likely it is to be an anomaly.
Thresholding: An anomaly threshold is defined, and data points with anomaly scores below this threshold are considered anomalies.
Let's consider a simple example of anomaly detection in a dataset containing information about server response times. The dataset includes features such as CPU usage, memory usage, and network traffic. We want to identify anomalous server responses that indicate potential system failures or cyber attacks.
Random Partitioning: In the first iteration, the algorithm randomly selects a feature, let's say CPU usage, and then randomly selects a value within the range of CPU usage, for example, 80%.
Based on this random selection, it partitions the data into two groups: data points with CPU usage <= 80% and data points with CPU usage > 80%.
Recursive Partitioning: This process is repeated recursively, with random feature and value selections, until each data point is isolated or the maximum depth is reached. Each partitioning step creates a binary tree structure.
Anomaly Score Calculation: Anomalies are expected to require fewer partitions to isolate. Therefore, data points that are isolated early in the process (i.e., with shorter average path lengths) are assigned lower anomaly scores.
Thresholding: An anomaly threshold is defined based on domain knowledge or validation data. Data points with anomaly scores below this threshold are flagged as anomalies.
One-Class Support Vector Machine
One-Class Support Vector Machine (SVM) is a type of support vector machine algorithm that is used for anomaly detection, particularly when dealing with unlabeled data. It is trained on only the normal data instances and aims to create a decision boundary that encapsulates the normal data points, thereby distinguishing them from potential anomalies.
Training Phase: One-Class SVM is trained using only the normal instances (i.e., data points without anomalies).
The algorithm aims to find a hyperplane (decision boundary) that best separates the normal data points from the origin in the feature space.
Unlike traditional SVM, which aims to maximize the margin between different classes, One-Class SVM aims to enclose as many normal data points as possible within a margin around the decision boundary.
Model Representation: The decision boundary created by One-Class SVM is represented by a hyperplane defined by a set of support vectors and a distance parameter called the "nu" parameter.
The hyperplane divides the feature space into two regions: the region encapsulating the normal data points (inliers) and the region outside the boundary, which may contain anomalies (outliers).
Prediction Phase: During the prediction phase, new data points are evaluated based on their proximity to the decision boundary.
Data points falling within the boundary (inside the margin) are classified as normal (inliers).
Data points falling outside the boundary (outside the margin) are classified as potential anomalies (outliers).
Hyperparameter Tuning: One-Class SVM typically has a hyperparameter called "nu" that controls the trade-off between maximizing the margin and allowing for violations (i.e., data points classified as outliers). Tuning this hyperparameter is crucial for achieving optimal performance.
Scalability: is computationally efficient, particularly when dealing with high-dimensional data or large datasets. However, it may become less effective in extremely high-dimensional spaces.
Robustness to Outliers: is inherently robust to outliers in the training data since it learns from only one class. However, it may still misclassify some anomalies that lie close to the decision boundary.
Class Imbalance: assumes that the normal class is the minority class, and anomalies are rare. If anomalies are not significantly different from normal instances or if they form a significant portion of the data, One-Class SVM may not perform well.
Deep Learning
An artificial neuron, also known as a perceptron, is a fundamental building block of artificial neural networks. It mimics the behavior of biological neurons in the human brain.
The input, a vector from x� to x is multiplied element-wise by a weight vector w� to w and then summed together. The sum is then offset by a bias term b, and the result passes through an activation function, which is some mathematical function that delivers an output signal based on the magnitude and sign of the input. An example is a simple step function that outputs 1 if the combined input passes a threshold, or 0 otherwise. These now form the outputs, y, to ym. This y-vector can now serve as the input to another neuron.
Input Layer: An artificial neuron typically receives input signals from other neurons or directly from the input features of the data. Each input signal x is associated with a weight w that represents the strength of the connection between the input and the neuron.
Weighted Sum: The neuron computes the weighted sum of the input signals and their corresponding weights. The bias term allows the neuron to adjust the decision boundary independently of the input data.
Activation Function: The weighted sum z is then passed through an activation function, f(z). It introduces non-linearity into the neuron, enabling it to model complex relationships and learn non-linear patterns in the data. Common activation functions include sigmoid, tanh, ReLU, Leaky ReLU, ELU, etc.
Output: The output y of the neuron is the result of applying the activation function to the weighted sum y=f(z)
The output of the neuron represents its activation level or firing rate, which is then passed as input to other neurons in the subsequent layers of the neural network.
Bias Term: The bias term b is a constant value added to the weighted sum before applying the activation function. It allows the neuron to control the decision boundary independently of the input data. The bias term effectively shifts the activation function horizontally, influencing the threshold at which the neuron fires.
Activation Function: introduces non-linearity into the neuron's output. This non-linearity enables the neural network to learn complex relationships and patterns in the data that may not be captured by a simple linear model. The choice of activation function depends on the specific requirements of the task and the characteristics of the data.
Output Layer: In a neural network, neurons are organized into layers. The output layer typically consists of one or more neurons that produce the final output of the network. The activation function used in the output layer depends on the nature of the task. For example, sigmoid or softmax functions are commonly used for binary or multi-class classification tasks, while linear functions may be used for regression tasks.
Activation Functions are a way to map the input signals into some form of output signal to be interpreted by the subsequent neurons.
They are designed to add non-linearity to the data.
If we do not use it, then the output of the affine transformations is just the final output of the neuron.
- Sigmoid: The sigmoid activation function squashes the input values between 0 and 1. It has an S-shaped curve and is commonly used in binary classification tasks. However, it suffers from the vanishing gradient problem and is not recommended for deep neural networks. It is appropriate for being used at the very end of a DNN to map the last layer’s raw output into a probability score.
- Hyperbolic Tangent (Tanh): Tanh activation function squashes the input values between -1 and 1. Similar to the sigmoid function, it has an S-shaped curve but centered at 0. Tanh is often used in hidden layers of neural networks.
- Rectified Linear Unit (ReLU): outputs the input directly if it is positive, otherwise, it outputs zero. It is computationally efficient and helps in mitigating the vanishing gradient problem. ReLU is widely used in deep learning models due to its simplicity and effectiveness.
- Leaky ReLU: is similar to ReLU but allows a small, non-zero gradient when the input is negative. This helps prevent dying ReLU neurons, which can occur when a large gradient update causes the neuron to never activate again.
- Exponential Linear Unit (ELU): is similar to ReLU for positive input values but smoothly approaches zero for negative input values. It helps in preventing dead neurons and can capture information from negative inputs.
- Softmax: is typically used in the output layer of a neural network for multi-class classification tasks. It converts the raw output scores (logits) into probabilities, ensuring that the sum of the probabilities for all classes is equal to 1. Softmax is useful for determining the probability distribution over multiple classes.
A layer in a neural network is a collection of neurons that each compute some output value using the entire input. The output of a layer is comprised of all the output values computed by the neurons within that layer. A neural network is a sequence of layers of neurons where the output of one layer is the input to the next.
The first layer of the neural network is the input layer, and it takes in the training data as the input. The last layer of the network is the output layer, and it outputs values that are used as predictions for whatever task the network is being trained to perform. All layers in between are called hidden layers. ...more
All in all A very good introduction to ML, DL and anomaly detection but with the original sin of a poor pagination and an even poorer graphics design.
All in all it does its job in explaining how to deal with anomaly detection, but I'd have liked a little bit more of unsupervised examples, which are the toughest situations to deal with.
The Deep Learning section is very well written, they start from the basics, from the artificial neuron up to state of art like CNNs and GPT.
NOTES
Data-based Anomaly Detection
Statistical Methods: These methods rely on statistical measures such as mean, standard deviation, or probability distributions to identify anomalies. Examples include z-score, interquartile range (IQR), and Gaussian distribution modeling.
Machine Learning Algorithms: Various machine learning algorithms learn patterns from the data and detect anomalies based on deviations from learned patterns. Techniques like decision trees, support vector machines (SVM), isolation forests, and autoencoders fall into this category.
Context-based Anomaly Detection
Domain Knowledge: Context-based approaches leverage domain-specific knowledge to identify anomalies. For example, in network security, unusual network traffic patterns may be detected based on knowledge of typical network behavior.
Expert Systems: Expert systems use rule-based or knowledge-based systems to detect anomalies based on predefined rules or heuristics derived from domain expertise.
Pattern-based Anomaly Detection
Pattern Recognition: Pattern-based approaches focus on identifying deviations from expected patterns within the data. Techniques such as time series analysis, sequence mining, and clustering fall into this category.
Deep Learning: Deep learning techniques, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs), can be used for pattern-based anomaly detection by learning complex patterns and detecting deviations from learned representations.
Outlier detection focuses on identifying data points that deviate significantly from the majority of the dataset. These data points are often called outliers. Outliers can be indicative of errors, anomalies, or rare events in the data. Techniques such as statistical methods (e.g., z-score, IQR), machine learning algorithms (e.g., isolation forests, one-class SVM), and clustering can be used for outlier detection.
Novelty detection, also known as one-class classification, involves identifying instances that significantly differ from normal data, without having access to examples of anomalies during training. The goal is to detect novel or unseen patterns in the data. It's particularly useful when anomalies are rare and difficult to obtain labeled data for. Techniques such as support vector machines (SVM) and autoencoders are commonly used for novelty detection.
Event detection aims to identify significant occurrences or events in a dataset, often in real-time or near real-time. These events may represent changes, anomalies, or patterns of interest in the data stream. Event detection is crucial in various domains such as sensor networks, finance, and cybersecurity. Techniques such as time series analysis, signal processing, and machine learning algorithms can be applied for event detection.
Noise removal involves the process of filtering or eliminating unwanted or irrelevant data points from a dataset. Noise can obscure meaningful patterns and distort the analysis results. Techniques such as smoothing filters, wavelet denoising, and outlier detection can be used for noise removal, depending on the nature of the noise and the characteristics of the data.
Traditional ML Algorithms
Isolation Forest
It is an unsupervised machine learning algorithm used for anomaly detection. It works by isolating anomalies in the data by splitting them from the rest of the data using binary trees.
Random Partitioning: Isolation Forest randomly selects a feature and then randomly selects a value within the range of that feature. It then partitions the data based on this randomly selected feature and value.
Recursive Partitioning: This process of random partitioning is repeated recursively until all data points are isolated or a predefined maximum depth is reached.
Anomaly Score Calculation: Anomalies are expected to be isolated with fewer partitions compared to normal data points. Therefore, anomalies are assigned lower anomaly scores. These scores are based on the average path length required to isolate the data points during the partitioning process. The shorter the path, the more likely it is to be an anomaly.
Thresholding: An anomaly threshold is defined, and data points with anomaly scores below this threshold are considered anomalies.
Let's consider a simple example of anomaly detection in a dataset containing information about server response times. The dataset includes features such as CPU usage, memory usage, and network traffic. We want to identify anomalous server responses that indicate potential system failures or cyber attacks.
Random Partitioning: In the first iteration, the algorithm randomly selects a feature, let's say CPU usage, and then randomly selects a value within the range of CPU usage, for example, 80%.
Based on this random selection, it partitions the data into two groups: data points with CPU usage <= 80% and data points with CPU usage > 80%.
Recursive Partitioning: This process is repeated recursively, with random feature and value selections, until each data point is isolated or the maximum depth is reached. Each partitioning step creates a binary tree structure.
Anomaly Score Calculation: Anomalies are expected to require fewer partitions to isolate. Therefore, data points that are isolated early in the process (i.e., with shorter average path lengths) are assigned lower anomaly scores.
Thresholding: An anomaly threshold is defined based on domain knowledge or validation data. Data points with anomaly scores below this threshold are flagged as anomalies.
One-Class Support Vector Machine
One-Class Support Vector Machine (SVM) is a type of support vector machine algorithm that is used for anomaly detection, particularly when dealing with unlabeled data. It is trained on only the normal data instances and aims to create a decision boundary that encapsulates the normal data points, thereby distinguishing them from potential anomalies.
Training Phase: One-Class SVM is trained using only the normal instances (i.e., data points without anomalies).
The algorithm aims to find a hyperplane (decision boundary) that best separates the normal data points from the origin in the feature space.
Unlike traditional SVM, which aims to maximize the margin between different classes, One-Class SVM aims to enclose as many normal data points as possible within a margin around the decision boundary.
Model Representation: The decision boundary created by One-Class SVM is represented by a hyperplane defined by a set of support vectors and a distance parameter called the "nu" parameter.
The hyperplane divides the feature space into two regions: the region encapsulating the normal data points (inliers) and the region outside the boundary, which may contain anomalies (outliers).
Prediction Phase: During the prediction phase, new data points are evaluated based on their proximity to the decision boundary.
Data points falling within the boundary (inside the margin) are classified as normal (inliers).
Data points falling outside the boundary (outside the margin) are classified as potential anomalies (outliers).
Hyperparameter Tuning: One-Class SVM typically has a hyperparameter called "nu" that controls the trade-off between maximizing the margin and allowing for violations (i.e., data points classified as outliers). Tuning this hyperparameter is crucial for achieving optimal performance.
Scalability: is computationally efficient, particularly when dealing with high-dimensional data or large datasets. However, it may become less effective in extremely high-dimensional spaces.
Robustness to Outliers: is inherently robust to outliers in the training data since it learns from only one class. However, it may still misclassify some anomalies that lie close to the decision boundary.
Class Imbalance: assumes that the normal class is the minority class, and anomalies are rare. If anomalies are not significantly different from normal instances or if they form a significant portion of the data, One-Class SVM may not perform well.
Deep Learning
An artificial neuron, also known as a perceptron, is a fundamental building block of artificial neural networks. It mimics the behavior of biological neurons in the human brain.
The input, a vector from x� to x is multiplied element-wise by a weight vector w� to w and then summed together. The sum is then offset by a bias term b, and the result passes through an activation function, which is some mathematical function that delivers an output signal based on the magnitude and sign of the input. An example is a simple step function that outputs 1 if the combined input passes a threshold, or 0 otherwise. These now form the outputs, y, to ym. This y-vector can now serve as the input to another neuron.
Input Layer: An artificial neuron typically receives input signals from other neurons or directly from the input features of the data. Each input signal x is associated with a weight w that represents the strength of the connection between the input and the neuron.
Weighted Sum: The neuron computes the weighted sum of the input signals and their corresponding weights. The bias term allows the neuron to adjust the decision boundary independently of the input data.
Activation Function: The weighted sum z is then passed through an activation function, f(z). It introduces non-linearity into the neuron, enabling it to model complex relationships and learn non-linear patterns in the data. Common activation functions include sigmoid, tanh, ReLU, Leaky ReLU, ELU, etc.
Output: The output y of the neuron is the result of applying the activation function to the weighted sum y=f(z)
The output of the neuron represents its activation level or firing rate, which is then passed as input to other neurons in the subsequent layers of the neural network.
Bias Term: The bias term b is a constant value added to the weighted sum before applying the activation function. It allows the neuron to control the decision boundary independently of the input data. The bias term effectively shifts the activation function horizontally, influencing the threshold at which the neuron fires.
Activation Function: introduces non-linearity into the neuron's output. This non-linearity enables the neural network to learn complex relationships and patterns in the data that may not be captured by a simple linear model. The choice of activation function depends on the specific requirements of the task and the characteristics of the data.
Output Layer: In a neural network, neurons are organized into layers. The output layer typically consists of one or more neurons that produce the final output of the network. The activation function used in the output layer depends on the nature of the task. For example, sigmoid or softmax functions are commonly used for binary or multi-class classification tasks, while linear functions may be used for regression tasks.
Activation Functions are a way to map the input signals into some form of output signal to be interpreted by the subsequent neurons.
They are designed to add non-linearity to the data.
If we do not use it, then the output of the affine transformations is just the final output of the neuron.
- Sigmoid: The sigmoid activation function squashes the input values between 0 and 1. It has an S-shaped curve and is commonly used in binary classification tasks. However, it suffers from the vanishing gradient problem and is not recommended for deep neural networks. It is appropriate for being used at the very end of a DNN to map the last layer’s raw output into a probability score.
- Hyperbolic Tangent (Tanh): Tanh activation function squashes the input values between -1 and 1. Similar to the sigmoid function, it has an S-shaped curve but centered at 0. Tanh is often used in hidden layers of neural networks.
- Rectified Linear Unit (ReLU): outputs the input directly if it is positive, otherwise, it outputs zero. It is computationally efficient and helps in mitigating the vanishing gradient problem. ReLU is widely used in deep learning models due to its simplicity and effectiveness.
- Leaky ReLU: is similar to ReLU but allows a small, non-zero gradient when the input is negative. This helps prevent dying ReLU neurons, which can occur when a large gradient update causes the neuron to never activate again.
- Exponential Linear Unit (ELU): is similar to ReLU for positive input values but smoothly approaches zero for negative input values. It helps in preventing dead neurons and can capture information from negative inputs.
- Softmax: is typically used in the output layer of a neural network for multi-class classification tasks. It converts the raw output scores (logits) into probabilities, ensuring that the sum of the probabilities for all classes is equal to 1. Softmax is useful for determining the probability distribution over multiple classes.
A layer in a neural network is a collection of neurons that each compute some output value using the entire input. The output of a layer is comprised of all the output values computed by the neurons within that layer. A neural network is a sequence of layers of neurons where the output of one layer is the input to the next.
The first layer of the neural network is the input layer, and it takes in the training data as the input. The last layer of the network is the output layer, and it outputs values that are used as predictions for whatever task the network is being trained to perform. All layers in between are called hidden layers. ...more
Notes are private!
1
Mar 27, 2024
Apr 19, 2024
Mar 27, 2024
Kindle Edition
1617298891
9781617298899
1617298891
3.75
16
unknown
Dec 21, 2021
it was amazing
This textbook has a very clear and effective structure: each part focuses on the virtues and capabilities that are needed at different career stages.
T This textbook has a very clear and effective structure: each part focuses on the virtues and capabilities that are needed at different career stages.
That is why I will for sure read again the latest parts of this book as my career progresses.
Right now the section that interests me the most is leading projects.
It has the right blend between theory and practical examples.
T This textbook has a very clear and effective structure: each part focuses on the virtues and capabilities that are needed at different career stages.
That is why I will for sure read again the latest parts of this book as my career progresses.
Right now the section that interests me the most is leading projects.
It has the right blend between theory and practical examples.
Technologies...more
Framing the Problem to Maximize Business Impact: can be more important than identifying data sources. Strive to produce a more significant impact by not only providing hindsight, insight, and foresight for business decisions but also by driving customer actions with predictive intelligence capabilities.
Discovery Patterns in Data
- Understanding Data Characteristics: unit of decisioning, sample size, data sparsity, outliers, sample imbalance, data types
- Innovating in feature Engineering to summarization
- Clarifying the Modeling Strategy: Momentum-based, Foundational, Reflexive or Hybrid modeling strategies
- Provide clarity in modeling strategies that are coherent with fundamental mechanisms of the domain.
Execution
- In specifying projects, avoid being task-oriented, and ask the question behind the question to take personal responsibility to produce the best business results in projects.
- In prioritizing, planning, and managing projects, assess projects in terms of reach, impact, confidence, and effort (RICE) and alignment to data strategies; address common project failure modes with a simple and concise project plan; and leverage waterfall or scrum techniques, depending on your project, to manage your projects.
- In balancing hard trade-offs between speed and quality, safety and accountability, and documentation and progress, improve the long-term productivity of the team.
- Project Motivation and Definition, Solution Architecture, Execution Timeline, Risk to Anticipate
Expert Knowledge
In recognizing business context, check for project alignment with the vision and mission of your organization: Deep Domain Understanding
- Clarify the business context of opportunities or risks
- Account for domain data sources nuances, biases, inaccuracies, incompleteness
- Navigate organizational structures in an industry
Notes are private!
1
Mar 08, 2024
Mar 26, 2024
Mar 06, 2024
Paperback
1953295738
9781953295736
1953295738
3.20
20
unknown
Dec 14, 2021
it was ok
More than a book on Artificial Intelligence, this is an examination on how to formulate problems and search for solutions.
I strongly agree with the ma More than a book on Artificial Intelligence, this is an examination on how to formulate problems and search for solutions.
I strongly agree with the main thesis, that is first define the problem and then think about (potentially AI driven?) solutions.
His message is perfectly summarised in these few lines
He then highlights the differences between doing research in the AI field and work with AI: in the latter case we must be pragmatic and problem oriented.
Working in a company means that your manager does not care if the solution in called AI or BI, but whether you solve the problem in the first place.
However, he soon starts to repeat those concepts over and over again, making most of the text redundant.
He then takes a (boring) philosophical tangent on what's a problem and its different types, which weights down the narrative flow.
Just to give you an intuition here is an extract on this topic
At last he gives good advice on general problem solving, particularly regarding Divide and Impera, which comes from the fact that small problems are often simpler problems
I strongly agree with the ma More than a book on Artificial Intelligence, this is an examination on how to formulate problems and search for solutions.
I strongly agree with the main thesis, that is first define the problem and then think about (potentially AI driven?) solutions.
His message is perfectly summarised in these few lines
Being Al-first means doing Al last. Doing Al means doing it last or not doing it at all. The reason is rather simple: Solution-focused strategies are more complex than problem-focused strategies; and solution-focused thinking ignores the most important part of business, which is the problems they solve and the customers they create.
Keep in mind that solution-centric thinking results from the following:
Focusing on what our solutions ought to be rather than what they are.
Focusing on the impact of future solutions rather than the future impact of today's solutions.
Conflating our goals with the goals of others.
Focusing too much on abstract problems with some arbitrary solution or focusing too much on someone else's problem and ignore your problems. The former is solution solving and the latter often means we are working problem solving backward and finding problems to solve in the context of someone else's solution.
Do not define your solution. The search for analytical exactitude in verbal definition will not lead to economic progress. Ignore recycling glib, textbook definitions of artificial intelligence mainly because consumers don't care about textbook definitions. Customers care about themselves. If you want to make your life better, make their lives better. Help them accomplish their goals in a better, faster, safer, or cheaper way. They're generally interested in a value proposition that contains problem-specific information, not in a definition of intelligence. Your journey starts with more comprehension of problems, not the names or definitions of solutions. Besides, creating definitions for our solutions means we are creating external goals for them, which is nonsense.
He then highlights the differences between doing research in the AI field and work with AI: in the latter case we must be pragmatic and problem oriented.
Working in a company means that your manager does not care if the solution in called AI or BI, but whether you solve the problem in the first place.
Remember that insiders seek epistemological discoveries, not economic ones. The more epistemological a pursuit is, the less likely it is to become something that could be turned into a business. Entrepreneur, venture capitalist, and author Paul Graham discusses the value of problems at length and explains that good business ideas are unlikely to come from scholars who write and defend dissertations.
The reason is that the subset of ideas that count as "research" is so narrow that it's unlikely to satisfy academic constraints and also satisfy the orthogonal constraints of business. The incentives for success in the academic world are not consistent with what it takes to start and grow a business. Ultimately, business pursuits are much more complicated than academic ones. Managers ought to acknowledge that solving intelligence is not likely your goal, and in many ways it's oppressive to problem solving. AGI may be possible, but it is not desirable as a business goal.
However, he soon starts to repeat those concepts over and over again, making most of the text redundant.
He then takes a (boring) philosophical tangent on what's a problem and its different types, which weights down the narrative flow.
Just to give you an intuition here is an extract on this topic
There is something magical about writing down a problem. It's almost as though by writing about what is wrong, we start to discover new ways of making it right. Writing things down will also remind oneself and our teams of the problem and the goal. Once a problem is written down, don't forget to come back to the problem statement. It is a guide. Problem solving often starts with great intentions and alignment, but when it counts most-when the work is actually being done-we often don't hold on to the problem we set out to solve, and that's the most important part of problem solving: what the problem is and why we are solving it to begin with.
Furthermore, do not needlessly seek out complexity by making larger solutions to solve needlessly bigger problems. Complexity bias is the logical fallacy where we find it easier to seek out complex solutions rather than a simple one. Without a problem statement, solutions tend to become more complex and expand to fill in the available time we've allocated for problem solving. Parkinson's law, named after Cyril Northcote Parkinson, states that "work expands so as to fill the time available for its completion." This is a sort of solution sprawl, similar to the urban sprawl that expands to fill in geographic spaces immaterial to how well the urban landscape serves it citizenry.
At last he gives good advice on general problem solving, particularly regarding Divide and Impera, which comes from the fact that small problems are often simpler problems
Always start small and take small steps to ensure that performance is what you want. Don't try to boil the ocean with the whole of a problem. With smaller steps almost everything can be reduced to something more manageable. Working in smaller sizes and smaller steps goes for your team as well. Rather than having your whole team work on something for six months, think about what one person can do in six weeks. The Basecamp team uses six weeks, which I think is a good size. If you are an Agile team, you may have batches of two weeks.75 That is fine, too. The point is that constraining batch size will force everyone to find the best bad solution, rather than working into the abyss of perfection....more
Of course, simple problems are different. Simple problems can often be solved by applying a single solution to the whole of the problem. In practice you may not know the best solution a priori. One strategy to find the best solution for a simple problem may be to simply guess. Guessing, however, will have a high error rate in the face of increasing complexity.
Notes are private!
1
Feb 25, 2024
Mar 02, 2024
Feb 25, 2024
Hardcover
1633695689
9781633695689
B075GXJPFS
3.87
3,684
unknown
Apr 17, 2018
liked it
A good starting non-technical book, if you have no idea of what AI and machine learning are.
I"ve found ti a bit repetitive and verbose, but at least i A good starting non-technical book, if you have no idea of what AI and machine learning are.
I"ve found ti a bit repetitive and verbose, but at least it doesn't take anything for granted.
It starts with the classic law of supply and demand: the lower the price of a good the higher its demand, ceteris paribus.
Since prediction machines are becoming cheaper they are going to be used much more extensively in many different sectors.
These are augmented by the fact that the data is now everywhere, at our disposal which is the fuel of machine learning. From statistical perspective, data has diminishing returns: each additional unit of data improves your prediction less than the prior data. In terms of economics, the relationships is ambiguous: adding more data to allergic existing stock of data may be greater than adding it to a small stock. Thus organisations need to understand the relationship between adding more data in enhancing prediction, accuracy and increasing value creation.
P.S. To be honest, the summaries at the end of each chapter are so well written that sometimes I've read them directly. ...more
I"ve found ti a bit repetitive and verbose, but at least i A good starting non-technical book, if you have no idea of what AI and machine learning are.
I"ve found ti a bit repetitive and verbose, but at least it doesn't take anything for granted.
It starts with the classic law of supply and demand: the lower the price of a good the higher its demand, ceteris paribus.
Since prediction machines are becoming cheaper they are going to be used much more extensively in many different sectors.
These are augmented by the fact that the data is now everywhere, at our disposal which is the fuel of machine learning. From statistical perspective, data has diminishing returns: each additional unit of data improves your prediction less than the prior data. In terms of economics, the relationships is ambiguous: adding more data to allergic existing stock of data may be greater than adding it to a small stock. Thus organisations need to understand the relationship between adding more data in enhancing prediction, accuracy and increasing value creation.
Machine learning science had different goals from statistics. Whereas statistics emphasized being correct on average, machine learning did not require that. Instead, the goal was operational effectiveness. Predictions could have biases so long as they were better (something that was possible with powerful computers). This gave scientists a freedom to experiment and drove rapid improvements that take advantage of the rich data and fast computers that appeared over the last decade.
Traditional statistical methods require the articulation of hypotheses or at least of human intuition for model specification. Machine learning has less need to specify in advance what goes into the model and can accommodate the equivalent of much more complex models with many more interactions between variables.
Recent advances in machine learning are often referred to as advances in artificial intelligence because: (1) systems predicated on this technique learn and improve over time; (2) these systems produce significantly more-accurate predictions than other approaches under certain conditions, and some experts argue that prediction is central to intelligence; and (3) the enhanced prediction accuracy of these systems enable them to perform tasks, such as translation and navigation, that were previously considered the exclusive domain of human intelligence. We remain agnostic on the link between prediction and intelligence. None of our conclusions rely on taking a position on whether advances in prediction represent advances in intelligence. We focus on the consequences of a drop in the cost of prediction, not a drop in the cost of intelligence.
P.S. To be honest, the summaries at the end of each chapter are so well written that sometimes I've read them directly. ...more
Notes are private!
1
Feb 17, 2024
Feb 25, 2024
Feb 17, 2024
Kindle Edition
1098125975
9781098125974
1098125975
4.55
2,679
Apr 09, 2017
Nov 08, 2022
it was amazing
This textbook is like the Swiss Army knife of machine learning books—it's packed with tools and techniques to help you tackle a wide range of real-wor
This textbook is like the Swiss Army knife of machine learning books—it's packed with tools and techniques to help you tackle a wide range of real-world problems.
It takes you on a journey through the exciting landscape of machine learning, equipped with powerful libraries like Scikit-Learn, Keras, and TensorFlow.
It explains in depth each library: I was more interested in the first one, as Keras and TS are too advanced for my interests and knowledge.
It is actually funny to read the Natural Language Processing NLP, LLMs section, prior to ChatGPT.
NOTES:
Supervised Learning: the algorithm is trained on a labeled dataset, meaning the input data is paired with the correct output. The model learns to map the input to the output, making predictions or classifications when new data is introduced. Common algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks.
Unsupervised Learning: deals with unlabeled data, where the algorithm explores the data's structure or patterns without any explicit supervision. Clustering and association are two primary tasks in this type. Clustering algorithms, like K-means or hierarchical clustering, group similar data points together. Association algorithms, like a priori algorithm, find relationships or associations among data points.
Reinforcement Learning: involves an agent learning to make decisions by interacting with an environment. Usually is used on robots: It learns by receiving feedback in the form of rewards or penalties as it navigates through a problem space. The goal is to learn the optimal actions that maximize the cumulative reward. Algorithms like Q-learning and Deep Q Networks (DQN) are used in reinforcement learning scenarios.
Additionally, there are subfields and specialized forms within these categories, such as semi-supervised learning, where algorithms learn from a combination of labeled and unlabeled data, and transfer learning, which involves leveraging knowledge from one domain to another. These types and their variations offer diverse approaches to solving different types of problems in machine learning.
Gradient descent is a fundamental optimization algorithm widely used in machine learning for minimizing the error of a model by adjusting its parameters. It's especially crucial in training models like neural networks, linear regression, and other algorithms where the goal is to find the optimal parameters that minimize a cost or loss function.
- Objective: In machine learning, the objective is to minimize a cost or loss function that measures the difference between predicted values and actual values.
- Optimization Process: Gradient descent is an iterative optimization algorithm. It works by adjusting the model parameters iteratively to minimize the given cost function.
- Gradient Calculation: At each iteration, the algorithm calculates the gradient of the cost function with respect to the model parameters. The gradient essentially points in the direction of the steepest increase of the function.
- Parameter Update: The algorithm updates the parameters in the direction opposite to the gradient (i.e., descending along the gradient). This step size is determined by the learning rate, which controls how big a step the algorithm takes in the direction of the gradient.
- Convergence: This process continues iteratively, gradually reducing the error or loss. The algorithm terminates when it reaches a point where further iterations don't significantly decrease the loss or when it reaches a predefined number of iterations.
There are variations of gradient descent, such as:
Batch Gradient Descent: Calculates the gradient over the entire dataset.
Stochastic Gradient Descent (SGD): Computes the gradient using a single random example from the dataset at each iteration, which can be faster but more noisy. Randomness is good to escape local optima.
Mini-batch Gradient Descent: Computes the gradient using a small subset of the dataset, balancing between the efficiency of SGD and the stability of batch gradient descent.
Gradient descent plays a vital role in training machine learning models by iteratively adjusting parameters to find the optimal values that minimize the error or loss function, leading to better model predictions and performance.
It is commonly used in conjunction with various machine learning algorithms, including regression models. It serves as an optimization technique to train these models by minimizing a cost or loss function associated with the model's predictions.
Support Vector Machines SVM
It can perform linear or nonlinear classification, regression and even outlier detection.
Well suited for classification of complex small to medium sized datasets.
They tend to work effectively and efficiently when there are many features compared to the observations, but SVM is not as scalable to larger data sets and it’s hard to tune its hyperparameters.
SVM is a family of model classes that operate in high dimensional space to find an optimal hyperplane when they attempt to separate the classes with a maximum margin between them. Support vectors are the points closest to the decision boundary that would change it if were removed.
It tries to fit the widest possible space between the classes, staying as far as possible from the closest training instances: large margin classification.
Adding more training instances far away from the boundary does not affect SVM, which is fully determined/supported by the instances located at the edge of the street, called support vectors.
N.B. SVMs are sensitive to the feature scales.
Soft margin classification is generally preferred to the hard version, because it is tolerant to outliers and it’s a compromises between perfectly separating the two classes, and having the widest possible Street.
Unlike Logistic regressions, SVM classifiers of not output probabilities.
Nonlinear SVM classification adds polynomial features and thanks to the kernel trick we get the same result as if we add many high-degree polynomial features, without actually adding them so there is no combinatorial explosion of the number of features.
SVM Regression reverses the objective: it tries to fit as many instances as possible on the street while limiting margin violations, that is training instances outside the support vectors region.
Decision Trees
They have been used for the longest time, even before they were turned into algorithms.
It searches for the the pair (feature, threshold) that produces the purest subsets (weighted by their size) and it does it recursively, however it does not check whether or not the split will lead to the lowest possible impurity several levels down.
Hence it does not guarantee a global maximum solution.
The computational complexity does not explode since each node only requires checking the value of one feature: the training algorithm compares all features on all samples at each node.
Nodes purity is measured by Gini coefficient or entropy: a node’s impurity is generally lower that its parents�.
Decision trees make very few assumptions about the training data, as opposed to linear models, which assume that the data is linear. If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, indeed, most likely overfitting it.
Such a model is often called a non-parametric model, it has parameters, but their number is not determined prior to training.
To avoid overfitting, we need to regularize hyperparameters, to reduce the decision tree freedom during training: pruning (deleting unnecessary nodes), set a max number of leaves.
We can have decision tree regressions, which, instead of predicting a class in each node, it predicts a value.
They are simple to understand and interpret, easy to use, versatile and powerful.
They don’t care if the training data is called or centered: no need to scale features.
However, they apply orthogonal decision boundaries which makes them sensitive to training set rotation, that is the model will not generalize well because they are very sensitive to small variations in the training data. Random forests can lead to disease stability by averaging predictions over many trees.
Random Forests
It is an ensemble of Decision Trees, generally trained via bagging or sometimes pasting, typically with the max_samples set to the size of the training set.
Instead of using the BaggingClassifier the RandomForestClassifier is optimized for Decision Trees, it has all its hyperparameters.
Instead of searching for the best feature when splitting a node, it searches for the best feature among a random subset of features, which results in a greater tree diversity.
It makes it easy to measure feature importance by looking at how much that feature reduces impurity on average.
Boosting
Adaptive Boosting
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. For example, when training AdaBoost classifier the algorithm first trains of base classifier such as a decision trees, and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances, then train the second classifier using the updated weights and again makes predictions on the training said updates the instance weights and so on. Once all predictors are trained the ensemble makes predictions like bagging expect that the predictors have wights depending on their overall accuracy on the weighted training set.
Gradient Boosting
It works by sequentially adding predictors to an ensemble, each one correcting its predecessor.
Instead of tweaking the instance weights at every iteration like AdaBoost does, it tries to fit the new predictor to the residual errors made by the previous predictor.
[XGBoost Python Library is an optimised implementation]
Stacking
Stacked generalization involves training multiple diverse models and combining their predictions using a meta-model (or blender).
Instead of parallel training like in bagging, stacking involves training models in a sequential manner.
The idea is to let the base models specialize in different aspects of the data, and the meta-model learns how to weigh their contributions effectively.
Stacking can involve multiple layers of models, with each layer's output serving as input to the next layer.
It requires a hold-out set (validation set) for the final model to prevent overfitting on the training data.
Stacking is a more complex ensemble method compared to boosting and bagging.
[Not supported by Scikit-learn]
Unsupervised Learning
Dimensionality Reduction
Reducing dimensionality does cause information loss and makes pipelines more complex thus harder to maintain, while speeding up training.
The main result is that it is much easier to rely on Data Viz once we have fewer dimensions.
[the operation can be reversed, we can reconstruct a data set relatively similar to the original]
Intuitively dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the data set without losing too much information.
The Curse of Dimensionality
As the number of features or dimensions in a dataset increases, certain phenomena occur that can lead to difficulties in model training, performance, and generalization.
- Increased Sparsity: In high-dimensional spaces, data points become more sparse. As the number of dimensions increases, the available data tends to be spread out thinly across the feature space. This sparsity can lead to difficulties in estimating reliable statistical quantities and relationships.
- Increased Computational Complexity: The computational requirements grow exponentially with the number of dimensions. Algorithms that work efficiently in low-dimensional spaces may become computationally expensive or impractical in high-dimensional settings. This can affect the training and inference times of machine learning models.
- Overfitting: In high-dimensional spaces, models have more freedom to fit the training data closely. This can lead to overfitting, where a model performs well on the training data but fails to generalize to new, unseen data. Regularization techniques become crucial to mitigate overfitting in high-dimensional settings.
- Decreased Intuition and Visualization: It becomes increasingly difficult for humans to visualize and understand high-dimensional spaces. While we can easily visualize and interpret data in two or three dimensions, the ability to comprehend relationships among variables diminishes as the number of dimensions increases.
- Increased Data Requirements: As the dimensionality increases, the amount of data needed to maintain the same level of statistical significance also increases. This implies that more data is required to obtain reliable estimates and make accurate predictions in high-dimensional spaces.
- Distance Measures and Density Estimation: The concept of distance becomes less meaningful in high-dimensional spaces, and traditional distance metrics may lose their discriminative power. Similarly, density estimation becomes challenging as the data becomes more spread out.
Projection
In most real-world problems, training instances are not spread out uniformly across all dimensions: many features are almost constant whereas others are highly correlated.
As a result, all training instances lie within a much lower dimensional subspace of the high-dimensional space.
If we project every instance perpendicularly onto this subspace we get a new Dimension-1 dataset.
Manifold Learning focuses on capturing and representing the intrinsic structure or geometry of high-dimensional data in lower-dimensional spaces, often referred to as manifolds.
The assumption is that the task will be simpler if expressed in the lower dimensional space of the manifold, which is not always true: the decision boundary may not always be simpler with lower dimensions.
PCA Principal Component Analysis
It identifies the hyperplane that lies closest to the data and then it projects the data onto to it while retaining as much of the original variance as possible.
PCA achieves this by identifying the principal components of the data, which are linear combinations of the original features, the axis that accounts for the largest amount of variance in the training set.
[It's essential to note that PCA assumes that the principal components capture the most important features of the data, and it works well when the variance in the data is aligned with the directions of maximum variance. However, PCA is a linear technique and may not perform optimally when the underlying structure of the data is nonlinear. In such cases, non-linear dimensionality reduction techniques like t-Distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP) might be more appropriate.]
It identifies the principal components via a standard matrix factorization technique, Singular Value Decomposition.
Before applying PCA, it's common to standardize the data by centering it (subtracting the mean) and scaling it (dividing by the standard deviation). This ensures that each feature contributes equally to the analysis.
PCA involves the computation of the covariance matrix of the standardized data. The covariance matrix represents the relationships between different features, indicating how they vary together.
It is useful to compute the explained variance ratio of each principal component which indicates the proportion of the dataset’s variance that lies along each PC.
The number of dimensions to reduce down to, should account for 95% of the variance.
After dimensionality reduction the training set takes up much less space.
- Dimensionality Reduction: The primary use of PCA is to reduce the number of features in a dataset while retaining most of the information. This is beneficial for visualization, computational efficiency, and avoiding the curse of dimensionality.
- Data Compression: PCA can be used for data compression by representing the data in a lower-dimensional space, reducing storage requirements.
- Noise Reduction: By focusing on the principal components with the highest variance, PCA can help filter out noise in the data.
- Visualization: PCA is often employed for visualizing high-dimensional data in two or three dimensions, making it easier to interpret and understand.
Kernel PCA, Unsupervised Algorithm
The basic idea behind Kernel PCA is to use a kernel function to implicitly map the original data into a higher-dimensional space where linear relationships may become more apparent. The kernel trick avoids the explicit computation of the high-dimensional feature space but relies on the computation of pairwise similarities (kernels) between data points.
Commonly used kernel functions include the radial basis function (RBF) or Gaussian kernel, polynomial kernel, and sigmoid kernel. The choice of the kernel function depends on the characteristics of the data and the desired transformation.
After applying the kernel trick, the eigenvalue decomposition is performed in the feature space induced by the kernel. This results in eigenvalues and eigenvectors, which are analogous to those obtained in traditional PCA.
The final step involves projecting the original data onto the principal components in the higher-dimensional feature space. The projection allows for non-linear dimensionality reduction.
Kernel PCA is particularly useful in scenarios where the relationships in the data are not well captured by linear techniques. It has applications in various fields, including computer vision, pattern recognition, and bioinformatics, where the underlying structure of the data might be highly non-linear.
However, it's important to note that Kernel PCA can be computationally expensive, especially when dealing with large datasets, as it involves the computation of pairwise kernel values. The choice of the kernel and its parameters can also impact the performance of Kernel PCA, and tuning these parameters may be necessary for optimal results.
Clustering: K-Means
It is the task of identifying similar instances and assigning them to clusters or groups of similar instances.
It is an example where we can use Data Science not to predict but to classify the existing data.
Use cases:
- Customer segmentation: You can cluster your customers based on their purchases and their activity on your website. This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment. ...more
It takes you on a journey through the exciting landscape of machine learning, equipped with powerful libraries like Scikit-Learn, Keras, and TensorFlow.
It explains in depth each library: I was more interested in the first one, as Keras and TS are too advanced for my interests and knowledge.
It is actually funny to read the Natural Language Processing NLP, LLMs section, prior to ChatGPT.
NOTES:
Supervised Learning: the algorithm is trained on a labeled dataset, meaning the input data is paired with the correct output. The model learns to map the input to the output, making predictions or classifications when new data is introduced. Common algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks.
Unsupervised Learning: deals with unlabeled data, where the algorithm explores the data's structure or patterns without any explicit supervision. Clustering and association are two primary tasks in this type. Clustering algorithms, like K-means or hierarchical clustering, group similar data points together. Association algorithms, like a priori algorithm, find relationships or associations among data points.
Reinforcement Learning: involves an agent learning to make decisions by interacting with an environment. Usually is used on robots: It learns by receiving feedback in the form of rewards or penalties as it navigates through a problem space. The goal is to learn the optimal actions that maximize the cumulative reward. Algorithms like Q-learning and Deep Q Networks (DQN) are used in reinforcement learning scenarios.
Additionally, there are subfields and specialized forms within these categories, such as semi-supervised learning, where algorithms learn from a combination of labeled and unlabeled data, and transfer learning, which involves leveraging knowledge from one domain to another. These types and their variations offer diverse approaches to solving different types of problems in machine learning.
Gradient descent is a fundamental optimization algorithm widely used in machine learning for minimizing the error of a model by adjusting its parameters. It's especially crucial in training models like neural networks, linear regression, and other algorithms where the goal is to find the optimal parameters that minimize a cost or loss function.
- Objective: In machine learning, the objective is to minimize a cost or loss function that measures the difference between predicted values and actual values.
- Optimization Process: Gradient descent is an iterative optimization algorithm. It works by adjusting the model parameters iteratively to minimize the given cost function.
- Gradient Calculation: At each iteration, the algorithm calculates the gradient of the cost function with respect to the model parameters. The gradient essentially points in the direction of the steepest increase of the function.
- Parameter Update: The algorithm updates the parameters in the direction opposite to the gradient (i.e., descending along the gradient). This step size is determined by the learning rate, which controls how big a step the algorithm takes in the direction of the gradient.
- Convergence: This process continues iteratively, gradually reducing the error or loss. The algorithm terminates when it reaches a point where further iterations don't significantly decrease the loss or when it reaches a predefined number of iterations.
There are variations of gradient descent, such as:
Batch Gradient Descent: Calculates the gradient over the entire dataset.
Stochastic Gradient Descent (SGD): Computes the gradient using a single random example from the dataset at each iteration, which can be faster but more noisy. Randomness is good to escape local optima.
Mini-batch Gradient Descent: Computes the gradient using a small subset of the dataset, balancing between the efficiency of SGD and the stability of batch gradient descent.
Gradient descent plays a vital role in training machine learning models by iteratively adjusting parameters to find the optimal values that minimize the error or loss function, leading to better model predictions and performance.
It is commonly used in conjunction with various machine learning algorithms, including regression models. It serves as an optimization technique to train these models by minimizing a cost or loss function associated with the model's predictions.
Support Vector Machines SVM
It can perform linear or nonlinear classification, regression and even outlier detection.
Well suited for classification of complex small to medium sized datasets.
They tend to work effectively and efficiently when there are many features compared to the observations, but SVM is not as scalable to larger data sets and it’s hard to tune its hyperparameters.
SVM is a family of model classes that operate in high dimensional space to find an optimal hyperplane when they attempt to separate the classes with a maximum margin between them. Support vectors are the points closest to the decision boundary that would change it if were removed.
It tries to fit the widest possible space between the classes, staying as far as possible from the closest training instances: large margin classification.
Adding more training instances far away from the boundary does not affect SVM, which is fully determined/supported by the instances located at the edge of the street, called support vectors.
N.B. SVMs are sensitive to the feature scales.
Soft margin classification is generally preferred to the hard version, because it is tolerant to outliers and it’s a compromises between perfectly separating the two classes, and having the widest possible Street.
Unlike Logistic regressions, SVM classifiers of not output probabilities.
Nonlinear SVM classification adds polynomial features and thanks to the kernel trick we get the same result as if we add many high-degree polynomial features, without actually adding them so there is no combinatorial explosion of the number of features.
SVM Regression reverses the objective: it tries to fit as many instances as possible on the street while limiting margin violations, that is training instances outside the support vectors region.
Decision Trees
They have been used for the longest time, even before they were turned into algorithms.
It searches for the the pair (feature, threshold) that produces the purest subsets (weighted by their size) and it does it recursively, however it does not check whether or not the split will lead to the lowest possible impurity several levels down.
Hence it does not guarantee a global maximum solution.
The computational complexity does not explode since each node only requires checking the value of one feature: the training algorithm compares all features on all samples at each node.
Nodes purity is measured by Gini coefficient or entropy: a node’s impurity is generally lower that its parents�.
Decision trees make very few assumptions about the training data, as opposed to linear models, which assume that the data is linear. If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, indeed, most likely overfitting it.
Such a model is often called a non-parametric model, it has parameters, but their number is not determined prior to training.
To avoid overfitting, we need to regularize hyperparameters, to reduce the decision tree freedom during training: pruning (deleting unnecessary nodes), set a max number of leaves.
We can have decision tree regressions, which, instead of predicting a class in each node, it predicts a value.
They are simple to understand and interpret, easy to use, versatile and powerful.
They don’t care if the training data is called or centered: no need to scale features.
However, they apply orthogonal decision boundaries which makes them sensitive to training set rotation, that is the model will not generalize well because they are very sensitive to small variations in the training data. Random forests can lead to disease stability by averaging predictions over many trees.
Random Forests
It is an ensemble of Decision Trees, generally trained via bagging or sometimes pasting, typically with the max_samples set to the size of the training set.
Instead of using the BaggingClassifier the RandomForestClassifier is optimized for Decision Trees, it has all its hyperparameters.
Instead of searching for the best feature when splitting a node, it searches for the best feature among a random subset of features, which results in a greater tree diversity.
It makes it easy to measure feature importance by looking at how much that feature reduces impurity on average.
Boosting
Adaptive Boosting
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. For example, when training AdaBoost classifier the algorithm first trains of base classifier such as a decision trees, and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances, then train the second classifier using the updated weights and again makes predictions on the training said updates the instance weights and so on. Once all predictors are trained the ensemble makes predictions like bagging expect that the predictors have wights depending on their overall accuracy on the weighted training set.
Gradient Boosting
It works by sequentially adding predictors to an ensemble, each one correcting its predecessor.
Instead of tweaking the instance weights at every iteration like AdaBoost does, it tries to fit the new predictor to the residual errors made by the previous predictor.
[XGBoost Python Library is an optimised implementation]
Stacking
Stacked generalization involves training multiple diverse models and combining their predictions using a meta-model (or blender).
Instead of parallel training like in bagging, stacking involves training models in a sequential manner.
The idea is to let the base models specialize in different aspects of the data, and the meta-model learns how to weigh their contributions effectively.
Stacking can involve multiple layers of models, with each layer's output serving as input to the next layer.
It requires a hold-out set (validation set) for the final model to prevent overfitting on the training data.
Stacking is a more complex ensemble method compared to boosting and bagging.
[Not supported by Scikit-learn]
Unsupervised Learning
Dimensionality Reduction
Reducing dimensionality does cause information loss and makes pipelines more complex thus harder to maintain, while speeding up training.
The main result is that it is much easier to rely on Data Viz once we have fewer dimensions.
[the operation can be reversed, we can reconstruct a data set relatively similar to the original]
Intuitively dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the data set without losing too much information.
The Curse of Dimensionality
As the number of features or dimensions in a dataset increases, certain phenomena occur that can lead to difficulties in model training, performance, and generalization.
- Increased Sparsity: In high-dimensional spaces, data points become more sparse. As the number of dimensions increases, the available data tends to be spread out thinly across the feature space. This sparsity can lead to difficulties in estimating reliable statistical quantities and relationships.
- Increased Computational Complexity: The computational requirements grow exponentially with the number of dimensions. Algorithms that work efficiently in low-dimensional spaces may become computationally expensive or impractical in high-dimensional settings. This can affect the training and inference times of machine learning models.
- Overfitting: In high-dimensional spaces, models have more freedom to fit the training data closely. This can lead to overfitting, where a model performs well on the training data but fails to generalize to new, unseen data. Regularization techniques become crucial to mitigate overfitting in high-dimensional settings.
- Decreased Intuition and Visualization: It becomes increasingly difficult for humans to visualize and understand high-dimensional spaces. While we can easily visualize and interpret data in two or three dimensions, the ability to comprehend relationships among variables diminishes as the number of dimensions increases.
- Increased Data Requirements: As the dimensionality increases, the amount of data needed to maintain the same level of statistical significance also increases. This implies that more data is required to obtain reliable estimates and make accurate predictions in high-dimensional spaces.
- Distance Measures and Density Estimation: The concept of distance becomes less meaningful in high-dimensional spaces, and traditional distance metrics may lose their discriminative power. Similarly, density estimation becomes challenging as the data becomes more spread out.
Projection
In most real-world problems, training instances are not spread out uniformly across all dimensions: many features are almost constant whereas others are highly correlated.
As a result, all training instances lie within a much lower dimensional subspace of the high-dimensional space.
If we project every instance perpendicularly onto this subspace we get a new Dimension-1 dataset.
Manifold Learning focuses on capturing and representing the intrinsic structure or geometry of high-dimensional data in lower-dimensional spaces, often referred to as manifolds.
The assumption is that the task will be simpler if expressed in the lower dimensional space of the manifold, which is not always true: the decision boundary may not always be simpler with lower dimensions.
PCA Principal Component Analysis
It identifies the hyperplane that lies closest to the data and then it projects the data onto to it while retaining as much of the original variance as possible.
PCA achieves this by identifying the principal components of the data, which are linear combinations of the original features, the axis that accounts for the largest amount of variance in the training set.
[It's essential to note that PCA assumes that the principal components capture the most important features of the data, and it works well when the variance in the data is aligned with the directions of maximum variance. However, PCA is a linear technique and may not perform optimally when the underlying structure of the data is nonlinear. In such cases, non-linear dimensionality reduction techniques like t-Distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP) might be more appropriate.]
It identifies the principal components via a standard matrix factorization technique, Singular Value Decomposition.
Before applying PCA, it's common to standardize the data by centering it (subtracting the mean) and scaling it (dividing by the standard deviation). This ensures that each feature contributes equally to the analysis.
PCA involves the computation of the covariance matrix of the standardized data. The covariance matrix represents the relationships between different features, indicating how they vary together.
It is useful to compute the explained variance ratio of each principal component which indicates the proportion of the dataset’s variance that lies along each PC.
The number of dimensions to reduce down to, should account for 95% of the variance.
After dimensionality reduction the training set takes up much less space.
- Dimensionality Reduction: The primary use of PCA is to reduce the number of features in a dataset while retaining most of the information. This is beneficial for visualization, computational efficiency, and avoiding the curse of dimensionality.
- Data Compression: PCA can be used for data compression by representing the data in a lower-dimensional space, reducing storage requirements.
- Noise Reduction: By focusing on the principal components with the highest variance, PCA can help filter out noise in the data.
- Visualization: PCA is often employed for visualizing high-dimensional data in two or three dimensions, making it easier to interpret and understand.
Kernel PCA, Unsupervised Algorithm
The basic idea behind Kernel PCA is to use a kernel function to implicitly map the original data into a higher-dimensional space where linear relationships may become more apparent. The kernel trick avoids the explicit computation of the high-dimensional feature space but relies on the computation of pairwise similarities (kernels) between data points.
Commonly used kernel functions include the radial basis function (RBF) or Gaussian kernel, polynomial kernel, and sigmoid kernel. The choice of the kernel function depends on the characteristics of the data and the desired transformation.
After applying the kernel trick, the eigenvalue decomposition is performed in the feature space induced by the kernel. This results in eigenvalues and eigenvectors, which are analogous to those obtained in traditional PCA.
The final step involves projecting the original data onto the principal components in the higher-dimensional feature space. The projection allows for non-linear dimensionality reduction.
Kernel PCA is particularly useful in scenarios where the relationships in the data are not well captured by linear techniques. It has applications in various fields, including computer vision, pattern recognition, and bioinformatics, where the underlying structure of the data might be highly non-linear.
However, it's important to note that Kernel PCA can be computationally expensive, especially when dealing with large datasets, as it involves the computation of pairwise kernel values. The choice of the kernel and its parameters can also impact the performance of Kernel PCA, and tuning these parameters may be necessary for optimal results.
Clustering: K-Means
It is the task of identifying similar instances and assigning them to clusters or groups of similar instances.
It is an example where we can use Data Science not to predict but to classify the existing data.
Use cases:
- Customer segmentation: You can cluster your customers based on their purchases and their activity on your website. This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment. ...more
Notes are private!
1
Jan 07, 2024
Feb 10, 2024
Jan 07, 2024
Paperback
1804617334
9781804617335
B0BRCW95ZQ
2.90
30
unknown
Feb 28, 2023
it was ok
I have mixed feelings about this textbook.
There is some value for someone who doesn't know a lot about the world of Artificial Intelligence, but then I have mixed feelings about this textbook.
There is some value for someone who doesn't know a lot about the world of Artificial Intelligence, but then the book is still aimed at AI product managers, who should have some knowledge of the basics.
Some parts are so repetitive that become recursive: the explanations of AI and ML are present too many times throughout the text as well as unnecessary summarise here and there.
I struggled to find something original, but maybe I'm not the right audience.
There are some good bits for an intro to AI, but not many in depth use cases, just general descriptions of successful products which it is not practical.
The usual stuff about Machine Learning and Deep Learning, Supervised and unsupervised learning, a general introduction of the basics algorithms, but nothing than you can't find on Wikipedia.
To me the most interesting parts are around the different product strategies: dominant, disruptive or differentiated. ...more
There is some value for someone who doesn't know a lot about the world of Artificial Intelligence, but then I have mixed feelings about this textbook.
There is some value for someone who doesn't know a lot about the world of Artificial Intelligence, but then the book is still aimed at AI product managers, who should have some knowledge of the basics.
Some parts are so repetitive that become recursive: the explanations of AI and ML are present too many times throughout the text as well as unnecessary summarise here and there.
I struggled to find something original, but maybe I'm not the right audience.
There are some good bits for an intro to AI, but not many in depth use cases, just general descriptions of successful products which it is not practical.
The usual stuff about Machine Learning and Deep Learning, Supervised and unsupervised learning, a general introduction of the basics algorithms, but nothing than you can't find on Wikipedia.
To me the most interesting parts are around the different product strategies: dominant, disruptive or differentiated. ...more
Notes are private!
1
Mar 02, 2024
Mar 08, 2024
Nov 09, 2023
Kindle Edition
1492060941
9781492060949
1492060941
3.55
42
May 2020
Jun 30, 2020
really liked it
This textbook lives up to its author's expectations
The book is very well laid out and it has lots of use cases.
It touches very well known but important concepts in the industry, like the three Vs (volume, velocity, variety) that has projected the world into the big data era, the role of uncertainty or the difference between correlation and causation, which are important for someone who first approaches analytics, but are a bit redundant for someone already working in this sector.
For me the main takeaway was the clear distinction between the different phases of a business request in the Big Data Era: descriptive, predictive, prescriptive.
It is something that we (analytics) should always keep in mind because it is easy to be carried away from the main mission.
We need to start with the business question in mind, decompose it and move backward until we have some actions that relate to the business objective you want to achieve.
That is why we always need relevant and measurable KPIs.
We need also to identify what is actionable: the problem of choosing levels is one of causality, we want to make decisions that impact our business objectives, so there must be a causal relation from levers to consequences; we need to construct hypothesis and test them.
I often find myself wondering what's the end goal of a request, which forces me to go back to stakeholder and lay down strict and precise requirements as well as the desired outcome.
The author encourages to follow the KISS (Keep it simple, stupid) principle, a revisitation of the Occam's razor: avoid unnecessary complexity and complications. Choose the simplest solution that achieves the desired result.
Every difficult problem can be dissect into simpler and smaller problems.
By starting from those ones we can make educated guesses and rough approximations, estimating probabilities and expected values.
Another important point is how to work with uncertainty.
He introduces the Fermi problems, where the analyst can appreciate the power of intuition based on very few coordinates that help navigates through uncertainty and compute expected utilities.
The prescriptive stage is all about optimisation, which in general is hard. That is why we always want to start by solving the problem with no uncertainty, solving the simpler problem will provide valuable insights intuition as to what the relevant underlying uncertainty is.
He gives some very basic notions about probability and how to cope with uncertainty:
The main takeaway is that value is created by making decisions, not by data or prediction. But theThis textbook lives up to its author's expectations
The main takeaway is that value is created by making decisions, not by data or prediction. But these are necessary inputs to make AI-and-data-driven decisions decisions. To create value and make better decisions in a systematic and scalable way we need to improve our analytical skills.
The book is very well laid out and it has lots of use cases.
It touches very well known but important concepts in the industry, like the three Vs (volume, velocity, variety) that has projected the world into the big data era, the role of uncertainty or the difference between correlation and causation, which are important for someone who first approaches analytics, but are a bit redundant for someone already working in this sector.
For me the main takeaway was the clear distinction between the different phases of a business request in the Big Data Era: descriptive, predictive, prescriptive.
It is something that we (analytics) should always keep in mind because it is easy to be carried away from the main mission.
Descriptive Analytics
What it is: involves analyzing historical data to understand what has happened in the past. It focuses on summarizing and presenting data in a meaningful way and relate to the current state of the business objective.
Example: Generating reports, dashboards, or visualizations that show key performance indicators (KPIs) over a specific time period.
Predictive Analytics
What it is: involves using statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data. It helps organizations anticipate what might happen in the future.
Example: Building a predictive model to forecast sales for the next quarter based on past sales data and other relevant factors.
Prescriptive Analytics
What it is: goes a step further by recommending actions to optimize or improve future outcomes. It considers the predicted outcomes and suggests the best course of action, to help choose the right levers.
Example: Recommending marketing strategies to maximize the predicted sales, taking into account various influencing factors.
We need to start with the business question in mind, decompose it and move backward until we have some actions that relate to the business objective you want to achieve.
That is why we always need relevant and measurable KPIs.
We need also to identify what is actionable: the problem of choosing levels is one of causality, we want to make decisions that impact our business objectives, so there must be a causal relation from levers to consequences; we need to construct hypothesis and test them.
I often find myself wondering what's the end goal of a request, which forces me to go back to stakeholder and lay down strict and precise requirements as well as the desired outcome.
Business objectives are usually already defined, but we must learn to ask the right business questions to achieve these objectives.
Always start with the business objective and move backward: for any decision you're planning or have already made, think about the business objective you want to achieve. You can then move backward to figure out the set of possible levers and how these create consequences that affect the business.
The sequence of why questions can help define the right business objective you want to achieve: this bottom-up approach generally helps with identifying business objectives and enlarging the set of actions we can take. But other times you can also use a top-down approach similar to the decomposition of conversion rates.
The author encourages to follow the KISS (Keep it simple, stupid) principle, a revisitation of the Occam's razor: avoid unnecessary complexity and complications. Choose the simplest solution that achieves the desired result.
Every difficult problem can be dissect into simpler and smaller problems.
By starting from those ones we can make educated guesses and rough approximations, estimating probabilities and expected values.
Another important point is how to work with uncertainty.
He introduces the Fermi problems, where the analyst can appreciate the power of intuition based on very few coordinates that help navigates through uncertainty and compute expected utilities.
The prescriptive stage is all about optimisation, which in general is hard. That is why we always want to start by solving the problem with no uncertainty, solving the simpler problem will provide valuable insights intuition as to what the relevant underlying uncertainty is.
He gives some very basic notions about probability and how to cope with uncertainty:
Probabilities represent the likelihood of different outcomes occurring. In predictive analytics, estimating probabilities involves using statistical models and data analysis to quantify the chance of various future events....more
Expected values, also known as mean or average values, are calculated by multiplying each possible outcome by its probability and summing up these values. It provides a measure of the central tendency of a probability distribution.
Expected utility is a concept in decision theory that combines the probabilities of different outcomes with the associated utility or value of each outcome. It helps in making decisions under uncertainty by considering both likelihood and desirability.
Notes are private!
1
Nov 06, 2023
Nov 10, 2023
Nov 06, 2023
Paperback