|
|
|
|
|
|
|
|
|
|
|
|
|
my rating |
|
|
|
|
|
|
|
![]() |
|
|
||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1492040347
| 9781492040347
| 1492040347
| 4.26
| 501
| unknown
| Nov 04, 2019
|
None
|
Notes are private!
|
2
|
Apr 28, 2025
Mar 07, 2025
|
not set
not set
|
Mar 07, 2025
|
Paperback
| ||||||||||||||||
0753559013
| 9780753559017
| 0753559013
| 3.85
| 4,989
| Sep 2021
| Sep 16, 2021
|
it was amazing
|
This is such a visionary book. Its uniqueness stems from the blend of speculative fiction and expert analysis. I'd recommend it to anyone interested in This is such a visionary book. Its uniqueness stems from the blend of speculative fiction and expert analysis. I'd recommend it to anyone interested in AI without the technical expertise to understand its intricacies, but wants to envision near future scenarios. The book lays out great ideas with realistic scenarios. The fact that it is set in 2041 makes the stories credible, the future is not that afar so that the reader can empathize with the situations. However some stories like "Quantum genocide" are too complex to be that short, and the reader may loose the plot. I've found particularly interesting two stories: “Job Savior� and “Golden Elephant� both with huge ethic implications. In “Job Savior� AI has automated entire industries, leaving millions of workers displaced and searching for purpose. The protagonist, a former factory worker, finds herself immersed in an AI-driven gig economy that feels as alienating as it is efficient. The algorithms grade every action, offering opportunities based on performance metrics rather than human qualities. Yet, beyond the dystopian veneer lies a vision of adaptation: AI-powered retraining programs allow individuals to pivot their careers, and Universal Basic Income offers a safety net. The story dares to imagine how humans might coexist with machines in a post-work world, emphasizing resilience and the potential for redefining what “work� means. The speculative yet grounded narrative leaves readers pondering: could AI liberate humanity from monotonous labor, or will it leave us adrift in a sea of automation? In the end workers are paid to simulate work... The system analyzed the work in real time, awarding credit points based on speed and quality feedback. At the front of the hotel ballroom, there was a leaderboard displaying the names of the workers with the most points. The “Golden Elephant� projects a future where AI revolutionizes healthcare, bringing cutting-edge diagnostics and treatments to remote, underserved areas. The “golden elephant,� an AI avatar that provides both medical insights and emotional comfort, embodies the fusion of precision and compassion. Through the eyes of a young woman and her family in a rural community, the story explores the leap of faith required to trust machines with something as intimate as health. What sets this vision apart is its optimism. It imagines a world where AI doesn’t just close gaps in medical care but also augments human empathy, ensuring that technology remains a tool for connection, not cold detachment. The story wrestles with ethical concerns, such as the biases AI might inherit and the limits of its clinical perspective, but it ultimately points to a future where technology is an equalizer, breaking down barriers to access. In fact AI was sheltering Nayana from Sahej, because he was from the poorest caste hence it would have been a tragedy for Nayana's future prospects. "Sahej, why? Why can't we get close to each other?" Nayana chose her words carefully....more |
Notes are private!
|
1
|
Dec 16, 2024
|
Jan 03, 2025
|
Dec 16, 2024
|
Paperback
| |||||||||||||||
0321125215
| 9780321125217
| 0321125215
| 4.15
| 5,687
| Aug 20, 2003
| Aug 20, 2003
|
liked it
|
As a Senior data analyst, I approached Eric Evans� "Domain-Driven Design" hoping to gain practical insights into better aligning software design with
As a Senior data analyst, I approached Eric Evans� "Domain-Driven Design" hoping to gain practical insights into better aligning software design with real-world business needs. While the book is undeniably influential in software development circles, I found it too technical and theoretical for my needs. The core ideas—like agreeing on terminology (ubiquitous language) and defining clear boundaries between domains (bounded contexts)—are interesting and relevant, but they’re buried under layers of complex jargon, UML diagrams, and discussions rooted in object-oriented programming. As someone who works closely with data, I was hoping for more actionable strategies or examples that could be applied directly to data modeling and analysis. Instead, the focus on software architecture felt removed from the day-to-day challenges of interpreting and communicating data insights. The "domain" is the specific area of knowledge or activity your software addresses (e.g., banking, healthcare). Evans emphasizes the creation of a ubiquitous language—a shared vocabulary used by both developers and domain experts to avoid miscommunication. This language is embedded directly into the code and the design process. For developers or architects tackling large, enterprise-level systems, this book might be invaluable. But for analysts like me, it leans too heavily into abstract theory and programming frameworks, making it difficult to extract clear, practical takeaways. I’d recommend it only to those deeply embedded in the technical side of software design. For those seeking a more accessible guide to aligning technical solutions with business domains, this might not be the right fit. In fact there are newer resources inspired by Evans� work that may be more approachable. While it’s not an easy read, the book’s long-term impact on software architecture makes it a valuable, albeit imperfect, resource. ...more |
Notes are private!
|
1
|
Dec 05, 2024
|
Dec 15, 2024
|
Dec 05, 2024
|
Hardcover
| |||||||||||||||
1292164778
| 9781292164779
| 1292164778
| 3.29
| 110
| Jan 01, 2006
| Mar 27, 2017
|
really liked it
|
This textbook provides a comprehensive introduction to statistics, focusing on concepts and applications rather than complex formulas. It emphasizes t
This textbook provides a comprehensive introduction to statistics, focusing on concepts and applications rather than complex formulas. It emphasizes the practical side of statistics while maintaining a balance with theory, making it suitable for students across various disciplines. It is definitely, so to say, a school book which to me wasn't the best fit. The main topics Descriptive Statistics: Organizing and summarizing data using measures like mean, median, standard deviation, and graphical tools such as histograms and boxplots. Probability: Basics of probability, probability distributions (like the normal and binomial), and how these underpin statistical inference. Inferential Statistics: Estimation, confidence intervals, and hypothesis testing to draw conclusions about populations from samples. Regression Analysis: Simple and multiple regression techniques for modeling relationships between variables. However, more advanced topics are not explored. Statistics is About Learning from Data: Statistics provides tools to understand patterns, summarize data, and draw meaningful conclusions. It's not just about numbers but interpreting what the data reveals about real-world situations. Variability is Key: Understanding and managing variability in data is central to statistics. Measures like standard deviation and variance help quantify how much data points differ, enabling better insights. Descriptive Statistics Summarize Data: Techniques like mean, median, mode, and graphical tools (e.g., histograms and boxplots) provide a snapshot of the data, making complex datasets easier to interpret. Probability Links Data to Inference: Probability is the foundation for making predictions and drawing conclusions about populations from sample data. It bridges descriptive and inferential statistics. Statistical Inference is Powerful: Through confidence intervals and hypothesis testing, statisticians can make predictions, estimate population parameters, and assess relationships between variables with a degree of certainty. ...more |
Notes are private!
|
1
|
Nov 22, 2024
|
Dec 05, 2024
|
Nov 20, 2024
|
Pocket Book
| |||||||||||||||
1735431168
| 9781735431161
| B08JG2C29F
| 4.34
| 35
| unknown
| Sep 17, 2020
|
it was amazing
|
This will be my reference point whenever I'll need to refresh my memory around the intricate topic of Hypothesis Testing. The book introduces readers t This will be my reference point whenever I'll need to refresh my memory around the intricate topic of Hypothesis Testing. The book introduces readers to different types of hypothesis tests, such as t-tests, ANOVA, and chi-square tests, with an emphasis on when and why to use each. Frost also covers one-tailed vs. two-tailed tests, Type I and Type II errors, and power analysis, making the book a comprehensive guide to the basics of testing. Frost takes time to discuss common mistakes in hypothesis testing, such as data dredging and misinterpreting p-values. He emphasizes the importance of context and encourages readers to look beyond just the numbers to understand the data's story. First instance this explanation of Type I error aka false positive is so on point and accessible to everyone A fire alarm provides a good analogy for the types of hypothesis testing errors. Ideally, the alarm rings when there is a fire and does not ring in the absence of a fire. However, if the alarm rings when there is no fire, it is a false positive, or a Type I error in statistical terms. Conversely, if the fire alarm fails to ring when there is a fire, it is a false negative, or a Type II error. He succeeds in explaining very convoluted and difficult topics, such as why we do not accept the Null Hypothesis, in such a good manner You learned that we do not accept the null hypothesis. Instead, we fail to reject it. The convoluted wording encapsulates the fact that insufficient evidence for an effect in our sample isn't proof that the effect does not exist in the population. The effect might exist, but our sample didn't detect it-just like all those species scientists presumed were extinct because they didn't see them. Frost highlights common pitfalls, like misinterpreting p-values or ignoring statistical power, that can lead to poor decision-making. This focus on interpretation is invaluable and helps readers avoid drawing incorrect conclusions from their data. Another example is the difficult explanation of what is a P-value. Of course you cannot simply too much complex concept, but you can try to make them more accessible as Frost does. P-values indicate the strength of the sample evidence against the null hypothesis. If it is less than the significance level, your results are statistically significant. At last he explains thoroughly Central Limit Theorem and why it is so important: the distribution of sample means (or sample sums) from a population will tend to follow a normal distribution, regardless of the population's original distribution, as long as the sample size is sufficiently large. This is true even if the data itself doesn’t follow a normal distribution, which is a powerful feature of the CLT. In hypothesis testing, the CLT justifies the use of normal distribution-based methods (like z-tests or t-tests) to assess sample means or proportions when making data-driven decisions. For instance, it allows researchers to test if a sample mean significantly differs from a hypothesized population mean even if they don’t know the underlying distribution of the data. In fact, Even if the underlying data is skewed, multimodal, or has any other shape, the distribution of the sample means will approach a normal (bell-shaped) distribution as the sample size increases. ...more |
Notes are private!
|
1
|
Oct 16, 2024
|
Oct 28, 2024
|
Oct 16, 2024
|
Kindle Edition
| |||||||||||||||
9798991193542
| B0DCXMMX7C
| 4.45
| 53
| unknown
| Aug 19, 2024
|
it was amazing
|
The beauty of this book lies in its simplicity and clarity. Frost starts with the basics of simple linear regression and gradually moves into more mul
The beauty of this book lies in its simplicity and clarity. Frost starts with the basics of simple linear regression and gradually moves into more multiple regression models, always emphasizing the why behind each concept. He’s less concerned with turning you into a statistician and more focused on helping you understand how to use regression analysis to draw meaningful conclusions from data. The book is full of practical examples that are easy to relate to, covering real-world applications in fields like business, economics, and social sciences. Frost walks you through interpreting key outputs like coefficients, R-squared values, and p-values, breaking them down into terms that even beginners can grasp. If you’ve ever struggled to make sense of what these numbers actually mean in the context of your data, this guide will be a game-changer. Beginner-Friendly Approach: Frost’s writing is clear, engaging, and always focused on building an intuitive understanding of the concepts. Focus on Interpretation: Instead of getting bogged down in formulas, the book emphasizes how to interpret regression results and what they mean for your data. Practical Applications: Real-world examples make it easier to see how regression analysis can be used to answer practical questions in various fields. ...more |
Notes are private!
|
1
|
Oct 04, 2024
|
Oct 16, 2024
|
Oct 04, 2024
|
Kindle Edition
| ||||||||||||||||
1735431109
| 9781735431109
| 1735431109
| 3.85
| 41
| unknown
| Aug 13, 2020
|
it was amazing
|
Such a well written (true) introduction to statistics. The book goes straight into the point, sometimes even too much as it tends to be a bit dry. First Such a well written (true) introduction to statistics. The book goes straight into the point, sometimes even too much as it tends to be a bit dry. First off, this book isn’t about bombarding you with formulas and complicated jargon. Frost is more interested in explaining why statistical methods work. He makes sure you actually understand the logic behind the numbers rather than just memorizing steps. For example, when discussing descriptive statistics, he doesn’t just give you definitions of mean, median, and standard deviation—he explains how these measures give you different insights into your data set and when you should use each one. It’s really about building an intuitive feel for how to analyze data. The book encompasses the following themes: Data Visualization: starting off with histograms to more advanced charts Summary Statistics: central tendency, variability, percentiles, correlation Statistical Distributions: The book also gives a nice overview of key distributions, like normal distribution, and why they matter when analyzing data. Frost explains how concepts like the central limit theorem are at the heart of many statistical methods, but he doesn’t bog you down with unnecessary complexity. Probability: Frost takes the time to explain probability in a way that’s easy to grasp. Instead of throwing out formulas right away, he walks through practical examples, like rolling dice or drawing cards, to show how probability works in real-life scenarios. This helps ground your understanding before moving into more abstract concepts. Confidence Intervals: Another topic that Frost handles well is confidence intervals. Instead of throwing a bunch of equations at you, he starts by explaining what a confidence interval really means in terms of how certain you can be about a range of values. He explains it in an approachable way, which is perfect for beginners who might not have a background in mathematics. ...more |
Notes are private!
|
1
|
Sep 27, 2024
|
Oct 04, 2024
|
Sep 27, 2024
|
Paperback
| |||||||||||||||
1492056316
| 9781492056317
| B09WZJMMJP
| 4.62
| 1,693
| Jan 25, 2015
| Mar 31, 2022
|
it was amazing
|
Such a great textbook to look under the hood of Python's engine. As it often happens with more advanced books, the latest chapters are a bit too comple Such a great textbook to look under the hood of Python's engine. As it often happens with more advanced books, the latest chapters are a bit too complex for me, but overall the structure of the book is very good. The book delves into Python’s underlying mechanics and advanced constructs, guiding developers to write cleaner, more Pythonic code by leveraging the language’s built-in capabilities. Ramalho does a deep dive into Python’s core features, including data structures, functions, objects, metaprogramming, and concurrency, while emphasizing the importance of understanding Pythonic idioms and best practices. I've also appreciated the approach of putting first some technical details that are often overlooked, such as: Special methods in Python, often referred to as "magic methods" or "dunder methods" (short for "double underscore"), are methods that have double underscores at the beginning and end of their names, like __init__, __str__, and __add__. These methods are used to enable certain behaviors and operations on objects that are instances of a class. They allow you to define how your objects should respond to built-in functions and operators. or A hash is a fixed-size integer that uniquely identifies a particular value or object. Hashes are used in many areas of computer science and programming, such as in data structures like hash tables (which are used to implement dictionaries and sets in Python). The purpose of hashing is to quickly compare and retrieve data in collections that require fast lookups. The Python Data Model: The book starts by explaining Python’s data model, which is the foundation for everything in Python. It explores how to define and customize objects, and how Python’s magic methods (like __repr__, __str__, __len__, etc.) can be used to make objects behave consistently with Python’s expectations. You’ll learn how Python leverages special methods for operator overloading, object representation, and protocol implementation. Data Structures: Ramalho provides a thorough review of Python’s built-in data structures such as lists, tuples, sets, dictionaries, and more. He explains how to efficiently use these structures, as well as advanced concepts like comprehensions, slicing, and sorting. He also introduces custom containers, how to create immutable types, and how to use collections to implement advanced data structures like namedtuple and deque. Functions as Objects: The book explores first-class functions, demonstrating how functions in Python are objects and can be used as arguments, returned from other functions, or stored in data structures. Concepts such as closures, lambda functions, decorators, and higher-order functions are explained in detail, showing their importance in building flexible, reusable code. Object-Oriented Idioms: Fluent Python emphasizes writing idiomatic object-oriented code and explores Python’s approach to object-oriented programming (OOP). Topics covered include inheritance, polymorphism, interfaces, protocols, mixins, and abstract base classes (ABCs). The book discusses how to design flexible class hierarchies and how Python’s OOP system differs from other languages. Interfaces, Protocols and ABCs Metaprogramming: Metaprogramming is one of the most advanced topics in Python, and the book provides a detailed exploration of how to write code that modifies or generates other code at runtime. Descriptors, properties, class decorators, and metaclasses are discussed in depth, offering insight into how Python’s internals work and how to leverage them to write dynamic, reusable, and adaptable code. Concurrency: Ramalho also addresses concurrency and introduces several approaches for handling parallelism and concurrency in Python, including threading, multiprocessing, asynchronous programming with asyncio, and the concurrent.futures module. The book provides examples of how to work with I/O-bound and CPU-bound tasks efficiently, using appropriate concurrency models. Generators and Coroutines: Generators and coroutines are powerful tools in Python for managing state and producing data on demand (lazy evaluation). Fluent Python covers the use of generators for iterating over sequences and using coroutines for writing asynchronous code in an intuitive way. Decorators and Context Managers: Fluent Python covers decorators and context managers extensively, explaining how they work and how they can be used to implement cleaner, more readable code. The book covers the @property decorator, function decorators, and class decorators, as well as how to use with statements and implement custom context managers. Design Patterns: The book touches on common design patterns in Python and how they can be implemented in a Pythonic way. This includes patterns like Strategy, Observer, and Command, as well as more Python-specific approaches like duck typing and protocols. ...more |
Notes are private!
|
1
|
Aug 21, 2024
|
Sep 27, 2024
|
Aug 21, 2024
|
Kindle Edition
| |||||||||||||||
1634629663
| 9781634629669
| 1634629663
| 2.79
| 34
| unknown
| Oct 01, 2021
|
it was ok
|
There is a lot of theoretical stuff, even too much about basic concepts. The core idea is intriguing, but the authors never explain in details how to a There is a lot of theoretical stuff, even too much about basic concepts. The core idea is intriguing, but the authors never explain in details how to achieve it. The definitions already say a lot of each architecture and actually would save you the time to read the book: **Data Warehouse** A data warehouse is a centralized repository that stores structured data from various sources, typically optimized for query performance and reporting. It is designed to support business intelligence (BI) and analytics, enabling users to generate reports and insights. Data in a warehouse is usually organized in a schema-based format (e.g., star schema, snowflake schema) and is subject to strict quality control and transformation processes (ETL: Extract, Transform, Load) before being loaded into the warehouse. **Data Lake** A data lake is a large-scale storage repository that can hold vast amounts of raw, unstructured, semi-structured, and structured data in its native format. Unlike a data warehouse, a data lake does not require data to be processed or transformed before being stored. It is designed to accommodate a wide variety of data types, such as log files, videos, images, and sensor data, making it suitable for big data analytics, machine learning, and data exploration. Data lakes are highly scalable and can be deployed on cloud platforms or on-premises. **Data Lakehouse** A data lakehouse is a modern data architecture that combines the scalable storage and flexibility of a data lake with the performance and management features of a data warehouse. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while enabling high-performance analytics and query processing. The data lakehouse supports ACID transactions, schema enforcement, and data governance, providing a unified platform for diverse workloads, including BI, data science, and real-time analytics. It is a hybrid architecture that leverages the scalability and flexibility of data lakes with the reliability and performance of data warehouses. It allows organizations to store all types of data (structured, semi-structured, and unstructured) in a single repository while also supporting high-performance queries and analytics. The book discusses the (very well known) limitations of traditional data warehouses, such as their inability to handle unstructured data efficiently and their cost-prohibitive scalability. It also covers the challenges associated with data lakes, like data governance, data quality, and the complexity of managing diverse data formats. The data lakehouse addresses these issues by integrating the best features of both architectures. A common criticism is that the book tends to be repetitive, rehashing similar ideas and concepts multiple times without providing new insights. While it covers the basics of the data lakehouse, it may not delve deeply enough into technical details or advanced concepts for readers who are already familiar with data architecture. In order to move to a Lakehouse architecture: Assessment and Planning Evaluate Current Infrastructure: Assess the existing data warehouse architecture, including storage, ETL processes, and BI tools. Identify limitations and areas for improvement. Define Use Cases: Determine the use cases that the data lakehouse will support, such as real-time analytics, machine learning, and unstructured data analysis. Identify Stakeholders: Engage business stakeholders, data engineers, data scientists, and IT teams to gather requirements and expectations. Design the Lakehouse Architecture Storage Layer: Plan for a scalable storage solution (e.g., object storage like AWS S3, Azure Blob Storage) that can handle diverse data types (structured, semi-structured, and unstructured). Management Layer: Implement data governance, security, and metadata management practices that ensure data quality and compliance. Processing Layer: Incorporate processing engines capable of supporting SQL queries, machine learning, and streaming data processing (e.g., Apache Spark, Flink). Consumption Layer: Ensure compatibility with existing BI tools and provide user-friendly access for data analysts and data scientists. ...more |
Notes are private!
|
1
|
Aug 02, 2024
|
Aug 15, 2024
|
Aug 02, 2024
|
Paperback
| |||||||||||||||
1098142381
| 9781098142384
| 1098142381
| 4.10
| 21
| unknown
| Jan 16, 2024
|
really liked it
|
A solid handbook on the emerging field of analytics engineering, which bridges the gap between data engineering and data analytics. That is why it spec A solid handbook on the emerging field of analytics engineering, which bridges the gap between data engineering and data analytics. That is why it specifically emphasizes the use of SQL, the lingua franca of DBs, and DBT (data build tool) to create scalable, maintainable, and meaningful data models that can power business intelligence (BI) and analytics workflows. I would have hoped for a more advanced textbook, as in most of the initial chapters encompass basic Analysts tools and concepts such as Data Modeling and SQL. Since Analytics Engineer is a more technical and advanced role compared to Data Analyst it would have made sense to skip the basics and go straight into the action. Although I understand the need to attract as much audience as possible, as a Senior Analytics Engineer myself, I've found the first part of the book redundant and the last one insightful. That is to say that for a Data Analyst it might be the opposite, making this book not super relevant. DBT is an open-source tool that enables analytics engineers to transform data in their warehouse by writing modular SQL queries, testing data quality, and documenting data transformations. dbt automates the process of building and maintaining data models, making it easier to manage complex data pipelines. I've found particularly interesting the sections about DBT macros and the use of Jinja SQL. One final note: the book focuses almost entirely on the DBT cloud distribution which is a paid service. I'd have liked to have a more deeper discussion on the open source distribution, DBT Core so to understand how DBT works under the hood. In fact the book mainly shows the UI without going into the technical details of the CLI. A bit light on the DBT core part, which is the open source distribution ...more |
Notes are private!
|
1
|
Jul 26, 2024
|
Aug 21, 2024
|
Jul 26, 2024
|
Paperback
| |||||||||||||||
1617297208
| 9781617297205
| 1617297208
| 4.34
| 29
| unknown
| Mar 22, 2022
|
it was amazing
|
A very good introduction to Spark and its components. Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQ A very good introduction to Spark and its components. Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQL: for instance he explains how JOINs work regardless of PySpark or SQL. So if you already know these languages it might have redundant, but still valuable info. NOTES Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed to provide fast and general-purpose data processing capabilities. Spark extends the MapReduce model to support more types of computations, including interactive queries and stream processing, making it a powerful engine for large-scale data analytics. Key Features of Spark: Speed: Spark's in-memory processing capabilities allow it to be up to 100 times faster than Hadoop MapReduce for certain applications. Ease of Use: It provides simple APIs in Java, Scala, Python, and R, which makes it accessible for a wide range of users. Versatility: Spark supports various workloads, including batch processing, interactive querying, real-time analytics, machine learning, and graph processing. Advanced Analytics: It has built-in modules for SQL, streaming, machine learning (MLlib), and graph processing (GraphX). PySpark is the Python API for Apache Spark, which allows Python developers to write Spark applications using Python. PySpark integrates the simplicity and flexibility of Python with the powerful distributed computing capabilities of Spark. Key Features of PySpark: Python-Friendly: It enables Python developers to leverage Spark’s power using familiar Python syntax. DataFrames: Provides a high-level DataFrame API, which is similar to pandas DataFrames, but distributed. Integration with Python Ecosystem: Allows seamless integration with Python libraries such as NumPy, pandas, and scikit-learn. Machine Learning: Through MLlib, PySpark supports a wide range of machine learning algorithms. SparkSQL is a module for structured data processing in Apache Spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. Key Features of SparkSQL: DataFrames: A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Pandas. SQL Queries: SparkSQL allows users to execute SQL queries on Spark data. It supports SQL and Hive Query Language (HQL) out of the box. Unified Data Access: It provides a unified interface for working with structured data from various sources, including Hive tables, Parquet files, JSON files, and JDBC databases. Optimizations: Uses the Catalyst optimizer for query optimization, ensuring efficient execution of queries. Key Components and Concepts Spark Core RDD (Resilient Distributed Dataset): The fundamental data structure of Spark, an immutable distributed collection of objects that can be processed in parallel. Transformations and Actions: Transformations create new RDDs from existing ones (e.g., map, filter), while actions trigger computations and return results (e.g., collect, count). PySpark RDDs and DataFrames: Similar to Spark Core but accessed using Python syntax. SparkContext: The entry point to any Spark functionality, responsible for coordinating Spark applications. SparkSession: An entry point to interact with DataFrames and the Spark SQL API. SparkSQL DataFrame API: Provides a high-level abstraction for structured data. SparkSession: Central to SparkSQL, used to create DataFrames, execute SQL queries, and manage Spark configurations. SQL Queries: Enables running SQL queries using the sql method on a SparkSession. Catalog: Metadata repository that stores information about the structure of the data. ...more |
Notes are private!
|
1
|
Jul 16, 2024
|
Jul 25, 2024
|
Jul 16, 2024
|
Paperback
| |||||||||||||||
1492092398
| 9781492092391
| 1492092398
| 3.78
| 327
| Apr 12, 2022
| Apr 12, 2022
|
it was ok
|
It really bothers me when I feel like the author is wasting my time - this is one of the times. Don't get me wrong the concepts laid out are interestin It really bothers me when I feel like the author is wasting my time - this is one of the times. Don't get me wrong the concepts laid out are interesting, but as pointed out by many reviewers, they could have well been summarized in half the pages. "The concept is very simple, and it's presented in the initial section of the book; what you get later is a lot of repetition w/o practical advice or (what's even worse) any useful examples - and that's probably the biggest drawback of the book: it's by far too dry and theoretical." There are far too many definitions that create confusion and the books remains too much theoretical. This is what ChatGPT has to say about Data Mesch, and unfortunately for the author, it covers most of what you need to know about this movement. Data mesh is a decentralized approach to data architecture that aims to overcome the limitations of traditional centralized data systems, particularly in large and complex organizations. It was introduced by Zhamak Dehghani in 2019. The core idea behind data mesh is to treat data as a product and to manage data ownership and responsibilities in a decentralized way, much like how microservices are managed in software development. Here are the key principles and components of data mesh: Domain-Oriented Data Ownership: Data is owned by the teams that know the data best, typically the ones that generate it. Each domain team is responsible for the data it produces, ensuring high quality and relevance. Data as a Product: Data is treated as a product with its own lifecycle, including development, maintenance, and deprecation. Domain teams are responsible for delivering their data in a way that is easily discoverable, understandable, and usable by others. Self-Serve Data Infrastructure: A self-service infrastructure platform is provided to domain teams to enable them to manage their data independently. This platform typically includes tools for data storage, processing, governance, and access control. Federated Computational Governance: Governance is implemented in a federated manner, balancing global standards with local autonomy. This involves establishing policies and standards that are enforced across all domains while allowing domains the flexibility to manage their own data. Components of Data Mesh Domain Data Products These are datasets produced by different domain teams, designed to be used by other teams. Each data product comes with a clear contract, including schema, SLAs, quality metrics, and documentation. Data Platform: A central platform provides common infrastructure services like data storage, processing, and security. The platform abstracts away the complexities of underlying technologies, allowing domain teams to focus on their data products. Governance and Standards: Policies and standards are established to ensure data quality, security, and compliance. Governance is implemented in a federated manner, with responsibilities distributed across domain teams. Interoperability and Communication:Mechanisms are put in place to ensure that data products from different domains can be easily integrated and used together. This may involve standardizing on formats, interfaces, and communication protocols. Benefits of Data Mesh Scalability: By decentralizing data ownership and management, organizations can scale their data practices more effectively. Each domain team can work independently, avoiding bottlenecks associated with centralized data teams. Agility: Domain teams can develop and iterate on their data products more quickly, responding to changing business needs. This leads to faster innovation and time-to-market for data-driven initiatives. Quality and Relevance: Data ownership by domain teams ensures that the people most familiar with the data are responsible for its quality and relevance. This leads to higher quality data that is more aligned with business needs. Collaboration and Reuse: Data mesh promotes a culture of data sharing and reuse, making it easier for teams to discover and use data from other domains. This reduces duplication of effort and leads to more efficient use of data resources. Challenges and Considerations Cultural Change: Implementing data mesh requires a significant cultural shift, as teams need to take on new responsibilities for data ownership and product management. Organizations need to invest in training and change management to support this transition. Complexity: Managing a decentralized data architecture can introduce new complexities, particularly around governance and interoperability. It requires careful planning and robust tooling to ensure that data remains discoverable, usable, and compliant. Technology and Tooling: Building a self-serve data platform requires significant investment in technology and infrastructure. Organizations need to ensure they have the right tools and platforms to support the needs of their domain teams. Data mesh represents a significant shift in how organizations manage and utilize their data. By decentralizing data ownership and treating data as a product, organizations can become more agile, scalable, and effective in their use of data. However, successful implementation requires careful planning, investment in infrastructure, and a commitment to cultural change. ...more |
Notes are private!
|
1
|
Jun 10, 2024
|
Jun 23, 2024
|
Jun 10, 2024
|
Paperback
| |||||||||||||||
1942788290
| 9781942788294
| 1942788290
| 4.26
| 48,265
| Jan 10, 2013
| Feb 01, 2018
|
it was amazing
|
This gem of a book that took me by surprise and deservedly seats in my "best" section. It is the perfect blend between fiction and textbook: some parts This gem of a book that took me by surprise and deservedly seats in my "best" section. It is the perfect blend between fiction and textbook: some parts makes you laugh, others makes you think and reflect on important work-related topics. The story centers around Bill Palmer, an IT manager at Parts Unlimited, a struggling automotive parts company. The company's new initiative, code-named Phoenix Project, is critical for its survival, but it's over budget, behind schedule, and plagued by numerous issues. If you work in the Tech industry, and in particular in the Digital (ex IT) teams, you can very much relate to what happens at Parts Unlimited. Bill Palmer is unexpectedly promoted to VP of IT Operations. The CEO sugarcoats the pill, but in truth Bill is on course for a suicide mission: the relationship between IT and the rest of the company is dysfunctional to say the least. Since day 0 Bill finds himself in the middle of political meetings where more often than not everyone blames IT failures for everything that does not work. In short he has faces a mess situation. Then appears a savior, Erik, who teaches Bill how to deal with complexity. In short First Way: Emphasizes the performance of the entire system, rather than individual departments. Second Way: Focuses on creating feedback loops to enable continuous improvement. Third Way: Encourages a culture of continual experimentation and learning. DevOps Principles: The book illustrates the core principles of DevOps, including continuous delivery, automation, and the integration of development and operations teams. Workflow Optimization: Emphasizes the importance of streamlining workflows, eliminating bottlenecks, and improving efficiency. Cultural Change: Highlights the necessity of cultural transformation within an organization to adopt DevOps practices effectively. Systems Thinking: Encourages a holistic view of the IT environment and its impact on the business as a whole. But before meeting Erik, Of course the first thing Bill tries to do is to understand what's going on, what are the root causes of so many incidents and to better the relationships with the Developers and Security. But most importantly Bill wants to get the grip of the situation, who is doing what, and who is authorizing any changes. However Bill, Patty and Wes soon realise that there are too many pending changes, so much so that it is difficult even to list them all Thinking for a moment, I add, "For that matter, do the same thing for every person assigned to Phoenix. I'm guessing we're overloaded, so I want to know by how much. I want to proactively tell people whose projects have been bumped, so they're not surprised when we don't deliver what we promised." And then, when everything seemed lost, the strange figure of Erik entered the scene. He is brought into the story as a mysterious and knowledgeable figure who understands the deep-rooted problems within IT operations and the organization as a whole. His primary role is to mentor Bill Palmer, providing him with the insights and guidance needed to tackle the complex issues facing Parts Unlimited. Erik emphasizes the importance of optimizing the entire system rather than focusing on individual departments. This involves understanding how various parts of the organization interact and ensuring that improvements in one area do not create problems in another. I look at Erik suspiciously. He supposedly couldn't get anyone's name right, and yet he apparently remembers the name of some security guard from years past. And no one ever mentioned anything about a Dr. Reid. Moreover Erik teaches Bill how to avoid bottlenecks and necessity of creating robust feedback loops within the organization. These feedback loops help identify issues quickly, allow for continuous improvement, and ensure that knowledge is shared across teams. In Bill's team they have the Brent problem: a guy who can do it all, but who doesn't even how he does it, and therefore everybody calls for his help. That is Brent is a constraint. He shakes his head, recalling the memory, "He sat down at the keyboard, and it's like he went into this trance. Ten minutes later, the problem is fixed. Everyone is happy and relieved that the system came back up. But then someone asked, 'How did you do it?' And I swear to God, Brent just looked back at him blankly and said, 'I have no idea. I just did it." But if everybody needs Brent, his workload is unmanageable, therefore his tasks are always late. An ever-growing pile of changes trapped inside of IT Operations, with us running out of space to post the change cards. And then there is the ultimate monster: Unplanned work. Unplanned work often interrupts scheduled tasks and projects, leading to delays and inefficiencies. When team members are constantly pulled away to deal with unplanned issues, it consumes valuable time and resources. This can prevent the completion of strategic work and contribute to employee burnout. Frequent unplanned work can indicate underlying issues in processes or systems. It often leads to a reactive mode of operation where teams are firefighting instead of proactively improving systems and preventing problems. I turn back to Patty and say slowly, "Let me guess. Brent didn't get any of his non-Phoenix change work completed either, right?"...more |
Notes are private!
|
1
|
May 09, 2024
|
May 22, 2024
|
May 09, 2024
|
Paperback
| |||||||||||||||
1484251776
| 9781484251775
| B07Z1PHHQ9
| 3.56
| 9
| unknown
| Oct 10, 2019
|
really liked it
|
A very good introduction to ML, DL and anomaly detection but with the original sin of a poor pagination and an even poorer graphics design. All in all A very good introduction to ML, DL and anomaly detection but with the original sin of a poor pagination and an even poorer graphics design. All in all it does its job in explaining how to deal with anomaly detection, but I'd have liked a little bit more of unsupervised examples, which are the toughest situations to deal with. The Deep Learning section is very well written, they start from the basics, from the artificial neuron up to state of art like CNNs and GPT. NOTES Data-based Anomaly Detection Statistical Methods: These methods rely on statistical measures such as mean, standard deviation, or probability distributions to identify anomalies. Examples include z-score, interquartile range (IQR), and Gaussian distribution modeling. Machine Learning Algorithms: Various machine learning algorithms learn patterns from the data and detect anomalies based on deviations from learned patterns. Techniques like decision trees, support vector machines (SVM), isolation forests, and autoencoders fall into this category. Context-based Anomaly Detection Domain Knowledge: Context-based approaches leverage domain-specific knowledge to identify anomalies. For example, in network security, unusual network traffic patterns may be detected based on knowledge of typical network behavior. Expert Systems: Expert systems use rule-based or knowledge-based systems to detect anomalies based on predefined rules or heuristics derived from domain expertise. Pattern-based Anomaly Detection Pattern Recognition: Pattern-based approaches focus on identifying deviations from expected patterns within the data. Techniques such as time series analysis, sequence mining, and clustering fall into this category. Deep Learning: Deep learning techniques, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs), can be used for pattern-based anomaly detection by learning complex patterns and detecting deviations from learned representations. Outlier detection focuses on identifying data points that deviate significantly from the majority of the dataset. These data points are often called outliers. Outliers can be indicative of errors, anomalies, or rare events in the data. Techniques such as statistical methods (e.g., z-score, IQR), machine learning algorithms (e.g., isolation forests, one-class SVM), and clustering can be used for outlier detection. Novelty detection, also known as one-class classification, involves identifying instances that significantly differ from normal data, without having access to examples of anomalies during training. The goal is to detect novel or unseen patterns in the data. It's particularly useful when anomalies are rare and difficult to obtain labeled data for. Techniques such as support vector machines (SVM) and autoencoders are commonly used for novelty detection. Event detection aims to identify significant occurrences or events in a dataset, often in real-time or near real-time. These events may represent changes, anomalies, or patterns of interest in the data stream. Event detection is crucial in various domains such as sensor networks, finance, and cybersecurity. Techniques such as time series analysis, signal processing, and machine learning algorithms can be applied for event detection. Noise removal involves the process of filtering or eliminating unwanted or irrelevant data points from a dataset. Noise can obscure meaningful patterns and distort the analysis results. Techniques such as smoothing filters, wavelet denoising, and outlier detection can be used for noise removal, depending on the nature of the noise and the characteristics of the data. Traditional ML Algorithms Isolation Forest It is an unsupervised machine learning algorithm used for anomaly detection. It works by isolating anomalies in the data by splitting them from the rest of the data using binary trees. Random Partitioning: Isolation Forest randomly selects a feature and then randomly selects a value within the range of that feature. It then partitions the data based on this randomly selected feature and value. Recursive Partitioning: This process of random partitioning is repeated recursively until all data points are isolated or a predefined maximum depth is reached. Anomaly Score Calculation: Anomalies are expected to be isolated with fewer partitions compared to normal data points. Therefore, anomalies are assigned lower anomaly scores. These scores are based on the average path length required to isolate the data points during the partitioning process. The shorter the path, the more likely it is to be an anomaly. Thresholding: An anomaly threshold is defined, and data points with anomaly scores below this threshold are considered anomalies. Let's consider a simple example of anomaly detection in a dataset containing information about server response times. The dataset includes features such as CPU usage, memory usage, and network traffic. We want to identify anomalous server responses that indicate potential system failures or cyber attacks. Random Partitioning: In the first iteration, the algorithm randomly selects a feature, let's say CPU usage, and then randomly selects a value within the range of CPU usage, for example, 80%. Based on this random selection, it partitions the data into two groups: data points with CPU usage <= 80% and data points with CPU usage > 80%. Recursive Partitioning: This process is repeated recursively, with random feature and value selections, until each data point is isolated or the maximum depth is reached. Each partitioning step creates a binary tree structure. Anomaly Score Calculation: Anomalies are expected to require fewer partitions to isolate. Therefore, data points that are isolated early in the process (i.e., with shorter average path lengths) are assigned lower anomaly scores. Thresholding: An anomaly threshold is defined based on domain knowledge or validation data. Data points with anomaly scores below this threshold are flagged as anomalies. One-Class Support Vector Machine One-Class Support Vector Machine (SVM) is a type of support vector machine algorithm that is used for anomaly detection, particularly when dealing with unlabeled data. It is trained on only the normal data instances and aims to create a decision boundary that encapsulates the normal data points, thereby distinguishing them from potential anomalies. Training Phase: One-Class SVM is trained using only the normal instances (i.e., data points without anomalies). The algorithm aims to find a hyperplane (decision boundary) that best separates the normal data points from the origin in the feature space. Unlike traditional SVM, which aims to maximize the margin between different classes, One-Class SVM aims to enclose as many normal data points as possible within a margin around the decision boundary. Model Representation: The decision boundary created by One-Class SVM is represented by a hyperplane defined by a set of support vectors and a distance parameter called the "nu" parameter. The hyperplane divides the feature space into two regions: the region encapsulating the normal data points (inliers) and the region outside the boundary, which may contain anomalies (outliers). Prediction Phase: During the prediction phase, new data points are evaluated based on their proximity to the decision boundary. Data points falling within the boundary (inside the margin) are classified as normal (inliers). Data points falling outside the boundary (outside the margin) are classified as potential anomalies (outliers). Hyperparameter Tuning: One-Class SVM typically has a hyperparameter called "nu" that controls the trade-off between maximizing the margin and allowing for violations (i.e., data points classified as outliers). Tuning this hyperparameter is crucial for achieving optimal performance. Scalability: is computationally efficient, particularly when dealing with high-dimensional data or large datasets. However, it may become less effective in extremely high-dimensional spaces. Robustness to Outliers: is inherently robust to outliers in the training data since it learns from only one class. However, it may still misclassify some anomalies that lie close to the decision boundary. Class Imbalance: assumes that the normal class is the minority class, and anomalies are rare. If anomalies are not significantly different from normal instances or if they form a significant portion of the data, One-Class SVM may not perform well. Deep Learning An artificial neuron, also known as a perceptron, is a fundamental building block of artificial neural networks. It mimics the behavior of biological neurons in the human brain. The input, a vector from x� to x is multiplied element-wise by a weight vector w� to w and then summed together. The sum is then offset by a bias term b, and the result passes through an activation function, which is some mathematical function that delivers an output signal based on the magnitude and sign of the input. An example is a simple step function that outputs 1 if the combined input passes a threshold, or 0 otherwise. These now form the outputs, y, to ym. This y-vector can now serve as the input to another neuron. Input Layer: An artificial neuron typically receives input signals from other neurons or directly from the input features of the data. Each input signal x is associated with a weight w that represents the strength of the connection between the input and the neuron. Weighted Sum: The neuron computes the weighted sum of the input signals and their corresponding weights. The bias term allows the neuron to adjust the decision boundary independently of the input data. Activation Function: The weighted sum z is then passed through an activation function, f(z). It introduces non-linearity into the neuron, enabling it to model complex relationships and learn non-linear patterns in the data. Common activation functions include sigmoid, tanh, ReLU, Leaky ReLU, ELU, etc. Output: The output y of the neuron is the result of applying the activation function to the weighted sum y=f(z) The output of the neuron represents its activation level or firing rate, which is then passed as input to other neurons in the subsequent layers of the neural network. Bias Term: The bias term b is a constant value added to the weighted sum before applying the activation function. It allows the neuron to control the decision boundary independently of the input data. The bias term effectively shifts the activation function horizontally, influencing the threshold at which the neuron fires. Activation Function: introduces non-linearity into the neuron's output. This non-linearity enables the neural network to learn complex relationships and patterns in the data that may not be captured by a simple linear model. The choice of activation function depends on the specific requirements of the task and the characteristics of the data. Output Layer: In a neural network, neurons are organized into layers. The output layer typically consists of one or more neurons that produce the final output of the network. The activation function used in the output layer depends on the nature of the task. For example, sigmoid or softmax functions are commonly used for binary or multi-class classification tasks, while linear functions may be used for regression tasks. Activation Functions are a way to map the input signals into some form of output signal to be interpreted by the subsequent neurons. They are designed to add non-linearity to the data. If we do not use it, then the output of the affine transformations is just the final output of the neuron. - Sigmoid: The sigmoid activation function squashes the input values between 0 and 1. It has an S-shaped curve and is commonly used in binary classification tasks. However, it suffers from the vanishing gradient problem and is not recommended for deep neural networks. It is appropriate for being used at the very end of a DNN to map the last layer’s raw output into a probability score. - Hyperbolic Tangent (Tanh): Tanh activation function squashes the input values between -1 and 1. Similar to the sigmoid function, it has an S-shaped curve but centered at 0. Tanh is often used in hidden layers of neural networks. - Rectified Linear Unit (ReLU): outputs the input directly if it is positive, otherwise, it outputs zero. It is computationally efficient and helps in mitigating the vanishing gradient problem. ReLU is widely used in deep learning models due to its simplicity and effectiveness. - Leaky ReLU: is similar to ReLU but allows a small, non-zero gradient when the input is negative. This helps prevent dying ReLU neurons, which can occur when a large gradient update causes the neuron to never activate again. - Exponential Linear Unit (ELU): is similar to ReLU for positive input values but smoothly approaches zero for negative input values. It helps in preventing dead neurons and can capture information from negative inputs. - Softmax: is typically used in the output layer of a neural network for multi-class classification tasks. It converts the raw output scores (logits) into probabilities, ensuring that the sum of the probabilities for all classes is equal to 1. Softmax is useful for determining the probability distribution over multiple classes. A layer in a neural network is a collection of neurons that each compute some output value using the entire input. The output of a layer is comprised of all the output values computed by the neurons within that layer. A neural network is a sequence of layers of neurons where the output of one layer is the input to the next. The first layer of the neural network is the input layer, and it takes in the training data as the input. The last layer of the network is the output layer, and it outputs values that are used as predictions for whatever task the network is being trained to perform. All layers in between are called hidden layers. ...more |
Notes are private!
|
1
|
Mar 27, 2024
|
Apr 19, 2024
|
Mar 27, 2024
|
Kindle Edition
| |||||||||||||||
1617298891
| 9781617298899
| 1617298891
| 3.75
| 16
| unknown
| Dec 21, 2021
|
it was amazing
|
This textbook has a very clear and effective structure: each part focuses on the virtues and capabilities that are needed at different career stages. T This textbook has a very clear and effective structure: each part focuses on the virtues and capabilities that are needed at different career stages. That is why I will for sure read again the latest parts of this book as my career progresses. Right now the section that interests me the most is leading projects. It has the right blend between theory and practical examples. Technologies...more |
Notes are private!
|
1
|
Mar 08, 2024
|
Mar 26, 2024
|
Mar 06, 2024
|
Paperback
| |||||||||||||||
1953295738
| 9781953295736
| 1953295738
| 3.20
| 20
| unknown
| Dec 14, 2021
|
it was ok
|
More than a book on Artificial Intelligence, this is an examination on how to formulate problems and search for solutions. I strongly agree with the ma More than a book on Artificial Intelligence, this is an examination on how to formulate problems and search for solutions. I strongly agree with the main thesis, that is first define the problem and then think about (potentially AI driven?) solutions. His message is perfectly summarised in these few lines Being Al-first means doing Al last. Doing Al means doing it last or not doing it at all. The reason is rather simple: Solution-focused strategies are more complex than problem-focused strategies; and solution-focused thinking ignores the most important part of business, which is the problems they solve and the customers they create. He then highlights the differences between doing research in the AI field and work with AI: in the latter case we must be pragmatic and problem oriented. Working in a company means that your manager does not care if the solution in called AI or BI, but whether you solve the problem in the first place. Remember that insiders seek epistemological discoveries, not economic ones. The more epistemological a pursuit is, the less likely it is to become something that could be turned into a business. Entrepreneur, venture capitalist, and author Paul Graham discusses the value of problems at length and explains that good business ideas are unlikely to come from scholars who write and defend dissertations. However, he soon starts to repeat those concepts over and over again, making most of the text redundant. He then takes a (boring) philosophical tangent on what's a problem and its different types, which weights down the narrative flow. Just to give you an intuition here is an extract on this topic There is something magical about writing down a problem. It's almost as though by writing about what is wrong, we start to discover new ways of making it right. Writing things down will also remind oneself and our teams of the problem and the goal. Once a problem is written down, don't forget to come back to the problem statement. It is a guide. Problem solving often starts with great intentions and alignment, but when it counts most-when the work is actually being done-we often don't hold on to the problem we set out to solve, and that's the most important part of problem solving: what the problem is and why we are solving it to begin with. At last he gives good advice on general problem solving, particularly regarding Divide and Impera, which comes from the fact that small problems are often simpler problems Always start small and take small steps to ensure that performance is what you want. Don't try to boil the ocean with the whole of a problem. With smaller steps almost everything can be reduced to something more manageable. Working in smaller sizes and smaller steps goes for your team as well. Rather than having your whole team work on something for six months, think about what one person can do in six weeks. The Basecamp team uses six weeks, which I think is a good size. If you are an Agile team, you may have batches of two weeks.75 That is fine, too. The point is that constraining batch size will force everyone to find the best bad solution, rather than working into the abyss of perfection....more |
Notes are private!
|
1
|
Feb 25, 2024
|
Mar 02, 2024
|
Feb 25, 2024
|
Hardcover
| |||||||||||||||
1633695689
| 9781633695689
| B075GXJPFS
| 3.87
| 3,682
| unknown
| Apr 17, 2018
|
liked it
|
A good starting non-technical book, if you have no idea of what AI and machine learning are. I"ve found ti a bit repetitive and verbose, but at least i A good starting non-technical book, if you have no idea of what AI and machine learning are. I"ve found ti a bit repetitive and verbose, but at least it doesn't take anything for granted. It starts with the classic law of supply and demand: the lower the price of a good the higher its demand, ceteris paribus. Since prediction machines are becoming cheaper they are going to be used much more extensively in many different sectors. These are augmented by the fact that the data is now everywhere, at our disposal which is the fuel of machine learning. From statistical perspective, data has diminishing returns: each additional unit of data improves your prediction less than the prior data. In terms of economics, the relationships is ambiguous: adding more data to allergic existing stock of data may be greater than adding it to a small stock. Thus organisations need to understand the relationship between adding more data in enhancing prediction, accuracy and increasing value creation. Machine learning science had different goals from statistics. Whereas statistics emphasized being correct on average, machine learning did not require that. Instead, the goal was operational effectiveness. Predictions could have biases so long as they were better (something that was possible with powerful computers). This gave scientists a freedom to experiment and drove rapid improvements that take advantage of the rich data and fast computers that appeared over the last decade. P.S. To be honest, the summaries at the end of each chapter are so well written that sometimes I've read them directly. ...more |
Notes are private!
|
1
|
Feb 17, 2024
|
Feb 25, 2024
|
Feb 17, 2024
|
Kindle Edition
| |||||||||||||||
1098125975
| 9781098125974
| 1098125975
| 4.55
| 2,679
| Apr 09, 2017
| Nov 08, 2022
|
it was amazing
|
This textbook is like the Swiss Army knife of machine learning books—it's packed with tools and techniques to help you tackle a wide range of real-wor
This textbook is like the Swiss Army knife of machine learning books—it's packed with tools and techniques to help you tackle a wide range of real-world problems. It takes you on a journey through the exciting landscape of machine learning, equipped with powerful libraries like Scikit-Learn, Keras, and TensorFlow. It explains in depth each library: I was more interested in the first one, as Keras and TS are too advanced for my interests and knowledge. It is actually funny to read the Natural Language Processing NLP, LLMs section, prior to ChatGPT. NOTES: Supervised Learning: the algorithm is trained on a labeled dataset, meaning the input data is paired with the correct output. The model learns to map the input to the output, making predictions or classifications when new data is introduced. Common algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks. Unsupervised Learning: deals with unlabeled data, where the algorithm explores the data's structure or patterns without any explicit supervision. Clustering and association are two primary tasks in this type. Clustering algorithms, like K-means or hierarchical clustering, group similar data points together. Association algorithms, like a priori algorithm, find relationships or associations among data points. Reinforcement Learning: involves an agent learning to make decisions by interacting with an environment. Usually is used on robots: It learns by receiving feedback in the form of rewards or penalties as it navigates through a problem space. The goal is to learn the optimal actions that maximize the cumulative reward. Algorithms like Q-learning and Deep Q Networks (DQN) are used in reinforcement learning scenarios. Additionally, there are subfields and specialized forms within these categories, such as semi-supervised learning, where algorithms learn from a combination of labeled and unlabeled data, and transfer learning, which involves leveraging knowledge from one domain to another. These types and their variations offer diverse approaches to solving different types of problems in machine learning. Gradient descent is a fundamental optimization algorithm widely used in machine learning for minimizing the error of a model by adjusting its parameters. It's especially crucial in training models like neural networks, linear regression, and other algorithms where the goal is to find the optimal parameters that minimize a cost or loss function. - Objective: In machine learning, the objective is to minimize a cost or loss function that measures the difference between predicted values and actual values. - Optimization Process: Gradient descent is an iterative optimization algorithm. It works by adjusting the model parameters iteratively to minimize the given cost function. - Gradient Calculation: At each iteration, the algorithm calculates the gradient of the cost function with respect to the model parameters. The gradient essentially points in the direction of the steepest increase of the function. - Parameter Update: The algorithm updates the parameters in the direction opposite to the gradient (i.e., descending along the gradient). This step size is determined by the learning rate, which controls how big a step the algorithm takes in the direction of the gradient. - Convergence: This process continues iteratively, gradually reducing the error or loss. The algorithm terminates when it reaches a point where further iterations don't significantly decrease the loss or when it reaches a predefined number of iterations. There are variations of gradient descent, such as: Batch Gradient Descent: Calculates the gradient over the entire dataset. Stochastic Gradient Descent (SGD): Computes the gradient using a single random example from the dataset at each iteration, which can be faster but more noisy. Randomness is good to escape local optima. Mini-batch Gradient Descent: Computes the gradient using a small subset of the dataset, balancing between the efficiency of SGD and the stability of batch gradient descent. Gradient descent plays a vital role in training machine learning models by iteratively adjusting parameters to find the optimal values that minimize the error or loss function, leading to better model predictions and performance. It is commonly used in conjunction with various machine learning algorithms, including regression models. It serves as an optimization technique to train these models by minimizing a cost or loss function associated with the model's predictions. Support Vector Machines SVM It can perform linear or nonlinear classification, regression and even outlier detection. Well suited for classification of complex small to medium sized datasets. They tend to work effectively and efficiently when there are many features compared to the observations, but SVM is not as scalable to larger data sets and it’s hard to tune its hyperparameters. SVM is a family of model classes that operate in high dimensional space to find an optimal hyperplane when they attempt to separate the classes with a maximum margin between them. Support vectors are the points closest to the decision boundary that would change it if were removed. It tries to fit the widest possible space between the classes, staying as far as possible from the closest training instances: large margin classification. Adding more training instances far away from the boundary does not affect SVM, which is fully determined/supported by the instances located at the edge of the street, called support vectors. N.B. SVMs are sensitive to the feature scales. Soft margin classification is generally preferred to the hard version, because it is tolerant to outliers and it’s a compromises between perfectly separating the two classes, and having the widest possible Street. Unlike Logistic regressions, SVM classifiers of not output probabilities. Nonlinear SVM classification adds polynomial features and thanks to the kernel trick we get the same result as if we add many high-degree polynomial features, without actually adding them so there is no combinatorial explosion of the number of features. SVM Regression reverses the objective: it tries to fit as many instances as possible on the street while limiting margin violations, that is training instances outside the support vectors region. Decision Trees They have been used for the longest time, even before they were turned into algorithms. It searches for the the pair (feature, threshold) that produces the purest subsets (weighted by their size) and it does it recursively, however it does not check whether or not the split will lead to the lowest possible impurity several levels down. Hence it does not guarantee a global maximum solution. The computational complexity does not explode since each node only requires checking the value of one feature: the training algorithm compares all features on all samples at each node. Nodes purity is measured by Gini coefficient or entropy: a node’s impurity is generally lower that its parents�. Decision trees make very few assumptions about the training data, as opposed to linear models, which assume that the data is linear. If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, indeed, most likely overfitting it. Such a model is often called a non-parametric model, it has parameters, but their number is not determined prior to training. To avoid overfitting, we need to regularize hyperparameters, to reduce the decision tree freedom during training: pruning (deleting unnecessary nodes), set a max number of leaves. We can have decision tree regressions, which, instead of predicting a class in each node, it predicts a value. They are simple to understand and interpret, easy to use, versatile and powerful. They don’t care if the training data is called or centered: no need to scale features. However, they apply orthogonal decision boundaries which makes them sensitive to training set rotation, that is the model will not generalize well because they are very sensitive to small variations in the training data. Random forests can lead to disease stability by averaging predictions over many trees. Random Forests It is an ensemble of Decision Trees, generally trained via bagging or sometimes pasting, typically with the max_samples set to the size of the training set. Instead of using the BaggingClassifier the RandomForestClassifier is optimized for Decision Trees, it has all its hyperparameters. Instead of searching for the best feature when splitting a node, it searches for the best feature among a random subset of features, which results in a greater tree diversity. It makes it easy to measure feature importance by looking at how much that feature reduces impurity on average. Boosting Adaptive Boosting One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. For example, when training AdaBoost classifier the algorithm first trains of base classifier such as a decision trees, and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances, then train the second classifier using the updated weights and again makes predictions on the training said updates the instance weights and so on. Once all predictors are trained the ensemble makes predictions like bagging expect that the predictors have wights depending on their overall accuracy on the weighted training set. Gradient Boosting It works by sequentially adding predictors to an ensemble, each one correcting its predecessor. Instead of tweaking the instance weights at every iteration like AdaBoost does, it tries to fit the new predictor to the residual errors made by the previous predictor. [XGBoost Python Library is an optimised implementation] Stacking Stacked generalization involves training multiple diverse models and combining their predictions using a meta-model (or blender). Instead of parallel training like in bagging, stacking involves training models in a sequential manner. The idea is to let the base models specialize in different aspects of the data, and the meta-model learns how to weigh their contributions effectively. Stacking can involve multiple layers of models, with each layer's output serving as input to the next layer. It requires a hold-out set (validation set) for the final model to prevent overfitting on the training data. Stacking is a more complex ensemble method compared to boosting and bagging. [Not supported by Scikit-learn] Unsupervised Learning Dimensionality Reduction Reducing dimensionality does cause information loss and makes pipelines more complex thus harder to maintain, while speeding up training. The main result is that it is much easier to rely on Data Viz once we have fewer dimensions. [the operation can be reversed, we can reconstruct a data set relatively similar to the original] Intuitively dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the data set without losing too much information. The Curse of Dimensionality As the number of features or dimensions in a dataset increases, certain phenomena occur that can lead to difficulties in model training, performance, and generalization. - Increased Sparsity: In high-dimensional spaces, data points become more sparse. As the number of dimensions increases, the available data tends to be spread out thinly across the feature space. This sparsity can lead to difficulties in estimating reliable statistical quantities and relationships. - Increased Computational Complexity: The computational requirements grow exponentially with the number of dimensions. Algorithms that work efficiently in low-dimensional spaces may become computationally expensive or impractical in high-dimensional settings. This can affect the training and inference times of machine learning models. - Overfitting: In high-dimensional spaces, models have more freedom to fit the training data closely. This can lead to overfitting, where a model performs well on the training data but fails to generalize to new, unseen data. Regularization techniques become crucial to mitigate overfitting in high-dimensional settings. - Decreased Intuition and Visualization: It becomes increasingly difficult for humans to visualize and understand high-dimensional spaces. While we can easily visualize and interpret data in two or three dimensions, the ability to comprehend relationships among variables diminishes as the number of dimensions increases. - Increased Data Requirements: As the dimensionality increases, the amount of data needed to maintain the same level of statistical significance also increases. This implies that more data is required to obtain reliable estimates and make accurate predictions in high-dimensional spaces. - Distance Measures and Density Estimation: The concept of distance becomes less meaningful in high-dimensional spaces, and traditional distance metrics may lose their discriminative power. Similarly, density estimation becomes challenging as the data becomes more spread out. Projection In most real-world problems, training instances are not spread out uniformly across all dimensions: many features are almost constant whereas others are highly correlated. As a result, all training instances lie within a much lower dimensional subspace of the high-dimensional space. If we project every instance perpendicularly onto this subspace we get a new Dimension-1 dataset. Manifold Learning focuses on capturing and representing the intrinsic structure or geometry of high-dimensional data in lower-dimensional spaces, often referred to as manifolds. The assumption is that the task will be simpler if expressed in the lower dimensional space of the manifold, which is not always true: the decision boundary may not always be simpler with lower dimensions. PCA Principal Component Analysis It identifies the hyperplane that lies closest to the data and then it projects the data onto to it while retaining as much of the original variance as possible. PCA achieves this by identifying the principal components of the data, which are linear combinations of the original features, the axis that accounts for the largest amount of variance in the training set. [It's essential to note that PCA assumes that the principal components capture the most important features of the data, and it works well when the variance in the data is aligned with the directions of maximum variance. However, PCA is a linear technique and may not perform optimally when the underlying structure of the data is nonlinear. In such cases, non-linear dimensionality reduction techniques like t-Distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP) might be more appropriate.] It identifies the principal components via a standard matrix factorization technique, Singular Value Decomposition. Before applying PCA, it's common to standardize the data by centering it (subtracting the mean) and scaling it (dividing by the standard deviation). This ensures that each feature contributes equally to the analysis. PCA involves the computation of the covariance matrix of the standardized data. The covariance matrix represents the relationships between different features, indicating how they vary together. It is useful to compute the explained variance ratio of each principal component which indicates the proportion of the dataset’s variance that lies along each PC. The number of dimensions to reduce down to, should account for 95% of the variance. After dimensionality reduction the training set takes up much less space. - Dimensionality Reduction: The primary use of PCA is to reduce the number of features in a dataset while retaining most of the information. This is beneficial for visualization, computational efficiency, and avoiding the curse of dimensionality. - Data Compression: PCA can be used for data compression by representing the data in a lower-dimensional space, reducing storage requirements. - Noise Reduction: By focusing on the principal components with the highest variance, PCA can help filter out noise in the data. - Visualization: PCA is often employed for visualizing high-dimensional data in two or three dimensions, making it easier to interpret and understand. Kernel PCA, Unsupervised Algorithm The basic idea behind Kernel PCA is to use a kernel function to implicitly map the original data into a higher-dimensional space where linear relationships may become more apparent. The kernel trick avoids the explicit computation of the high-dimensional feature space but relies on the computation of pairwise similarities (kernels) between data points. Commonly used kernel functions include the radial basis function (RBF) or Gaussian kernel, polynomial kernel, and sigmoid kernel. The choice of the kernel function depends on the characteristics of the data and the desired transformation. After applying the kernel trick, the eigenvalue decomposition is performed in the feature space induced by the kernel. This results in eigenvalues and eigenvectors, which are analogous to those obtained in traditional PCA. The final step involves projecting the original data onto the principal components in the higher-dimensional feature space. The projection allows for non-linear dimensionality reduction. Kernel PCA is particularly useful in scenarios where the relationships in the data are not well captured by linear techniques. It has applications in various fields, including computer vision, pattern recognition, and bioinformatics, where the underlying structure of the data might be highly non-linear. However, it's important to note that Kernel PCA can be computationally expensive, especially when dealing with large datasets, as it involves the computation of pairwise kernel values. The choice of the kernel and its parameters can also impact the performance of Kernel PCA, and tuning these parameters may be necessary for optimal results. Clustering: K-Means It is the task of identifying similar instances and assigning them to clusters or groups of similar instances. It is an example where we can use Data Science not to predict but to classify the existing data. Use cases: - Customer segmentation: You can cluster your customers based on their purchases and their activity on your website. This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment. ...more |
Notes are private!
|
1
|
Jan 07, 2024
|
Feb 10, 2024
|
Jan 07, 2024
|
Paperback
| |||||||||||||||
1804617334
| 9781804617335
| B0BRCW95ZQ
| 2.90
| 30
| unknown
| Feb 28, 2023
|
it was ok
|
I have mixed feelings about this textbook. There is some value for someone who doesn't know a lot about the world of Artificial Intelligence, but then I have mixed feelings about this textbook. There is some value for someone who doesn't know a lot about the world of Artificial Intelligence, but then the book is still aimed at AI product managers, who should have some knowledge of the basics. Some parts are so repetitive that become recursive: the explanations of AI and ML are present too many times throughout the text as well as unnecessary summarise here and there. I struggled to find something original, but maybe I'm not the right audience. There are some good bits for an intro to AI, but not many in depth use cases, just general descriptions of successful products which it is not practical. The usual stuff about Machine Learning and Deep Learning, Supervised and unsupervised learning, a general introduction of the basics algorithms, but nothing than you can't find on Wikipedia. To me the most interesting parts are around the different product strategies: dominant, disruptive or differentiated. ...more |
Notes are private!
|
1
|
Mar 02, 2024
|
Mar 08, 2024
|
Nov 09, 2023
|
Kindle Edition
| |||||||||||||||
1492060941
| 9781492060949
| 1492060941
| 3.55
| 42
| May 2020
| Jun 30, 2020
|
really liked it
|
This textbook lives up to its author's expectations The main takeaway is that value is created by making decisions, not by data or prediction. But theThis textbook lives up to its author's expectations The main takeaway is that value is created by making decisions, not by data or prediction. But these are necessary inputs to make AI-and-data-driven decisions decisions. To create value and make better decisions in a systematic and scalable way we need to improve our analytical skills. The book is very well laid out and it has lots of use cases. It touches very well known but important concepts in the industry, like the three Vs (volume, velocity, variety) that has projected the world into the big data era, the role of uncertainty or the difference between correlation and causation, which are important for someone who first approaches analytics, but are a bit redundant for someone already working in this sector. For me the main takeaway was the clear distinction between the different phases of a business request in the Big Data Era: descriptive, predictive, prescriptive. It is something that we (analytics) should always keep in mind because it is easy to be carried away from the main mission. Descriptive Analytics We need to start with the business question in mind, decompose it and move backward until we have some actions that relate to the business objective you want to achieve. That is why we always need relevant and measurable KPIs. We need also to identify what is actionable: the problem of choosing levels is one of causality, we want to make decisions that impact our business objectives, so there must be a causal relation from levers to consequences; we need to construct hypothesis and test them. I often find myself wondering what's the end goal of a request, which forces me to go back to stakeholder and lay down strict and precise requirements as well as the desired outcome. Business objectives are usually already defined, but we must learn to ask the right business questions to achieve these objectives. The author encourages to follow the KISS (Keep it simple, stupid) principle, a revisitation of the Occam's razor: avoid unnecessary complexity and complications. Choose the simplest solution that achieves the desired result. Every difficult problem can be dissect into simpler and smaller problems. By starting from those ones we can make educated guesses and rough approximations, estimating probabilities and expected values. Another important point is how to work with uncertainty. He introduces the Fermi problems, where the analyst can appreciate the power of intuition based on very few coordinates that help navigates through uncertainty and compute expected utilities. The prescriptive stage is all about optimisation, which in general is hard. That is why we always want to start by solving the problem with no uncertainty, solving the simpler problem will provide valuable insights intuition as to what the relevant underlying uncertainty is. He gives some very basic notions about probability and how to cope with uncertainty: Probabilities represent the likelihood of different outcomes occurring. In predictive analytics, estimating probabilities involves using statistical models and data analysis to quantify the chance of various future events....more |
Notes are private!
|
1
|
Nov 06, 2023
|
Nov 10, 2023
|
Nov 06, 2023
|
Paperback
|
|
|
|
|
|
my rating |
|
![]() |
||
---|---|---|---|---|---|---|---|---|---|
4.26
|
not set
not set
|
Mar 07, 2025
|
|||||||
3.85
|
it was amazing
|
Jan 03, 2025
|
Dec 16, 2024
|
||||||
4.15
|
liked it
|
Dec 15, 2024
|
Dec 05, 2024
|
||||||
3.29
|
really liked it
|
Dec 05, 2024
|
Nov 20, 2024
|
||||||
4.34
|
it was amazing
|
Oct 28, 2024
|
Oct 16, 2024
|
||||||
4.45
|
it was amazing
|
Oct 16, 2024
|
Oct 04, 2024
|
||||||
3.85
|
it was amazing
|
Oct 04, 2024
|
Sep 27, 2024
|
||||||
4.62
|
it was amazing
|
Sep 27, 2024
|
Aug 21, 2024
|
||||||
2.79
|
it was ok
|
Aug 15, 2024
|
Aug 02, 2024
|
||||||
4.10
|
really liked it
|
Aug 21, 2024
|
Jul 26, 2024
|
||||||
4.34
|
it was amazing
|
Jul 25, 2024
|
Jul 16, 2024
|
||||||
3.78
|
it was ok
|
Jun 23, 2024
|
Jun 10, 2024
|
||||||
4.26
|
it was amazing
|
May 22, 2024
|
May 09, 2024
|
||||||
3.56
|
really liked it
|
Apr 19, 2024
|
Mar 27, 2024
|
||||||
3.75
|
it was amazing
|
Mar 26, 2024
|
Mar 06, 2024
|
||||||
3.20
|
it was ok
|
Mar 02, 2024
|
Feb 25, 2024
|
||||||
3.87
|
liked it
|
Feb 25, 2024
|
Feb 17, 2024
|
||||||
4.55
|
it was amazing
|
Feb 10, 2024
|
Jan 07, 2024
|
||||||
2.90
|
it was ok
|
Mar 08, 2024
|
Nov 09, 2023
|
||||||
3.55
|
really liked it
|
Nov 10, 2023
|
Nov 06, 2023
|