Troubleshooting & Debugging

Introduction to Troubleshooting & Debugging

Troubleshooting is the process of identifying, analyzing and solving problems in any system. Debugging is the specific process of identifying, analyzing and removing bugs in code. Debuggers are tools that follow code line by line, inspect variable changes, and interrupt programs when conditions are met.

Why Master These Skills?

Save Time - Quick problem identification and resolution
Code Quality - Better understanding leads to better code
Professional Growth - Debug skills separate good from great developers
System Reliability - Prevent and resolve production issues

Core Principle: Information → Hypothesis → Test → Fix

Troubleshooting vs Debugging

Troubleshooting: The process of identifying, analyzing and solving problems in any system
Debugging: The process of identifying, analyzing and removing bugs in actual code
Debuggers: Tools that follow code line by line, inspect variable changes, and interrupt programs when conditions are met

The Systematic Troubleshooting Process

Step-by-Step Approach

Getting Information - Create reproduction case (clear description of how/when problem appears)
Finding Root Cause - Essential for long-term remediation
Performing Remediation - Apply short-term and long-term fixes
Test & Verify - Ensure fix works in test environment first, then production

Questions to Ask Users

What were you trying to do?
What steps did you follow?
What was the expected result?
What was the actual result?

Before Starting to Debug

Can you reproduce the error consistently?
What changed since it last worked?
Do you have access to logs and error messages?
Is this affecting one user or many?

💻 Coding Project Steps

Understand the problem statement
Research existing solutions and approaches
Planning the implementation strategy
Writing the actual code

Types of Programming Errors

1. Syntax Errors

Definition: Errors in code structure that prevent parsing/compilation

# Common syntax errors
if x == 5  # Missing colon
    print("Hello")

print("Hello"  # Missing closing parenthesis

# Mixed indentation
if True:
    print("Tab")
        print("Spaces")  # Inconsistent indentation

Prevention Checklist:

Use consistent indentation (spaces or tabs, not both)
Check for missing colons after if, for, while, def
Verify matching brackets: (), [], {}
Ensure proper string quote matching
Avoid using reserved keywords as variable names
Check for = vs == in conditions

2. Runtime Errors (Exceptions)

Definition: Errors that occur during program execution

# Common runtime errors
numbers = [1, 2, 3]
print(numbers[5])        # IndexError
user_data = {"name": "John"}
print(user_data["age"])  # KeyError
result = 10 / 0          # ZeroDivisionError
undefined_var            # NameError
"string".append("x")     # AttributeError

Common Runtime Errors:

Error Type	Cause	Solution
`IndexError`	Accessing invalid array index	Check bounds before access
`KeyError`	Accessing non-existent dictionary key	Use `.get()` or check key exists
`NameError`	Using undefined variable	Define variable before use
`TypeError`	Wrong data type for operation	Validate types before operations
`AttributeError`	Calling non-existent method	Check object has method/attribute
`ModuleNotFoundError`	Import failed	Install module: `pip install module_name`
`FileNotFoundError`	File doesn't exist	Check path, permissions, file existence

3. Semantic Errors (Logic Errors)

Definition: Code runs without crashing but produces incorrect results

# Example: Calculating average incorrectly
def calculate_average(numbers):
    total = sum(numbers)
    return total / len(numbers) + 1  # Bug: adding 1

# Should be:
def calculate_average(numbers):
    total = sum(numbers)
    return total / len(numbers)

Python Debugging Techniques

1. Print Statement Debugging

# Track execution flow and variable values
print(f"Variable value: {variable}")
print("Reached this point in code")
print(f"Function input: {input_param}")
print(f"Loop iteration {i}: {current_item}")

# Advanced print debugging
def debug_function(data):
    print(f"DEBUG: Input data = {data}")
    result = process_data(data)
    print(f"DEBUG: Processed result = {result}")
    return result

When to use: Quick debugging, understanding code flow, simple issues

2. Assert Statements

assert condition, "Custom error message"

# Examples:
assert filename != "", "You must specify a filename!"
assert len(data) > 0, "Data list cannot be empty"
assert isinstance(user_id, int), "User ID must be an integer"

# Disable asserts in production with: python -O script.py

3. Exception Handling with Try-Catch

try:
    risky_operation()
except SpecificError as e:
    print(f"Specific error occurred: {e}")
    # Handle gracefully
except Exception as e:
    print(f"Unexpected error: {e}")
    # Log the error
    raise  # Re-raise if needed

# Multiple exceptions
try:
    file_operation()
except FileNotFoundError:
    print("File not found")
except PermissionError:
    print("Permission denied")
except Exception as e:
    print(f"Other error: {e}")

4. Python Logging Module

import logging

# Configure logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    filename='debug.log'
)

# Use different log levels
logging.debug("Detailed debugging information")
logging.info("General information")
logging.warning("Warning message")
logging.error("Error occurred")
logging.critical("Critical error")

# In functions
def my_function(data):
    logging.debug(f"Processing data: {data}")
    result = process_data(data)
    logging.info(f"Function completed, result: {result}")
    return result

Log Levels: DEBUG → INFO → WARNING → ERROR → CRITICAL

5. PDB (Python Debugger)

import pdb

def problematic_function():
    x = 10
    pdb.set_trace()  # Execution pauses here
    y = x * 2
    return y

# Alternative: Use breakpoint() in Python 3.7+
def another_function():
    x = 10
    breakpoint()  # Same as pdb.set_trace()
    y = x * 2
    return y

PDB Commands:

Command	Action
`n` (next)	Execute next line
`s` (step)	Step into function calls
`c` (continue)	Continue execution
`l` (list)	Show current code
`p variable_name`	Print variable value
`pp variable_name`	Pretty-print variable
`w` (where)	Show stack trace
`u` (up)	Move up stack frame
`d` (down)	Move down stack frame
`q` (quit)	Quit debugger

6. Advanced Debugging Techniques

Scale Down Input

# Instead of processing 10,000 items, test with 10
test_data = large_dataset[:10]
result = process_data(test_data)

Check Summaries

# Instead of printing entire dataset
print(f"Data summary: {len(data)} items, first 5: {data[:5]}")
print(f"Data types: {[type(item) for item in data[:3]]}")

Write Self-Checks

def process_user_data(users):
    # Sanity checks
    assert isinstance(users, list), "Users must be a list"
    assert all(isinstance(u, dict) for u in users), "Each user must be a dict"

    # Process data
    processed = []
    for user in users:
        # Consistency checks
        assert 'id' in user, f"User missing ID: {user}"
        processed.append(process_single_user(user))

    return processed

System-Level Debugging

Memory Debugging

Common Memory Issues

Memory Leaks - Chunks of memory no longer needed but not released
Invalid Memory Access - Process tries to access unassigned memory
Buffer Overflows - Writing past allocated memory boundaries

Tools for Memory Debugging

# Valgrind (Linux) - detect invalid operations
valgrind --tool=memcheck python script.py

# Python memory profiling
pip install memory-profiler
python -m memory_profiler script.py

Python Memory Monitoring

import psutil
import os

# Current process memory usage
process = psutil.Process(os.getpid())
memory_info = process.memory_info()
print(f"Memory usage: {memory_info.rss / 1024 / 1024:.2f} MB")

# System memory
memory = psutil.virtual_memory()
print(f"Total: {memory.total / 1024 / 1024:.2f} MB")
print(f"Available: {memory.available / 1024 / 1024:.2f} MB")

System Call Tracing

strace (Linux/Mac)

# Trace system calls
strace python script.py

# Save output to file
strace -o debug.trace python script.py

# Trace specific system calls
strace -e trace=open,read,write python script.py

Process Monitor (Windows)

Use Process Monitor to trace file/registry operations
Filter by process name to focus on your application

Performance Analysis & Optimization

Performance Issues Types

CPU-Bound vs I/O-Bound

Type	Characteristics	Solutions
CPU-Bound	Heavy calculations, uses single core	Multiprocessing, parallel processing
I/O-Bound	Waiting for files/network/database	Threading, asyncio

Performance Optimization Principles

Start with clear code that works correctly first
Profile before optimizing to identify real bottlenecks
Only optimize when necessary - don't over-optimize

Time Measurement

import time
from functools import wraps

# Simple timing
start_time = time.time()
# Code to measure
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

# Timing decorator
def time_it(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.2f} seconds")
        return result
    return wrapper

@time_it
def slow_function():
    time.sleep(1)
    return "Done"

Built-in Profiling

import cProfile
import pstats

# Profile a function
cProfile.run('my_function()', 'profile_output')

# Analyze results
stats = pstats.Stats('profile_output')
stats.sort_stats('cumulative').print_stats(10)

# Profile line by line (install line_profiler first)
# pip install line_profiler
@profile
def function_to_profile():
    # Function code here
    pass

Data Structure Guidelines

Lists: When accessing by position or iterating through all elements
Dictionaries: When looking up elements using keys
Sets: When checking membership or eliminating duplicates
Keep iterations minimal and break out of loops when found

Common Performance Problems

Inefficient Algorithms - Use Big O analysis
Expensive Operations in Loops - Move calculations outside loops
Memory Leaks - Monitor memory usage over time
Poor Resource Management - Close files, connections properly

Concurrency & Parallelism

When to Use What

Scenario	Recommended Approach
I/O operations (files, network)	Threading or asyncio
CPU-intensive tasks	Multiprocessing
Many small tasks	asyncio
Mixed workload	Combination

Threading Example

from concurrent.futures import ThreadPoolExecutor
import requests

def fetch_url(url):
    response = requests.get(url)
    return response.status_code

urls = ['<http://example1.com>', '<http://example2.com>']

# I/O-bound tasks - use threading
with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(fetch_url, urls))

Multiprocessing Example

from concurrent.futures import ProcessPoolExecutor
import math

def cpu_intensive_task(n):
    return sum(math.sqrt(i) for i in range(n))

tasks = [1000000, 2000000, 3000000]

# CPU-bound tasks - use multiprocessing
with ProcessPoolExecutor() as executor:
    results = list(executor.map(cpu_intensive_task, tasks))

Asyncio Example

import asyncio
import aiohttp

async def fetch_async(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = ['<http://example1.com>', '<http://example2.com>']

    async with aiohttp.ClientSession() as session:
        tasks = [fetch_async(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    return results

# Run async function
results = asyncio.run(main())

System Monitoring & Tools

Linux/Mac Tools

# Process monitoring
top                    # Basic process info
htop                   # Enhanced process viewer
ps aux                 # All processes

# Memory usage
free -h               # Memory information
cat /proc/meminfo     # Detailed memory stats

# Disk usage
df -h                 # Filesystem disk space
du -sh *              # Directory sizes
iostat                # I/O statistics

# Network
netstat -tuln         # Network connections
ss -tuln              # Modern netstat alternative
ping google.com       # Test connectivity
traceroute google.com # Network path

# System calls
strace command        # Trace system calls
lsof -p PID          # Open files by process

Windows Tools

# Task Manager - Basic monitoring
taskmgr

# Resource Monitor - Detailed resource usage
resmon

# Process Monitor - File/registry operations
procmon

# Performance Monitor - Custom counters
perfmon

# Command line tools
tasklist              # List processes
netstat -an           # Network connections

Python System Monitoring

import psutil
import platform

# System info
print(f"System: {platform.system()} {platform.release()}")
print(f"CPU cores: {psutil.cpu_count()}")

# CPU usage
print(f"CPU usage: {psutil.cpu_percent(interval=1)}%")

# Memory usage
memory = psutil.virtual_memory()
print(f"Memory: {memory.percent}% used")
print(f"Available: {memory.available / 1024 / 1024:.0f} MB")

# Disk usage
disk = psutil.disk_usage('/')
print(f"Disk usage: {disk.percent}%")

# Network I/O
net_io = psutil.net_io_counters()
print(f"Bytes sent: {net_io.bytes_sent}")
print(f"Bytes received: {net_io.bytes_recv}")

# Process information
for proc in psutil.process_iter(['pid', 'name', 'cpu_percent']):
    if proc.info['cpu_percent'] > 5.0:
        print(f"High CPU: {proc.info}")

Network Troubleshooting

Basic Network Debugging

# Test connectivity
ping google.com
ping -c 4 google.com        # Send 4 packets

# Test specific ports
telnet google.com 80
nc -zv google.com 80        # Test port without connecting

# DNS resolution
nslookup google.com
dig google.com

# Network path
traceroute google.com
mtr google.com              # Continuous traceroute

Python Network Debugging

import socket
import requests

def test_connection(host, port, timeout=5):
    """Test if a host:port is reachable."""
    try:
        socket.create_connection((host, port), timeout)
        return True
    except socket.error as e:
        print(f"Connection to {host}:{port} failed: {e}")
        return False

def test_http(url, timeout=5):
    """Test HTTP connectivity."""
    try:
        response = requests.get(url, timeout=timeout)
        print(f"HTTP {response.status_code}: {url}")
        return response
    except requests.exceptions.RequestException as e:
        print(f"HTTP request failed: {e}")
        return None

# Usage
test_connection('google.com', 80)
test_http('<https://google.com>')

Common Network Issues

DNS Problems - Can't resolve hostnames
- Check /etc/resolv.conf (Linux) or DNS settings
- Try different DNS servers (8.8.8.8, 1.1.1.1)
Firewall Blocking - Ports not accessible
- Check local firewall rules
- Verify remote firewall allows connections
Network Congestion - Slow response times
- Check bandwidth usage
- Test during different times
Server Down - Service unavailable
- Check if service is running
- Verify server health

Getting Help Effectively

Before Asking for Help

Try These First:

Read error messages carefully
Search the error message online
Check official documentation
Review similar working code
Use rubber duck debugging
Take a break and return with fresh perspective

How to Ask Good Questions

❌ Bad Help Requests

Subject: "HELP!!!"
Message: "My code doesn't work. Can someone fix it?"

✅ Good Help Requests

Subject: "Python: Getting KeyError when accessing dictionary in user_manager.py"

I'm trying to retrieve user data from a dictionary but getting a KeyError.

**Expected behavior:** Get user info by ID from user_data dictionary
**Actual behavior:** KeyError: 'user_123' exception

**Minimal code example:**
```python
user_data = {'user_456': {'name': 'John'}}
user_id = 'user_123'
info = user_data[user_id]  # Error occurs here

Full error message: KeyError: 'user_123' File "user_manager.py", line 15, in get_user_info

What I've tried:

Verified user_id variable contains expected value
Checked that dictionary has data
Tried using .get() method but need to handle missing users properly

Environment: Python 3.9, Windows 10, PyCharm IDE Additional context: This happens when processing user requests from API

Help Request Checklist

Clear, specific subject line
State your goal clearly - what should happen?
Explain what actually happens - error messages, wrong output
Include complete error message - full traceback
Share minimal, complete code that reproduces issue
Format code properly - use code blocks
List what you've already tried
Include environment details - Python version, OS, IDE

Crash Analysis & Recovery

Finding Root Cause of Crashes

Check all available logs - application, system, web server
Figure out what changed recently - code, config, environment
Trace system/library calls the program makes
Create smallest reproduction case possible

Crash Analysis Tools

Python Traceback - Shows execution path when error occurred
Core Dumps - Store crash state for analysis (Linux/Unix)
Crash Logs - System-generated crash reports
Log Analysis - Look for patterns and correlations

Python Crash Debugging

import traceback
import sys

def handle_crash():
    """Capture and log crash information."""
    exc_type, exc_value, exc_traceback = sys.exc_info()

    # Print detailed traceback
    traceback.print_exception(exc_type, exc_value, exc_traceback)

    # Save to file
    with open('crash.log', 'w') as f:
        traceback.print_exception(exc_type, exc_value, exc_traceback, file=f)

# Use in your code
try:
    risky_operation()
except Exception:
    handle_crash()
    raise

Incident Documentation & Postmortems

What to Document During Incidents

Timeline - When did it start, when was it detected, when was it fixed
Root cause of the issue
How you diagnosed the problem
What you did to fix it (short-term and long-term)
Prevention measures for future

Postmortem Template

Executive Summary

Brief description of the incident
Impact on users/systems
Duration of the incident

Timeline of Events

When incident started
When it was detected
Key actions taken
When it was resolved

Root Cause Analysis

What caused the issue
Why existing safeguards didn't prevent it
Contributing factors

Impact Assessment

Who was affected
What services were impacted
Business impact (if applicable)

Response Analysis

What went well in the response
What could have been done better
Response time metrics

Resolution and Recovery

How the issue was resolved
Steps taken to restore service
Verification of fix

Lessons Learned

What we learned from this incident
Process improvements needed
Technical improvements needed

Action Items

Specific tasks to prevent recurrence
Owners and deadlines
Monitoring improvements

Code Review for Bug Prevention

Code Review Checklist

Functionality:

Code does what it's supposed to do
Edge cases are handled appropriately
Error handling is comprehensive
Input validation where needed

Code Quality:

Code is readable and well-documented
Functions have single responsibility
Variable names are descriptive
No obvious code duplication

Performance:

No obvious performance bottlenecks
Efficient algorithms used
Memory usage is reasonable
Database queries are optimized

Security:

No hardcoded credentials
Input sanitization implemented
Authentication/authorization proper
No SQL injection vulnerabilities

Testing:

Adequate test coverage
Tests cover edge cases
Tests are meaningful

Giving Good Review Feedback

✅ Good Review Comments

"Consider extracting this validation logic into a separate function
for reusability. Something like `validate_user_input(user_data)`
would improve readability and testability."

"This could potentially cause a memory leak if the connection isn't
closed. Consider using a context manager or try/finally block."

"Great error handling! The specific exception types make debugging easier."

```markdown

### ❌ Poor Review Comments

```markdown
"This is wrong."
"Bad code."
"Why did you do it this way?"
"This won't work."

Emergency Response Guide

When Systems Are Down

Don't Panic - Stay calm and methodical
Immediate Response

Assess the scope of the problem
Check if users are affected
Look at recent changes (deployments, config changes)
Check system resources (CPU, memory, disk)
Review recent logs for errors

Communication

Notify relevant stakeholders
Update status pages if applicable
Set up incident response channel

Investigation

Gather logs and error messages
Create timeline of when issues started
Check dependencies (databases, external APIs)
Monitor system metrics

Resolution

Implement temporary fix if possible
Test fix in staging if time permits
Apply fix to production
Monitor for stability

Follow-up

Verify complete resolution
Write incident report
Schedule postmortem meeting
Implement prevention measures

Best Practices Summary

Prevention

Write Tests - Catch bugs before they reach production
Code Reviews - Get second opinions on code changes
Logging - Implement comprehensive logging throughout your application
Monitoring - Set up alerts for system health metrics
Documentation - Keep runbooks and troubleshooting guides updated

Debugging Strategy

Start Simple - Use basic debugging before advanced tools
Isolate Problems - Narrow down to smallest reproducible case
Test Safely - Always test fixes in non-production environment first
Document Findings - Keep track of what you learn for future reference
Think Systematically - Follow structured approach, don't jump around

Code Quality

Fail Fast - Use assertions and validation to catch problems early
Handle Errors Gracefully - Don't let exceptions crash your program
Monitor Performance - Track key metrics to spot problems early
Keep It Simple - Complex code is harder to debug

Essential Tools Quick Reference

Python Debugging

# Debug prints
print(f"Debug: variable = {variable}")

# Assertions
assert condition, "Error message"

# Exception handling
try:
    risky_code()
except SpecificException as e:
    logging.error(f"Error: {e}")

# Logging
import logging
logging.basicConfig(level=logging.DEBUG)
logging.debug("Debug message")

# Interactive debugging
import pdb; pdb.set_trace()
breakpoint()  # Python 3.7+

# Timing
import time
start = time.time()
# code here
print(f"Time: {time.time() - start:.2f}s")

# Profiling
import cProfile
cProfile.run('function_call()')

# System monitoring
import psutil
print(f"CPU: {psutil.cpu_percent()}%")
print(f"Memory: {psutil.virtual_memory().percent}%")

System Commands

# Process monitoring
top                     # Process viewer
htop                    # Enhanced top
ps aux | grep python    # Find Python processes

# System tracing
strace python script.py # Trace system calls (Linux)
dtruss python script.py # Trace system calls (Mac)

# Network debugging
ping google.com         # Test connectivity
telnet host port       # Test specific port
netstat -tuln          # Show network connections
ss -tuln               # Modern netstat

# Resource usage
free -h                # Memory usage
df -h                  # Disk usage
iostat                 # I/O statistics

# Logs
tail -f /var/log/app.log    # Follow log file
journalctl -f              # Follow systemd logs

Quick Decision Matrix

When to Use Each Debugging Method

Problem Type	First Try	If That Fails	Advanced Option
Syntax Error	Read error message	Check indentation/brackets	Use IDE syntax checker
Logic Error	Add print statements	Use assertions	Interactive debugger
Performance	Time measurement	Profile the code	System monitoring
Memory Issues	Check for leaks	Memory profiler	Valgrind/system tools
Network Issues	ping/telnet	Check logs	Packet capture
Crashes	Check logs	Reproduce minimally	Core dump analysis
Intermittent	Add logging	Monitor over time	Statistical analysis

Remember: The Golden Rules

Information First - Always gather complete information before acting
Reproduce Consistently - Find reliable way to trigger the problem
Start Simple - Use basic tools before advanced ones
Test Safely - Never debug directly in production
Document Everything - Future you (and your team) will thank you
Prevention > Cure - Good practices prevent most problems
Stay Calm - Panicked debugging leads to more problems

The best debugger is a clear mind and systematic approach! 🧠✨

Introduction to Troubleshooting & Debugging​

Why Master These Skills?​

Core Principle: Information → Hypothesis → Test → Fix​

Troubleshooting vs Debugging​

The Systematic Troubleshooting Process​

Step-by-Step Approach​

Questions to Ask Users​

Before Starting to Debug​

💻 Coding Project Steps​

Types of Programming Errors​

1. Syntax Errors​

2. Runtime Errors (Exceptions)​

3. Semantic Errors (Logic Errors)​

Python Debugging Techniques​

1. Print Statement Debugging​

2. Assert Statements​

3. Exception Handling with Try-Catch​

4. Python Logging Module​

5. PDB (Python Debugger)​

6. Advanced Debugging Techniques​

Scale Down Input​

Check Summaries​

Write Self-Checks​

System-Level Debugging​

Memory Debugging​

Common Memory Issues​

Tools for Memory Debugging​

Python Memory Monitoring​

System Call Tracing​

strace (Linux/Mac)​

Process Monitor (Windows)​

Performance Analysis & Optimization​

Performance Issues Types​

CPU-Bound vs I/O-Bound​

Performance Optimization Principles​

Time Measurement​

Built-in Profiling​

Data Structure Guidelines​

Common Performance Problems​

Concurrency & Parallelism​

When to Use What​

Threading Example​

Multiprocessing Example​

Asyncio Example​

System Monitoring & Tools​

Linux/Mac Tools​

Windows Tools​

Python System Monitoring​

Network Troubleshooting​

Basic Network Debugging​

Python Network Debugging​

Common Network Issues​

Getting Help Effectively​

Before Asking for Help​

How to Ask Good Questions​

❌ Bad Help Requests​

✅ Good Help Requests​

Help Request Checklist​

Crash Analysis & Recovery​

Finding Root Cause of Crashes​

Crash Analysis Tools​

Python Crash Debugging​

Incident Documentation & Postmortems​

What to Document During Incidents​

Postmortem Template​

Code Review for Bug Prevention​

Code Review Checklist​

Giving Good Review Feedback​

✅ Good Review Comments​

Emergency Response Guide​

When Systems Are Down​

Best Practices Summary​

Prevention​

Debugging Strategy​

Code Quality​

Essential Tools Quick Reference​

Python Debugging​

System Commands​

Quick Decision Matrix​

When to Use Each Debugging Method​