Skip to content

Pandas Interview Questions

This document provides a curated list of Pandas interview questions commonly asked in technical interviews for Data Science, Data Analysis, Machine Learning, and Python Developer roles. It covers fundamental concepts to advanced data manipulation techniques, including rigorous "brutally difficult" questions for senior roles.

This is updated frequently but right now this is the most exhaustive list of type of questions being asked.


Premium Interview Questions

Explain loc vs iloc - Key Difference - Google, Amazon, Meta Interview Question

Difficulty: 🟒 Easy | Tags: Indexing, Selection, Core | Asked by: Google, Amazon, Meta, Apple, Netflix

View Answer

loc: Label-based indexing (uses row/column names). iloc: Integer position-based (uses 0, 1, 2...). Key difference: loc is inclusive, iloc is exclusive on end of slice.

Real-World: Netflix uses loc for time-series data (datetime index), iloc for positional slicing in ETL pipelines.

Method Indexing Type Slicing Behavior Use Case
loc Label-based Inclusive end Named rows/cols, datetime index, boolean filtering
iloc Integer position Exclusive end First N rows, every Nth row, CSV-like access
at Label (scalar) Single value Fastest for single cell (10Γ— faster)
iat Position (scalar) Single value Fastest for position-based access

Production Code (130 lines)

import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'salary': [50000, 60000, 70000, 55000, 65000],
    'city': ['NY', 'LA', 'SF', 'NY', 'LA']
}, index=['emp_001', 'emp_002', 'emp_003', 'emp_004', 'emp_005'])

# ===== loc (Label-based) =====
# Single cell
print(df.loc['emp_001', 'name'])  # 'Alice'

# Multiple rows/cols
print(df.loc['emp_001':'emp_003', 'name':'salary'])
# IMPORTANT: 'emp_003' IS INCLUDED (inclusive slicing)

# Boolean filtering (VERY COMMON in production)
high_earners = df.loc[df['salary'] > 60000]
print(high_earners)

# Multiple conditions
ny_high_earners = df.loc[(df['city'] == 'NY') & (df['salary'] > 50000)]

# ===== iloc (Integer position) =====
# Single cell
print(df.iloc[0, 0])  # 'Alice' (first row, first col)

# Multiple rows/cols
print(df.iloc[0:3, 0:2])  # First 3 rows, first 2 cols
# IMPORTANT: Row 3 is EXCLUDED (0, 1, 2 only)

# Every Nth row (common in sampling)
every_2nd_row = df.iloc[::2]  # Rows 0, 2, 4

# Last N rows
last_3_rows = df.iloc[-3:]

# ===== at/iat (Scalar access - FASTEST) =====
# 10Γ— faster for single cell than loc/iloc
name = df.at['emp_001', 'name']  # Use at for labels
age = df.iat[0, 1]                # Use iat for positions

# ===== Common Mistakes =====
# ❌ WRONG: Using loc with integer on non-integer index
try:
    df.loc[0]  # KeyError! Index is ['emp_001', 'emp_002', ...]
except KeyError:
    print("Error: loc expects labels, not positions")

# βœ… CORRECT
df.iloc[0]  # Always works

# ❌ WRONG: Assuming iloc is inclusive like loc
print(df.iloc[0:2])  # Only rows 0, 1 (NOT 0, 1, 2)

# βœ… CORRECT: Use loc for inclusive slicing
print(df.loc['emp_001':'emp_002'])  # Includes both endpoints

When to Use Which

Scenario Use Example
Named index (datetime, strings) loc df.loc['2024-01-01':'2024-12-31']
First N rows iloc df.iloc[:100]
Boolean filtering loc df.loc[df['age'] > 30]
CSV-like access (row 5, col 3) iloc df.iloc[5, 3]
Single cell (performance critical) at/iat df.at['row', 'col']

Real-World Performance

Company Use Case Method Why
Netflix Time-series (datetime index) loc df.loc['2024-01-01':'2024-01-31']
Airbnb First 1000 listings (pagination) iloc df.iloc[:1000]
Uber High-frequency trading (1M+ ops/sec) at/iat 10Γ— faster for scalar access

Interviewer's Insight

  • Knows loc is inclusive, iloc is exclusive (classic gotcha!)
  • Uses at/iat for scalar access (10Γ— faster than loc/iloc)
  • Mentions boolean filtering with loc (most common production use)
  • Avoids loc with integers on non-integer index (KeyError)
  • Real-world: Netflix uses loc for datetime slicing in recommendation ETL

How Do You Handle Missing Values in Pandas? - Google, Amazon, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: Missing Data, Data Cleaning, fillna | Asked by: Google, Amazon, Meta, Netflix, Apple

View Answer

Detecting Missing Values:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': ['x', None, 'y', 'z']
})

# Detection
df.isnull()              # Boolean mask
df.isnull().sum()        # Count per column
df.isnull().sum().sum()  # Total count
df.isna().any()          # Any missing per column

Handling Strategies:

# 1. Drop rows with any missing
df.dropna()

# 2. Drop rows where specific columns are missing
df.dropna(subset=['A'])

# 3. Drop only if all values missing
df.dropna(how='all')

# 4. Fill with constant
df.fillna(0)
df.fillna({'A': 0, 'B': 'unknown'})

# 5. Fill with statistics
df['A'].fillna(df['A'].mean())
df['A'].fillna(df['A'].median())
df['A'].fillna(df['A'].mode()[0])

# 6. Forward/backward fill (time series)
df.ffill()  # Forward fill
df.bfill()  # Backward fill

# 7. Interpolation
df['A'].interpolate(method='linear')

Strategy Selection:

Scenario Strategy When to Use
< 5% missing Drop rows (dropna()) Minimal data loss
Numeric, random Mean/median (fillna) MCAR (missing completely at random)
Time series Forward/backward fill (ffill/bfill) Sequential data
Categorical Mode or 'Unknown' Preserves categories
> 50% missing Drop column Too much missing data

Real-World: - Airbnb: Uses median for price (robust to outliers), mode for amenities - Netflix: Forward fill for user viewing history (time-series) - Uber: Drops rows with missing trip_id (< 1% of data)

Interviewer's Insight

  • Chooses strategy by data type (mean for numeric, mode for categorical)
  • Uses ffill for time-series (preserves temporal patterns)
  • Knows dropna vs fillna tradeoffs (< 5% missing β†’ drop, > 5% β†’ fill)
  • Checks missingness pattern (MCAR vs MAR vs MNAR)
  • Real-world: Airbnb fills missing amenities with 'Not Available' (preserves info)

Explain GroupBy in Pandas - Split-Apply-Combine - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: GroupBy, Aggregation, Split-Apply-Combine | Asked by: Google, Amazon, Meta, Netflix, Apple

View Answer

GroupBy Concept:

  1. Split: Divide data into groups
  2. Apply: Apply function to each group
  3. Combine: Combine results
import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'A'],
    'product': ['x', 'y', 'x', 'y', 'x'],
    'sales': [100, 150, 200, 250, 120],
    'quantity': [10, 15, 20, 25, 12]
})

# Basic groupby
df.groupby('category')['sales'].sum()

# Multiple columns
df.groupby(['category', 'product'])['sales'].mean()

# Multiple aggregations
df.groupby('category').agg({
    'sales': ['sum', 'mean', 'count'],
    'quantity': 'sum'
})

# Named aggregations (cleaner output)
df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    order_count=('sales', 'count')
)

# Custom functions
df.groupby('category')['sales'].apply(lambda x: x.max() - x.min())

Transform vs Apply:

# Transform: returns same shape (broadcast back)
df['sales_normalized'] = df.groupby('category')['sales'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Apply: returns any shape
df.groupby('category').apply(lambda g: g.nlargest(2, 'sales'))

Interviewer's Insight

  • Explains split-apply-combine pattern clearly
  • Uses named aggregations (agg(total_sales=('sales', 'sum')) - cleaner output)
  • Knows transform vs apply (transform keeps shape, apply reduces)
  • Real-world: Netflix uses groupby for user engagement metrics (daily/weekly aggregations)

Difference Between merge(), join(), and concat() - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Merging, Joining, Concatenation | Asked by: Google, Amazon, Meta

View Answer

Comparison:

Method Use Case Key Difference
merge() SQL-like joins on columns Column-based
join() Join on index Index-based
concat() Stack DataFrames No key matching

Examples:

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'val2': [4, 5, 6]})

# MERGE - column-based joining
pd.merge(df1, df2, on='key', how='inner')  # Only A, B
pd.merge(df1, df2, on='key', how='left')   # Keep all from df1
pd.merge(df1, df2, on='key', how='outer')  # Keep all
pd.merge(df1, df2, left_on='key', right_on='key')  # Different column names

# JOIN - index-based (set index first)
df1.set_index('key').join(df2.set_index('key'))

# CONCAT - stacking
pd.concat([df1, df2])              # Vertical (default axis=0)
pd.concat([df1, df2], axis=1)      # Horizontal
pd.concat([df1, df2], ignore_index=True)  # Reset index

Join Types:

Inner: Only matching keys
Left:  All from left + matching from right
Right: All from right + matching from left
Outer: All from both (union)

Indicator for debugging:

result = pd.merge(df1, df2, on='key', how='outer', indicator=True)
# _merge column shows: 'left_only', 'right_only', 'both'

Interviewer's Insight

  • Knows when to use each: merge (columns), join (index), concat (stacking)
  • Explains join types (inner/left/right/outer) with use cases
  • Uses indicator=True for debugging (shows merge source)
  • Real-world: Uber merges trip data + driver data on driver_id (billions/day)

How to Apply Functions to DataFrames? apply(), map(), applymap() - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Data Transformation, Functions | Asked by: Google, Amazon, Microsoft

View Answer

Method Comparison:

Method Scope Use Case
apply() Row/Column Complex transformations
map() Series only Element-wise mapping
applymap() Element-wise Simple element ops (deprecated, use map)

Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': ['x', 'y', 'z']
})

# apply() - works on columns or rows
df[['A', 'B']].apply(np.sum)           # Sum each column
df[['A', 'B']].apply(np.sum, axis=1)   # Sum each row
df[['A', 'B']].apply(lambda x: x.max() - x.min())  # Custom function

# map() - Series only, element-wise
df['C'].map({'x': 'X', 'y': 'Y', 'z': 'Z'})  # Dictionary mapping
df['A'].map(lambda x: x ** 2)                 # Lambda function

# For DataFrames, use apply with axis
df[['A', 'B']].apply(lambda x: x ** 2)  # Each column

# In Pandas 2.1+, use map() instead of applymap()
df[['A', 'B']].map(lambda x: x * 2)

Performance Tip:

# SLOW - apply with lambda
df['A'].apply(lambda x: x ** 2)

# FAST - vectorized operation
df['A'] ** 2

# Use apply only when vectorization isn't possible

Interviewer's Insight

  • Prefers vectorized operations (100Γ— faster than apply)
  • Knows applymap deprecated in Pandas 2.1+ (use map instead)
  • Uses map for Series, apply for DataFrame
  • Real-world: Airbnb vectorizes price calculations (10M+ listings, <1s)

How to Optimize Memory Usage in Pandas? - Google, Amazon, Netflix Interview Question

Difficulty: πŸ”΄ Hard | Tags: Memory Optimization, Performance | Asked by: Google, Amazon, Netflix

View Answer

Check Memory Usage:

import pandas as pd

df = pd.read_csv('large_file.csv')

# Memory per column
df.memory_usage(deep=True)

# Total memory in MB
df.memory_usage(deep=True).sum() / 1024**2

Optimization Techniques:

# 1. Use appropriate dtypes on read
df = pd.read_csv('file.csv', dtype={
    'id': 'int32',           # Instead of int64
    'value': 'float32',      # Instead of float64
    'category': 'category'   # Instead of object
})

# 2. Convert existing columns
df['category'] = df['category'].astype('category')

# 3. Downcast numeric types
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')

# 4. Use sparse dtypes for mostly-empty columns
sparse_col = pd.arrays.SparseArray([0, 0, 0, 1, 0, 0, 0, 0])

Memory Reduction Example:

def reduce_memory(df):
    for col in df.columns:
        col_type = df[col].dtype

        if col_type == 'object':
            if df[col].nunique() / len(df) < 0.5:  # Low cardinality
                df[col] = df[col].astype('category')

        elif col_type == 'int64':
            df[col] = pd.to_numeric(df[col], downcast='integer')

        elif col_type == 'float64':
            df[col] = pd.to_numeric(df[col], downcast='float')

    return df

Savings:

From To Reduction
int64 int32 50%
float64 float32 50%
object (strings) category 90%+

Interviewer's Insight

  • Uses category dtype for low-cardinality strings (90% memory reduction)
  • Knows int64β†’int32, float64β†’float32 (50% reduction each)
  • Checks memory_usage(deep=True) before/after optimization
  • Real-world: Netflix reduced 50GB DataFrame to 5GB using category dtype

How to Handle Large Datasets That Don't Fit in Memory? - Google, Amazon, Netflix Interview Question

Difficulty: πŸ”΄ Hard | Tags: Large Data, Chunking, Performance | Asked by: Google, Amazon, Netflix, Meta

View Answer

Chunked Processing:

import pandas as pd

# Process file in chunks
chunk_size = 100_000
chunks = []

for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed = chunk.groupby('category')['value'].sum()
    chunks.append(processed)

# Combine results
result = pd.concat(chunks).groupby(level=0).sum()

Use Efficient File Formats:

# Parquet - columnar, compressed
df.to_parquet('data.parquet', compression='snappy')
df = pd.read_parquet('data.parquet', columns=['col1', 'col2'])  # Read subset

# Feather - fast read/write
df.to_feather('data.feather')
df = pd.read_feather('data.feather')

Dask for Out-of-Core Computing:

import dask.dataframe as dd

# Lazy loading - doesn't load into memory
ddf = dd.read_csv('huge_file.csv')

# Same Pandas API
result = ddf.groupby('category')['value'].sum().compute()

PyArrow Backend (Pandas 2.0+):

# Use PyArrow for better memory efficiency
df = pd.read_csv('file.csv', dtype_backend='pyarrow')

# Or convert existing
df = df.convert_dtypes(dtype_backend='pyarrow')

Comparison:

Approach Use Case
Chunking Simple aggregations
Parquet Columnar queries
Dask Complex operations
Polars Speed-critical

Interviewer's Insight

  • Uses chunksize for simple aggregations (100K-1M rows/chunk)
  • Knows Parquet > CSV (10Γ— smaller, 100Γ— faster column access)
  • Mentions Dask for out-of-core (scales beyond RAM)
  • Real-world: Uber processes 100TB+ trip data daily using Parquet + Dask

Explain Pivot Tables in Pandas - Amazon, Google Interview Question

Difficulty: 🟑 Medium | Tags: Pivot Tables, Reshaping | Asked by: Amazon, Google, Microsoft, Netflix

View Answer

pivot_table() - Flexible Reshaping:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': ['2023-01', '2023-01', '2023-02', '2023-02'],
    'product': ['A', 'B', 'A', 'B'],
    'region': ['East', 'East', 'West', 'West'],
    'sales': [100, 150, 200, 250],
    'quantity': [10, 15, 20, 25]
})

# Basic pivot table
pt = pd.pivot_table(
    df,
    values='sales',
    index='date',
    columns='product',
    aggfunc='sum'
)

# Multiple aggregations
pt = pd.pivot_table(
    df,
    values=['sales', 'quantity'],
    index='date',
    columns='product',
    aggfunc={'sales': 'sum', 'quantity': 'mean'}
)

# With totals
pt = pd.pivot_table(
    df,
    values='sales',
    index='date',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

pivot() vs pivot_table():

pivot() pivot_table()
Duplicates Error Handles with aggfunc
Aggregation No Yes
Fill value No Yes
# pivot() - simple reshape (no duplicates allowed)
df.pivot(index='date', columns='product', values='sales')

# pivot_table() - handles duplicates
df.pivot_table(index='date', columns='product', values='sales', aggfunc='mean')

Interviewer's Insight

  • Uses pivot_table() for aggregation (handles duplicates, pivot() errors)
  • Knows margins=True for row/column totals
  • Can reverse with melt() (unpivot)
  • Real-world: Airbnb uses pivot_table for daily revenue reports by region

How to Work with DateTime Data in Pandas? - Google, Amazon, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: DateTime, Time Series | Asked by: Google, Amazon, Netflix, Meta

View Answer

Creating DateTime:

import pandas as pd

# Parse strings to datetime
df['date'] = pd.to_datetime(df['date_string'])
df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')

# Create date range
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
dates = pd.date_range(start='2023-01-01', periods=12, freq='ME')

DateTime Accessors (.dt):

df['date'] = pd.to_datetime(df['date'])

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek  # 0=Monday
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['date'].dt.dayofweek >= 5

# Formatting
df['month_name'] = df['date'].dt.month_name()
df['formatted'] = df['date'].dt.strftime('%Y-%m')

Time Zone Handling:

# Localize (no timezone -> timezone)
df['date'] = df['date'].dt.tz_localize('UTC')

# Convert between timezones
df['date_est'] = df['date'].dt.tz_convert('US/Eastern')

Resampling:

# Resample time series
df.set_index('date').resample('M')['sales'].sum()  # Monthly
df.set_index('date').resample('Q')['sales'].mean()  # Quarterly
df.set_index('date').resample('W')['sales'].agg(['sum', 'mean'])

Interviewer's Insight

  • Uses .dt accessor for datetime components (year, month, dayofweek)
  • Handles timezones (tz_localize, tz_convert)
  • Knows resample() for aggregation (D, W, M, Q, Y frequencies)
  • Real-world: Netflix uses resample('D') for daily user engagement metrics

What is SettingWithCopyWarning and How to Avoid It? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Common Errors, Best Practices | Asked by: Google, Amazon, Meta, Microsoft

View Answer

The Problem:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# This triggers warning
filtered = df[df['A'] > 1]
filtered['B'] = 100  # Warning: setting on copy

Why It Happens:

Pandas can't always tell if you're working with a view or a copy. Modifying a view might not update the original (or might unexpectedly).

Solutions:

# Solution 1: Explicit copy
filtered = df[df['A'] > 1].copy()
filtered['B'] = 100  # Safe

# Solution 2: Use .loc for assignment
df.loc[df['A'] > 1, 'B'] = 100  # Modifies original

# Solution 3: Chain in one line
df = df[df['A'] > 1].assign(B=100)

# Pandas 2.0+ with Copy-on-Write
pd.options.mode.copy_on_write = True
filtered = df[df['A'] > 1]
filtered['B'] = 100  # Creates copy automatically

Best Practices:

Want to... Use
Modify original df.loc[condition, 'col'] = value
Create new DataFrame df[condition].copy()
Chain operations .assign() method

Interviewer's Insight

  • Explains view vs copy concept (chained indexing creates ambiguity)
  • Uses .loc for in-place modification (safe)
  • Uses .copy() when creating new DataFrame
  • Knows Copy-on-Write in Pandas 2.0+ (eliminates warning)
  • Real-world: Google enforces .copy() in code reviews to avoid bugs

How to Use Rolling Windows for Time Series? - Google, Amazon, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: Rolling Windows, Time Series, Finance | Asked by: Google, Amazon, Netflix, Apple

View Answer

Rolling Window Calculations:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=100),
    'price': np.random.randn(100).cumsum() + 100
})
df.set_index('date', inplace=True)

# Moving averages
df['MA_7'] = df['price'].rolling(window=7).mean()
df['MA_30'] = df['price'].rolling(window=30).mean()

# Other statistics
df['rolling_std'] = df['price'].rolling(7).std()
df['rolling_max'] = df['price'].rolling(7).max()
df['rolling_sum'] = df['price'].rolling(7).sum()

# Minimum periods (handle NaN at start)
df['MA_7_min3'] = df['price'].rolling(window=7, min_periods=3).mean()

# Centered window
df['MA_centered'] = df['price'].rolling(window=7, center=True).mean()

Custom Rolling Functions:

# Custom function with apply
df['rolling_range'] = df['price'].rolling(7).apply(
    lambda x: x.max() - x.min()
)

# Faster with raw=True (NumPy array)
df['rolling_custom'] = df['price'].rolling(7).apply(
    lambda x: np.percentile(x, 75), raw=True
)

Exponential Weighted Average:

# EMA - more weight to recent values
df['EMA_7'] = df['price'].ewm(span=7).mean()
df['EMA_decay'] = df['price'].ewm(alpha=0.1).mean()

Interviewer's Insight

  • Uses min_periods for edge cases (avoids NaN at start)
  • Knows ewm() for exponential weighting (more weight to recent)
  • Uses raw=True for performance (NumPy array, 3Γ— faster)
  • Real-world: Trading firms use rolling(20).mean() for moving averages

How to Efficiently Use Query and Eval? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Filtering, Query, Performance | Asked by: Google, Amazon, Meta

View Answer

query() - String-Based Filtering:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'department': ['Sales', 'IT', 'Sales'],
    'salary': [50000, 60000, 55000]
})

# Standard filtering (verbose)
df[(df['age'] > 25) & (df['department'] == 'Sales')]

# query() - cleaner syntax
df.query('age > 25 and department == "Sales"')

# Using variables with @
min_age = 25
dept = 'Sales'
df.query('age > @min_age and department == @dept')

# Column names with spaces
df.query('`Column Name` > 10')

eval() - Efficient Expression Evaluation:

# Create new column without intermediate copies
df.eval('bonus = salary * 0.1')

# Multiple expressions
df.eval('''
    bonus = salary * 0.1
    total_comp = salary + bonus
    age_group = age // 10 * 10
''', inplace=True)

# Conditional expressions
df.eval('is_senior = age >= 30')

Performance:

# eval uses numexpr for large DataFrames
# Faster for: large datasets, complex expressions
# Similar for: small datasets, simple operations

# Check if numexpr is available
import pandas as pd
print(pd.get_option('compute.use_numexpr'))

Interviewer's Insight

  • Uses query() for readable filters (SQL-like syntax)
  • Uses @ for variable interpolation (dynamic filters)
  • Knows eval() faster for large data (uses numexpr, avoids copies)
  • Real-world: Airbnb uses query() for dynamic filters in analytics dashboards

How to Work with String Data in Pandas? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: String Operations, Text Processing | Asked by: Google, Amazon, Meta, Netflix

View Answer

String Accessor (.str):

import pandas as pd

df = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Bob Johnson'],
    'email': ['[email protected]', '[email protected]', '[email protected]']
})

# Case transformations
df['name'].str.lower()
df['name'].str.upper()
df['name'].str.title()

# String matching
df['name'].str.contains('John')
df['name'].str.startswith('J')
df['email'].str.endswith('.com')

# Extraction
df['first_name'] = df['name'].str.split(' ').str[0]
df['domain'] = df['email'].str.extract(r'@(\w+\.\w+)')

# Replacement
df['name'].str.replace('John', 'Jonathan')
df['email'].str.replace(r'@\w+', '@company', regex=True)

# Length and padding
df['name'].str.len()
df['name'].str.pad(width=20, side='right', fillchar='.')

Split and Expand:

# Split into multiple columns
df[['first', 'last']] = df['name'].str.split(' ', expand=True)

# Split into list (no expand)
df['name_parts'] = df['name'].str.split(' ')

Regular Expressions:

# Extract groups
df['phone'] = pd.Series(['123-456-7890', '987-654-3210'])
df[['area', 'exchange', 'number']] = df['phone'].str.extract(
    r'(\d{3})-(\d{3})-(\d{4})'
)

# Find all matches
df['numbers'] = df['phone'].str.findall(r'\d+')

Interviewer's Insight

  • Uses .str accessor for vectorized string operations (10-100x faster than loops)
  • Knows expand=True for split() to create multiple columns
  • Can write regex patterns for extraction (.str.extract(), .str.replace())
  • Real-world: Airbnb cleans 500M+ listing descriptions daily with .str methods

Difference Between transform() and apply() in GroupBy - Google, Amazon Interview Question

Difficulty: πŸ”΄ Hard | Tags: GroupBy, Data Transformation | Asked by: Google, Amazon, Meta

View Answer

Key Difference:

Method Output Shape Broadcast
transform() Same as input Yes
apply() Any shape No

transform() - Broadcasts Back:

import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 30, 40]
})

# Normalize within groups
df['normalized'] = df.groupby('group')['value'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Add group statistics as new column
df['group_mean'] = df.groupby('group')['value'].transform('mean')
df['group_sum'] = df.groupby('group')['value'].transform('sum')
df['pct_of_group'] = df['value'] / df.groupby('group')['value'].transform('sum')

# Result: same number of rows as original

apply() - Flexible Output:

# Return aggregated result
df.groupby('group')['value'].apply(lambda x: x.sum())

# Return different shape per group
df.groupby('group').apply(lambda g: g.nlargest(1, 'value'))

# Return multiple values per group
df.groupby('group')['value'].apply(lambda x: pd.Series({
    'mean': x.mean(),
    'range': x.max() - x.min()
}))

When to Use Each:

Use Case Method
Add group stat as column transform()
Normalize within groups transform()
Filter/rank within groups transform()
Custom aggregation apply()
Return subset of rows apply()

Interviewer's Insight

  • Explains broadcast behavior: transform() returns same shape, apply() flexible
  • Uses transform() for normalization (z-scores within groups)
  • Knows performance: transform() optimized for aggregations, apply() slower
  • Real-world: Netflix normalizes viewing time by country with transform()

How to Create Bins and Categories with cut() and qcut()? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Discretization, Binning | Asked by: Google, Amazon, Meta

View Answer

cut() - Fixed-Width Bins:

import pandas as pd
import numpy as np

ages = pd.Series([5, 17, 25, 35, 45, 55, 65, 75, 85])

# Explicit bin edges
bins = [0, 18, 35, 50, 65, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Middle Aged', 'Senior']

pd.cut(ages, bins=bins, labels=labels)

# Equal-width bins
pd.cut(ages, bins=5)  # 5 equal-width bins

# Include lowest value
pd.cut(ages, bins=bins, labels=labels, include_lowest=True)

# Return bin boundaries
pd.cut(ages, bins=5, retbins=True)

qcut() - Quantile-Based Bins:

# Equal-sized bins (same number of items)
values = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])

pd.qcut(values, q=4)  # Quartiles
pd.qcut(values, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
pd.qcut(values, q=[0, 0.25, 0.5, 0.75, 1])  # Custom quantiles

# Handle duplicates
pd.qcut(values, q=4, duplicates='drop')

Comparison:

Method Bin Size Use Case
cut() Fixed width Age groups, price ranges
qcut() Equal count Percentiles, balanced groups

Example:

# cut: bins by value range
pd.cut([1, 5, 10, 50, 100], bins=3)
# (0.9, 34], (0.9, 34], (0.9, 34], (34, 67], (67, 100]

# qcut: bins by count
pd.qcut([1, 5, 10, 50, 100], q=3)
# (0.99, 7.5], (0.99, 7.5], (7.5, 30], (30, 100], (30, 100]

Interviewer's Insight

What they're testing: Data discretization.

Strong answer signals:

  • Knows when to use each method
  • Uses meaningful labels
  • Handles edge cases (duplicates)
  • Explains equal-width vs equal-count
  • Real-world: Stripe uses qcut() for balanced A/B test cohorts (equal-sized groups)

How to Handle Categorical Data Efficiently? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Categorical Data, Memory | Asked by: Google, Amazon, Meta, Netflix

View Answer

Creating Categorical:

import pandas as pd

# Convert to categorical
df = pd.DataFrame({'status': ['active', 'inactive', 'active', 'pending'] * 10000})

# Memory before
print(df.memory_usage(deep=True).sum())  # ~360KB

# Convert
df['status'] = df['status'].astype('category')

# Memory after
print(df.memory_usage(deep=True).sum())  # ~40KB (90% reduction!)

Ordered Categories:

from pandas.api.types import CategoricalDtype

# Define ordered category
size_type = CategoricalDtype(
    categories=['small', 'medium', 'large', 'xlarge'],
    ordered=True
)

df['size'] = df['size'].astype(size_type)

# Now comparisons work
df[df['size'] > 'medium']  # Returns 'large' and 'xlarge'
df['size'].min(), df['size'].max()

get_dummies() for One-Hot Encoding:

# One-hot encoding
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})

pd.get_dummies(df, columns=['color'])
# Result: color_blue, color_green, color_red columns

# Drop first to avoid multicollinearity
pd.get_dummies(df, columns=['color'], drop_first=True)

# Prefix
pd.get_dummies(df, columns=['color'], prefix='c')

Category Operations:

# Add/remove categories
df['status'].cat.add_categories(['new_status'])
df['status'].cat.remove_unused_categories()

# Rename categories
df['status'].cat.rename_categories({'active': 'ACTIVE'})

Interviewer's Insight

What they're testing: Efficient data handling.

Strong answer signals:

  • Uses category for memory savings
  • Knows ordered categories for comparisons
  • Uses get_dummies for ML
  • Understands when to use category type

How to Use MultiIndex (Hierarchical Indexing)? - Google, Amazon Interview Question

Difficulty: πŸ”΄ Hard | Tags: MultiIndex, Hierarchical Data | Asked by: Google, Amazon, Meta

View Answer

Creating MultiIndex:

import pandas as pd

# From tuples
index = pd.MultiIndex.from_tuples([
    ('A', 2021), ('A', 2022), ('B', 2021), ('B', 2022)
], names=['category', 'year'])

df = pd.DataFrame({'sales': [100, 150, 200, 250]}, index=index)

# From product (all combinations)
index = pd.MultiIndex.from_product(
    [['A', 'B'], [2021, 2022, 2023]],
    names=['category', 'year']
)

# From GroupBy
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'year': [2021, 2022, 2021, 2022],
    'sales': [100, 150, 200, 250]
})
df_multi = df.set_index(['category', 'year'])

Selecting with MultiIndex:

# Select outer level
df.loc['A']  # All years for category A

# Select specific combination
df.loc[('A', 2021)]

# Cross-section
df.xs(2021, level='year')  # All categories for 2021

# Slice
df.loc['A':'B']  # Range of outer index

Flatten MultiIndex:

# Reset to regular columns
df.reset_index()

# Flatten column MultiIndex
df.columns = ['_'.join(col).strip() for col in df.columns.values]

Stack/Unstack:

# Wide to long (stack)
df.stack()

# Long to wide (unstack)
df.unstack()
df.unstack(level='year')

Interviewer's Insight

What they're testing: Complex data structures.

Strong answer signals:

  • Creates MultiIndex efficiently
  • Uses xs for cross-sections
  • Knows stack/unstack
  • Can flatten when needed

How to Profile and Optimize Pandas Performance? - Google, Amazon, Netflix Interview Question

Difficulty: πŸ”΄ Hard | Tags: Performance, Profiling, Optimization | Asked by: Google, Amazon, Netflix

View Answer

Profiling Tools:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(100000, 10))

# Time operations
%timeit df.sum()

# Line-by-line profiling
%lprun -f function_name function_name(df)

# Memory profiling
%memit df.copy()

Common Optimizations:

# 1. Avoid loops - use vectorization
# SLOW
for i in range(len(df)):
    df.loc[i, 'new_col'] = df.loc[i, 'A'] * 2

# FAST
df['new_col'] = df['A'] * 2

# 2. Use NumPy for complex operations
# SLOW
df['result'] = df.apply(lambda row: complex_function(row), axis=1)

# FAST
df['result'] = np.where(df['A'] > 0, df['A'] * 2, df['A'] * 3)

# 3. Use categorical for strings
df['category'] = df['category'].astype('category')

# 4. Read only needed columns
df = pd.read_csv('file.csv', usecols=['col1', 'col2'])

# 5. Use query() for complex filters
df.query('A > 0 and B < 100')  # Faster than boolean indexing

Advanced Techniques:

# Numba JIT compilation
from numba import jit

@jit(nopython=True)
def fast_calculation(arr):
    result = np.empty(len(arr))
    for i in range(len(arr)):
        result[i] = arr[i] ** 2 + arr[i] * 2
    return result

df['result'] = fast_calculation(df['A'].values)

# Swifter for automatic parallelization
import swifter
df['result'] = df['A'].swifter.apply(complex_function)

Interviewer's Insight

What they're testing: Production optimization skills.

Strong answer signals:

  • Profiles before optimizing
  • Prefers vectorization over loops
  • Uses NumPy for speed
  • Knows Numba/Swifter for edge cases

How to Use melt() for Unpivoting Data? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Reshaping, Unpivoting | Asked by: Google, Amazon, Meta

View Answer

melt() - Wide to Long Format:

import pandas as pd

# Wide format
df_wide = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    '2021_sales': [100, 200],
    '2022_sales': [150, 250],
    '2023_sales': [180, 300]
})

# Convert to long format
df_long = pd.melt(
    df_wide,
    id_vars=['name'],
    value_vars=['2021_sales', '2022_sales', '2023_sales'],
    var_name='year',
    value_name='sales'
)

# Result:
#    name        year  sales
# 0  Alice  2021_sales    100
# 1    Bob  2021_sales    200
# ...

Clean up melted data:

df_long['year'] = df_long['year'].str.replace('_sales', '').astype(int)

Interviewer's Insight

What they're testing: Data reshaping skills.

Strong answer signals:

  • Knows melt reverses pivot
  • Uses id_vars for fixed columns
  • Cleans variable names after melting

How to Use stack() and unstack()? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Reshaping, MultiIndex | Asked by: Google, Amazon, Meta

View Answer

stack() - Pivot columns to rows:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
}, index=['row1', 'row2'])

# Stack columns into rows
stacked = df.stack()
# Result: MultiIndex Series
# row1  A    1
#       B    3
# row2  A    2
#       B    4

unstack() - Pivot rows to columns:

# Reverse operation
unstacked = stacked.unstack()

# Unstack specific level
df_multi = df.set_index([['cat1', 'cat1', 'cat2', 'cat2'], [1, 2, 1, 2]])
df_multi.unstack(level=0)  # Unstack first level
df_multi.unstack(level=1)  # Unstack second level

Interviewer's Insight

What they're testing: MultiIndex manipulation.

Strong answer signals:

  • Understands stack/unstack relationship
  • Can specify which level to unstack
  • Uses for reshaping complex data

How to Cross-Tabulate with crosstab()? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Cross Tabulation, Analysis | Asked by: Google, Amazon, Meta

View Answer

crosstab() - Frequency Tables:

import pandas as pd

df = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
    'department': ['Sales', 'Sales', 'IT', 'IT', 'Sales', 'HR'],
    'salary': [50000, 55000, 60000, 65000, 52000, 48000]
})

# Simple frequency table
pd.crosstab(df['gender'], df['department'])

# With margins (totals)
pd.crosstab(df['gender'], df['department'], margins=True)

# With aggregation
pd.crosstab(
    df['gender'], 
    df['department'], 
    values=df['salary'],
    aggfunc='mean'
)

# Normalize to percentages
pd.crosstab(df['gender'], df['department'], normalize='all')  # All cells
pd.crosstab(df['gender'], df['department'], normalize='index')  # By row
pd.crosstab(df['gender'], df['department'], normalize='columns')  # By column

Interviewer's Insight

What they're testing: Contingency table creation.

Strong answer signals:

  • Uses normalize for percentages
  • Knows margins for totals
  • Can aggregate with values/aggfunc

How to Use explode() for List Columns? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: List Operations, Data Preprocessing | Asked by: Google, Amazon, Meta

View Answer

explode() - Unnest Lists:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'skills': [['Python', 'SQL', 'ML'], ['Java', 'Scala']]
})

# Explode list column
df_exploded = df.explode('skills')

# Result:
#    name  skills
# 0  Alice  Python
# 0  Alice     SQL
# 0  Alice      ML
# 1    Bob    Java
# 1    Bob   Scala

# Reset index after explode
df_exploded = df.explode('skills').reset_index(drop=True)

# Explode multiple columns (same length)
df = pd.DataFrame({
    'id': [1, 2],
    'values': [[1, 2], [3, 4]],
    'labels': [['a', 'b'], ['c', 'd']]
})
df.explode(['values', 'labels'])

Interviewer's Insight

What they're testing: Nested data handling.

Strong answer signals:

  • Resets index after explode
  • Handles multiple list columns
  • Knows reverse: groupby + list aggregation

How to Handle JSON with Nested Structures? - Amazon, Google Interview Question

Difficulty: 🟑 Medium | Tags: JSON Processing, Nested Data | Asked by: Amazon, Google, Meta

View Answer

json_normalize() - Flatten Nested JSON:

import pandas as pd
from pandas import json_normalize

data = [
    {
        'id': 1,
        'name': 'Alice',
        'address': {
            'city': 'NYC',
            'zip': '10001'
        },
        'orders': [{'id': 101, 'amount': 50}, {'id': 102, 'amount': 75}]
    }
]

# Flatten top level
df = json_normalize(data)

# Flatten with nested paths
df = json_normalize(
    data,
    record_path='orders',  # Explode this array
    meta=['id', 'name', ['address', 'city']],  # Include these fields
    meta_prefix='user_'
)

# Read JSON file
df = pd.read_json('data.json')
df = pd.read_json('data.json', orient='records')

Interviewer's Insight

What they're testing: Real-world data handling.

Strong answer signals:

  • Uses json_normalize for nested
  • Knows record_path and meta
  • Handles different JSON orientations

How to Use Method Chaining for Clean Code? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Method Chaining, Clean Code | Asked by: Google, Amazon, Meta

View Answer

Method Chaining Pattern:

import pandas as pd

# Instead of multiple intermediate variables
df = (
    pd.read_csv('data.csv')
    .query('status == "active"')
    .assign(
        year=lambda x: pd.to_datetime(x['date']).dt.year,
        total=lambda x: x['price'] * x['quantity']
    )
    .groupby(['year', 'region'])
    .agg(revenue=('total', 'sum'))
    .reset_index()
    .sort_values('revenue', ascending=False)
)

pipe() for Custom Functions:

def add_features(df):
    return df.assign(
        log_value=np.log1p(df['value']),
        is_high=df['value'] > df['value'].median()
    )

def filter_outliers(df, column, n_std=3):
    mean, std = df[column].mean(), df[column].std()
    return df[(df[column] - mean).abs() <= n_std * std]

result = (
    df
    .pipe(add_features)
    .pipe(filter_outliers, 'value', n_std=2)
)

Interviewer's Insight

What they're testing: Clean, maintainable code.

Strong answer signals:

  • Uses parentheses for multi-line chains
  • Uses pipe() for custom functions
  • Avoids intermediate variables

How to Read/Write Parquet Files? - Google, Amazon, Netflix Interview Question

Difficulty: 🟒 Easy | Tags: File I/O, Big Data, Parquet | Asked by: Google, Amazon, Netflix, Meta

View Answer

Parquet - Columnar Format:

import pandas as pd

# Write to Parquet
df.to_parquet('data.parquet')
df.to_parquet('data.parquet', compression='snappy')  # Default
df.to_parquet('data.parquet', compression='gzip')    # Smaller

# Read Parquet
df = pd.read_parquet('data.parquet')

# Read only specific columns (fast!)
df = pd.read_parquet('data.parquet', columns=['col1', 'col2'])

# With filters (predicate pushdown)
df = pd.read_parquet(
    'data.parquet',
    filters=[('year', '==', 2023)]
)

Parquet vs CSV:

Feature Parquet CSV
Size ~10x smaller Larger
Read speed Faster Slower
Column selection Fast Must read all
Data types Preserved Lost

Interviewer's Insight

What they're testing: Efficient data storage.

Strong answer signals:

  • Uses Parquet for large datasets
  • Reads only needed columns
  • Knows compression options

How to Use assign() for Creating New Columns? - Google, Amazon Interview Question

Difficulty: 🟒 Easy | Tags: Column Creation, Method Chaining | Asked by: Google, Amazon, Meta

View Answer

assign() - Create Columns in Chain:

import pandas as pd

df = pd.DataFrame({
    'price': [100, 200, 300],
    'quantity': [2, 3, 1]
})

# Create multiple columns
df = df.assign(
    total=lambda x: x['price'] * x['quantity'],
    discounted=lambda x: x['total'] * 0.9,
    tax=lambda x: x['total'] * 0.1
)

# Reference previous assignments
df = df.assign(
    subtotal=lambda x: x['price'] * x['quantity'],
    tax=lambda x: x['subtotal'] * 0.1,
    total=lambda x: x['subtotal'] + x['tax']
)

assign() vs direct assignment:

# Direct assignment (modifies in place)
df['new_col'] = df['price'] * 2

# assign() (returns new DataFrame, original unchanged)
df_new = df.assign(new_col=df['price'] * 2)

Interviewer's Insight

What they're testing: Functional programming style.

Strong answer signals:

  • Uses lambda for dependent columns
  • Prefers assign for method chaining
  • Knows it returns new DataFrame

How to Calculate Percentage Change and Cumulative Stats? - Google, Amazon, Netflix Interview Question

Difficulty: 🟒 Easy | Tags: Time Series, Finance | Asked by: Google, Amazon, Netflix, Apple

View Answer

Percentage Change:

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=5),
    'price': [100, 105, 102, 110, 108]
})

# Daily percentage change
df['pct_change'] = df['price'].pct_change()

# Percentage change over N periods
df['pct_change_3'] = df['price'].pct_change(periods=3)

# Fill first NaN
df['pct_change'] = df['price'].pct_change().fillna(0)

Cumulative Statistics:

# Cumulative sum
df['cumsum'] = df['price'].cumsum()

# Cumulative product (for returns)
df['cumulative_return'] = (1 + df['pct_change']).cumprod() - 1

# Cumulative max/min
df['cummax'] = df['price'].cummax()
df['cummin'] = df['price'].cummin()

# Drawdown
df['drawdown'] = df['price'] / df['price'].cummax() - 1

Interviewer's Insight

What they're testing: Financial calculations.

Strong answer signals:

  • Uses pct_change for returns
  • Knows cumprod for cumulative returns
  • Can calculate drawdowns

How to Shift and Lag Data? - Google, Amazon, Netflix Interview Question

Difficulty: 🟒 Easy | Tags: Time Series, Lag Features | Asked by: Google, Amazon, Netflix

View Answer

shift() - Create Lag/Lead Features:

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=5),
    'value': [10, 20, 30, 40, 50]
})

# Previous value (lag)
df['prev_value'] = df['value'].shift(1)
df['prev_2'] = df['value'].shift(2)

# Next value (lead)
df['next_value'] = df['value'].shift(-1)

# Calculate difference from previous
df['diff'] = df['value'] - df['value'].shift(1)
# Same as: df['value'].diff()

# Shift with fill
df['prev_filled'] = df['value'].shift(1, fill_value=0)

Use Cases:

Operation Formula
Lag-1 df['col'].shift(1)
Difference df['col'].diff()
% Change df['col'].pct_change()
Rolling difference df['col'] - df['col'].shift(n)

Interviewer's Insight

What they're testing: Feature engineering for time series.

Strong answer signals:

  • Creates lag features for ML
  • Knows shift vs diff vs pct_change
  • Handles NaN from shifting

How to Sample Data from DataFrame? - Google, Amazon Interview Question

Difficulty: 🟒 Easy | Tags: Sampling, Data Exploration | Asked by: Google, Amazon, Meta, Netflix

View Answer

sample() - Random Sampling:

import pandas as pd

df = pd.DataFrame({'A': range(1000)})

# Sample n rows
sample = df.sample(n=100)

# Sample fraction
sample = df.sample(frac=0.1)  # 10%

# With replacement (bootstrap)
bootstrap = df.sample(n=1000, replace=True)

# Reproducible sampling
sample = df.sample(n=100, random_state=42)

# Weighted sampling
df['weight'] = [0.1] * 500 + [0.9] * 500
sample = df.sample(n=100, weights='weight')

Stratified Sampling:

# Sample within groups
df.groupby('category').sample(n=10)
df.groupby('category').sample(frac=0.1)

# Stratified with sklearn
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, stratify=df['category'], test_size=0.2)

Interviewer's Insight

What they're testing: Data sampling techniques.

Strong answer signals:

  • Uses random_state for reproducibility
  • Knows weighted sampling
  • Uses stratified for imbalanced data

How to Detect and Handle Outliers? - Google, Amazon, Meta Interview Question

Difficulty: 🟑 Medium | Tags: Outlier Detection, Data Cleaning | Asked by: Google, Amazon, Meta, Netflix

View Answer

Detection Methods:

import pandas as pd
import numpy as np

df = pd.DataFrame({'value': [1, 2, 3, 100, 4, 5, -50, 6]})

# Z-score method
z_scores = (df['value'] - df['value'].mean()) / df['value'].std()
outliers = df[abs(z_scores) > 3]

# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['value'] < lower) | (df['value'] > upper)]

# Percentile method
outliers = df[(df['value'] < df['value'].quantile(0.01)) | 
              (df['value'] > df['value'].quantile(0.99))]

Handling Outliers:

# Remove
df_clean = df[(df['value'] >= lower) & (df['value'] <= upper)]

# Cap (winsorize)
df['value_capped'] = df['value'].clip(lower=lower, upper=upper)

# Replace with NaN
df.loc[abs(z_scores) > 3, 'value'] = np.nan

Interviewer's Insight

What they're testing: Data quality handling.

Strong answer signals:

  • Knows IQR and Z-score methods
  • Chooses method based on distribution
  • Considers domain knowledge

How to Normalize and Standardize Data? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Feature Engineering, ML Preprocessing | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Normalization (Min-Max Scaling):

import pandas as pd

df = pd.DataFrame({'value': [10, 20, 30, 40, 50]})

# Scale to [0, 1]
df['normalized'] = (df['value'] - df['value'].min()) / \
                   (df['value'].max() - df['value'].min())

# Using sklearn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['normalized'] = scaler.fit_transform(df[['value']])

Standardization (Z-score):

# Scale to mean=0, std=1
df['standardized'] = (df['value'] - df['value'].mean()) / df['value'].std()

# Using sklearn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['standardized'] = scaler.fit_transform(df[['value']])

When to Use:

Method Use When
Min-Max Bounded range needed, no outliers
Z-score Normal distribution, has outliers
Robust Many outliers (uses median/IQR)

Interviewer's Insight

What they're testing: ML preprocessing.

Strong answer signals:

  • Knows difference between methods
  • Uses sklearn for production
  • Considers outliers in choice

How to Use where() and mask() Methods? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Conditional Operations, Data Transformation | Asked by: Google, Amazon, Meta

View Answer

where() - Keep values where True:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})

# Keep values >= 3, replace others with NaN
df['A'].where(df['A'] >= 3)
# Result: [NaN, NaN, 3, 4, 5]

# Replace with specific value
df['A'].where(df['A'] >= 3, other=0)
# Result: [0, 0, 3, 4, 5]

mask() - Replace values where True:

# Opposite of where
# Replace values >= 3 with NaN
df['A'].mask(df['A'] >= 3)
# Result: [1, 2, NaN, NaN, NaN]

# Replace with specific value
df['A'].mask(df['A'] >= 3, other=999)
# Result: [1, 2, 999, 999, 999]

vs np.where:

# np.where for if-else
df['result'] = np.where(df['A'] >= 3, 'high', 'low')

# Multiple conditions: np.select
conditions = [df['A'] < 2, df['A'] < 4]
choices = ['low', 'medium']
df['category'] = np.select(conditions, choices, default='high')

Interviewer's Insight

What they're testing: Conditional data manipulation.

Strong answer signals:

  • Knows where keeps, mask replaces
  • Uses np.where for if-else
  • Uses np.select for multiple conditions

How to Rank Values in Pandas? - Google, Amazon Interview Question

Difficulty: 🟒 Easy | Tags: Ranking, Analysis | Asked by: Google, Amazon, Meta, Netflix

View Answer

rank() Method:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': [85, 90, 85, 95]
})

# Default: average rank for ties
df['rank'] = df['score'].rank()
# [1.5, 3, 1.5, 4] - tied scores get average

# Rank methods for ties
df['rank_min'] = df['score'].rank(method='min')    # [1, 3, 1, 4]
df['rank_max'] = df['score'].rank(method='max')    # [2, 3, 2, 4]
df['rank_first'] = df['score'].rank(method='first') # [1, 3, 2, 4]
df['rank_dense'] = df['score'].rank(method='dense') # [1, 2, 1, 3]

# Descending rank
df['rank_desc'] = df['score'].rank(ascending=False)

# Rank within groups
df['rank_in_group'] = df.groupby('category')['score'].rank()

# Percentile rank
df['percentile'] = df['score'].rank(pct=True)

Interviewer's Insight

What they're testing: Data ranking skills.

Strong answer signals:

  • Knows different tie-breaking methods
  • Uses dense for no gaps
  • Can rank within groups

How to Find First/Last N Rows Per Group? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: GroupBy, Selection | Asked by: Google, Amazon, Meta, Netflix

View Answer

nlargest/nsmallest per Group:

import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [10, 30, 20, 40, 60, 50]
})

# Top 2 per group using nlargest
df.groupby('category').apply(lambda x: x.nlargest(2, 'value'))

# Top N using head after sort
df.sort_values('value', ascending=False).groupby('category').head(2)

# First/last row per group
df.groupby('category').first()
df.groupby('category').last()
df.groupby('category').nth(0)  # First row
df.groupby('category').nth(-1)  # Last row

Using rank:

# More efficient for large data
df['rank'] = df.groupby('category')['value'].rank(method='first', ascending=False)
top_2 = df[df['rank'] <= 2]

Interviewer's Insight

What they're testing: Efficient group operations.

Strong answer signals:

  • Uses nlargest/nsmallest for simplicity
  • Uses rank for large data
  • Knows first(), last(), nth()

How to Use nsmallest() and nlargest()? - Google, Amazon Interview Question

Difficulty: 🟒 Easy | Tags: Selection, Performance | Asked by: Google, Amazon, Meta

View Answer

nlargest/nsmallest - Efficient Selection:

import pandas as pd

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'sales': [100, 500, 200, 800, 300],
    'profit': [10, 50, 30, 70, 20]
})

# Top 3 by sales
df.nlargest(3, 'sales')

# Bottom 3 by profit
df.nsmallest(3, 'profit')

# By multiple columns (tiebreaker)
df.nlargest(3, ['sales', 'profit'])

# Keep='first' (default), 'last', 'all'
df.nlargest(3, 'sales', keep='all')  # Include all ties

Performance:

# nlargest is O(n log k) - faster than full sort
df.nlargest(10, 'value')  # Faster

# Full sort is O(n log n)
df.sort_values('value', ascending=False).head(10)  # Slower

Interviewer's Insight

What they're testing: Efficient data selection.

Strong answer signals:

  • Knows nlargest is faster than sort+head
  • Uses keep='all' for ties
  • Applies to Series and DataFrame

How to Calculate Weighted Average? - Google, Amazon, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: Aggregation, Finance | Asked by: Google, Amazon, Netflix, Apple

View Answer

Weighted Average:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10, 20, 30],
    'quantity': [100, 50, 25]
})

# Weighted average price
weighted_avg = np.average(df['price'], weights=df['quantity'])
# or
weighted_avg = (df['price'] * df['quantity']).sum() / df['quantity'].sum()

# Weighted average per group
def weighted_avg_func(group, value_col, weight_col):
    return np.average(group[value_col], weights=group[weight_col])

df.groupby('category').apply(
    weighted_avg_func, 'price', 'quantity'
)

Named Aggregation with Weighted Average:

def weighted_mean(df, value_col, weight_col):
    return (df[value_col] * df[weight_col]).sum() / df[weight_col].sum()

result = df.groupby('category').apply(
    lambda x: pd.Series({
        'weighted_price': weighted_mean(x, 'price', 'quantity'),
        'total_quantity': x['quantity'].sum()
    })
)

Interviewer's Insight

What they're testing: Custom aggregations.

Strong answer signals:

  • Uses np.average with weights
  • Can apply to groups
  • Handles edge cases (zero weights)

How to Perform Window Functions Like SQL? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Window Functions, Analytics | Asked by: Google, Amazon, Meta, Netflix

View Answer

SQL-like Window Functions:

import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B'],
    'date': pd.date_range('2023-01-01', periods=6),
    'value': [10, 20, 30, 40, 50, 60]
})

# ROW_NUMBER
df['row_num'] = df.groupby('category').cumcount() + 1

# RANK
df['rank'] = df.groupby('category')['value'].rank(method='min')

# DENSE_RANK
df['dense_rank'] = df.groupby('category')['value'].rank(method='dense')

# LEAD / LAG
df['prev_value'] = df.groupby('category')['value'].shift(1)
df['next_value'] = df.groupby('category')['value'].shift(-1)

# Running SUM / AVG
df['running_sum'] = df.groupby('category')['value'].cumsum()
df['running_avg'] = df.groupby('category')['value'].expanding().mean().values

# Percent of total
df['pct_of_cat'] = df['value'] / df.groupby('category')['value'].transform('sum')

Interviewer's Insight

What they're testing: Analytics skills.

Strong answer signals:

  • Maps SQL functions to Pandas
  • Uses transform for same-shape output
  • Combines groupby with cumulative ops

How to Compare Two DataFrames? - Amazon, Google Interview Question

Difficulty: 🟑 Medium | Tags: Data Comparison, Validation | Asked by: Amazon, Google, Microsoft

View Answer

compare() Method (Pandas 1.1+):

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': ['a', 'x', 'c']})

# Show differences
df1.compare(df2)
# Shows 'self' and 'other' for differences

# Keep all rows
df1.compare(df2, keep_equal=True)

# Keep all columns
df1.compare(df2, keep_shape=True)

Testing Equality:

# Check if equal
df1.equals(df2)  # Returns True/False

# Element-wise comparison
df1 == df2  # Boolean DataFrame
(df1 == df2).all().all()  # All equal?

# For testing
from pandas.testing import assert_frame_equal
assert_frame_equal(df1, df2, check_dtype=False)

Find Differences:

# Rows in df1 not in df2
df1[~df1.isin(df2).all(axis=1)]

# Using merge
merged = df1.merge(df2, indicator=True, how='outer')
only_in_df1 = merged[merged['_merge'] == 'left_only']
only_in_df2 = merged[merged['_merge'] == 'right_only']

Interviewer's Insight

What they're testing: Data validation skills.

Strong answer signals:

  • Uses compare() for visual diff
  • Uses assert_frame_equal for tests
  • Can find specific differences

How to Handle SettingWithCopyWarning Properly? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Best Practices, Common Errors | Asked by: Google, Amazon, Meta, Microsoft

View Answer

Copy-on-Write in Pandas 2.0+:

import pandas as pd

# Enable CoW globally
pd.options.mode.copy_on_write = True

# Now this is safe
df2 = df[df['A'] > 0]
df2['B'] = 100  # Creates copy automatically

# Original df unchanged

Best Practices Summary:

# 1. Modifying original - use .loc
df.loc[df['A'] > 0, 'B'] = 100

# 2. Creating new DataFrame - use .copy()
df_new = df[df['A'] > 0].copy()
df_new['B'] = 100

# 3. Method chaining - use .assign()
df_new = df.query('A > 0').assign(B=100)

# 4. Never chain indexing
# BAD
df[df['A'] > 0]['B'] = 100
# GOOD
df.loc[df['A'] > 0, 'B'] = 100

Interviewer's Insight

What they're testing: Production-safe code.

Strong answer signals:

  • Knows CoW in Pandas 2.0+
  • Uses .loc consistently
  • Avoids chained indexing

How to Use Sparse Data Structures? - Google, Amazon Interview Question

Difficulty: πŸ”΄ Hard | Tags: Sparse Data, Memory Optimization | Asked by: Google, Amazon, Netflix

View Answer

Sparse Arrays - For Mostly-Null Data:

import pandas as pd
import numpy as np

# Create sparse array
arr = pd.arrays.SparseArray([0, 0, 1, 0, 0, 0, 2, 0, 0, 0])

# Memory savings
print(arr.memory_usage())  # Much smaller than dense

# Sparse Series
s = pd.Series(pd.arrays.SparseArray([0] * 1000000 + [1]))
print(s.memory_usage())  # ~16 bytes vs ~8MB

# Convert existing
df_sparse = df.astype(pd.SparseDtype('float', fill_value=0))

Use Cases:

Scenario Memory Savings
90% zeros ~10x
99% zeros ~100x
One-hot encoded Massive

Operations:

# Most operations work normally
s_sparse.sum()
s_sparse.mean()

# Convert back to dense
s_dense = s_sparse.sparse.to_dense()

Interviewer's Insight

What they're testing: Advanced memory optimization.

Strong answer signals:

  • Uses for mostly-zero data
  • Knows fill_value parameter
  • Understands trade-offs

How to Implement Custom Aggregation Functions? - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Aggregation, Custom Functions | Asked by: Google, Amazon, Meta

View Answer

Custom Aggregation:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 30, 40]
})

# Lambda function
df.groupby('category')['value'].agg(lambda x: x.max() - x.min())

# Named function
def range_func(x):
    return x.max() - x.min()

df.groupby('category')['value'].agg(range_func)

# Multiple custom aggregations
df.groupby('category')['value'].agg([
    ('range', lambda x: x.max() - x.min()),
    ('cv', lambda x: x.std() / x.mean()),  # Coefficient of variation
    ('iqr', lambda x: x.quantile(0.75) - x.quantile(0.25))
])

# Named aggregation with custom
df.groupby('category').agg(
    value_range=('value', lambda x: x.max() - x.min()),
    value_cv=('value', lambda x: x.std() / x.mean())
)

Interviewer's Insight

What they're testing: Advanced aggregation skills.

Strong answer signals:

  • Uses named aggregations
  • Defines reusable functions
  • Combines built-in and custom

How to Perform Asof Joins (Nearest Key Join)? - Google, Amazon, Netflix Interview Question

Difficulty: πŸ”΄ Hard | Tags: Joining, Time Series | Asked by: Google, Amazon, Netflix, Apple

View Answer

merge_asof() - Join on Nearest Key:

import pandas as pd

# Trade data
trades = pd.DataFrame({
    'time': pd.to_datetime(['2023-01-01 10:00:01', '2023-01-01 10:00:03', 
                            '2023-01-01 10:00:05']),
    'ticker': ['AAPL', 'AAPL', 'AAPL'],
    'quantity': [100, 200, 150]
})

# Quote data
quotes = pd.DataFrame({
    'time': pd.to_datetime(['2023-01-01 10:00:00', '2023-01-01 10:00:02', 
                            '2023-01-01 10:00:04']),
    'ticker': ['AAPL', 'AAPL', 'AAPL'],
    'bid': [149.0, 149.5, 150.0]
})

# Join trade with most recent quote
result = pd.merge_asof(
    trades.sort_values('time'),
    quotes.sort_values('time'),
    on='time',
    by='ticker',
    direction='backward'  # Use most recent quote
)

Direction Options:

Direction Meaning
backward Previous/equal key
forward Next/equal key
nearest Closest key

Interviewer's Insight

What they're testing: Time series joining.

Strong answer signals:

  • Sorts data before asof join
  • Uses by= for grouping
  • Knows direction parameter

How to Calculate Cohort Metrics? - Meta, Netflix, Amazon Interview Question

Difficulty: πŸ”΄ Hard | Tags: Cohort Analysis, Time Series | Asked by: Meta, Netflix, Amazon, Google

View Answer

Cohort Retention Analysis:

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 3],
    'order_date': pd.to_datetime([
        '2023-01-05', '2023-02-10', '2023-03-15',
        '2023-01-20', '2023-03-25', '2023-02-01'
    ])
})

# Get first purchase date (cohort)
df['cohort'] = df.groupby('user_id')['order_date'].transform('min')
df['cohort_month'] = df['cohort'].dt.to_period('M')
df['order_month'] = df['order_date'].dt.to_period('M')

# Calculate months since cohort
df['months_since_cohort'] = (
    df['order_month'].astype(int) - df['cohort_month'].astype(int)
)

# Create cohort pivot
cohort_data = df.groupby(['cohort_month', 'months_since_cohort']).agg(
    users=('user_id', 'nunique')
).reset_index()

cohort_pivot = cohort_data.pivot_table(
    index='cohort_month',
    columns='months_since_cohort',
    values='users'
)

# Calculate retention rates
cohort_sizes = cohort_pivot[0]
retention = cohort_pivot.divide(cohort_sizes, axis=0)

Interviewer's Insight

What they're testing: Business analytics skills.

Strong answer signals:

  • Calculates cohort from first action
  • Uses period for month grouping
  • Creates retention matrix

How to Implement A/B Test Analysis in Pandas? - Meta, Google, Netflix Interview Question

Difficulty: πŸ”΄ Hard | Tags: A/B Testing, Statistical Analysis | Asked by: Meta, Google, Netflix, Amazon

View Answer

A/B Test Analysis:

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame({
    'user_id': range(10000),
    'variant': np.random.choice(['control', 'treatment'], 10000),
    'converted': np.random.binomial(1, [0.10] * 5000 + [0.12] * 5000)
})

# Summary statistics per variant
summary = df.groupby('variant').agg(
    users=('user_id', 'count'),
    conversions=('converted', 'sum'),
    conversion_rate=('converted', 'mean')
)

# Statistical test
control = df[df['variant'] == 'control']['converted']
treatment = df[df['variant'] == 'treatment']['converted']

# Chi-squared test for proportions
contingency = pd.crosstab(df['variant'], df['converted'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

# Or use proportion z-test
from statsmodels.stats.proportion import proportions_ztest
count = [treatment.sum(), control.sum()]
nobs = [len(treatment), len(control)]
z_stat, p_value = proportions_ztest(count, nobs)

# Confidence interval for lift
lift = (treatment.mean() - control.mean()) / control.mean()

Interviewer's Insight

What they're testing: Statistical analysis skills.

Strong answer signals:

  • Calculates conversion rates
  • Uses appropriate statistical test
  • Interprets p-value correctly

How to Use Copy-on-Write (CoW) in Pandas 2.0+? - Google, Meta Interview Question

Difficulty: πŸ”΄ Hard | Tags: Internals, Performance | Asked by: Google, Meta, Microsoft

View Answer

Copy-on-Write Explained:

import pandas as pd

# Enable globally
pd.options.mode.copy_on_write = True

# Or use context manager
with pd.option_context('mode.copy_on_write', True):
    df2 = df[['A', 'B']]
    df2['A'] = 100  # Creates copy only when modified

Benefits:

Aspect Without CoW With CoW
Memory Copies on slice Shares until modified
Safety Ambiguous Always safe
Speed Unnecessary copies Lazy copies

Behavior:

pd.options.mode.copy_on_write = True

# Views share data until modification
df = pd.DataFrame({'A': [1, 2, 3]})
df2 = df[['A']]  # No copy yet

df2['A'] = 100  # Copy created here
print(df)  # Original unchanged!

# No more SettingWithCopyWarning
df[df['A'] > 1]['A'] = 99  # Safe, no effect on df

Interviewer's Insight

What they're testing: Modern Pandas knowledge.

Strong answer signals:

  • Knows CoW mechanism
  • Understands lazy copying
  • Knows it's default in Pandas 3.0

How to Use PyArrow Backend for Better Performance? - Google, Amazon Interview Question

Difficulty: πŸ”΄ Hard | Tags: Performance, Arrow | Asked by: Google, Amazon, Databricks

View Answer

PyArrow Backend (Pandas 2.0+):

import pandas as pd

# Read with PyArrow backend
df = pd.read_csv('data.csv', dtype_backend='pyarrow')
df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')

# Convert existing DataFrame
df_arrow = df.convert_dtypes(dtype_backend='pyarrow')

# Check types
df_arrow.dtypes
# int64[pyarrow], string[pyarrow], etc.

Benefits:

Feature NumPy Backend PyArrow Backend
String memory High Low
Null handling float64 trick Native
Interop Limited Arrow ecosystem

String Performance:

# PyArrow strings are much more efficient
df['text_col']  # With PyArrow: less memory, faster ops

# Native null support
df_arrow['int_col']  # Can have Int64 with nulls

Interviewer's Insight

What they're testing: Cutting-edge Pandas.

Strong answer signals:

  • Knows Arrow for string efficiency
  • Uses for interop with Spark/Arrow
  • Understands when to use

Quick Reference: 127+ Interview Questions

Sno Question Title Practice Links Companies Asking Difficulty Topics
1 What is Pandas and why is it used? Pandas Docs Google, Amazon, Meta, Netflix Easy Basics, Introduction
2 Difference between Series and DataFrame GeeksforGeeks Google, Amazon, Meta, Microsoft Easy Data Structures
3 How to create a DataFrame from dictionary? Pandas Docs Amazon, Google, Flipkart Easy DataFrame Creation
4 Difference between loc and iloc Stack Overflow Google, Amazon, Meta, Apple, Netflix Easy Indexing, Selection
5 How to read CSV, Excel, JSON files? Pandas Docs Most Tech Companies Easy Data I/O
6 How to handle missing values (NaN)? Real Python Google, Amazon, Meta, Netflix, Apple Medium Missing Data, fillna, dropna
7 Difference between dropna() and fillna() Pandas Docs Amazon, Google, Microsoft Easy Missing Data
8 Explain GroupBy in Pandas Real Python Google, Amazon, Meta, Netflix, Apple Medium GroupBy, Aggregation
9 How to merge two DataFrames? Pandas Docs Google, Amazon, Meta, Microsoft Medium Merging, Joining
10 Difference between merge(), join(), concat() Stack Overflow Google, Amazon, Meta Medium Merging, Joining, Concatenation
11 How to apply a function to DataFrame? Pandas Docs Google, Amazon, Meta, Netflix Medium apply, applymap, map
12 Difference between apply(), map(), applymap() GeeksforGeeks Google, Amazon, Microsoft Medium Data Transformation
13 How to rename columns in DataFrame? Pandas Docs Most Tech Companies Easy Column Operations
14 How to sort DataFrame by column values? Pandas Docs Most Tech Companies Easy Sorting
15 How to filter rows based on conditions? Pandas Docs Google, Amazon, Meta, Netflix Easy Filtering, Boolean Indexing
16 How to remove duplicate rows? Pandas Docs Amazon, Google, Microsoft Easy Data Cleaning, Deduplication
17 How to change data types of columns? Pandas Docs Most Tech Companies Easy Data Types
18 What is the difference between copy() and view? Stack Overflow Google, Amazon, Meta Medium Memory Management
19 Explain pivot tables in Pandas Pandas Docs Amazon, Google, Microsoft, Netflix Medium Pivot Tables, Reshaping
20 Difference between pivot() and pivot_table() Stack Overflow Google, Amazon, Meta Medium Reshaping
21 How to handle datetime data in Pandas? Pandas Docs Google, Amazon, Netflix, Meta Medium DateTime, Time Series
22 How to create a date range? Pandas Docs Amazon, Netflix, Google Easy DateTime
23 What is MultiIndex (Hierarchical Indexing)? Pandas Docs Google, Amazon, Meta Hard MultiIndex, Hierarchical Data
24 How to reset and set index? Pandas Docs Most Tech Companies Easy Indexing
25 How to perform rolling window calculations? Pandas Docs Google, Amazon, Netflix, Apple Medium Rolling Windows, Time Series
26 How to calculate moving averages? GeeksforGeeks Google, Amazon, Netflix, Apple Medium Rolling Windows, Finance
27 How to perform resampling on time series? Pandas Docs Google, Amazon, Netflix Medium Resampling, Time Series
28 Difference between transform() and apply() Stack Overflow Google, Amazon, Meta Hard GroupBy, Data Transformation
29 How to create bins with cut() and qcut()? Pandas Docs Google, Amazon, Meta Medium Discretization, Binning
30 How to handle categorical data? Pandas Docs Google, Amazon, Meta, Netflix Medium Categorical Data, Memory
31 How to one-hot encode categorical data? Pandas Docs Google, Amazon, Meta, Microsoft Easy Feature Engineering, ML
32 How to read data from SQL database? Pandas Docs Amazon, Google, Microsoft Medium Database I/O
33 How to export DataFrame to various formats? Pandas Docs Most Tech Companies Easy Data Export
34 How to handle large datasets efficiently? Towards Data Science Google, Amazon, Netflix, Meta Hard Performance, Memory Optimization
35 What is Categorical dtype and when to use it? Pandas Docs Google, Amazon, Meta Medium Data Types, Memory Optimization
36 How to optimize memory usage in Pandas? Medium Google, Amazon, Netflix Hard Memory Optimization
37 Difference between inplace=True and returning copy Stack Overflow Most Tech Companies Easy DataFrame Modification
38 How to use query() method for filtering? Pandas Docs Google, Amazon, Meta Easy Filtering, Query
39 How to work with string data (str accessor)? Pandas Docs Google, Amazon, Meta, Netflix Medium String Operations
40 How to use str accessor methods? Pandas Docs Amazon, Google, Microsoft Medium String Operations
41 How to split and expand string columns? GeeksforGeeks Amazon, Google, Meta Medium String Operations, Data Cleaning
42 How to use melt() for unpivoting data? Pandas Docs Google, Amazon, Meta Medium Reshaping, Unpivoting
43 How to use stack() and unstack()? Pandas Docs Google, Amazon, Meta Medium Reshaping, MultiIndex
44 How to cross-tabulate with crosstab()? Pandas Docs Google, Amazon, Meta Medium Cross Tabulation, Analysis
45 How to calculate correlations? Pandas Docs Google, Amazon, Meta, Netflix Easy Statistical Analysis
46 How to calculate descriptive statistics? Pandas Docs Most Tech Companies Easy Statistical Analysis
47 How to use agg() for multiple aggregations? Pandas Docs Google, Amazon, Meta, Netflix Medium Aggregation
48 How to use named aggregations? Pandas Docs Google, Amazon, Meta Medium GroupBy, Named Aggregation
49 How to handle timezone-aware datetime? Pandas Docs Google, Amazon, Netflix Medium DateTime, Timezones
50 How to interpolate missing values? Pandas Docs Google, Amazon, Netflix Medium Missing Data, Interpolation
51 How to forward fill and backward fill? Pandas Docs Amazon, Netflix, Google Easy Missing Data, Time Series
52 How to use where() and mask() methods? Pandas Docs Google, Amazon, Meta Medium Conditional Operations
53 How to clip values in DataFrame? Pandas Docs Amazon, Google, Meta Easy Data Transformation
54 How to rank values in Pandas? Pandas Docs Google, Amazon, Meta, Netflix Easy Ranking
55 How to calculate percentage change? Pandas Docs Google, Amazon, Netflix, Apple Easy Time Series, Finance
56 How to shift and lag data? Pandas Docs Google, Amazon, Netflix Easy Time Series, Lag Features
57 How to calculate cumulative statistics? Pandas Docs Google, Amazon, Meta, Netflix Easy Cumulative Operations
58 How to use explode() for list columns? Pandas Docs Google, Amazon, Meta Medium List Operations, Data Preprocessing
59 How to sample data from DataFrame? Pandas Docs Google, Amazon, Meta, Netflix Easy Sampling
60 How to detect and handle outliers? Towards Data Science Google, Amazon, Meta, Netflix Medium Outlier Detection, Data Cleaning
61 How to normalize/standardize data? GeeksforGeeks Google, Amazon, Meta, Microsoft Medium Feature Engineering, ML
62 How to use eval() for efficient operations? Pandas Docs Google, Amazon, Meta Hard Performance Optimization
63 How to perform element-wise operations? Pandas Docs Most Tech Companies Easy Arithmetic Operations
64 Why vectorized operations are faster than loops? Real Python Google, Amazon, Meta Medium Performance, Vectorization
65 How to profile Pandas code performance? Pandas Docs Google, Amazon, Netflix Hard Performance Profiling
66 How to use pipe() for method chaining? Pandas Docs Google, Amazon, Meta Medium Method Chaining
67 How to handle SettingWithCopyWarning? Pandas Docs Google, Amazon, Meta, Microsoft Medium Common Errors, Debugging
68 How to compare two DataFrames? Pandas Docs Amazon, Google, Microsoft Medium Data Comparison, Validation
69 How to combine DataFrames with different schemas? Stack Overflow Google, Amazon, Meta Medium Merging, Schema Alignment
70 How to create conditional columns? GeeksforGeeks Most Tech Companies Easy Data Transformation
71 How to use np.where() with Pandas? Real Python Google, Amazon, Meta, Netflix Easy Conditional Operations
72 How to use np.select() for multiple conditions? Stack Overflow Google, Amazon, Meta Medium Conditional Operations
73 How to count value frequencies? Pandas Docs Most Tech Companies Easy Data Exploration
74 How to find unique values and nunique()? Pandas Docs Most Tech Companies Easy Data Exploration
75 How to check for null values? Pandas Docs Most Tech Companies Easy Missing Data
76 How to use any() and all() methods? Pandas Docs Google, Amazon, Meta Easy Boolean Operations
77 How to select specific columns? Pandas Docs Most Tech Companies Easy Column Selection
78 How to drop columns or rows? Pandas Docs Most Tech Companies Easy Data Cleaning
79 How to use assign() for creating new columns? Pandas Docs Google, Amazon, Meta Easy Column Creation
80 How to use idxmax() and idxmin()? Pandas Docs Google, Amazon, Meta, Netflix Easy Indexing
81 Why is iterating over rows slow? Stack Overflow Google, Amazon, Meta Medium Performance
82 How to use iterrows() and itertuples()? Pandas Docs Amazon, Google, Microsoft Easy Iteration
83 How to vectorize custom functions? Real Python Google, Amazon, Meta Hard Performance Optimization
84 How to use Pandas with NumPy? Pandas Docs Google, Amazon, Meta, Netflix Easy NumPy Integration
85 How to flatten hierarchical index? Stack Overflow Google, Amazon, Meta Medium MultiIndex
86 How to group by multiple columns? Pandas Docs Most Tech Companies Easy GroupBy
87 How to filter groups after GroupBy? Pandas Docs Google, Amazon, Meta Medium GroupBy, Filtering
88 How to get first/last n rows per group? Stack Overflow Google, Amazon, Meta, Netflix Medium GroupBy
89 How to handle JSON with nested structures? Pandas Docs Amazon, Google, Meta Medium JSON Processing
90 How to read/write Parquet files? Pandas Docs Google, Amazon, Netflix, Meta Easy File I/O, Big Data
91 Difference between Parquet, CSV, and Feather Towards Data Science Google, Amazon, Netflix Medium File Formats, Performance
92 How to use chunksize for large files? Pandas Docs Google, Amazon, Netflix, Meta Medium Large Data Processing
93 How to use nsmallest() and nlargest()? Pandas Docs Google, Amazon, Meta Easy Selection
94 How to calculate weighted average? Stack Overflow Google, Amazon, Netflix, Apple Medium Aggregation, Finance
95 How to perform window functions like SQL? Pandas Docs Google, Amazon, Meta, Netflix Medium Window Functions
96 How to join on nearest key (asof join)? Pandas Docs Google, Amazon, Netflix, Apple Hard Joining, Time Series
97 How to use combine_first() for data merging? Pandas Docs Amazon, Google, Microsoft Medium Merging
98 How to create period indices? Pandas Docs Google, Amazon, Netflix Medium Time Series
99 How to use Timedelta for time differences? Pandas Docs Google, Amazon, Netflix Easy DateTime
100 How to set display options globally? Pandas Docs Most Tech Companies Easy Display Options
101 What is method chaining and when to use it? Tom Augspurger Blog Google, Amazon, Meta Medium Method Chaining, Clean Code
102 How to calculate month-over-month change? StrataScratch Google, Amazon, Meta, Netflix Medium Time Series, Analytics
103 How to find customers with highest orders? DataLemur Amazon, Google, Meta, Netflix Medium GroupBy, Aggregation
104 [HARD] How to calculate retention metrics efficiently? StrataScratch Meta, Netflix, Amazon, Google Hard Cohort Analysis, Time Series
105 [HARD] How to implement A/B test analysis? Towards Data Science Meta, Google, Netflix, Amazon Hard Statistical Analysis, Testing
106 [HARD] How to optimize memory with category types? Pandas Docs Google, Amazon, Netflix Hard Memory Optimization
107 [HARD] How to implement cohort analysis? Towards Data Science Meta, Netflix, Amazon, Google Hard Cohort Analysis
108 [HARD] How to calculate funnel drop-off rates? StrataScratch Meta, Google, Amazon, Netflix Hard Funnel Analysis, Analytics
109 [HARD] How to implement custom testing using assert_frame_equal? Pandas Docs Google, Amazon, Microsoft Hard Testing, Quality
110 [HARD] How to handle sparse data structures? Pandas Docs Google, Amazon, Netflix Hard Sparse Data, Memory
111 [HARD] How to use Numba/JIT with Pandas? Pandas Docs Google, Amazon, Hedge Funds Hard Performance
112 [HARD] How to implement custom accessors? Pandas Docs Google, Amazon, Meta Hard Extending Pandas
113 [HARD] How to use Swifter for parallel processing? Swifter Docs Google, Amazon, Uber Hard Parallelism
114 [HARD] Explain Pandas Block Manager structure Pandas Wiki Google, Amazon, Meta Hard Internals
115 [HARD] How Copy-on-Write (CoW) works in Pandas 2.0+? Pandas Docs Google, Meta, Microsoft Hard Internals, Performance
116 [HARD] How to use PyArrow backend for performance? Pandas Docs Google, Amazon, Databricks Hard Performance, Arrow
117 [HARD] How to implement custom index types? Pandas Docs Google, Amazon Hard Extending Pandas
118 [HARD] How to optimize MultiIndex slicing performance? Pandas Docs Google, Amazon, Hedge Funds Hard Optimization
119 [HARD] groupby().transform() internal mechanics vs apply() Pandas Docs Google, Amazon, Meta Hard Deep Dive
120 [HARD] How to implement rolling window with raw=True? Pandas Docs Google, Amazon, Hedge Funds Hard Optimization
121 [HARD] How to extend Pandas with custom plotting backends? Pandas Docs Google, Amazon Hard Extending Pandas
122 [HARD] How to handle time series offset aliases? Pandas Docs Google, Amazon, Hedge Funds Hard Time Series
123 [HARD] How to use Dask DataFrames for out-of-core computing? Dask Docs Google, Amazon, Netflix Hard Big Data
124 [HARD] How to optimize chained assignment performance? Pandas Docs Google, Amazon, Meta Hard Optimization
125 [HARD] Nullable integers/floats implementation? Pandas Docs Google, Amazon, Microsoft Hard Internals
126 [HARD] How to use Cython with Pandas? Pandas Docs Google, Amazon, HFT Firms Hard Performance
127 [HARD] Comparison of Parquet vs Feather vs ORC? Apache Arrow Google, Amazon, Netflix Hard Systems

Code Examples

1. Memory Optimization

Difficulty: 🟒 Easy | Tags: Code Example | Asked by: Code Pattern

View Code Example
import pandas as pd
import numpy as np

# Typical large dataframe creation
df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C'], size=1000000),
    'value': np.random.randn(1000000)
})

# Memory usage before optimization
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")

# Optimize by converting object to category
df['category'] = df['category'].astype('category')

# Memory usage after optimization
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")

2. Method Chaining for Clean Code

Difficulty: 🟒 Easy | Tags: Code Example | Asked by: Code Pattern

View Code Example
# Instead of multiple intermediate variables
df = (
    pd.read_csv('data.csv')
    .query('status == "active"')
    .assign(
        year=lambda x: pd.to_datetime(x['date']).dt.year,
        total_cost=lambda x: x['price'] * x['quantity']
    )
    .groupby(['year', 'region'])
    .agg(total_revenue=('total_cost', 'sum'))
    .reset_index()
    .sort_values('total_revenue', ascending=False)
)

3. Parallel Processing with Swifter

Difficulty: 🟒 Easy | Tags: Code Example | Asked by: Code Pattern

View Code Example
import pandas as pd
import swifter

df = pd.DataFrame({'text': ['some text'] * 100000})

def heavy_processing(text):
    # Simulate heavy work
    return text.upper()[::-1]

# Automatic parallelization
df['processed'] = df['text'].swifter.apply(heavy_processing)

Explain MultiIndex in Pandas - Google, Meta Interview Question

Difficulty: 🟑 Medium | Tags: MultiIndex, Hierarchical Data, Advanced Indexing | Asked by: Google, Meta, Amazon

View Answer

MultiIndex creates hierarchical row/column labels, enabling multi-dimensional data in 2D DataFrames. Essential for time series, grouped data, and pivot operations.

Complete Examples:

import pandas as pd
import numpy as np

# 1. Creating MultiIndex
# From tuples
index = pd.MultiIndex.from_tuples([
    ('USA', 'NY'), ('USA', 'CA'), ('UK', 'London'), ('UK', 'Manchester')
], names=['Country', 'City'])

df = pd.DataFrame({
    'Population': [8_000_000, 4_000_000, 9_000_000, 500_000],
    'GDP': [1500, 3000, 500, 80]
}, index=index)

print(df)

# 2. From product
idx = pd.MultiIndex.from_product(
    [['2023', '2024'], ['Q1', 'Q2', 'Q3', 'Q4']],
    names=['Year', 'Quarter']
)

sales_df = pd.DataFrame({
    'Revenue': np.random.randint(100, 500, 8),
    'Profit': np.random.randint(10, 100, 8)
}, index=idx)

# 3. Indexing with MultiIndex
# Access by level
print(df.loc['USA'])  # All USA cities
print(df.loc[('USA', 'NY')])  # Specific city

# Cross-section (xs)
print(df.xs('USA', level='Country'))
print(df.xs('NY', level='City'))

# 4. Slicing MultiIndex
# Slice first level
print(df.loc['USA':'UK'])

# Slice second level
print(df.loc[('USA', 'CA'):('UK', 'London')])

# 5. Stack and Unstack
# Create sample data
data = pd.DataFrame({
    'Year': [2023, 2023, 2024, 2024],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'Revenue': [100, 150, 120, 180]
})

# Set multiindex
data_idx = data.set_index(['Year', 'Quarter'])

# Unstack: move index level to columns
unstacked = data_idx.unstack()
print(unstacked)

# Stack: move column level to index
stacked = unstacked.stack()
print(stacked)

# 6. GroupBy with MultiIndex
df_sales = pd.DataFrame({
    'Region': ['North', 'North', 'South', 'South'] * 2,
    'Product': ['A', 'B', 'A', 'B'] * 2,
    'Quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q2'],
    'Revenue': [100, 150, 120, 180, 110, 160, 130, 190]
})

# Create MultiIndex from groupby
grouped = df_sales.groupby(['Region', 'Product']).sum()
print(grouped)

# 7. Sorting MultiIndex
# Sort by all levels
df_sorted = df.sort_index()

# Sort by specific level
df_sorted = df.sort_index(level='City')

# Sort descending
df_sorted = df.sort_index(ascending=False)

# 8. Swapping Levels
# Swap index levels
df_swapped = df.swaplevel('Country', 'City')
print(df_swapped)

# 9. Resetting Index
# Remove MultiIndex
df_reset = df.reset_index()
print(df_reset)

# Keep one level
df_reset_partial = df.reset_index(level='City')
print(df_reset_partial)

# 10. Advanced Operations
# Sum across levels
print(df.sum(level='Country'))

# Mean across levels
print(df.mean(level='City'))

# Apply function per level
result = df.groupby(level='Country').agg({
    'Population': 'sum',
    'GDP': 'mean'
})
print(result)

Creating MultiIndex:

Method Use Case Example
from_tuples Explicit labels pd.MultiIndex.from_tuples([('A', 1)])
from_product Cartesian product pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
from_arrays Separate arrays pd.MultiIndex.from_arrays([['A', 'B'], [1, 2]])
set_index From columns df.set_index(['col1', 'col2'])

Common Operations:

Operation Description Example
loc Access by labels df.loc[('USA', 'NY')]
xs Cross-section df.xs('USA', level=0)
unstack Level to columns df.unstack()
stack Columns to level df.stack()
swaplevel Swap index levels df.swaplevel(0, 1)

Interviewer's Insight

What they're testing: Advanced indexing and hierarchical data structures.

Strong answer signals:

  • Knows from_tuples, from_product, set_index methods
  • Can access specific levels with loc and xs()
  • Understands stack/unstack transformations
  • Mentions GroupBy creates MultiIndex naturally
  • Uses sort_index(level=...) for sorting
  • Applies to time series and pivot tables

Explain pd.merge() Parameters - Amazon, Microsoft Interview Question

Difficulty: 🟑 Medium | Tags: Merge, Join, Data Combination | Asked by: Amazon, Microsoft, Google, Meta

View Answer

pd.merge() combines DataFrames using SQL-style joins. Parameters: how (join type), on (keys), left_on/right_on, suffixes, indicator.

Complete Examples:

import pandas as pd

# Sample data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'city': ['NYC', 'LA', 'Chicago', 'Boston']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 5, 1],
    'amount': [100, 200, 150, 300, 50]
})

# 1. Inner Join (default)
inner = pd.merge(customers, orders, on='customer_id', how='inner')
print("Inner join:")
print(inner)
# Only matching records (customer_id 1, 2)

# 2. Left Join
left = pd.merge(customers, orders, on='customer_id', how='left')
print("\nLeft join:")
print(left)
# All customers, NaN for unmatched orders

# 3. Right Join
right = pd.merge(customers, orders, on='customer_id', how='right')
print("\nRight join:")
print(right)
# All orders, NaN for unmatched customers

# 4. Outer Join
outer = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter join:")
print(outer)
# All records, NaN where no match

# 5. Different Column Names
products = pd.DataFrame({
    'product_id': [1, 2, 3],
    'name': ['Widget', 'Gadget', 'Tool']
})

order_items = pd.DataFrame({
    'order_id': [101, 102],
    'prod_id': [1, 2],
    'quantity': [5, 3]
})

merged = pd.merge(
    products, order_items,
    left_on='product_id',
    right_on='prod_id',
    how='inner'
)
print("\nDifferent column names:")
print(merged)

# 6. Multiple Keys
df1 = pd.DataFrame({
    'year': [2023, 2023, 2024],
    'quarter': ['Q1', 'Q2', 'Q1'],
    'revenue': [100, 150, 120]
})

df2 = pd.DataFrame({
    'year': [2023, 2023, 2024],
    'quarter': ['Q1', 'Q2', 'Q2'],
    'profit': [20, 30, 25]
})

multi_key = pd.merge(df1, df2, on=['year', 'quarter'], how='outer')
print("\nMultiple keys:")
print(multi_key)

# 7. Suffixes for Duplicate Columns
left_df = pd.DataFrame({
    'key': ['A', 'B'],
    'value': [1, 2],
    'extra': [10, 20]
})

right_df = pd.DataFrame({
    'key': ['A', 'C'],
    'value': [3, 4],
    'extra': [30, 40]
})

with_suffixes = pd.merge(
    left_df, right_df,
    on='key',
    how='outer',
    suffixes=('_left', '_right')
)
print("\nWith suffixes:")
print(with_suffixes)

# 8. Indicator Column
with_indicator = pd.merge(
    customers, orders,
    on='customer_id',
    how='outer',
    indicator=True
)
print("\nWith indicator:")
print(with_indicator)
# Shows source of each row

# Custom indicator name
custom_indicator = pd.merge(
    customers, orders,
    on='customer_id',
    how='outer',
    indicator='source'
)

# 9. Index-based Merge
left_idx = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
right_idx = pd.DataFrame({'B': [3, 4]}, index=['a', 'c'])

idx_merge = pd.merge(
    left_idx, right_idx,
    left_index=True,
    right_index=True,
    how='outer'
)
print("\nIndex merge:")
print(idx_merge)

# 10. Validate Merge
# Ensure one-to-one merge
try:
    pd.merge(
        customers, orders,
        on='customer_id',
        validate='one_to_one'  # Will fail (one customer, many orders)
    )
except Exception as e:
    print(f"Validation error: {e}")

# Valid one-to-many
valid = pd.merge(
    customers, orders,
    on='customer_id',
    validate='one_to_many'  # OK
)

Join Types:

how SQL Equivalent Description Use Case
inner INNER JOIN Only matches Common records
left LEFT JOIN All left + matches Keep all left
right RIGHT JOIN All right + matches Keep all right
outer FULL OUTER JOIN All records Keep everything
cross CROSS JOIN Cartesian product All combinations

Key Parameters:

Parameter Description Example
on Common column(s) on='customer_id'
left_on/right_on Different names left_on='id', right_on='cust_id'
how Join type how='left'
suffixes Duplicate columns suffixes=('_x', '_y')
indicator Show merge source indicator=True
validate Check cardinality validate='one_to_one'

Interviewer's Insight

What they're testing: SQL join knowledge and data combination strategies.

Strong answer signals:

  • Knows all join types: inner, left, right, outer
  • Uses left_on/right_on for different column names
  • Applies suffixes to avoid duplicate columns
  • Mentions indicator=True for tracking merge source
  • Knows validate parameter for cardinality checks
  • Can merge on index with left_index/right_index

Explain apply() vs map() vs applymap() - Google, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: apply, map, Performance | Asked by: Google, Netflix, Meta, Amazon

View Answer

map() for Series element-wise. apply() for Series/DataFrame row/column-wise. applymap() (deprecated) for DataFrame element-wise. Use vectorized operations when possible.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': ['a', 'b', 'c', 'd']
})

# 1. Series.map() - Element-wise
# Map with dict
df['C_mapped'] = df['C'].map({'a': 'apple', 'b': 'banana', 'c': 'cherry'})
print(df)

# Map with function
df['A_squared'] = df['A'].map(lambda x: x ** 2)

# 2. Series.apply() - Element-wise (like map)
df['A_cubed'] = df['A'].apply(lambda x: x ** 3)

# More complex function
def categorize(value):
    if value < 2:
        return 'low'
    elif value < 4:
        return 'medium'
    else:
        return 'high'

df['category'] = df['A'].apply(categorize)

# 3. DataFrame.apply() - Row/Column-wise
# Apply to columns (axis=0, default)
col_sums = df[['A', 'B']].apply(sum)
print("Column sums:", col_sums)

# Apply to rows (axis=1)
row_sums = df[['A', 'B']].apply(sum, axis=1)
print("Row sums:", row_sums)

# Return Series from apply
def compute_stats(row):
    return pd.Series({
        'sum': row['A'] + row['B'],
        'product': row['A'] * row['B']
    })

stats = df.apply(compute_stats, axis=1)
print("\nStats:")
print(stats)

# 4. DataFrame.applymap() - Element-wise (deprecated)
# Use df.map() in pandas 2.1+
numeric_df = df[['A', 'B']]

# Old way (deprecated):
# result = numeric_df.applymap(lambda x: x * 2)

# New way:
result = numeric_df.map(lambda x: x * 2)
print("\nElement-wise operation:")
print(result)

# 5. Performance Comparison
import time

df_large = pd.DataFrame({
    'A': np.random.randint(0, 100, 100000),
    'B': np.random.randint(0, 100, 100000)
})

# Vectorized (fastest)
start = time.time()
result_vec = df_large['A'] * 2 + df_large['B']
vec_time = time.time() - start
print(f"Vectorized: {vec_time:.4f}s")

# apply (slower)
start = time.time()
result_apply = df_large.apply(lambda row: row['A'] * 2 + row['B'], axis=1)
apply_time = time.time() - start
print(f"apply(): {apply_time:.4f}s ({apply_time/vec_time:.0f}x slower)")

# 6. When to Use Each
# map: Series, simple transformations, dict mapping
status_map = {1: 'active', 2: 'inactive', 3: 'pending'}
df['status'] = df['A'].map(status_map)

# apply: Series, more complex functions, multiple outputs
def complex_transform(x):
    return x ** 2 if x % 2 == 0 else x ** 3

df['complex'] = df['A'].apply(complex_transform)

# DataFrame.apply: row/column aggregations, multi-column operations
def calculate_score(row):
    base = row['A'] * row['B']
    bonus = 10 if row['A'] > 2 else 0
    return base + bonus

df['score'] = df.apply(calculate_score, axis=1)

# 7. Vectorized Alternative (Best Performance)
# Instead of apply for simple operations:
# Slow:
# df['sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

# Fast:
df['sum'] = df['A'] + df['B']

# Instead of apply with conditions:
# Slow:
# df['flag'] = df['A'].apply(lambda x: 'high' if x > 2 else 'low')

# Fast:
df['flag'] = np.where(df['A'] > 2, 'high', 'low')

# 8. Result Type Parameter
df_nums = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Return DataFrame
result_df = df_nums.apply(lambda x: x * 2, result_type='expand')
print(result_df)

# Return list (broadcast)
result_broadcast = df_nums.apply(lambda x: [1, 2], result_type='broadcast')
print(result_broadcast)

Method Comparison:

Method Scope Axis Use Case Speed
Series.map() Element N/A Dict mapping, simple transform Medium
Series.apply() Element N/A Complex functions Medium
DataFrame.apply() Row/Column 0/1 Multi-column, aggregations Slow
DataFrame.map() Element N/A Element-wise transform Medium
Vectorized All N/A Math operations Fast

Performance Tips:

Slow Fast Speedup
df.apply(lambda r: r['A']+r['B'], axis=1) df['A'] + df['B'] 100x+
df['A'].apply(lambda x: x*2) df['A'] * 2 50x+
df['A'].apply(lambda x: 'Y' if x>5 else 'N') np.where(df['A']>5, 'Y', 'N') 20x+

Interviewer's Insight

What they're testing: Performance optimization and function application methods.

Strong answer signals:

  • Distinguishes map() for Series vs apply() for both
  • Knows applymap() is deprecated (use df.map())
  • Understands axis=0 (columns) vs axis=1 (rows)
  • Prefers vectorized operations (100x+ faster)
  • Uses np.where() instead of apply for conditionals
  • Only uses apply when vectorization impossible

Explain pd.cut() vs pd.qcut() - Amazon, Google Interview Question

Difficulty: 🟑 Medium | Tags: Binning, Discretization, Data Preparation | Asked by: Amazon, Google, Meta

View Answer

pd.cut() bins data into equal-width intervals. pd.qcut() bins into equal-frequency quantiles. Use cut for fixed ranges, qcut for balanced distribution.

Complete Examples:

import pandas as pd
import numpy as np

# Sample data
ages = np.array([18, 22, 25, 27, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80])
scores = np.array([45, 67, 89, 72, 81, 55, 93, 60, 77, 85, 92, 68, 74, 82, 90])

# 1. pd.cut() - Equal-width Bins
# Create 3 bins of equal width
age_bins = pd.cut(ages, bins=3)
print("cut() with 3 bins:")
print(age_bins)
print(age_bins.value_counts())

# 2. pd.cut() with Custom Labels
age_categories = pd.cut(
    ages,
    bins=[0, 30, 50, 100],
    labels=['Young', 'Middle-aged', 'Senior']
)
print("\nCustom labels:")
print(age_categories)

# 3. pd.cut() with right=False
# Default: intervals are right-closed (a, b]
# With right=False: intervals are left-closed [a, b)
bins_left = pd.cut(ages, bins=3, right=False)
print("\nLeft-closed intervals:")
print(bins_left)

# 4. pd.qcut() - Equal-frequency Quantiles
# Create 4 quantiles (quartiles)
score_quartiles = pd.qcut(scores, q=4)
print("\nqcut() quartiles:")
print(score_quartiles)
print(score_quartiles.value_counts())

# 5. pd.qcut() with Custom Labels
score_categories = pd.qcut(
    scores,
    q=4,
    labels=['Low', 'Medium', 'High', 'Very High']
)
print("\nQuantile labels:")
print(score_categories)

# 6. Specific Quantiles
# Deciles (10 groups)
deciles = pd.qcut(scores, q=10)
print("\nDeciles:")
print(deciles.value_counts())

# Custom quantiles
custom_quantiles = pd.qcut(
    scores,
    q=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)

# 7. Include Lowest
# Include lowest value in first bin
with_lowest = pd.cut(
    ages,
    bins=[18, 30, 50, 80],
    include_lowest=True
)
print("\nInclude lowest:")
print(with_lowest)

# 8. Duplicate Edges Handling
# When data has duplicates, qcut may fail
data_with_dups = np.array([1, 1, 1, 1, 2, 3, 4, 5])

# This would fail:
# pd.qcut(data_with_dups, q=4)

# Use duplicates='drop'
safe_qcut = pd.qcut(data_with_dups, q=4, duplicates='drop')
print("\nWith duplicates:")
print(safe_qcut)

# 9. Return Bins
# Get bin edges
values, bin_edges = pd.cut(ages, bins=3, retbins=True)
print("\nBin edges:")
print(bin_edges)

# 10. Practical Example: Customer Segmentation
df = pd.DataFrame({
    'customer_id': range(1, 16),
    'age': ages,
    'purchase_amount': scores * 10,
    'frequency': np.random.randint(1, 20, 15)
})

# Age groups (equal width)
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 30, 50, 100],
    labels=['Young', 'Middle', 'Senior']
)

# Spending tiers (equal frequency)
df['spending_tier'] = pd.qcut(
    df['purchase_amount'],
    q=3,
    labels=['Low', 'Medium', 'High']
)

# Frequency quintiles
df['freq_quintile'] = pd.qcut(
    df['frequency'],
    q=5,
    labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'],
    duplicates='drop'
)

print("\nCustomer Segmentation:")
print(df)

# Analysis by segment
print("\nAverage purchase by age group:")
print(df.groupby('age_group')['purchase_amount'].mean())

print("\nCustomers per spending tier:")
print(df['spending_tier'].value_counts())

cut() vs qcut():

Aspect pd.cut() pd.qcut()
Method Equal-width intervals Equal-frequency quantiles
Bins Same size ranges Same number of items
Use Case Fixed ranges (e.g., age groups) Balanced distribution
Example 0-20, 20-40, 40-60 Each bin has ~same count
Distribution Uneven counts Even counts

Key Parameters:

Parameter cut() qcut() Description
bins/q βœ… βœ… Number of bins/quantiles
labels βœ… βœ… Custom category names
right βœ… βœ… Right-closed intervals (default True)
include_lowest βœ… βœ… Include smallest value
duplicates ❌ βœ… Handle duplicate edges
retbins βœ… βœ… Return bin edges

Interviewer's Insight

What they're testing: Data discretization and binning strategies.

Strong answer signals:

  • Knows cut() creates equal-width intervals
  • Knows qcut() creates equal-frequency quantiles
  • Uses cut for fixed ranges (age groups, scores)
  • Uses qcut for balanced distributions (percentiles)
  • Applies custom labels parameter
  • Handles duplicates with duplicates='drop' in qcut

Explain Categorical Data Type - Meta, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: Categorical, Memory Optimization, Performance | Asked by: Meta, Netflix, Google, Amazon

View Answer

Categorical dtype stores repeated string values as integers with mapping, reducing memory usage by 90%+ for low-cardinality columns. Essential for large datasets.

Complete Examples:

import pandas as pd
import numpy as np

# 1. Creating Categorical
# From list
colors = pd.Categorical(['red', 'blue', 'red', 'green', 'blue', 'red'])
print("Categorical:", colors)
print("Categories:", colors.categories)
print("Codes:", colors.codes)

# 2. DataFrame Column
df = pd.DataFrame({
    'product': ['A', 'B', 'A', 'C', 'B', 'A'] * 10000,
    'size': ['S', 'M', 'L', 'M', 'S', 'L'] * 10000
})

# Memory before
memory_before = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nMemory before: {memory_before:.2f} MB")

# Convert to categorical
df['product'] = df['product'].astype('category')
df['size'] = df['size'].astype('category')

# Memory after
memory_after = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory after: {memory_after:.2f} MB")
print(f"Reduction: {(1 - memory_after/memory_before)*100:.1f}%")

# 3. Ordered Categories
sizes = pd.Categorical(
    ['S', 'M', 'L', 'M', 'S', 'XL'],
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)

print("\nOrdered categorical:")
print(sizes)
print("S < M:", sizes[0] < sizes[1])

# 4. Adding/Removing Categories
cat = pd.Categorical(['a', 'b', 'c'])

# Add category
cat = cat.add_categories(['d'])
print("\nAdded category:", cat.categories)

# Remove category
cat = cat.remove_categories(['d'])
print("Removed category:", cat.categories)

# Set categories
cat = cat.set_categories(['a', 'b', 'c', 'd', 'e'])
print("Set categories:", cat.categories)

# 5. Reordering Categories
df_survey = pd.DataFrame({
    'satisfaction': ['Good', 'Bad', 'Excellent', 'Good', 'Bad', 'Fair']
})

df_survey['satisfaction'] = pd.Categorical(
    df_survey['satisfaction'],
    categories=['Bad', 'Fair', 'Good', 'Excellent'],
    ordered=True
)

# Now can sort meaningfully
df_sorted = df_survey.sort_values('satisfaction')
print("\nSorted by category order:")
print(df_sorted)

# 6. Groupby with Categories
df_sales = pd.DataFrame({
    'region': pd.Categorical(['North', 'South', 'North', 'East']),
    'sales': [100, 200, 150, 180]
})

# Add unused category
df_sales['region'] = df_sales['region'].cat.add_categories(['West'])

# Groupby includes empty categories
grouped = df_sales.groupby('region', observed=False).sum()
print("\nGroupby with empty category:")
print(grouped)

# Exclude empty categories
grouped_observed = df_sales.groupby('region', observed=True).sum()
print("\nGroupby observed only:")
print(grouped_observed)

# 7. Renaming Categories
df_status = pd.DataFrame({
    'status': pd.Categorical(['A', 'B', 'A', 'C'])
})

# Rename
df_status['status'] = df_status['status'].cat.rename_categories({
    'A': 'Active',
    'B': 'Blocked',
    'C': 'Closed'
})
print("\nRenamed categories:")
print(df_status)

# 8. Performance Comparison
# String operations
df_str = pd.DataFrame({
    'category': ['A', 'B', 'C'] * 100000
})

df_cat = df_str.copy()
df_cat['category'] = df_cat['category'].astype('category')

# Comparison speed
import time

start = time.time()
_ = df_str['category'] == 'A'
str_time = time.time() - start

start = time.time()
_ = df_cat['category'] == 'A'
cat_time = time.time() - start

print(f"\nString comparison: {str_time:.4f}s")
print(f"Categorical comparison: {cat_time:.4f}s")
print(f"Speedup: {str_time/cat_time:.1f}x")

# 9. When to Use Categorical
# Good: low cardinality (few unique values)
df_good = pd.DataFrame({
    'country': ['USA', 'UK', 'Canada'] * 10000  # 3 unique
})
df_good['country'] = df_good['country'].astype('category')

# Bad: high cardinality (many unique values)
df_bad = pd.DataFrame({
    'user_id': [f'user_{i}' for i in range(10000)]  # All unique
})
# Don't convert to categorical - no benefit!

# 10. Practical Example: Survey Data
df_survey = pd.DataFrame({
    'age_group': pd.Categorical(
        ['18-25', '26-35', '18-25', '36-45', '26-35'],
        categories=['18-25', '26-35', '36-45', '46-55', '55+'],
        ordered=True
    ),
    'satisfaction': pd.Categorical(
        ['Good', 'Bad', 'Excellent', 'Good', 'Fair'],
        categories=['Bad', 'Fair', 'Good', 'Excellent'],
        ordered=True
    ),
    'product': pd.Categorical(['A', 'B', 'A', 'C', 'B'])
})

# Efficient aggregation
result = df_survey.groupby(['age_group', 'product'], observed=False).size()
print("\nSurvey analysis:")
print(result)

Benefits:

Benefit Description Example Savings
Memory Stores as int codes 90%+ reduction
Speed Faster comparisons 2-10x faster
Ordering Meaningful sorts Survey ratings
Groupby Include empty categories Complete reports

When to Use:

Use Categorical Don't Use Categorical
Low cardinality (<50% unique) High cardinality (90%+ unique)
Repeated values Mostly unique values
Surveys, ratings User IDs, timestamps
Country, department Email addresses
Fixed set of values Free-form text

Key Operations:

Operation Method Example
Create astype('category') df['col'].astype('category')
Add category .cat.add_categories() .cat.add_categories(['new'])
Remove .cat.remove_categories() .cat.remove_categories(['old'])
Rename .cat.rename_categories() .cat.rename_categories({'A': 'Active'})
Reorder .cat.reorder_categories() .cat.reorder_categories(['S', 'M', 'L'])
Set ordered .cat.as_ordered() .cat.as_ordered()

Interviewer's Insight

What they're testing: Memory optimization and categorical data handling.

Strong answer signals:

  • Knows categorical stores strings as integer codes with mapping
  • Mentions 90%+ memory reduction for low-cardinality columns
  • Uses ordered=True for meaningful sorting (e.g., ratings)
  • Applies when <50% unique values with repetition
  • Avoids for high cardinality data (unique IDs)
  • Uses .cat accessor for category management
  • Sets observed=False to include empty categories in groupby

Explain Window Functions (Rolling, Expanding, EWMA) - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Window Functions, Time Series, Rolling Statistics | Asked by: Google, Amazon, Meta, Netflix

View Answer

Window functions compute statistics over sliding, expanding, or exponentially weighted windows. Essential for time series analysis, moving averages, and trend detection.

Complete Examples:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample time series data
dates = pd.date_range('2024-01-01', periods=100, freq='D')
df = pd.DataFrame({
    'date': dates,
    'sales': np.random.randint(80, 120, 100) + np.sin(np.arange(100) * 0.1) * 20,
    'temperature': np.random.normal(70, 10, 100)
})
df.set_index('date', inplace=True)

# 1. Rolling Window (Moving Average)
# 7-day moving average
df['sales_ma7'] = df['sales'].rolling(window=7).mean()

# 30-day moving average
df['sales_ma30'] = df['sales'].rolling(window=30).mean()

print("Rolling means:")
print(df[['sales', 'sales_ma7', 'sales_ma30']].head(35))

# 2. Rolling with min_periods
# Calculate even with fewer than 7 values
df['sales_ma7_min'] = df['sales'].rolling(window=7, min_periods=1).mean()

# 3. Rolling Aggregations
df['sales_roll_std'] = df['sales'].rolling(window=7).std()
df['sales_roll_min'] = df['sales'].rolling(window=7).min()
df['sales_roll_max'] = df['sales'].rolling(window=7).max()
df['sales_roll_sum'] = df['sales'].rolling(window=7).sum()

# Multiple aggregations
rolling_stats = df['sales'].rolling(window=7).agg(['mean', 'std', 'min', 'max'])
print("\nRolling statistics:")
print(rolling_stats.head(10))

# 4. Expanding Window
# Cumulative average (all data up to current point)
df['sales_cumavg'] = df['sales'].expanding().mean()

# Cumulative sum
df['sales_cumsum'] = df['sales'].expanding().sum()

# Cumulative max
df['sales_cummax'] = df['sales'].expanding().max()

print("\nExpanding statistics:")
print(df[['sales', 'sales_cumavg', 'sales_cumsum', 'sales_cummax']].head(10))

# 5. Exponentially Weighted Moving Average (EWMA)
# More weight to recent values
df['sales_ewma'] = df['sales'].ewm(span=7).mean()

# Different smoothing parameters
df['sales_ewm_fast'] = df['sales'].ewm(span=3).mean()  # Fast response
df['sales_ewm_slow'] = df['sales'].ewm(span=14).mean()  # Slow response

# Alpha parameter (0-1)
df['sales_ewm_alpha'] = df['sales'].ewm(alpha=0.3).mean()

# 6. Center Parameter
# Center the window (for smoothing, not forecasting)
df['sales_centered'] = df['sales'].rolling(window=7, center=True).mean()

print("\nCentered vs non-centered:")
print(df[['sales', 'sales_ma7', 'sales_centered']].iloc[3:8])

# 7. Rolling with Custom Function
def range_ratio(x):
    """Range divided by mean."""
    return (x.max() - x.min()) / x.mean() if x.mean() != 0 else 0

df['sales_volatility'] = df['sales'].rolling(window=7).apply(range_ratio)

# 8. Time-based Window
# 7-day window (calendar days, not data points)
df['sales_7d'] = df['sales'].rolling('7D').mean()

# 14-day window
df['sales_14d'] = df['sales'].rolling('14D').mean()

# 9. Shift and Diff
# Previous value
df['sales_prev'] = df['sales'].shift(1)

# Change from previous day
df['sales_change'] = df['sales'].diff()

# Percentage change
df['sales_pct_change'] = df['sales'].pct_change()

# Change from 7 days ago
df['sales_change_7d'] = df['sales'].diff(7)

print("\nChanges:")
print(df[['sales', 'sales_prev', 'sales_change', 'sales_pct_change']].head(10))

# 10. Practical Example: Anomaly Detection
# Bollinger Bands
window = 20
df['bb_mid'] = df['sales'].rolling(window=window).mean()
df['bb_std'] = df['sales'].rolling(window=window).std()
df['bb_upper'] = df['bb_mid'] + (df['bb_std'] * 2)
df['bb_lower'] = df['bb_mid'] - (df['bb_std'] * 2)

# Detect anomalies
df['anomaly'] = ((df['sales'] > df['bb_upper']) | 
                 (df['sales'] < df['bb_lower']))

print("\nAnomalies detected:")
print(df[df['anomaly']][['sales', 'bb_upper', 'bb_lower']])

# 11. Multi-column Rolling
# Rolling correlation
df['corr_sales_temp'] = df['sales'].rolling(window=30).corr(df['temperature'])

# 12. GroupBy + Rolling
# Add categories
df['category'] = np.random.choice(['A', 'B'], len(df))

# Rolling average per category
df['sales_ma_by_cat'] = df.groupby('category')['sales'].rolling(window=7).mean().reset_index(0, drop=True)

Window Types:

Type Method Use Case Example
Rolling .rolling(window) Moving average 7-day MA
Expanding .expanding() Cumulative stats Running total
EWMA .ewm(span/alpha) Weighted average Trend smoothing

Common Rolling Operations:

Operation Method Description
Mean .rolling(n).mean() Moving average
Sum .rolling(n).sum() Moving total
Std .rolling(n).std() Moving volatility
Min/Max .rolling(n).min/max() Moving range
Corr .rolling(n).corr() Moving correlation
Apply .rolling(n).apply(func) Custom function

Key Parameters:

Parameter Description Example
window Window size window=7
min_periods Min values needed min_periods=1
center Center the window center=True
span EWMA span span=7
alpha EWMA smoothing alpha=0.3

Interview Insights:

Interviewer's Insight

  • "rolling(): moving average over fixed window"
  • "expanding(): cumulative statistics"
  • "ewm(): exponentially weighted (recent values weighted more)"
  • "min_periods: handle NaN at start"
  • "center=True: for smoothing (not forecasting)"
  • "Time-based: rolling('7D') for calendar days"
  • "Shift: df['prev'] = df['col'].shift(1)"
  • "Diff: df['change'] = df['col'].diff()"
  • "Use for: moving averages, anomaly detection, trend analysis"

Explain DataFrame Indexing (.loc vs .iloc vs .at vs .iat) - Microsoft, Amazon Interview Question

Difficulty: 🟒 Easy | Tags: Indexing, DataFrame Access, Performance | Asked by: Microsoft, Amazon, Google, Meta

View Answer

.loc accesses by label, .iloc by integer position, .at/.iat for single values (faster). Use loc for label-based, iloc for position-based indexing.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['a', 'b', 'c', 'd', 'e']
}, index=['row1', 'row2', 'row3', 'row4', 'row5'])

# 1. .loc - Label-based
# Single value
print(df.loc['row1', 'A'])  # 1

# Row
print(df.loc['row2'])  # Series

# Multiple rows
print(df.loc[['row1', 'row3']])

# Slice (inclusive)
print(df.loc['row1':'row3'])  # Includes row3!

# Boolean indexing
print(df.loc[df['A'] > 2])

# Multiple columns
print(df.loc['row1', ['A', 'B']])

# 2. .iloc - Position-based
# Single value
print(df.iloc[0, 0])  # 1

# Row by position
print(df.iloc[1])  # Second row

# Multiple rows
print(df.iloc[[0, 2, 4]])

# Slice (exclusive end)
print(df.iloc[0:3])  # Rows 0, 1, 2 (not 3)

# Negative indexing
print(df.iloc[-1])  # Last row

# 2D slicing
print(df.iloc[1:3, 0:2])  # Rows 1-2, columns 0-1

# 3. .at - Fast single value (label)
value = df.at['row1', 'A']
print(value)  # 1

# Set value
df.at['row1', 'A'] = 100

# 4. .iat - Fast single value (position)
value = df.iat[0, 0]
print(value)  # 100

# Set value
df.iat[0, 0] = 1

# 5. Performance Comparison
import time

large_df = pd.DataFrame(np.random.rand(10000, 100))

# .loc timing
start = time.time()
for _ in range(1000):
    _ = large_df.loc[500, 50]
loc_time = time.time() - start

# .at timing (faster)
start = time.time()
for _ in range(1000):
    _ = large_df.at[500, 50]
at_time = time.time() - start

print(f".loc: {loc_time:.4f}s")
print(f".at: {at_time:.4f}s ({loc_time/at_time:.1f}x slower)")

# 6. Setting Values
# Set single value
df.loc['row1', 'A'] = 999

# Set multiple values
df.loc['row1', ['A', 'B']] = [1, 2]

# Set column
df.loc[:, 'A'] = 0

# Set with condition
df.loc[df['A'] > 2, 'B'] = 100

# 7. Adding Rows/Columns
# Add row
df.loc['row6'] = [6, 60, 'f']

# Add column
df.loc[:, 'D'] = [1, 2, 3, 4, 5, 6]

# 8. Common Pitfalls
# Chain indexing (bad!)
# df[df['A'] > 2]['B'] = 999  # SettingWithCopyWarning

# Correct way
df.loc[df['A'] > 2, 'B'] = 999

# 9. MultiIndex
df_multi = pd.DataFrame({
    'value': [1, 2, 3, 4]
}, index=pd.MultiIndex.from_tuples([
    ('A', 1), ('A', 2), ('B', 1), ('B', 2)
]))

# Access with tuple
print(df_multi.loc[('A', 1)])

# Slice first level
print(df_multi.loc['A'])

# 10. Boolean + Label
# Combine conditions
result = df.loc[(df['A'] > 1) & (df['B'] < 50), ['A', 'C']]
print(result)

Comparison:

Method Indexing Type Speed Use Case
.loc Label-based Medium General access by label
.iloc Position-based Medium Access by position
.at Label-based Fast Single value by label
.iat Position-based Fast Single value by position

Key Differences:

Aspect .loc .iloc
Input Labels Integers
Slice Inclusive end Exclusive end
Example df.loc['row1':'row3'] df.iloc[0:3]
Boolean βœ… Yes ❌ No

Interview Insights:

Interviewer's Insight

  • "loc: label-based, iloc: position-based"
  • "loc slicing: inclusive end"
  • "iloc slicing: exclusive end (like Python)"
  • "at/iat: faster for single values"
  • "Use loc with boolean indexing"
  • "Avoid chained indexing: df[col][row]"
  • "Correct: df.loc[row, col]"
  • "at/iat: 10-100x faster for scalar access"

Interviewer's Insight

What they're testing: DataFrame indexing fundamentals and performance.

Strong answer signals:

  • Knows loc uses labels, iloc uses integer positions
  • Mentions at/iat are 10-100x faster for single values
  • Understands loc slicing is inclusive, iloc is exclusive
  • Avoids chained indexing to prevent SettingWithCopyWarning
  • Uses loc for boolean indexing and filtering

Explain pd.concat() vs pd.merge() - Amazon, Google Interview Question

Difficulty: 🟑 Medium | Tags: concat, merge, Data Combination | Asked by: Amazon, Google, Meta, Microsoft

View Answer

pd.concat() stacks DataFrames vertically/horizontally. pd.merge() combines based on keys (SQL joins). Use concat for stacking, merge for joining on columns.

Complete Examples:

import pandas as pd

# Sample data
df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df2 = pd.DataFrame({
    'A': [7, 8, 9],
    'B': [10, 11, 12]
})

df3 = pd.DataFrame({
    'C': [13, 14, 15],
    'D': [16, 17, 18]
})

# 1. concat() - Vertical (default axis=0)
vertical = pd.concat([df1, df2])
print("Vertical concat:")
print(vertical)
# Stacks rows, keeps all columns

# Reset index
vertical_reset = pd.concat([df1, df2], ignore_index=True)
print("\nWith reset index:")
print(vertical_reset)

# 2. concat() - Horizontal (axis=1)
horizontal = pd.concat([df1, df3], axis=1)
print("\nHorizontal concat:")
print(horizontal)
# Side-by-side, matches by index

# 3. concat() with Keys
keyed = pd.concat([df1, df2], keys=['first', 'second'])
print("\nWith keys:")
print(keyed)
# Creates MultiIndex

# 4. concat() with Missing Columns
df4 = pd.DataFrame({
    'A': [1, 2],
    'C': [3, 4]  # Different column
})

mixed = pd.concat([df1, df4])
print("\nMissing columns (filled with NaN):")
print(mixed)

# Inner join (only common columns)
inner = pd.concat([df1, df4], join='inner')
print("\nInner join:")
print(inner)

# 5. merge() - Inner Join
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 2, 4],
    'amount': [100, 200, 300]
})

inner_merge = pd.merge(customers, orders, on='customer_id')
print("\nInner merge:")
print(inner_merge)
# Only matching customer_ids (1, 2)

# 6. merge() - Outer Join
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter merge:")
print(outer_merge)
# All records, NaN for missing

# 7. When to Use Each
# concat: Same structure, stack them
jan = pd.DataFrame({'sales': [100, 200]})
feb = pd.DataFrame({'sales': [150, 250]})
mar = pd.DataFrame({'sales': [120, 220]})

quarterly = pd.concat([jan, feb, mar], ignore_index=True)
print("\nQuarterly sales (concat):")
print(quarterly)

# merge: Different tables, join on key
products = pd.DataFrame({
    'product_id': [1, 2],
    'name': ['Widget', 'Gadget']
})

sales = pd.DataFrame({
    'product_id': [1, 2, 1],
    'quantity': [5, 3, 2]
})

sales_detail = pd.merge(sales, products, on='product_id')
print("\nSales detail (merge):")
print(sales_detail)

# 8. concat Multiple DataFrames
dfs = [df1, df2, df1, df2]
result = pd.concat(dfs, ignore_index=True)
print("\nMultiple concat:")
print(result)

# 9. Performance: concat vs append
# append (deprecated) - slow in loop
# result = pd.DataFrame()
# for df in dfs:
#     result = result.append(df)  # Slow!

# concat - fast
result = pd.concat(dfs)  # Much faster

# 10. Practical Example
# Monthly data files
data_2023_q1 = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=90, freq='D'),
    'revenue': range(90)
})

data_2023_q2 = pd.DataFrame({
    'date': pd.date_range('2023-04-01', periods=91, freq='D'),
    'revenue': range(100, 191)
})

# Combine time series
full_year = pd.concat([data_2023_q1, data_2023_q2], ignore_index=True)

# Add metadata
metadata = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=181, freq='D'),
    'holiday': [False] * 181
})

# Merge with metadata
complete = pd.merge(full_year, metadata, on='date', how='left')
print("\nComplete data:")
print(complete.head())

concat() vs merge():

Aspect concat() merge()
Purpose Stack DataFrames Join on keys
Direction Vertical/Horizontal Based on keys
Use Case Append rows, add columns SQL-style joins
Key Required No Yes
Example Monthly data β†’ yearly Customers + Orders

concat() Parameters:

Parameter Description Example
axis 0=vertical, 1=horizontal axis=0
ignore_index Reset index ignore_index=True
keys Create MultiIndex keys=['a', 'b']
join 'outer' (default) or 'inner' join='inner'

Interview Insights:

Interviewer's Insight

  • "concat: stack DataFrames (same structure)"
  • "merge: join on keys (different tables)"
  • "concat axis=0: vertical, axis=1: horizontal"
  • "concat ignore_index: reset index"
  • "merge for SQL-style joins"
  • "concat join='inner': only common columns"
  • "Use concat for: time series, appending rows"
  • "Use merge for: combining related tables"
  • "concat list of DFs: faster than repeated append"

Explain GroupBy Aggregation Functions - Google, Meta Interview Question

Difficulty: 🟑 Medium | Tags: GroupBy, Aggregation, Statistics | Asked by: Google, Meta, Netflix, Amazon

View Answer

GroupBy splits data into groups, applies functions, combines results. Supports multiple aggregations, custom functions, and named aggregations with agg().

Complete Examples:

import pandas as pd
import numpy as np

# Sample sales data
df = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'East', 'East'] * 5,
    'product': ['A', 'B', 'A', 'B', 'A', 'B'] * 5,
    'sales': np.random.randint(100, 500, 30),
    'quantity': np.random.randint(1, 20, 30),
    'cost': np.random.randint(50, 200, 30)
})

# 1. Basic GroupBy
# Single aggregation
region_sales = df.groupby('region')['sales'].sum()
print("Sales by region:")
print(region_sales)

# Multiple columns
by_region_product = df.groupby(['region', 'product'])['sales'].sum()
print("\nSales by region and product:")
print(by_region_product)

# 2. Multiple Aggregations
# Same function on all columns
summary = df.groupby('region').sum()
print("\nSum by region:")
print(summary)

# Multiple functions
stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'std', 'count'])
print("\nMultiple stats:")
print(stats)

# 3. Different Functions per Column
aggregations = df.groupby('region').agg({
    'sales': 'sum',
    'quantity': 'mean',
    'cost': ['min', 'max']
})
print("\nDifferent functions per column:")
print(aggregations)

# 4. Named Aggregations (Clean Column Names)
result = df.groupby('region').agg(
    total_sales=('sales', 'sum'),
    avg_quantity=('quantity', 'mean'),
    max_cost=('cost', 'max'),
    sales_count=('sales', 'count')
)
print("\nNamed aggregations:")
print(result)

# 5. Custom Aggregation Functions
def range_func(x):
    return x.max() - x.min()

def coefficient_of_variation(x):
    return x.std() / x.mean() if x.mean() != 0 else 0

custom = df.groupby('region')['sales'].agg([
    ('total', 'sum'),
    ('average', 'mean'),
    ('range', range_func),
    ('cv', coefficient_of_variation)
])
print("\nCustom functions:")
print(custom)

# 6. Lambda Functions
result = df.groupby('region').agg({
    'sales': [
        ('total', 'sum'),
        ('weighted_avg', lambda x: (x * df.loc[x.index, 'quantity']).sum() / df.loc[x.index, 'quantity'].sum())
    ]
})
print("\nLambda aggregation:")
print(result)

# 7. Transform (Keep Original Shape)
# Add group mean to each row
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')

# Normalize within group
df['sales_normalized'] = df.groupby('region')['sales'].transform(
    lambda x: (x - x.mean()) / x.std()
)

print("\nWith transform:")
print(df[['region', 'sales', 'region_avg_sales', 'sales_normalized']].head())

# 8. Filter Groups
# Keep only groups with more than 5 records
filtered = df.groupby('region').filter(lambda x: len(x) > 5)
print("\nFiltered groups:")
print(filtered.groupby('region').size())

# Keep groups with high average sales
high_sales = df.groupby('region').filter(lambda x: x['sales'].mean() > 250)

# 9. Apply (Most Flexible)
def top_products(group):
    """Get top 2 products by sales."""
    return group.nlargest(2, 'sales')

top_per_region = df.groupby('region').apply(top_products)
print("\nTop products per region:")
print(top_per_region)

# 10. GroupBy with MultiIndex
multi_group = df.groupby(['region', 'product']).agg({
    'sales': ['sum', 'mean'],
    'quantity': 'sum'
})
print("\nMultiIndex groupby:")
print(multi_group)

# Flatten MultiIndex columns
multi_group.columns = ['_'.join(col).strip() for col in multi_group.columns.values]
print("\nFlattened columns:")
print(multi_group)

# 11. Cumulative Operations
# Cumulative sum within group
df['cumulative_sales'] = df.groupby('region')['sales'].cumsum()

# Ranking within group
df['sales_rank'] = df.groupby('region')['sales'].rank(ascending=False)

print("\nCumulative operations:")
print(df[['region', 'sales', 'cumulative_sales', 'sales_rank']].head(10))

# 12. Practical Example: Sales Report
report = df.groupby('region').agg(
    total_revenue=('sales', 'sum'),
    avg_sale=('sales', 'mean'),
    total_units=('quantity', 'sum'),
    num_transactions=('sales', 'count'),
    max_sale=('sales', 'max'),
    min_sale=('sales', 'min')
)

# Add calculated columns
report['revenue_per_unit'] = report['total_revenue'] / report['total_units']
report['avg_units_per_transaction'] = report['total_units'] / report['num_transactions']

print("\nSales report:")
print(report)

Common Aggregation Functions:

Function Description Example
sum Total groupby().sum()
mean Average groupby().mean()
count Count non-null groupby().count()
size Count all (including null) groupby().size()
std Standard deviation groupby().std()
min/max Min/Max groupby().max()
first/last First/Last value groupby().first()
nunique Unique count groupby().nunique()

GroupBy Methods:

Method Purpose Returns
agg() Apply aggregations Aggregated data
transform() Keep original shape Same shape as input
filter() Filter groups Subset of data
apply() Custom function Custom result

Interview Insights:

Interviewer's Insight

  • "groupby(): split-apply-combine pattern"
  • "agg(): apply multiple functions"
  • "Named agg: total_sales=('sales', 'sum')"
  • "transform(): keep original shape"
  • "filter(): keep/remove entire groups"
  • "apply(): most flexible, returns custom structure"
  • "Multiple grouping: groupby(['col1', 'col2'])"
  • "Custom functions: pass callable to agg()"
  • "Flatten MultiIndex: '_'.join(col)"

Explain Pandas Query Method - Amazon, Netflix Interview Question

Difficulty: 🟒 Easy | Tags: Query, Filtering, Performance | Asked by: Amazon, Netflix, Google, Meta

View Answer

query() filters DataFrames using string expressions (SQL-like). Cleaner syntax than boolean indexing, supports variables with @, and can be faster for large DataFrames.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 28],
    'salary': [50000, 60000, 70000, 80000, 55000],
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR'],
    'experience': [2, 5, 8, 12, 3]
})

# 1. Basic Query
# Simple condition
result = df.query('age > 30')
print("Age > 30:")
print(result)

# Multiple conditions (AND)
result = df.query('age > 25 and salary > 55000')
print("\nAge > 25 AND salary > 55000:")
print(result)

# OR condition
result = df.query('department == "IT" or salary > 70000')
print("\nIT OR high salary:")
print(result)

# 2. Comparison with Boolean Indexing
# Boolean indexing (verbose)
result_bool = df[(df['age'] > 25) & (df['salary'] > 55000)]

# Query (cleaner)
result_query = df.query('age > 25 and salary > 55000')

# Both produce same result
print("\nResults equal:", result_bool.equals(result_query))

# 3. Using Variables with @
min_age = 30
min_salary = 60000

result = df.query('age > @min_age and salary > @min_salary')
print("\nWith variables:")
print(result)

# List membership
departments = ['IT', 'HR']
result = df.query('department in @departments')
print("\nDepartment in list:")
print(result)

# 4. String Operations
# String contains
result = df.query('name.str.contains("a", case=False)', engine='python')
print("\nName contains 'a':")
print(result)

# String startswith
result = df.query('name.str.startswith("D")', engine='python')
print("\nName starts with 'D':")
print(result)

# 5. Between
# Age between 25 and 35
result = df.query('25 <= age <= 35')
print("\nAge between 25 and 35:")
print(result)

# 6. Not Equal
result = df.query('department != "Sales"')
print("\nNot Sales:")
print(result)

# 7. Complex Expressions
# Calculate in query
result = df.query('salary / experience > 10000')
print("\nSalary per year experience > 10k:")
print(result)

# 8. Index Queries
df_indexed = df.set_index('name')

# Query index
result = df_indexed.query('index == "Alice"')
print("\nQuery index:")
print(result)

# 9. Performance Comparison
large_df = pd.DataFrame({
    'A': np.random.randint(0, 100, 100000),
    'B': np.random.randint(0, 100, 100000),
    'C': np.random.randint(0, 100, 100000)
})

import time

# Boolean indexing
start = time.time()
_ = large_df[(large_df['A'] > 50) & (large_df['B'] < 30)]
bool_time = time.time() - start

# Query
start = time.time()
_ = large_df.query('A > 50 and B < 30')
query_time = time.time() - start

print(f"\nBoolean: {bool_time:.4f}s")
print(f"Query: {query_time:.4f}s")

# 10. Inplace Modification
df_copy = df.copy()

# Filter and reassign
df_filtered = df_copy.query('age > 30')

# Can't use inplace=True with query (returns filtered view)

# 11. Backticks for Column Names with Spaces
df_spaces = pd.DataFrame({
    'First Name': ['Alice', 'Bob'],
    'Last Name': ['Smith', 'Jones'],
    'Age': [25, 30]
})

result = df_spaces.query('`First Name` == "Alice"')
print("\nWith spaces in column names:")
print(result)

# 12. Practical Example: Data Analysis
sales_df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'A', 'B', 'C'] * 10,
    'region': ['North', 'South'] * 30,
    'sales': np.random.randint(100, 1000, 60),
    'units': np.random.randint(1, 50, 60),
    'quarter': ['Q1', 'Q2', 'Q3', 'Q4'] * 15
})

# Complex business query
target_sales = 500
target_region = 'North'

high_performers = sales_df.query(
    'sales > @target_sales and region == @target_region and units > 20'
)

print("\nHigh performers:")
print(high_performers)

# Chained queries
result = (sales_df
          .query('region == "North"')
          .query('sales > 500')
          .query('quarter in ["Q1", "Q2"]'))

print("\nChained queries:")
print(result)

query() vs Boolean Indexing:

Aspect query() Boolean Indexing
Syntax df.query('age > 30') df[df['age'] > 30]
Readability Cleaner, SQL-like More verbose
Performance Faster (large data) Slower
Variables Use @ Direct
Complex More readable Harder to read

Query Operators:

Operator Description Example
==, != Equality 'age == 30'
>, <, >=, <= Comparison 'salary > 50000'
and, or, not Logical 'age > 30 and salary < 60000'
in Membership 'dept in @departments'
@ Variable reference 'age > @min_age'

Interview Insights:

Interviewer's Insight

  • "query(): SQL-like string expressions"
  • "Cleaner than boolean indexing"
  • "Use @ for variables: query('age > @min_age')"
  • "and/or instead of &/|"
  • "Faster for large DataFrames (numexpr)"
  • "Backticks for spaces: `column name`"
  • "Can query index: query('index == \"value\"')"
  • "engine='python' for string methods"
  • "More readable for complex conditions"

Explain Pandas Memory Usage Optimization - Netflix, Google Interview Question

Difficulty: πŸ”΄ Hard | Tags: Memory, Performance, Optimization | Asked by: Netflix, Google, Meta, Amazon

View Answer

Memory optimization reduces DataFrame memory footprint through: downcasting numeric types, categorical dtype for strings, sparse arrays, chunking, and efficient file formats (parquet).

Complete Examples:

import pandas as pd
import numpy as np

# 1. Check Memory Usage
df = pd.DataFrame({
    'int_col': np.random.randint(0, 100, 1000000),
    'float_col': np.random.rand(1000000),
    'str_col': np.random.choice(['A', 'B', 'C'], 1000000),
    'date_col': pd.date_range('2020-01-01', periods=1000000, freq='min')
})

# Memory before
memory_before = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory before: {memory_before:.2f} MB")
print("\nMemory by column:")
print(df.memory_usage(deep=True))

# 2. Downcast Numeric Types
# int64 β†’ int8/int16/int32
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')

# float64 β†’ float32
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')

memory_after_numeric = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nAfter numeric downcast: {memory_after_numeric:.2f} MB")
print(f"Saved: {memory_before - memory_after_numeric:.2f} MB")

# 3. Categorical for Low Cardinality Strings
df['str_col'] = df['str_col'].astype('category')

memory_after_cat = df.memory_usage(deep=True).sum() / 1024**2
print(f"After categorical: {memory_after_cat:.2f} MB")
print(f"Total saved: {(1 - memory_after_cat/memory_before)*100:.1f}%")

# 4. Automatic Optimization Function
def optimize_dataframe(df):
    """Automatically optimize DataFrame dtypes."""
    optimized = df.copy()

    for col in optimized.columns:
        col_type = optimized[col].dtype

        # Numeric columns
        if col_type != 'object' and col_type.name != 'category':
            c_min = optimized[col].min()
            c_max = optimized[col].max()

            # Integer
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    optimized[col] = optimized[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    optimized[col] = optimized[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    optimized[col] = optimized[col].astype(np.int32)

            # Float
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    optimized[col] = optimized[col].astype(np.float32)

        # String to categorical
        elif col_type == 'object':
            num_unique = optimized[col].nunique()
            num_total = len(optimized[col])
            if num_unique / num_total < 0.5:  # Less than 50% unique
                optimized[col] = optimized[col].astype('category')

    return optimized

# Test optimization
df_test = pd.DataFrame({
    'id': range(100000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 100000),
    'value': np.random.rand(100000) * 100
})

print(f"\nBefore optimization: {df_test.memory_usage(deep=True).sum()/1024**2:.2f} MB")
df_optimized = optimize_dataframe(df_test)
print(f"After optimization: {df_optimized.memory_usage(deep=True).sum()/1024**2:.2f} MB")

# 5. Sparse Arrays
# For data with many zeros/NaN
df_sparse = pd.DataFrame({
    'A': pd.arrays.SparseArray([0, 0, 1, 0, 0, 0, 2, 0, 0, 0] * 100000)
})

# Regular array
df_dense = pd.DataFrame({
    'A': [0, 0, 1, 0, 0, 0, 2, 0, 0, 0] * 100000
})

print(f"\nDense: {df_dense.memory_usage(deep=True).sum()/1024**2:.2f} MB")
print(f"Sparse: {df_sparse.memory_usage(deep=True).sum()/1024**2:.2f} MB")

# 6. Chunking for Large Files
chunk_size = 10000

# Process in chunks
def process_large_file(filename, chunksize=10000):
    """Process large CSV in chunks."""
    results = []
    for chunk in pd.read_csv(filename, chunksize=chunksize):
        # Process chunk
        processed = chunk[chunk['value'] > 50]
        results.append(processed)
    return pd.concat(results, ignore_index=True)

# 7. Efficient File Formats
# Save test data
df_test.to_csv('test.csv', index=False)
df_test.to_parquet('test.parquet')

import os
csv_size = os.path.getsize('test.csv') / 1024**2
parquet_size = os.path.getsize('test.parquet') / 1024**2

print(f"\nCSV size: {csv_size:.2f} MB")
print(f"Parquet size: {parquet_size:.2f} MB")
print(f"Parquet is {csv_size/parquet_size:.1f}x smaller")

# 8. Column Selection
# Read only needed columns
# df_subset = pd.read_csv('large_file.csv', usecols=['col1', 'col2'])

# 9. Date Parsing
# Don't parse dates if not needed
# df = pd.read_csv('file.csv', parse_dates=False)

# 10. Practical Example: Large Dataset
print("\n=== Optimization Summary ===")

# Original
df_original = pd.DataFrame({
    'user_id': range(1000000),
    'country': np.random.choice(['USA', 'UK', 'Canada', 'Germany'], 1000000),
    'age': np.random.randint(18, 80, 1000000),
    'score': np.random.rand(1000000) * 100,
    'status': np.random.choice(['active', 'inactive'], 1000000)
})

mem_original = df_original.memory_usage(deep=True).sum() / 1024**2
print(f"Original: {mem_original:.2f} MB")

# Optimized
df_opt = df_original.copy()
df_opt['user_id'] = pd.to_numeric(df_opt['user_id'], downcast='integer')
df_opt['age'] = pd.to_numeric(df_opt['age'], downcast='integer')
df_opt['score'] = pd.to_numeric(df_opt['score'], downcast='float')
df_opt['country'] = df_opt['country'].astype('category')
df_opt['status'] = df_opt['status'].astype('category')

mem_opt = df_opt.memory_usage(deep=True).sum() / 1024**2
print(f"Optimized: {mem_opt:.2f} MB")
print(f"Reduction: {(1 - mem_opt/mem_original)*100:.1f}%")

# Cleanup
os.remove('test.csv')
os.remove('test.parquet')

Optimization Techniques:

Technique Memory Savings Use Case
Downcast int64β†’int8 87.5% Small integers
Downcast float64β†’float32 50% Precision not critical
String→Categorical 90%+ Low cardinality
Sparse arrays 90%+ Many zeros/NaN
Parquet format 50-90% File storage
Chunking No limit Process huge files

Data Type Sizes:

Type Size Range
int8 1 byte -128 to 127
int16 2 bytes -32,768 to 32,767
int32 4 bytes -2B to 2B
int64 8 bytes -9E18 to 9E18
float32 4 bytes ~7 decimals
float64 8 bytes ~15 decimals

Interview Insights:

Interviewer's Insight

  • "memory_usage(deep=True): check actual usage"
  • "Downcast: int64β†’int8/16/32, float64β†’float32"
  • "Categorical: 90%+ savings for low-cardinality strings"
  • "Sparse arrays: for data with many zeros"
  • "Chunking: process files larger than RAM"
  • "Parquet: smaller, faster than CSV"
  • "read_csv usecols: load only needed columns"
  • "Automatic optimization: check ranges, convert types"
  • "Rule: <50% unique β†’ categorical"

Explain Pandas Eval Method - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: eval, Performance, Expression Evaluation | Asked by: Google, Amazon, Meta

View Answer

eval() evaluates string expressions efficiently using numexpr backend. Much faster than standard operations for large DataFrames, especially with complex arithmetic.

Complete Examples:

import pandas as pd
import numpy as np
import time

# Create large DataFrame
n = 1000000
df = pd.DataFrame({
    'A': np.random.rand(n),
    'B': np.random.rand(n),
    'C': np.random.rand(n),
    'D': np.random.rand(n)
})

# 1. Basic eval()
# Standard operation
result_standard = df['A'] + df['B'] * df['C'] - df['D']

# eval operation
result_eval = df.eval('A + B * C - D')

# Results are equal
print("Results equal:", np.allclose(result_standard, result_eval))

# 2. Performance Comparison
# Standard
start = time.time()
_ = df['A'] + df['B'] * df['C'] - df['D']
standard_time = time.time() - start

# eval
start = time.time()
_ = df.eval('A + B * C - D')
eval_time = time.time() - start

print(f"\nStandard: {standard_time:.4f}s")
print(f"eval: {eval_time:.4f}s")
print(f"Speedup: {standard_time/eval_time:.1f}x")

# 3. Column Assignment
# Create new column
df.eval('E = A + B', inplace=True)
print("\nNew column E:")
print(df[['A', 'B', 'E']].head())

# Multiple assignments
df.eval('''
    F = A * 2
    G = B + C
''', inplace=True)

# 4. Using Variables with @
multiplier = 2.5
threshold = 0.5

result = df.eval('A * @multiplier > @threshold')
print("\nWith variables:")
print(result.head())

# 5. Comparison Operations
# Boolean mask
mask = df.eval('A > 0.5 and B < 0.3')
filtered = df[mask]
print("\nFiltered rows:", len(filtered))

# Multiple conditions
complex_mask = df.eval('(A > 0.5 and B < 0.3) or (C > 0.8 and D < 0.2)')

# 6. Local vs Global Variables
local_var = 10

# Access local variable
df.eval('H = A * @local_var', inplace=True)

# 7. DataFrame.eval() vs pd.eval()
# DataFrame.eval
df_result = df.eval('A + B')

# pd.eval with multiple DataFrames
df2 = pd.DataFrame({
    'X': np.random.rand(n),
    'Y': np.random.rand(n)
})

# Doesn't work directly with multiple DataFrames in df.eval
# Use standard operations or merge first

# 8. Assignment Operators
# Add to existing column
df.eval('A = A + 1', inplace=True)

# Conditional assignment (use where instead)
df['I'] = df.eval('A if A > 0.5 else B')

# 9. Complex Expressions
# Multi-line expressions
df.eval('''
    temp1 = A + B
    temp2 = C * D
    result = temp1 / temp2
''', inplace=True)

# 10. When eval() is Faster
# Good: Complex arithmetic
start = time.time()
_ = df.eval('A + B * C / D - (A * B) + (C / D)')
eval_fast = time.time() - start

start = time.time()
_ = df['A'] + df['B'] * df['C'] / df['D'] - (df['A'] * df['B']) + (df['C'] / df['D'])
standard_slow = time.time() - start

print(f"\nComplex arithmetic:")
print(f"eval: {eval_fast:.4f}s")
print(f"Standard: {standard_slow:.4f}s")
print(f"Speedup: {standard_slow/eval_fast:.1f}x")

# Bad: Simple operations, small data
small_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Not faster for small data
_ = small_df.eval('A + B')
_ = small_df['A'] + small_df['B']  # Similar speed

# 11. Supported Operations
print("\n=== Supported Operations ===")
df_ops = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Arithmetic
print("Arithmetic:", df_ops.eval('A + B * 2').tolist())

# Comparison
print("Comparison:", df_ops.eval('A > 1').tolist())

# Logical
print("Logical:", df_ops.eval('A > 1 and B < 6').tolist())

# 12. Limitations
# Can't use:
# - String methods: df.eval('A.str.upper()')  # Error
# - Complex functions: df.eval('A.apply(func)')  # Error
# - Method chaining

# Use standard operations instead for these

# 13. Practical Example
# Financial calculations
portfolio = pd.DataFrame({
    'shares': np.random.randint(10, 1000, 10000),
    'price': np.random.rand(10000) * 100,
    'cost_basis': np.random.rand(10000) * 80,
    'dividend': np.random.rand(10000) * 5
})

# Calculate portfolio metrics
portfolio.eval('''
    market_value = shares * price
    cost_value = shares * cost_basis
    unrealized_gain = market_value - cost_value
    gain_pct = (unrealized_gain / cost_value) * 100
    annual_dividend = shares * dividend
    yield_pct = (annual_dividend / market_value) * 100
''', inplace=True)

print("\nPortfolio metrics:")
print(portfolio[['market_value', 'unrealized_gain', 'gain_pct', 'yield_pct']].head())

eval() vs Standard:

Aspect eval() Standard
Syntax String expression Python operations
Speed Faster (large data) Slower
Memory More efficient More allocations
Readability SQL-like, cleaner More verbose
Best for Complex arithmetic Simple operations

When to Use eval():

Use eval() Don't Use eval()
Large DataFrames (>100k rows) Small DataFrames
Complex arithmetic Simple operations
Multiple operations Single operation
Chained calculations String methods
Memory constrained Custom functions

Supported Operations:

Operation Example
Arithmetic +, -, *, /, **, %
Comparison >, <, >=, <=, ==, !=
Logical and, or, not, &, |, ~
Variables @variable_name
Assignment new_col = expression

Interview Insights:

Interviewer's Insight

  • "eval(): string expressions using numexpr"
  • "Faster for large DataFrames (100k+ rows)"
  • "Best for complex arithmetic: 2-4x speedup"
  • "More memory efficient"
  • "Use @ for variables: eval('A * @multiplier')"
  • "inplace=True: modify DataFrame"
  • "Limitations: no string methods, custom functions"
  • "Not faster for small data or simple ops"
  • "Cleaner syntax for complex calculations"

Explain Pandas Time Series Resampling - Meta, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: Time Series, Resampling, Aggregation | Asked by: Meta, Netflix, Google, Amazon

View Answer

Resample() converts time series to different frequencies (upsampling/downsampling). Essential for time series aggregation, business reporting, and data alignment.

Complete Examples:

import pandas as pd
import numpy as np

# Create time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({
    'date': dates,
    'sales': np.random.randint(100, 500, 365) + np.sin(np.arange(365) * 0.1) * 50,
    'customers': np.random.randint(10, 100, 365),
    'temperature': 70 + np.sin(np.arange(365) * 0.017) * 20 + np.random.randn(365) * 5
}).set_index('date')

# 1. Downsampling (Higher β†’ Lower Frequency)
# Daily β†’ Weekly
weekly = df.resample('W').sum()
print("Weekly totals:")
print(weekly.head())

# Daily β†’ Monthly
monthly = df.resample('M').sum()
print("\nMonthly totals:")
print(monthly.head())

# Daily β†’ Quarterly
quarterly = df.resample('Q').sum()
print("\nQuarterly totals:")
print(quarterly)

# 2. Different Aggregations
# Multiple aggregations
monthly_stats = df.resample('M').agg({
    'sales': ['sum', 'mean', 'max'],
    'customers': 'sum',
    'temperature': 'mean'
})
print("\nMonthly statistics:")
print(monthly_stats)

# 3. Upsampling (Lower β†’ Higher Frequency)
# Create sparse data
weekly_data = pd.DataFrame({
    'value': [100, 200, 150, 180]
}, index=pd.date_range('2024-01-01', periods=4, freq='W'))

# Fill forward
daily_ffill = weekly_data.resample('D').ffill()
print("\nUpsampled (forward fill):")
print(daily_ffill.head(10))

# Interpolate
daily_interp = weekly_data.resample('D').interpolate()
print("\nUpsampled (interpolate):")
print(daily_interp.head(10))

# 4. Common Frequency Codes
freq_examples = {
    'D': 'Daily',
    'W': 'Weekly',
    'M': 'Month end',
    'MS': 'Month start',
    'Q': 'Quarter end',
    'QS': 'Quarter start',
    'Y': 'Year end',
    'H': 'Hourly',
    'T' or 'min': 'Minute',
    'S': 'Second'
}

# 5. Business Day Resampling
# Business days only (Mon-Fri)
business_weekly = df.resample('W-MON').sum()  # Week ending Monday
print("\nWeekly (business days):")
print(business_weekly.head())

# Business month
business_monthly = df.resample('BM').sum()  # Business month end

# 6. Offset Resampling
# 2-week periods
biweekly = df.resample('2W').sum()
print("\nBi-weekly totals:")
print(biweekly.head())

# 10-day periods
ten_day = df.resample('10D').sum()

# 7. Rolling with Resample
# Monthly average with rolling
monthly_rolling = df.resample('M').mean().rolling(window=3).mean()
print("\n3-month moving average:")
print(monthly_rolling.head(6))

# 8. Groupby with Resample
# Add category
df['region'] = np.random.choice(['North', 'South'], len(df))

# Resample per group
regional_monthly = df.groupby('region').resample('M').sum()
print("\nRegional monthly sales:")
print(regional_monthly.head())

# 9. Custom Aggregation
def custom_agg(x):
    """Custom aggregation function."""
    return pd.Series({
        'total': x.sum(),
        'average': x.mean(),
        'peak': x.max(),
        'days': len(x)
    })

custom_monthly = df['sales'].resample('M').apply(custom_agg)
print("\nCustom aggregation:")
print(custom_monthly.head())

# 10. Label and Closed Parameters
# Default: label='right', closed='right'
# Period labeled by right edge, closed on right

# Label by left edge
left_label = df.resample('W', label='left').sum()

# Closed on left
left_closed = df.resample('W', closed='left').sum()

# 11. Origin and Offset
# Start resampling from specific point
custom_origin = df.resample('10D', origin='2024-01-05').sum()

# 12. Practical Example: Business Reporting
sales_data = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=1000, freq='H'),
    'revenue': np.random.randint(100, 1000, 1000),
    'orders': np.random.randint(1, 20, 1000)
}).set_index('timestamp')

# Daily summary
daily_report = sales_data.resample('D').agg({
    'revenue': ['sum', 'mean', 'max'],
    'orders': ['sum', 'count']
})

# Add calculated metrics
daily_report['avg_order_value'] = (
    daily_report[('revenue', 'sum')] / daily_report[('orders', 'sum')]
)

print("\nDaily business report:")
print(daily_report.head())

# Weekly summary with week-over-week growth
weekly_report = sales_data.resample('W').sum()
weekly_report['revenue_growth'] = weekly_report['revenue'].pct_change() * 100
weekly_report['orders_growth'] = weekly_report['orders'].pct_change() * 100

print("\nWeekly growth report:")
print(weekly_report.head())

# 13. Combining with Transform
# Add group statistics back to original data
df['monthly_avg'] = df.groupby(df.index.to_period('M'))['sales'].transform('mean')

print("\nWith monthly average:")
print(df[['sales', 'monthly_avg']].head(35))

Frequency Codes:

Code Meaning Example
D Day Daily data
W Week Weekly summary
M Month end Monthly totals
MS Month start Start of month
Q Quarter end Quarterly reports
Y Year end Annual data
H Hour Hourly aggregation
T/min Minute Minute-level data
S Second Second-level data
B Business day Weekdays only

Resampling Methods:

Method Description Use Case
sum() Total Revenue, quantities
mean() Average Prices, temperatures
first() First value Opening price
last() Last value Closing price
max()/min() Peak/Low Daily highs/lows
count() Count Transactions
ffill() Forward fill Upsampling
interpolate() Interpolation Smooth upsampling

Upsampling vs Downsampling:

Type Direction Aggregation Example
Downsampling High β†’ Low freq Required Daily β†’ Monthly
Upsampling Low β†’ High freq Fill/interpolate Monthly β†’ Daily

Interview Insights:

Interviewer's Insight

  • "resample(): convert time series frequency"
  • "Downsampling: higher β†’ lower (daily β†’ monthly)"
  • "Upsampling: lower β†’ higher (monthly β†’ daily)"
  • "Frequency codes: D, W, M, Q, Y, H"
  • "Aggregation: sum, mean, max for downsampling"
  • "Fill methods: ffill, bfill, interpolate for upsampling"
  • "label: 'left' or 'right' edge labeling"
  • "Business days: 'B', 'BM' for business calendars"
  • "Combine with groupby for category-wise resampling"

Explain Pandas Pivot Tables - Microsoft, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Pivot Tables, Aggregation, Reshaping | Asked by: Microsoft, Amazon, Google, Meta

View Answer

pivot_table() summarizes data with row/column groupings and aggregations (like Excel). pivot() reshapes without aggregation. Use pivot_table for reports, pivot for simple reshaping.

Complete Examples:

import pandas as pd
import numpy as np

# Sample sales data
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D').repeat(3),
    'region': np.tile(['North', 'South', 'East'], 100),
    'product': np.tile(['A', 'B', 'C'], 100),
    'sales': np.random.randint(100, 1000, 300),
    'quantity': np.random.randint(1, 50, 300),
    'profit': np.random.randint(10, 200, 300)
})

# 1. Basic pivot_table
# Sales by region and product
pivot = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)
print("Sales by region and product:")
print(pivot)

# 2. Multiple Aggregations
# Multiple functions
multi_agg = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc=['sum', 'mean', 'count']
)
print("\nMultiple aggregations:")
print(multi_agg)

# 3. Multiple Values
# Different columns
multi_values = pd.pivot_table(
    df,
    values=['sales', 'profit'],
    index='region',
    columns='product',
    aggfunc='sum'
)
print("\nMultiple values:")
print(multi_values)

# 4. Margins (Totals)
# Add row and column totals
with_totals = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print("\nWith totals:")
print(with_totals)

# 5. Fill Missing Values
# Replace NaN
filled = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
print("\nFilled with zeros:")
print(filled)

# 6. Multiple Index Levels
# Hierarchical index
df['month'] = df['date'].dt.to_period('M')

multi_index = pd.pivot_table(
    df,
    values='sales',
    index=['month', 'region'],
    columns='product',
    aggfunc='sum'
)
print("\nMulti-index:")
print(multi_index.head(10))

# 7. Custom Aggregation
def range_func(x):
    return x.max() - x.min()

custom_agg = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc=range_func
)
print("\nCustom aggregation (range):")
print(custom_agg)

# 8. pivot() vs pivot_table()
# Simple data (unique index/column combinations)
simple_df = pd.DataFrame({
    'row': ['A', 'A', 'B', 'B'],
    'col': ['X', 'Y', 'X', 'Y'],
    'value': [1, 2, 3, 4]
})

# pivot (no aggregation, must be unique)
pivoted = simple_df.pivot(index='row', columns='col', values='value')
print("\npivot() result:")
print(pivoted)

# pivot_table (with aggregation, handles duplicates)
dup_df = pd.DataFrame({
    'row': ['A', 'A', 'A', 'B'],
    'col': ['X', 'X', 'Y', 'X'],
    'value': [1, 2, 3, 4]
})

# pivot would fail (duplicates)
# pivoted = dup_df.pivot(index='row', columns='col', values='value')  # Error!

# pivot_table handles it
pivoted_agg = dup_df.pivot_table(
    index='row',
    columns='col',
    values='value',
    aggfunc='sum'
)
print("\npivot_table() with duplicates:")
print(pivoted_agg)

# 9. Observed Parameter
# Add categorical column
df['category'] = pd.Categorical(
    np.random.choice(['Cat1', 'Cat2', 'Cat3'], len(df)),
    categories=['Cat1', 'Cat2', 'Cat3', 'Cat4']  # Cat4 unused
)

# Include unused categories
with_unused = pd.pivot_table(
    df,
    values='sales',
    index='category',
    aggfunc='sum',
    observed=False
)
print("\nWith unused categories:")
print(with_unused)

# Exclude unused
observed_only = pd.pivot_table(
    df,
    values='sales',
    index='category',
    aggfunc='sum',
    observed=True
)
print("\nObserved only:")
print(observed_only)

# 10. Practical Example: Business Report
sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=365, freq='D'),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 365),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food'], 365),
    'revenue': np.random.randint(1000, 10000, 365),
    'units_sold': np.random.randint(10, 200, 365)
})

# Add month and quarter
sales_data['month'] = sales_data['date'].dt.to_period('M')
sales_data['quarter'] = sales_data['date'].dt.to_period('Q')

# Quarterly report
quarterly_report = pd.pivot_table(
    sales_data,
    values=['revenue', 'units_sold'],
    index='quarter',
    columns='region',
    aggfunc={
        'revenue': 'sum',
        'units_sold': 'sum'
    },
    margins=True,
    fill_value=0
)

print("\nQuarterly business report:")
print(quarterly_report)

# Monthly category performance
category_report = pd.pivot_table(
    sales_data,
    values='revenue',
    index='month',
    columns='product_category',
    aggfunc=['sum', 'mean', 'count'],
    margins=True
)

print("\nMonthly category report:")
print(category_report.head(6))

pivot() vs pivot_table():

Aspect pivot() pivot_table()
Aggregation No Yes
Duplicates Must be unique Handles duplicates
Use Case Simple reshape Summarization
Example Wide to long Excel pivot table
Parameters index, columns, values + aggfunc, margins

Common Parameters:

Parameter Description Example
values Column(s) to aggregate values='sales'
index Row grouping index='region'
columns Column grouping columns='product'
aggfunc Aggregation function aggfunc='sum'
fill_value Replace NaN fill_value=0
margins Add totals margins=True

Interview Insights:

Interviewer's Insight

  • "pivot_table(): Excel-like summarization"
  • "pivot(): reshape without aggregation"
  • "index: row grouping, columns: column grouping"
  • "aggfunc: 'sum', 'mean', 'count', or custom"
  • "margins=True: add row/column totals"
  • "fill_value=0: replace NaN"
  • "pivot_table handles duplicates, pivot doesn't"
  • "Multiple values: list in values parameter"
  • "Use for: reports, summaries, cross-tabulation"

Explain pd.melt() - Google, Meta Interview Question

Difficulty: 🟑 Medium | Tags: melt, Reshaping, Wide to Long | Asked by: Google, Meta, Amazon

View Answer

melt() transforms wide format to long format (unpivots). Opposite of pivot. Essential for tidy data, plotting, and analysis requiring long format.

Complete Examples:

import pandas as pd

# Wide format data
df_wide = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie'],
    'math': [85, 92, 78],
    'science': [88, 95, 82],
    'english': [90, 87, 85]
})

print("Wide format:")
print(df_wide)

# 1. Basic melt
df_long = pd.melt(df_wide, id_vars=['student'])
print("\nLong format (basic):")
print(df_long)

# 2. Custom Column Names
df_long = pd.melt(
    df_wide,
    id_vars=['student'],
    var_name='subject',
    value_name='score'
)
print("\nWith custom names:")
print(df_long)

# 3. Select Specific Columns to Melt
df_long = pd.melt(
    df_wide,
    id_vars=['student'],
    value_vars=['math', 'science'],  # Only these columns
    var_name='subject',
    value_name='score'
)
print("\nSelected columns:")
print(df_long)

# 4. Multiple ID Variables
df_multi = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie'],
    'class': ['A', 'A', 'B'],
    'math': [85, 92, 78],
    'science': [88, 95, 82]
})

df_long = pd.melt(
    df_multi,
    id_vars=['student', 'class'],
    var_name='subject',
    value_name='score'
)
print("\nMultiple ID vars:")
print(df_long)

# 5. melt() + pivot() (Round Trip)
# Wide β†’ Long
df_long = pd.melt(df_wide, id_vars=['student'])

# Long β†’ Wide
df_back = df_long.pivot(
    index='student',
    columns='variable',
    values='value'
).reset_index()

print("\nRound trip (original):")
print(df_wide)
print("\nRound trip (restored):")
print(df_back)

# 6. Real Example: Time Series
sales_wide = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Tool'],
    '2023-Q1': [100, 150, 80],
    '2023-Q2': [120, 160, 90],
    '2023-Q3': [110, 170, 85],
    '2023-Q4': [130, 180, 95]
})

sales_long = pd.melt(
    sales_wide,
    id_vars=['product'],
    var_name='quarter',
    value_name='sales'
)

print("\nTime series (long format):")
print(sales_long)

# Convert quarter to datetime
sales_long['quarter'] = pd.PeriodIndex(sales_long['quarter'], freq='Q').to_timestamp()

# 7. Handling Missing Values
df_missing = pd.DataFrame({
    'id': [1, 2, 3],
    'A': [10, None, 30],
    'B': [40, 50, None]
})

# Default: keeps NaN
df_long_nan = pd.melt(df_missing, id_vars=['id'])
print("\nWith NaN:")
print(df_long_nan)

# Drop NaN
df_long_clean = pd.melt(df_missing, id_vars=['id']).dropna()
print("\nNaN dropped:")
print(df_long_clean)

# 8. Practical Example: Survey Data
survey = pd.DataFrame({
    'respondent_id': [1, 2, 3],
    'age': [25, 30, 35],
    'q1_satisfaction': [4, 5, 3],
    'q2_likelihood': [5, 4, 4],
    'q3_quality': [4, 5, 5]
})

# Melt only question columns
survey_long = pd.melt(
    survey,
    id_vars=['respondent_id', 'age'],
    value_vars=['q1_satisfaction', 'q2_likelihood', 'q3_quality'],
    var_name='question',
    value_name='response'
)

# Clean question names
survey_long['question'] = survey_long['question'].str.replace('q\\d+_', '', regex=True)

print("\nSurvey data (long):")
print(survey_long)

# 9. Performance with Large Data
# melt is efficient for large datasets
import numpy as np

large_df = pd.DataFrame({
    'id': range(10000),
    **{f'col_{i}': np.random.rand(10000) for i in range(50)}
})

import time
start = time.time()
large_melted = pd.melt(large_df, id_vars=['id'])
melt_time = time.time() - start

print(f"\nMelted 10k rows x 50 cols in {melt_time:.4f}s")
print(f"Result shape: {large_melted.shape}")

# 10. Use Cases Comparison
print("\n=== When to Use ===")
print("Wide format:")
print("- Human-readable reports")
print("- Pivot tables")
print("- Matrix operations")

print("\nLong format (melt):")
print("- Analysis in Pandas/SQL")
print("- Plotting with seaborn/plotly")
print("- GroupBy operations")
print("- Statistical modeling")

Wide vs Long Format:

Format Structure Use Case Example
Wide Many columns Reports, humans Math: 85, Science: 88
Long Few columns, many rows Analysis, computers Subject: Math, Score: 85

melt() Parameters:

Parameter Description Example
id_vars Columns to keep id_vars=['student']
value_vars Columns to melt value_vars=['math', 'science']
var_name Name for variable column var_name='subject'
value_name Name for value column value_name='score'

Interview Insights:

Interviewer's Insight

  • "melt(): wide β†’ long format (unpivot)"
  • "id_vars: columns to keep fixed"
  • "value_vars: columns to melt (optional, default all)"
  • "var_name: name for melted column names"
  • "value_name: name for values"
  • "Opposite of pivot()/pivot_table()"
  • "Use for: plotting, analysis, tidy data"
  • "Long format better for: GroupBy, seaborn, SQL"
  • "Round trip: melt() + pivot() = original"

Explain pd.crosstab() - Amazon, Microsoft Interview Question

Difficulty: 🟒 Easy | Tags: crosstab, Frequency Tables, Statistics | Asked by: Amazon, Microsoft, Google, Meta

View Answer

crosstab() computes cross-tabulation of two or more factors. Shows frequency distributions. Similar to pivot_table but specialized for counts and proportions.

Complete Examples:

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'gender': np.random.choice(['M', 'F'], 200),
    'education': np.random.choice(['HS', 'Bachelor', 'Master', 'PhD'], 200),
    'employed': np.random.choice(['Yes', 'No'], 200),
    'age_group': np.random.choice(['18-25', '26-35', '36-50', '50+'], 200)
})

# 1. Basic Crosstab (Counts)
ct = pd.crosstab(df['gender'], df['education'])
print("Basic crosstab:")
print(ct)

# 2. With Margins (Totals)
ct_margins = pd.crosstab(
    df['gender'],
    df['education'],
    margins=True,
    margins_name='Total'
)
print("\nWith totals:")
print(ct_margins)

# 3. Normalize (Percentages)
# Normalize all (percentages of total)
ct_norm_all = pd.crosstab(
    df['gender'],
    df['education'],
    normalize='all'
)
print("\nNormalized (all):")
print(ct_norm_all)

# Normalize by rows
ct_norm_index = pd.crosstab(
    df['gender'],
    df['education'],
    normalize='index'
)
print("\nNormalized (by row):")
print(ct_norm_index)

# Normalize by columns
ct_norm_columns = pd.crosstab(
    df['gender'],
    df['education'],
    normalize='columns'
)
print("\nNormalized (by column):")
print(ct_norm_columns)

# 4. Multiple Row/Column Variables
ct_multi = pd.crosstab(
    [df['gender'], df['age_group']],
    [df['education'], df['employed']]
)
print("\nMulti-level crosstab:")
print(ct_multi)

# 5. With Values (Like pivot_table)
df['income'] = np.random.randint(30000, 120000, len(df))

ct_values = pd.crosstab(
    df['gender'],
    df['education'],
    values=df['income'],
    aggfunc='mean'
)
print("\nWith values (mean income):")
print(ct_values)

# 6. Multiple Aggregations
ct_multi_agg = pd.crosstab(
    df['gender'],
    df['education'],
    values=df['income'],
    aggfunc=['mean', 'median', 'count']
)
print("\nMultiple aggregations:")
print(ct_multi_agg)

# 7. Dropna Parameter
df_with_nan = df.copy()
df_with_nan.loc[::10, 'education'] = None

# Include NaN
ct_with_nan = pd.crosstab(
    df_with_nan['gender'],
    df_with_nan['education'],
    dropna=False
)
print("\nWith NaN:")
print(ct_with_nan)

# Exclude NaN
ct_no_nan = pd.crosstab(
    df_with_nan['gender'],
    df_with_nan['education'],
    dropna=True
)
print("\nExcluding NaN:")
print(ct_no_nan)

# 8. Practical Example: A/B Test Analysis
ab_test = pd.DataFrame({
    'variant': np.random.choice(['A', 'B'], 1000),
    'converted': np.random.choice(['Yes', 'No'], 1000, p=[0.15, 0.85]),
    'device': np.random.choice(['Mobile', 'Desktop'], 1000)
})

# Conversion rates
conversion_ct = pd.crosstab(
    ab_test['variant'],
    ab_test['converted'],
    normalize='index'
)
print("\nA/B test conversion rates:")
print(conversion_ct)

# By device
device_ct = pd.crosstab(
    [ab_test['variant'], ab_test['device']],
    ab_test['converted'],
    normalize='index',
    margins=True
)
print("\nBy device:")
print(device_ct)

# 9. Chi-Square Test
# Statistical independence test
from scipy.stats import chi2_contingency

ct_test = pd.crosstab(df['gender'], df['employed'])
chi2, p_value, dof, expected = chi2_contingency(ct_test)

print(f"\nChi-square test:")
print(f"Chi2: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Independent: {p_value > 0.05}")

# 10. Comparison with pivot_table
# crosstab
ct = pd.crosstab(df['gender'], df['education'])

# Equivalent pivot_table
pt = df.pivot_table(
    index='gender',
    columns='education',
    aggfunc='size',
    fill_value=0
)

print("\ncrosstab == pivot_table:", ct.equals(pt))

crosstab() vs pivot_table():

Aspect crosstab() pivot_table()
Purpose Frequency counts General aggregation
Default aggfunc Count Mean
Normalize βœ… Built-in ❌ Manual
Input Arrays or Series DataFrame columns
Use Case Categorical analysis Numeric summarization

normalize Parameter:

Value Meaning Use Case
'all' % of total Overall distribution
'index' % of row total Row-wise comparison
'columns' % of column total Column-wise comparison

Interview Insights:

Interviewer's Insight

  • "crosstab(): frequency table of categorical variables"
  • "Default: counts, can aggregate with values/aggfunc"
  • "normalize: 'all', 'index', 'columns' for percentages"
  • "margins=True: add row/column totals"
  • "Useful for: categorical analysis, A/B testing, chi-square"
  • "vs pivot_table: specialized for counts"
  • "Multiple variables: pass lists to index/columns"
  • "Chi-square test: test independence of variables"

Explain pd.get_dummies() for One-Hot Encoding - Google, Netflix Interview Question

Difficulty: 🟒 Easy | Tags: One-Hot Encoding, Feature Engineering, ML Preprocessing | Asked by: Google, Netflix, Meta, Amazon

View Answer

get_dummies() converts categorical variables to binary indicator columns (one-hot encoding). Essential for machine learning with categorical features.

Complete Examples:

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'M', 'S'],
    'price': [10, 20, 15, 12, 18]
})

print("Original data:")
print(df)

# 1. Basic One-Hot Encoding
dummies = pd.get_dummies(df['color'])
print("\nOne-hot encoded:")
print(dummies)

# 2. Add Prefix
dummies = pd.get_dummies(df['color'], prefix='color')
print("\nWith prefix:")
print(dummies)

# 3. Encode Entire DataFrame
df_encoded = pd.get_dummies(df, columns=['color', 'size'])
print("\nFull DataFrame encoded:")
print(df_encoded)

# 4. Drop First (Avoid Multicollinearity)
# Drop one category to avoid dummy variable trap
dummies_drop = pd.get_dummies(
    df,
    columns=['color', 'size'],
    drop_first=True
)
print("\nDrop first category:")
print(dummies_drop)

# 5. Custom Prefix for Multiple Columns
dummies_custom = pd.get_dummies(
    df,
    columns=['color', 'size'],
    prefix=['col', 'sz']
)
print("\nCustom prefixes:")
print(dummies_custom)

# 6. Handle Unknown Categories
# Training data
train = pd.DataFrame({
    'category': ['A', 'B', 'C', 'A', 'B']
})

train_encoded = pd.get_dummies(train, prefix='cat')
print("\nTraining encoded:")
print(train_encoded)

# Test data with new category 'D'
test = pd.DataFrame({
    'category': ['A', 'B', 'D']
})

test_encoded = pd.get_dummies(test, prefix='cat')
print("\nTest encoded (has 'D'):")
print(test_encoded)

# Align columns
test_aligned = test_encoded.reindex(columns=train_encoded.columns, fill_value=0)
print("\nTest aligned with training:")
print(test_aligned)

# 7. Dummy NA
# Include NaN as category
df_na = pd.DataFrame({
    'color': ['red', 'blue', None, 'red', 'blue']
})

dummies_na = pd.get_dummies(df_na, dummy_na=True)
print("\nWith NaN as category:")
print(dummies_na)

# 8. Sparse Matrix (Memory Efficient)
# For high cardinality
df_large = pd.DataFrame({
    'category': np.random.choice(list('ABCDEFGHIJ'), 10000)
})

# Dense (default)
dense = pd.get_dummies(df_large)
print(f"\nDense memory: {dense.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Sparse
sparse = pd.get_dummies(df_large, sparse=True)
print(f"Sparse memory: {sparse.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# 9. Practical Example: ML Preprocessing
# Customer data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'country': ['USA', 'UK', 'Canada', 'USA', 'UK'],
    'membership': ['Gold', 'Silver', 'Bronze', 'Gold', 'Silver'],
    'age': [25, 35, 45, 30, 40],
    'purchases': [10, 5, 8, 12, 6]
})

# Encode categorical, keep numeric
customers_encoded = pd.get_dummies(
    customers,
    columns=['country', 'membership'],
    drop_first=True,  # Avoid multicollinearity
    prefix=['country', 'tier']
)

print("\nML-ready data:")
print(customers_encoded)

# 10. Comparison with Categorical
# get_dummies: expands to multiple columns
encoded = pd.get_dummies(df['color'])
print("\nget_dummies (wide):")
print(encoded)

# Categorical: single column with integer codes
categorical = df['color'].astype('category')
print("\nCategorical (compact):")
print(categorical)
print("Codes:", categorical.cat.codes.tolist())

get_dummies() Parameters:

Parameter Description Example
data DataFrame or Series pd.get_dummies(df)
columns Columns to encode columns=['color']
prefix Column name prefix prefix='cat'
drop_first Avoid multicollinearity drop_first=True
dummy_na Include NaN as category dummy_na=True
sparse Use sparse matrix sparse=True

When to drop_first:

Use Case drop_first Reason
Linear models βœ… True Avoid multicollinearity
Tree models ❌ False Can handle all dummies
Neural networks ❌ False Usually fine
Interpretation ❌ False Easier to understand

get_dummies() vs Categorical:

Aspect get_dummies() Categorical
Format Multiple columns Single column
Storage More space Less space
ML Ready to use Need encoding
Interpretability Clear Needs mapping

Interview Insights:

Interviewer's Insight

  • "get_dummies(): one-hot encoding (categorical β†’ binary)"
  • "Creates column per category with 0/1"
  • "prefix: add meaningful column names"
  • "drop_first=True: avoid dummy variable trap (linear models)"
  • "dummy_na=True: treat NaN as separate category"
  • "sparse=True: save memory for high cardinality"
  • "Align train/test: use reindex() for new categories"
  • "vs LabelEncoder: ordinal encoding (single column)"
  • "Use for: ML models that need numeric input"

Explain SettingWithCopyWarning - Meta, Google Interview Question

Difficulty: 🟑 Medium | Tags: Warning, DataFrame Copies, Best Practices | Asked by: Meta, Google, Amazon, Netflix

View Answer

SettingWithCopyWarning occurs when modifying a DataFrame slice that might be a view or copy. Use .loc for assignment and .copy() when needed to avoid ambiguity.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['a', 'b', 'c', 'd', 'e']
})

# 1. The Problem: Chained Indexing
# BAD: This triggers warning
# df[df['A'] > 2]['B'] = 999  # SettingWithCopyWarning!

# GOOD: Use .loc
df.loc[df['A'] > 2, 'B'] = 999
print("Correct assignment:")
print(df)

# Reset for more examples
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['a', 'b', 'c', 'd', 'e']
})

# 2. View vs Copy
# Sometimes slice is a view, sometimes a copy
# Pandas can't always tell which

# This might be a view
subset1 = df[df['A'] > 2]

# This might be a copy
subset2 = df[['A', 'B']]

# Modifying triggers warning (is it view or copy?)
# subset1['B'] = 999  # Warning!

# 3. Explicit Copy
# Make explicit copy to avoid warning
subset_copy = df[df['A'] > 2].copy()
subset_copy['B'] = 999  # No warning
print("\nExplicit copy:")
print(subset_copy)
print("\nOriginal unchanged:")
print(df)

# 4. Using .loc (Recommended)
# Always use .loc for assignment
df_correct = df.copy()

# Single condition
df_correct.loc[df_correct['A'] > 2, 'B'] = 999

# Multiple conditions
df_correct.loc[(df_correct['A'] > 2) & (df_correct['B'] < 50), 'C'] = 'updated'

print("\nUsing .loc:")
print(df_correct)

# 5. Function Modifications
# BAD: Modifying without copy
def process_bad(data):
    # If data is a slice, this warns
    data['new_col'] = data['A'] * 2
    return data

# GOOD: Explicit copy
def process_good(data):
    data = data.copy()
    data['new_col'] = data['A'] * 2
    return data

subset = df[df['A'] > 2]
# result_bad = process_bad(subset)  # Warning
result_good = process_good(subset)  # No warning
print("\nProcessed with copy:")
print(result_good)

# 6. When to Copy
print("\n=== When to Use .copy() ===")

# Slice you'll modify
working_df = df[df['A'] > 2].copy()
working_df['B'] = 100  # Safe

# Function that modifies
def transform(data):
    data = data.copy()  # Safe
    data['transformed'] = data['A'] ** 2
    return data

# 7. Detecting Views vs Copies
# Check if shares memory
df_original = pd.DataFrame({'A': [1, 2, 3]})

# View (shares memory)
view = df_original['A']
print(f"\nView shares memory: {np.shares_memory(df_original['A'].values, view.values)}")

# Copy (doesn't share)
copy = df_original['A'].copy()
print(f"Copy shares memory: {np.shares_memory(df_original['A'].values, copy.values)}")

# 8. Common Patterns
print("\n=== Best Practices ===")

# Pattern 1: Filtering then modifying
# BAD:
# filtered = df[df['A'] > 2]
# filtered['B'] = 100  # Warning

# GOOD:
filtered = df[df['A'] > 2].copy()
filtered['B'] = 100  # No warning

# Pattern 2: Direct modification
# BEST (if you want to modify original):
df.loc[df['A'] > 2, 'B'] = 100

# Pattern 3: Function returning modified data
def clean_data(df):
    """Always copy at start of function."""
    df = df.copy()
    df['cleaned'] = df['A'] * 2
    return df

# 9. Disabling Warning (Not Recommended)
# Only for debugging
# pd.options.mode.chained_assignment = None  # Disable
# pd.options.mode.chained_assignment = 'warn'  # Default
# pd.options.mode.chained_assignment = 'raise'  # Raise error

# 10. Real-World Example
sales_df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'sales': [100, 200, 150, 300, 250],
    'region': ['North', 'South', 'North', 'South', 'North']
})

# Task: Add 10% bonus to North region

# BAD:
# north = sales_df[sales_df['region'] == 'North']
# north['sales'] = north['sales'] * 1.1  # Warning!

# GOOD Option 1: Use .loc on original
sales_df.loc[sales_df['region'] == 'North', 'sales'] *= 1.1
print("\nSales with bonus:")
print(sales_df)

# GOOD Option 2: Explicit copy if you need subset
north_copy = sales_df[sales_df['region'] == 'North'].copy()
north_copy['sales'] *= 1.1
print("\nNorth region (copy):")
print(north_copy)

Common Causes:

Pattern Problem Solution
df[cond][col] = val Chained indexing df.loc[cond, col] = val
subset['col'] = val Unclear if view/copy subset = df[...].copy()
func(df[...]) Function modifies slice Copy in function

Best Practices:

Scenario Recommended Approach
Modify original df.loc[condition, column] = value
Modify subset subset = df[...].copy()
Function Copy at start: df = df.copy()
Read-only No copy needed

Interview Insights:

Interviewer's Insight

  • "SettingWithCopyWarning: ambiguous view vs copy"
  • "Chained indexing bad: df[cond][col] = val"
  • "Use .loc: df.loc[cond, col] = val"
  • "Explicit copy when needed: .copy()"
  • "View: shares memory, copy: independent"
  • "Always copy at function start if modifying"
  • "Warning helps avoid silent bugs"
  • "Never disable warning without understanding"
  • "Rule: if modifying slice, use .loc or .copy()"

Explain Pandas String Methods (.str accessor) - Amazon, Microsoft Interview Question

Difficulty: 🟒 Easy | Tags: String Methods, Text Processing, Data Cleaning | Asked by: Amazon, Microsoft, Google, Meta

View Answer

.str accessor provides vectorized string operations on Series. Essential for text cleaning, pattern matching, and feature extraction. Much faster than Python loops.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['  John Doe  ', 'jane smith', 'BOB WILSON', 'Alice Johnson'],
    'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
    'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', '1234567890'],
    'address': ['123 Main St, NYC, NY', '456 Oak Ave, LA, CA', '789 Elm Rd, Austin, TX', '321 Pine Dr, Miami, FL']
})

# 1. Basic String Operations
# Lower/upper case
df['name_lower'] = df['name'].str.lower()
df['name_upper'] = df['name'].str.upper()

# Title case
df['name_title'] = df['name'].str.title()

# Strip whitespace
df['name_clean'] = df['name'].str.strip()

print("Case conversions:")
print(df[['name', 'name_lower', 'name_title', 'name_clean']].head())

# 2. Contains / Startswith / Endswith
# Check if contains pattern
df['has_john'] = df['name'].str.lower().str.contains('john')

# Starts with
df['starts_with_j'] = df['name'].str.lower().str.startswith('j')

# Ends with
df['gmail'] = df['email'].str.endswith('@example.com')

print("\nPattern matching:")
print(df[['name', 'has_john', 'starts_with_j']])

# 3. Replace
# Simple replace
df['email_masked'] = df['email'].str.replace('@', '[at]')

# Regex replace
df['phone_clean'] = df['phone'].str.replace(r'[\(\)\-\.\s]', '', regex=True)

print("\nReplace:")
print(df[['email', 'email_masked', 'phone', 'phone_clean']])

# 4. Extract with Regex
# Extract domain from email
df['domain'] = df['email'].str.extract(r'@(\w+\.\w+)')

# Extract area code
df['area_code'] = df['phone_clean'].str.extract(r'^(\d{3})')

# Multiple groups
df[['first_name', 'last_name']] = df['name_clean'].str.extract(r'(\w+)\s+(\w+)')

print("\nExtract:")
print(df[['email', 'domain', 'first_name', 'last_name']])

# 5. Split
# Split on space
name_parts = df['name_clean'].str.split()
print("\nSplit (list):")
print(name_parts)

# Expand to columns
df[['first', 'last']] = df['name_clean'].str.split(n=1, expand=True)

# Split address
df[['street', 'city', 'state']] = df['address'].str.split(',', expand=True)

print("\nSplit (expanded):")
print(df[['first', 'last', 'city']])

# 6. Slice
# First 3 characters
df['name_abbr'] = df['name_clean'].str[:3]

# Last 4 of phone
df['phone_last4'] = df['phone_clean'].str[-4:]

print("\nSlice:")
print(df[['name_clean', 'name_abbr', 'phone_last4']])

# 7. Length and Find
# String length
df['name_length'] = df['name_clean'].str.len()

# Find position
df['at_position'] = df['email'].str.find('@')

print("\nLength and find:")
print(df[['name_clean', 'name_length', 'email', 'at_position']])

# 8. Count Occurrences
# Count character
df['num_spaces'] = df['address'].str.count(' ')

# Count pattern
df['num_digits'] = df['phone'].str.count(r'\d')

print("\nCount:")
print(df[['address', 'num_spaces']])

# 9. Padding
# Pad with zeros
order_ids = pd.Series([1, 42, 123])
padded = order_ids.astype(str).str.zfill(5)
print("\nZero-padded:")
print(padded)

# Pad left/right
padded_left = order_ids.astype(str).str.pad(width=5, side='left', fillchar='0')
padded_right = order_ids.astype(str).str.pad(width=5, side='right', fillchar='*')

# 10. Practical Example: Data Cleaning
messy_data = pd.DataFrame({
    'customer_name': ['  john DOE  ', 'Jane_Smith', 'bob-wilson'],
    'email': ['[email protected]', 'jane.smith@test', '[email protected]'],
    'phone': ['(555)123-4567', '555 123 4567', '5551234567'],
    'zip_code': ['12345', '678', '90210']
})

# Clean everything
cleaned = messy_data.copy()

# Names: strip, title case, replace special chars
cleaned['customer_name'] = (cleaned['customer_name']
                            .str.strip()
                            .str.replace(r'[_-]', ' ', regex=True)
                            .str.title())

# Email: lowercase, validate format
cleaned['email'] = cleaned['email'].str.lower()
cleaned['email_valid'] = cleaned['email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')

# Phone: extract digits only, format
phone_digits = cleaned['phone'].str.replace(r'\D', '', regex=True)
cleaned['phone_formatted'] = phone_digits.str.replace(
    r'^(\d{3})(\d{3})(\d{4})$',
    r'(\1) \2-\3',
    regex=True
)

# Zip: pad with zeros
cleaned['zip_code'] = cleaned['zip_code'].str.zfill(5)

print("\nCleaned data:")
print(cleaned)

# 11. Performance Comparison
import time

large_series = pd.Series(['test_string'] * 100000)

# Vectorized (fast)
start = time.time()
_ = large_series.str.upper()
vectorized_time = time.time() - start

# Loop (slow)
start = time.time()
_ = [s.upper() for s in large_series]
loop_time = time.time() - start

print(f"\nVectorized: {vectorized_time:.4f}s")
print(f"Loop: {loop_time:.4f}s ({loop_time/vectorized_time:.0f}x slower)")

Common String Methods:

Method Description Example
lower/upper/title Case conversion .str.lower()
strip/lstrip/rstrip Remove whitespace .str.strip()
contains Check if contains .str.contains('word')
replace Replace substring .str.replace('old', 'new')
extract Regex extract .str.extract(r'(\d+)')
split Split string .str.split(',')
[:n] Slice .str[:5]
len Length .str.len()
find Find position .str.find('@')
count Count occurrences .str.count('a')

Regex Operations:

Method Purpose Example
contains Match pattern .str.contains(r'\d{3}')
extract Extract groups .str.extract(r'(\w+)@(\w+)')
replace Replace pattern .str.replace(r'\s+', ' ')
match Full match .str.match(r'^\d{5}$')
findall Find all matches .str.findall(r'\d+')

Interview Insights:

Interviewer's Insight

  • ".str accessor: vectorized string operations"
  • "Much faster than Python loops (10-100x)"
  • "Common: lower(), strip(), replace(), contains()"
  • "extract(): use regex with capturing groups"
  • "split(expand=True): split into columns"
  • "replace(regex=True): regex patterns"
  • "Use for: data cleaning, feature extraction"
  • "Returns Series of same length (NaN for missing)"
  • "Performance: always prefer .str over loops"

Explain pd.explode() - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: explode, List Columns, Data Transformation | Asked by: Google, Amazon, Meta

View Answer

explode() transforms list-like column into separate rows. Each list element becomes a new row with duplicated other column values. Essential for nested data structures.

Complete Examples:

import pandas as pd
import numpy as np

# 1. Basic explode
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'hobbies': [['reading', 'swimming'], ['gaming'], ['cooking', 'painting', 'music']]
})

print("Original (lists in column):")
print(df)

exploded = df.explode('hobbies')
print("\nExploded:")
print(exploded)

# 2. Reset Index
exploded_reset = df.explode('hobbies').reset_index(drop=True)
print("\nWith reset index:")
print(exploded_reset)

# 3. Multiple Columns with Lists
df_multi = pd.DataFrame({
    'id': [1, 2],
    'tags': [['a', 'b'], ['c', 'd', 'e']],
    'scores': [[10, 20], [30, 40, 50]]
})

# Explode both columns (must be same length per row)
exploded_multi = df_multi.explode(['tags', 'scores'])
print("\nExplode multiple columns:")
print(exploded_multi)

# 4. With Missing Values
df_missing = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'items': [['apple', 'banana'], None, []]
})

exploded_na = df_missing.explode('items')
print("\nWith missing/empty:")
print(exploded_na)

# 5. Practical Example: User Tags
users = pd.DataFrame({
    'user_id': [101, 102, 103],
    'username': ['alice', 'bob', 'charlie'],
    'interests': [
        ['python', 'data science', 'ML'],
        ['javascript', 'web dev'],
        ['design', 'UX', 'CSS', 'HTML']
    ]
})

# Explode to analyze tag frequency
user_interests = users.explode('interests')

tag_counts = user_interests['interests'].value_counts()
print("\nTag frequency:")
print(tag_counts)

# 6. E-commerce Orders
orders = pd.DataFrame({
    'order_id': [1, 2, 3],
    'customer': ['Alice', 'Bob', 'Charlie'],
    'products': [
        ['Laptop', 'Mouse'],
        ['Keyboard'],
        ['Monitor', 'Cable', 'Stand']
    ],
    'quantities': [
        [1, 2],
        [1],
        [1, 3, 1]
    ],
    'prices': [
        [1000, 25],
        [75],
        [300, 10, 50]
    ]
})

# Explode order items
order_items = orders.explode(['products', 'quantities', 'prices'])
order_items = order_items.reset_index(drop=True)

# Calculate line totals
order_items['line_total'] = (
    order_items['quantities'].astype(int) * 
    order_items['prices'].astype(int)
)

print("\nOrder items:")
print(order_items)

# Total per order
order_totals = order_items.groupby('order_id')['line_total'].sum()
print("\nOrder totals:")
print(order_totals)

# 7. JSON/Nested Data
# Simulating API response
api_data = pd.DataFrame({
    'post_id': [1, 2],
    'author': ['Alice', 'Bob'],
    'comments': [
        [
            {'user': 'Bob', 'text': 'Great!'},
            {'user': 'Charlie', 'text': 'Awesome!'}
        ],
        [
            {'user': 'Alice', 'text': 'Thanks!'}
        ]
    ]
})

# Explode comments
comments_df = api_data.explode('comments').reset_index(drop=True)

# Extract comment fields
comments_df['commenter'] = comments_df['comments'].apply(lambda x: x['user'] if x else None)
comments_df['comment_text'] = comments_df['comments'].apply(lambda x: x['text'] if x else None)

comments_df = comments_df.drop('comments', axis=1)

print("\nExtracted comments:")
print(comments_df)

# 8. Time Series Events
events = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=3, freq='D'),
    'events': [
        ['login', 'purchase', 'logout'],
        ['login'],
        ['login', 'view', 'view', 'logout']
    ]
})

event_log = events.explode('events').reset_index(drop=True)

# Event frequency by day
event_freq = event_log.groupby(['date', 'events']).size().unstack(fill_value=0)
print("\nEvent frequency:")
print(event_freq)

# 9. Performance Note
# explode is efficient, but watch for very large lists
import time

# Create test data
large_df = pd.DataFrame({
    'id': range(1000),
    'values': [[1, 2, 3, 4, 5] for _ in range(1000)]
})

start = time.time()
_ = large_df.explode('values')
explode_time = time.time() - start

print(f"\nExploded 1000 rows Γ— 5 values in {explode_time:.4f}s")

# 10. Reverse: groupby + list
# Explode β†’ Group back to lists
exploded_data = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'B'],
    'value': [1, 2, 3, 4, 5]
})

grouped = exploded_data.groupby('category')['value'].apply(list).reset_index()
print("\nGrouped back to lists:")
print(grouped)

explode() Characteristics:

Aspect Description
Input Column with lists/arrays/sets
Output One row per list element
Index Duplicates original index
Other cols Repeated for each element
Empty list Creates one row with NaN
None/NaN Creates one row with NaN

Common Patterns:

Pattern Use Case
User tags Analyze interests, categories
Order items E-commerce line items
JSON arrays Flatten nested API data
Multi-value fields Convert to normalized form

Interview Insights:

Interviewer's Insight

  • "explode(): list column β†’ separate rows"
  • "Each list element becomes new row"
  • "Other columns duplicated"
  • "Index preserved (duplicated)"
  • "reset_index(drop=True) for clean index"
  • "Can explode multiple columns (same lengths)"
  • "Empty list β†’ NaN row"
  • "Reverse: groupby().apply(list)"
  • "Use for: nested data, JSON arrays, multi-value fields"

Explain Pandas Pipe Method - Meta, Google Interview Question

Difficulty: 🟑 Medium | Tags: Pipe, Method Chaining, Clean Code | Asked by: Meta, Google, Netflix

View Answer

pipe() enables method chaining with custom functions. Pass DataFrame through pipeline of functions for clean, readable data transformations.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'salary': [50000, 60000, 70000, 80000],
    'department': ['Sales', 'IT', 'Sales', 'IT']
})

# 1. Without pipe (nested functions - hard to read)
def add_bonus(df, rate=0.1):
    df = df.copy()
    df['bonus'] = df['salary'] * rate
    return df

def calculate_total(df):
    df = df.copy()
    df['total'] = df['salary'] + df['bonus']
    return df

# Nested (ugly)
result = calculate_total(add_bonus(df, rate=0.15))

# 2. With pipe (clean, readable)
result = (df
          .pipe(add_bonus, rate=0.15)
          .pipe(calculate_total))

print("With pipe:")
print(result)

# 3. Multiple Transformations
def filter_by_age(df, min_age):
    return df[df['age'] >= min_age]

def add_age_group(df):
    df = df.copy()
    df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], 
                              labels=['Young', 'Middle', 'Senior'])
    return df

def calculate_metrics(df):
    df = df.copy()
    df['salary_per_age'] = df['salary'] / df['age']
    return df

# Clean pipeline
result = (df
          .pipe(filter_by_age, min_age=28)
          .pipe(add_age_group)
          .pipe(calculate_metrics))

print("\nPipeline result:")
print(result)

# 4. Mixing Built-in and Custom Functions
result = (df
          .pipe(lambda d: d[d['age'] > 25])  # Lambda
          .assign(bonus=lambda d: d['salary'] * 0.1)  # Built-in
          .pipe(lambda d: d.sort_values('bonus', ascending=False))
          .reset_index(drop=True))

print("\nMixed pipeline:")
print(result)

# 5. Pipe with Arguments and Kwargs
def apply_discount(df, column, discount_rate, condition_col, condition_val):
    df = df.copy()
    mask = df[condition_col] == condition_val
    df.loc[mask, column] = df.loc[mask, column] * (1 - discount_rate)
    return df

result = (df
          .pipe(apply_discount, 
                column='salary',
                discount_rate=0.05,
                condition_col='department',
                condition_val='Sales'))

print("\nWith arguments:")
print(result)

# 6. Error Handling in Pipeline
def validate_data(df):
    """Validate DataFrame."""
    assert not df['salary'].isna().any(), "Salary has NaN"
    assert (df['age'] > 0).all(), "Invalid age"
    return df

def clean_data(df):
    """Clean DataFrame."""
    df = df.copy()
    df = df.dropna()
    df = df[df['age'] > 0]
    return df

# Safe pipeline
result = (df
          .pipe(clean_data)
          .pipe(validate_data)
          .pipe(add_bonus, rate=0.1))

# 7. Practical Example: Data Processing Pipeline
sales_data = pd.DataFrame({
    'order_id': range(1, 101),
    'customer_id': np.random.randint(1, 21, 100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'quantity': np.random.randint(1, 10, 100),
    'price': np.random.randint(10, 100, 100),
    'date': pd.date_range('2024-01-01', periods=100, freq='D')
})

def add_revenue(df):
    df = df.copy()
    df['revenue'] = df['quantity'] * df['price']
    return df

def add_date_features(df):
    df = df.copy()
    df['month'] = df['date'].dt.month
    df['day_of_week'] = df['date'].dt.day_name()
    return df

def filter_high_value(df, min_revenue=500):
    return df[df['revenue'] >= min_revenue]

def add_customer_segment(df):
    df = df.copy()
    customer_totals = df.groupby('customer_id')['revenue'].sum()
    segments = pd.qcut(customer_totals, q=3, labels=['Low', 'Medium', 'High'])
    df['segment'] = df['customer_id'].map(segments)
    return df

# Complete pipeline
processed = (sales_data
             .pipe(add_revenue)
             .pipe(add_date_features)
             .pipe(filter_high_value, min_revenue=300)
             .pipe(add_customer_segment)
             .pipe(lambda d: d.sort_values(['segment', 'revenue'], 
                                           ascending=[True, False])))

print("\nProcessed sales data:")
print(processed.head(10))

# 8. Reusable Pipelines
def sales_pipeline(df, min_revenue=500):
    """Reusable sales processing pipeline."""
    return (df
            .pipe(add_revenue)
            .pipe(add_date_features)
            .pipe(filter_high_value, min_revenue=min_revenue)
            .pipe(add_customer_segment))

# Apply to different datasets
result1 = sales_pipeline(sales_data, min_revenue=300)
result2 = sales_pipeline(sales_data, min_revenue=600)

print(f"\nWith min_revenue=300: {len(result1)} rows")
print(f"With min_revenue=600: {len(result2)} rows")

# 9. pipe() vs Method Chaining
# Method chaining (only built-in methods)
result_chain = (df
                .query('age > 25')
                .assign(bonus=lambda d: d['salary'] * 0.1)
                .sort_values('bonus'))

# pipe (custom functions)
result_pipe = (df
               .pipe(lambda d: d[d['age'] > 25])
               .pipe(add_bonus, rate=0.1)
               .pipe(lambda d: d.sort_values('bonus')))

print("\nBoth approaches work:")
print(f"Chain: {len(result_chain)} rows")
print(f"Pipe: {len(result_pipe)} rows")

# 10. Performance Note
# pipe() has minimal overhead
import time

large_df = pd.DataFrame({
    'A': range(100000),
    'B': range(100000)
})

# With pipe
start = time.time()
_ = (large_df
     .pipe(lambda d: d[d['A'] > 50000])
     .pipe(lambda d: d.assign(C=d['A'] + d['B'])))
pipe_time = time.time() - start

# Without pipe
start = time.time()
temp = large_df[large_df['A'] > 50000]
_ = temp.assign(C=temp['A'] + temp['B'])
direct_time = time.time() - start

print(f"\nPipe: {pipe_time:.4f}s")
print(f"Direct: {direct_time:.4f}s")
print("(Virtually same performance)")

pipe() Benefits:

Benefit Description
Readability Linear, top-to-bottom flow
Reusability Functions can be reused
Testability Each function testable
Maintainability Easy to add/remove steps
Debugging Comment out steps easily

pipe() vs Method Chaining:

Aspect pipe() Method Chaining
Functions Custom functions Built-in methods
Flexibility Very flexible Limited to methods
Arguments Pass parameters Use kwargs
Use Case Complex pipelines Simple transformations

Interview Insights:

Interviewer's Insight

  • "pipe(): pass DataFrame through functions"
  • "Enables method chaining with custom functions"
  • "Clean, readable data pipelines"
  • "Pass arguments: .pipe(func, arg1, arg2)"
  • "Better than nested functions: f(g(h(df)))"
  • "Combine with built-in methods: assign, query"
  • "Each function: take df, return df"
  • "Benefits: readability, reusability, testability"
  • "Minimal performance overhead"

Explain pd.json_normalize() - Amazon, Google Interview Question

Difficulty: 🟑 Medium | Tags: JSON, Nested Data, Flattening | Asked by: Amazon, Google, Meta, Netflix

View Answer

json_normalize() flattens nested JSON/dict structures into DataFrames. Essential for API responses, nested documents, and semi-structured data.

Complete Examples:

import pandas as pd
import json

# 1. Simple Nested JSON
data = [
    {'name': 'Alice', 'age': 25, 'address': {'city': 'NYC', 'zip': '10001'}},
    {'name': 'Bob', 'age': 30, 'address': {'city': 'LA', 'zip': '90001'}}
]

df = pd.json_normalize(data)
print("Flattened JSON:")
print(df)

# 2. Deep Nesting
data = [
    {
        'id': 1,
        'user': {
            'name': 'Alice',
            'contact': {
                'email': '[email protected]',
                'phone': '123-456-7890'
            }
        }
    },
    {
        'id': 2,
        'user': {
            'name': 'Bob',
            'contact': {
                'email': '[email protected]',
                'phone': '098-765-4321'
            }
        }
    }
]

df = pd.json_normalize(data)
print("\nDeep nesting:")
print(df)

# 3. Custom Separator
df = pd.json_normalize(data, sep='_')
print("\nCustom separator:")
print(df)

# 4. Arrays in JSON
data = [
    {'name': 'Alice', 'scores': [90, 85, 88]},
    {'name': 'Bob', 'scores': [78, 82, 80]}
]

# Note: Arrays stay as lists
df = pd.json_normalize(data)
print("\nWith arrays:")
print(df)

# Explode to separate rows
df_exploded = df.explode('scores').reset_index(drop=True)
print("\nExploded scores:")
print(df_exploded)

# 5. Nested Arrays of Objects
data = [
    {
        'order_id': 1,
        'customer': 'Alice',
        'items': [
            {'product': 'Laptop', 'price': 1000},
            {'product': 'Mouse', 'price': 25}
        ]
    },
    {
        'order_id': 2,
        'customer': 'Bob',
        'items': [
            {'product': 'Keyboard', 'price': 75}
        ]
    }
]

df = pd.json_normalize(
    data,
    record_path='items',
    meta=['order_id', 'customer']
)
print("\nNested arrays of objects:")
print(df)

# 6. Multiple Levels of Nesting
data = [
    {
        'id': 1,
        'name': 'Alice',
        'orders': [
            {
                'order_id': 101,
                'items': [
                    {'product': 'A', 'qty': 2},
                    {'product': 'B', 'qty': 1}
                ]
            }
        ]
    }
]

# Flatten orders
df_orders = pd.json_normalize(
    data,
    record_path='orders',
    meta=['id', 'name']
)
print("\nFlattened orders:")
print(df_orders)

# Further flatten items
# Need to work with the items column
items_data = []
for _, row in df_orders.iterrows():
    for item in row['items']:
        items_data.append({
            'id': row['id'],
            'name': row['name'],
            'order_id': row['order_id'],
            'product': item['product'],
            'qty': item['qty']
        })

df_items = pd.DataFrame(items_data)
print("\nFully flattened:")
print(df_items)

# 7. Real Example: GitHub API Response
github_response = [
    {
        'id': 1,
        'name': 'repo1',
        'owner': {
            'login': 'alice',
            'id': 123,
            'url': 'https://github.com/alice'
        },
        'stats': {
            'stars': 100,
            'forks': 20,
            'watchers': 50
        }
    },
    {
        'id': 2,
        'name': 'repo2',
        'owner': {
            'login': 'bob',
            'id': 456,
            'url': 'https://github.com/bob'
        },
        'stats': {
            'stars': 500,
            'forks': 100,
            'watchers': 250
        }
    }
]

df = pd.json_normalize(github_response, sep='_')
print("\nGitHub repos:")
print(df)

# 8. Handling Missing Keys
data = [
    {'name': 'Alice', 'age': 25, 'city': 'NYC'},
    {'name': 'Bob', 'age': 30},  # Missing 'city'
    {'name': 'Charlie', 'city': 'LA'}  # Missing 'age'
]

df = pd.json_normalize(data)
print("\nMissing keys (NaN):")
print(df)

# 9. Real Example: Twitter/X API
tweets = [
    {
        'id': 1,
        'text': 'Hello World',
        'user': {'name': 'Alice', 'followers': 1000},
        'entities': {
            'hashtags': ['hello', 'world'],
            'mentions': ['@bob']
        },
        'metrics': {'likes': 50, 'retweets': 10}
    },
    {
        'id': 2,
        'text': 'Python is great',
        'user': {'name': 'Bob', 'followers': 5000},
        'entities': {
            'hashtags': ['python'],
            'mentions': []
        },
        'metrics': {'likes': 200, 'retweets': 50}
    }
]

df = pd.json_normalize(tweets, sep='_')
print("\nTweets data:")
print(df[['id', 'text', 'user_name', 'user_followers', 'metrics_likes']])

# 10. Performance with Large JSON
# json_normalize is efficient
import time

large_data = [
    {
        'id': i,
        'data': {
            'value': i * 2,
            'meta': {
                'timestamp': '2024-01-01',
                'source': 'api'
            }
        }
    }
    for i in range(10000)
]

start = time.time()
df = pd.json_normalize(large_data, sep='_')
normalize_time = time.time() - start

print(f"\nNormalized 10k records in {normalize_time:.4f}s")
print(f"Result shape: {df.shape}")

json_normalize() Parameters:

Parameter Description Example
data JSON data (list of dicts) [{'a': 1}]
record_path Path to nested records record_path='items'
meta Columns to include from parent meta=['id', 'name']
sep Separator for flattened keys sep='_' (default '.')
max_level Max nesting depth max_level=1

Common Patterns:

Pattern Solution
Simple nesting json_normalize(data)
Nested arrays json_normalize(data, record_path='items', meta=['id'])
Deep nesting Flatten in multiple steps
Arrays of values Use with .explode()

Interview Insights:

Interviewer's Insight

  • "json_normalize(): flatten nested JSON/dicts"
  • "Converts nested structures to flat DataFrame"
  • "record_path: path to nested array"
  • "meta: parent columns to include"
  • "sep: separator for flattened names (default '.')"
  • "Use with explode() for nested arrays"
  • "Essential for: API responses, NoSQL data"
  • "Handles missing keys (creates NaN)"
  • "Use for: Twitter API, GitHub API, MongoDB docs"

Explain pd.to_datetime() Options - Microsoft, Amazon Interview Question

Difficulty: 🟒 Easy | Tags: DateTime, Parsing, Time Series | Asked by: Microsoft, Amazon, Google, Meta

View Answer

to_datetime() parses strings/numbers to datetime objects. Supports multiple formats, error handling, timezone conversions, and performance optimizations with format parameter.

Complete Examples:

import pandas as pd
import numpy as np

# 1. Basic Parsing
# Automatic format detection
dates = pd.Series(['2024-01-15', '2024-02-20', '2024-03-25'])
parsed = pd.to_datetime(dates)
print("Basic parsing:")
print(parsed)
print(f"Type: {parsed.dtype}")

# 2. Different Formats
# Common formats
formats = {
    'ISO': ['2024-01-15', '2024-02-20'],
    'US': ['01/15/2024', '02/20/2024'],
    'EU': ['15/01/2024', '20/02/2024'],
    'Text': ['Jan 15, 2024', 'Feb 20, 2024']
}

for name, dates in formats.items():
    parsed = pd.to_datetime(dates)
    print(f"\n{name} format:")
    print(parsed)

# 3. Specify Format (Much Faster!)
dates = pd.Series(['20240115', '20240220'] * 10000)

import time

# Without format (slow, tries multiple formats)
start = time.time()
_ = pd.to_datetime(dates)
auto_time = time.time() - start

# With format (fast, direct parsing)
start = time.time()
_ = pd.to_datetime(dates, format='%Y%m%d')
format_time = time.time() - start

print(f"\nAuto detection: {auto_time:.4f}s")
print(f"With format: {format_time:.4f}s ({auto_time/format_time:.0f}x faster!)")

# 4. Common Format Codes
print("\n=== Format Codes ===")
examples = {
    '%Y-%m-%d': '2024-01-15',
    '%d/%m/%Y': '15/01/2024',
    '%m/%d/%Y': '01/15/2024',
    '%Y%m%d': '20240115',
    '%d-%b-%Y': '15-Jan-2024',
    '%Y-%m-%d %H:%M:%S': '2024-01-15 14:30:00',
    '%d/%m/%Y %I:%M %p': '15/01/2024 02:30 PM'
}

for fmt, example in examples.items():
    parsed = pd.to_datetime(example, format=fmt)
    print(f"{fmt:25} β†’ {example:25} β†’ {parsed}")

# 5. Error Handling
mixed_dates = pd.Series(['2024-01-15', '2024-02-20', 'not a date', '2024-03-25'])

# errors='raise' (default) - raises exception
# try: pd.to_datetime(mixed_dates); except: pass

# errors='ignore' - returns original if can't parse
ignore_result = pd.to_datetime(mixed_dates, errors='ignore')
print("\nerrors='ignore':")
print(ignore_result)
print(f"Types: {[type(x).__name__ for x in ignore_result]}")

# errors='coerce' - NaT for invalid
coerce_result = pd.to_datetime(mixed_dates, errors='coerce')
print("\nerrors='coerce':")
print(coerce_result)

# 6. From Components
df = pd.DataFrame({
    'year': [2024, 2024, 2024],
    'month': [1, 2, 3],
    'day': [15, 20, 25]
})

# Combine columns
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
print("\nFrom components:")
print(df)

# 7. Unix Timestamps
timestamps = pd.Series([1704067200, 1706745600, 1709337600])

# Convert (unit='s' for seconds)
dates = pd.to_datetime(timestamps, unit='s')
print("\nFrom Unix timestamp:")
print(dates)

# Milliseconds
timestamps_ms = timestamps * 1000
dates_ms = pd.to_datetime(timestamps_ms, unit='ms')

# 8. Timezone Handling
# Parse with timezone
dates = pd.Series(['2024-01-15 10:00:00', '2024-02-20 15:30:00'])

# UTC
utc = pd.to_datetime(dates, utc=True)
print("\nUTC:")
print(utc)

# Convert to timezone
eastern = utc.dt.tz_convert('America/New_York')
print("\nEastern Time:")
print(eastern)

# Localize (add timezone to naive)
naive = pd.to_datetime(dates)
localized = naive.dt.tz_localize('America/Los_Angeles')
print("\nLocalized to Pacific:")
print(localized)

# 9. Day First / Year First
ambiguous_dates = pd.Series(['01-02-2024', '03-04-2024'])

# Default: month first (US)
us_format = pd.to_datetime(ambiguous_dates)
print("\nMonth first (US):")
print(us_format)

# Day first (EU)
eu_format = pd.to_datetime(ambiguous_dates, dayfirst=True)
print("\nDay first (EU):")
print(eu_format)

# 10. Practical Example: Log File Parsing
log_data = pd.DataFrame({
    'timestamp': [
        '2024-01-15 10:30:45.123',
        '2024-01-15 10:31:12.456',
        '2024-01-15 10:32:03.789'
    ],
    'event': ['login', 'click', 'logout']
})

# Parse with milliseconds
log_data['timestamp'] = pd.to_datetime(
    log_data['timestamp'],
    format='%Y-%m-%d %H:%M:%S.%f'
)

# Extract time components
log_data['hour'] = log_data['timestamp'].dt.hour
log_data['minute'] = log_data['timestamp'].dt.minute
log_data['second'] = log_data['timestamp'].dt.second

# Calculate duration
log_data['duration'] = log_data['timestamp'].diff()

print("\nLog data:")
print(log_data)

Common Format Codes:

Code Meaning Example
%Y 4-digit year 2024
%y 2-digit year 24
%m Month (01-12) 01
%d Day (01-31) 15
%H Hour 24h (00-23) 14
%I Hour 12h (01-12) 02
%M Minute (00-59) 30
%S Second (00-59) 45
%f Microsecond 123456
%p AM/PM PM
%b Month abbr Jan
%B Month full January

Key Parameters:

Parameter Description Example
format Explicit format format='%Y-%m-%d'
errors Error handling errors='coerce'
dayfirst Day before month dayfirst=True
yearfirst Year first yearfirst=True
utc Parse as UTC utc=True
unit Timestamp unit unit='s'

Error Handling:

Option Behavior Use Case
raise Raise exception Ensure all valid
ignore Return original Keep as string
coerce Return NaT Fill invalid with NaT

Interview Insights:

Interviewer's Insight

  • "to_datetime(): parse strings/numbers to datetime"
  • "format parameter: specify format for speed (10-100x faster)"
  • "Format codes: %Y, %m, %d, %H, %M, %S"
  • "errors='coerce': NaT for invalid dates"
  • "dayfirst=True: European format (day/month/year)"
  • "unit='s': parse Unix timestamps"
  • "utc=True: parse as UTC"
  • "From dict: pass DataFrame with year/month/day columns"
  • "Always specify format for large datasets"

Explain pd.read_csv() Optimization - Google, Netflix Interview Question

Difficulty: 🟑 Medium | Tags: File I/O, Performance, Large Files | Asked by: Google, Netflix, Amazon, Meta

View Answer

read_csv() optimization reduces memory and time. Key techniques: dtype specification, usecols, nrows, chunksize, efficient parsers, and using alternative formats.

Complete Examples:

import pandas as pd
import numpy as np
import time

# Create sample CSV
sample_df = pd.DataFrame({
    'id': range(100000),
    'category': np.random.choice(['A', 'B', 'C'], 100000),
    'value': np.random.rand(100000),
    'text': ['some text'] * 100000,
    'date': pd.date_range('2020-01-01', periods=100000, freq='h')
})
sample_df.to_csv('sample.csv', index=False)

# 1. Basic (Slow)
start = time.time()
df = pd.read_csv('sample.csv')
basic_time = time.time() - start
basic_memory = df.memory_usage(deep=True).sum() / 1024**2

print(f"Basic: {basic_time:.4f}s, {basic_memory:.2f} MB")
print(df.dtypes)

# 2. Specify dtypes (Faster, Less Memory)
start = time.time()
df_typed = pd.read_csv(
    'sample.csv',
    dtype={
        'id': 'int32',  # vs int64
        'category': 'category',  # vs object
        'value': 'float32'  # vs float64
    },
    parse_dates=['date']
)
typed_time = time.time() - start
typed_memory = df_typed.memory_usage(deep=True).sum() / 1024**2

print(f"\nWith dtypes: {typed_time:.4f}s, {typed_memory:.2f} MB")
print(f"Memory saved: {(1 - typed_memory/basic_memory)*100:.1f}%")
print(df_typed.dtypes)

# 3. Load Only Needed Columns
start = time.time()
df_cols = pd.read_csv(
    'sample.csv',
    usecols=['id', 'category', 'value']
)
cols_time = time.time() - start
cols_memory = df_cols.memory_usage(deep=True).sum() / 1024**2

print(f"\nSelected columns: {cols_time:.4f}s, {cols_memory:.2f} MB")

# 4. Load First N Rows (Testing)
df_sample = pd.read_csv('sample.csv', nrows=1000)
print(f"\nFirst 1000 rows: {df_sample.shape}")

# 5. Chunking (For Very Large Files)
def process_chunks(filename, chunksize=10000):
    """Process file in chunks."""
    total = 0
    for chunk in pd.read_csv(filename, chunksize=chunksize):
        # Process each chunk
        total += chunk['value'].sum()
    return total

start = time.time()
result = process_chunks('sample.csv', chunksize=10000)
chunk_time = time.time() - start
print(f"\nChunked processing: {chunk_time:.4f}s, result: {result:.2f}")

# 6. C Engine vs Python Engine
# C engine (default, faster)
start = time.time()
_ = pd.read_csv('sample.csv', engine='c')
c_time = time.time() - start

# Python engine (more features, slower)
start = time.time()
_ = pd.read_csv('sample.csv', engine='python')
python_time = time.time() - start

print(f"\nC engine: {c_time:.4f}s")
print(f"Python engine: {python_time:.4f}s")

# 7. Low Memory Mode
# Good for files with inconsistent types
df_low_mem = pd.read_csv('sample.csv', low_memory=True)

# 8. Date Parsing Optimization
# Bad: parse after loading
start = time.time()
df1 = pd.read_csv('sample.csv')
df1['date'] = pd.to_datetime(df1['date'])
parse_after_time = time.time() - start

# Good: parse during loading
start = time.time()
df2 = pd.read_csv('sample.csv', parse_dates=['date'])
parse_during_time = time.time() - start

print(f"\nParse after: {parse_after_time:.4f}s")
print(f"Parse during: {parse_during_time:.4f}s")

# 9. Compression
# Save compressed
sample_df.to_csv('sample.csv.gz', index=False, compression='gzip')

# Read compressed (automatic detection)
start = time.time()
df_compressed = pd.read_csv('sample.csv.gz')
compressed_time = time.time() - start

import os
csv_size = os.path.getsize('sample.csv') / 1024**2
gz_size = os.path.getsize('sample.csv.gz') / 1024**2

print(f"\nCSV size: {csv_size:.2f} MB")
print(f"Gzip size: {gz_size:.2f} MB ({(1-gz_size/csv_size)*100:.1f}% smaller)")
print(f"Load time: {compressed_time:.4f}s")

# 10. Alternative: Parquet (Much Better!)
# Save as parquet
sample_df.to_parquet('sample.parquet', index=False)

# Read parquet
start = time.time()
df_parquet = pd.read_parquet('sample.parquet')
parquet_time = time.time() - start
parquet_size = os.path.getsize('sample.parquet') / 1024**2

print(f"\nParquet size: {parquet_size:.2f} MB")
print(f"Parquet load: {parquet_time:.4f}s")
print(f"Parquet vs CSV: {basic_time/parquet_time:.1f}x faster!")

# 11. Skip Rows
# Skip first N rows
df_skip = pd.read_csv('sample.csv', skiprows=1000)

# Skip by condition (Python engine)
df_skip_func = pd.read_csv(
    'sample.csv',
    skiprows=lambda x: x % 2 == 0,  # Every other row
    engine='python'
)

# 12. Practical Recommendations
print("\n=== Best Practices ===")
print("1. Specify dtypes (especially category for low-cardinality)")
print("2. Use usecols to load only needed columns")
print("3. Use chunksize for files larger than RAM")
print("4. parse_dates during loading, not after")
print("5. Consider parquet for repeated reads (10x faster)")
print("6. Use compression for storage (gzip, snappy)")
print("7. Test with nrows first to design pipeline")
print("8. Use low_memory=False to avoid dtype warnings")

# Cleanup
import os
for file in ['sample.csv', 'sample.csv.gz', 'sample.parquet']:
    if os.path.exists(file):
        os.remove(file)

Optimization Techniques:

Technique Speed Improvement Memory Savings Use Case
dtype 10-20% 50-90% Specify types
usecols 30-50% 50-80% Select columns
chunksize N/A 90%+ Larger than RAM
parse_dates 20-30% - Date columns
engine='c' 2-5x - Default, fastest
parquet 5-10x 50-80% Best format

Key Parameters:

Parameter Purpose Example
dtype Specify types dtype={'col': 'int32'}
usecols Select columns usecols=['a', 'b']
nrows Limit rows nrows=1000
chunksize Read in chunks chunksize=10000
parse_dates Parse dates parse_dates=['date']
low_memory Avoid warnings low_memory=False
engine Parser engine='c' (fast)

Interview Insights:

Interviewer's Insight

  • "Specify dtypes: 50-90% memory savings"
  • "usecols: load only needed columns"
  • "chunksize: process files larger than RAM"
  • "parse_dates: faster than parsing after load"
  • "category dtype: 90%+ savings for low cardinality"
  • "int32 vs int64: 50% memory savings"
  • "float32 vs float64: 50% memory savings"
  • "Parquet: 5-10x faster than CSV"
  • "nrows: test with sample first"
  • "engine='c': default, fastest"

Explain Pandas Style for Formatting - Meta, Amazon Interview Question

Difficulty: 🟒 Easy | Tags: Styling, Formatting, Visualization | Asked by: Meta, Amazon, Microsoft

View Answer

style API formats DataFrame output with colors, highlights, and conditional formatting. Creates publication-ready tables, heatmaps, and visual reports.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Sales': [10000, 15000, 8000, 12000, 18000],
    'Profit': [2000, 3500, 1200, 2800, 4200],
    'Growth': [0.15, -0.05, 0.22, 0.08, 0.18],
    'Target': [12000, 14000, 10000, 13000, 16000]
})

# 1. Basic Number Formatting
styled = df.style.format({
    'Sales': '${:,.0f}',
    'Profit': '${:,.0f}',
    'Growth': '{:.1%}'
})
print("Formatted (display in notebook):")
# styled  # Display in Jupyter

# 2. Highlight Maximum
styled = df.style.highlight_max(
    subset=['Sales', 'Profit'],
    color='lightgreen'
)

# 3. Highlight Minimum
styled = df.style.highlight_min(
    subset=['Sales', 'Profit'],
    color='lightcoral'
)

# 4. Background Gradient
styled = df.style.background_gradient(
    subset=['Sales', 'Profit'],
    cmap='YlOrRd'
)

# 5. Color Negative Values
def color_negative_red(val):
    color = 'red' if val < 0 else 'black'
    return f'color: {color}'

styled = df.style.applymap(
    color_negative_red,
    subset=['Growth']
)

# 6. Conditional Formatting (Row-wise)
def highlight_missed_target(row):
    """Highlight rows that missed target."""
    if row['Sales'] < row['Target']:
        return ['background-color: #ffcccb'] * len(row)
    else:
        return [''] * len(row)

styled = df.style.apply(highlight_missed_target, axis=1)

# 7. Bar Charts in Cells
styled = df.style.bar(
    subset=['Sales'],
    color='lightblue'
)

# 8. Combining Multiple Styles
styled = (df.style
          .format({
              'Sales': '${:,.0f}',
              'Profit': '${:,.0f}',
              'Growth': '{:.1%}',
              'Target': '${:,.0f}'
          })
          .background_gradient(subset=['Sales'], cmap='Blues')
          .bar(subset=['Profit'], color='lightgreen')
          .highlight_max(subset=['Growth'], color='yellow'))

# 9. Custom Function (Element-wise)
def color_profit(val):
    """Color profit levels."""
    if val > 3000:
        return 'background-color: #90EE90'
    elif val > 2000:
        return 'background-color: #FFFFCC'
    else:
        return 'background-color: #FFB6C1'

styled = df.style.applymap(color_profit, subset=['Profit'])

# 10. Practical Example: Sales Dashboard
dashboard_df = pd.DataFrame({
    'Region': ['North', 'South', 'East', 'West'],
    'Q1_Sales': [120000, 95000, 105000, 110000],
    'Q2_Sales': [135000, 88000, 115000, 125000],
    'YoY_Growth': [0.12, -0.08, 0.15, 0.10],
    'Target_Met': ['Yes', 'No', 'Yes', 'Yes']
})

def highlight_yes_green(val):
    if val == 'Yes':
        return 'background-color: lightgreen'
    elif val == 'No':
        return 'background-color: lightcoral'
    return ''

styled = (dashboard_df.style
          .format({
              'Q1_Sales': '${:,.0f}',
              'Q2_Sales': '${:,.0f}',
              'YoY_Growth': '{:.1%}'
          })
          .background_gradient(subset=['Q1_Sales', 'Q2_Sales'], cmap='YlGn')
          .applymap(highlight_yes_green, subset=['Target_Met'])
          .set_caption('Quarterly Sales Dashboard'))

# 11. Export to HTML/Excel
# To HTML
html = styled.to_html()
# with open('report.html', 'w') as f:
#     f.write(html)

# To Excel (with xlsxwriter)
# styled.to_excel('report.xlsx', engine='xlsxwriter')

# 12. Hiding Index/Columns
styled = (df.style
          .hide_index()
          .format({'Sales': '${:,.0f}'}))

# 13. Setting Properties
styled = (df.style
          .set_properties(**{
              'background-color': 'lightblue',
              'color': 'black',
              'border-color': 'white'
          }, subset=['Product']))

# 14. Tooltips (Hover Text)
# Note: More advanced, requires custom CSS

# 15. Practical Report
financial_df = pd.DataFrame({
    'Metric': ['Revenue', 'COGS', 'Gross Profit', 'Operating Exp', 'Net Income'],
    '2023': [1000000, 600000, 400000, 250000, 150000],
    '2024': [1200000, 680000, 520000, 280000, 240000],
    'Change': [0.20, 0.13, 0.30, 0.12, 0.60]
})

def color_change(val):
    if val > 0.15:
        return 'color: green; font-weight: bold'
    elif val < 0:
        return 'color: red; font-weight: bold'
    return ''

styled = (financial_df.style
          .format({
              '2023': '${:,.0f}',
              '2024': '${:,.0f}',
              'Change': '{:+.1%}'
          })
          .applymap(color_change, subset=['Change'])
          .set_caption('Financial Report 2023-2024')
          .set_table_styles([
              {'selector': 'caption',
               'props': [('font-size', '16px'), ('font-weight', 'bold')]}
          ]))

print("\nStyle API commonly used in Jupyter notebooks for visual reports")

Common Style Methods:

Method Purpose Example
format Number formatting .format({'col': '{:.2f}'})
highlight_max Highlight maximum .highlight_max(color='green')
highlight_min Highlight minimum .highlight_min(color='red')
background_gradient Color scale .background_gradient(cmap='Blues')
bar In-cell bar chart .bar(color='blue')
applymap Element-wise function .applymap(func, subset=['col'])
apply Row/column function .apply(func, axis=1)

Format Strings:

Format Output Use Case
{:.2f} 12.35 2 decimals
{:,.0f} 1,234 Thousands separator
${:,.2f} $1,234.56 Currency
{:.1%} 12.3% Percentage
{:+.1f} +12.3 Show sign

Color Maps:

cmap Description
'YlOrRd' Yellow-Orange-Red (heat)
'Blues' Light to dark blue
'Greens' Light to dark green
'RdYlGn' Red-Yellow-Green (diverging)
'viridis' Perceptually uniform

Interview Insights:

Interviewer's Insight

  • "style: format DataFrame for display"
  • "format(): number formatting (${:,.0f}, {:.1%})"
  • "highlight_max/min(): highlight extremes"
  • "background_gradient(): color scale"
  • "bar(): in-cell bar charts"
  • "applymap(): element-wise styling"
  • "apply(): row/column-wise styling"
  • "Method chaining for multiple styles"
  • "Export: .to_html(), .to_excel()"
  • "Use in: Jupyter reports, dashboards"

Explain Pandas Merge Indicators - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Merge, Join, Data Quality | Asked by: Google, Amazon, Meta

View Answer

Merge indicator tracks source of each row (left_only, right_only, both). Essential for data validation, join analysis, and identifying mismatches.

Complete Examples:

import pandas as pd

# Sample data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 2, 2, 6, 7, 1],
    'amount': [100, 200, 150, 300, 250, 175]
})

# 1. Basic Merge with Indicator
merged = pd.merge(
    customers,
    orders,
    on='customer_id',
    how='outer',
    indicator=True
)

print("Merge with indicator:")
print(merged)
print("\nIndicator values:")
print(merged['_merge'].value_counts())

# 2. Custom Indicator Name
merged = pd.merge(
    customers,
    orders,
    on='customer_id',
    how='outer',
    indicator='source'
)

print("\nCustom indicator name:")
print(merged[['customer_id', 'name', 'order_id', 'source']])

# 3. Analyze Merge Results
# Customers without orders
no_orders = merged[merged['source'] == 'left_only']
print("\nCustomers without orders:")
print(no_orders[['customer_id', 'name']])

# Orders without customer match
orphan_orders = merged[merged['source'] == 'right_only']
print("\nOrphan orders (no matching customer):")
print(orphan_orders[['order_id', 'customer_id', 'amount']])

# Successfully matched
matched = merged[merged['source'] == 'both']
print("\nMatched records:")
print(matched[['customer_id', 'name', 'order_id', 'amount']])

# 4. Data Quality Check
def analyze_merge(left, right, on, how='outer'):
    """Analyze merge quality."""
    merged = pd.merge(left, right, on=on, how=how, indicator=True)

    stats = merged['_merge'].value_counts()
    total = len(merged)

    print("=== Merge Analysis ===")
    print(f"Total rows: {total}")
    print(f"Left only: {stats.get('left_only', 0)} ({stats.get('left_only', 0)/total*100:.1f}%)")
    print(f"Right only: {stats.get('right_only', 0)} ({stats.get('right_only', 0)/total*100:.1f}%)")
    print(f"Both: {stats.get('both', 0)} ({stats.get('both', 0)/total*100:.1f}%)")

    return merged

result = analyze_merge(customers, orders, on='customer_id')

# 5. Validation After Merge
merged = pd.merge(customers, orders, on='customer_id', how='outer', indicator=True)

# Assert no orphan orders
orphans = merged[merged['_merge'] == 'right_only']
if len(orphans) > 0:
    print(f"\n⚠️  WARNING: {len(orphans)} orders without customer!")
    print(orphans[['order_id', 'customer_id']])

# Assert all customers have orders
no_orders = merged[merged['_merge'] == 'left_only']
if len(no_orders) > 0:
    print(f"\nINFO: {len(no_orders)} customers without orders")

# 6. Practical Example: Product Catalog Sync
catalog = pd.DataFrame({
    'sku': ['A001', 'A002', 'A003', 'A004'],
    'product': ['Widget', 'Gadget', 'Tool', 'Device'],
    'price': [10, 20, 15, 25]
})

inventory = pd.DataFrame({
    'sku': ['A001', 'A002', 'A005', 'A006'],
    'warehouse': ['NYC', 'LA', 'NYC', 'LA'],
    'quantity': [100, 50, 75, 30]
})

sync = pd.merge(catalog, inventory, on='sku', how='outer', indicator='status')

print("\n=== Catalog Sync Report ===")
print(f"Products in catalog only: {(sync['status']=='left_only').sum()}")
print(f"Products in inventory only: {(sync['status']=='right_only').sum()}")
print(f"Products in both: {(sync['status']=='both').sum()}")

# Products to add to inventory
to_add = sync[sync['status'] == 'left_only']
print("\nAdd to inventory:")
print(to_add[['sku', 'product']])

# Products to remove from inventory
to_remove = sync[sync['status'] == 'right_only']
print("\nRemove from inventory:")
print(to_remove[['sku', 'warehouse']])

# 7. Multiple Key Merge
sales = pd.DataFrame({
    'region': ['North', 'South', 'North', 'East'],
    'product': ['A', 'B', 'A', 'C'],
    'sales': [100, 200, 150, 180]
})

targets = pd.DataFrame({
    'region': ['North', 'South', 'West'],
    'product': ['A', 'B', 'A'],
    'target': [120, 180, 100]
})

comparison = pd.merge(
    sales,
    targets,
    on=['region', 'product'],
    how='outer',
    indicator=True
)

print("\nSales vs Targets:")
print(comparison)

# Missing targets
no_target = comparison[comparison['_merge'] == 'left_only']
print("\nSales without targets:")
print(no_target[['region', 'product', 'sales']])

# 8. Time Series Alignment
actual = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5, freq='D'),
    'actual': [100, 110, 105, 115, 120]
})

forecast = pd.DataFrame({
    'date': pd.date_range('2024-01-02', periods=5, freq='D'),
    'forecast': [108, 106, 116, 122, 125]
})

comparison = pd.merge(
    actual,
    forecast,
    on='date',
    how='outer',
    indicator=True
).sort_values('date')

print("\nActual vs Forecast:")
print(comparison)

# Calculate accuracy only where both exist
both_exist = comparison[comparison['_merge'] == 'both']
both_exist['error'] = abs(both_exist['actual'] - both_exist['forecast'])
print("\nForecast accuracy:")
print(both_exist[['date', 'actual', 'forecast', 'error']])

Indicator Values:

Value Meaning Description
'left_only' In left only No match in right
'right_only' In right only No match in left
'both' In both Successfully matched

Common Use Cases:

Use Case Check For
Data quality Orphan records
ETL validation Missing matches
Catalog sync Products to add/remove
Customer orders Customers without orders
Join analysis Match rate

Interview Insights:

Interviewer's Insight

  • "indicator=True: add _merge column"
  • "Values: 'left_only', 'right_only', 'both'"
  • "Use for: data quality checks, join validation"
  • "Filter by indicator: find unmatched records"
  • "Custom name: indicator='source'"
  • "Essential for: ETL pipelines, data validation"
  • "Check match rate: value_counts() on _merge"
  • "Find orphans: merged[merged['_merge']=='right_only']"
  • "Always use with how='outer' for full picture"

Explain DataFrame.where() and mask() - Meta, Microsoft Interview Question

Difficulty: 🟑 Medium | Tags: Conditional Operations, Filtering, Data Replacement | Asked by: Meta, Microsoft, Google

View Answer

where() and mask() replace values based on conditions. where() keeps True values, mask() replaces True values. Alternative to boolean indexing for conditional replacement.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})

# 1. where() - Keep values where condition is True
# Replace False values with NaN
result = df.where(df > 20)
print("where (keep > 20):")
print(result)

# 2. where() with replacement value
result = df.where(df > 20, 0)
print("\nwhere with 0:")
print(result)

# 3. mask() - Replace values where condition is True
# Opposite of where()
result = df.mask(df > 20)
print("\nmask (replace > 20 with NaN):")
print(result)

# 4. mask() with replacement value
result = df.mask(df > 20, 0)
print("\nmask with 0:")
print(result)

# 5. Practical Example: Outlier Capping
data = pd.DataFrame({
    'values': [1, 5, 10, 100, 15, 200, 8, 300]
})

# Cap at 50
capped = data['values'].where(data['values'] <= 50, 50)
print("\nCapped at 50:")
print(capped)

# Or using mask
capped2 = data['values'].mask(data['values'] > 50, 50)
print("\nSame with mask:")
print(capped2)

# 6. Element-wise Operations
df = pd.DataFrame({
    'temperature': [-5, 0, 10, 25, 35, 42],
    'humidity': [20, 40, 60, 80, 90, 95]
})

# Replace extreme temperatures
df['temp_adjusted'] = df['temperature'].where(
    (df['temperature'] >= 0) & (df['temperature'] <= 40)
)
print("\nTemperature filtering:")
print(df)

# 7. Using Functions
# where/mask can take callable
result = df['temperature'].where(lambda x: x > 0)
print("\nUsing lambda:")
print(result)

# 8. Multiple Conditions
sales = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'quantity': [10, 50, 100, 200, 5],
    'price': [10, 20, 15, 25, 30]
})

# Flag high-value sales
sales['category'] = 'Normal'
sales['category'] = sales['category'].where(
    sales['quantity'] * sales['price'] < 1000,
    'High Value'
)
print("\nSales categories:")
print(sales)

# 9. vs Boolean Indexing
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
})

# Boolean indexing (modifies subset)
df_bool = df.copy()
df_bool.loc[df_bool['A'] > 3, 'B'] = 0
print("\nBoolean indexing:")
print(df_bool)

# where (returns new, preserves shape)
df_where = df.copy()
df_where['B'] = df_where['B'].where(df_where['A'] <= 3, 0)
print("\nWith where:")
print(df_where)

# 10. Practical: Data Cleaning
data = pd.DataFrame({
    'age': [-5, 25, 150, 30, 200, 45],
    'salary': [50000, -10000, 80000, 90000, 1000000, 75000]
})

# Valid ranges
data['age_clean'] = data['age'].where(
    (data['age'] >= 0) & (data['age'] <= 120)
)
data['salary_clean'] = data['salary'].where(
    (data['salary'] >= 0) & (data['salary'] <= 500000)
)
print("\nData cleaning:")
print(data)

# 11. Winsorization
# Cap extreme values
data = pd.Series([1, 2, 3, 4, 5, 100, 200, 300])

lower = data.quantile(0.05)
upper = data.quantile(0.95)

winsorized = (data
              .where(data >= lower, lower)
              .where(data <= upper, upper))
print("\nWinsorized:")
print(winsorized)

# 12. Time Series Example
dates = pd.date_range('2024-01-01', periods=10, freq='D')
ts = pd.DataFrame({
    'date': dates,
    'value': [10, 50, 100, 200, 150, 180, 90, 70, 60, 55]
})

# Flag anomalies
mean = ts['value'].mean()
std = ts['value'].std()

ts['anomaly'] = 'Normal'
ts['anomaly'] = ts['anomaly'].where(
    abs(ts['value'] - mean) <= 2 * std,
    'Anomaly'
)
print("\nAnomaly detection:")
print(ts)

# 13. Performance Note
# where/mask are efficient (vectorized)
large_df = pd.DataFrame(np.random.randn(100000, 3), columns=['A', 'B', 'C'])

import time

# where()
start = time.time()
_ = large_df.where(large_df > 0, 0)
where_time = time.time() - start

# Boolean indexing (slower for full replacement)
start = time.time()
temp = large_df.copy()
temp[temp < 0] = 0
bool_time = time.time() - start

print(f"\nwhere(): {where_time:.4f}s")
print(f"Boolean: {bool_time:.4f}s")

where() vs mask():

Method Behavior Equivalent
where(cond) Keep True, replace False df[cond]
mask(cond) Replace True, keep False df[~cond]

Use Cases:

Use Case Method Example
Outlier capping where s.where(s <= 100, 100)
Filter & replace where df.where(df > 0, 0)
Anomaly removal mask df.mask(is_anomaly)
Winsorization where (twice) Cap both tails

Interview Insights:

Interviewer's Insight

  • "where(): keep True, replace False with other"
  • "mask(): replace True with other, keep False"
  • "Default other: NaN"
  • "where(cond, value): replace False with value"
  • "Preserves DataFrame shape (vs boolean indexing)"
  • "Vectorized: efficient on large data"
  • "Use for: outlier capping, data cleaning, winsorization"
  • "Can take callable: where(lambda x: x > 0)"
  • "mask() = where(~cond)"

Explain nlargest() and nsmallest() - Amazon, Netflix Interview Question

Difficulty: 🟒 Easy | Tags: Selection, Sorting, Top-N | Asked by: Amazon, Netflix, Google, Meta

View Answer

nlargest() and nsmallest() select top/bottom N rows by column(s). Faster than sorting for small N. Useful for top products, customers, values.

Complete Examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E', 'F'],
    'sales': [1000, 5000, 3000, 8000, 2000, 6000],
    'profit': [200, 1000, 600, 1500, 400, 1200],
    'region': ['North', 'South', 'North', 'East', 'West', 'South']
})

# 1. Top 3 by Sales
top_sales = df.nlargest(3, 'sales')
print("Top 3 by sales:")
print(top_sales)

# 2. Bottom 3 by Sales
bottom_sales = df.nsmallest(3, 'sales')
print("\nBottom 3 by sales:")
print(bottom_sales)

# 3. Multiple Columns (Tiebreaker)
# First by sales, then by profit
top_multi = df.nlargest(3, ['sales', 'profit'])
print("\nTop 3 by sales, then profit:")
print(top_multi)

# 4. vs sort_values + head
import time

# Large dataset
large_df = pd.DataFrame({
    'value': np.random.randn(1000000)
})

# nlargest (fast)
start = time.time()
_ = large_df.nlargest(10, 'value')
nlargest_time = time.time() - start

# sort + head (slow)
start = time.time()
_ = large_df.sort_values('value', ascending=False).head(10)
sort_time = time.time() - start

print(f"\nnlargest: {nlargest_time:.4f}s")
print(f"sort + head: {sort_time:.4f}s ({sort_time/nlargest_time:.1f}x slower)")

# 5. Series nlargest/nsmallest
sales_series = pd.Series([100, 500, 300, 800, 200], 
                         index=['A', 'B', 'C', 'D', 'E'])

print("\nTop 3 products:")
print(sales_series.nlargest(3))

print("\nBottom 2 products:")
print(sales_series.nsmallest(2))

# 6. Practical: Top Customers
customers = pd.DataFrame({
    'customer_id': range(1, 101),
    'total_spent': np.random.randint(100, 10000, 100),
    'orders': np.random.randint(1, 50, 100),
    'last_order_days': np.random.randint(0, 365, 100)
})

# Top 10 spenders
top_customers = customers.nlargest(10, 'total_spent')
print("\nTop 10 customers by spending:")
print(top_customers[['customer_id', 'total_spent', 'orders']])

# Most frequent buyers
frequent_buyers = customers.nlargest(10, 'orders')
print("\nTop 10 frequent buyers:")
print(frequent_buyers[['customer_id', 'orders', 'total_spent']])

# 7. GroupBy + nlargest
# Top 2 products per region
top_per_region = df.groupby('region').apply(
    lambda x: x.nlargest(2, 'sales')
).reset_index(drop=True)
print("\nTop 2 products per region:")
print(top_per_region)

# 8. Practical: Stock Analysis
stocks = pd.DataFrame({
    'ticker': ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA', 'META'],
    'price': [175, 140, 380, 145, 250, 350],
    'volume': [50000000, 25000000, 30000000, 45000000, 80000000, 20000000],
    'change_pct': [2.5, -1.2, 1.8, 0.5, 5.2, -2.1]
})

# Top gainers
print("\nTop 3 gainers:")
print(stocks.nlargest(3, 'change_pct')[['ticker', 'change_pct']])

# Top losers
print("\nTop 3 losers:")
print(stocks.nsmallest(3, 'change_pct')[['ticker', 'change_pct']])

# Most traded
print("\nMost traded:")
print(stocks.nlargest(3, 'volume')[['ticker', 'volume']])

# 9. Keep Parameter
# Handle duplicates
df_dup = pd.DataFrame({
    'name': ['A', 'B', 'C', 'D', 'E'],
    'score': [10, 20, 20, 15, 10]
})

print("\nOriginal with duplicates:")
print(df_dup)

# keep='first' (default)
print("\nTop 2 (keep='first'):")
print(df_dup.nlargest(2, 'score', keep='first'))

# keep='last'
print("\nTop 2 (keep='last'):")
print(df_dup.nlargest(2, 'score', keep='last'))

# keep='all' (may return more than n)
print("\nTop 2 (keep='all'):")
print(df_dup.nlargest(2, 'score', keep='all'))

# 10. Real-World: Sales Leaderboard
sales_data = pd.DataFrame({
    'salesperson': [f'Person_{i}' for i in range(1, 51)],
    'q1_sales': np.random.randint(10000, 50000, 50),
    'q2_sales': np.random.randint(10000, 50000, 50),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 50)
})

sales_data['total_sales'] = sales_data['q1_sales'] + sales_data['q2_sales']

# Overall top 10
leaderboard = sales_data.nlargest(10, 'total_sales')
print("\nTop 10 salespeople:")
print(leaderboard[['salesperson', 'total_sales', 'region']])

# Top 3 per region
regional_leaders = sales_data.groupby('region').apply(
    lambda x: x.nlargest(3, 'total_sales')
).reset_index(drop=True)
print("\nTop 3 per region:")
print(regional_leaders[['region', 'salesperson', 'total_sales']])

Performance Comparison:

Dataset Size Top N nlargest() sort_values() + head()
10K rows 10 0.001s 0.005s (5x slower)
100K rows 10 0.005s 0.050s (10x slower)
1M rows 10 0.05s 0.5s (10x slower)
1M rows 100 0.08s 0.5s (6x slower)

When to Use:

Scenario Best Method
Top/bottom N (N << size) nlargest/nsmallest
Need all sorted sort_values
Multiple columns Both work, nlargest faster
Very small N (<100) nlargest (much faster)

Interview Insights:

Interviewer's Insight

  • "nlargest(n, 'col'): top N rows by column"
  • "nsmallest(n, 'col'): bottom N rows"
  • "Faster than sort_values + head for small N"
  • "Multiple columns: nlargest(n, ['col1', 'col2'])"
  • "keep parameter: 'first', 'last', 'all' for ties"
  • "Works on Series and DataFrame"
  • "Use for: top products, customers, leaderboards"
  • "O(n log k) vs O(n log n) for sorting"
  • "GroupBy + nlargest: top N per group"

Explain Pandas rank() Method - Microsoft, Google Interview Question

Difficulty: 🟑 Medium | Tags: Ranking, Statistics, Order | Asked by: Microsoft, Google, Amazon

View Answer

rank() assigns ranks to values in Series/DataFrame. Supports multiple methods for handling ties (average, min, max, first, dense). Essential for leaderboards, percentiles, and relative comparisons.

Complete Examples:

import pandas as pd
import numpy as np

# Sample data with ties
s = pd.Series([100, 200, 200, 300, 400, 400, 400])

# 1. Default Ranking (average for ties)
print("Original values:")
print(s.values)
print("\nDefault rank (average):")
print(s.rank())

# 2. Different Ranking Methods
methods = ['average', 'min', 'max', 'first', 'dense']

df = pd.DataFrame({'value': s})
for method in methods:
    df[f'rank_{method}'] = s.rank(method=method)

print("\nAll ranking methods:")
print(df)

# 3. Ascending vs Descending
scores = pd.Series([85, 92, 78, 92, 95])

print("\nScores:")
print(scores.values)

# ascending=True (default): 1 = smallest
print("\nRank ascending (1 = lowest):")
print(scores.rank())

# ascending=False: 1 = largest
print("\nRank descending (1 = highest):")
print(scores.rank(ascending=False))

# 4. Percentile Ranks
# pct=True: ranks from 0 to 1
print("\nPercentile ranks:")
print(scores.rank(pct=True))

# 5. Practical: Student Grades
students = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'math': [85, 92, 78, 92, 95],
    'english': [88, 85, 90, 82, 95]
})

# Rank by math (1 = best)
students['math_rank'] = students['math'].rank(ascending=False, method='min')
students['english_rank'] = students['english'].rank(ascending=False, method='min')

# Overall rank (sum of ranks, lower is better)
students['overall_rank'] = (students['math_rank'] + students['english_rank']).rank()

print("\nStudent rankings:")
print(students)

# 6. GroupBy Ranking
sales = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'salesperson': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'sales': [1000, 1500, 800, 1200, 900, 1100]
})

# Rank within each region
sales['rank_in_region'] = sales.groupby('region')['sales'].rank(ascending=False)

print("\nRanking within regions:")
print(sales)

# 7. Handling NaN
s_nan = pd.Series([100, np.nan, 200, 300, np.nan])

# na_option='keep' (default): NaN stays NaN
print("\nWith NaN (keep):")
print(s_nan.rank())

# na_option='top': NaN gets smallest rank
print("\nWith NaN (top):")
print(s_nan.rank(na_option='top'))

# na_option='bottom': NaN gets largest rank
print("\nWith NaN (bottom):")
print(s_nan.rank(na_option='bottom'))

# 8. Practical: Stock Performance
stocks = pd.DataFrame({
    'ticker': ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA'],
    'return_1m': [5.2, -1.5, 3.8, 2.1, 8.5],
    'return_6m': [12.5, 8.2, 15.1, 10.3, 25.2],
    'volatility': [0.25, 0.30, 0.22, 0.28, 0.45]
})

# Rank returns (higher is better)
stocks['return_1m_rank'] = stocks['return_1m'].rank(ascending=False)
stocks['return_6m_rank'] = stocks['return_6m'].rank(ascending=False)

# Rank volatility (lower is better)
stocks['vol_rank'] = stocks['volatility'].rank(ascending=True)

# Combined score (lower is better)
stocks['combined_rank'] = (
    stocks['return_1m_rank'] + 
    stocks['return_6m_rank'] + 
    stocks['vol_rank']
).rank()

print("\nStock rankings:")
print(stocks)

# 9. Quantile-based Ranking
values = pd.Series(np.random.randn(100))

# Percentile ranks
values_df = pd.DataFrame({
    'value': values,
    'percentile': values.rank(pct=True),
    'quartile': pd.qcut(values.rank(pct=True), q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
})

print("\nQuantile-based ranking:")
print(values_df.head(10))

# 10. Leaderboard Example
leaderboard = pd.DataFrame({
    'player': [f'Player_{i}' for i in range(1, 11)],
    'score': [1000, 1500, 1200, 1500, 1800, 900, 1500, 1200, 1000, 2000]
})

# Dense ranking (no gaps)
leaderboard['rank_dense'] = leaderboard['score'].rank(method='dense', ascending=False)

# Min ranking (standard)
leaderboard['rank_standard'] = leaderboard['score'].rank(method='min', ascending=False)

leaderboard = leaderboard.sort_values('score', ascending=False)

print("\nLeaderboard:")
print(leaderboard)

# 11. Performance Comparison
large_series = pd.Series(np.random.randn(1000000))

import time

start = time.time()
_ = large_series.rank()
rank_time = time.time() - start

start = time.time()
_ = large_series.sort_values().reset_index(drop=True)
sort_time = time.time() - start

print(f"\nrank(): {rank_time:.4f}s")
print(f"sort: {sort_time:.4f}s")

Ranking Methods:

Method Description Example [100, 200, 200, 300]
average Average of ranks [1.0, 2.5, 2.5, 4.0]
min Minimum rank [1, 2, 2, 4]
max Maximum rank [1, 3, 3, 4]
first Order they appear [1, 2, 3, 4]
dense No gaps [1, 2, 2, 3]

Key Parameters:

Parameter Options Use Case
method average, min, max, first, dense Tie handling
ascending True, False Rank direction
pct True, False Percentile ranks (0-1)
na_option keep, top, bottom NaN handling

Interview Insights:

Interviewer's Insight

  • "rank(): assign ranks to values"
  • "ascending=False: 1 = highest (for scores)"
  • "method='min': standard ranking (gaps for ties)"
  • "method='dense': no gaps (1, 2, 2, 3)"
  • "method='first': order of appearance"
  • "pct=True: percentile ranks (0.0 to 1.0)"
  • "Use with groupby(): rank within groups"
  • "na_option: handle NaN ('keep', 'top', 'bottom')"
  • "Use for: leaderboards, percentiles, relative comparisons"

Explain pd.cut() vs pd.qcut() - Amazon, Google Interview Question

Difficulty: 🟑 Medium | Tags: Binning, Discretization, Feature Engineering | Asked by: Amazon, Google, Meta, Netflix

View Answer

cut() bins data into equal-width intervals. qcut() bins into equal-sized quantiles. cut() for fixed ranges, qcut() for balanced distribution.

Complete Examples:

import pandas as pd
import numpy as np

ages = pd.Series([5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80])

# 1. pd.cut() - Equal-Width Bins
# Bins have same width, different counts
age_bins = pd.cut(ages, bins=4)
print("cut() - Equal width:")
print(age_bins)
print("\nValue counts:")
print(age_bins.value_counts().sort_index())

# 2. pd.qcut() - Equal-Frequency Bins
# Bins have different widths, same counts
age_quantiles = pd.qcut(ages, q=4)
print("\nqcut() - Equal frequency:")
print(age_quantiles)
print("\nValue counts:")
print(age_quantiles.value_counts().sort_index())

# 3. Custom Bin Edges (cut)
custom_bins = pd.cut(ages, bins=[0, 18, 35, 60, 100], 
                     labels=['Child', 'Young Adult', 'Adult', 'Senior'])
print("\nCustom bins:")
print(pd.DataFrame({'age': ages, 'category': custom_bins}))

# 4. Percentile-based (qcut)
# Quartiles
quartiles = pd.qcut(ages, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print("\nQuartiles:")
print(pd.DataFrame({'age': ages, 'quartile': quartiles}))

# 5. Practical: Income Brackets
incomes = pd.Series(np.random.randint(20000, 200000, 100))

# Equal-width bins
income_cut = pd.cut(incomes, bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
print("\nIncome bins (equal width):")
print(income_cut.value_counts())

# Equal-frequency bins (balanced)
income_qcut = pd.qcut(incomes, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
print("\nIncome bins (equal frequency):")
print(income_qcut.value_counts())

# 6. Include/Exclude Boundaries
scores = pd.Series([0, 25, 50, 75, 100])

# Right edge included (default)
bins_right = pd.cut(scores, bins=[0, 50, 100], right=True)
print("\nRight edge included:")
print(bins_right)

# Left edge included
bins_left = pd.cut(scores, bins=[0, 50, 100], right=False)
print("\nLeft edge included:")
print(bins_left)

# 7. Retbins - Get Bin Edges
data = pd.Series(np.random.randn(100))

# Get bins used
binned, bin_edges = pd.cut(data, bins=5, retbins=True)
print("\nBin edges:")
print(bin_edges)

# 8. Practical: Credit Score Categories
credit_scores = pd.Series(np.random.randint(300, 850, 1000))

# Standard credit score ranges
credit_bins = pd.cut(
    credit_scores,
    bins=[300, 580, 670, 740, 800, 850],
    labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent']
)

print("\nCredit score distribution:")
print(credit_bins.value_counts().sort_index())

# 9. Handling Duplicates in qcut
# qcut can fail with many duplicates
data_dup = pd.Series([1, 1, 1, 2, 2, 3, 4, 5, 5, 5])

try:
    pd.qcut(data_dup, q=4)
except ValueError as e:
    print(f"\nqcut error: {e}")

# Solution: duplicates='drop'
binned_dup = pd.qcut(data_dup, q=4, duplicates='drop')
print("\nWith duplicates='drop':")
print(binned_dup.value_counts())

# 10. Machine Learning Feature Engineering
df = pd.DataFrame({
    'customer_id': range(1, 101),
    'age': np.random.randint(18, 80, 100),
    'income': np.random.randint(20000, 150000, 100),
    'spend': np.random.randint(100, 5000, 100)
})

# Create features
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle', 'Senior'])
df['income_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
df['spend_level'] = pd.qcut(df['spend'], q=3, labels=['Low', 'Medium', 'High'])

print("\nFeature engineering:")
print(df.head(10))

# Distribution
print("\nFeature distributions:")
print(df[['age_group', 'income_quartile', 'spend_level']].apply(lambda x: x.value_counts()))

cut() vs qcut():

Aspect cut() qcut()
Bins Equal width Equal frequency
Use Case Fixed ranges Balanced groups
Distribution Uneven Even
Example Age groups Quartiles
Custom Edges Yes No (only q)

Common Patterns:

Task Solution
Age groups cut(ages, [0, 18, 65, 100])
Quartiles qcut(data, q=4)
Credit scores cut(scores, [300, 580, 670, 740, 800, 850])
Income percentiles qcut(income, q=10) (deciles)
Equal groups for ML qcut(feature, q=5)

Interview Insights:

Interviewer's Insight

  • "cut(): equal-width bins, fixed ranges"
  • "qcut(): equal-frequency bins, quantiles"
  • "cut() bins can have different counts"
  • "qcut() bins have same counts (balanced)"
  • "cut() for: age groups, custom ranges"
  • "qcut() for: quartiles, percentiles, balanced features"
  • "labels: custom category names"
  • "retbins=True: return bin edges"
  • "qcut with duplicates: use duplicates='drop'"
  • "Use for: feature engineering, binning, discretization"

Explain Pandas cumsum(), cumprod(), cummax(), cummin() - Meta, Microsoft Interview Question

Difficulty: 🟒 Easy | Tags: Cumulative Operations, Time Series, Running Totals | Asked by: Meta, Microsoft, Amazon, Google

View Answer

Cumulative functions compute running totals, products, max, min. Essential for time series analysis, running totals, cumulative statistics, and financial calculations.

Complete Examples:

import pandas as pd
import numpy as np

data = pd.Series([1, 2, 3, 4, 5])

# 1. cumsum() - Cumulative Sum
print("Original:")
print(data.values)
print("\ncumsum():")
print(data.cumsum().values)
# [1, 3, 6, 10, 15]

# 2. cumprod() - Cumulative Product
print("\ncumprod():")
print(data.cumprod().values)
# [1, 2, 6, 24, 120]

# 3. cummax() - Cumulative Maximum
data_random = pd.Series([3, 1, 4, 1, 5, 2, 6])
print("\nOriginal:")
print(data_random.values)
print("\ncummax():")
print(data_random.cummax().values)
# [3, 3, 4, 4, 5, 5, 6]

# 4. cummin() - Cumulative Minimum
print("\ncummin():")
print(data_random.cummin().values)
# [3, 1, 1, 1, 1, 1, 1]

# 5. Practical: Running Total Sales
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=7, freq='D'),
    'daily_sales': [100, 150, 120, 180, 200, 160, 190]
})

sales['cumulative_sales'] = sales['daily_sales'].cumsum()

print("\nRunning sales total:")
print(sales)

# 6. Financial: Cumulative Returns
returns = pd.Series([0.05, -0.02, 0.03, 0.01, -0.01, 0.04])

# Cumulative product for compound returns
cumulative_return = (1 + returns).cumprod() - 1

print("\nDaily returns:")
print(returns.values)
print("\nCumulative returns:")
print(cumulative_return.values)

# 7. Peak Detection (cummax)
stock_prices = pd.Series([100, 105, 103, 108, 110, 107, 115, 112])

df = pd.DataFrame({
    'price': stock_prices,
    'peak': stock_prices.cummax()
})
df['drawdown'] = (df['price'] - df['peak']) / df['peak']

print("\nDrawdown analysis:")
print(df)

# 8. GroupBy Cumulative
orders = pd.DataFrame({
    'customer': ['A', 'A', 'B', 'A', 'B', 'B'],
    'amount': [100, 200, 150, 300, 100, 200]
})

orders['cumulative_per_customer'] = orders.groupby('customer')['amount'].cumsum()

print("\nCumulative per customer:")
print(orders)

# 9. Time Series: Running Metrics
ts = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'value': np.random.randint(10, 50, 10)
})

ts['running_total'] = ts['value'].cumsum()
ts['running_avg'] = ts['value'].expanding().mean()
ts['running_max'] = ts['value'].cummax()
ts['running_min'] = ts['value'].cummin()

print("\nTime series running metrics:")
print(ts)

# 10. Practical: Inventory Tracking
inventory = pd.DataFrame({
    'transaction': ['Sale', 'Purchase', 'Sale', 'Sale', 'Purchase', 'Sale'],
    'quantity': [-10, 50, -15, -20, 40, -12]
})

inventory['stock_level'] = 100 + inventory['quantity'].cumsum()

print("\nInventory levels:")
print(inventory)

# 11. Year-to-Date Calculations
monthly = pd.DataFrame({
    'month': pd.date_range('2024-01-01', periods=12, freq='MS'),
    'revenue': np.random.randint(50000, 100000, 12)
})

monthly['ytd_revenue'] = monthly['revenue'].cumsum()

print("\nYear-to-date revenue:")
print(monthly)

# 12. Performance Comparison
large_series = pd.Series(np.random.randn(1000000))

import time

# cumsum (fast, O(n))
start = time.time()
_ = large_series.cumsum()
cumsum_time = time.time() - start

# Manual cumsum (slow)
start = time.time()
result = []
total = 0
for val in large_series:
    total += val
    result.append(total)
manual_time = time.time() - start

print(f"\ncumsum(): {cumsum_time:.4f}s")
print(f"Manual: {manual_time:.4f}s ({manual_time/cumsum_time:.1f}x slower)")

Cumulative Functions:

Function Description Example [1,2,3,4]
cumsum() Running sum [1, 3, 6, 10]
cumprod() Running product [1, 2, 6, 24]
cummax() Running maximum [1, 2, 3, 4]
cummin() Running minimum [1, 1, 1, 1]

Common Use Cases:

Use Case Function Example
Running totals cumsum() Sales YTD
Compound returns cumprod() Investment growth
Peak tracking cummax() All-time high
Drawdown price - cummax() Portfolio decline
Inventory cumsum() Stock levels
Streak tracking Custom with cumsum Winning streaks

Interview Insights:

Interviewer's Insight

  • "cumsum(): running sum/total"
  • "cumprod(): running product (compound growth)"
  • "cummax(): running maximum (peak)"
  • "cummin(): running minimum (trough)"
  • "Use with groupby(): cumulative per group"
  • "Financial: cumprod() for compound returns"
  • "Drawdown: price - cummax()"
  • "O(n) complexity: efficient"
  • "Common in: time series, finance, inventory"

Explain pd.concat() axis Parameter - Google, Amazon Interview Question

Difficulty: 🟑 Medium | Tags: Concatenation, Joining, Data Combination | Asked by: Google, Amazon, Meta, Microsoft

View Answer

concat() axis parameter controls concatenation direction. axis=0 (default) stacks vertically (rows). axis=1 stacks horizontally (columns). Essential for combining datasets.

Complete Examples:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df2 = pd.DataFrame({
    'A': [7, 8, 9],
    'B': [10, 11, 12]
})

df3 = pd.DataFrame({
    'C': [13, 14, 15],
    'D': [16, 17, 18]
})

# 1. axis=0 (Vertical Stacking - Rows)
result_v = pd.concat([df1, df2], axis=0)
print("Vertical concatenation (axis=0):")
print(result_v)

# Reset index
result_v_reset = pd.concat([df1, df2], axis=0, ignore_index=True)
print("\nWith ignore_index=True:")
print(result_v_reset)

# 2. axis=1 (Horizontal Stacking - Columns)
result_h = pd.concat([df1, df3], axis=1)
print("\nHorizontal concatenation (axis=1):")
print(result_h)

# 3. Mismatched Indexes
df_a = pd.DataFrame({'X': [1, 2, 3]}, index=['a', 'b', 'c'])
df_b = pd.DataFrame({'Y': [4, 5, 6]}, index=['b', 'c', 'd'])

# axis=1: outer join (default)
result = pd.concat([df_a, df_b], axis=1)
print("\nMismatched indexes (axis=1):")
print(result)

# axis=1 with inner join
result_inner = pd.concat([df_a, df_b], axis=1, join='inner')
print("\nWith join='inner':")
print(result_inner)

# 4. Multiple DataFrames
df_list = [
    pd.DataFrame({'value': [1, 2]}),
    pd.DataFrame({'value': [3, 4]}),
    pd.DataFrame({'value': [5, 6]})
]

# Vertical
combined_v = pd.concat(df_list, axis=0, ignore_index=True)
print("\nMultiple DataFrames (vertical):")
print(combined_v)

# 5. Practical: Combining Regional Data
north = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [100, 200, 150]
})

south = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [120, 180, 160]
})

# Vertical (add rows)
all_regions_v = pd.concat([north, south], axis=0, keys=['North', 'South'])
print("\nRegional data (vertical):")
print(all_regions_v)

# Horizontal (add columns)
all_regions_h = pd.concat(
    [north.set_index('product')['sales'], 
     south.set_index('product')['sales']], 
    axis=1, 
    keys=['North', 'South']
)
print("\nRegional data (horizontal):")
print(all_regions_h)

# 6. Keys Parameter
q1 = pd.DataFrame({'revenue': [100, 200]})
q2 = pd.DataFrame({'revenue': [150, 250]})

quarterly = pd.concat([q1, q2], axis=0, keys=['Q1', 'Q2'])
print("\nWith keys (MultiIndex):")
print(quarterly)

# 7. Names Parameter
quarterly_named = pd.concat(
    [q1, q2], 
    axis=0, 
    keys=['Q1', 'Q2'],
    names=['Quarter', 'ID']
)
print("\nWith names:")
print(quarterly_named)

# 8. Practical: Time Series Concatenation
jan = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5, freq='D'),
    'value': [10, 12, 11, 13, 14]
})

feb = pd.DataFrame({
    'date': pd.date_range('2024-02-01', periods=5, freq='D'),
    'value': [15, 16, 14, 17, 18]
})

full_data = pd.concat([jan, feb], axis=0, ignore_index=True)
print("\nTime series concatenation:")
print(full_data)

# 9. Verify No Data Loss
def safe_concat(dfs, axis):
    """Concat with verification."""
    result = pd.concat(dfs, axis=axis)

    if axis == 0:
        expected_rows = sum(len(df) for df in dfs)
        assert len(result) == expected_rows, "Row count mismatch!"
    else:
        expected_cols = sum(len(df.columns) for df in dfs)
        assert len(result.columns) == expected_cols, "Column count mismatch!"

    return result

result = safe_concat([df1, df2], axis=0)
print(f"\nSafe concat: {result.shape}")

# 10. Performance Comparison
dfs = [pd.DataFrame({'A': range(1000)}) for _ in range(100)]

import time

# concat (fast)
start = time.time()
_ = pd.concat(dfs, axis=0)
concat_time = time.time() - start

# append (deprecated, slow)
# result = dfs[0]
# for df in dfs[1:]:
#     result = result.append(df)

print(f"\nconcat time: {concat_time:.4f}s")
print("(append is deprecated, use concat)")

axis Parameter:

axis Direction Stacking Use Case
0 Vertical Add rows Combine datasets
1 Horizontal Add columns Add features

Common Parameters:

Parameter Purpose Example
axis Direction (0=rows, 1=cols) axis=0
ignore_index Reset index ignore_index=True
keys Add MultiIndex keys=['A', 'B']
join 'outer' or 'inner' join='inner'
names Name MultiIndex levels names=['level1']

Interview Insights:

Interviewer's Insight

  • "axis=0: vertical stacking (add rows)"
  • "axis=1: horizontal stacking (add columns)"
  • "ignore_index=True: reset index"
  • "join='outer': keep all (default)"
  • "join='inner': keep common only"
  • "keys: add MultiIndex for source tracking"
  • "Use for: combining datasets, regional data"
  • "axis=0 + ignore_index: clean row numbers"
  • "axis=1: must have compatible indexes"
  • "Prefer concat over append (deprecated)"

Explain Pandas Sparse Data Structures - Netflix, Google Interview Question

Difficulty: 🟑 Medium | Tags: Memory Optimization, Sparse Arrays, Performance | Asked by: Netflix, Google, Amazon

View Answer

Sparse arrays store only non-zero/non-default values, saving memory for mostly-empty data. Essential for one-hot encoding, matrices with many zeros, and large categorical features.

Complete Examples:

import pandas as pd
import numpy as np

# 1. Create Sparse Array
# Most values are 0 (or another fill_value)
dense = pd.Series([0, 0, 1, 0, 0, 2, 0, 0, 0, 3])

# Convert to sparse
sparse = dense.astype(pd.SparseDtype("int64", fill_value=0))

print("Dense memory:")
print(f"{dense.memory_usage(deep=True):,} bytes")
print("\nSparse memory:")
print(f"{sparse.memory_usage(deep=True):,} bytes")
print(f"\nMemory savings: {(1 - sparse.memory_usage(deep=True)/dense.memory_usage(deep=True))*100:.1f}%")

# 2. Sparse DataFrame
# One-hot encoded data (mostly zeros)
df = pd.DataFrame({
    'A': pd.Categorical(['cat', 'dog', 'cat', 'bird'] * 1000)
})

# Dense one-hot encoding
dense_dummies = pd.get_dummies(df['A'])

# Sparse one-hot encoding
sparse_dummies = pd.get_dummies(df['A'], sparse=True)

print("\nOne-hot encoding:")
print(f"Dense: {dense_dummies.memory_usage(deep=True).sum():,} bytes")
print(f"Sparse: {sparse_dummies.memory_usage(deep=True).sum():,} bytes")
print(f"Savings: {(1 - sparse_dummies.memory_usage(deep=True).sum()/dense_dummies.memory_usage(deep=True).sum())*100:.1f}%")

# 3. SparseArray Properties
sparse_arr = pd.arrays.SparseArray([0, 0, 1, 2, 0, 0, 0, 3, 0])

print("\nSparse array:")
print(f"Values: {sparse_arr}")
print(f"Density: {sparse_arr.density:.2%}")
print(f"Fill value: {sparse_arr.fill_value}")
print(f"Non-zero count: {sparse_arr.npoints}")

# 4. Operations on Sparse Data
s1 = pd.Series([0, 0, 1, 0, 2]).astype(pd.SparseDtype("int64", 0))
s2 = pd.Series([0, 1, 0, 0, 3]).astype(pd.SparseDtype("int64", 0))

# Arithmetic (stays sparse)
result = s1 + s2
print("\nArithmetic on sparse:")
print(result)
print(f"Result is sparse: {isinstance(result.dtype, pd.SparseDtype)}")

# 5. Practical: User-Item Matrix
# Typical recommendation system matrix (very sparse)
users = 10000
items = 1000

# Simulate ratings (only 1% have ratings)
n_ratings = int(users * items * 0.01)
user_ids = np.random.randint(0, users, n_ratings)
item_ids = np.random.randint(0, items, n_ratings)
ratings = np.random.randint(1, 6, n_ratings)

# Dense matrix (BAD for sparse data)
# Would need: users * items * 8 bytes = 80 MB

# Sparse representation (GOOD)
sparse_df = pd.DataFrame({
    'user_id': user_ids,
    'item_id': item_ids,
    'rating': ratings
})

print(f"\nSparse ratings:")
print(f"Memory: {sparse_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"vs full matrix: ~{users * items * 8 / 1024**2:.2f} MB")
print(f"Savings: ~{(1 - sparse_df.memory_usage(deep=True).sum()/(users*items*8))*100:.1f}%")

# 6. Convert Between Dense and Sparse
dense = pd.Series([0, 1, 0, 2, 0])

# To sparse
sparse = dense.astype(pd.SparseDtype("int64", fill_value=0))
print("\nConverted to sparse:")
print(sparse)

# Back to dense
dense_again = sparse.astype("int64")
print("\nBack to dense:")
print(dense_again)

# 7. Sparse with Different Fill Values
# If most values are -1
data = pd.Series([-1, -1, 5, -1, -1, 10, -1, -1])
sparse_neg = data.astype(pd.SparseDtype("int64", fill_value=-1))

print("\nSparse with fill_value=-1:")
print(sparse_neg)
print(f"Density: {sparse_neg.density:.2%}")

# 8. Practical: Feature Engineering
# High-cardinality categorical β†’ sparse one-hot
df = pd.DataFrame({
    'user_id': range(1000),
    'country': np.random.choice(['US', 'UK', 'CA', 'AU', 'DE'] * 200, 1000)
})

# Sparse one-hot encoding
country_sparse = pd.get_dummies(df['country'], prefix='country', sparse=True)

print("\nSparse feature engineering:")
print(f"Shape: {country_sparse.shape}")
print(f"Memory: {country_sparse.memory_usage(deep=True).sum() / 1024:.2f} KB")

# 9. When to Use Sparse
def analyze_sparsity(series):
    """Check if sparse makes sense."""
    zero_count = (series == 0).sum()
    total = len(series)
    sparsity = zero_count / total

    print(f"Sparsity: {sparsity:.1%}")
    if sparsity > 0.9:
        print("βœ“ Good candidate for sparse")
    elif sparsity > 0.5:
        print("⚠ Maybe use sparse")
    else:
        print("βœ— Not sparse enough")

    return sparsity

data1 = pd.Series(np.random.randint(0, 2, 1000))  # 50% zeros
data2 = pd.Series(np.random.choice([0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 1000))  # 90% zeros

print("\nData 1:")
analyze_sparsity(data1)

print("\nData 2:")
analyze_sparsity(data2)

# 10. Limitations
print("\n=== Sparse Limitations ===")
print("- Arithmetic can densify results")
print("- Some operations require conversion to dense")
print("- Not all pandas functions support sparse")
print("- Best for: β‰₯90% sparsity")
print("- Use cases: one-hot encoding, user-item matrices, NLP features")

When to Use Sparse:

Sparsity Recommendation Example
>95% Definitely use sparse User-item matrix
90-95% Likely beneficial One-hot encoding
50-90% Maybe Depends on operations
<50% Don't use sparse Regular data

Memory Comparison:

Data Dense Sparse Savings
99% zeros 1 GB ~10 MB ~99%
95% zeros 1 GB ~50 MB ~95%
90% zeros 1 GB ~100 MB ~90%

Interview Insights:

Interviewer's Insight

  • "Sparse: store only non-zero/non-default values"
  • "Memory savings: proportional to sparsity"
  • "pd.get_dummies(sparse=True): sparse one-hot"
  • "SparseDtype(dtype, fill_value): specify type"
  • "Use when: >90% same value (often zeros)"
  • "density: fraction of non-sparse values"
  • "npoints: count of non-sparse values"
  • "Use cases: one-hot encoding, user-item matrix, NLP"
  • "Limitation: some ops require densification"
  • "Trade-off: memory vs computation speed"

Questions asked in Google interview

  • How would you optimize a Pandas operation running slowly on large dataset?
  • Explain the difference between merge() and join()
  • Write code to calculate rolling averages with different window sizes
  • How would you handle a DataFrame with 100 million rows?
  • Explain memory optimization techniques
  • Write code to perform complex GroupBy with multiple aggregations
  • Explain the internal data structure of DataFrame
  • How would you implement feature engineering pipelines?
  • Write code to calculate year-over-year growth
  • Explain vectorized operations and their importance
  • How to handle SettingWithCopyWarning?
  • Write code to perform window functions similar to SQL

Questions asked in Amazon interview

  • Write code to merge multiple DataFrames with different schemas
  • How would you calculate year-over-year growth?
  • Explain how to handle time series data with irregular intervals
  • Write code to identify and remove duplicate records
  • How would you implement a moving average crossover strategy?
  • Explain the difference between transform() and apply()
  • Write code to pivot data for sales analysis
  • How would you handle categorical variables with high cardinality?
  • Explain how to optimize for memory efficiency
  • Write code to perform cohort analysis

Questions asked in Meta interview

  • Write code to analyze user engagement data
  • How would you calculate conversion funnels?
  • Explain how to handle large-scale data processing
  • Write code to resample time series data
  • How would you implement A/B testing analysis?
  • Explain method chaining and its benefits
  • Write code to calculate retention metrics
  • How would you handle hierarchical data structures?
  • Explain vectorization benefits over loops
  • Write code to analyze network data

Questions asked in Microsoft interview

  • Explain the SettingWithCopyWarning and how to avoid it
  • Write code to perform window functions similar to SQL
  • How would you handle timezone conversions?
  • Explain the difference between views and copies
  • Write code to implement custom aggregation functions
  • How would you optimize Pandas for production?
  • Explain multi-level indexing use cases
  • Write code to compare two DataFrames
  • How would you handle missing data in time series?
  • Explain eval() and query() methods

Questions asked in Netflix interview

  • Write code to analyze viewing patterns and user behavior
  • How would you calculate streaming quality metrics?
  • Explain how to handle messy data from multiple sources
  • Write code to implement collaborative filtering preprocessing
  • How would you analyze content performance across regions?
  • Explain time series decomposition
  • Write code to calculate customer lifetime value
  • How would you handle data for recommendation systems?
  • Explain rolling window calculations for real-time analytics
  • Write code to analyze A/B test results

Questions asked in Apple interview

  • Write code to perform data validation on imported data
  • How would you implement data quality checks?
  • Explain how to handle multi-format data imports
  • Write code to analyze product performance metrics
  • How would you implement data anonymization?
  • Explain best practices for production Pandas code
  • Write code to create automated data reports
  • How would you handle data versioning?
  • Explain memory management for large DataFrames
  • Write code to implement time-based partitioning

Questions asked in Flipkart interview

  • Write code to analyze e-commerce transaction data
  • How would you calculate GMV metrics?
  • Explain handling high-cardinality categorical data
  • Write code to analyze customer purchase patterns
  • How would you implement product recommendation preprocessing?
  • Explain data aggregation for dashboard analytics

Questions asked in LinkedIn interview

  • Write code to analyze professional network connections
  • How would you calculate engagement metrics for posts?
  • Explain how to handle user activity data
  • Write code to implement skill-based matching
  • How would you analyze job posting performance?
  • Explain data preprocessing for NLP tasks

Additional Resources