Pandas Interview Questions
This document provides a curated list of Pandas interview questions commonly asked in technical interviews for Data Science, Data Analysis, Machine Learning, and Python Developer roles. It covers fundamental concepts to advanced data manipulation techniques, including rigorous "brutally difficult" questions for senior roles.
This is updated frequently but right now this is the most exhaustive list of type of questions being asked.
Premium Interview Questions
Explain loc vs iloc - Key Difference - Google, Amazon, Meta Interview Question
Difficulty: π’ Easy | Tags: Indexing, Selection, Core | Asked by: Google, Amazon, Meta, Apple, Netflix
View Answer
loc: Label-based indexing (uses row/column names). iloc: Integer position-based (uses 0, 1, 2...). Key difference: loc is inclusive, iloc is exclusive on end of slice.
Real-World: Netflix uses loc for time-series data (datetime index), iloc for positional slicing in ETL pipelines.
| Method | Indexing Type | Slicing Behavior | Use Case |
|---|---|---|---|
| loc | Label-based | Inclusive end | Named rows/cols, datetime index, boolean filtering |
| iloc | Integer position | Exclusive end | First N rows, every Nth row, CSV-like access |
| at | Label (scalar) | Single value | Fastest for single cell (10Γ faster) |
| iat | Position (scalar) | Single value | Fastest for position-based access |
Production Code (130 lines)
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 28, 32],
'salary': [50000, 60000, 70000, 55000, 65000],
'city': ['NY', 'LA', 'SF', 'NY', 'LA']
}, index=['emp_001', 'emp_002', 'emp_003', 'emp_004', 'emp_005'])
# ===== loc (Label-based) =====
# Single cell
print(df.loc['emp_001', 'name']) # 'Alice'
# Multiple rows/cols
print(df.loc['emp_001':'emp_003', 'name':'salary'])
# IMPORTANT: 'emp_003' IS INCLUDED (inclusive slicing)
# Boolean filtering (VERY COMMON in production)
high_earners = df.loc[df['salary'] > 60000]
print(high_earners)
# Multiple conditions
ny_high_earners = df.loc[(df['city'] == 'NY') & (df['salary'] > 50000)]
# ===== iloc (Integer position) =====
# Single cell
print(df.iloc[0, 0]) # 'Alice' (first row, first col)
# Multiple rows/cols
print(df.iloc[0:3, 0:2]) # First 3 rows, first 2 cols
# IMPORTANT: Row 3 is EXCLUDED (0, 1, 2 only)
# Every Nth row (common in sampling)
every_2nd_row = df.iloc[::2] # Rows 0, 2, 4
# Last N rows
last_3_rows = df.iloc[-3:]
# ===== at/iat (Scalar access - FASTEST) =====
# 10Γ faster for single cell than loc/iloc
name = df.at['emp_001', 'name'] # Use at for labels
age = df.iat[0, 1] # Use iat for positions
# ===== Common Mistakes =====
# β WRONG: Using loc with integer on non-integer index
try:
df.loc[0] # KeyError! Index is ['emp_001', 'emp_002', ...]
except KeyError:
print("Error: loc expects labels, not positions")
# β
CORRECT
df.iloc[0] # Always works
# β WRONG: Assuming iloc is inclusive like loc
print(df.iloc[0:2]) # Only rows 0, 1 (NOT 0, 1, 2)
# β
CORRECT: Use loc for inclusive slicing
print(df.loc['emp_001':'emp_002']) # Includes both endpoints
When to Use Which
| Scenario | Use | Example |
|---|---|---|
| Named index (datetime, strings) | loc | df.loc['2024-01-01':'2024-12-31'] |
| First N rows | iloc | df.iloc[:100] |
| Boolean filtering | loc | df.loc[df['age'] > 30] |
| CSV-like access (row 5, col 3) | iloc | df.iloc[5, 3] |
| Single cell (performance critical) | at/iat | df.at['row', 'col'] |
Real-World Performance
| Company | Use Case | Method | Why |
|---|---|---|---|
| Netflix | Time-series (datetime index) | loc | df.loc['2024-01-01':'2024-01-31'] |
| Airbnb | First 1000 listings (pagination) | iloc | df.iloc[:1000] |
| Uber | High-frequency trading (1M+ ops/sec) | at/iat | 10Γ faster for scalar access |
Interviewer's Insight
- Knows loc is inclusive, iloc is exclusive (classic gotcha!)
- Uses at/iat for scalar access (10Γ faster than loc/iloc)
- Mentions boolean filtering with loc (most common production use)
- Avoids loc with integers on non-integer index (KeyError)
- Real-world: Netflix uses loc for datetime slicing in recommendation ETL
How Do You Handle Missing Values in Pandas? - Google, Amazon, Netflix Interview Question
Difficulty: π‘ Medium | Tags: Missing Data, Data Cleaning, fillna | Asked by: Google, Amazon, Meta, Netflix, Apple
View Answer
Detecting Missing Values:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, np.nan, 3, np.nan],
'B': ['x', None, 'y', 'z']
})
# Detection
df.isnull() # Boolean mask
df.isnull().sum() # Count per column
df.isnull().sum().sum() # Total count
df.isna().any() # Any missing per column
Handling Strategies:
# 1. Drop rows with any missing
df.dropna()
# 2. Drop rows where specific columns are missing
df.dropna(subset=['A'])
# 3. Drop only if all values missing
df.dropna(how='all')
# 4. Fill with constant
df.fillna(0)
df.fillna({'A': 0, 'B': 'unknown'})
# 5. Fill with statistics
df['A'].fillna(df['A'].mean())
df['A'].fillna(df['A'].median())
df['A'].fillna(df['A'].mode()[0])
# 6. Forward/backward fill (time series)
df.ffill() # Forward fill
df.bfill() # Backward fill
# 7. Interpolation
df['A'].interpolate(method='linear')
Strategy Selection:
| Scenario | Strategy | When to Use |
|---|---|---|
| < 5% missing | Drop rows (dropna()) | Minimal data loss |
| Numeric, random | Mean/median (fillna) | MCAR (missing completely at random) |
| Time series | Forward/backward fill (ffill/bfill) | Sequential data |
| Categorical | Mode or 'Unknown' | Preserves categories |
| > 50% missing | Drop column | Too much missing data |
Real-World: - Airbnb: Uses median for price (robust to outliers), mode for amenities - Netflix: Forward fill for user viewing history (time-series) - Uber: Drops rows with missing trip_id (< 1% of data)
Interviewer's Insight
- Chooses strategy by data type (mean for numeric, mode for categorical)
- Uses ffill for time-series (preserves temporal patterns)
- Knows dropna vs fillna tradeoffs (< 5% missing β drop, > 5% β fill)
- Checks missingness pattern (MCAR vs MAR vs MNAR)
- Real-world: Airbnb fills missing amenities with 'Not Available' (preserves info)
Explain GroupBy in Pandas - Split-Apply-Combine - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: GroupBy, Aggregation, Split-Apply-Combine | Asked by: Google, Amazon, Meta, Netflix, Apple
View Answer
GroupBy Concept:
- Split: Divide data into groups
- Apply: Apply function to each group
- Combine: Combine results
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'A'],
'product': ['x', 'y', 'x', 'y', 'x'],
'sales': [100, 150, 200, 250, 120],
'quantity': [10, 15, 20, 25, 12]
})
# Basic groupby
df.groupby('category')['sales'].sum()
# Multiple columns
df.groupby(['category', 'product'])['sales'].mean()
# Multiple aggregations
df.groupby('category').agg({
'sales': ['sum', 'mean', 'count'],
'quantity': 'sum'
})
# Named aggregations (cleaner output)
df.groupby('category').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean'),
order_count=('sales', 'count')
)
# Custom functions
df.groupby('category')['sales'].apply(lambda x: x.max() - x.min())
Transform vs Apply:
# Transform: returns same shape (broadcast back)
df['sales_normalized'] = df.groupby('category')['sales'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Apply: returns any shape
df.groupby('category').apply(lambda g: g.nlargest(2, 'sales'))
Interviewer's Insight
- Explains split-apply-combine pattern clearly
- Uses named aggregations (
agg(total_sales=('sales', 'sum'))- cleaner output) - Knows transform vs apply (transform keeps shape, apply reduces)
- Real-world: Netflix uses groupby for user engagement metrics (daily/weekly aggregations)
Difference Between merge(), join(), and concat() - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Merging, Joining, Concatenation | Asked by: Google, Amazon, Meta
View Answer
Comparison:
| Method | Use Case | Key Difference |
|---|---|---|
merge() | SQL-like joins on columns | Column-based |
join() | Join on index | Index-based |
concat() | Stack DataFrames | No key matching |
Examples:
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'val2': [4, 5, 6]})
# MERGE - column-based joining
pd.merge(df1, df2, on='key', how='inner') # Only A, B
pd.merge(df1, df2, on='key', how='left') # Keep all from df1
pd.merge(df1, df2, on='key', how='outer') # Keep all
pd.merge(df1, df2, left_on='key', right_on='key') # Different column names
# JOIN - index-based (set index first)
df1.set_index('key').join(df2.set_index('key'))
# CONCAT - stacking
pd.concat([df1, df2]) # Vertical (default axis=0)
pd.concat([df1, df2], axis=1) # Horizontal
pd.concat([df1, df2], ignore_index=True) # Reset index
Join Types:
Inner: Only matching keys
Left: All from left + matching from right
Right: All from right + matching from left
Outer: All from both (union)
Indicator for debugging:
result = pd.merge(df1, df2, on='key', how='outer', indicator=True)
# _merge column shows: 'left_only', 'right_only', 'both'
Interviewer's Insight
- Knows when to use each: merge (columns), join (index), concat (stacking)
- Explains join types (inner/left/right/outer) with use cases
- Uses indicator=True for debugging (shows merge source)
- Real-world: Uber merges trip data + driver data on driver_id (billions/day)
How to Apply Functions to DataFrames? apply(), map(), applymap() - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Data Transformation, Functions | Asked by: Google, Amazon, Microsoft
View Answer
Method Comparison:
| Method | Scope | Use Case |
|---|---|---|
apply() | Row/Column | Complex transformations |
map() | Series only | Element-wise mapping |
applymap() | Element-wise | Simple element ops (deprecated, use map) |
Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': ['x', 'y', 'z']
})
# apply() - works on columns or rows
df[['A', 'B']].apply(np.sum) # Sum each column
df[['A', 'B']].apply(np.sum, axis=1) # Sum each row
df[['A', 'B']].apply(lambda x: x.max() - x.min()) # Custom function
# map() - Series only, element-wise
df['C'].map({'x': 'X', 'y': 'Y', 'z': 'Z'}) # Dictionary mapping
df['A'].map(lambda x: x ** 2) # Lambda function
# For DataFrames, use apply with axis
df[['A', 'B']].apply(lambda x: x ** 2) # Each column
# In Pandas 2.1+, use map() instead of applymap()
df[['A', 'B']].map(lambda x: x * 2)
Performance Tip:
# SLOW - apply with lambda
df['A'].apply(lambda x: x ** 2)
# FAST - vectorized operation
df['A'] ** 2
# Use apply only when vectorization isn't possible
Interviewer's Insight
- Prefers vectorized operations (100Γ faster than apply)
- Knows applymap deprecated in Pandas 2.1+ (use map instead)
- Uses map for Series, apply for DataFrame
- Real-world: Airbnb vectorizes price calculations (10M+ listings, <1s)
How to Optimize Memory Usage in Pandas? - Google, Amazon, Netflix Interview Question
Difficulty: π΄ Hard | Tags: Memory Optimization, Performance | Asked by: Google, Amazon, Netflix
View Answer
Check Memory Usage:
import pandas as pd
df = pd.read_csv('large_file.csv')
# Memory per column
df.memory_usage(deep=True)
# Total memory in MB
df.memory_usage(deep=True).sum() / 1024**2
Optimization Techniques:
# 1. Use appropriate dtypes on read
df = pd.read_csv('file.csv', dtype={
'id': 'int32', # Instead of int64
'value': 'float32', # Instead of float64
'category': 'category' # Instead of object
})
# 2. Convert existing columns
df['category'] = df['category'].astype('category')
# 3. Downcast numeric types
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
# 4. Use sparse dtypes for mostly-empty columns
sparse_col = pd.arrays.SparseArray([0, 0, 0, 1, 0, 0, 0, 0])
Memory Reduction Example:
def reduce_memory(df):
for col in df.columns:
col_type = df[col].dtype
if col_type == 'object':
if df[col].nunique() / len(df) < 0.5: # Low cardinality
df[col] = df[col].astype('category')
elif col_type == 'int64':
df[col] = pd.to_numeric(df[col], downcast='integer')
elif col_type == 'float64':
df[col] = pd.to_numeric(df[col], downcast='float')
return df
Savings:
| From | To | Reduction |
|---|---|---|
| int64 | int32 | 50% |
| float64 | float32 | 50% |
| object (strings) | category | 90%+ |
Interviewer's Insight
- Uses category dtype for low-cardinality strings (90% memory reduction)
- Knows int64βint32, float64βfloat32 (50% reduction each)
- Checks memory_usage(deep=True) before/after optimization
- Real-world: Netflix reduced 50GB DataFrame to 5GB using category dtype
How to Handle Large Datasets That Don't Fit in Memory? - Google, Amazon, Netflix Interview Question
Difficulty: π΄ Hard | Tags: Large Data, Chunking, Performance | Asked by: Google, Amazon, Netflix, Meta
View Answer
Chunked Processing:
import pandas as pd
# Process file in chunks
chunk_size = 100_000
chunks = []
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
# Process each chunk
processed = chunk.groupby('category')['value'].sum()
chunks.append(processed)
# Combine results
result = pd.concat(chunks).groupby(level=0).sum()
Use Efficient File Formats:
# Parquet - columnar, compressed
df.to_parquet('data.parquet', compression='snappy')
df = pd.read_parquet('data.parquet', columns=['col1', 'col2']) # Read subset
# Feather - fast read/write
df.to_feather('data.feather')
df = pd.read_feather('data.feather')
Dask for Out-of-Core Computing:
import dask.dataframe as dd
# Lazy loading - doesn't load into memory
ddf = dd.read_csv('huge_file.csv')
# Same Pandas API
result = ddf.groupby('category')['value'].sum().compute()
PyArrow Backend (Pandas 2.0+):
# Use PyArrow for better memory efficiency
df = pd.read_csv('file.csv', dtype_backend='pyarrow')
# Or convert existing
df = df.convert_dtypes(dtype_backend='pyarrow')
Comparison:
| Approach | Use Case |
|---|---|
| Chunking | Simple aggregations |
| Parquet | Columnar queries |
| Dask | Complex operations |
| Polars | Speed-critical |
Interviewer's Insight
- Uses chunksize for simple aggregations (100K-1M rows/chunk)
- Knows Parquet > CSV (10Γ smaller, 100Γ faster column access)
- Mentions Dask for out-of-core (scales beyond RAM)
- Real-world: Uber processes 100TB+ trip data daily using Parquet + Dask
Explain Pivot Tables in Pandas - Amazon, Google Interview Question
Difficulty: π‘ Medium | Tags: Pivot Tables, Reshaping | Asked by: Amazon, Google, Microsoft, Netflix
View Answer
pivot_table() - Flexible Reshaping:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': ['2023-01', '2023-01', '2023-02', '2023-02'],
'product': ['A', 'B', 'A', 'B'],
'region': ['East', 'East', 'West', 'West'],
'sales': [100, 150, 200, 250],
'quantity': [10, 15, 20, 25]
})
# Basic pivot table
pt = pd.pivot_table(
df,
values='sales',
index='date',
columns='product',
aggfunc='sum'
)
# Multiple aggregations
pt = pd.pivot_table(
df,
values=['sales', 'quantity'],
index='date',
columns='product',
aggfunc={'sales': 'sum', 'quantity': 'mean'}
)
# With totals
pt = pd.pivot_table(
df,
values='sales',
index='date',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total'
)
pivot() vs pivot_table():
| pivot() | pivot_table() | |
|---|---|---|
| Duplicates | Error | Handles with aggfunc |
| Aggregation | No | Yes |
| Fill value | No | Yes |
# pivot() - simple reshape (no duplicates allowed)
df.pivot(index='date', columns='product', values='sales')
# pivot_table() - handles duplicates
df.pivot_table(index='date', columns='product', values='sales', aggfunc='mean')
Interviewer's Insight
- Uses pivot_table() for aggregation (handles duplicates, pivot() errors)
- Knows margins=True for row/column totals
- Can reverse with melt() (unpivot)
- Real-world: Airbnb uses pivot_table for daily revenue reports by region
How to Work with DateTime Data in Pandas? - Google, Amazon, Netflix Interview Question
Difficulty: π‘ Medium | Tags: DateTime, Time Series | Asked by: Google, Amazon, Netflix, Meta
View Answer
Creating DateTime:
import pandas as pd
# Parse strings to datetime
df['date'] = pd.to_datetime(df['date_string'])
df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')
# Create date range
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
dates = pd.date_range(start='2023-01-01', periods=12, freq='ME')
DateTime Accessors (.dt):
df['date'] = pd.to_datetime(df['date'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek # 0=Monday
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['date'].dt.dayofweek >= 5
# Formatting
df['month_name'] = df['date'].dt.month_name()
df['formatted'] = df['date'].dt.strftime('%Y-%m')
Time Zone Handling:
# Localize (no timezone -> timezone)
df['date'] = df['date'].dt.tz_localize('UTC')
# Convert between timezones
df['date_est'] = df['date'].dt.tz_convert('US/Eastern')
Resampling:
# Resample time series
df.set_index('date').resample('M')['sales'].sum() # Monthly
df.set_index('date').resample('Q')['sales'].mean() # Quarterly
df.set_index('date').resample('W')['sales'].agg(['sum', 'mean'])
Interviewer's Insight
- Uses .dt accessor for datetime components (year, month, dayofweek)
- Handles timezones (tz_localize, tz_convert)
- Knows resample() for aggregation (D, W, M, Q, Y frequencies)
- Real-world: Netflix uses resample('D') for daily user engagement metrics
What is SettingWithCopyWarning and How to Avoid It? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Common Errors, Best Practices | Asked by: Google, Amazon, Meta, Microsoft
View Answer
The Problem:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# This triggers warning
filtered = df[df['A'] > 1]
filtered['B'] = 100 # Warning: setting on copy
Why It Happens:
Pandas can't always tell if you're working with a view or a copy. Modifying a view might not update the original (or might unexpectedly).
Solutions:
# Solution 1: Explicit copy
filtered = df[df['A'] > 1].copy()
filtered['B'] = 100 # Safe
# Solution 2: Use .loc for assignment
df.loc[df['A'] > 1, 'B'] = 100 # Modifies original
# Solution 3: Chain in one line
df = df[df['A'] > 1].assign(B=100)
# Pandas 2.0+ with Copy-on-Write
pd.options.mode.copy_on_write = True
filtered = df[df['A'] > 1]
filtered['B'] = 100 # Creates copy automatically
Best Practices:
| Want to... | Use |
|---|---|
| Modify original | df.loc[condition, 'col'] = value |
| Create new DataFrame | df[condition].copy() |
| Chain operations | .assign() method |
Interviewer's Insight
- Explains view vs copy concept (chained indexing creates ambiguity)
- Uses .loc for in-place modification (safe)
- Uses .copy() when creating new DataFrame
- Knows Copy-on-Write in Pandas 2.0+ (eliminates warning)
- Real-world: Google enforces .copy() in code reviews to avoid bugs
How to Use Rolling Windows for Time Series? - Google, Amazon, Netflix Interview Question
Difficulty: π‘ Medium | Tags: Rolling Windows, Time Series, Finance | Asked by: Google, Amazon, Netflix, Apple
View Answer
Rolling Window Calculations:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=100),
'price': np.random.randn(100).cumsum() + 100
})
df.set_index('date', inplace=True)
# Moving averages
df['MA_7'] = df['price'].rolling(window=7).mean()
df['MA_30'] = df['price'].rolling(window=30).mean()
# Other statistics
df['rolling_std'] = df['price'].rolling(7).std()
df['rolling_max'] = df['price'].rolling(7).max()
df['rolling_sum'] = df['price'].rolling(7).sum()
# Minimum periods (handle NaN at start)
df['MA_7_min3'] = df['price'].rolling(window=7, min_periods=3).mean()
# Centered window
df['MA_centered'] = df['price'].rolling(window=7, center=True).mean()
Custom Rolling Functions:
# Custom function with apply
df['rolling_range'] = df['price'].rolling(7).apply(
lambda x: x.max() - x.min()
)
# Faster with raw=True (NumPy array)
df['rolling_custom'] = df['price'].rolling(7).apply(
lambda x: np.percentile(x, 75), raw=True
)
Exponential Weighted Average:
# EMA - more weight to recent values
df['EMA_7'] = df['price'].ewm(span=7).mean()
df['EMA_decay'] = df['price'].ewm(alpha=0.1).mean()
Interviewer's Insight
- Uses min_periods for edge cases (avoids NaN at start)
- Knows ewm() for exponential weighting (more weight to recent)
- Uses raw=True for performance (NumPy array, 3Γ faster)
- Real-world: Trading firms use rolling(20).mean() for moving averages
How to Efficiently Use Query and Eval? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Filtering, Query, Performance | Asked by: Google, Amazon, Meta
View Answer
query() - String-Based Filtering:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'department': ['Sales', 'IT', 'Sales'],
'salary': [50000, 60000, 55000]
})
# Standard filtering (verbose)
df[(df['age'] > 25) & (df['department'] == 'Sales')]
# query() - cleaner syntax
df.query('age > 25 and department == "Sales"')
# Using variables with @
min_age = 25
dept = 'Sales'
df.query('age > @min_age and department == @dept')
# Column names with spaces
df.query('`Column Name` > 10')
eval() - Efficient Expression Evaluation:
# Create new column without intermediate copies
df.eval('bonus = salary * 0.1')
# Multiple expressions
df.eval('''
bonus = salary * 0.1
total_comp = salary + bonus
age_group = age // 10 * 10
''', inplace=True)
# Conditional expressions
df.eval('is_senior = age >= 30')
Performance:
# eval uses numexpr for large DataFrames
# Faster for: large datasets, complex expressions
# Similar for: small datasets, simple operations
# Check if numexpr is available
import pandas as pd
print(pd.get_option('compute.use_numexpr'))
Interviewer's Insight
- Uses query() for readable filters (SQL-like syntax)
- Uses @ for variable interpolation (dynamic filters)
- Knows eval() faster for large data (uses numexpr, avoids copies)
- Real-world: Airbnb uses query() for dynamic filters in analytics dashboards
How to Work with String Data in Pandas? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: String Operations, Text Processing | Asked by: Google, Amazon, Meta, Netflix
View Answer
String Accessor (.str):
import pandas as pd
df = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Bob Johnson'],
'email': ['[email protected]', '[email protected]', '[email protected]']
})
# Case transformations
df['name'].str.lower()
df['name'].str.upper()
df['name'].str.title()
# String matching
df['name'].str.contains('John')
df['name'].str.startswith('J')
df['email'].str.endswith('.com')
# Extraction
df['first_name'] = df['name'].str.split(' ').str[0]
df['domain'] = df['email'].str.extract(r'@(\w+\.\w+)')
# Replacement
df['name'].str.replace('John', 'Jonathan')
df['email'].str.replace(r'@\w+', '@company', regex=True)
# Length and padding
df['name'].str.len()
df['name'].str.pad(width=20, side='right', fillchar='.')
Split and Expand:
# Split into multiple columns
df[['first', 'last']] = df['name'].str.split(' ', expand=True)
# Split into list (no expand)
df['name_parts'] = df['name'].str.split(' ')
Regular Expressions:
# Extract groups
df['phone'] = pd.Series(['123-456-7890', '987-654-3210'])
df[['area', 'exchange', 'number']] = df['phone'].str.extract(
r'(\d{3})-(\d{3})-(\d{4})'
)
# Find all matches
df['numbers'] = df['phone'].str.findall(r'\d+')
Interviewer's Insight
- Uses .str accessor for vectorized string operations (10-100x faster than loops)
- Knows expand=True for split() to create multiple columns
- Can write regex patterns for extraction (.str.extract(), .str.replace())
- Real-world: Airbnb cleans 500M+ listing descriptions daily with .str methods
Difference Between transform() and apply() in GroupBy - Google, Amazon Interview Question
Difficulty: π΄ Hard | Tags: GroupBy, Data Transformation | Asked by: Google, Amazon, Meta
View Answer
Key Difference:
| Method | Output Shape | Broadcast |
|---|---|---|
transform() | Same as input | Yes |
apply() | Any shape | No |
transform() - Broadcasts Back:
import pandas as pd
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B'],
'value': [10, 20, 30, 40]
})
# Normalize within groups
df['normalized'] = df.groupby('group')['value'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Add group statistics as new column
df['group_mean'] = df.groupby('group')['value'].transform('mean')
df['group_sum'] = df.groupby('group')['value'].transform('sum')
df['pct_of_group'] = df['value'] / df.groupby('group')['value'].transform('sum')
# Result: same number of rows as original
apply() - Flexible Output:
# Return aggregated result
df.groupby('group')['value'].apply(lambda x: x.sum())
# Return different shape per group
df.groupby('group').apply(lambda g: g.nlargest(1, 'value'))
# Return multiple values per group
df.groupby('group')['value'].apply(lambda x: pd.Series({
'mean': x.mean(),
'range': x.max() - x.min()
}))
When to Use Each:
| Use Case | Method |
|---|---|
| Add group stat as column | transform() |
| Normalize within groups | transform() |
| Filter/rank within groups | transform() |
| Custom aggregation | apply() |
| Return subset of rows | apply() |
Interviewer's Insight
- Explains broadcast behavior: transform() returns same shape, apply() flexible
- Uses transform() for normalization (z-scores within groups)
- Knows performance: transform() optimized for aggregations, apply() slower
- Real-world: Netflix normalizes viewing time by country with transform()
How to Create Bins and Categories with cut() and qcut()? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Discretization, Binning | Asked by: Google, Amazon, Meta
View Answer
cut() - Fixed-Width Bins:
import pandas as pd
import numpy as np
ages = pd.Series([5, 17, 25, 35, 45, 55, 65, 75, 85])
# Explicit bin edges
bins = [0, 18, 35, 50, 65, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Middle Aged', 'Senior']
pd.cut(ages, bins=bins, labels=labels)
# Equal-width bins
pd.cut(ages, bins=5) # 5 equal-width bins
# Include lowest value
pd.cut(ages, bins=bins, labels=labels, include_lowest=True)
# Return bin boundaries
pd.cut(ages, bins=5, retbins=True)
qcut() - Quantile-Based Bins:
# Equal-sized bins (same number of items)
values = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])
pd.qcut(values, q=4) # Quartiles
pd.qcut(values, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
pd.qcut(values, q=[0, 0.25, 0.5, 0.75, 1]) # Custom quantiles
# Handle duplicates
pd.qcut(values, q=4, duplicates='drop')
Comparison:
| Method | Bin Size | Use Case |
|---|---|---|
cut() | Fixed width | Age groups, price ranges |
qcut() | Equal count | Percentiles, balanced groups |
Example:
# cut: bins by value range
pd.cut([1, 5, 10, 50, 100], bins=3)
# (0.9, 34], (0.9, 34], (0.9, 34], (34, 67], (67, 100]
# qcut: bins by count
pd.qcut([1, 5, 10, 50, 100], q=3)
# (0.99, 7.5], (0.99, 7.5], (7.5, 30], (30, 100], (30, 100]
Interviewer's Insight
What they're testing: Data discretization.
Strong answer signals:
- Knows when to use each method
- Uses meaningful labels
- Handles edge cases (duplicates)
- Explains equal-width vs equal-count
- Real-world: Stripe uses qcut() for balanced A/B test cohorts (equal-sized groups)
How to Handle Categorical Data Efficiently? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Categorical Data, Memory | Asked by: Google, Amazon, Meta, Netflix
View Answer
Creating Categorical:
import pandas as pd
# Convert to categorical
df = pd.DataFrame({'status': ['active', 'inactive', 'active', 'pending'] * 10000})
# Memory before
print(df.memory_usage(deep=True).sum()) # ~360KB
# Convert
df['status'] = df['status'].astype('category')
# Memory after
print(df.memory_usage(deep=True).sum()) # ~40KB (90% reduction!)
Ordered Categories:
from pandas.api.types import CategoricalDtype
# Define ordered category
size_type = CategoricalDtype(
categories=['small', 'medium', 'large', 'xlarge'],
ordered=True
)
df['size'] = df['size'].astype(size_type)
# Now comparisons work
df[df['size'] > 'medium'] # Returns 'large' and 'xlarge'
df['size'].min(), df['size'].max()
get_dummies() for One-Hot Encoding:
# One-hot encoding
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})
pd.get_dummies(df, columns=['color'])
# Result: color_blue, color_green, color_red columns
# Drop first to avoid multicollinearity
pd.get_dummies(df, columns=['color'], drop_first=True)
# Prefix
pd.get_dummies(df, columns=['color'], prefix='c')
Category Operations:
# Add/remove categories
df['status'].cat.add_categories(['new_status'])
df['status'].cat.remove_unused_categories()
# Rename categories
df['status'].cat.rename_categories({'active': 'ACTIVE'})
Interviewer's Insight
What they're testing: Efficient data handling.
Strong answer signals:
- Uses category for memory savings
- Knows ordered categories for comparisons
- Uses get_dummies for ML
- Understands when to use category type
How to Use MultiIndex (Hierarchical Indexing)? - Google, Amazon Interview Question
Difficulty: π΄ Hard | Tags: MultiIndex, Hierarchical Data | Asked by: Google, Amazon, Meta
View Answer
Creating MultiIndex:
import pandas as pd
# From tuples
index = pd.MultiIndex.from_tuples([
('A', 2021), ('A', 2022), ('B', 2021), ('B', 2022)
], names=['category', 'year'])
df = pd.DataFrame({'sales': [100, 150, 200, 250]}, index=index)
# From product (all combinations)
index = pd.MultiIndex.from_product(
[['A', 'B'], [2021, 2022, 2023]],
names=['category', 'year']
)
# From GroupBy
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B'],
'year': [2021, 2022, 2021, 2022],
'sales': [100, 150, 200, 250]
})
df_multi = df.set_index(['category', 'year'])
Selecting with MultiIndex:
# Select outer level
df.loc['A'] # All years for category A
# Select specific combination
df.loc[('A', 2021)]
# Cross-section
df.xs(2021, level='year') # All categories for 2021
# Slice
df.loc['A':'B'] # Range of outer index
Flatten MultiIndex:
# Reset to regular columns
df.reset_index()
# Flatten column MultiIndex
df.columns = ['_'.join(col).strip() for col in df.columns.values]
Stack/Unstack:
# Wide to long (stack)
df.stack()
# Long to wide (unstack)
df.unstack()
df.unstack(level='year')
Interviewer's Insight
What they're testing: Complex data structures.
Strong answer signals:
- Creates MultiIndex efficiently
- Uses xs for cross-sections
- Knows stack/unstack
- Can flatten when needed
How to Profile and Optimize Pandas Performance? - Google, Amazon, Netflix Interview Question
Difficulty: π΄ Hard | Tags: Performance, Profiling, Optimization | Asked by: Google, Amazon, Netflix
View Answer
Profiling Tools:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100000, 10))
# Time operations
%timeit df.sum()
# Line-by-line profiling
%lprun -f function_name function_name(df)
# Memory profiling
%memit df.copy()
Common Optimizations:
# 1. Avoid loops - use vectorization
# SLOW
for i in range(len(df)):
df.loc[i, 'new_col'] = df.loc[i, 'A'] * 2
# FAST
df['new_col'] = df['A'] * 2
# 2. Use NumPy for complex operations
# SLOW
df['result'] = df.apply(lambda row: complex_function(row), axis=1)
# FAST
df['result'] = np.where(df['A'] > 0, df['A'] * 2, df['A'] * 3)
# 3. Use categorical for strings
df['category'] = df['category'].astype('category')
# 4. Read only needed columns
df = pd.read_csv('file.csv', usecols=['col1', 'col2'])
# 5. Use query() for complex filters
df.query('A > 0 and B < 100') # Faster than boolean indexing
Advanced Techniques:
# Numba JIT compilation
from numba import jit
@jit(nopython=True)
def fast_calculation(arr):
result = np.empty(len(arr))
for i in range(len(arr)):
result[i] = arr[i] ** 2 + arr[i] * 2
return result
df['result'] = fast_calculation(df['A'].values)
# Swifter for automatic parallelization
import swifter
df['result'] = df['A'].swifter.apply(complex_function)
Interviewer's Insight
What they're testing: Production optimization skills.
Strong answer signals:
- Profiles before optimizing
- Prefers vectorization over loops
- Uses NumPy for speed
- Knows Numba/Swifter for edge cases
How to Use melt() for Unpivoting Data? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Reshaping, Unpivoting | Asked by: Google, Amazon, Meta
View Answer
melt() - Wide to Long Format:
import pandas as pd
# Wide format
df_wide = pd.DataFrame({
'name': ['Alice', 'Bob'],
'2021_sales': [100, 200],
'2022_sales': [150, 250],
'2023_sales': [180, 300]
})
# Convert to long format
df_long = pd.melt(
df_wide,
id_vars=['name'],
value_vars=['2021_sales', '2022_sales', '2023_sales'],
var_name='year',
value_name='sales'
)
# Result:
# name year sales
# 0 Alice 2021_sales 100
# 1 Bob 2021_sales 200
# ...
Clean up melted data:
df_long['year'] = df_long['year'].str.replace('_sales', '').astype(int)
Interviewer's Insight
What they're testing: Data reshaping skills.
Strong answer signals:
- Knows melt reverses pivot
- Uses id_vars for fixed columns
- Cleans variable names after melting
How to Use stack() and unstack()? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Reshaping, MultiIndex | Asked by: Google, Amazon, Meta
View Answer
stack() - Pivot columns to rows:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
}, index=['row1', 'row2'])
# Stack columns into rows
stacked = df.stack()
# Result: MultiIndex Series
# row1 A 1
# B 3
# row2 A 2
# B 4
unstack() - Pivot rows to columns:
# Reverse operation
unstacked = stacked.unstack()
# Unstack specific level
df_multi = df.set_index([['cat1', 'cat1', 'cat2', 'cat2'], [1, 2, 1, 2]])
df_multi.unstack(level=0) # Unstack first level
df_multi.unstack(level=1) # Unstack second level
Interviewer's Insight
What they're testing: MultiIndex manipulation.
Strong answer signals:
- Understands stack/unstack relationship
- Can specify which level to unstack
- Uses for reshaping complex data
How to Cross-Tabulate with crosstab()? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Cross Tabulation, Analysis | Asked by: Google, Amazon, Meta
View Answer
crosstab() - Frequency Tables:
import pandas as pd
df = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'department': ['Sales', 'Sales', 'IT', 'IT', 'Sales', 'HR'],
'salary': [50000, 55000, 60000, 65000, 52000, 48000]
})
# Simple frequency table
pd.crosstab(df['gender'], df['department'])
# With margins (totals)
pd.crosstab(df['gender'], df['department'], margins=True)
# With aggregation
pd.crosstab(
df['gender'],
df['department'],
values=df['salary'],
aggfunc='mean'
)
# Normalize to percentages
pd.crosstab(df['gender'], df['department'], normalize='all') # All cells
pd.crosstab(df['gender'], df['department'], normalize='index') # By row
pd.crosstab(df['gender'], df['department'], normalize='columns') # By column
Interviewer's Insight
What they're testing: Contingency table creation.
Strong answer signals:
- Uses normalize for percentages
- Knows margins for totals
- Can aggregate with values/aggfunc
How to Use explode() for List Columns? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: List Operations, Data Preprocessing | Asked by: Google, Amazon, Meta
View Answer
explode() - Unnest Lists:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob'],
'skills': [['Python', 'SQL', 'ML'], ['Java', 'Scala']]
})
# Explode list column
df_exploded = df.explode('skills')
# Result:
# name skills
# 0 Alice Python
# 0 Alice SQL
# 0 Alice ML
# 1 Bob Java
# 1 Bob Scala
# Reset index after explode
df_exploded = df.explode('skills').reset_index(drop=True)
# Explode multiple columns (same length)
df = pd.DataFrame({
'id': [1, 2],
'values': [[1, 2], [3, 4]],
'labels': [['a', 'b'], ['c', 'd']]
})
df.explode(['values', 'labels'])
Interviewer's Insight
What they're testing: Nested data handling.
Strong answer signals:
- Resets index after explode
- Handles multiple list columns
- Knows reverse: groupby + list aggregation
How to Handle JSON with Nested Structures? - Amazon, Google Interview Question
Difficulty: π‘ Medium | Tags: JSON Processing, Nested Data | Asked by: Amazon, Google, Meta
View Answer
json_normalize() - Flatten Nested JSON:
import pandas as pd
from pandas import json_normalize
data = [
{
'id': 1,
'name': 'Alice',
'address': {
'city': 'NYC',
'zip': '10001'
},
'orders': [{'id': 101, 'amount': 50}, {'id': 102, 'amount': 75}]
}
]
# Flatten top level
df = json_normalize(data)
# Flatten with nested paths
df = json_normalize(
data,
record_path='orders', # Explode this array
meta=['id', 'name', ['address', 'city']], # Include these fields
meta_prefix='user_'
)
# Read JSON file
df = pd.read_json('data.json')
df = pd.read_json('data.json', orient='records')
Interviewer's Insight
What they're testing: Real-world data handling.
Strong answer signals:
- Uses json_normalize for nested
- Knows record_path and meta
- Handles different JSON orientations
How to Use Method Chaining for Clean Code? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Method Chaining, Clean Code | Asked by: Google, Amazon, Meta
View Answer
Method Chaining Pattern:
import pandas as pd
# Instead of multiple intermediate variables
df = (
pd.read_csv('data.csv')
.query('status == "active"')
.assign(
year=lambda x: pd.to_datetime(x['date']).dt.year,
total=lambda x: x['price'] * x['quantity']
)
.groupby(['year', 'region'])
.agg(revenue=('total', 'sum'))
.reset_index()
.sort_values('revenue', ascending=False)
)
pipe() for Custom Functions:
def add_features(df):
return df.assign(
log_value=np.log1p(df['value']),
is_high=df['value'] > df['value'].median()
)
def filter_outliers(df, column, n_std=3):
mean, std = df[column].mean(), df[column].std()
return df[(df[column] - mean).abs() <= n_std * std]
result = (
df
.pipe(add_features)
.pipe(filter_outliers, 'value', n_std=2)
)
Interviewer's Insight
What they're testing: Clean, maintainable code.
Strong answer signals:
- Uses parentheses for multi-line chains
- Uses pipe() for custom functions
- Avoids intermediate variables
How to Read/Write Parquet Files? - Google, Amazon, Netflix Interview Question
Difficulty: π’ Easy | Tags: File I/O, Big Data, Parquet | Asked by: Google, Amazon, Netflix, Meta
View Answer
Parquet - Columnar Format:
import pandas as pd
# Write to Parquet
df.to_parquet('data.parquet')
df.to_parquet('data.parquet', compression='snappy') # Default
df.to_parquet('data.parquet', compression='gzip') # Smaller
# Read Parquet
df = pd.read_parquet('data.parquet')
# Read only specific columns (fast!)
df = pd.read_parquet('data.parquet', columns=['col1', 'col2'])
# With filters (predicate pushdown)
df = pd.read_parquet(
'data.parquet',
filters=[('year', '==', 2023)]
)
Parquet vs CSV:
| Feature | Parquet | CSV |
|---|---|---|
| Size | ~10x smaller | Larger |
| Read speed | Faster | Slower |
| Column selection | Fast | Must read all |
| Data types | Preserved | Lost |
Interviewer's Insight
What they're testing: Efficient data storage.
Strong answer signals:
- Uses Parquet for large datasets
- Reads only needed columns
- Knows compression options
How to Use assign() for Creating New Columns? - Google, Amazon Interview Question
Difficulty: π’ Easy | Tags: Column Creation, Method Chaining | Asked by: Google, Amazon, Meta
View Answer
assign() - Create Columns in Chain:
import pandas as pd
df = pd.DataFrame({
'price': [100, 200, 300],
'quantity': [2, 3, 1]
})
# Create multiple columns
df = df.assign(
total=lambda x: x['price'] * x['quantity'],
discounted=lambda x: x['total'] * 0.9,
tax=lambda x: x['total'] * 0.1
)
# Reference previous assignments
df = df.assign(
subtotal=lambda x: x['price'] * x['quantity'],
tax=lambda x: x['subtotal'] * 0.1,
total=lambda x: x['subtotal'] + x['tax']
)
assign() vs direct assignment:
# Direct assignment (modifies in place)
df['new_col'] = df['price'] * 2
# assign() (returns new DataFrame, original unchanged)
df_new = df.assign(new_col=df['price'] * 2)
Interviewer's Insight
What they're testing: Functional programming style.
Strong answer signals:
- Uses lambda for dependent columns
- Prefers assign for method chaining
- Knows it returns new DataFrame
How to Calculate Percentage Change and Cumulative Stats? - Google, Amazon, Netflix Interview Question
Difficulty: π’ Easy | Tags: Time Series, Finance | Asked by: Google, Amazon, Netflix, Apple
View Answer
Percentage Change:
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=5),
'price': [100, 105, 102, 110, 108]
})
# Daily percentage change
df['pct_change'] = df['price'].pct_change()
# Percentage change over N periods
df['pct_change_3'] = df['price'].pct_change(periods=3)
# Fill first NaN
df['pct_change'] = df['price'].pct_change().fillna(0)
Cumulative Statistics:
# Cumulative sum
df['cumsum'] = df['price'].cumsum()
# Cumulative product (for returns)
df['cumulative_return'] = (1 + df['pct_change']).cumprod() - 1
# Cumulative max/min
df['cummax'] = df['price'].cummax()
df['cummin'] = df['price'].cummin()
# Drawdown
df['drawdown'] = df['price'] / df['price'].cummax() - 1
Interviewer's Insight
What they're testing: Financial calculations.
Strong answer signals:
- Uses pct_change for returns
- Knows cumprod for cumulative returns
- Can calculate drawdowns
How to Shift and Lag Data? - Google, Amazon, Netflix Interview Question
Difficulty: π’ Easy | Tags: Time Series, Lag Features | Asked by: Google, Amazon, Netflix
View Answer
shift() - Create Lag/Lead Features:
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=5),
'value': [10, 20, 30, 40, 50]
})
# Previous value (lag)
df['prev_value'] = df['value'].shift(1)
df['prev_2'] = df['value'].shift(2)
# Next value (lead)
df['next_value'] = df['value'].shift(-1)
# Calculate difference from previous
df['diff'] = df['value'] - df['value'].shift(1)
# Same as: df['value'].diff()
# Shift with fill
df['prev_filled'] = df['value'].shift(1, fill_value=0)
Use Cases:
| Operation | Formula |
|---|---|
| Lag-1 | df['col'].shift(1) |
| Difference | df['col'].diff() |
| % Change | df['col'].pct_change() |
| Rolling difference | df['col'] - df['col'].shift(n) |
Interviewer's Insight
What they're testing: Feature engineering for time series.
Strong answer signals:
- Creates lag features for ML
- Knows shift vs diff vs pct_change
- Handles NaN from shifting
How to Sample Data from DataFrame? - Google, Amazon Interview Question
Difficulty: π’ Easy | Tags: Sampling, Data Exploration | Asked by: Google, Amazon, Meta, Netflix
View Answer
sample() - Random Sampling:
import pandas as pd
df = pd.DataFrame({'A': range(1000)})
# Sample n rows
sample = df.sample(n=100)
# Sample fraction
sample = df.sample(frac=0.1) # 10%
# With replacement (bootstrap)
bootstrap = df.sample(n=1000, replace=True)
# Reproducible sampling
sample = df.sample(n=100, random_state=42)
# Weighted sampling
df['weight'] = [0.1] * 500 + [0.9] * 500
sample = df.sample(n=100, weights='weight')
Stratified Sampling:
# Sample within groups
df.groupby('category').sample(n=10)
df.groupby('category').sample(frac=0.1)
# Stratified with sklearn
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, stratify=df['category'], test_size=0.2)
Interviewer's Insight
What they're testing: Data sampling techniques.
Strong answer signals:
- Uses random_state for reproducibility
- Knows weighted sampling
- Uses stratified for imbalanced data
How to Detect and Handle Outliers? - Google, Amazon, Meta Interview Question
Difficulty: π‘ Medium | Tags: Outlier Detection, Data Cleaning | Asked by: Google, Amazon, Meta, Netflix
View Answer
Detection Methods:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [1, 2, 3, 100, 4, 5, -50, 6]})
# Z-score method
z_scores = (df['value'] - df['value'].mean()) / df['value'].std()
outliers = df[abs(z_scores) > 3]
# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['value'] < lower) | (df['value'] > upper)]
# Percentile method
outliers = df[(df['value'] < df['value'].quantile(0.01)) |
(df['value'] > df['value'].quantile(0.99))]
Handling Outliers:
# Remove
df_clean = df[(df['value'] >= lower) & (df['value'] <= upper)]
# Cap (winsorize)
df['value_capped'] = df['value'].clip(lower=lower, upper=upper)
# Replace with NaN
df.loc[abs(z_scores) > 3, 'value'] = np.nan
Interviewer's Insight
What they're testing: Data quality handling.
Strong answer signals:
- Knows IQR and Z-score methods
- Chooses method based on distribution
- Considers domain knowledge
How to Normalize and Standardize Data? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Feature Engineering, ML Preprocessing | Asked by: Google, Amazon, Meta, Microsoft
View Answer
Normalization (Min-Max Scaling):
import pandas as pd
df = pd.DataFrame({'value': [10, 20, 30, 40, 50]})
# Scale to [0, 1]
df['normalized'] = (df['value'] - df['value'].min()) / \
(df['value'].max() - df['value'].min())
# Using sklearn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['normalized'] = scaler.fit_transform(df[['value']])
Standardization (Z-score):
# Scale to mean=0, std=1
df['standardized'] = (df['value'] - df['value'].mean()) / df['value'].std()
# Using sklearn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['standardized'] = scaler.fit_transform(df[['value']])
When to Use:
| Method | Use When |
|---|---|
| Min-Max | Bounded range needed, no outliers |
| Z-score | Normal distribution, has outliers |
| Robust | Many outliers (uses median/IQR) |
Interviewer's Insight
What they're testing: ML preprocessing.
Strong answer signals:
- Knows difference between methods
- Uses sklearn for production
- Considers outliers in choice
How to Use where() and mask() Methods? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Conditional Operations, Data Transformation | Asked by: Google, Amazon, Meta
View Answer
where() - Keep values where True:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
# Keep values >= 3, replace others with NaN
df['A'].where(df['A'] >= 3)
# Result: [NaN, NaN, 3, 4, 5]
# Replace with specific value
df['A'].where(df['A'] >= 3, other=0)
# Result: [0, 0, 3, 4, 5]
mask() - Replace values where True:
# Opposite of where
# Replace values >= 3 with NaN
df['A'].mask(df['A'] >= 3)
# Result: [1, 2, NaN, NaN, NaN]
# Replace with specific value
df['A'].mask(df['A'] >= 3, other=999)
# Result: [1, 2, 999, 999, 999]
vs np.where:
# np.where for if-else
df['result'] = np.where(df['A'] >= 3, 'high', 'low')
# Multiple conditions: np.select
conditions = [df['A'] < 2, df['A'] < 4]
choices = ['low', 'medium']
df['category'] = np.select(conditions, choices, default='high')
Interviewer's Insight
What they're testing: Conditional data manipulation.
Strong answer signals:
- Knows where keeps, mask replaces
- Uses np.where for if-else
- Uses np.select for multiple conditions
How to Rank Values in Pandas? - Google, Amazon Interview Question
Difficulty: π’ Easy | Tags: Ranking, Analysis | Asked by: Google, Amazon, Meta, Netflix
View Answer
rank() Method:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'score': [85, 90, 85, 95]
})
# Default: average rank for ties
df['rank'] = df['score'].rank()
# [1.5, 3, 1.5, 4] - tied scores get average
# Rank methods for ties
df['rank_min'] = df['score'].rank(method='min') # [1, 3, 1, 4]
df['rank_max'] = df['score'].rank(method='max') # [2, 3, 2, 4]
df['rank_first'] = df['score'].rank(method='first') # [1, 3, 2, 4]
df['rank_dense'] = df['score'].rank(method='dense') # [1, 2, 1, 3]
# Descending rank
df['rank_desc'] = df['score'].rank(ascending=False)
# Rank within groups
df['rank_in_group'] = df.groupby('category')['score'].rank()
# Percentile rank
df['percentile'] = df['score'].rank(pct=True)
Interviewer's Insight
What they're testing: Data ranking skills.
Strong answer signals:
- Knows different tie-breaking methods
- Uses dense for no gaps
- Can rank within groups
How to Find First/Last N Rows Per Group? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: GroupBy, Selection | Asked by: Google, Amazon, Meta, Netflix
View Answer
nlargest/nsmallest per Group:
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [10, 30, 20, 40, 60, 50]
})
# Top 2 per group using nlargest
df.groupby('category').apply(lambda x: x.nlargest(2, 'value'))
# Top N using head after sort
df.sort_values('value', ascending=False).groupby('category').head(2)
# First/last row per group
df.groupby('category').first()
df.groupby('category').last()
df.groupby('category').nth(0) # First row
df.groupby('category').nth(-1) # Last row
Using rank:
# More efficient for large data
df['rank'] = df.groupby('category')['value'].rank(method='first', ascending=False)
top_2 = df[df['rank'] <= 2]
Interviewer's Insight
What they're testing: Efficient group operations.
Strong answer signals:
- Uses nlargest/nsmallest for simplicity
- Uses rank for large data
- Knows first(), last(), nth()
How to Use nsmallest() and nlargest()? - Google, Amazon Interview Question
Difficulty: π’ Easy | Tags: Selection, Performance | Asked by: Google, Amazon, Meta
View Answer
nlargest/nsmallest - Efficient Selection:
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'sales': [100, 500, 200, 800, 300],
'profit': [10, 50, 30, 70, 20]
})
# Top 3 by sales
df.nlargest(3, 'sales')
# Bottom 3 by profit
df.nsmallest(3, 'profit')
# By multiple columns (tiebreaker)
df.nlargest(3, ['sales', 'profit'])
# Keep='first' (default), 'last', 'all'
df.nlargest(3, 'sales', keep='all') # Include all ties
Performance:
# nlargest is O(n log k) - faster than full sort
df.nlargest(10, 'value') # Faster
# Full sort is O(n log n)
df.sort_values('value', ascending=False).head(10) # Slower
Interviewer's Insight
What they're testing: Efficient data selection.
Strong answer signals:
- Knows nlargest is faster than sort+head
- Uses keep='all' for ties
- Applies to Series and DataFrame
How to Calculate Weighted Average? - Google, Amazon, Netflix Interview Question
Difficulty: π‘ Medium | Tags: Aggregation, Finance | Asked by: Google, Amazon, Netflix, Apple
View Answer
Weighted Average:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10, 20, 30],
'quantity': [100, 50, 25]
})
# Weighted average price
weighted_avg = np.average(df['price'], weights=df['quantity'])
# or
weighted_avg = (df['price'] * df['quantity']).sum() / df['quantity'].sum()
# Weighted average per group
def weighted_avg_func(group, value_col, weight_col):
return np.average(group[value_col], weights=group[weight_col])
df.groupby('category').apply(
weighted_avg_func, 'price', 'quantity'
)
Named Aggregation with Weighted Average:
def weighted_mean(df, value_col, weight_col):
return (df[value_col] * df[weight_col]).sum() / df[weight_col].sum()
result = df.groupby('category').apply(
lambda x: pd.Series({
'weighted_price': weighted_mean(x, 'price', 'quantity'),
'total_quantity': x['quantity'].sum()
})
)
Interviewer's Insight
What they're testing: Custom aggregations.
Strong answer signals:
- Uses np.average with weights
- Can apply to groups
- Handles edge cases (zero weights)
How to Perform Window Functions Like SQL? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Window Functions, Analytics | Asked by: Google, Amazon, Meta, Netflix
View Answer
SQL-like Window Functions:
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'A', 'A', 'B', 'B', 'B'],
'date': pd.date_range('2023-01-01', periods=6),
'value': [10, 20, 30, 40, 50, 60]
})
# ROW_NUMBER
df['row_num'] = df.groupby('category').cumcount() + 1
# RANK
df['rank'] = df.groupby('category')['value'].rank(method='min')
# DENSE_RANK
df['dense_rank'] = df.groupby('category')['value'].rank(method='dense')
# LEAD / LAG
df['prev_value'] = df.groupby('category')['value'].shift(1)
df['next_value'] = df.groupby('category')['value'].shift(-1)
# Running SUM / AVG
df['running_sum'] = df.groupby('category')['value'].cumsum()
df['running_avg'] = df.groupby('category')['value'].expanding().mean().values
# Percent of total
df['pct_of_cat'] = df['value'] / df.groupby('category')['value'].transform('sum')
Interviewer's Insight
What they're testing: Analytics skills.
Strong answer signals:
- Maps SQL functions to Pandas
- Uses transform for same-shape output
- Combines groupby with cumulative ops
How to Compare Two DataFrames? - Amazon, Google Interview Question
Difficulty: π‘ Medium | Tags: Data Comparison, Validation | Asked by: Amazon, Google, Microsoft
View Answer
compare() Method (Pandas 1.1+):
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': ['a', 'x', 'c']})
# Show differences
df1.compare(df2)
# Shows 'self' and 'other' for differences
# Keep all rows
df1.compare(df2, keep_equal=True)
# Keep all columns
df1.compare(df2, keep_shape=True)
Testing Equality:
# Check if equal
df1.equals(df2) # Returns True/False
# Element-wise comparison
df1 == df2 # Boolean DataFrame
(df1 == df2).all().all() # All equal?
# For testing
from pandas.testing import assert_frame_equal
assert_frame_equal(df1, df2, check_dtype=False)
Find Differences:
# Rows in df1 not in df2
df1[~df1.isin(df2).all(axis=1)]
# Using merge
merged = df1.merge(df2, indicator=True, how='outer')
only_in_df1 = merged[merged['_merge'] == 'left_only']
only_in_df2 = merged[merged['_merge'] == 'right_only']
Interviewer's Insight
What they're testing: Data validation skills.
Strong answer signals:
- Uses compare() for visual diff
- Uses assert_frame_equal for tests
- Can find specific differences
How to Handle SettingWithCopyWarning Properly? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Best Practices, Common Errors | Asked by: Google, Amazon, Meta, Microsoft
View Answer
Copy-on-Write in Pandas 2.0+:
import pandas as pd
# Enable CoW globally
pd.options.mode.copy_on_write = True
# Now this is safe
df2 = df[df['A'] > 0]
df2['B'] = 100 # Creates copy automatically
# Original df unchanged
Best Practices Summary:
# 1. Modifying original - use .loc
df.loc[df['A'] > 0, 'B'] = 100
# 2. Creating new DataFrame - use .copy()
df_new = df[df['A'] > 0].copy()
df_new['B'] = 100
# 3. Method chaining - use .assign()
df_new = df.query('A > 0').assign(B=100)
# 4. Never chain indexing
# BAD
df[df['A'] > 0]['B'] = 100
# GOOD
df.loc[df['A'] > 0, 'B'] = 100
Interviewer's Insight
What they're testing: Production-safe code.
Strong answer signals:
- Knows CoW in Pandas 2.0+
- Uses .loc consistently
- Avoids chained indexing
How to Use Sparse Data Structures? - Google, Amazon Interview Question
Difficulty: π΄ Hard | Tags: Sparse Data, Memory Optimization | Asked by: Google, Amazon, Netflix
View Answer
Sparse Arrays - For Mostly-Null Data:
import pandas as pd
import numpy as np
# Create sparse array
arr = pd.arrays.SparseArray([0, 0, 1, 0, 0, 0, 2, 0, 0, 0])
# Memory savings
print(arr.memory_usage()) # Much smaller than dense
# Sparse Series
s = pd.Series(pd.arrays.SparseArray([0] * 1000000 + [1]))
print(s.memory_usage()) # ~16 bytes vs ~8MB
# Convert existing
df_sparse = df.astype(pd.SparseDtype('float', fill_value=0))
Use Cases:
| Scenario | Memory Savings |
|---|---|
| 90% zeros | ~10x |
| 99% zeros | ~100x |
| One-hot encoded | Massive |
Operations:
# Most operations work normally
s_sparse.sum()
s_sparse.mean()
# Convert back to dense
s_dense = s_sparse.sparse.to_dense()
Interviewer's Insight
What they're testing: Advanced memory optimization.
Strong answer signals:
- Uses for mostly-zero data
- Knows fill_value parameter
- Understands trade-offs
How to Implement Custom Aggregation Functions? - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Aggregation, Custom Functions | Asked by: Google, Amazon, Meta
View Answer
Custom Aggregation:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B'],
'value': [10, 20, 30, 40]
})
# Lambda function
df.groupby('category')['value'].agg(lambda x: x.max() - x.min())
# Named function
def range_func(x):
return x.max() - x.min()
df.groupby('category')['value'].agg(range_func)
# Multiple custom aggregations
df.groupby('category')['value'].agg([
('range', lambda x: x.max() - x.min()),
('cv', lambda x: x.std() / x.mean()), # Coefficient of variation
('iqr', lambda x: x.quantile(0.75) - x.quantile(0.25))
])
# Named aggregation with custom
df.groupby('category').agg(
value_range=('value', lambda x: x.max() - x.min()),
value_cv=('value', lambda x: x.std() / x.mean())
)
Interviewer's Insight
What they're testing: Advanced aggregation skills.
Strong answer signals:
- Uses named aggregations
- Defines reusable functions
- Combines built-in and custom
How to Perform Asof Joins (Nearest Key Join)? - Google, Amazon, Netflix Interview Question
Difficulty: π΄ Hard | Tags: Joining, Time Series | Asked by: Google, Amazon, Netflix, Apple
View Answer
merge_asof() - Join on Nearest Key:
import pandas as pd
# Trade data
trades = pd.DataFrame({
'time': pd.to_datetime(['2023-01-01 10:00:01', '2023-01-01 10:00:03',
'2023-01-01 10:00:05']),
'ticker': ['AAPL', 'AAPL', 'AAPL'],
'quantity': [100, 200, 150]
})
# Quote data
quotes = pd.DataFrame({
'time': pd.to_datetime(['2023-01-01 10:00:00', '2023-01-01 10:00:02',
'2023-01-01 10:00:04']),
'ticker': ['AAPL', 'AAPL', 'AAPL'],
'bid': [149.0, 149.5, 150.0]
})
# Join trade with most recent quote
result = pd.merge_asof(
trades.sort_values('time'),
quotes.sort_values('time'),
on='time',
by='ticker',
direction='backward' # Use most recent quote
)
Direction Options:
| Direction | Meaning |
|---|---|
| backward | Previous/equal key |
| forward | Next/equal key |
| nearest | Closest key |
Interviewer's Insight
What they're testing: Time series joining.
Strong answer signals:
- Sorts data before asof join
- Uses by= for grouping
- Knows direction parameter
How to Calculate Cohort Metrics? - Meta, Netflix, Amazon Interview Question
Difficulty: π΄ Hard | Tags: Cohort Analysis, Time Series | Asked by: Meta, Netflix, Amazon, Google
View Answer
Cohort Retention Analysis:
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 1, 1, 2, 2, 3],
'order_date': pd.to_datetime([
'2023-01-05', '2023-02-10', '2023-03-15',
'2023-01-20', '2023-03-25', '2023-02-01'
])
})
# Get first purchase date (cohort)
df['cohort'] = df.groupby('user_id')['order_date'].transform('min')
df['cohort_month'] = df['cohort'].dt.to_period('M')
df['order_month'] = df['order_date'].dt.to_period('M')
# Calculate months since cohort
df['months_since_cohort'] = (
df['order_month'].astype(int) - df['cohort_month'].astype(int)
)
# Create cohort pivot
cohort_data = df.groupby(['cohort_month', 'months_since_cohort']).agg(
users=('user_id', 'nunique')
).reset_index()
cohort_pivot = cohort_data.pivot_table(
index='cohort_month',
columns='months_since_cohort',
values='users'
)
# Calculate retention rates
cohort_sizes = cohort_pivot[0]
retention = cohort_pivot.divide(cohort_sizes, axis=0)
Interviewer's Insight
What they're testing: Business analytics skills.
Strong answer signals:
- Calculates cohort from first action
- Uses period for month grouping
- Creates retention matrix
How to Implement A/B Test Analysis in Pandas? - Meta, Google, Netflix Interview Question
Difficulty: π΄ Hard | Tags: A/B Testing, Statistical Analysis | Asked by: Meta, Google, Netflix, Amazon
View Answer
A/B Test Analysis:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame({
'user_id': range(10000),
'variant': np.random.choice(['control', 'treatment'], 10000),
'converted': np.random.binomial(1, [0.10] * 5000 + [0.12] * 5000)
})
# Summary statistics per variant
summary = df.groupby('variant').agg(
users=('user_id', 'count'),
conversions=('converted', 'sum'),
conversion_rate=('converted', 'mean')
)
# Statistical test
control = df[df['variant'] == 'control']['converted']
treatment = df[df['variant'] == 'treatment']['converted']
# Chi-squared test for proportions
contingency = pd.crosstab(df['variant'], df['converted'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
# Or use proportion z-test
from statsmodels.stats.proportion import proportions_ztest
count = [treatment.sum(), control.sum()]
nobs = [len(treatment), len(control)]
z_stat, p_value = proportions_ztest(count, nobs)
# Confidence interval for lift
lift = (treatment.mean() - control.mean()) / control.mean()
Interviewer's Insight
What they're testing: Statistical analysis skills.
Strong answer signals:
- Calculates conversion rates
- Uses appropriate statistical test
- Interprets p-value correctly
How to Use Copy-on-Write (CoW) in Pandas 2.0+? - Google, Meta Interview Question
Difficulty: π΄ Hard | Tags: Internals, Performance | Asked by: Google, Meta, Microsoft
View Answer
Copy-on-Write Explained:
import pandas as pd
# Enable globally
pd.options.mode.copy_on_write = True
# Or use context manager
with pd.option_context('mode.copy_on_write', True):
df2 = df[['A', 'B']]
df2['A'] = 100 # Creates copy only when modified
Benefits:
| Aspect | Without CoW | With CoW |
|---|---|---|
| Memory | Copies on slice | Shares until modified |
| Safety | Ambiguous | Always safe |
| Speed | Unnecessary copies | Lazy copies |
Behavior:
pd.options.mode.copy_on_write = True
# Views share data until modification
df = pd.DataFrame({'A': [1, 2, 3]})
df2 = df[['A']] # No copy yet
df2['A'] = 100 # Copy created here
print(df) # Original unchanged!
# No more SettingWithCopyWarning
df[df['A'] > 1]['A'] = 99 # Safe, no effect on df
Interviewer's Insight
What they're testing: Modern Pandas knowledge.
Strong answer signals:
- Knows CoW mechanism
- Understands lazy copying
- Knows it's default in Pandas 3.0
How to Use PyArrow Backend for Better Performance? - Google, Amazon Interview Question
Difficulty: π΄ Hard | Tags: Performance, Arrow | Asked by: Google, Amazon, Databricks
View Answer
PyArrow Backend (Pandas 2.0+):
import pandas as pd
# Read with PyArrow backend
df = pd.read_csv('data.csv', dtype_backend='pyarrow')
df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')
# Convert existing DataFrame
df_arrow = df.convert_dtypes(dtype_backend='pyarrow')
# Check types
df_arrow.dtypes
# int64[pyarrow], string[pyarrow], etc.
Benefits:
| Feature | NumPy Backend | PyArrow Backend |
|---|---|---|
| String memory | High | Low |
| Null handling | float64 trick | Native |
| Interop | Limited | Arrow ecosystem |
String Performance:
# PyArrow strings are much more efficient
df['text_col'] # With PyArrow: less memory, faster ops
# Native null support
df_arrow['int_col'] # Can have Int64 with nulls
Interviewer's Insight
What they're testing: Cutting-edge Pandas.
Strong answer signals:
- Knows Arrow for string efficiency
- Uses for interop with Spark/Arrow
- Understands when to use
Quick Reference: 127+ Interview Questions
| Sno | Question Title | Practice Links | Companies Asking | Difficulty | Topics |
|---|---|---|---|---|---|
| 1 | What is Pandas and why is it used? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | Basics, Introduction |
| 2 | Difference between Series and DataFrame | GeeksforGeeks | Google, Amazon, Meta, Microsoft | Easy | Data Structures |
| 3 | How to create a DataFrame from dictionary? | Pandas Docs | Amazon, Google, Flipkart | Easy | DataFrame Creation |
| 4 | Difference between loc and iloc | Stack Overflow | Google, Amazon, Meta, Apple, Netflix | Easy | Indexing, Selection |
| 5 | How to read CSV, Excel, JSON files? | Pandas Docs | Most Tech Companies | Easy | Data I/O |
| 6 | How to handle missing values (NaN)? | Real Python | Google, Amazon, Meta, Netflix, Apple | Medium | Missing Data, fillna, dropna |
| 7 | Difference between dropna() and fillna() | Pandas Docs | Amazon, Google, Microsoft | Easy | Missing Data |
| 8 | Explain GroupBy in Pandas | Real Python | Google, Amazon, Meta, Netflix, Apple | Medium | GroupBy, Aggregation |
| 9 | How to merge two DataFrames? | Pandas Docs | Google, Amazon, Meta, Microsoft | Medium | Merging, Joining |
| 10 | Difference between merge(), join(), concat() | Stack Overflow | Google, Amazon, Meta | Medium | Merging, Joining, Concatenation |
| 11 | How to apply a function to DataFrame? | Pandas Docs | Google, Amazon, Meta, Netflix | Medium | apply, applymap, map |
| 12 | Difference between apply(), map(), applymap() | GeeksforGeeks | Google, Amazon, Microsoft | Medium | Data Transformation |
| 13 | How to rename columns in DataFrame? | Pandas Docs | Most Tech Companies | Easy | Column Operations |
| 14 | How to sort DataFrame by column values? | Pandas Docs | Most Tech Companies | Easy | Sorting |
| 15 | How to filter rows based on conditions? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | Filtering, Boolean Indexing |
| 16 | How to remove duplicate rows? | Pandas Docs | Amazon, Google, Microsoft | Easy | Data Cleaning, Deduplication |
| 17 | How to change data types of columns? | Pandas Docs | Most Tech Companies | Easy | Data Types |
| 18 | What is the difference between copy() and view? | Stack Overflow | Google, Amazon, Meta | Medium | Memory Management |
| 19 | Explain pivot tables in Pandas | Pandas Docs | Amazon, Google, Microsoft, Netflix | Medium | Pivot Tables, Reshaping |
| 20 | Difference between pivot() and pivot_table() | Stack Overflow | Google, Amazon, Meta | Medium | Reshaping |
| 21 | How to handle datetime data in Pandas? | Pandas Docs | Google, Amazon, Netflix, Meta | Medium | DateTime, Time Series |
| 22 | How to create a date range? | Pandas Docs | Amazon, Netflix, Google | Easy | DateTime |
| 23 | What is MultiIndex (Hierarchical Indexing)? | Pandas Docs | Google, Amazon, Meta | Hard | MultiIndex, Hierarchical Data |
| 24 | How to reset and set index? | Pandas Docs | Most Tech Companies | Easy | Indexing |
| 25 | How to perform rolling window calculations? | Pandas Docs | Google, Amazon, Netflix, Apple | Medium | Rolling Windows, Time Series |
| 26 | How to calculate moving averages? | GeeksforGeeks | Google, Amazon, Netflix, Apple | Medium | Rolling Windows, Finance |
| 27 | How to perform resampling on time series? | Pandas Docs | Google, Amazon, Netflix | Medium | Resampling, Time Series |
| 28 | Difference between transform() and apply() | Stack Overflow | Google, Amazon, Meta | Hard | GroupBy, Data Transformation |
| 29 | How to create bins with cut() and qcut()? | Pandas Docs | Google, Amazon, Meta | Medium | Discretization, Binning |
| 30 | How to handle categorical data? | Pandas Docs | Google, Amazon, Meta, Netflix | Medium | Categorical Data, Memory |
| 31 | How to one-hot encode categorical data? | Pandas Docs | Google, Amazon, Meta, Microsoft | Easy | Feature Engineering, ML |
| 32 | How to read data from SQL database? | Pandas Docs | Amazon, Google, Microsoft | Medium | Database I/O |
| 33 | How to export DataFrame to various formats? | Pandas Docs | Most Tech Companies | Easy | Data Export |
| 34 | How to handle large datasets efficiently? | Towards Data Science | Google, Amazon, Netflix, Meta | Hard | Performance, Memory Optimization |
| 35 | What is Categorical dtype and when to use it? | Pandas Docs | Google, Amazon, Meta | Medium | Data Types, Memory Optimization |
| 36 | How to optimize memory usage in Pandas? | Medium | Google, Amazon, Netflix | Hard | Memory Optimization |
| 37 | Difference between inplace=True and returning copy | Stack Overflow | Most Tech Companies | Easy | DataFrame Modification |
| 38 | How to use query() method for filtering? | Pandas Docs | Google, Amazon, Meta | Easy | Filtering, Query |
| 39 | How to work with string data (str accessor)? | Pandas Docs | Google, Amazon, Meta, Netflix | Medium | String Operations |
| 40 | How to use str accessor methods? | Pandas Docs | Amazon, Google, Microsoft | Medium | String Operations |
| 41 | How to split and expand string columns? | GeeksforGeeks | Amazon, Google, Meta | Medium | String Operations, Data Cleaning |
| 42 | How to use melt() for unpivoting data? | Pandas Docs | Google, Amazon, Meta | Medium | Reshaping, Unpivoting |
| 43 | How to use stack() and unstack()? | Pandas Docs | Google, Amazon, Meta | Medium | Reshaping, MultiIndex |
| 44 | How to cross-tabulate with crosstab()? | Pandas Docs | Google, Amazon, Meta | Medium | Cross Tabulation, Analysis |
| 45 | How to calculate correlations? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | Statistical Analysis |
| 46 | How to calculate descriptive statistics? | Pandas Docs | Most Tech Companies | Easy | Statistical Analysis |
| 47 | How to use agg() for multiple aggregations? | Pandas Docs | Google, Amazon, Meta, Netflix | Medium | Aggregation |
| 48 | How to use named aggregations? | Pandas Docs | Google, Amazon, Meta | Medium | GroupBy, Named Aggregation |
| 49 | How to handle timezone-aware datetime? | Pandas Docs | Google, Amazon, Netflix | Medium | DateTime, Timezones |
| 50 | How to interpolate missing values? | Pandas Docs | Google, Amazon, Netflix | Medium | Missing Data, Interpolation |
| 51 | How to forward fill and backward fill? | Pandas Docs | Amazon, Netflix, Google | Easy | Missing Data, Time Series |
| 52 | How to use where() and mask() methods? | Pandas Docs | Google, Amazon, Meta | Medium | Conditional Operations |
| 53 | How to clip values in DataFrame? | Pandas Docs | Amazon, Google, Meta | Easy | Data Transformation |
| 54 | How to rank values in Pandas? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | Ranking |
| 55 | How to calculate percentage change? | Pandas Docs | Google, Amazon, Netflix, Apple | Easy | Time Series, Finance |
| 56 | How to shift and lag data? | Pandas Docs | Google, Amazon, Netflix | Easy | Time Series, Lag Features |
| 57 | How to calculate cumulative statistics? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | Cumulative Operations |
| 58 | How to use explode() for list columns? | Pandas Docs | Google, Amazon, Meta | Medium | List Operations, Data Preprocessing |
| 59 | How to sample data from DataFrame? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | Sampling |
| 60 | How to detect and handle outliers? | Towards Data Science | Google, Amazon, Meta, Netflix | Medium | Outlier Detection, Data Cleaning |
| 61 | How to normalize/standardize data? | GeeksforGeeks | Google, Amazon, Meta, Microsoft | Medium | Feature Engineering, ML |
| 62 | How to use eval() for efficient operations? | Pandas Docs | Google, Amazon, Meta | Hard | Performance Optimization |
| 63 | How to perform element-wise operations? | Pandas Docs | Most Tech Companies | Easy | Arithmetic Operations |
| 64 | Why vectorized operations are faster than loops? | Real Python | Google, Amazon, Meta | Medium | Performance, Vectorization |
| 65 | How to profile Pandas code performance? | Pandas Docs | Google, Amazon, Netflix | Hard | Performance Profiling |
| 66 | How to use pipe() for method chaining? | Pandas Docs | Google, Amazon, Meta | Medium | Method Chaining |
| 67 | How to handle SettingWithCopyWarning? | Pandas Docs | Google, Amazon, Meta, Microsoft | Medium | Common Errors, Debugging |
| 68 | How to compare two DataFrames? | Pandas Docs | Amazon, Google, Microsoft | Medium | Data Comparison, Validation |
| 69 | How to combine DataFrames with different schemas? | Stack Overflow | Google, Amazon, Meta | Medium | Merging, Schema Alignment |
| 70 | How to create conditional columns? | GeeksforGeeks | Most Tech Companies | Easy | Data Transformation |
| 71 | How to use np.where() with Pandas? | Real Python | Google, Amazon, Meta, Netflix | Easy | Conditional Operations |
| 72 | How to use np.select() for multiple conditions? | Stack Overflow | Google, Amazon, Meta | Medium | Conditional Operations |
| 73 | How to count value frequencies? | Pandas Docs | Most Tech Companies | Easy | Data Exploration |
| 74 | How to find unique values and nunique()? | Pandas Docs | Most Tech Companies | Easy | Data Exploration |
| 75 | How to check for null values? | Pandas Docs | Most Tech Companies | Easy | Missing Data |
| 76 | How to use any() and all() methods? | Pandas Docs | Google, Amazon, Meta | Easy | Boolean Operations |
| 77 | How to select specific columns? | Pandas Docs | Most Tech Companies | Easy | Column Selection |
| 78 | How to drop columns or rows? | Pandas Docs | Most Tech Companies | Easy | Data Cleaning |
| 79 | How to use assign() for creating new columns? | Pandas Docs | Google, Amazon, Meta | Easy | Column Creation |
| 80 | How to use idxmax() and idxmin()? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | Indexing |
| 81 | Why is iterating over rows slow? | Stack Overflow | Google, Amazon, Meta | Medium | Performance |
| 82 | How to use iterrows() and itertuples()? | Pandas Docs | Amazon, Google, Microsoft | Easy | Iteration |
| 83 | How to vectorize custom functions? | Real Python | Google, Amazon, Meta | Hard | Performance Optimization |
| 84 | How to use Pandas with NumPy? | Pandas Docs | Google, Amazon, Meta, Netflix | Easy | NumPy Integration |
| 85 | How to flatten hierarchical index? | Stack Overflow | Google, Amazon, Meta | Medium | MultiIndex |
| 86 | How to group by multiple columns? | Pandas Docs | Most Tech Companies | Easy | GroupBy |
| 87 | How to filter groups after GroupBy? | Pandas Docs | Google, Amazon, Meta | Medium | GroupBy, Filtering |
| 88 | How to get first/last n rows per group? | Stack Overflow | Google, Amazon, Meta, Netflix | Medium | GroupBy |
| 89 | How to handle JSON with nested structures? | Pandas Docs | Amazon, Google, Meta | Medium | JSON Processing |
| 90 | How to read/write Parquet files? | Pandas Docs | Google, Amazon, Netflix, Meta | Easy | File I/O, Big Data |
| 91 | Difference between Parquet, CSV, and Feather | Towards Data Science | Google, Amazon, Netflix | Medium | File Formats, Performance |
| 92 | How to use chunksize for large files? | Pandas Docs | Google, Amazon, Netflix, Meta | Medium | Large Data Processing |
| 93 | How to use nsmallest() and nlargest()? | Pandas Docs | Google, Amazon, Meta | Easy | Selection |
| 94 | How to calculate weighted average? | Stack Overflow | Google, Amazon, Netflix, Apple | Medium | Aggregation, Finance |
| 95 | How to perform window functions like SQL? | Pandas Docs | Google, Amazon, Meta, Netflix | Medium | Window Functions |
| 96 | How to join on nearest key (asof join)? | Pandas Docs | Google, Amazon, Netflix, Apple | Hard | Joining, Time Series |
| 97 | How to use combine_first() for data merging? | Pandas Docs | Amazon, Google, Microsoft | Medium | Merging |
| 98 | How to create period indices? | Pandas Docs | Google, Amazon, Netflix | Medium | Time Series |
| 99 | How to use Timedelta for time differences? | Pandas Docs | Google, Amazon, Netflix | Easy | DateTime |
| 100 | How to set display options globally? | Pandas Docs | Most Tech Companies | Easy | Display Options |
| 101 | What is method chaining and when to use it? | Tom Augspurger Blog | Google, Amazon, Meta | Medium | Method Chaining, Clean Code |
| 102 | How to calculate month-over-month change? | StrataScratch | Google, Amazon, Meta, Netflix | Medium | Time Series, Analytics |
| 103 | How to find customers with highest orders? | DataLemur | Amazon, Google, Meta, Netflix | Medium | GroupBy, Aggregation |
| 104 | [HARD] How to calculate retention metrics efficiently? | StrataScratch | Meta, Netflix, Amazon, Google | Hard | Cohort Analysis, Time Series |
| 105 | [HARD] How to implement A/B test analysis? | Towards Data Science | Meta, Google, Netflix, Amazon | Hard | Statistical Analysis, Testing |
| 106 | [HARD] How to optimize memory with category types? | Pandas Docs | Google, Amazon, Netflix | Hard | Memory Optimization |
| 107 | [HARD] How to implement cohort analysis? | Towards Data Science | Meta, Netflix, Amazon, Google | Hard | Cohort Analysis |
| 108 | [HARD] How to calculate funnel drop-off rates? | StrataScratch | Meta, Google, Amazon, Netflix | Hard | Funnel Analysis, Analytics |
| 109 | [HARD] How to implement custom testing using assert_frame_equal? | Pandas Docs | Google, Amazon, Microsoft | Hard | Testing, Quality |
| 110 | [HARD] How to handle sparse data structures? | Pandas Docs | Google, Amazon, Netflix | Hard | Sparse Data, Memory |
| 111 | [HARD] How to use Numba/JIT with Pandas? | Pandas Docs | Google, Amazon, Hedge Funds | Hard | Performance |
| 112 | [HARD] How to implement custom accessors? | Pandas Docs | Google, Amazon, Meta | Hard | Extending Pandas |
| 113 | [HARD] How to use Swifter for parallel processing? | Swifter Docs | Google, Amazon, Uber | Hard | Parallelism |
| 114 | [HARD] Explain Pandas Block Manager structure | Pandas Wiki | Google, Amazon, Meta | Hard | Internals |
| 115 | [HARD] How Copy-on-Write (CoW) works in Pandas 2.0+? | Pandas Docs | Google, Meta, Microsoft | Hard | Internals, Performance |
| 116 | [HARD] How to use PyArrow backend for performance? | Pandas Docs | Google, Amazon, Databricks | Hard | Performance, Arrow |
| 117 | [HARD] How to implement custom index types? | Pandas Docs | Google, Amazon | Hard | Extending Pandas |
| 118 | [HARD] How to optimize MultiIndex slicing performance? | Pandas Docs | Google, Amazon, Hedge Funds | Hard | Optimization |
| 119 | [HARD] groupby().transform() internal mechanics vs apply() | Pandas Docs | Google, Amazon, Meta | Hard | Deep Dive |
| 120 | [HARD] How to implement rolling window with raw=True? | Pandas Docs | Google, Amazon, Hedge Funds | Hard | Optimization |
| 121 | [HARD] How to extend Pandas with custom plotting backends? | Pandas Docs | Google, Amazon | Hard | Extending Pandas |
| 122 | [HARD] How to handle time series offset aliases? | Pandas Docs | Google, Amazon, Hedge Funds | Hard | Time Series |
| 123 | [HARD] How to use Dask DataFrames for out-of-core computing? | Dask Docs | Google, Amazon, Netflix | Hard | Big Data |
| 124 | [HARD] How to optimize chained assignment performance? | Pandas Docs | Google, Amazon, Meta | Hard | Optimization |
| 125 | [HARD] Nullable integers/floats implementation? | Pandas Docs | Google, Amazon, Microsoft | Hard | Internals |
| 126 | [HARD] How to use Cython with Pandas? | Pandas Docs | Google, Amazon, HFT Firms | Hard | Performance |
| 127 | [HARD] Comparison of Parquet vs Feather vs ORC? | Apache Arrow | Google, Amazon, Netflix | Hard | Systems |
Code Examples
1. Memory Optimization
Difficulty: π’ Easy | Tags: Code Example | Asked by: Code Pattern
View Code Example
import pandas as pd
import numpy as np
# Typical large dataframe creation
df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C'], size=1000000),
'value': np.random.randn(1000000)
})
# Memory usage before optimization
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")
# Optimize by converting object to category
df['category'] = df['category'].astype('category')
# Memory usage after optimization
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")
2. Method Chaining for Clean Code
Difficulty: π’ Easy | Tags: Code Example | Asked by: Code Pattern
View Code Example
# Instead of multiple intermediate variables
df = (
pd.read_csv('data.csv')
.query('status == "active"')
.assign(
year=lambda x: pd.to_datetime(x['date']).dt.year,
total_cost=lambda x: x['price'] * x['quantity']
)
.groupby(['year', 'region'])
.agg(total_revenue=('total_cost', 'sum'))
.reset_index()
.sort_values('total_revenue', ascending=False)
)
3. Parallel Processing with Swifter
Difficulty: π’ Easy | Tags: Code Example | Asked by: Code Pattern
View Code Example
import pandas as pd
import swifter
df = pd.DataFrame({'text': ['some text'] * 100000})
def heavy_processing(text):
# Simulate heavy work
return text.upper()[::-1]
# Automatic parallelization
df['processed'] = df['text'].swifter.apply(heavy_processing)
Explain MultiIndex in Pandas - Google, Meta Interview Question
Difficulty: π‘ Medium | Tags: MultiIndex, Hierarchical Data, Advanced Indexing | Asked by: Google, Meta, Amazon
View Answer
MultiIndex creates hierarchical row/column labels, enabling multi-dimensional data in 2D DataFrames. Essential for time series, grouped data, and pivot operations.
Complete Examples:
import pandas as pd
import numpy as np
# 1. Creating MultiIndex
# From tuples
index = pd.MultiIndex.from_tuples([
('USA', 'NY'), ('USA', 'CA'), ('UK', 'London'), ('UK', 'Manchester')
], names=['Country', 'City'])
df = pd.DataFrame({
'Population': [8_000_000, 4_000_000, 9_000_000, 500_000],
'GDP': [1500, 3000, 500, 80]
}, index=index)
print(df)
# 2. From product
idx = pd.MultiIndex.from_product(
[['2023', '2024'], ['Q1', 'Q2', 'Q3', 'Q4']],
names=['Year', 'Quarter']
)
sales_df = pd.DataFrame({
'Revenue': np.random.randint(100, 500, 8),
'Profit': np.random.randint(10, 100, 8)
}, index=idx)
# 3. Indexing with MultiIndex
# Access by level
print(df.loc['USA']) # All USA cities
print(df.loc[('USA', 'NY')]) # Specific city
# Cross-section (xs)
print(df.xs('USA', level='Country'))
print(df.xs('NY', level='City'))
# 4. Slicing MultiIndex
# Slice first level
print(df.loc['USA':'UK'])
# Slice second level
print(df.loc[('USA', 'CA'):('UK', 'London')])
# 5. Stack and Unstack
# Create sample data
data = pd.DataFrame({
'Year': [2023, 2023, 2024, 2024],
'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'Revenue': [100, 150, 120, 180]
})
# Set multiindex
data_idx = data.set_index(['Year', 'Quarter'])
# Unstack: move index level to columns
unstacked = data_idx.unstack()
print(unstacked)
# Stack: move column level to index
stacked = unstacked.stack()
print(stacked)
# 6. GroupBy with MultiIndex
df_sales = pd.DataFrame({
'Region': ['North', 'North', 'South', 'South'] * 2,
'Product': ['A', 'B', 'A', 'B'] * 2,
'Quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q2'],
'Revenue': [100, 150, 120, 180, 110, 160, 130, 190]
})
# Create MultiIndex from groupby
grouped = df_sales.groupby(['Region', 'Product']).sum()
print(grouped)
# 7. Sorting MultiIndex
# Sort by all levels
df_sorted = df.sort_index()
# Sort by specific level
df_sorted = df.sort_index(level='City')
# Sort descending
df_sorted = df.sort_index(ascending=False)
# 8. Swapping Levels
# Swap index levels
df_swapped = df.swaplevel('Country', 'City')
print(df_swapped)
# 9. Resetting Index
# Remove MultiIndex
df_reset = df.reset_index()
print(df_reset)
# Keep one level
df_reset_partial = df.reset_index(level='City')
print(df_reset_partial)
# 10. Advanced Operations
# Sum across levels
print(df.sum(level='Country'))
# Mean across levels
print(df.mean(level='City'))
# Apply function per level
result = df.groupby(level='Country').agg({
'Population': 'sum',
'GDP': 'mean'
})
print(result)
Creating MultiIndex:
| Method | Use Case | Example |
|---|---|---|
| from_tuples | Explicit labels | pd.MultiIndex.from_tuples([('A', 1)]) |
| from_product | Cartesian product | pd.MultiIndex.from_product([['A', 'B'], [1, 2]]) |
| from_arrays | Separate arrays | pd.MultiIndex.from_arrays([['A', 'B'], [1, 2]]) |
| set_index | From columns | df.set_index(['col1', 'col2']) |
Common Operations:
| Operation | Description | Example |
|---|---|---|
| loc | Access by labels | df.loc[('USA', 'NY')] |
| xs | Cross-section | df.xs('USA', level=0) |
| unstack | Level to columns | df.unstack() |
| stack | Columns to level | df.stack() |
| swaplevel | Swap index levels | df.swaplevel(0, 1) |
Interviewer's Insight
What they're testing: Advanced indexing and hierarchical data structures.
Strong answer signals:
- Knows from_tuples, from_product, set_index methods
- Can access specific levels with loc and xs()
- Understands stack/unstack transformations
- Mentions GroupBy creates MultiIndex naturally
- Uses sort_index(level=...) for sorting
- Applies to time series and pivot tables
Explain pd.merge() Parameters - Amazon, Microsoft Interview Question
Difficulty: π‘ Medium | Tags: Merge, Join, Data Combination | Asked by: Amazon, Microsoft, Google, Meta
View Answer
pd.merge() combines DataFrames using SQL-style joins. Parameters: how (join type), on (keys), left_on/right_on, suffixes, indicator.
Complete Examples:
import pandas as pd
# Sample data
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'city': ['NYC', 'LA', 'Chicago', 'Boston']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 2, 5, 1],
'amount': [100, 200, 150, 300, 50]
})
# 1. Inner Join (default)
inner = pd.merge(customers, orders, on='customer_id', how='inner')
print("Inner join:")
print(inner)
# Only matching records (customer_id 1, 2)
# 2. Left Join
left = pd.merge(customers, orders, on='customer_id', how='left')
print("\nLeft join:")
print(left)
# All customers, NaN for unmatched orders
# 3. Right Join
right = pd.merge(customers, orders, on='customer_id', how='right')
print("\nRight join:")
print(right)
# All orders, NaN for unmatched customers
# 4. Outer Join
outer = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter join:")
print(outer)
# All records, NaN where no match
# 5. Different Column Names
products = pd.DataFrame({
'product_id': [1, 2, 3],
'name': ['Widget', 'Gadget', 'Tool']
})
order_items = pd.DataFrame({
'order_id': [101, 102],
'prod_id': [1, 2],
'quantity': [5, 3]
})
merged = pd.merge(
products, order_items,
left_on='product_id',
right_on='prod_id',
how='inner'
)
print("\nDifferent column names:")
print(merged)
# 6. Multiple Keys
df1 = pd.DataFrame({
'year': [2023, 2023, 2024],
'quarter': ['Q1', 'Q2', 'Q1'],
'revenue': [100, 150, 120]
})
df2 = pd.DataFrame({
'year': [2023, 2023, 2024],
'quarter': ['Q1', 'Q2', 'Q2'],
'profit': [20, 30, 25]
})
multi_key = pd.merge(df1, df2, on=['year', 'quarter'], how='outer')
print("\nMultiple keys:")
print(multi_key)
# 7. Suffixes for Duplicate Columns
left_df = pd.DataFrame({
'key': ['A', 'B'],
'value': [1, 2],
'extra': [10, 20]
})
right_df = pd.DataFrame({
'key': ['A', 'C'],
'value': [3, 4],
'extra': [30, 40]
})
with_suffixes = pd.merge(
left_df, right_df,
on='key',
how='outer',
suffixes=('_left', '_right')
)
print("\nWith suffixes:")
print(with_suffixes)
# 8. Indicator Column
with_indicator = pd.merge(
customers, orders,
on='customer_id',
how='outer',
indicator=True
)
print("\nWith indicator:")
print(with_indicator)
# Shows source of each row
# Custom indicator name
custom_indicator = pd.merge(
customers, orders,
on='customer_id',
how='outer',
indicator='source'
)
# 9. Index-based Merge
left_idx = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
right_idx = pd.DataFrame({'B': [3, 4]}, index=['a', 'c'])
idx_merge = pd.merge(
left_idx, right_idx,
left_index=True,
right_index=True,
how='outer'
)
print("\nIndex merge:")
print(idx_merge)
# 10. Validate Merge
# Ensure one-to-one merge
try:
pd.merge(
customers, orders,
on='customer_id',
validate='one_to_one' # Will fail (one customer, many orders)
)
except Exception as e:
print(f"Validation error: {e}")
# Valid one-to-many
valid = pd.merge(
customers, orders,
on='customer_id',
validate='one_to_many' # OK
)
Join Types:
| how | SQL Equivalent | Description | Use Case |
|---|---|---|---|
| inner | INNER JOIN | Only matches | Common records |
| left | LEFT JOIN | All left + matches | Keep all left |
| right | RIGHT JOIN | All right + matches | Keep all right |
| outer | FULL OUTER JOIN | All records | Keep everything |
| cross | CROSS JOIN | Cartesian product | All combinations |
Key Parameters:
| Parameter | Description | Example |
|---|---|---|
| on | Common column(s) | on='customer_id' |
| left_on/right_on | Different names | left_on='id', right_on='cust_id' |
| how | Join type | how='left' |
| suffixes | Duplicate columns | suffixes=('_x', '_y') |
| indicator | Show merge source | indicator=True |
| validate | Check cardinality | validate='one_to_one' |
Interviewer's Insight
What they're testing: SQL join knowledge and data combination strategies.
Strong answer signals:
- Knows all join types: inner, left, right, outer
- Uses left_on/right_on for different column names
- Applies suffixes to avoid duplicate columns
- Mentions indicator=True for tracking merge source
- Knows validate parameter for cardinality checks
- Can merge on index with left_index/right_index
Explain apply() vs map() vs applymap() - Google, Netflix Interview Question
Difficulty: π‘ Medium | Tags: apply, map, Performance | Asked by: Google, Netflix, Meta, Amazon
View Answer
map() for Series element-wise. apply() for Series/DataFrame row/column-wise. applymap() (deprecated) for DataFrame element-wise. Use vectorized operations when possible.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40],
'C': ['a', 'b', 'c', 'd']
})
# 1. Series.map() - Element-wise
# Map with dict
df['C_mapped'] = df['C'].map({'a': 'apple', 'b': 'banana', 'c': 'cherry'})
print(df)
# Map with function
df['A_squared'] = df['A'].map(lambda x: x ** 2)
# 2. Series.apply() - Element-wise (like map)
df['A_cubed'] = df['A'].apply(lambda x: x ** 3)
# More complex function
def categorize(value):
if value < 2:
return 'low'
elif value < 4:
return 'medium'
else:
return 'high'
df['category'] = df['A'].apply(categorize)
# 3. DataFrame.apply() - Row/Column-wise
# Apply to columns (axis=0, default)
col_sums = df[['A', 'B']].apply(sum)
print("Column sums:", col_sums)
# Apply to rows (axis=1)
row_sums = df[['A', 'B']].apply(sum, axis=1)
print("Row sums:", row_sums)
# Return Series from apply
def compute_stats(row):
return pd.Series({
'sum': row['A'] + row['B'],
'product': row['A'] * row['B']
})
stats = df.apply(compute_stats, axis=1)
print("\nStats:")
print(stats)
# 4. DataFrame.applymap() - Element-wise (deprecated)
# Use df.map() in pandas 2.1+
numeric_df = df[['A', 'B']]
# Old way (deprecated):
# result = numeric_df.applymap(lambda x: x * 2)
# New way:
result = numeric_df.map(lambda x: x * 2)
print("\nElement-wise operation:")
print(result)
# 5. Performance Comparison
import time
df_large = pd.DataFrame({
'A': np.random.randint(0, 100, 100000),
'B': np.random.randint(0, 100, 100000)
})
# Vectorized (fastest)
start = time.time()
result_vec = df_large['A'] * 2 + df_large['B']
vec_time = time.time() - start
print(f"Vectorized: {vec_time:.4f}s")
# apply (slower)
start = time.time()
result_apply = df_large.apply(lambda row: row['A'] * 2 + row['B'], axis=1)
apply_time = time.time() - start
print(f"apply(): {apply_time:.4f}s ({apply_time/vec_time:.0f}x slower)")
# 6. When to Use Each
# map: Series, simple transformations, dict mapping
status_map = {1: 'active', 2: 'inactive', 3: 'pending'}
df['status'] = df['A'].map(status_map)
# apply: Series, more complex functions, multiple outputs
def complex_transform(x):
return x ** 2 if x % 2 == 0 else x ** 3
df['complex'] = df['A'].apply(complex_transform)
# DataFrame.apply: row/column aggregations, multi-column operations
def calculate_score(row):
base = row['A'] * row['B']
bonus = 10 if row['A'] > 2 else 0
return base + bonus
df['score'] = df.apply(calculate_score, axis=1)
# 7. Vectorized Alternative (Best Performance)
# Instead of apply for simple operations:
# Slow:
# df['sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
# Fast:
df['sum'] = df['A'] + df['B']
# Instead of apply with conditions:
# Slow:
# df['flag'] = df['A'].apply(lambda x: 'high' if x > 2 else 'low')
# Fast:
df['flag'] = np.where(df['A'] > 2, 'high', 'low')
# 8. Result Type Parameter
df_nums = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Return DataFrame
result_df = df_nums.apply(lambda x: x * 2, result_type='expand')
print(result_df)
# Return list (broadcast)
result_broadcast = df_nums.apply(lambda x: [1, 2], result_type='broadcast')
print(result_broadcast)
Method Comparison:
| Method | Scope | Axis | Use Case | Speed |
|---|---|---|---|---|
| Series.map() | Element | N/A | Dict mapping, simple transform | Medium |
| Series.apply() | Element | N/A | Complex functions | Medium |
| DataFrame.apply() | Row/Column | 0/1 | Multi-column, aggregations | Slow |
| DataFrame.map() | Element | N/A | Element-wise transform | Medium |
| Vectorized | All | N/A | Math operations | Fast |
Performance Tips:
| Slow | Fast | Speedup |
|---|---|---|
df.apply(lambda r: r['A']+r['B'], axis=1) | df['A'] + df['B'] | 100x+ |
df['A'].apply(lambda x: x*2) | df['A'] * 2 | 50x+ |
df['A'].apply(lambda x: 'Y' if x>5 else 'N') | np.where(df['A']>5, 'Y', 'N') | 20x+ |
Interviewer's Insight
What they're testing: Performance optimization and function application methods.
Strong answer signals:
- Distinguishes map() for Series vs apply() for both
- Knows applymap() is deprecated (use df.map())
- Understands axis=0 (columns) vs axis=1 (rows)
- Prefers vectorized operations (100x+ faster)
- Uses np.where() instead of apply for conditionals
- Only uses apply when vectorization impossible
Explain pd.cut() vs pd.qcut() - Amazon, Google Interview Question
Difficulty: π‘ Medium | Tags: Binning, Discretization, Data Preparation | Asked by: Amazon, Google, Meta
View Answer
pd.cut() bins data into equal-width intervals. pd.qcut() bins into equal-frequency quantiles. Use cut for fixed ranges, qcut for balanced distribution.
Complete Examples:
import pandas as pd
import numpy as np
# Sample data
ages = np.array([18, 22, 25, 27, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80])
scores = np.array([45, 67, 89, 72, 81, 55, 93, 60, 77, 85, 92, 68, 74, 82, 90])
# 1. pd.cut() - Equal-width Bins
# Create 3 bins of equal width
age_bins = pd.cut(ages, bins=3)
print("cut() with 3 bins:")
print(age_bins)
print(age_bins.value_counts())
# 2. pd.cut() with Custom Labels
age_categories = pd.cut(
ages,
bins=[0, 30, 50, 100],
labels=['Young', 'Middle-aged', 'Senior']
)
print("\nCustom labels:")
print(age_categories)
# 3. pd.cut() with right=False
# Default: intervals are right-closed (a, b]
# With right=False: intervals are left-closed [a, b)
bins_left = pd.cut(ages, bins=3, right=False)
print("\nLeft-closed intervals:")
print(bins_left)
# 4. pd.qcut() - Equal-frequency Quantiles
# Create 4 quantiles (quartiles)
score_quartiles = pd.qcut(scores, q=4)
print("\nqcut() quartiles:")
print(score_quartiles)
print(score_quartiles.value_counts())
# 5. pd.qcut() with Custom Labels
score_categories = pd.qcut(
scores,
q=4,
labels=['Low', 'Medium', 'High', 'Very High']
)
print("\nQuantile labels:")
print(score_categories)
# 6. Specific Quantiles
# Deciles (10 groups)
deciles = pd.qcut(scores, q=10)
print("\nDeciles:")
print(deciles.value_counts())
# Custom quantiles
custom_quantiles = pd.qcut(
scores,
q=[0, 0.25, 0.5, 0.75, 1.0],
labels=['Q1', 'Q2', 'Q3', 'Q4']
)
# 7. Include Lowest
# Include lowest value in first bin
with_lowest = pd.cut(
ages,
bins=[18, 30, 50, 80],
include_lowest=True
)
print("\nInclude lowest:")
print(with_lowest)
# 8. Duplicate Edges Handling
# When data has duplicates, qcut may fail
data_with_dups = np.array([1, 1, 1, 1, 2, 3, 4, 5])
# This would fail:
# pd.qcut(data_with_dups, q=4)
# Use duplicates='drop'
safe_qcut = pd.qcut(data_with_dups, q=4, duplicates='drop')
print("\nWith duplicates:")
print(safe_qcut)
# 9. Return Bins
# Get bin edges
values, bin_edges = pd.cut(ages, bins=3, retbins=True)
print("\nBin edges:")
print(bin_edges)
# 10. Practical Example: Customer Segmentation
df = pd.DataFrame({
'customer_id': range(1, 16),
'age': ages,
'purchase_amount': scores * 10,
'frequency': np.random.randint(1, 20, 15)
})
# Age groups (equal width)
df['age_group'] = pd.cut(
df['age'],
bins=[0, 30, 50, 100],
labels=['Young', 'Middle', 'Senior']
)
# Spending tiers (equal frequency)
df['spending_tier'] = pd.qcut(
df['purchase_amount'],
q=3,
labels=['Low', 'Medium', 'High']
)
# Frequency quintiles
df['freq_quintile'] = pd.qcut(
df['frequency'],
q=5,
labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'],
duplicates='drop'
)
print("\nCustomer Segmentation:")
print(df)
# Analysis by segment
print("\nAverage purchase by age group:")
print(df.groupby('age_group')['purchase_amount'].mean())
print("\nCustomers per spending tier:")
print(df['spending_tier'].value_counts())
cut() vs qcut():
| Aspect | pd.cut() | pd.qcut() |
|---|---|---|
| Method | Equal-width intervals | Equal-frequency quantiles |
| Bins | Same size ranges | Same number of items |
| Use Case | Fixed ranges (e.g., age groups) | Balanced distribution |
| Example | 0-20, 20-40, 40-60 | Each bin has ~same count |
| Distribution | Uneven counts | Even counts |
Key Parameters:
| Parameter | cut() | qcut() | Description |
|---|---|---|---|
| bins/q | β | β | Number of bins/quantiles |
| labels | β | β | Custom category names |
| right | β | β | Right-closed intervals (default True) |
| include_lowest | β | β | Include smallest value |
| duplicates | β | β | Handle duplicate edges |
| retbins | β | β | Return bin edges |
Interviewer's Insight
What they're testing: Data discretization and binning strategies.
Strong answer signals:
- Knows cut() creates equal-width intervals
- Knows qcut() creates equal-frequency quantiles
- Uses cut for fixed ranges (age groups, scores)
- Uses qcut for balanced distributions (percentiles)
- Applies custom labels parameter
- Handles duplicates with duplicates='drop' in qcut
Explain Categorical Data Type - Meta, Netflix Interview Question
Difficulty: π‘ Medium | Tags: Categorical, Memory Optimization, Performance | Asked by: Meta, Netflix, Google, Amazon
View Answer
Categorical dtype stores repeated string values as integers with mapping, reducing memory usage by 90%+ for low-cardinality columns. Essential for large datasets.
Complete Examples:
import pandas as pd
import numpy as np
# 1. Creating Categorical
# From list
colors = pd.Categorical(['red', 'blue', 'red', 'green', 'blue', 'red'])
print("Categorical:", colors)
print("Categories:", colors.categories)
print("Codes:", colors.codes)
# 2. DataFrame Column
df = pd.DataFrame({
'product': ['A', 'B', 'A', 'C', 'B', 'A'] * 10000,
'size': ['S', 'M', 'L', 'M', 'S', 'L'] * 10000
})
# Memory before
memory_before = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nMemory before: {memory_before:.2f} MB")
# Convert to categorical
df['product'] = df['product'].astype('category')
df['size'] = df['size'].astype('category')
# Memory after
memory_after = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory after: {memory_after:.2f} MB")
print(f"Reduction: {(1 - memory_after/memory_before)*100:.1f}%")
# 3. Ordered Categories
sizes = pd.Categorical(
['S', 'M', 'L', 'M', 'S', 'XL'],
categories=['S', 'M', 'L', 'XL'],
ordered=True
)
print("\nOrdered categorical:")
print(sizes)
print("S < M:", sizes[0] < sizes[1])
# 4. Adding/Removing Categories
cat = pd.Categorical(['a', 'b', 'c'])
# Add category
cat = cat.add_categories(['d'])
print("\nAdded category:", cat.categories)
# Remove category
cat = cat.remove_categories(['d'])
print("Removed category:", cat.categories)
# Set categories
cat = cat.set_categories(['a', 'b', 'c', 'd', 'e'])
print("Set categories:", cat.categories)
# 5. Reordering Categories
df_survey = pd.DataFrame({
'satisfaction': ['Good', 'Bad', 'Excellent', 'Good', 'Bad', 'Fair']
})
df_survey['satisfaction'] = pd.Categorical(
df_survey['satisfaction'],
categories=['Bad', 'Fair', 'Good', 'Excellent'],
ordered=True
)
# Now can sort meaningfully
df_sorted = df_survey.sort_values('satisfaction')
print("\nSorted by category order:")
print(df_sorted)
# 6. Groupby with Categories
df_sales = pd.DataFrame({
'region': pd.Categorical(['North', 'South', 'North', 'East']),
'sales': [100, 200, 150, 180]
})
# Add unused category
df_sales['region'] = df_sales['region'].cat.add_categories(['West'])
# Groupby includes empty categories
grouped = df_sales.groupby('region', observed=False).sum()
print("\nGroupby with empty category:")
print(grouped)
# Exclude empty categories
grouped_observed = df_sales.groupby('region', observed=True).sum()
print("\nGroupby observed only:")
print(grouped_observed)
# 7. Renaming Categories
df_status = pd.DataFrame({
'status': pd.Categorical(['A', 'B', 'A', 'C'])
})
# Rename
df_status['status'] = df_status['status'].cat.rename_categories({
'A': 'Active',
'B': 'Blocked',
'C': 'Closed'
})
print("\nRenamed categories:")
print(df_status)
# 8. Performance Comparison
# String operations
df_str = pd.DataFrame({
'category': ['A', 'B', 'C'] * 100000
})
df_cat = df_str.copy()
df_cat['category'] = df_cat['category'].astype('category')
# Comparison speed
import time
start = time.time()
_ = df_str['category'] == 'A'
str_time = time.time() - start
start = time.time()
_ = df_cat['category'] == 'A'
cat_time = time.time() - start
print(f"\nString comparison: {str_time:.4f}s")
print(f"Categorical comparison: {cat_time:.4f}s")
print(f"Speedup: {str_time/cat_time:.1f}x")
# 9. When to Use Categorical
# Good: low cardinality (few unique values)
df_good = pd.DataFrame({
'country': ['USA', 'UK', 'Canada'] * 10000 # 3 unique
})
df_good['country'] = df_good['country'].astype('category')
# Bad: high cardinality (many unique values)
df_bad = pd.DataFrame({
'user_id': [f'user_{i}' for i in range(10000)] # All unique
})
# Don't convert to categorical - no benefit!
# 10. Practical Example: Survey Data
df_survey = pd.DataFrame({
'age_group': pd.Categorical(
['18-25', '26-35', '18-25', '36-45', '26-35'],
categories=['18-25', '26-35', '36-45', '46-55', '55+'],
ordered=True
),
'satisfaction': pd.Categorical(
['Good', 'Bad', 'Excellent', 'Good', 'Fair'],
categories=['Bad', 'Fair', 'Good', 'Excellent'],
ordered=True
),
'product': pd.Categorical(['A', 'B', 'A', 'C', 'B'])
})
# Efficient aggregation
result = df_survey.groupby(['age_group', 'product'], observed=False).size()
print("\nSurvey analysis:")
print(result)
Benefits:
| Benefit | Description | Example Savings |
|---|---|---|
| Memory | Stores as int codes | 90%+ reduction |
| Speed | Faster comparisons | 2-10x faster |
| Ordering | Meaningful sorts | Survey ratings |
| Groupby | Include empty categories | Complete reports |
When to Use:
| Use Categorical | Don't Use Categorical |
|---|---|
| Low cardinality (<50% unique) | High cardinality (90%+ unique) |
| Repeated values | Mostly unique values |
| Surveys, ratings | User IDs, timestamps |
| Country, department | Email addresses |
| Fixed set of values | Free-form text |
Key Operations:
| Operation | Method | Example |
|---|---|---|
| Create | astype('category') | df['col'].astype('category') |
| Add category | .cat.add_categories() | .cat.add_categories(['new']) |
| Remove | .cat.remove_categories() | .cat.remove_categories(['old']) |
| Rename | .cat.rename_categories() | .cat.rename_categories({'A': 'Active'}) |
| Reorder | .cat.reorder_categories() | .cat.reorder_categories(['S', 'M', 'L']) |
| Set ordered | .cat.as_ordered() | .cat.as_ordered() |
Interviewer's Insight
What they're testing: Memory optimization and categorical data handling.
Strong answer signals:
- Knows categorical stores strings as integer codes with mapping
- Mentions 90%+ memory reduction for low-cardinality columns
- Uses ordered=True for meaningful sorting (e.g., ratings)
- Applies when <50% unique values with repetition
- Avoids for high cardinality data (unique IDs)
- Uses .cat accessor for category management
- Sets observed=False to include empty categories in groupby
Explain Window Functions (Rolling, Expanding, EWMA) - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Window Functions, Time Series, Rolling Statistics | Asked by: Google, Amazon, Meta, Netflix
View Answer
Window functions compute statistics over sliding, expanding, or exponentially weighted windows. Essential for time series analysis, moving averages, and trend detection.
Complete Examples:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Sample time series data
dates = pd.date_range('2024-01-01', periods=100, freq='D')
df = pd.DataFrame({
'date': dates,
'sales': np.random.randint(80, 120, 100) + np.sin(np.arange(100) * 0.1) * 20,
'temperature': np.random.normal(70, 10, 100)
})
df.set_index('date', inplace=True)
# 1. Rolling Window (Moving Average)
# 7-day moving average
df['sales_ma7'] = df['sales'].rolling(window=7).mean()
# 30-day moving average
df['sales_ma30'] = df['sales'].rolling(window=30).mean()
print("Rolling means:")
print(df[['sales', 'sales_ma7', 'sales_ma30']].head(35))
# 2. Rolling with min_periods
# Calculate even with fewer than 7 values
df['sales_ma7_min'] = df['sales'].rolling(window=7, min_periods=1).mean()
# 3. Rolling Aggregations
df['sales_roll_std'] = df['sales'].rolling(window=7).std()
df['sales_roll_min'] = df['sales'].rolling(window=7).min()
df['sales_roll_max'] = df['sales'].rolling(window=7).max()
df['sales_roll_sum'] = df['sales'].rolling(window=7).sum()
# Multiple aggregations
rolling_stats = df['sales'].rolling(window=7).agg(['mean', 'std', 'min', 'max'])
print("\nRolling statistics:")
print(rolling_stats.head(10))
# 4. Expanding Window
# Cumulative average (all data up to current point)
df['sales_cumavg'] = df['sales'].expanding().mean()
# Cumulative sum
df['sales_cumsum'] = df['sales'].expanding().sum()
# Cumulative max
df['sales_cummax'] = df['sales'].expanding().max()
print("\nExpanding statistics:")
print(df[['sales', 'sales_cumavg', 'sales_cumsum', 'sales_cummax']].head(10))
# 5. Exponentially Weighted Moving Average (EWMA)
# More weight to recent values
df['sales_ewma'] = df['sales'].ewm(span=7).mean()
# Different smoothing parameters
df['sales_ewm_fast'] = df['sales'].ewm(span=3).mean() # Fast response
df['sales_ewm_slow'] = df['sales'].ewm(span=14).mean() # Slow response
# Alpha parameter (0-1)
df['sales_ewm_alpha'] = df['sales'].ewm(alpha=0.3).mean()
# 6. Center Parameter
# Center the window (for smoothing, not forecasting)
df['sales_centered'] = df['sales'].rolling(window=7, center=True).mean()
print("\nCentered vs non-centered:")
print(df[['sales', 'sales_ma7', 'sales_centered']].iloc[3:8])
# 7. Rolling with Custom Function
def range_ratio(x):
"""Range divided by mean."""
return (x.max() - x.min()) / x.mean() if x.mean() != 0 else 0
df['sales_volatility'] = df['sales'].rolling(window=7).apply(range_ratio)
# 8. Time-based Window
# 7-day window (calendar days, not data points)
df['sales_7d'] = df['sales'].rolling('7D').mean()
# 14-day window
df['sales_14d'] = df['sales'].rolling('14D').mean()
# 9. Shift and Diff
# Previous value
df['sales_prev'] = df['sales'].shift(1)
# Change from previous day
df['sales_change'] = df['sales'].diff()
# Percentage change
df['sales_pct_change'] = df['sales'].pct_change()
# Change from 7 days ago
df['sales_change_7d'] = df['sales'].diff(7)
print("\nChanges:")
print(df[['sales', 'sales_prev', 'sales_change', 'sales_pct_change']].head(10))
# 10. Practical Example: Anomaly Detection
# Bollinger Bands
window = 20
df['bb_mid'] = df['sales'].rolling(window=window).mean()
df['bb_std'] = df['sales'].rolling(window=window).std()
df['bb_upper'] = df['bb_mid'] + (df['bb_std'] * 2)
df['bb_lower'] = df['bb_mid'] - (df['bb_std'] * 2)
# Detect anomalies
df['anomaly'] = ((df['sales'] > df['bb_upper']) |
(df['sales'] < df['bb_lower']))
print("\nAnomalies detected:")
print(df[df['anomaly']][['sales', 'bb_upper', 'bb_lower']])
# 11. Multi-column Rolling
# Rolling correlation
df['corr_sales_temp'] = df['sales'].rolling(window=30).corr(df['temperature'])
# 12. GroupBy + Rolling
# Add categories
df['category'] = np.random.choice(['A', 'B'], len(df))
# Rolling average per category
df['sales_ma_by_cat'] = df.groupby('category')['sales'].rolling(window=7).mean().reset_index(0, drop=True)
Window Types:
| Type | Method | Use Case | Example |
|---|---|---|---|
| Rolling | .rolling(window) | Moving average | 7-day MA |
| Expanding | .expanding() | Cumulative stats | Running total |
| EWMA | .ewm(span/alpha) | Weighted average | Trend smoothing |
Common Rolling Operations:
| Operation | Method | Description |
|---|---|---|
| Mean | .rolling(n).mean() | Moving average |
| Sum | .rolling(n).sum() | Moving total |
| Std | .rolling(n).std() | Moving volatility |
| Min/Max | .rolling(n).min/max() | Moving range |
| Corr | .rolling(n).corr() | Moving correlation |
| Apply | .rolling(n).apply(func) | Custom function |
Key Parameters:
| Parameter | Description | Example |
|---|---|---|
| window | Window size | window=7 |
| min_periods | Min values needed | min_periods=1 |
| center | Center the window | center=True |
| span | EWMA span | span=7 |
| alpha | EWMA smoothing | alpha=0.3 |
Interview Insights:
Interviewer's Insight
- "rolling(): moving average over fixed window"
- "expanding(): cumulative statistics"
- "ewm(): exponentially weighted (recent values weighted more)"
- "min_periods: handle NaN at start"
- "center=True: for smoothing (not forecasting)"
- "Time-based:
rolling('7D')for calendar days" - "Shift:
df['prev'] = df['col'].shift(1)" - "Diff:
df['change'] = df['col'].diff()" - "Use for: moving averages, anomaly detection, trend analysis"
Explain DataFrame Indexing (.loc vs .iloc vs .at vs .iat) - Microsoft, Amazon Interview Question
Difficulty: π’ Easy | Tags: Indexing, DataFrame Access, Performance | Asked by: Microsoft, Amazon, Google, Meta
View Answer
.loc accesses by label, .iloc by integer position, .at/.iat for single values (faster). Use loc for label-based, iloc for position-based indexing.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': ['a', 'b', 'c', 'd', 'e']
}, index=['row1', 'row2', 'row3', 'row4', 'row5'])
# 1. .loc - Label-based
# Single value
print(df.loc['row1', 'A']) # 1
# Row
print(df.loc['row2']) # Series
# Multiple rows
print(df.loc[['row1', 'row3']])
# Slice (inclusive)
print(df.loc['row1':'row3']) # Includes row3!
# Boolean indexing
print(df.loc[df['A'] > 2])
# Multiple columns
print(df.loc['row1', ['A', 'B']])
# 2. .iloc - Position-based
# Single value
print(df.iloc[0, 0]) # 1
# Row by position
print(df.iloc[1]) # Second row
# Multiple rows
print(df.iloc[[0, 2, 4]])
# Slice (exclusive end)
print(df.iloc[0:3]) # Rows 0, 1, 2 (not 3)
# Negative indexing
print(df.iloc[-1]) # Last row
# 2D slicing
print(df.iloc[1:3, 0:2]) # Rows 1-2, columns 0-1
# 3. .at - Fast single value (label)
value = df.at['row1', 'A']
print(value) # 1
# Set value
df.at['row1', 'A'] = 100
# 4. .iat - Fast single value (position)
value = df.iat[0, 0]
print(value) # 100
# Set value
df.iat[0, 0] = 1
# 5. Performance Comparison
import time
large_df = pd.DataFrame(np.random.rand(10000, 100))
# .loc timing
start = time.time()
for _ in range(1000):
_ = large_df.loc[500, 50]
loc_time = time.time() - start
# .at timing (faster)
start = time.time()
for _ in range(1000):
_ = large_df.at[500, 50]
at_time = time.time() - start
print(f".loc: {loc_time:.4f}s")
print(f".at: {at_time:.4f}s ({loc_time/at_time:.1f}x slower)")
# 6. Setting Values
# Set single value
df.loc['row1', 'A'] = 999
# Set multiple values
df.loc['row1', ['A', 'B']] = [1, 2]
# Set column
df.loc[:, 'A'] = 0
# Set with condition
df.loc[df['A'] > 2, 'B'] = 100
# 7. Adding Rows/Columns
# Add row
df.loc['row6'] = [6, 60, 'f']
# Add column
df.loc[:, 'D'] = [1, 2, 3, 4, 5, 6]
# 8. Common Pitfalls
# Chain indexing (bad!)
# df[df['A'] > 2]['B'] = 999 # SettingWithCopyWarning
# Correct way
df.loc[df['A'] > 2, 'B'] = 999
# 9. MultiIndex
df_multi = pd.DataFrame({
'value': [1, 2, 3, 4]
}, index=pd.MultiIndex.from_tuples([
('A', 1), ('A', 2), ('B', 1), ('B', 2)
]))
# Access with tuple
print(df_multi.loc[('A', 1)])
# Slice first level
print(df_multi.loc['A'])
# 10. Boolean + Label
# Combine conditions
result = df.loc[(df['A'] > 1) & (df['B'] < 50), ['A', 'C']]
print(result)
Comparison:
| Method | Indexing Type | Speed | Use Case |
|---|---|---|---|
| .loc | Label-based | Medium | General access by label |
| .iloc | Position-based | Medium | Access by position |
| .at | Label-based | Fast | Single value by label |
| .iat | Position-based | Fast | Single value by position |
Key Differences:
| Aspect | .loc | .iloc |
|---|---|---|
| Input | Labels | Integers |
| Slice | Inclusive end | Exclusive end |
| Example | df.loc['row1':'row3'] | df.iloc[0:3] |
| Boolean | β Yes | β No |
Interview Insights:
Interviewer's Insight
- "loc: label-based, iloc: position-based"
- "loc slicing: inclusive end"
- "iloc slicing: exclusive end (like Python)"
- "at/iat: faster for single values"
- "Use loc with boolean indexing"
- "Avoid chained indexing:
df[col][row]" - "Correct:
df.loc[row, col]" - "at/iat: 10-100x faster for scalar access"
Interviewer's Insight
What they're testing: DataFrame indexing fundamentals and performance.
Strong answer signals:
- Knows loc uses labels, iloc uses integer positions
- Mentions at/iat are 10-100x faster for single values
- Understands loc slicing is inclusive, iloc is exclusive
- Avoids chained indexing to prevent SettingWithCopyWarning
- Uses loc for boolean indexing and filtering
Explain pd.concat() vs pd.merge() - Amazon, Google Interview Question
Difficulty: π‘ Medium | Tags: concat, merge, Data Combination | Asked by: Amazon, Google, Meta, Microsoft
View Answer
pd.concat() stacks DataFrames vertically/horizontally. pd.merge() combines based on keys (SQL joins). Use concat for stacking, merge for joining on columns.
Complete Examples:
import pandas as pd
# Sample data
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
df2 = pd.DataFrame({
'A': [7, 8, 9],
'B': [10, 11, 12]
})
df3 = pd.DataFrame({
'C': [13, 14, 15],
'D': [16, 17, 18]
})
# 1. concat() - Vertical (default axis=0)
vertical = pd.concat([df1, df2])
print("Vertical concat:")
print(vertical)
# Stacks rows, keeps all columns
# Reset index
vertical_reset = pd.concat([df1, df2], ignore_index=True)
print("\nWith reset index:")
print(vertical_reset)
# 2. concat() - Horizontal (axis=1)
horizontal = pd.concat([df1, df3], axis=1)
print("\nHorizontal concat:")
print(horizontal)
# Side-by-side, matches by index
# 3. concat() with Keys
keyed = pd.concat([df1, df2], keys=['first', 'second'])
print("\nWith keys:")
print(keyed)
# Creates MultiIndex
# 4. concat() with Missing Columns
df4 = pd.DataFrame({
'A': [1, 2],
'C': [3, 4] # Different column
})
mixed = pd.concat([df1, df4])
print("\nMissing columns (filled with NaN):")
print(mixed)
# Inner join (only common columns)
inner = pd.concat([df1, df4], join='inner')
print("\nInner join:")
print(inner)
# 5. merge() - Inner Join
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 2, 4],
'amount': [100, 200, 300]
})
inner_merge = pd.merge(customers, orders, on='customer_id')
print("\nInner merge:")
print(inner_merge)
# Only matching customer_ids (1, 2)
# 6. merge() - Outer Join
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter merge:")
print(outer_merge)
# All records, NaN for missing
# 7. When to Use Each
# concat: Same structure, stack them
jan = pd.DataFrame({'sales': [100, 200]})
feb = pd.DataFrame({'sales': [150, 250]})
mar = pd.DataFrame({'sales': [120, 220]})
quarterly = pd.concat([jan, feb, mar], ignore_index=True)
print("\nQuarterly sales (concat):")
print(quarterly)
# merge: Different tables, join on key
products = pd.DataFrame({
'product_id': [1, 2],
'name': ['Widget', 'Gadget']
})
sales = pd.DataFrame({
'product_id': [1, 2, 1],
'quantity': [5, 3, 2]
})
sales_detail = pd.merge(sales, products, on='product_id')
print("\nSales detail (merge):")
print(sales_detail)
# 8. concat Multiple DataFrames
dfs = [df1, df2, df1, df2]
result = pd.concat(dfs, ignore_index=True)
print("\nMultiple concat:")
print(result)
# 9. Performance: concat vs append
# append (deprecated) - slow in loop
# result = pd.DataFrame()
# for df in dfs:
# result = result.append(df) # Slow!
# concat - fast
result = pd.concat(dfs) # Much faster
# 10. Practical Example
# Monthly data files
data_2023_q1 = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=90, freq='D'),
'revenue': range(90)
})
data_2023_q2 = pd.DataFrame({
'date': pd.date_range('2023-04-01', periods=91, freq='D'),
'revenue': range(100, 191)
})
# Combine time series
full_year = pd.concat([data_2023_q1, data_2023_q2], ignore_index=True)
# Add metadata
metadata = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=181, freq='D'),
'holiday': [False] * 181
})
# Merge with metadata
complete = pd.merge(full_year, metadata, on='date', how='left')
print("\nComplete data:")
print(complete.head())
concat() vs merge():
| Aspect | concat() | merge() |
|---|---|---|
| Purpose | Stack DataFrames | Join on keys |
| Direction | Vertical/Horizontal | Based on keys |
| Use Case | Append rows, add columns | SQL-style joins |
| Key Required | No | Yes |
| Example | Monthly data β yearly | Customers + Orders |
concat() Parameters:
| Parameter | Description | Example |
|---|---|---|
| axis | 0=vertical, 1=horizontal | axis=0 |
| ignore_index | Reset index | ignore_index=True |
| keys | Create MultiIndex | keys=['a', 'b'] |
| join | 'outer' (default) or 'inner' | join='inner' |
Interview Insights:
Interviewer's Insight
- "concat: stack DataFrames (same structure)"
- "merge: join on keys (different tables)"
- "concat axis=0: vertical, axis=1: horizontal"
- "concat ignore_index: reset index"
- "merge for SQL-style joins"
- "concat join='inner': only common columns"
- "Use concat for: time series, appending rows"
- "Use merge for: combining related tables"
- "concat list of DFs: faster than repeated append"
Explain GroupBy Aggregation Functions - Google, Meta Interview Question
Difficulty: π‘ Medium | Tags: GroupBy, Aggregation, Statistics | Asked by: Google, Meta, Netflix, Amazon
View Answer
GroupBy splits data into groups, applies functions, combines results. Supports multiple aggregations, custom functions, and named aggregations with agg().
Complete Examples:
import pandas as pd
import numpy as np
# Sample sales data
df = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'East', 'East'] * 5,
'product': ['A', 'B', 'A', 'B', 'A', 'B'] * 5,
'sales': np.random.randint(100, 500, 30),
'quantity': np.random.randint(1, 20, 30),
'cost': np.random.randint(50, 200, 30)
})
# 1. Basic GroupBy
# Single aggregation
region_sales = df.groupby('region')['sales'].sum()
print("Sales by region:")
print(region_sales)
# Multiple columns
by_region_product = df.groupby(['region', 'product'])['sales'].sum()
print("\nSales by region and product:")
print(by_region_product)
# 2. Multiple Aggregations
# Same function on all columns
summary = df.groupby('region').sum()
print("\nSum by region:")
print(summary)
# Multiple functions
stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'std', 'count'])
print("\nMultiple stats:")
print(stats)
# 3. Different Functions per Column
aggregations = df.groupby('region').agg({
'sales': 'sum',
'quantity': 'mean',
'cost': ['min', 'max']
})
print("\nDifferent functions per column:")
print(aggregations)
# 4. Named Aggregations (Clean Column Names)
result = df.groupby('region').agg(
total_sales=('sales', 'sum'),
avg_quantity=('quantity', 'mean'),
max_cost=('cost', 'max'),
sales_count=('sales', 'count')
)
print("\nNamed aggregations:")
print(result)
# 5. Custom Aggregation Functions
def range_func(x):
return x.max() - x.min()
def coefficient_of_variation(x):
return x.std() / x.mean() if x.mean() != 0 else 0
custom = df.groupby('region')['sales'].agg([
('total', 'sum'),
('average', 'mean'),
('range', range_func),
('cv', coefficient_of_variation)
])
print("\nCustom functions:")
print(custom)
# 6. Lambda Functions
result = df.groupby('region').agg({
'sales': [
('total', 'sum'),
('weighted_avg', lambda x: (x * df.loc[x.index, 'quantity']).sum() / df.loc[x.index, 'quantity'].sum())
]
})
print("\nLambda aggregation:")
print(result)
# 7. Transform (Keep Original Shape)
# Add group mean to each row
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
# Normalize within group
df['sales_normalized'] = df.groupby('region')['sales'].transform(
lambda x: (x - x.mean()) / x.std()
)
print("\nWith transform:")
print(df[['region', 'sales', 'region_avg_sales', 'sales_normalized']].head())
# 8. Filter Groups
# Keep only groups with more than 5 records
filtered = df.groupby('region').filter(lambda x: len(x) > 5)
print("\nFiltered groups:")
print(filtered.groupby('region').size())
# Keep groups with high average sales
high_sales = df.groupby('region').filter(lambda x: x['sales'].mean() > 250)
# 9. Apply (Most Flexible)
def top_products(group):
"""Get top 2 products by sales."""
return group.nlargest(2, 'sales')
top_per_region = df.groupby('region').apply(top_products)
print("\nTop products per region:")
print(top_per_region)
# 10. GroupBy with MultiIndex
multi_group = df.groupby(['region', 'product']).agg({
'sales': ['sum', 'mean'],
'quantity': 'sum'
})
print("\nMultiIndex groupby:")
print(multi_group)
# Flatten MultiIndex columns
multi_group.columns = ['_'.join(col).strip() for col in multi_group.columns.values]
print("\nFlattened columns:")
print(multi_group)
# 11. Cumulative Operations
# Cumulative sum within group
df['cumulative_sales'] = df.groupby('region')['sales'].cumsum()
# Ranking within group
df['sales_rank'] = df.groupby('region')['sales'].rank(ascending=False)
print("\nCumulative operations:")
print(df[['region', 'sales', 'cumulative_sales', 'sales_rank']].head(10))
# 12. Practical Example: Sales Report
report = df.groupby('region').agg(
total_revenue=('sales', 'sum'),
avg_sale=('sales', 'mean'),
total_units=('quantity', 'sum'),
num_transactions=('sales', 'count'),
max_sale=('sales', 'max'),
min_sale=('sales', 'min')
)
# Add calculated columns
report['revenue_per_unit'] = report['total_revenue'] / report['total_units']
report['avg_units_per_transaction'] = report['total_units'] / report['num_transactions']
print("\nSales report:")
print(report)
Common Aggregation Functions:
| Function | Description | Example |
|---|---|---|
| sum | Total | groupby().sum() |
| mean | Average | groupby().mean() |
| count | Count non-null | groupby().count() |
| size | Count all (including null) | groupby().size() |
| std | Standard deviation | groupby().std() |
| min/max | Min/Max | groupby().max() |
| first/last | First/Last value | groupby().first() |
| nunique | Unique count | groupby().nunique() |
GroupBy Methods:
| Method | Purpose | Returns |
|---|---|---|
| agg() | Apply aggregations | Aggregated data |
| transform() | Keep original shape | Same shape as input |
| filter() | Filter groups | Subset of data |
| apply() | Custom function | Custom result |
Interview Insights:
Interviewer's Insight
- "groupby(): split-apply-combine pattern"
- "agg(): apply multiple functions"
- "Named agg:
total_sales=('sales', 'sum')" - "transform(): keep original shape"
- "filter(): keep/remove entire groups"
- "apply(): most flexible, returns custom structure"
- "Multiple grouping:
groupby(['col1', 'col2'])" - "Custom functions: pass callable to agg()"
- "Flatten MultiIndex:
'_'.join(col)"
Explain Pandas Query Method - Amazon, Netflix Interview Question
Difficulty: π’ Easy | Tags: Query, Filtering, Performance | Asked by: Amazon, Netflix, Google, Meta
View Answer
query() filters DataFrames using string expressions (SQL-like). Cleaner syntax than boolean indexing, supports variables with @, and can be faster for large DataFrames.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 40, 28],
'salary': [50000, 60000, 70000, 80000, 55000],
'department': ['Sales', 'IT', 'Sales', 'IT', 'HR'],
'experience': [2, 5, 8, 12, 3]
})
# 1. Basic Query
# Simple condition
result = df.query('age > 30')
print("Age > 30:")
print(result)
# Multiple conditions (AND)
result = df.query('age > 25 and salary > 55000')
print("\nAge > 25 AND salary > 55000:")
print(result)
# OR condition
result = df.query('department == "IT" or salary > 70000')
print("\nIT OR high salary:")
print(result)
# 2. Comparison with Boolean Indexing
# Boolean indexing (verbose)
result_bool = df[(df['age'] > 25) & (df['salary'] > 55000)]
# Query (cleaner)
result_query = df.query('age > 25 and salary > 55000')
# Both produce same result
print("\nResults equal:", result_bool.equals(result_query))
# 3. Using Variables with @
min_age = 30
min_salary = 60000
result = df.query('age > @min_age and salary > @min_salary')
print("\nWith variables:")
print(result)
# List membership
departments = ['IT', 'HR']
result = df.query('department in @departments')
print("\nDepartment in list:")
print(result)
# 4. String Operations
# String contains
result = df.query('name.str.contains("a", case=False)', engine='python')
print("\nName contains 'a':")
print(result)
# String startswith
result = df.query('name.str.startswith("D")', engine='python')
print("\nName starts with 'D':")
print(result)
# 5. Between
# Age between 25 and 35
result = df.query('25 <= age <= 35')
print("\nAge between 25 and 35:")
print(result)
# 6. Not Equal
result = df.query('department != "Sales"')
print("\nNot Sales:")
print(result)
# 7. Complex Expressions
# Calculate in query
result = df.query('salary / experience > 10000')
print("\nSalary per year experience > 10k:")
print(result)
# 8. Index Queries
df_indexed = df.set_index('name')
# Query index
result = df_indexed.query('index == "Alice"')
print("\nQuery index:")
print(result)
# 9. Performance Comparison
large_df = pd.DataFrame({
'A': np.random.randint(0, 100, 100000),
'B': np.random.randint(0, 100, 100000),
'C': np.random.randint(0, 100, 100000)
})
import time
# Boolean indexing
start = time.time()
_ = large_df[(large_df['A'] > 50) & (large_df['B'] < 30)]
bool_time = time.time() - start
# Query
start = time.time()
_ = large_df.query('A > 50 and B < 30')
query_time = time.time() - start
print(f"\nBoolean: {bool_time:.4f}s")
print(f"Query: {query_time:.4f}s")
# 10. Inplace Modification
df_copy = df.copy()
# Filter and reassign
df_filtered = df_copy.query('age > 30')
# Can't use inplace=True with query (returns filtered view)
# 11. Backticks for Column Names with Spaces
df_spaces = pd.DataFrame({
'First Name': ['Alice', 'Bob'],
'Last Name': ['Smith', 'Jones'],
'Age': [25, 30]
})
result = df_spaces.query('`First Name` == "Alice"')
print("\nWith spaces in column names:")
print(result)
# 12. Practical Example: Data Analysis
sales_df = pd.DataFrame({
'product': ['A', 'B', 'C', 'A', 'B', 'C'] * 10,
'region': ['North', 'South'] * 30,
'sales': np.random.randint(100, 1000, 60),
'units': np.random.randint(1, 50, 60),
'quarter': ['Q1', 'Q2', 'Q3', 'Q4'] * 15
})
# Complex business query
target_sales = 500
target_region = 'North'
high_performers = sales_df.query(
'sales > @target_sales and region == @target_region and units > 20'
)
print("\nHigh performers:")
print(high_performers)
# Chained queries
result = (sales_df
.query('region == "North"')
.query('sales > 500')
.query('quarter in ["Q1", "Q2"]'))
print("\nChained queries:")
print(result)
query() vs Boolean Indexing:
| Aspect | query() | Boolean Indexing |
|---|---|---|
| Syntax | df.query('age > 30') | df[df['age'] > 30] |
| Readability | Cleaner, SQL-like | More verbose |
| Performance | Faster (large data) | Slower |
| Variables | Use @ | Direct |
| Complex | More readable | Harder to read |
Query Operators:
| Operator | Description | Example |
|---|---|---|
| ==, != | Equality | 'age == 30' |
| >, <, >=, <= | Comparison | 'salary > 50000' |
| and, or, not | Logical | 'age > 30 and salary < 60000' |
| in | Membership | 'dept in @departments' |
| @ | Variable reference | 'age > @min_age' |
Interview Insights:
Interviewer's Insight
- "query(): SQL-like string expressions"
- "Cleaner than boolean indexing"
- "Use @ for variables:
query('age > @min_age')" - "and/or instead of &/|"
- "Faster for large DataFrames (numexpr)"
- "Backticks for spaces:
`column name`" - "Can query index:
query('index == \"value\"')" - "engine='python' for string methods"
- "More readable for complex conditions"
Explain Pandas Memory Usage Optimization - Netflix, Google Interview Question
Difficulty: π΄ Hard | Tags: Memory, Performance, Optimization | Asked by: Netflix, Google, Meta, Amazon
View Answer
Memory optimization reduces DataFrame memory footprint through: downcasting numeric types, categorical dtype for strings, sparse arrays, chunking, and efficient file formats (parquet).
Complete Examples:
import pandas as pd
import numpy as np
# 1. Check Memory Usage
df = pd.DataFrame({
'int_col': np.random.randint(0, 100, 1000000),
'float_col': np.random.rand(1000000),
'str_col': np.random.choice(['A', 'B', 'C'], 1000000),
'date_col': pd.date_range('2020-01-01', periods=1000000, freq='min')
})
# Memory before
memory_before = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory before: {memory_before:.2f} MB")
print("\nMemory by column:")
print(df.memory_usage(deep=True))
# 2. Downcast Numeric Types
# int64 β int8/int16/int32
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
# float64 β float32
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
memory_after_numeric = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nAfter numeric downcast: {memory_after_numeric:.2f} MB")
print(f"Saved: {memory_before - memory_after_numeric:.2f} MB")
# 3. Categorical for Low Cardinality Strings
df['str_col'] = df['str_col'].astype('category')
memory_after_cat = df.memory_usage(deep=True).sum() / 1024**2
print(f"After categorical: {memory_after_cat:.2f} MB")
print(f"Total saved: {(1 - memory_after_cat/memory_before)*100:.1f}%")
# 4. Automatic Optimization Function
def optimize_dataframe(df):
"""Automatically optimize DataFrame dtypes."""
optimized = df.copy()
for col in optimized.columns:
col_type = optimized[col].dtype
# Numeric columns
if col_type != 'object' and col_type.name != 'category':
c_min = optimized[col].min()
c_max = optimized[col].max()
# Integer
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
optimized[col] = optimized[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
optimized[col] = optimized[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
optimized[col] = optimized[col].astype(np.int32)
# Float
else:
if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
optimized[col] = optimized[col].astype(np.float32)
# String to categorical
elif col_type == 'object':
num_unique = optimized[col].nunique()
num_total = len(optimized[col])
if num_unique / num_total < 0.5: # Less than 50% unique
optimized[col] = optimized[col].astype('category')
return optimized
# Test optimization
df_test = pd.DataFrame({
'id': range(100000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 100000),
'value': np.random.rand(100000) * 100
})
print(f"\nBefore optimization: {df_test.memory_usage(deep=True).sum()/1024**2:.2f} MB")
df_optimized = optimize_dataframe(df_test)
print(f"After optimization: {df_optimized.memory_usage(deep=True).sum()/1024**2:.2f} MB")
# 5. Sparse Arrays
# For data with many zeros/NaN
df_sparse = pd.DataFrame({
'A': pd.arrays.SparseArray([0, 0, 1, 0, 0, 0, 2, 0, 0, 0] * 100000)
})
# Regular array
df_dense = pd.DataFrame({
'A': [0, 0, 1, 0, 0, 0, 2, 0, 0, 0] * 100000
})
print(f"\nDense: {df_dense.memory_usage(deep=True).sum()/1024**2:.2f} MB")
print(f"Sparse: {df_sparse.memory_usage(deep=True).sum()/1024**2:.2f} MB")
# 6. Chunking for Large Files
chunk_size = 10000
# Process in chunks
def process_large_file(filename, chunksize=10000):
"""Process large CSV in chunks."""
results = []
for chunk in pd.read_csv(filename, chunksize=chunksize):
# Process chunk
processed = chunk[chunk['value'] > 50]
results.append(processed)
return pd.concat(results, ignore_index=True)
# 7. Efficient File Formats
# Save test data
df_test.to_csv('test.csv', index=False)
df_test.to_parquet('test.parquet')
import os
csv_size = os.path.getsize('test.csv') / 1024**2
parquet_size = os.path.getsize('test.parquet') / 1024**2
print(f"\nCSV size: {csv_size:.2f} MB")
print(f"Parquet size: {parquet_size:.2f} MB")
print(f"Parquet is {csv_size/parquet_size:.1f}x smaller")
# 8. Column Selection
# Read only needed columns
# df_subset = pd.read_csv('large_file.csv', usecols=['col1', 'col2'])
# 9. Date Parsing
# Don't parse dates if not needed
# df = pd.read_csv('file.csv', parse_dates=False)
# 10. Practical Example: Large Dataset
print("\n=== Optimization Summary ===")
# Original
df_original = pd.DataFrame({
'user_id': range(1000000),
'country': np.random.choice(['USA', 'UK', 'Canada', 'Germany'], 1000000),
'age': np.random.randint(18, 80, 1000000),
'score': np.random.rand(1000000) * 100,
'status': np.random.choice(['active', 'inactive'], 1000000)
})
mem_original = df_original.memory_usage(deep=True).sum() / 1024**2
print(f"Original: {mem_original:.2f} MB")
# Optimized
df_opt = df_original.copy()
df_opt['user_id'] = pd.to_numeric(df_opt['user_id'], downcast='integer')
df_opt['age'] = pd.to_numeric(df_opt['age'], downcast='integer')
df_opt['score'] = pd.to_numeric(df_opt['score'], downcast='float')
df_opt['country'] = df_opt['country'].astype('category')
df_opt['status'] = df_opt['status'].astype('category')
mem_opt = df_opt.memory_usage(deep=True).sum() / 1024**2
print(f"Optimized: {mem_opt:.2f} MB")
print(f"Reduction: {(1 - mem_opt/mem_original)*100:.1f}%")
# Cleanup
os.remove('test.csv')
os.remove('test.parquet')
Optimization Techniques:
| Technique | Memory Savings | Use Case |
|---|---|---|
| Downcast int64βint8 | 87.5% | Small integers |
| Downcast float64βfloat32 | 50% | Precision not critical |
| StringβCategorical | 90%+ | Low cardinality |
| Sparse arrays | 90%+ | Many zeros/NaN |
| Parquet format | 50-90% | File storage |
| Chunking | No limit | Process huge files |
Data Type Sizes:
| Type | Size | Range |
|---|---|---|
| int8 | 1 byte | -128 to 127 |
| int16 | 2 bytes | -32,768 to 32,767 |
| int32 | 4 bytes | -2B to 2B |
| int64 | 8 bytes | -9E18 to 9E18 |
| float32 | 4 bytes | ~7 decimals |
| float64 | 8 bytes | ~15 decimals |
Interview Insights:
Interviewer's Insight
- "memory_usage(deep=True): check actual usage"
- "Downcast: int64βint8/16/32, float64βfloat32"
- "Categorical: 90%+ savings for low-cardinality strings"
- "Sparse arrays: for data with many zeros"
- "Chunking: process files larger than RAM"
- "Parquet: smaller, faster than CSV"
- "read_csv usecols: load only needed columns"
- "Automatic optimization: check ranges, convert types"
- "Rule: <50% unique β categorical"
Explain Pandas Eval Method - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: eval, Performance, Expression Evaluation | Asked by: Google, Amazon, Meta
View Answer
eval() evaluates string expressions efficiently using numexpr backend. Much faster than standard operations for large DataFrames, especially with complex arithmetic.
Complete Examples:
import pandas as pd
import numpy as np
import time
# Create large DataFrame
n = 1000000
df = pd.DataFrame({
'A': np.random.rand(n),
'B': np.random.rand(n),
'C': np.random.rand(n),
'D': np.random.rand(n)
})
# 1. Basic eval()
# Standard operation
result_standard = df['A'] + df['B'] * df['C'] - df['D']
# eval operation
result_eval = df.eval('A + B * C - D')
# Results are equal
print("Results equal:", np.allclose(result_standard, result_eval))
# 2. Performance Comparison
# Standard
start = time.time()
_ = df['A'] + df['B'] * df['C'] - df['D']
standard_time = time.time() - start
# eval
start = time.time()
_ = df.eval('A + B * C - D')
eval_time = time.time() - start
print(f"\nStandard: {standard_time:.4f}s")
print(f"eval: {eval_time:.4f}s")
print(f"Speedup: {standard_time/eval_time:.1f}x")
# 3. Column Assignment
# Create new column
df.eval('E = A + B', inplace=True)
print("\nNew column E:")
print(df[['A', 'B', 'E']].head())
# Multiple assignments
df.eval('''
F = A * 2
G = B + C
''', inplace=True)
# 4. Using Variables with @
multiplier = 2.5
threshold = 0.5
result = df.eval('A * @multiplier > @threshold')
print("\nWith variables:")
print(result.head())
# 5. Comparison Operations
# Boolean mask
mask = df.eval('A > 0.5 and B < 0.3')
filtered = df[mask]
print("\nFiltered rows:", len(filtered))
# Multiple conditions
complex_mask = df.eval('(A > 0.5 and B < 0.3) or (C > 0.8 and D < 0.2)')
# 6. Local vs Global Variables
local_var = 10
# Access local variable
df.eval('H = A * @local_var', inplace=True)
# 7. DataFrame.eval() vs pd.eval()
# DataFrame.eval
df_result = df.eval('A + B')
# pd.eval with multiple DataFrames
df2 = pd.DataFrame({
'X': np.random.rand(n),
'Y': np.random.rand(n)
})
# Doesn't work directly with multiple DataFrames in df.eval
# Use standard operations or merge first
# 8. Assignment Operators
# Add to existing column
df.eval('A = A + 1', inplace=True)
# Conditional assignment (use where instead)
df['I'] = df.eval('A if A > 0.5 else B')
# 9. Complex Expressions
# Multi-line expressions
df.eval('''
temp1 = A + B
temp2 = C * D
result = temp1 / temp2
''', inplace=True)
# 10. When eval() is Faster
# Good: Complex arithmetic
start = time.time()
_ = df.eval('A + B * C / D - (A * B) + (C / D)')
eval_fast = time.time() - start
start = time.time()
_ = df['A'] + df['B'] * df['C'] / df['D'] - (df['A'] * df['B']) + (df['C'] / df['D'])
standard_slow = time.time() - start
print(f"\nComplex arithmetic:")
print(f"eval: {eval_fast:.4f}s")
print(f"Standard: {standard_slow:.4f}s")
print(f"Speedup: {standard_slow/eval_fast:.1f}x")
# Bad: Simple operations, small data
small_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Not faster for small data
_ = small_df.eval('A + B')
_ = small_df['A'] + small_df['B'] # Similar speed
# 11. Supported Operations
print("\n=== Supported Operations ===")
df_ops = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Arithmetic
print("Arithmetic:", df_ops.eval('A + B * 2').tolist())
# Comparison
print("Comparison:", df_ops.eval('A > 1').tolist())
# Logical
print("Logical:", df_ops.eval('A > 1 and B < 6').tolist())
# 12. Limitations
# Can't use:
# - String methods: df.eval('A.str.upper()') # Error
# - Complex functions: df.eval('A.apply(func)') # Error
# - Method chaining
# Use standard operations instead for these
# 13. Practical Example
# Financial calculations
portfolio = pd.DataFrame({
'shares': np.random.randint(10, 1000, 10000),
'price': np.random.rand(10000) * 100,
'cost_basis': np.random.rand(10000) * 80,
'dividend': np.random.rand(10000) * 5
})
# Calculate portfolio metrics
portfolio.eval('''
market_value = shares * price
cost_value = shares * cost_basis
unrealized_gain = market_value - cost_value
gain_pct = (unrealized_gain / cost_value) * 100
annual_dividend = shares * dividend
yield_pct = (annual_dividend / market_value) * 100
''', inplace=True)
print("\nPortfolio metrics:")
print(portfolio[['market_value', 'unrealized_gain', 'gain_pct', 'yield_pct']].head())
eval() vs Standard:
| Aspect | eval() | Standard |
|---|---|---|
| Syntax | String expression | Python operations |
| Speed | Faster (large data) | Slower |
| Memory | More efficient | More allocations |
| Readability | SQL-like, cleaner | More verbose |
| Best for | Complex arithmetic | Simple operations |
When to Use eval():
| Use eval() | Don't Use eval() |
|---|---|
| Large DataFrames (>100k rows) | Small DataFrames |
| Complex arithmetic | Simple operations |
| Multiple operations | Single operation |
| Chained calculations | String methods |
| Memory constrained | Custom functions |
Supported Operations:
| Operation | Example |
|---|---|
| Arithmetic | +, -, *, /, **, % |
| Comparison | >, <, >=, <=, ==, != |
| Logical | and, or, not, &, |, ~ |
| Variables | @variable_name |
| Assignment | new_col = expression |
Interview Insights:
Interviewer's Insight
- "eval(): string expressions using numexpr"
- "Faster for large DataFrames (100k+ rows)"
- "Best for complex arithmetic: 2-4x speedup"
- "More memory efficient"
- "Use @ for variables:
eval('A * @multiplier')" - "inplace=True: modify DataFrame"
- "Limitations: no string methods, custom functions"
- "Not faster for small data or simple ops"
- "Cleaner syntax for complex calculations"
Explain Pandas Time Series Resampling - Meta, Netflix Interview Question
Difficulty: π‘ Medium | Tags: Time Series, Resampling, Aggregation | Asked by: Meta, Netflix, Google, Amazon
View Answer
Resample() converts time series to different frequencies (upsampling/downsampling). Essential for time series aggregation, business reporting, and data alignment.
Complete Examples:
import pandas as pd
import numpy as np
# Create time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({
'date': dates,
'sales': np.random.randint(100, 500, 365) + np.sin(np.arange(365) * 0.1) * 50,
'customers': np.random.randint(10, 100, 365),
'temperature': 70 + np.sin(np.arange(365) * 0.017) * 20 + np.random.randn(365) * 5
}).set_index('date')
# 1. Downsampling (Higher β Lower Frequency)
# Daily β Weekly
weekly = df.resample('W').sum()
print("Weekly totals:")
print(weekly.head())
# Daily β Monthly
monthly = df.resample('M').sum()
print("\nMonthly totals:")
print(monthly.head())
# Daily β Quarterly
quarterly = df.resample('Q').sum()
print("\nQuarterly totals:")
print(quarterly)
# 2. Different Aggregations
# Multiple aggregations
monthly_stats = df.resample('M').agg({
'sales': ['sum', 'mean', 'max'],
'customers': 'sum',
'temperature': 'mean'
})
print("\nMonthly statistics:")
print(monthly_stats)
# 3. Upsampling (Lower β Higher Frequency)
# Create sparse data
weekly_data = pd.DataFrame({
'value': [100, 200, 150, 180]
}, index=pd.date_range('2024-01-01', periods=4, freq='W'))
# Fill forward
daily_ffill = weekly_data.resample('D').ffill()
print("\nUpsampled (forward fill):")
print(daily_ffill.head(10))
# Interpolate
daily_interp = weekly_data.resample('D').interpolate()
print("\nUpsampled (interpolate):")
print(daily_interp.head(10))
# 4. Common Frequency Codes
freq_examples = {
'D': 'Daily',
'W': 'Weekly',
'M': 'Month end',
'MS': 'Month start',
'Q': 'Quarter end',
'QS': 'Quarter start',
'Y': 'Year end',
'H': 'Hourly',
'T' or 'min': 'Minute',
'S': 'Second'
}
# 5. Business Day Resampling
# Business days only (Mon-Fri)
business_weekly = df.resample('W-MON').sum() # Week ending Monday
print("\nWeekly (business days):")
print(business_weekly.head())
# Business month
business_monthly = df.resample('BM').sum() # Business month end
# 6. Offset Resampling
# 2-week periods
biweekly = df.resample('2W').sum()
print("\nBi-weekly totals:")
print(biweekly.head())
# 10-day periods
ten_day = df.resample('10D').sum()
# 7. Rolling with Resample
# Monthly average with rolling
monthly_rolling = df.resample('M').mean().rolling(window=3).mean()
print("\n3-month moving average:")
print(monthly_rolling.head(6))
# 8. Groupby with Resample
# Add category
df['region'] = np.random.choice(['North', 'South'], len(df))
# Resample per group
regional_monthly = df.groupby('region').resample('M').sum()
print("\nRegional monthly sales:")
print(regional_monthly.head())
# 9. Custom Aggregation
def custom_agg(x):
"""Custom aggregation function."""
return pd.Series({
'total': x.sum(),
'average': x.mean(),
'peak': x.max(),
'days': len(x)
})
custom_monthly = df['sales'].resample('M').apply(custom_agg)
print("\nCustom aggregation:")
print(custom_monthly.head())
# 10. Label and Closed Parameters
# Default: label='right', closed='right'
# Period labeled by right edge, closed on right
# Label by left edge
left_label = df.resample('W', label='left').sum()
# Closed on left
left_closed = df.resample('W', closed='left').sum()
# 11. Origin and Offset
# Start resampling from specific point
custom_origin = df.resample('10D', origin='2024-01-05').sum()
# 12. Practical Example: Business Reporting
sales_data = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=1000, freq='H'),
'revenue': np.random.randint(100, 1000, 1000),
'orders': np.random.randint(1, 20, 1000)
}).set_index('timestamp')
# Daily summary
daily_report = sales_data.resample('D').agg({
'revenue': ['sum', 'mean', 'max'],
'orders': ['sum', 'count']
})
# Add calculated metrics
daily_report['avg_order_value'] = (
daily_report[('revenue', 'sum')] / daily_report[('orders', 'sum')]
)
print("\nDaily business report:")
print(daily_report.head())
# Weekly summary with week-over-week growth
weekly_report = sales_data.resample('W').sum()
weekly_report['revenue_growth'] = weekly_report['revenue'].pct_change() * 100
weekly_report['orders_growth'] = weekly_report['orders'].pct_change() * 100
print("\nWeekly growth report:")
print(weekly_report.head())
# 13. Combining with Transform
# Add group statistics back to original data
df['monthly_avg'] = df.groupby(df.index.to_period('M'))['sales'].transform('mean')
print("\nWith monthly average:")
print(df[['sales', 'monthly_avg']].head(35))
Frequency Codes:
| Code | Meaning | Example |
|---|---|---|
| D | Day | Daily data |
| W | Week | Weekly summary |
| M | Month end | Monthly totals |
| MS | Month start | Start of month |
| Q | Quarter end | Quarterly reports |
| Y | Year end | Annual data |
| H | Hour | Hourly aggregation |
| T/min | Minute | Minute-level data |
| S | Second | Second-level data |
| B | Business day | Weekdays only |
Resampling Methods:
| Method | Description | Use Case |
|---|---|---|
| sum() | Total | Revenue, quantities |
| mean() | Average | Prices, temperatures |
| first() | First value | Opening price |
| last() | Last value | Closing price |
| max()/min() | Peak/Low | Daily highs/lows |
| count() | Count | Transactions |
| ffill() | Forward fill | Upsampling |
| interpolate() | Interpolation | Smooth upsampling |
Upsampling vs Downsampling:
| Type | Direction | Aggregation | Example |
|---|---|---|---|
| Downsampling | High β Low freq | Required | Daily β Monthly |
| Upsampling | Low β High freq | Fill/interpolate | Monthly β Daily |
Interview Insights:
Interviewer's Insight
- "resample(): convert time series frequency"
- "Downsampling: higher β lower (daily β monthly)"
- "Upsampling: lower β higher (monthly β daily)"
- "Frequency codes: D, W, M, Q, Y, H"
- "Aggregation: sum, mean, max for downsampling"
- "Fill methods: ffill, bfill, interpolate for upsampling"
- "label: 'left' or 'right' edge labeling"
- "Business days: 'B', 'BM' for business calendars"
- "Combine with groupby for category-wise resampling"
Explain Pandas Pivot Tables - Microsoft, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Pivot Tables, Aggregation, Reshaping | Asked by: Microsoft, Amazon, Google, Meta
View Answer
pivot_table() summarizes data with row/column groupings and aggregations (like Excel). pivot() reshapes without aggregation. Use pivot_table for reports, pivot for simple reshaping.
Complete Examples:
import pandas as pd
import numpy as np
# Sample sales data
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100, freq='D').repeat(3),
'region': np.tile(['North', 'South', 'East'], 100),
'product': np.tile(['A', 'B', 'C'], 100),
'sales': np.random.randint(100, 1000, 300),
'quantity': np.random.randint(1, 50, 300),
'profit': np.random.randint(10, 200, 300)
})
# 1. Basic pivot_table
# Sales by region and product
pivot = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
print("Sales by region and product:")
print(pivot)
# 2. Multiple Aggregations
# Multiple functions
multi_agg = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc=['sum', 'mean', 'count']
)
print("\nMultiple aggregations:")
print(multi_agg)
# 3. Multiple Values
# Different columns
multi_values = pd.pivot_table(
df,
values=['sales', 'profit'],
index='region',
columns='product',
aggfunc='sum'
)
print("\nMultiple values:")
print(multi_values)
# 4. Margins (Totals)
# Add row and column totals
with_totals = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total'
)
print("\nWith totals:")
print(with_totals)
# 5. Fill Missing Values
# Replace NaN
filled = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc='sum',
fill_value=0
)
print("\nFilled with zeros:")
print(filled)
# 6. Multiple Index Levels
# Hierarchical index
df['month'] = df['date'].dt.to_period('M')
multi_index = pd.pivot_table(
df,
values='sales',
index=['month', 'region'],
columns='product',
aggfunc='sum'
)
print("\nMulti-index:")
print(multi_index.head(10))
# 7. Custom Aggregation
def range_func(x):
return x.max() - x.min()
custom_agg = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc=range_func
)
print("\nCustom aggregation (range):")
print(custom_agg)
# 8. pivot() vs pivot_table()
# Simple data (unique index/column combinations)
simple_df = pd.DataFrame({
'row': ['A', 'A', 'B', 'B'],
'col': ['X', 'Y', 'X', 'Y'],
'value': [1, 2, 3, 4]
})
# pivot (no aggregation, must be unique)
pivoted = simple_df.pivot(index='row', columns='col', values='value')
print("\npivot() result:")
print(pivoted)
# pivot_table (with aggregation, handles duplicates)
dup_df = pd.DataFrame({
'row': ['A', 'A', 'A', 'B'],
'col': ['X', 'X', 'Y', 'X'],
'value': [1, 2, 3, 4]
})
# pivot would fail (duplicates)
# pivoted = dup_df.pivot(index='row', columns='col', values='value') # Error!
# pivot_table handles it
pivoted_agg = dup_df.pivot_table(
index='row',
columns='col',
values='value',
aggfunc='sum'
)
print("\npivot_table() with duplicates:")
print(pivoted_agg)
# 9. Observed Parameter
# Add categorical column
df['category'] = pd.Categorical(
np.random.choice(['Cat1', 'Cat2', 'Cat3'], len(df)),
categories=['Cat1', 'Cat2', 'Cat3', 'Cat4'] # Cat4 unused
)
# Include unused categories
with_unused = pd.pivot_table(
df,
values='sales',
index='category',
aggfunc='sum',
observed=False
)
print("\nWith unused categories:")
print(with_unused)
# Exclude unused
observed_only = pd.pivot_table(
df,
values='sales',
index='category',
aggfunc='sum',
observed=True
)
print("\nObserved only:")
print(observed_only)
# 10. Practical Example: Business Report
sales_data = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=365, freq='D'),
'region': np.random.choice(['North', 'South', 'East', 'West'], 365),
'product_category': np.random.choice(['Electronics', 'Clothing', 'Food'], 365),
'revenue': np.random.randint(1000, 10000, 365),
'units_sold': np.random.randint(10, 200, 365)
})
# Add month and quarter
sales_data['month'] = sales_data['date'].dt.to_period('M')
sales_data['quarter'] = sales_data['date'].dt.to_period('Q')
# Quarterly report
quarterly_report = pd.pivot_table(
sales_data,
values=['revenue', 'units_sold'],
index='quarter',
columns='region',
aggfunc={
'revenue': 'sum',
'units_sold': 'sum'
},
margins=True,
fill_value=0
)
print("\nQuarterly business report:")
print(quarterly_report)
# Monthly category performance
category_report = pd.pivot_table(
sales_data,
values='revenue',
index='month',
columns='product_category',
aggfunc=['sum', 'mean', 'count'],
margins=True
)
print("\nMonthly category report:")
print(category_report.head(6))
pivot() vs pivot_table():
| Aspect | pivot() | pivot_table() |
|---|---|---|
| Aggregation | No | Yes |
| Duplicates | Must be unique | Handles duplicates |
| Use Case | Simple reshape | Summarization |
| Example | Wide to long | Excel pivot table |
| Parameters | index, columns, values | + aggfunc, margins |
Common Parameters:
| Parameter | Description | Example |
|---|---|---|
| values | Column(s) to aggregate | values='sales' |
| index | Row grouping | index='region' |
| columns | Column grouping | columns='product' |
| aggfunc | Aggregation function | aggfunc='sum' |
| fill_value | Replace NaN | fill_value=0 |
| margins | Add totals | margins=True |
Interview Insights:
Interviewer's Insight
- "pivot_table(): Excel-like summarization"
- "pivot(): reshape without aggregation"
- "index: row grouping, columns: column grouping"
- "aggfunc: 'sum', 'mean', 'count', or custom"
- "margins=True: add row/column totals"
- "fill_value=0: replace NaN"
- "pivot_table handles duplicates, pivot doesn't"
- "Multiple values: list in values parameter"
- "Use for: reports, summaries, cross-tabulation"
Explain pd.melt() - Google, Meta Interview Question
Difficulty: π‘ Medium | Tags: melt, Reshaping, Wide to Long | Asked by: Google, Meta, Amazon
View Answer
melt() transforms wide format to long format (unpivots). Opposite of pivot. Essential for tidy data, plotting, and analysis requiring long format.
Complete Examples:
import pandas as pd
# Wide format data
df_wide = pd.DataFrame({
'student': ['Alice', 'Bob', 'Charlie'],
'math': [85, 92, 78],
'science': [88, 95, 82],
'english': [90, 87, 85]
})
print("Wide format:")
print(df_wide)
# 1. Basic melt
df_long = pd.melt(df_wide, id_vars=['student'])
print("\nLong format (basic):")
print(df_long)
# 2. Custom Column Names
df_long = pd.melt(
df_wide,
id_vars=['student'],
var_name='subject',
value_name='score'
)
print("\nWith custom names:")
print(df_long)
# 3. Select Specific Columns to Melt
df_long = pd.melt(
df_wide,
id_vars=['student'],
value_vars=['math', 'science'], # Only these columns
var_name='subject',
value_name='score'
)
print("\nSelected columns:")
print(df_long)
# 4. Multiple ID Variables
df_multi = pd.DataFrame({
'student': ['Alice', 'Bob', 'Charlie'],
'class': ['A', 'A', 'B'],
'math': [85, 92, 78],
'science': [88, 95, 82]
})
df_long = pd.melt(
df_multi,
id_vars=['student', 'class'],
var_name='subject',
value_name='score'
)
print("\nMultiple ID vars:")
print(df_long)
# 5. melt() + pivot() (Round Trip)
# Wide β Long
df_long = pd.melt(df_wide, id_vars=['student'])
# Long β Wide
df_back = df_long.pivot(
index='student',
columns='variable',
values='value'
).reset_index()
print("\nRound trip (original):")
print(df_wide)
print("\nRound trip (restored):")
print(df_back)
# 6. Real Example: Time Series
sales_wide = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Tool'],
'2023-Q1': [100, 150, 80],
'2023-Q2': [120, 160, 90],
'2023-Q3': [110, 170, 85],
'2023-Q4': [130, 180, 95]
})
sales_long = pd.melt(
sales_wide,
id_vars=['product'],
var_name='quarter',
value_name='sales'
)
print("\nTime series (long format):")
print(sales_long)
# Convert quarter to datetime
sales_long['quarter'] = pd.PeriodIndex(sales_long['quarter'], freq='Q').to_timestamp()
# 7. Handling Missing Values
df_missing = pd.DataFrame({
'id': [1, 2, 3],
'A': [10, None, 30],
'B': [40, 50, None]
})
# Default: keeps NaN
df_long_nan = pd.melt(df_missing, id_vars=['id'])
print("\nWith NaN:")
print(df_long_nan)
# Drop NaN
df_long_clean = pd.melt(df_missing, id_vars=['id']).dropna()
print("\nNaN dropped:")
print(df_long_clean)
# 8. Practical Example: Survey Data
survey = pd.DataFrame({
'respondent_id': [1, 2, 3],
'age': [25, 30, 35],
'q1_satisfaction': [4, 5, 3],
'q2_likelihood': [5, 4, 4],
'q3_quality': [4, 5, 5]
})
# Melt only question columns
survey_long = pd.melt(
survey,
id_vars=['respondent_id', 'age'],
value_vars=['q1_satisfaction', 'q2_likelihood', 'q3_quality'],
var_name='question',
value_name='response'
)
# Clean question names
survey_long['question'] = survey_long['question'].str.replace('q\\d+_', '', regex=True)
print("\nSurvey data (long):")
print(survey_long)
# 9. Performance with Large Data
# melt is efficient for large datasets
import numpy as np
large_df = pd.DataFrame({
'id': range(10000),
**{f'col_{i}': np.random.rand(10000) for i in range(50)}
})
import time
start = time.time()
large_melted = pd.melt(large_df, id_vars=['id'])
melt_time = time.time() - start
print(f"\nMelted 10k rows x 50 cols in {melt_time:.4f}s")
print(f"Result shape: {large_melted.shape}")
# 10. Use Cases Comparison
print("\n=== When to Use ===")
print("Wide format:")
print("- Human-readable reports")
print("- Pivot tables")
print("- Matrix operations")
print("\nLong format (melt):")
print("- Analysis in Pandas/SQL")
print("- Plotting with seaborn/plotly")
print("- GroupBy operations")
print("- Statistical modeling")
Wide vs Long Format:
| Format | Structure | Use Case | Example |
|---|---|---|---|
| Wide | Many columns | Reports, humans | Math: 85, Science: 88 |
| Long | Few columns, many rows | Analysis, computers | Subject: Math, Score: 85 |
melt() Parameters:
| Parameter | Description | Example |
|---|---|---|
| id_vars | Columns to keep | id_vars=['student'] |
| value_vars | Columns to melt | value_vars=['math', 'science'] |
| var_name | Name for variable column | var_name='subject' |
| value_name | Name for value column | value_name='score' |
Interview Insights:
Interviewer's Insight
- "melt(): wide β long format (unpivot)"
- "id_vars: columns to keep fixed"
- "value_vars: columns to melt (optional, default all)"
- "var_name: name for melted column names"
- "value_name: name for values"
- "Opposite of pivot()/pivot_table()"
- "Use for: plotting, analysis, tidy data"
- "Long format better for: GroupBy, seaborn, SQL"
- "Round trip: melt() + pivot() = original"
Explain pd.crosstab() - Amazon, Microsoft Interview Question
Difficulty: π’ Easy | Tags: crosstab, Frequency Tables, Statistics | Asked by: Amazon, Microsoft, Google, Meta
View Answer
crosstab() computes cross-tabulation of two or more factors. Shows frequency distributions. Similar to pivot_table but specialized for counts and proportions.
Complete Examples:
import pandas as pd
import numpy as np
# Sample data
df = pd.DataFrame({
'gender': np.random.choice(['M', 'F'], 200),
'education': np.random.choice(['HS', 'Bachelor', 'Master', 'PhD'], 200),
'employed': np.random.choice(['Yes', 'No'], 200),
'age_group': np.random.choice(['18-25', '26-35', '36-50', '50+'], 200)
})
# 1. Basic Crosstab (Counts)
ct = pd.crosstab(df['gender'], df['education'])
print("Basic crosstab:")
print(ct)
# 2. With Margins (Totals)
ct_margins = pd.crosstab(
df['gender'],
df['education'],
margins=True,
margins_name='Total'
)
print("\nWith totals:")
print(ct_margins)
# 3. Normalize (Percentages)
# Normalize all (percentages of total)
ct_norm_all = pd.crosstab(
df['gender'],
df['education'],
normalize='all'
)
print("\nNormalized (all):")
print(ct_norm_all)
# Normalize by rows
ct_norm_index = pd.crosstab(
df['gender'],
df['education'],
normalize='index'
)
print("\nNormalized (by row):")
print(ct_norm_index)
# Normalize by columns
ct_norm_columns = pd.crosstab(
df['gender'],
df['education'],
normalize='columns'
)
print("\nNormalized (by column):")
print(ct_norm_columns)
# 4. Multiple Row/Column Variables
ct_multi = pd.crosstab(
[df['gender'], df['age_group']],
[df['education'], df['employed']]
)
print("\nMulti-level crosstab:")
print(ct_multi)
# 5. With Values (Like pivot_table)
df['income'] = np.random.randint(30000, 120000, len(df))
ct_values = pd.crosstab(
df['gender'],
df['education'],
values=df['income'],
aggfunc='mean'
)
print("\nWith values (mean income):")
print(ct_values)
# 6. Multiple Aggregations
ct_multi_agg = pd.crosstab(
df['gender'],
df['education'],
values=df['income'],
aggfunc=['mean', 'median', 'count']
)
print("\nMultiple aggregations:")
print(ct_multi_agg)
# 7. Dropna Parameter
df_with_nan = df.copy()
df_with_nan.loc[::10, 'education'] = None
# Include NaN
ct_with_nan = pd.crosstab(
df_with_nan['gender'],
df_with_nan['education'],
dropna=False
)
print("\nWith NaN:")
print(ct_with_nan)
# Exclude NaN
ct_no_nan = pd.crosstab(
df_with_nan['gender'],
df_with_nan['education'],
dropna=True
)
print("\nExcluding NaN:")
print(ct_no_nan)
# 8. Practical Example: A/B Test Analysis
ab_test = pd.DataFrame({
'variant': np.random.choice(['A', 'B'], 1000),
'converted': np.random.choice(['Yes', 'No'], 1000, p=[0.15, 0.85]),
'device': np.random.choice(['Mobile', 'Desktop'], 1000)
})
# Conversion rates
conversion_ct = pd.crosstab(
ab_test['variant'],
ab_test['converted'],
normalize='index'
)
print("\nA/B test conversion rates:")
print(conversion_ct)
# By device
device_ct = pd.crosstab(
[ab_test['variant'], ab_test['device']],
ab_test['converted'],
normalize='index',
margins=True
)
print("\nBy device:")
print(device_ct)
# 9. Chi-Square Test
# Statistical independence test
from scipy.stats import chi2_contingency
ct_test = pd.crosstab(df['gender'], df['employed'])
chi2, p_value, dof, expected = chi2_contingency(ct_test)
print(f"\nChi-square test:")
print(f"Chi2: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Independent: {p_value > 0.05}")
# 10. Comparison with pivot_table
# crosstab
ct = pd.crosstab(df['gender'], df['education'])
# Equivalent pivot_table
pt = df.pivot_table(
index='gender',
columns='education',
aggfunc='size',
fill_value=0
)
print("\ncrosstab == pivot_table:", ct.equals(pt))
crosstab() vs pivot_table():
| Aspect | crosstab() | pivot_table() |
|---|---|---|
| Purpose | Frequency counts | General aggregation |
| Default aggfunc | Count | Mean |
| Normalize | β Built-in | β Manual |
| Input | Arrays or Series | DataFrame columns |
| Use Case | Categorical analysis | Numeric summarization |
normalize Parameter:
| Value | Meaning | Use Case |
|---|---|---|
| 'all' | % of total | Overall distribution |
| 'index' | % of row total | Row-wise comparison |
| 'columns' | % of column total | Column-wise comparison |
Interview Insights:
Interviewer's Insight
- "crosstab(): frequency table of categorical variables"
- "Default: counts, can aggregate with values/aggfunc"
- "normalize: 'all', 'index', 'columns' for percentages"
- "margins=True: add row/column totals"
- "Useful for: categorical analysis, A/B testing, chi-square"
- "vs pivot_table: specialized for counts"
- "Multiple variables: pass lists to index/columns"
- "Chi-square test: test independence of variables"
Explain pd.get_dummies() for One-Hot Encoding - Google, Netflix Interview Question
Difficulty: π’ Easy | Tags: One-Hot Encoding, Feature Engineering, ML Preprocessing | Asked by: Google, Netflix, Meta, Amazon
View Answer
get_dummies() converts categorical variables to binary indicator columns (one-hot encoding). Essential for machine learning with categorical features.
Complete Examples:
import pandas as pd
import numpy as np
# Sample data
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['S', 'M', 'L', 'M', 'S'],
'price': [10, 20, 15, 12, 18]
})
print("Original data:")
print(df)
# 1. Basic One-Hot Encoding
dummies = pd.get_dummies(df['color'])
print("\nOne-hot encoded:")
print(dummies)
# 2. Add Prefix
dummies = pd.get_dummies(df['color'], prefix='color')
print("\nWith prefix:")
print(dummies)
# 3. Encode Entire DataFrame
df_encoded = pd.get_dummies(df, columns=['color', 'size'])
print("\nFull DataFrame encoded:")
print(df_encoded)
# 4. Drop First (Avoid Multicollinearity)
# Drop one category to avoid dummy variable trap
dummies_drop = pd.get_dummies(
df,
columns=['color', 'size'],
drop_first=True
)
print("\nDrop first category:")
print(dummies_drop)
# 5. Custom Prefix for Multiple Columns
dummies_custom = pd.get_dummies(
df,
columns=['color', 'size'],
prefix=['col', 'sz']
)
print("\nCustom prefixes:")
print(dummies_custom)
# 6. Handle Unknown Categories
# Training data
train = pd.DataFrame({
'category': ['A', 'B', 'C', 'A', 'B']
})
train_encoded = pd.get_dummies(train, prefix='cat')
print("\nTraining encoded:")
print(train_encoded)
# Test data with new category 'D'
test = pd.DataFrame({
'category': ['A', 'B', 'D']
})
test_encoded = pd.get_dummies(test, prefix='cat')
print("\nTest encoded (has 'D'):")
print(test_encoded)
# Align columns
test_aligned = test_encoded.reindex(columns=train_encoded.columns, fill_value=0)
print("\nTest aligned with training:")
print(test_aligned)
# 7. Dummy NA
# Include NaN as category
df_na = pd.DataFrame({
'color': ['red', 'blue', None, 'red', 'blue']
})
dummies_na = pd.get_dummies(df_na, dummy_na=True)
print("\nWith NaN as category:")
print(dummies_na)
# 8. Sparse Matrix (Memory Efficient)
# For high cardinality
df_large = pd.DataFrame({
'category': np.random.choice(list('ABCDEFGHIJ'), 10000)
})
# Dense (default)
dense = pd.get_dummies(df_large)
print(f"\nDense memory: {dense.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Sparse
sparse = pd.get_dummies(df_large, sparse=True)
print(f"Sparse memory: {sparse.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# 9. Practical Example: ML Preprocessing
# Customer data
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'country': ['USA', 'UK', 'Canada', 'USA', 'UK'],
'membership': ['Gold', 'Silver', 'Bronze', 'Gold', 'Silver'],
'age': [25, 35, 45, 30, 40],
'purchases': [10, 5, 8, 12, 6]
})
# Encode categorical, keep numeric
customers_encoded = pd.get_dummies(
customers,
columns=['country', 'membership'],
drop_first=True, # Avoid multicollinearity
prefix=['country', 'tier']
)
print("\nML-ready data:")
print(customers_encoded)
# 10. Comparison with Categorical
# get_dummies: expands to multiple columns
encoded = pd.get_dummies(df['color'])
print("\nget_dummies (wide):")
print(encoded)
# Categorical: single column with integer codes
categorical = df['color'].astype('category')
print("\nCategorical (compact):")
print(categorical)
print("Codes:", categorical.cat.codes.tolist())
get_dummies() Parameters:
| Parameter | Description | Example |
|---|---|---|
| data | DataFrame or Series | pd.get_dummies(df) |
| columns | Columns to encode | columns=['color'] |
| prefix | Column name prefix | prefix='cat' |
| drop_first | Avoid multicollinearity | drop_first=True |
| dummy_na | Include NaN as category | dummy_na=True |
| sparse | Use sparse matrix | sparse=True |
When to drop_first:
| Use Case | drop_first | Reason |
|---|---|---|
| Linear models | β True | Avoid multicollinearity |
| Tree models | β False | Can handle all dummies |
| Neural networks | β False | Usually fine |
| Interpretation | β False | Easier to understand |
get_dummies() vs Categorical:
| Aspect | get_dummies() | Categorical |
|---|---|---|
| Format | Multiple columns | Single column |
| Storage | More space | Less space |
| ML | Ready to use | Need encoding |
| Interpretability | Clear | Needs mapping |
Interview Insights:
Interviewer's Insight
- "get_dummies(): one-hot encoding (categorical β binary)"
- "Creates column per category with 0/1"
- "prefix: add meaningful column names"
- "drop_first=True: avoid dummy variable trap (linear models)"
- "dummy_na=True: treat NaN as separate category"
- "sparse=True: save memory for high cardinality"
- "Align train/test: use reindex() for new categories"
- "vs LabelEncoder: ordinal encoding (single column)"
- "Use for: ML models that need numeric input"
Explain SettingWithCopyWarning - Meta, Google Interview Question
Difficulty: π‘ Medium | Tags: Warning, DataFrame Copies, Best Practices | Asked by: Meta, Google, Amazon, Netflix
View Answer
SettingWithCopyWarning occurs when modifying a DataFrame slice that might be a view or copy. Use .loc for assignment and .copy() when needed to avoid ambiguity.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': ['a', 'b', 'c', 'd', 'e']
})
# 1. The Problem: Chained Indexing
# BAD: This triggers warning
# df[df['A'] > 2]['B'] = 999 # SettingWithCopyWarning!
# GOOD: Use .loc
df.loc[df['A'] > 2, 'B'] = 999
print("Correct assignment:")
print(df)
# Reset for more examples
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': ['a', 'b', 'c', 'd', 'e']
})
# 2. View vs Copy
# Sometimes slice is a view, sometimes a copy
# Pandas can't always tell which
# This might be a view
subset1 = df[df['A'] > 2]
# This might be a copy
subset2 = df[['A', 'B']]
# Modifying triggers warning (is it view or copy?)
# subset1['B'] = 999 # Warning!
# 3. Explicit Copy
# Make explicit copy to avoid warning
subset_copy = df[df['A'] > 2].copy()
subset_copy['B'] = 999 # No warning
print("\nExplicit copy:")
print(subset_copy)
print("\nOriginal unchanged:")
print(df)
# 4. Using .loc (Recommended)
# Always use .loc for assignment
df_correct = df.copy()
# Single condition
df_correct.loc[df_correct['A'] > 2, 'B'] = 999
# Multiple conditions
df_correct.loc[(df_correct['A'] > 2) & (df_correct['B'] < 50), 'C'] = 'updated'
print("\nUsing .loc:")
print(df_correct)
# 5. Function Modifications
# BAD: Modifying without copy
def process_bad(data):
# If data is a slice, this warns
data['new_col'] = data['A'] * 2
return data
# GOOD: Explicit copy
def process_good(data):
data = data.copy()
data['new_col'] = data['A'] * 2
return data
subset = df[df['A'] > 2]
# result_bad = process_bad(subset) # Warning
result_good = process_good(subset) # No warning
print("\nProcessed with copy:")
print(result_good)
# 6. When to Copy
print("\n=== When to Use .copy() ===")
# Slice you'll modify
working_df = df[df['A'] > 2].copy()
working_df['B'] = 100 # Safe
# Function that modifies
def transform(data):
data = data.copy() # Safe
data['transformed'] = data['A'] ** 2
return data
# 7. Detecting Views vs Copies
# Check if shares memory
df_original = pd.DataFrame({'A': [1, 2, 3]})
# View (shares memory)
view = df_original['A']
print(f"\nView shares memory: {np.shares_memory(df_original['A'].values, view.values)}")
# Copy (doesn't share)
copy = df_original['A'].copy()
print(f"Copy shares memory: {np.shares_memory(df_original['A'].values, copy.values)}")
# 8. Common Patterns
print("\n=== Best Practices ===")
# Pattern 1: Filtering then modifying
# BAD:
# filtered = df[df['A'] > 2]
# filtered['B'] = 100 # Warning
# GOOD:
filtered = df[df['A'] > 2].copy()
filtered['B'] = 100 # No warning
# Pattern 2: Direct modification
# BEST (if you want to modify original):
df.loc[df['A'] > 2, 'B'] = 100
# Pattern 3: Function returning modified data
def clean_data(df):
"""Always copy at start of function."""
df = df.copy()
df['cleaned'] = df['A'] * 2
return df
# 9. Disabling Warning (Not Recommended)
# Only for debugging
# pd.options.mode.chained_assignment = None # Disable
# pd.options.mode.chained_assignment = 'warn' # Default
# pd.options.mode.chained_assignment = 'raise' # Raise error
# 10. Real-World Example
sales_df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'sales': [100, 200, 150, 300, 250],
'region': ['North', 'South', 'North', 'South', 'North']
})
# Task: Add 10% bonus to North region
# BAD:
# north = sales_df[sales_df['region'] == 'North']
# north['sales'] = north['sales'] * 1.1 # Warning!
# GOOD Option 1: Use .loc on original
sales_df.loc[sales_df['region'] == 'North', 'sales'] *= 1.1
print("\nSales with bonus:")
print(sales_df)
# GOOD Option 2: Explicit copy if you need subset
north_copy = sales_df[sales_df['region'] == 'North'].copy()
north_copy['sales'] *= 1.1
print("\nNorth region (copy):")
print(north_copy)
Common Causes:
| Pattern | Problem | Solution |
|---|---|---|
df[cond][col] = val | Chained indexing | df.loc[cond, col] = val |
subset['col'] = val | Unclear if view/copy | subset = df[...].copy() |
func(df[...]) | Function modifies slice | Copy in function |
Best Practices:
| Scenario | Recommended Approach |
|---|---|
| Modify original | df.loc[condition, column] = value |
| Modify subset | subset = df[...].copy() |
| Function | Copy at start: df = df.copy() |
| Read-only | No copy needed |
Interview Insights:
Interviewer's Insight
- "SettingWithCopyWarning: ambiguous view vs copy"
- "Chained indexing bad:
df[cond][col] = val" - "Use .loc:
df.loc[cond, col] = val" - "Explicit copy when needed:
.copy()" - "View: shares memory, copy: independent"
- "Always copy at function start if modifying"
- "Warning helps avoid silent bugs"
- "Never disable warning without understanding"
- "Rule: if modifying slice, use .loc or .copy()"
Explain Pandas String Methods (.str accessor) - Amazon, Microsoft Interview Question
Difficulty: π’ Easy | Tags: String Methods, Text Processing, Data Cleaning | Asked by: Amazon, Microsoft, Google, Meta
View Answer
.str accessor provides vectorized string operations on Series. Essential for text cleaning, pattern matching, and feature extraction. Much faster than Python loops.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': [' John Doe ', 'jane smith', 'BOB WILSON', 'Alice Johnson'],
'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'phone': ['123-456-7890', '(555) 123-4567', '555.123.4567', '1234567890'],
'address': ['123 Main St, NYC, NY', '456 Oak Ave, LA, CA', '789 Elm Rd, Austin, TX', '321 Pine Dr, Miami, FL']
})
# 1. Basic String Operations
# Lower/upper case
df['name_lower'] = df['name'].str.lower()
df['name_upper'] = df['name'].str.upper()
# Title case
df['name_title'] = df['name'].str.title()
# Strip whitespace
df['name_clean'] = df['name'].str.strip()
print("Case conversions:")
print(df[['name', 'name_lower', 'name_title', 'name_clean']].head())
# 2. Contains / Startswith / Endswith
# Check if contains pattern
df['has_john'] = df['name'].str.lower().str.contains('john')
# Starts with
df['starts_with_j'] = df['name'].str.lower().str.startswith('j')
# Ends with
df['gmail'] = df['email'].str.endswith('@example.com')
print("\nPattern matching:")
print(df[['name', 'has_john', 'starts_with_j']])
# 3. Replace
# Simple replace
df['email_masked'] = df['email'].str.replace('@', '[at]')
# Regex replace
df['phone_clean'] = df['phone'].str.replace(r'[\(\)\-\.\s]', '', regex=True)
print("\nReplace:")
print(df[['email', 'email_masked', 'phone', 'phone_clean']])
# 4. Extract with Regex
# Extract domain from email
df['domain'] = df['email'].str.extract(r'@(\w+\.\w+)')
# Extract area code
df['area_code'] = df['phone_clean'].str.extract(r'^(\d{3})')
# Multiple groups
df[['first_name', 'last_name']] = df['name_clean'].str.extract(r'(\w+)\s+(\w+)')
print("\nExtract:")
print(df[['email', 'domain', 'first_name', 'last_name']])
# 5. Split
# Split on space
name_parts = df['name_clean'].str.split()
print("\nSplit (list):")
print(name_parts)
# Expand to columns
df[['first', 'last']] = df['name_clean'].str.split(n=1, expand=True)
# Split address
df[['street', 'city', 'state']] = df['address'].str.split(',', expand=True)
print("\nSplit (expanded):")
print(df[['first', 'last', 'city']])
# 6. Slice
# First 3 characters
df['name_abbr'] = df['name_clean'].str[:3]
# Last 4 of phone
df['phone_last4'] = df['phone_clean'].str[-4:]
print("\nSlice:")
print(df[['name_clean', 'name_abbr', 'phone_last4']])
# 7. Length and Find
# String length
df['name_length'] = df['name_clean'].str.len()
# Find position
df['at_position'] = df['email'].str.find('@')
print("\nLength and find:")
print(df[['name_clean', 'name_length', 'email', 'at_position']])
# 8. Count Occurrences
# Count character
df['num_spaces'] = df['address'].str.count(' ')
# Count pattern
df['num_digits'] = df['phone'].str.count(r'\d')
print("\nCount:")
print(df[['address', 'num_spaces']])
# 9. Padding
# Pad with zeros
order_ids = pd.Series([1, 42, 123])
padded = order_ids.astype(str).str.zfill(5)
print("\nZero-padded:")
print(padded)
# Pad left/right
padded_left = order_ids.astype(str).str.pad(width=5, side='left', fillchar='0')
padded_right = order_ids.astype(str).str.pad(width=5, side='right', fillchar='*')
# 10. Practical Example: Data Cleaning
messy_data = pd.DataFrame({
'customer_name': [' john DOE ', 'Jane_Smith', 'bob-wilson'],
'email': ['[email protected]', 'jane.smith@test', '[email protected]'],
'phone': ['(555)123-4567', '555 123 4567', '5551234567'],
'zip_code': ['12345', '678', '90210']
})
# Clean everything
cleaned = messy_data.copy()
# Names: strip, title case, replace special chars
cleaned['customer_name'] = (cleaned['customer_name']
.str.strip()
.str.replace(r'[_-]', ' ', regex=True)
.str.title())
# Email: lowercase, validate format
cleaned['email'] = cleaned['email'].str.lower()
cleaned['email_valid'] = cleaned['email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')
# Phone: extract digits only, format
phone_digits = cleaned['phone'].str.replace(r'\D', '', regex=True)
cleaned['phone_formatted'] = phone_digits.str.replace(
r'^(\d{3})(\d{3})(\d{4})$',
r'(\1) \2-\3',
regex=True
)
# Zip: pad with zeros
cleaned['zip_code'] = cleaned['zip_code'].str.zfill(5)
print("\nCleaned data:")
print(cleaned)
# 11. Performance Comparison
import time
large_series = pd.Series(['test_string'] * 100000)
# Vectorized (fast)
start = time.time()
_ = large_series.str.upper()
vectorized_time = time.time() - start
# Loop (slow)
start = time.time()
_ = [s.upper() for s in large_series]
loop_time = time.time() - start
print(f"\nVectorized: {vectorized_time:.4f}s")
print(f"Loop: {loop_time:.4f}s ({loop_time/vectorized_time:.0f}x slower)")
Common String Methods:
| Method | Description | Example |
|---|---|---|
| lower/upper/title | Case conversion | .str.lower() |
| strip/lstrip/rstrip | Remove whitespace | .str.strip() |
| contains | Check if contains | .str.contains('word') |
| replace | Replace substring | .str.replace('old', 'new') |
| extract | Regex extract | .str.extract(r'(\d+)') |
| split | Split string | .str.split(',') |
| [:n] | Slice | .str[:5] |
| len | Length | .str.len() |
| find | Find position | .str.find('@') |
| count | Count occurrences | .str.count('a') |
Regex Operations:
| Method | Purpose | Example |
|---|---|---|
| contains | Match pattern | .str.contains(r'\d{3}') |
| extract | Extract groups | .str.extract(r'(\w+)@(\w+)') |
| replace | Replace pattern | .str.replace(r'\s+', ' ') |
| match | Full match | .str.match(r'^\d{5}$') |
| findall | Find all matches | .str.findall(r'\d+') |
Interview Insights:
Interviewer's Insight
- ".str accessor: vectorized string operations"
- "Much faster than Python loops (10-100x)"
- "Common: lower(), strip(), replace(), contains()"
- "extract(): use regex with capturing groups"
- "split(expand=True): split into columns"
- "replace(regex=True): regex patterns"
- "Use for: data cleaning, feature extraction"
- "Returns Series of same length (NaN for missing)"
- "Performance: always prefer .str over loops"
Explain pd.explode() - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: explode, List Columns, Data Transformation | Asked by: Google, Amazon, Meta
View Answer
explode() transforms list-like column into separate rows. Each list element becomes a new row with duplicated other column values. Essential for nested data structures.
Complete Examples:
import pandas as pd
import numpy as np
# 1. Basic explode
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'hobbies': [['reading', 'swimming'], ['gaming'], ['cooking', 'painting', 'music']]
})
print("Original (lists in column):")
print(df)
exploded = df.explode('hobbies')
print("\nExploded:")
print(exploded)
# 2. Reset Index
exploded_reset = df.explode('hobbies').reset_index(drop=True)
print("\nWith reset index:")
print(exploded_reset)
# 3. Multiple Columns with Lists
df_multi = pd.DataFrame({
'id': [1, 2],
'tags': [['a', 'b'], ['c', 'd', 'e']],
'scores': [[10, 20], [30, 40, 50]]
})
# Explode both columns (must be same length per row)
exploded_multi = df_multi.explode(['tags', 'scores'])
print("\nExplode multiple columns:")
print(exploded_multi)
# 4. With Missing Values
df_missing = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'items': [['apple', 'banana'], None, []]
})
exploded_na = df_missing.explode('items')
print("\nWith missing/empty:")
print(exploded_na)
# 5. Practical Example: User Tags
users = pd.DataFrame({
'user_id': [101, 102, 103],
'username': ['alice', 'bob', 'charlie'],
'interests': [
['python', 'data science', 'ML'],
['javascript', 'web dev'],
['design', 'UX', 'CSS', 'HTML']
]
})
# Explode to analyze tag frequency
user_interests = users.explode('interests')
tag_counts = user_interests['interests'].value_counts()
print("\nTag frequency:")
print(tag_counts)
# 6. E-commerce Orders
orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer': ['Alice', 'Bob', 'Charlie'],
'products': [
['Laptop', 'Mouse'],
['Keyboard'],
['Monitor', 'Cable', 'Stand']
],
'quantities': [
[1, 2],
[1],
[1, 3, 1]
],
'prices': [
[1000, 25],
[75],
[300, 10, 50]
]
})
# Explode order items
order_items = orders.explode(['products', 'quantities', 'prices'])
order_items = order_items.reset_index(drop=True)
# Calculate line totals
order_items['line_total'] = (
order_items['quantities'].astype(int) *
order_items['prices'].astype(int)
)
print("\nOrder items:")
print(order_items)
# Total per order
order_totals = order_items.groupby('order_id')['line_total'].sum()
print("\nOrder totals:")
print(order_totals)
# 7. JSON/Nested Data
# Simulating API response
api_data = pd.DataFrame({
'post_id': [1, 2],
'author': ['Alice', 'Bob'],
'comments': [
[
{'user': 'Bob', 'text': 'Great!'},
{'user': 'Charlie', 'text': 'Awesome!'}
],
[
{'user': 'Alice', 'text': 'Thanks!'}
]
]
})
# Explode comments
comments_df = api_data.explode('comments').reset_index(drop=True)
# Extract comment fields
comments_df['commenter'] = comments_df['comments'].apply(lambda x: x['user'] if x else None)
comments_df['comment_text'] = comments_df['comments'].apply(lambda x: x['text'] if x else None)
comments_df = comments_df.drop('comments', axis=1)
print("\nExtracted comments:")
print(comments_df)
# 8. Time Series Events
events = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=3, freq='D'),
'events': [
['login', 'purchase', 'logout'],
['login'],
['login', 'view', 'view', 'logout']
]
})
event_log = events.explode('events').reset_index(drop=True)
# Event frequency by day
event_freq = event_log.groupby(['date', 'events']).size().unstack(fill_value=0)
print("\nEvent frequency:")
print(event_freq)
# 9. Performance Note
# explode is efficient, but watch for very large lists
import time
# Create test data
large_df = pd.DataFrame({
'id': range(1000),
'values': [[1, 2, 3, 4, 5] for _ in range(1000)]
})
start = time.time()
_ = large_df.explode('values')
explode_time = time.time() - start
print(f"\nExploded 1000 rows Γ 5 values in {explode_time:.4f}s")
# 10. Reverse: groupby + list
# Explode β Group back to lists
exploded_data = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'B'],
'value': [1, 2, 3, 4, 5]
})
grouped = exploded_data.groupby('category')['value'].apply(list).reset_index()
print("\nGrouped back to lists:")
print(grouped)
explode() Characteristics:
| Aspect | Description |
|---|---|
| Input | Column with lists/arrays/sets |
| Output | One row per list element |
| Index | Duplicates original index |
| Other cols | Repeated for each element |
| Empty list | Creates one row with NaN |
| None/NaN | Creates one row with NaN |
Common Patterns:
| Pattern | Use Case |
|---|---|
| User tags | Analyze interests, categories |
| Order items | E-commerce line items |
| JSON arrays | Flatten nested API data |
| Multi-value fields | Convert to normalized form |
Interview Insights:
Interviewer's Insight
- "explode(): list column β separate rows"
- "Each list element becomes new row"
- "Other columns duplicated"
- "Index preserved (duplicated)"
- "reset_index(drop=True) for clean index"
- "Can explode multiple columns (same lengths)"
- "Empty list β NaN row"
- "Reverse: groupby().apply(list)"
- "Use for: nested data, JSON arrays, multi-value fields"
Explain Pandas Pipe Method - Meta, Google Interview Question
Difficulty: π‘ Medium | Tags: Pipe, Method Chaining, Clean Code | Asked by: Meta, Google, Netflix
View Answer
pipe() enables method chaining with custom functions. Pass DataFrame through pipeline of functions for clean, readable data transformations.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'salary': [50000, 60000, 70000, 80000],
'department': ['Sales', 'IT', 'Sales', 'IT']
})
# 1. Without pipe (nested functions - hard to read)
def add_bonus(df, rate=0.1):
df = df.copy()
df['bonus'] = df['salary'] * rate
return df
def calculate_total(df):
df = df.copy()
df['total'] = df['salary'] + df['bonus']
return df
# Nested (ugly)
result = calculate_total(add_bonus(df, rate=0.15))
# 2. With pipe (clean, readable)
result = (df
.pipe(add_bonus, rate=0.15)
.pipe(calculate_total))
print("With pipe:")
print(result)
# 3. Multiple Transformations
def filter_by_age(df, min_age):
return df[df['age'] >= min_age]
def add_age_group(df):
df = df.copy()
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100],
labels=['Young', 'Middle', 'Senior'])
return df
def calculate_metrics(df):
df = df.copy()
df['salary_per_age'] = df['salary'] / df['age']
return df
# Clean pipeline
result = (df
.pipe(filter_by_age, min_age=28)
.pipe(add_age_group)
.pipe(calculate_metrics))
print("\nPipeline result:")
print(result)
# 4. Mixing Built-in and Custom Functions
result = (df
.pipe(lambda d: d[d['age'] > 25]) # Lambda
.assign(bonus=lambda d: d['salary'] * 0.1) # Built-in
.pipe(lambda d: d.sort_values('bonus', ascending=False))
.reset_index(drop=True))
print("\nMixed pipeline:")
print(result)
# 5. Pipe with Arguments and Kwargs
def apply_discount(df, column, discount_rate, condition_col, condition_val):
df = df.copy()
mask = df[condition_col] == condition_val
df.loc[mask, column] = df.loc[mask, column] * (1 - discount_rate)
return df
result = (df
.pipe(apply_discount,
column='salary',
discount_rate=0.05,
condition_col='department',
condition_val='Sales'))
print("\nWith arguments:")
print(result)
# 6. Error Handling in Pipeline
def validate_data(df):
"""Validate DataFrame."""
assert not df['salary'].isna().any(), "Salary has NaN"
assert (df['age'] > 0).all(), "Invalid age"
return df
def clean_data(df):
"""Clean DataFrame."""
df = df.copy()
df = df.dropna()
df = df[df['age'] > 0]
return df
# Safe pipeline
result = (df
.pipe(clean_data)
.pipe(validate_data)
.pipe(add_bonus, rate=0.1))
# 7. Practical Example: Data Processing Pipeline
sales_data = pd.DataFrame({
'order_id': range(1, 101),
'customer_id': np.random.randint(1, 21, 100),
'product': np.random.choice(['A', 'B', 'C'], 100),
'quantity': np.random.randint(1, 10, 100),
'price': np.random.randint(10, 100, 100),
'date': pd.date_range('2024-01-01', periods=100, freq='D')
})
def add_revenue(df):
df = df.copy()
df['revenue'] = df['quantity'] * df['price']
return df
def add_date_features(df):
df = df.copy()
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
return df
def filter_high_value(df, min_revenue=500):
return df[df['revenue'] >= min_revenue]
def add_customer_segment(df):
df = df.copy()
customer_totals = df.groupby('customer_id')['revenue'].sum()
segments = pd.qcut(customer_totals, q=3, labels=['Low', 'Medium', 'High'])
df['segment'] = df['customer_id'].map(segments)
return df
# Complete pipeline
processed = (sales_data
.pipe(add_revenue)
.pipe(add_date_features)
.pipe(filter_high_value, min_revenue=300)
.pipe(add_customer_segment)
.pipe(lambda d: d.sort_values(['segment', 'revenue'],
ascending=[True, False])))
print("\nProcessed sales data:")
print(processed.head(10))
# 8. Reusable Pipelines
def sales_pipeline(df, min_revenue=500):
"""Reusable sales processing pipeline."""
return (df
.pipe(add_revenue)
.pipe(add_date_features)
.pipe(filter_high_value, min_revenue=min_revenue)
.pipe(add_customer_segment))
# Apply to different datasets
result1 = sales_pipeline(sales_data, min_revenue=300)
result2 = sales_pipeline(sales_data, min_revenue=600)
print(f"\nWith min_revenue=300: {len(result1)} rows")
print(f"With min_revenue=600: {len(result2)} rows")
# 9. pipe() vs Method Chaining
# Method chaining (only built-in methods)
result_chain = (df
.query('age > 25')
.assign(bonus=lambda d: d['salary'] * 0.1)
.sort_values('bonus'))
# pipe (custom functions)
result_pipe = (df
.pipe(lambda d: d[d['age'] > 25])
.pipe(add_bonus, rate=0.1)
.pipe(lambda d: d.sort_values('bonus')))
print("\nBoth approaches work:")
print(f"Chain: {len(result_chain)} rows")
print(f"Pipe: {len(result_pipe)} rows")
# 10. Performance Note
# pipe() has minimal overhead
import time
large_df = pd.DataFrame({
'A': range(100000),
'B': range(100000)
})
# With pipe
start = time.time()
_ = (large_df
.pipe(lambda d: d[d['A'] > 50000])
.pipe(lambda d: d.assign(C=d['A'] + d['B'])))
pipe_time = time.time() - start
# Without pipe
start = time.time()
temp = large_df[large_df['A'] > 50000]
_ = temp.assign(C=temp['A'] + temp['B'])
direct_time = time.time() - start
print(f"\nPipe: {pipe_time:.4f}s")
print(f"Direct: {direct_time:.4f}s")
print("(Virtually same performance)")
pipe() Benefits:
| Benefit | Description |
|---|---|
| Readability | Linear, top-to-bottom flow |
| Reusability | Functions can be reused |
| Testability | Each function testable |
| Maintainability | Easy to add/remove steps |
| Debugging | Comment out steps easily |
pipe() vs Method Chaining:
| Aspect | pipe() | Method Chaining |
|---|---|---|
| Functions | Custom functions | Built-in methods |
| Flexibility | Very flexible | Limited to methods |
| Arguments | Pass parameters | Use kwargs |
| Use Case | Complex pipelines | Simple transformations |
Interview Insights:
Interviewer's Insight
- "pipe(): pass DataFrame through functions"
- "Enables method chaining with custom functions"
- "Clean, readable data pipelines"
- "Pass arguments:
.pipe(func, arg1, arg2)" - "Better than nested functions:
f(g(h(df)))" - "Combine with built-in methods: assign, query"
- "Each function: take df, return df"
- "Benefits: readability, reusability, testability"
- "Minimal performance overhead"
Explain pd.json_normalize() - Amazon, Google Interview Question
Difficulty: π‘ Medium | Tags: JSON, Nested Data, Flattening | Asked by: Amazon, Google, Meta, Netflix
View Answer
json_normalize() flattens nested JSON/dict structures into DataFrames. Essential for API responses, nested documents, and semi-structured data.
Complete Examples:
import pandas as pd
import json
# 1. Simple Nested JSON
data = [
{'name': 'Alice', 'age': 25, 'address': {'city': 'NYC', 'zip': '10001'}},
{'name': 'Bob', 'age': 30, 'address': {'city': 'LA', 'zip': '90001'}}
]
df = pd.json_normalize(data)
print("Flattened JSON:")
print(df)
# 2. Deep Nesting
data = [
{
'id': 1,
'user': {
'name': 'Alice',
'contact': {
'email': '[email protected]',
'phone': '123-456-7890'
}
}
},
{
'id': 2,
'user': {
'name': 'Bob',
'contact': {
'email': '[email protected]',
'phone': '098-765-4321'
}
}
}
]
df = pd.json_normalize(data)
print("\nDeep nesting:")
print(df)
# 3. Custom Separator
df = pd.json_normalize(data, sep='_')
print("\nCustom separator:")
print(df)
# 4. Arrays in JSON
data = [
{'name': 'Alice', 'scores': [90, 85, 88]},
{'name': 'Bob', 'scores': [78, 82, 80]}
]
# Note: Arrays stay as lists
df = pd.json_normalize(data)
print("\nWith arrays:")
print(df)
# Explode to separate rows
df_exploded = df.explode('scores').reset_index(drop=True)
print("\nExploded scores:")
print(df_exploded)
# 5. Nested Arrays of Objects
data = [
{
'order_id': 1,
'customer': 'Alice',
'items': [
{'product': 'Laptop', 'price': 1000},
{'product': 'Mouse', 'price': 25}
]
},
{
'order_id': 2,
'customer': 'Bob',
'items': [
{'product': 'Keyboard', 'price': 75}
]
}
]
df = pd.json_normalize(
data,
record_path='items',
meta=['order_id', 'customer']
)
print("\nNested arrays of objects:")
print(df)
# 6. Multiple Levels of Nesting
data = [
{
'id': 1,
'name': 'Alice',
'orders': [
{
'order_id': 101,
'items': [
{'product': 'A', 'qty': 2},
{'product': 'B', 'qty': 1}
]
}
]
}
]
# Flatten orders
df_orders = pd.json_normalize(
data,
record_path='orders',
meta=['id', 'name']
)
print("\nFlattened orders:")
print(df_orders)
# Further flatten items
# Need to work with the items column
items_data = []
for _, row in df_orders.iterrows():
for item in row['items']:
items_data.append({
'id': row['id'],
'name': row['name'],
'order_id': row['order_id'],
'product': item['product'],
'qty': item['qty']
})
df_items = pd.DataFrame(items_data)
print("\nFully flattened:")
print(df_items)
# 7. Real Example: GitHub API Response
github_response = [
{
'id': 1,
'name': 'repo1',
'owner': {
'login': 'alice',
'id': 123,
'url': 'https://github.com/alice'
},
'stats': {
'stars': 100,
'forks': 20,
'watchers': 50
}
},
{
'id': 2,
'name': 'repo2',
'owner': {
'login': 'bob',
'id': 456,
'url': 'https://github.com/bob'
},
'stats': {
'stars': 500,
'forks': 100,
'watchers': 250
}
}
]
df = pd.json_normalize(github_response, sep='_')
print("\nGitHub repos:")
print(df)
# 8. Handling Missing Keys
data = [
{'name': 'Alice', 'age': 25, 'city': 'NYC'},
{'name': 'Bob', 'age': 30}, # Missing 'city'
{'name': 'Charlie', 'city': 'LA'} # Missing 'age'
]
df = pd.json_normalize(data)
print("\nMissing keys (NaN):")
print(df)
# 9. Real Example: Twitter/X API
tweets = [
{
'id': 1,
'text': 'Hello World',
'user': {'name': 'Alice', 'followers': 1000},
'entities': {
'hashtags': ['hello', 'world'],
'mentions': ['@bob']
},
'metrics': {'likes': 50, 'retweets': 10}
},
{
'id': 2,
'text': 'Python is great',
'user': {'name': 'Bob', 'followers': 5000},
'entities': {
'hashtags': ['python'],
'mentions': []
},
'metrics': {'likes': 200, 'retweets': 50}
}
]
df = pd.json_normalize(tweets, sep='_')
print("\nTweets data:")
print(df[['id', 'text', 'user_name', 'user_followers', 'metrics_likes']])
# 10. Performance with Large JSON
# json_normalize is efficient
import time
large_data = [
{
'id': i,
'data': {
'value': i * 2,
'meta': {
'timestamp': '2024-01-01',
'source': 'api'
}
}
}
for i in range(10000)
]
start = time.time()
df = pd.json_normalize(large_data, sep='_')
normalize_time = time.time() - start
print(f"\nNormalized 10k records in {normalize_time:.4f}s")
print(f"Result shape: {df.shape}")
json_normalize() Parameters:
| Parameter | Description | Example |
|---|---|---|
| data | JSON data (list of dicts) | [{'a': 1}] |
| record_path | Path to nested records | record_path='items' |
| meta | Columns to include from parent | meta=['id', 'name'] |
| sep | Separator for flattened keys | sep='_' (default '.') |
| max_level | Max nesting depth | max_level=1 |
Common Patterns:
| Pattern | Solution |
|---|---|
| Simple nesting | json_normalize(data) |
| Nested arrays | json_normalize(data, record_path='items', meta=['id']) |
| Deep nesting | Flatten in multiple steps |
| Arrays of values | Use with .explode() |
Interview Insights:
Interviewer's Insight
- "json_normalize(): flatten nested JSON/dicts"
- "Converts nested structures to flat DataFrame"
- "record_path: path to nested array"
- "meta: parent columns to include"
- "sep: separator for flattened names (default '.')"
- "Use with explode() for nested arrays"
- "Essential for: API responses, NoSQL data"
- "Handles missing keys (creates NaN)"
- "Use for: Twitter API, GitHub API, MongoDB docs"
Explain pd.to_datetime() Options - Microsoft, Amazon Interview Question
Difficulty: π’ Easy | Tags: DateTime, Parsing, Time Series | Asked by: Microsoft, Amazon, Google, Meta
View Answer
to_datetime() parses strings/numbers to datetime objects. Supports multiple formats, error handling, timezone conversions, and performance optimizations with format parameter.
Complete Examples:
import pandas as pd
import numpy as np
# 1. Basic Parsing
# Automatic format detection
dates = pd.Series(['2024-01-15', '2024-02-20', '2024-03-25'])
parsed = pd.to_datetime(dates)
print("Basic parsing:")
print(parsed)
print(f"Type: {parsed.dtype}")
# 2. Different Formats
# Common formats
formats = {
'ISO': ['2024-01-15', '2024-02-20'],
'US': ['01/15/2024', '02/20/2024'],
'EU': ['15/01/2024', '20/02/2024'],
'Text': ['Jan 15, 2024', 'Feb 20, 2024']
}
for name, dates in formats.items():
parsed = pd.to_datetime(dates)
print(f"\n{name} format:")
print(parsed)
# 3. Specify Format (Much Faster!)
dates = pd.Series(['20240115', '20240220'] * 10000)
import time
# Without format (slow, tries multiple formats)
start = time.time()
_ = pd.to_datetime(dates)
auto_time = time.time() - start
# With format (fast, direct parsing)
start = time.time()
_ = pd.to_datetime(dates, format='%Y%m%d')
format_time = time.time() - start
print(f"\nAuto detection: {auto_time:.4f}s")
print(f"With format: {format_time:.4f}s ({auto_time/format_time:.0f}x faster!)")
# 4. Common Format Codes
print("\n=== Format Codes ===")
examples = {
'%Y-%m-%d': '2024-01-15',
'%d/%m/%Y': '15/01/2024',
'%m/%d/%Y': '01/15/2024',
'%Y%m%d': '20240115',
'%d-%b-%Y': '15-Jan-2024',
'%Y-%m-%d %H:%M:%S': '2024-01-15 14:30:00',
'%d/%m/%Y %I:%M %p': '15/01/2024 02:30 PM'
}
for fmt, example in examples.items():
parsed = pd.to_datetime(example, format=fmt)
print(f"{fmt:25} β {example:25} β {parsed}")
# 5. Error Handling
mixed_dates = pd.Series(['2024-01-15', '2024-02-20', 'not a date', '2024-03-25'])
# errors='raise' (default) - raises exception
# try: pd.to_datetime(mixed_dates); except: pass
# errors='ignore' - returns original if can't parse
ignore_result = pd.to_datetime(mixed_dates, errors='ignore')
print("\nerrors='ignore':")
print(ignore_result)
print(f"Types: {[type(x).__name__ for x in ignore_result]}")
# errors='coerce' - NaT for invalid
coerce_result = pd.to_datetime(mixed_dates, errors='coerce')
print("\nerrors='coerce':")
print(coerce_result)
# 6. From Components
df = pd.DataFrame({
'year': [2024, 2024, 2024],
'month': [1, 2, 3],
'day': [15, 20, 25]
})
# Combine columns
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
print("\nFrom components:")
print(df)
# 7. Unix Timestamps
timestamps = pd.Series([1704067200, 1706745600, 1709337600])
# Convert (unit='s' for seconds)
dates = pd.to_datetime(timestamps, unit='s')
print("\nFrom Unix timestamp:")
print(dates)
# Milliseconds
timestamps_ms = timestamps * 1000
dates_ms = pd.to_datetime(timestamps_ms, unit='ms')
# 8. Timezone Handling
# Parse with timezone
dates = pd.Series(['2024-01-15 10:00:00', '2024-02-20 15:30:00'])
# UTC
utc = pd.to_datetime(dates, utc=True)
print("\nUTC:")
print(utc)
# Convert to timezone
eastern = utc.dt.tz_convert('America/New_York')
print("\nEastern Time:")
print(eastern)
# Localize (add timezone to naive)
naive = pd.to_datetime(dates)
localized = naive.dt.tz_localize('America/Los_Angeles')
print("\nLocalized to Pacific:")
print(localized)
# 9. Day First / Year First
ambiguous_dates = pd.Series(['01-02-2024', '03-04-2024'])
# Default: month first (US)
us_format = pd.to_datetime(ambiguous_dates)
print("\nMonth first (US):")
print(us_format)
# Day first (EU)
eu_format = pd.to_datetime(ambiguous_dates, dayfirst=True)
print("\nDay first (EU):")
print(eu_format)
# 10. Practical Example: Log File Parsing
log_data = pd.DataFrame({
'timestamp': [
'2024-01-15 10:30:45.123',
'2024-01-15 10:31:12.456',
'2024-01-15 10:32:03.789'
],
'event': ['login', 'click', 'logout']
})
# Parse with milliseconds
log_data['timestamp'] = pd.to_datetime(
log_data['timestamp'],
format='%Y-%m-%d %H:%M:%S.%f'
)
# Extract time components
log_data['hour'] = log_data['timestamp'].dt.hour
log_data['minute'] = log_data['timestamp'].dt.minute
log_data['second'] = log_data['timestamp'].dt.second
# Calculate duration
log_data['duration'] = log_data['timestamp'].diff()
print("\nLog data:")
print(log_data)
Common Format Codes:
| Code | Meaning | Example |
|---|---|---|
| %Y | 4-digit year | 2024 |
| %y | 2-digit year | 24 |
| %m | Month (01-12) | 01 |
| %d | Day (01-31) | 15 |
| %H | Hour 24h (00-23) | 14 |
| %I | Hour 12h (01-12) | 02 |
| %M | Minute (00-59) | 30 |
| %S | Second (00-59) | 45 |
| %f | Microsecond | 123456 |
| %p | AM/PM | PM |
| %b | Month abbr | Jan |
| %B | Month full | January |
Key Parameters:
| Parameter | Description | Example |
|---|---|---|
| format | Explicit format | format='%Y-%m-%d' |
| errors | Error handling | errors='coerce' |
| dayfirst | Day before month | dayfirst=True |
| yearfirst | Year first | yearfirst=True |
| utc | Parse as UTC | utc=True |
| unit | Timestamp unit | unit='s' |
Error Handling:
| Option | Behavior | Use Case |
|---|---|---|
| raise | Raise exception | Ensure all valid |
| ignore | Return original | Keep as string |
| coerce | Return NaT | Fill invalid with NaT |
Interview Insights:
Interviewer's Insight
- "to_datetime(): parse strings/numbers to datetime"
- "format parameter: specify format for speed (10-100x faster)"
- "Format codes: %Y, %m, %d, %H, %M, %S"
- "errors='coerce': NaT for invalid dates"
- "dayfirst=True: European format (day/month/year)"
- "unit='s': parse Unix timestamps"
- "utc=True: parse as UTC"
- "From dict: pass DataFrame with year/month/day columns"
- "Always specify format for large datasets"
Explain pd.read_csv() Optimization - Google, Netflix Interview Question
Difficulty: π‘ Medium | Tags: File I/O, Performance, Large Files | Asked by: Google, Netflix, Amazon, Meta
View Answer
read_csv() optimization reduces memory and time. Key techniques: dtype specification, usecols, nrows, chunksize, efficient parsers, and using alternative formats.
Complete Examples:
import pandas as pd
import numpy as np
import time
# Create sample CSV
sample_df = pd.DataFrame({
'id': range(100000),
'category': np.random.choice(['A', 'B', 'C'], 100000),
'value': np.random.rand(100000),
'text': ['some text'] * 100000,
'date': pd.date_range('2020-01-01', periods=100000, freq='h')
})
sample_df.to_csv('sample.csv', index=False)
# 1. Basic (Slow)
start = time.time()
df = pd.read_csv('sample.csv')
basic_time = time.time() - start
basic_memory = df.memory_usage(deep=True).sum() / 1024**2
print(f"Basic: {basic_time:.4f}s, {basic_memory:.2f} MB")
print(df.dtypes)
# 2. Specify dtypes (Faster, Less Memory)
start = time.time()
df_typed = pd.read_csv(
'sample.csv',
dtype={
'id': 'int32', # vs int64
'category': 'category', # vs object
'value': 'float32' # vs float64
},
parse_dates=['date']
)
typed_time = time.time() - start
typed_memory = df_typed.memory_usage(deep=True).sum() / 1024**2
print(f"\nWith dtypes: {typed_time:.4f}s, {typed_memory:.2f} MB")
print(f"Memory saved: {(1 - typed_memory/basic_memory)*100:.1f}%")
print(df_typed.dtypes)
# 3. Load Only Needed Columns
start = time.time()
df_cols = pd.read_csv(
'sample.csv',
usecols=['id', 'category', 'value']
)
cols_time = time.time() - start
cols_memory = df_cols.memory_usage(deep=True).sum() / 1024**2
print(f"\nSelected columns: {cols_time:.4f}s, {cols_memory:.2f} MB")
# 4. Load First N Rows (Testing)
df_sample = pd.read_csv('sample.csv', nrows=1000)
print(f"\nFirst 1000 rows: {df_sample.shape}")
# 5. Chunking (For Very Large Files)
def process_chunks(filename, chunksize=10000):
"""Process file in chunks."""
total = 0
for chunk in pd.read_csv(filename, chunksize=chunksize):
# Process each chunk
total += chunk['value'].sum()
return total
start = time.time()
result = process_chunks('sample.csv', chunksize=10000)
chunk_time = time.time() - start
print(f"\nChunked processing: {chunk_time:.4f}s, result: {result:.2f}")
# 6. C Engine vs Python Engine
# C engine (default, faster)
start = time.time()
_ = pd.read_csv('sample.csv', engine='c')
c_time = time.time() - start
# Python engine (more features, slower)
start = time.time()
_ = pd.read_csv('sample.csv', engine='python')
python_time = time.time() - start
print(f"\nC engine: {c_time:.4f}s")
print(f"Python engine: {python_time:.4f}s")
# 7. Low Memory Mode
# Good for files with inconsistent types
df_low_mem = pd.read_csv('sample.csv', low_memory=True)
# 8. Date Parsing Optimization
# Bad: parse after loading
start = time.time()
df1 = pd.read_csv('sample.csv')
df1['date'] = pd.to_datetime(df1['date'])
parse_after_time = time.time() - start
# Good: parse during loading
start = time.time()
df2 = pd.read_csv('sample.csv', parse_dates=['date'])
parse_during_time = time.time() - start
print(f"\nParse after: {parse_after_time:.4f}s")
print(f"Parse during: {parse_during_time:.4f}s")
# 9. Compression
# Save compressed
sample_df.to_csv('sample.csv.gz', index=False, compression='gzip')
# Read compressed (automatic detection)
start = time.time()
df_compressed = pd.read_csv('sample.csv.gz')
compressed_time = time.time() - start
import os
csv_size = os.path.getsize('sample.csv') / 1024**2
gz_size = os.path.getsize('sample.csv.gz') / 1024**2
print(f"\nCSV size: {csv_size:.2f} MB")
print(f"Gzip size: {gz_size:.2f} MB ({(1-gz_size/csv_size)*100:.1f}% smaller)")
print(f"Load time: {compressed_time:.4f}s")
# 10. Alternative: Parquet (Much Better!)
# Save as parquet
sample_df.to_parquet('sample.parquet', index=False)
# Read parquet
start = time.time()
df_parquet = pd.read_parquet('sample.parquet')
parquet_time = time.time() - start
parquet_size = os.path.getsize('sample.parquet') / 1024**2
print(f"\nParquet size: {parquet_size:.2f} MB")
print(f"Parquet load: {parquet_time:.4f}s")
print(f"Parquet vs CSV: {basic_time/parquet_time:.1f}x faster!")
# 11. Skip Rows
# Skip first N rows
df_skip = pd.read_csv('sample.csv', skiprows=1000)
# Skip by condition (Python engine)
df_skip_func = pd.read_csv(
'sample.csv',
skiprows=lambda x: x % 2 == 0, # Every other row
engine='python'
)
# 12. Practical Recommendations
print("\n=== Best Practices ===")
print("1. Specify dtypes (especially category for low-cardinality)")
print("2. Use usecols to load only needed columns")
print("3. Use chunksize for files larger than RAM")
print("4. parse_dates during loading, not after")
print("5. Consider parquet for repeated reads (10x faster)")
print("6. Use compression for storage (gzip, snappy)")
print("7. Test with nrows first to design pipeline")
print("8. Use low_memory=False to avoid dtype warnings")
# Cleanup
import os
for file in ['sample.csv', 'sample.csv.gz', 'sample.parquet']:
if os.path.exists(file):
os.remove(file)
Optimization Techniques:
| Technique | Speed Improvement | Memory Savings | Use Case |
|---|---|---|---|
| dtype | 10-20% | 50-90% | Specify types |
| usecols | 30-50% | 50-80% | Select columns |
| chunksize | N/A | 90%+ | Larger than RAM |
| parse_dates | 20-30% | - | Date columns |
| engine='c' | 2-5x | - | Default, fastest |
| parquet | 5-10x | 50-80% | Best format |
Key Parameters:
| Parameter | Purpose | Example |
|---|---|---|
| dtype | Specify types | dtype={'col': 'int32'} |
| usecols | Select columns | usecols=['a', 'b'] |
| nrows | Limit rows | nrows=1000 |
| chunksize | Read in chunks | chunksize=10000 |
| parse_dates | Parse dates | parse_dates=['date'] |
| low_memory | Avoid warnings | low_memory=False |
| engine | Parser | engine='c' (fast) |
Interview Insights:
Interviewer's Insight
- "Specify dtypes: 50-90% memory savings"
- "usecols: load only needed columns"
- "chunksize: process files larger than RAM"
- "parse_dates: faster than parsing after load"
- "category dtype: 90%+ savings for low cardinality"
- "int32 vs int64: 50% memory savings"
- "float32 vs float64: 50% memory savings"
- "Parquet: 5-10x faster than CSV"
- "nrows: test with sample first"
- "engine='c': default, fastest"
Explain Pandas Style for Formatting - Meta, Amazon Interview Question
Difficulty: π’ Easy | Tags: Styling, Formatting, Visualization | Asked by: Meta, Amazon, Microsoft
View Answer
style API formats DataFrame output with colors, highlights, and conditional formatting. Creates publication-ready tables, heatmaps, and visual reports.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D', 'E'],
'Sales': [10000, 15000, 8000, 12000, 18000],
'Profit': [2000, 3500, 1200, 2800, 4200],
'Growth': [0.15, -0.05, 0.22, 0.08, 0.18],
'Target': [12000, 14000, 10000, 13000, 16000]
})
# 1. Basic Number Formatting
styled = df.style.format({
'Sales': '${:,.0f}',
'Profit': '${:,.0f}',
'Growth': '{:.1%}'
})
print("Formatted (display in notebook):")
# styled # Display in Jupyter
# 2. Highlight Maximum
styled = df.style.highlight_max(
subset=['Sales', 'Profit'],
color='lightgreen'
)
# 3. Highlight Minimum
styled = df.style.highlight_min(
subset=['Sales', 'Profit'],
color='lightcoral'
)
# 4. Background Gradient
styled = df.style.background_gradient(
subset=['Sales', 'Profit'],
cmap='YlOrRd'
)
# 5. Color Negative Values
def color_negative_red(val):
color = 'red' if val < 0 else 'black'
return f'color: {color}'
styled = df.style.applymap(
color_negative_red,
subset=['Growth']
)
# 6. Conditional Formatting (Row-wise)
def highlight_missed_target(row):
"""Highlight rows that missed target."""
if row['Sales'] < row['Target']:
return ['background-color: #ffcccb'] * len(row)
else:
return [''] * len(row)
styled = df.style.apply(highlight_missed_target, axis=1)
# 7. Bar Charts in Cells
styled = df.style.bar(
subset=['Sales'],
color='lightblue'
)
# 8. Combining Multiple Styles
styled = (df.style
.format({
'Sales': '${:,.0f}',
'Profit': '${:,.0f}',
'Growth': '{:.1%}',
'Target': '${:,.0f}'
})
.background_gradient(subset=['Sales'], cmap='Blues')
.bar(subset=['Profit'], color='lightgreen')
.highlight_max(subset=['Growth'], color='yellow'))
# 9. Custom Function (Element-wise)
def color_profit(val):
"""Color profit levels."""
if val > 3000:
return 'background-color: #90EE90'
elif val > 2000:
return 'background-color: #FFFFCC'
else:
return 'background-color: #FFB6C1'
styled = df.style.applymap(color_profit, subset=['Profit'])
# 10. Practical Example: Sales Dashboard
dashboard_df = pd.DataFrame({
'Region': ['North', 'South', 'East', 'West'],
'Q1_Sales': [120000, 95000, 105000, 110000],
'Q2_Sales': [135000, 88000, 115000, 125000],
'YoY_Growth': [0.12, -0.08, 0.15, 0.10],
'Target_Met': ['Yes', 'No', 'Yes', 'Yes']
})
def highlight_yes_green(val):
if val == 'Yes':
return 'background-color: lightgreen'
elif val == 'No':
return 'background-color: lightcoral'
return ''
styled = (dashboard_df.style
.format({
'Q1_Sales': '${:,.0f}',
'Q2_Sales': '${:,.0f}',
'YoY_Growth': '{:.1%}'
})
.background_gradient(subset=['Q1_Sales', 'Q2_Sales'], cmap='YlGn')
.applymap(highlight_yes_green, subset=['Target_Met'])
.set_caption('Quarterly Sales Dashboard'))
# 11. Export to HTML/Excel
# To HTML
html = styled.to_html()
# with open('report.html', 'w') as f:
# f.write(html)
# To Excel (with xlsxwriter)
# styled.to_excel('report.xlsx', engine='xlsxwriter')
# 12. Hiding Index/Columns
styled = (df.style
.hide_index()
.format({'Sales': '${:,.0f}'}))
# 13. Setting Properties
styled = (df.style
.set_properties(**{
'background-color': 'lightblue',
'color': 'black',
'border-color': 'white'
}, subset=['Product']))
# 14. Tooltips (Hover Text)
# Note: More advanced, requires custom CSS
# 15. Practical Report
financial_df = pd.DataFrame({
'Metric': ['Revenue', 'COGS', 'Gross Profit', 'Operating Exp', 'Net Income'],
'2023': [1000000, 600000, 400000, 250000, 150000],
'2024': [1200000, 680000, 520000, 280000, 240000],
'Change': [0.20, 0.13, 0.30, 0.12, 0.60]
})
def color_change(val):
if val > 0.15:
return 'color: green; font-weight: bold'
elif val < 0:
return 'color: red; font-weight: bold'
return ''
styled = (financial_df.style
.format({
'2023': '${:,.0f}',
'2024': '${:,.0f}',
'Change': '{:+.1%}'
})
.applymap(color_change, subset=['Change'])
.set_caption('Financial Report 2023-2024')
.set_table_styles([
{'selector': 'caption',
'props': [('font-size', '16px'), ('font-weight', 'bold')]}
]))
print("\nStyle API commonly used in Jupyter notebooks for visual reports")
Common Style Methods:
| Method | Purpose | Example |
|---|---|---|
| format | Number formatting | .format({'col': '{:.2f}'}) |
| highlight_max | Highlight maximum | .highlight_max(color='green') |
| highlight_min | Highlight minimum | .highlight_min(color='red') |
| background_gradient | Color scale | .background_gradient(cmap='Blues') |
| bar | In-cell bar chart | .bar(color='blue') |
| applymap | Element-wise function | .applymap(func, subset=['col']) |
| apply | Row/column function | .apply(func, axis=1) |
Format Strings:
| Format | Output | Use Case |
|---|---|---|
{:.2f} | 12.35 | 2 decimals |
{:,.0f} | 1,234 | Thousands separator |
${:,.2f} | $1,234.56 | Currency |
{:.1%} | 12.3% | Percentage |
{:+.1f} | +12.3 | Show sign |
Color Maps:
| cmap | Description |
|---|---|
| 'YlOrRd' | Yellow-Orange-Red (heat) |
| 'Blues' | Light to dark blue |
| 'Greens' | Light to dark green |
| 'RdYlGn' | Red-Yellow-Green (diverging) |
| 'viridis' | Perceptually uniform |
Interview Insights:
Interviewer's Insight
- "style: format DataFrame for display"
- "format(): number formatting (${:,.0f}, {:.1%})"
- "highlight_max/min(): highlight extremes"
- "background_gradient(): color scale"
- "bar(): in-cell bar charts"
- "applymap(): element-wise styling"
- "apply(): row/column-wise styling"
- "Method chaining for multiple styles"
- "Export: .to_html(), .to_excel()"
- "Use in: Jupyter reports, dashboards"
Explain Pandas Merge Indicators - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Merge, Join, Data Quality | Asked by: Google, Amazon, Meta
View Answer
Merge indicator tracks source of each row (left_only, right_only, both). Essential for data validation, join analysis, and identifying mismatches.
Complete Examples:
import pandas as pd
# Sample data
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'city': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105, 106],
'customer_id': [1, 2, 2, 6, 7, 1],
'amount': [100, 200, 150, 300, 250, 175]
})
# 1. Basic Merge with Indicator
merged = pd.merge(
customers,
orders,
on='customer_id',
how='outer',
indicator=True
)
print("Merge with indicator:")
print(merged)
print("\nIndicator values:")
print(merged['_merge'].value_counts())
# 2. Custom Indicator Name
merged = pd.merge(
customers,
orders,
on='customer_id',
how='outer',
indicator='source'
)
print("\nCustom indicator name:")
print(merged[['customer_id', 'name', 'order_id', 'source']])
# 3. Analyze Merge Results
# Customers without orders
no_orders = merged[merged['source'] == 'left_only']
print("\nCustomers without orders:")
print(no_orders[['customer_id', 'name']])
# Orders without customer match
orphan_orders = merged[merged['source'] == 'right_only']
print("\nOrphan orders (no matching customer):")
print(orphan_orders[['order_id', 'customer_id', 'amount']])
# Successfully matched
matched = merged[merged['source'] == 'both']
print("\nMatched records:")
print(matched[['customer_id', 'name', 'order_id', 'amount']])
# 4. Data Quality Check
def analyze_merge(left, right, on, how='outer'):
"""Analyze merge quality."""
merged = pd.merge(left, right, on=on, how=how, indicator=True)
stats = merged['_merge'].value_counts()
total = len(merged)
print("=== Merge Analysis ===")
print(f"Total rows: {total}")
print(f"Left only: {stats.get('left_only', 0)} ({stats.get('left_only', 0)/total*100:.1f}%)")
print(f"Right only: {stats.get('right_only', 0)} ({stats.get('right_only', 0)/total*100:.1f}%)")
print(f"Both: {stats.get('both', 0)} ({stats.get('both', 0)/total*100:.1f}%)")
return merged
result = analyze_merge(customers, orders, on='customer_id')
# 5. Validation After Merge
merged = pd.merge(customers, orders, on='customer_id', how='outer', indicator=True)
# Assert no orphan orders
orphans = merged[merged['_merge'] == 'right_only']
if len(orphans) > 0:
print(f"\nβ οΈ WARNING: {len(orphans)} orders without customer!")
print(orphans[['order_id', 'customer_id']])
# Assert all customers have orders
no_orders = merged[merged['_merge'] == 'left_only']
if len(no_orders) > 0:
print(f"\nINFO: {len(no_orders)} customers without orders")
# 6. Practical Example: Product Catalog Sync
catalog = pd.DataFrame({
'sku': ['A001', 'A002', 'A003', 'A004'],
'product': ['Widget', 'Gadget', 'Tool', 'Device'],
'price': [10, 20, 15, 25]
})
inventory = pd.DataFrame({
'sku': ['A001', 'A002', 'A005', 'A006'],
'warehouse': ['NYC', 'LA', 'NYC', 'LA'],
'quantity': [100, 50, 75, 30]
})
sync = pd.merge(catalog, inventory, on='sku', how='outer', indicator='status')
print("\n=== Catalog Sync Report ===")
print(f"Products in catalog only: {(sync['status']=='left_only').sum()}")
print(f"Products in inventory only: {(sync['status']=='right_only').sum()}")
print(f"Products in both: {(sync['status']=='both').sum()}")
# Products to add to inventory
to_add = sync[sync['status'] == 'left_only']
print("\nAdd to inventory:")
print(to_add[['sku', 'product']])
# Products to remove from inventory
to_remove = sync[sync['status'] == 'right_only']
print("\nRemove from inventory:")
print(to_remove[['sku', 'warehouse']])
# 7. Multiple Key Merge
sales = pd.DataFrame({
'region': ['North', 'South', 'North', 'East'],
'product': ['A', 'B', 'A', 'C'],
'sales': [100, 200, 150, 180]
})
targets = pd.DataFrame({
'region': ['North', 'South', 'West'],
'product': ['A', 'B', 'A'],
'target': [120, 180, 100]
})
comparison = pd.merge(
sales,
targets,
on=['region', 'product'],
how='outer',
indicator=True
)
print("\nSales vs Targets:")
print(comparison)
# Missing targets
no_target = comparison[comparison['_merge'] == 'left_only']
print("\nSales without targets:")
print(no_target[['region', 'product', 'sales']])
# 8. Time Series Alignment
actual = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5, freq='D'),
'actual': [100, 110, 105, 115, 120]
})
forecast = pd.DataFrame({
'date': pd.date_range('2024-01-02', periods=5, freq='D'),
'forecast': [108, 106, 116, 122, 125]
})
comparison = pd.merge(
actual,
forecast,
on='date',
how='outer',
indicator=True
).sort_values('date')
print("\nActual vs Forecast:")
print(comparison)
# Calculate accuracy only where both exist
both_exist = comparison[comparison['_merge'] == 'both']
both_exist['error'] = abs(both_exist['actual'] - both_exist['forecast'])
print("\nForecast accuracy:")
print(both_exist[['date', 'actual', 'forecast', 'error']])
Indicator Values:
| Value | Meaning | Description |
|---|---|---|
| 'left_only' | In left only | No match in right |
| 'right_only' | In right only | No match in left |
| 'both' | In both | Successfully matched |
Common Use Cases:
| Use Case | Check For |
|---|---|
| Data quality | Orphan records |
| ETL validation | Missing matches |
| Catalog sync | Products to add/remove |
| Customer orders | Customers without orders |
| Join analysis | Match rate |
Interview Insights:
Interviewer's Insight
- "indicator=True: add _merge column"
- "Values: 'left_only', 'right_only', 'both'"
- "Use for: data quality checks, join validation"
- "Filter by indicator: find unmatched records"
- "Custom name:
indicator='source'" - "Essential for: ETL pipelines, data validation"
- "Check match rate:
value_counts()on _merge" - "Find orphans:
merged[merged['_merge']=='right_only']" - "Always use with how='outer' for full picture"
Explain DataFrame.where() and mask() - Meta, Microsoft Interview Question
Difficulty: π‘ Medium | Tags: Conditional Operations, Filtering, Data Replacement | Asked by: Meta, Microsoft, Google
View Answer
where() and mask() replace values based on conditions. where() keeps True values, mask() replaces True values. Alternative to boolean indexing for conditional replacement.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
})
# 1. where() - Keep values where condition is True
# Replace False values with NaN
result = df.where(df > 20)
print("where (keep > 20):")
print(result)
# 2. where() with replacement value
result = df.where(df > 20, 0)
print("\nwhere with 0:")
print(result)
# 3. mask() - Replace values where condition is True
# Opposite of where()
result = df.mask(df > 20)
print("\nmask (replace > 20 with NaN):")
print(result)
# 4. mask() with replacement value
result = df.mask(df > 20, 0)
print("\nmask with 0:")
print(result)
# 5. Practical Example: Outlier Capping
data = pd.DataFrame({
'values': [1, 5, 10, 100, 15, 200, 8, 300]
})
# Cap at 50
capped = data['values'].where(data['values'] <= 50, 50)
print("\nCapped at 50:")
print(capped)
# Or using mask
capped2 = data['values'].mask(data['values'] > 50, 50)
print("\nSame with mask:")
print(capped2)
# 6. Element-wise Operations
df = pd.DataFrame({
'temperature': [-5, 0, 10, 25, 35, 42],
'humidity': [20, 40, 60, 80, 90, 95]
})
# Replace extreme temperatures
df['temp_adjusted'] = df['temperature'].where(
(df['temperature'] >= 0) & (df['temperature'] <= 40)
)
print("\nTemperature filtering:")
print(df)
# 7. Using Functions
# where/mask can take callable
result = df['temperature'].where(lambda x: x > 0)
print("\nUsing lambda:")
print(result)
# 8. Multiple Conditions
sales = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'quantity': [10, 50, 100, 200, 5],
'price': [10, 20, 15, 25, 30]
})
# Flag high-value sales
sales['category'] = 'Normal'
sales['category'] = sales['category'].where(
sales['quantity'] * sales['price'] < 1000,
'High Value'
)
print("\nSales categories:")
print(sales)
# 9. vs Boolean Indexing
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50]
})
# Boolean indexing (modifies subset)
df_bool = df.copy()
df_bool.loc[df_bool['A'] > 3, 'B'] = 0
print("\nBoolean indexing:")
print(df_bool)
# where (returns new, preserves shape)
df_where = df.copy()
df_where['B'] = df_where['B'].where(df_where['A'] <= 3, 0)
print("\nWith where:")
print(df_where)
# 10. Practical: Data Cleaning
data = pd.DataFrame({
'age': [-5, 25, 150, 30, 200, 45],
'salary': [50000, -10000, 80000, 90000, 1000000, 75000]
})
# Valid ranges
data['age_clean'] = data['age'].where(
(data['age'] >= 0) & (data['age'] <= 120)
)
data['salary_clean'] = data['salary'].where(
(data['salary'] >= 0) & (data['salary'] <= 500000)
)
print("\nData cleaning:")
print(data)
# 11. Winsorization
# Cap extreme values
data = pd.Series([1, 2, 3, 4, 5, 100, 200, 300])
lower = data.quantile(0.05)
upper = data.quantile(0.95)
winsorized = (data
.where(data >= lower, lower)
.where(data <= upper, upper))
print("\nWinsorized:")
print(winsorized)
# 12. Time Series Example
dates = pd.date_range('2024-01-01', periods=10, freq='D')
ts = pd.DataFrame({
'date': dates,
'value': [10, 50, 100, 200, 150, 180, 90, 70, 60, 55]
})
# Flag anomalies
mean = ts['value'].mean()
std = ts['value'].std()
ts['anomaly'] = 'Normal'
ts['anomaly'] = ts['anomaly'].where(
abs(ts['value'] - mean) <= 2 * std,
'Anomaly'
)
print("\nAnomaly detection:")
print(ts)
# 13. Performance Note
# where/mask are efficient (vectorized)
large_df = pd.DataFrame(np.random.randn(100000, 3), columns=['A', 'B', 'C'])
import time
# where()
start = time.time()
_ = large_df.where(large_df > 0, 0)
where_time = time.time() - start
# Boolean indexing (slower for full replacement)
start = time.time()
temp = large_df.copy()
temp[temp < 0] = 0
bool_time = time.time() - start
print(f"\nwhere(): {where_time:.4f}s")
print(f"Boolean: {bool_time:.4f}s")
where() vs mask():
| Method | Behavior | Equivalent |
|---|---|---|
| where(cond) | Keep True, replace False | df[cond] |
| mask(cond) | Replace True, keep False | df[~cond] |
Use Cases:
| Use Case | Method | Example |
|---|---|---|
| Outlier capping | where | s.where(s <= 100, 100) |
| Filter & replace | where | df.where(df > 0, 0) |
| Anomaly removal | mask | df.mask(is_anomaly) |
| Winsorization | where (twice) | Cap both tails |
Interview Insights:
Interviewer's Insight
- "where(): keep True, replace False with other"
- "mask(): replace True with other, keep False"
- "Default other: NaN"
- "where(cond, value): replace False with value"
- "Preserves DataFrame shape (vs boolean indexing)"
- "Vectorized: efficient on large data"
- "Use for: outlier capping, data cleaning, winsorization"
- "Can take callable:
where(lambda x: x > 0)" - "mask() = where(~cond)"
Explain nlargest() and nsmallest() - Amazon, Netflix Interview Question
Difficulty: π’ Easy | Tags: Selection, Sorting, Top-N | Asked by: Amazon, Netflix, Google, Meta
View Answer
nlargest() and nsmallest() select top/bottom N rows by column(s). Faster than sorting for small N. Useful for top products, customers, values.
Complete Examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E', 'F'],
'sales': [1000, 5000, 3000, 8000, 2000, 6000],
'profit': [200, 1000, 600, 1500, 400, 1200],
'region': ['North', 'South', 'North', 'East', 'West', 'South']
})
# 1. Top 3 by Sales
top_sales = df.nlargest(3, 'sales')
print("Top 3 by sales:")
print(top_sales)
# 2. Bottom 3 by Sales
bottom_sales = df.nsmallest(3, 'sales')
print("\nBottom 3 by sales:")
print(bottom_sales)
# 3. Multiple Columns (Tiebreaker)
# First by sales, then by profit
top_multi = df.nlargest(3, ['sales', 'profit'])
print("\nTop 3 by sales, then profit:")
print(top_multi)
# 4. vs sort_values + head
import time
# Large dataset
large_df = pd.DataFrame({
'value': np.random.randn(1000000)
})
# nlargest (fast)
start = time.time()
_ = large_df.nlargest(10, 'value')
nlargest_time = time.time() - start
# sort + head (slow)
start = time.time()
_ = large_df.sort_values('value', ascending=False).head(10)
sort_time = time.time() - start
print(f"\nnlargest: {nlargest_time:.4f}s")
print(f"sort + head: {sort_time:.4f}s ({sort_time/nlargest_time:.1f}x slower)")
# 5. Series nlargest/nsmallest
sales_series = pd.Series([100, 500, 300, 800, 200],
index=['A', 'B', 'C', 'D', 'E'])
print("\nTop 3 products:")
print(sales_series.nlargest(3))
print("\nBottom 2 products:")
print(sales_series.nsmallest(2))
# 6. Practical: Top Customers
customers = pd.DataFrame({
'customer_id': range(1, 101),
'total_spent': np.random.randint(100, 10000, 100),
'orders': np.random.randint(1, 50, 100),
'last_order_days': np.random.randint(0, 365, 100)
})
# Top 10 spenders
top_customers = customers.nlargest(10, 'total_spent')
print("\nTop 10 customers by spending:")
print(top_customers[['customer_id', 'total_spent', 'orders']])
# Most frequent buyers
frequent_buyers = customers.nlargest(10, 'orders')
print("\nTop 10 frequent buyers:")
print(frequent_buyers[['customer_id', 'orders', 'total_spent']])
# 7. GroupBy + nlargest
# Top 2 products per region
top_per_region = df.groupby('region').apply(
lambda x: x.nlargest(2, 'sales')
).reset_index(drop=True)
print("\nTop 2 products per region:")
print(top_per_region)
# 8. Practical: Stock Analysis
stocks = pd.DataFrame({
'ticker': ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA', 'META'],
'price': [175, 140, 380, 145, 250, 350],
'volume': [50000000, 25000000, 30000000, 45000000, 80000000, 20000000],
'change_pct': [2.5, -1.2, 1.8, 0.5, 5.2, -2.1]
})
# Top gainers
print("\nTop 3 gainers:")
print(stocks.nlargest(3, 'change_pct')[['ticker', 'change_pct']])
# Top losers
print("\nTop 3 losers:")
print(stocks.nsmallest(3, 'change_pct')[['ticker', 'change_pct']])
# Most traded
print("\nMost traded:")
print(stocks.nlargest(3, 'volume')[['ticker', 'volume']])
# 9. Keep Parameter
# Handle duplicates
df_dup = pd.DataFrame({
'name': ['A', 'B', 'C', 'D', 'E'],
'score': [10, 20, 20, 15, 10]
})
print("\nOriginal with duplicates:")
print(df_dup)
# keep='first' (default)
print("\nTop 2 (keep='first'):")
print(df_dup.nlargest(2, 'score', keep='first'))
# keep='last'
print("\nTop 2 (keep='last'):")
print(df_dup.nlargest(2, 'score', keep='last'))
# keep='all' (may return more than n)
print("\nTop 2 (keep='all'):")
print(df_dup.nlargest(2, 'score', keep='all'))
# 10. Real-World: Sales Leaderboard
sales_data = pd.DataFrame({
'salesperson': [f'Person_{i}' for i in range(1, 51)],
'q1_sales': np.random.randint(10000, 50000, 50),
'q2_sales': np.random.randint(10000, 50000, 50),
'region': np.random.choice(['North', 'South', 'East', 'West'], 50)
})
sales_data['total_sales'] = sales_data['q1_sales'] + sales_data['q2_sales']
# Overall top 10
leaderboard = sales_data.nlargest(10, 'total_sales')
print("\nTop 10 salespeople:")
print(leaderboard[['salesperson', 'total_sales', 'region']])
# Top 3 per region
regional_leaders = sales_data.groupby('region').apply(
lambda x: x.nlargest(3, 'total_sales')
).reset_index(drop=True)
print("\nTop 3 per region:")
print(regional_leaders[['region', 'salesperson', 'total_sales']])
Performance Comparison:
| Dataset Size | Top N | nlargest() | sort_values() + head() |
|---|---|---|---|
| 10K rows | 10 | 0.001s | 0.005s (5x slower) |
| 100K rows | 10 | 0.005s | 0.050s (10x slower) |
| 1M rows | 10 | 0.05s | 0.5s (10x slower) |
| 1M rows | 100 | 0.08s | 0.5s (6x slower) |
When to Use:
| Scenario | Best Method |
|---|---|
| Top/bottom N (N << size) | nlargest/nsmallest |
| Need all sorted | sort_values |
| Multiple columns | Both work, nlargest faster |
| Very small N (<100) | nlargest (much faster) |
Interview Insights:
Interviewer's Insight
- "nlargest(n, 'col'): top N rows by column"
- "nsmallest(n, 'col'): bottom N rows"
- "Faster than sort_values + head for small N"
- "Multiple columns:
nlargest(n, ['col1', 'col2'])" - "keep parameter: 'first', 'last', 'all' for ties"
- "Works on Series and DataFrame"
- "Use for: top products, customers, leaderboards"
- "O(n log k) vs O(n log n) for sorting"
- "GroupBy + nlargest: top N per group"
Explain Pandas rank() Method - Microsoft, Google Interview Question
Difficulty: π‘ Medium | Tags: Ranking, Statistics, Order | Asked by: Microsoft, Google, Amazon
View Answer
rank() assigns ranks to values in Series/DataFrame. Supports multiple methods for handling ties (average, min, max, first, dense). Essential for leaderboards, percentiles, and relative comparisons.
Complete Examples:
import pandas as pd
import numpy as np
# Sample data with ties
s = pd.Series([100, 200, 200, 300, 400, 400, 400])
# 1. Default Ranking (average for ties)
print("Original values:")
print(s.values)
print("\nDefault rank (average):")
print(s.rank())
# 2. Different Ranking Methods
methods = ['average', 'min', 'max', 'first', 'dense']
df = pd.DataFrame({'value': s})
for method in methods:
df[f'rank_{method}'] = s.rank(method=method)
print("\nAll ranking methods:")
print(df)
# 3. Ascending vs Descending
scores = pd.Series([85, 92, 78, 92, 95])
print("\nScores:")
print(scores.values)
# ascending=True (default): 1 = smallest
print("\nRank ascending (1 = lowest):")
print(scores.rank())
# ascending=False: 1 = largest
print("\nRank descending (1 = highest):")
print(scores.rank(ascending=False))
# 4. Percentile Ranks
# pct=True: ranks from 0 to 1
print("\nPercentile ranks:")
print(scores.rank(pct=True))
# 5. Practical: Student Grades
students = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'math': [85, 92, 78, 92, 95],
'english': [88, 85, 90, 82, 95]
})
# Rank by math (1 = best)
students['math_rank'] = students['math'].rank(ascending=False, method='min')
students['english_rank'] = students['english'].rank(ascending=False, method='min')
# Overall rank (sum of ranks, lower is better)
students['overall_rank'] = (students['math_rank'] + students['english_rank']).rank()
print("\nStudent rankings:")
print(students)
# 6. GroupBy Ranking
sales = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'East', 'East'],
'salesperson': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'sales': [1000, 1500, 800, 1200, 900, 1100]
})
# Rank within each region
sales['rank_in_region'] = sales.groupby('region')['sales'].rank(ascending=False)
print("\nRanking within regions:")
print(sales)
# 7. Handling NaN
s_nan = pd.Series([100, np.nan, 200, 300, np.nan])
# na_option='keep' (default): NaN stays NaN
print("\nWith NaN (keep):")
print(s_nan.rank())
# na_option='top': NaN gets smallest rank
print("\nWith NaN (top):")
print(s_nan.rank(na_option='top'))
# na_option='bottom': NaN gets largest rank
print("\nWith NaN (bottom):")
print(s_nan.rank(na_option='bottom'))
# 8. Practical: Stock Performance
stocks = pd.DataFrame({
'ticker': ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA'],
'return_1m': [5.2, -1.5, 3.8, 2.1, 8.5],
'return_6m': [12.5, 8.2, 15.1, 10.3, 25.2],
'volatility': [0.25, 0.30, 0.22, 0.28, 0.45]
})
# Rank returns (higher is better)
stocks['return_1m_rank'] = stocks['return_1m'].rank(ascending=False)
stocks['return_6m_rank'] = stocks['return_6m'].rank(ascending=False)
# Rank volatility (lower is better)
stocks['vol_rank'] = stocks['volatility'].rank(ascending=True)
# Combined score (lower is better)
stocks['combined_rank'] = (
stocks['return_1m_rank'] +
stocks['return_6m_rank'] +
stocks['vol_rank']
).rank()
print("\nStock rankings:")
print(stocks)
# 9. Quantile-based Ranking
values = pd.Series(np.random.randn(100))
# Percentile ranks
values_df = pd.DataFrame({
'value': values,
'percentile': values.rank(pct=True),
'quartile': pd.qcut(values.rank(pct=True), q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
})
print("\nQuantile-based ranking:")
print(values_df.head(10))
# 10. Leaderboard Example
leaderboard = pd.DataFrame({
'player': [f'Player_{i}' for i in range(1, 11)],
'score': [1000, 1500, 1200, 1500, 1800, 900, 1500, 1200, 1000, 2000]
})
# Dense ranking (no gaps)
leaderboard['rank_dense'] = leaderboard['score'].rank(method='dense', ascending=False)
# Min ranking (standard)
leaderboard['rank_standard'] = leaderboard['score'].rank(method='min', ascending=False)
leaderboard = leaderboard.sort_values('score', ascending=False)
print("\nLeaderboard:")
print(leaderboard)
# 11. Performance Comparison
large_series = pd.Series(np.random.randn(1000000))
import time
start = time.time()
_ = large_series.rank()
rank_time = time.time() - start
start = time.time()
_ = large_series.sort_values().reset_index(drop=True)
sort_time = time.time() - start
print(f"\nrank(): {rank_time:.4f}s")
print(f"sort: {sort_time:.4f}s")
Ranking Methods:
| Method | Description | Example [100, 200, 200, 300] |
|---|---|---|
| average | Average of ranks | [1.0, 2.5, 2.5, 4.0] |
| min | Minimum rank | [1, 2, 2, 4] |
| max | Maximum rank | [1, 3, 3, 4] |
| first | Order they appear | [1, 2, 3, 4] |
| dense | No gaps | [1, 2, 2, 3] |
Key Parameters:
| Parameter | Options | Use Case |
|---|---|---|
| method | average, min, max, first, dense | Tie handling |
| ascending | True, False | Rank direction |
| pct | True, False | Percentile ranks (0-1) |
| na_option | keep, top, bottom | NaN handling |
Interview Insights:
Interviewer's Insight
- "rank(): assign ranks to values"
- "ascending=False: 1 = highest (for scores)"
- "method='min': standard ranking (gaps for ties)"
- "method='dense': no gaps (1, 2, 2, 3)"
- "method='first': order of appearance"
- "pct=True: percentile ranks (0.0 to 1.0)"
- "Use with groupby(): rank within groups"
- "na_option: handle NaN ('keep', 'top', 'bottom')"
- "Use for: leaderboards, percentiles, relative comparisons"
Explain pd.cut() vs pd.qcut() - Amazon, Google Interview Question
Difficulty: π‘ Medium | Tags: Binning, Discretization, Feature Engineering | Asked by: Amazon, Google, Meta, Netflix
View Answer
cut() bins data into equal-width intervals. qcut() bins into equal-sized quantiles. cut() for fixed ranges, qcut() for balanced distribution.
Complete Examples:
import pandas as pd
import numpy as np
ages = pd.Series([5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80])
# 1. pd.cut() - Equal-Width Bins
# Bins have same width, different counts
age_bins = pd.cut(ages, bins=4)
print("cut() - Equal width:")
print(age_bins)
print("\nValue counts:")
print(age_bins.value_counts().sort_index())
# 2. pd.qcut() - Equal-Frequency Bins
# Bins have different widths, same counts
age_quantiles = pd.qcut(ages, q=4)
print("\nqcut() - Equal frequency:")
print(age_quantiles)
print("\nValue counts:")
print(age_quantiles.value_counts().sort_index())
# 3. Custom Bin Edges (cut)
custom_bins = pd.cut(ages, bins=[0, 18, 35, 60, 100],
labels=['Child', 'Young Adult', 'Adult', 'Senior'])
print("\nCustom bins:")
print(pd.DataFrame({'age': ages, 'category': custom_bins}))
# 4. Percentile-based (qcut)
# Quartiles
quartiles = pd.qcut(ages, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print("\nQuartiles:")
print(pd.DataFrame({'age': ages, 'quartile': quartiles}))
# 5. Practical: Income Brackets
incomes = pd.Series(np.random.randint(20000, 200000, 100))
# Equal-width bins
income_cut = pd.cut(incomes, bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
print("\nIncome bins (equal width):")
print(income_cut.value_counts())
# Equal-frequency bins (balanced)
income_qcut = pd.qcut(incomes, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
print("\nIncome bins (equal frequency):")
print(income_qcut.value_counts())
# 6. Include/Exclude Boundaries
scores = pd.Series([0, 25, 50, 75, 100])
# Right edge included (default)
bins_right = pd.cut(scores, bins=[0, 50, 100], right=True)
print("\nRight edge included:")
print(bins_right)
# Left edge included
bins_left = pd.cut(scores, bins=[0, 50, 100], right=False)
print("\nLeft edge included:")
print(bins_left)
# 7. Retbins - Get Bin Edges
data = pd.Series(np.random.randn(100))
# Get bins used
binned, bin_edges = pd.cut(data, bins=5, retbins=True)
print("\nBin edges:")
print(bin_edges)
# 8. Practical: Credit Score Categories
credit_scores = pd.Series(np.random.randint(300, 850, 1000))
# Standard credit score ranges
credit_bins = pd.cut(
credit_scores,
bins=[300, 580, 670, 740, 800, 850],
labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent']
)
print("\nCredit score distribution:")
print(credit_bins.value_counts().sort_index())
# 9. Handling Duplicates in qcut
# qcut can fail with many duplicates
data_dup = pd.Series([1, 1, 1, 2, 2, 3, 4, 5, 5, 5])
try:
pd.qcut(data_dup, q=4)
except ValueError as e:
print(f"\nqcut error: {e}")
# Solution: duplicates='drop'
binned_dup = pd.qcut(data_dup, q=4, duplicates='drop')
print("\nWith duplicates='drop':")
print(binned_dup.value_counts())
# 10. Machine Learning Feature Engineering
df = pd.DataFrame({
'customer_id': range(1, 101),
'age': np.random.randint(18, 80, 100),
'income': np.random.randint(20000, 150000, 100),
'spend': np.random.randint(100, 5000, 100)
})
# Create features
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle', 'Senior'])
df['income_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
df['spend_level'] = pd.qcut(df['spend'], q=3, labels=['Low', 'Medium', 'High'])
print("\nFeature engineering:")
print(df.head(10))
# Distribution
print("\nFeature distributions:")
print(df[['age_group', 'income_quartile', 'spend_level']].apply(lambda x: x.value_counts()))
cut() vs qcut():
| Aspect | cut() | qcut() |
|---|---|---|
| Bins | Equal width | Equal frequency |
| Use Case | Fixed ranges | Balanced groups |
| Distribution | Uneven | Even |
| Example | Age groups | Quartiles |
| Custom Edges | Yes | No (only q) |
Common Patterns:
| Task | Solution |
|---|---|
| Age groups | cut(ages, [0, 18, 65, 100]) |
| Quartiles | qcut(data, q=4) |
| Credit scores | cut(scores, [300, 580, 670, 740, 800, 850]) |
| Income percentiles | qcut(income, q=10) (deciles) |
| Equal groups for ML | qcut(feature, q=5) |
Interview Insights:
Interviewer's Insight
- "cut(): equal-width bins, fixed ranges"
- "qcut(): equal-frequency bins, quantiles"
- "cut() bins can have different counts"
- "qcut() bins have same counts (balanced)"
- "cut() for: age groups, custom ranges"
- "qcut() for: quartiles, percentiles, balanced features"
- "labels: custom category names"
- "retbins=True: return bin edges"
- "qcut with duplicates: use duplicates='drop'"
- "Use for: feature engineering, binning, discretization"
Explain Pandas cumsum(), cumprod(), cummax(), cummin() - Meta, Microsoft Interview Question
Difficulty: π’ Easy | Tags: Cumulative Operations, Time Series, Running Totals | Asked by: Meta, Microsoft, Amazon, Google
View Answer
Cumulative functions compute running totals, products, max, min. Essential for time series analysis, running totals, cumulative statistics, and financial calculations.
Complete Examples:
import pandas as pd
import numpy as np
data = pd.Series([1, 2, 3, 4, 5])
# 1. cumsum() - Cumulative Sum
print("Original:")
print(data.values)
print("\ncumsum():")
print(data.cumsum().values)
# [1, 3, 6, 10, 15]
# 2. cumprod() - Cumulative Product
print("\ncumprod():")
print(data.cumprod().values)
# [1, 2, 6, 24, 120]
# 3. cummax() - Cumulative Maximum
data_random = pd.Series([3, 1, 4, 1, 5, 2, 6])
print("\nOriginal:")
print(data_random.values)
print("\ncummax():")
print(data_random.cummax().values)
# [3, 3, 4, 4, 5, 5, 6]
# 4. cummin() - Cumulative Minimum
print("\ncummin():")
print(data_random.cummin().values)
# [3, 1, 1, 1, 1, 1, 1]
# 5. Practical: Running Total Sales
sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=7, freq='D'),
'daily_sales': [100, 150, 120, 180, 200, 160, 190]
})
sales['cumulative_sales'] = sales['daily_sales'].cumsum()
print("\nRunning sales total:")
print(sales)
# 6. Financial: Cumulative Returns
returns = pd.Series([0.05, -0.02, 0.03, 0.01, -0.01, 0.04])
# Cumulative product for compound returns
cumulative_return = (1 + returns).cumprod() - 1
print("\nDaily returns:")
print(returns.values)
print("\nCumulative returns:")
print(cumulative_return.values)
# 7. Peak Detection (cummax)
stock_prices = pd.Series([100, 105, 103, 108, 110, 107, 115, 112])
df = pd.DataFrame({
'price': stock_prices,
'peak': stock_prices.cummax()
})
df['drawdown'] = (df['price'] - df['peak']) / df['peak']
print("\nDrawdown analysis:")
print(df)
# 8. GroupBy Cumulative
orders = pd.DataFrame({
'customer': ['A', 'A', 'B', 'A', 'B', 'B'],
'amount': [100, 200, 150, 300, 100, 200]
})
orders['cumulative_per_customer'] = orders.groupby('customer')['amount'].cumsum()
print("\nCumulative per customer:")
print(orders)
# 9. Time Series: Running Metrics
ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10, freq='D'),
'value': np.random.randint(10, 50, 10)
})
ts['running_total'] = ts['value'].cumsum()
ts['running_avg'] = ts['value'].expanding().mean()
ts['running_max'] = ts['value'].cummax()
ts['running_min'] = ts['value'].cummin()
print("\nTime series running metrics:")
print(ts)
# 10. Practical: Inventory Tracking
inventory = pd.DataFrame({
'transaction': ['Sale', 'Purchase', 'Sale', 'Sale', 'Purchase', 'Sale'],
'quantity': [-10, 50, -15, -20, 40, -12]
})
inventory['stock_level'] = 100 + inventory['quantity'].cumsum()
print("\nInventory levels:")
print(inventory)
# 11. Year-to-Date Calculations
monthly = pd.DataFrame({
'month': pd.date_range('2024-01-01', periods=12, freq='MS'),
'revenue': np.random.randint(50000, 100000, 12)
})
monthly['ytd_revenue'] = monthly['revenue'].cumsum()
print("\nYear-to-date revenue:")
print(monthly)
# 12. Performance Comparison
large_series = pd.Series(np.random.randn(1000000))
import time
# cumsum (fast, O(n))
start = time.time()
_ = large_series.cumsum()
cumsum_time = time.time() - start
# Manual cumsum (slow)
start = time.time()
result = []
total = 0
for val in large_series:
total += val
result.append(total)
manual_time = time.time() - start
print(f"\ncumsum(): {cumsum_time:.4f}s")
print(f"Manual: {manual_time:.4f}s ({manual_time/cumsum_time:.1f}x slower)")
Cumulative Functions:
| Function | Description | Example [1,2,3,4] |
|---|---|---|
| cumsum() | Running sum | [1, 3, 6, 10] |
| cumprod() | Running product | [1, 2, 6, 24] |
| cummax() | Running maximum | [1, 2, 3, 4] |
| cummin() | Running minimum | [1, 1, 1, 1] |
Common Use Cases:
| Use Case | Function | Example |
|---|---|---|
| Running totals | cumsum() | Sales YTD |
| Compound returns | cumprod() | Investment growth |
| Peak tracking | cummax() | All-time high |
| Drawdown | price - cummax() | Portfolio decline |
| Inventory | cumsum() | Stock levels |
| Streak tracking | Custom with cumsum | Winning streaks |
Interview Insights:
Interviewer's Insight
- "cumsum(): running sum/total"
- "cumprod(): running product (compound growth)"
- "cummax(): running maximum (peak)"
- "cummin(): running minimum (trough)"
- "Use with groupby(): cumulative per group"
- "Financial: cumprod() for compound returns"
- "Drawdown: price - cummax()"
- "O(n) complexity: efficient"
- "Common in: time series, finance, inventory"
Explain pd.concat() axis Parameter - Google, Amazon Interview Question
Difficulty: π‘ Medium | Tags: Concatenation, Joining, Data Combination | Asked by: Google, Amazon, Meta, Microsoft
View Answer
concat() axis parameter controls concatenation direction. axis=0 (default) stacks vertically (rows). axis=1 stacks horizontally (columns). Essential for combining datasets.
Complete Examples:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
df2 = pd.DataFrame({
'A': [7, 8, 9],
'B': [10, 11, 12]
})
df3 = pd.DataFrame({
'C': [13, 14, 15],
'D': [16, 17, 18]
})
# 1. axis=0 (Vertical Stacking - Rows)
result_v = pd.concat([df1, df2], axis=0)
print("Vertical concatenation (axis=0):")
print(result_v)
# Reset index
result_v_reset = pd.concat([df1, df2], axis=0, ignore_index=True)
print("\nWith ignore_index=True:")
print(result_v_reset)
# 2. axis=1 (Horizontal Stacking - Columns)
result_h = pd.concat([df1, df3], axis=1)
print("\nHorizontal concatenation (axis=1):")
print(result_h)
# 3. Mismatched Indexes
df_a = pd.DataFrame({'X': [1, 2, 3]}, index=['a', 'b', 'c'])
df_b = pd.DataFrame({'Y': [4, 5, 6]}, index=['b', 'c', 'd'])
# axis=1: outer join (default)
result = pd.concat([df_a, df_b], axis=1)
print("\nMismatched indexes (axis=1):")
print(result)
# axis=1 with inner join
result_inner = pd.concat([df_a, df_b], axis=1, join='inner')
print("\nWith join='inner':")
print(result_inner)
# 4. Multiple DataFrames
df_list = [
pd.DataFrame({'value': [1, 2]}),
pd.DataFrame({'value': [3, 4]}),
pd.DataFrame({'value': [5, 6]})
]
# Vertical
combined_v = pd.concat(df_list, axis=0, ignore_index=True)
print("\nMultiple DataFrames (vertical):")
print(combined_v)
# 5. Practical: Combining Regional Data
north = pd.DataFrame({
'product': ['A', 'B', 'C'],
'sales': [100, 200, 150]
})
south = pd.DataFrame({
'product': ['A', 'B', 'C'],
'sales': [120, 180, 160]
})
# Vertical (add rows)
all_regions_v = pd.concat([north, south], axis=0, keys=['North', 'South'])
print("\nRegional data (vertical):")
print(all_regions_v)
# Horizontal (add columns)
all_regions_h = pd.concat(
[north.set_index('product')['sales'],
south.set_index('product')['sales']],
axis=1,
keys=['North', 'South']
)
print("\nRegional data (horizontal):")
print(all_regions_h)
# 6. Keys Parameter
q1 = pd.DataFrame({'revenue': [100, 200]})
q2 = pd.DataFrame({'revenue': [150, 250]})
quarterly = pd.concat([q1, q2], axis=0, keys=['Q1', 'Q2'])
print("\nWith keys (MultiIndex):")
print(quarterly)
# 7. Names Parameter
quarterly_named = pd.concat(
[q1, q2],
axis=0,
keys=['Q1', 'Q2'],
names=['Quarter', 'ID']
)
print("\nWith names:")
print(quarterly_named)
# 8. Practical: Time Series Concatenation
jan = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5, freq='D'),
'value': [10, 12, 11, 13, 14]
})
feb = pd.DataFrame({
'date': pd.date_range('2024-02-01', periods=5, freq='D'),
'value': [15, 16, 14, 17, 18]
})
full_data = pd.concat([jan, feb], axis=0, ignore_index=True)
print("\nTime series concatenation:")
print(full_data)
# 9. Verify No Data Loss
def safe_concat(dfs, axis):
"""Concat with verification."""
result = pd.concat(dfs, axis=axis)
if axis == 0:
expected_rows = sum(len(df) for df in dfs)
assert len(result) == expected_rows, "Row count mismatch!"
else:
expected_cols = sum(len(df.columns) for df in dfs)
assert len(result.columns) == expected_cols, "Column count mismatch!"
return result
result = safe_concat([df1, df2], axis=0)
print(f"\nSafe concat: {result.shape}")
# 10. Performance Comparison
dfs = [pd.DataFrame({'A': range(1000)}) for _ in range(100)]
import time
# concat (fast)
start = time.time()
_ = pd.concat(dfs, axis=0)
concat_time = time.time() - start
# append (deprecated, slow)
# result = dfs[0]
# for df in dfs[1:]:
# result = result.append(df)
print(f"\nconcat time: {concat_time:.4f}s")
print("(append is deprecated, use concat)")
axis Parameter:
| axis | Direction | Stacking | Use Case |
|---|---|---|---|
| 0 | Vertical | Add rows | Combine datasets |
| 1 | Horizontal | Add columns | Add features |
Common Parameters:
| Parameter | Purpose | Example |
|---|---|---|
| axis | Direction (0=rows, 1=cols) | axis=0 |
| ignore_index | Reset index | ignore_index=True |
| keys | Add MultiIndex | keys=['A', 'B'] |
| join | 'outer' or 'inner' | join='inner' |
| names | Name MultiIndex levels | names=['level1'] |
Interview Insights:
Interviewer's Insight
- "axis=0: vertical stacking (add rows)"
- "axis=1: horizontal stacking (add columns)"
- "ignore_index=True: reset index"
- "join='outer': keep all (default)"
- "join='inner': keep common only"
- "keys: add MultiIndex for source tracking"
- "Use for: combining datasets, regional data"
- "axis=0 + ignore_index: clean row numbers"
- "axis=1: must have compatible indexes"
- "Prefer concat over append (deprecated)"
Explain Pandas Sparse Data Structures - Netflix, Google Interview Question
Difficulty: π‘ Medium | Tags: Memory Optimization, Sparse Arrays, Performance | Asked by: Netflix, Google, Amazon
View Answer
Sparse arrays store only non-zero/non-default values, saving memory for mostly-empty data. Essential for one-hot encoding, matrices with many zeros, and large categorical features.
Complete Examples:
import pandas as pd
import numpy as np
# 1. Create Sparse Array
# Most values are 0 (or another fill_value)
dense = pd.Series([0, 0, 1, 0, 0, 2, 0, 0, 0, 3])
# Convert to sparse
sparse = dense.astype(pd.SparseDtype("int64", fill_value=0))
print("Dense memory:")
print(f"{dense.memory_usage(deep=True):,} bytes")
print("\nSparse memory:")
print(f"{sparse.memory_usage(deep=True):,} bytes")
print(f"\nMemory savings: {(1 - sparse.memory_usage(deep=True)/dense.memory_usage(deep=True))*100:.1f}%")
# 2. Sparse DataFrame
# One-hot encoded data (mostly zeros)
df = pd.DataFrame({
'A': pd.Categorical(['cat', 'dog', 'cat', 'bird'] * 1000)
})
# Dense one-hot encoding
dense_dummies = pd.get_dummies(df['A'])
# Sparse one-hot encoding
sparse_dummies = pd.get_dummies(df['A'], sparse=True)
print("\nOne-hot encoding:")
print(f"Dense: {dense_dummies.memory_usage(deep=True).sum():,} bytes")
print(f"Sparse: {sparse_dummies.memory_usage(deep=True).sum():,} bytes")
print(f"Savings: {(1 - sparse_dummies.memory_usage(deep=True).sum()/dense_dummies.memory_usage(deep=True).sum())*100:.1f}%")
# 3. SparseArray Properties
sparse_arr = pd.arrays.SparseArray([0, 0, 1, 2, 0, 0, 0, 3, 0])
print("\nSparse array:")
print(f"Values: {sparse_arr}")
print(f"Density: {sparse_arr.density:.2%}")
print(f"Fill value: {sparse_arr.fill_value}")
print(f"Non-zero count: {sparse_arr.npoints}")
# 4. Operations on Sparse Data
s1 = pd.Series([0, 0, 1, 0, 2]).astype(pd.SparseDtype("int64", 0))
s2 = pd.Series([0, 1, 0, 0, 3]).astype(pd.SparseDtype("int64", 0))
# Arithmetic (stays sparse)
result = s1 + s2
print("\nArithmetic on sparse:")
print(result)
print(f"Result is sparse: {isinstance(result.dtype, pd.SparseDtype)}")
# 5. Practical: User-Item Matrix
# Typical recommendation system matrix (very sparse)
users = 10000
items = 1000
# Simulate ratings (only 1% have ratings)
n_ratings = int(users * items * 0.01)
user_ids = np.random.randint(0, users, n_ratings)
item_ids = np.random.randint(0, items, n_ratings)
ratings = np.random.randint(1, 6, n_ratings)
# Dense matrix (BAD for sparse data)
# Would need: users * items * 8 bytes = 80 MB
# Sparse representation (GOOD)
sparse_df = pd.DataFrame({
'user_id': user_ids,
'item_id': item_ids,
'rating': ratings
})
print(f"\nSparse ratings:")
print(f"Memory: {sparse_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"vs full matrix: ~{users * items * 8 / 1024**2:.2f} MB")
print(f"Savings: ~{(1 - sparse_df.memory_usage(deep=True).sum()/(users*items*8))*100:.1f}%")
# 6. Convert Between Dense and Sparse
dense = pd.Series([0, 1, 0, 2, 0])
# To sparse
sparse = dense.astype(pd.SparseDtype("int64", fill_value=0))
print("\nConverted to sparse:")
print(sparse)
# Back to dense
dense_again = sparse.astype("int64")
print("\nBack to dense:")
print(dense_again)
# 7. Sparse with Different Fill Values
# If most values are -1
data = pd.Series([-1, -1, 5, -1, -1, 10, -1, -1])
sparse_neg = data.astype(pd.SparseDtype("int64", fill_value=-1))
print("\nSparse with fill_value=-1:")
print(sparse_neg)
print(f"Density: {sparse_neg.density:.2%}")
# 8. Practical: Feature Engineering
# High-cardinality categorical β sparse one-hot
df = pd.DataFrame({
'user_id': range(1000),
'country': np.random.choice(['US', 'UK', 'CA', 'AU', 'DE'] * 200, 1000)
})
# Sparse one-hot encoding
country_sparse = pd.get_dummies(df['country'], prefix='country', sparse=True)
print("\nSparse feature engineering:")
print(f"Shape: {country_sparse.shape}")
print(f"Memory: {country_sparse.memory_usage(deep=True).sum() / 1024:.2f} KB")
# 9. When to Use Sparse
def analyze_sparsity(series):
"""Check if sparse makes sense."""
zero_count = (series == 0).sum()
total = len(series)
sparsity = zero_count / total
print(f"Sparsity: {sparsity:.1%}")
if sparsity > 0.9:
print("β Good candidate for sparse")
elif sparsity > 0.5:
print("β Maybe use sparse")
else:
print("β Not sparse enough")
return sparsity
data1 = pd.Series(np.random.randint(0, 2, 1000)) # 50% zeros
data2 = pd.Series(np.random.choice([0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 1000)) # 90% zeros
print("\nData 1:")
analyze_sparsity(data1)
print("\nData 2:")
analyze_sparsity(data2)
# 10. Limitations
print("\n=== Sparse Limitations ===")
print("- Arithmetic can densify results")
print("- Some operations require conversion to dense")
print("- Not all pandas functions support sparse")
print("- Best for: β₯90% sparsity")
print("- Use cases: one-hot encoding, user-item matrices, NLP features")
When to Use Sparse:
| Sparsity | Recommendation | Example |
|---|---|---|
| >95% | Definitely use sparse | User-item matrix |
| 90-95% | Likely beneficial | One-hot encoding |
| 50-90% | Maybe | Depends on operations |
| <50% | Don't use sparse | Regular data |
Memory Comparison:
| Data | Dense | Sparse | Savings |
|---|---|---|---|
| 99% zeros | 1 GB | ~10 MB | ~99% |
| 95% zeros | 1 GB | ~50 MB | ~95% |
| 90% zeros | 1 GB | ~100 MB | ~90% |
Interview Insights:
Interviewer's Insight
- "Sparse: store only non-zero/non-default values"
- "Memory savings: proportional to sparsity"
- "pd.get_dummies(sparse=True): sparse one-hot"
- "SparseDtype(dtype, fill_value): specify type"
- "Use when: >90% same value (often zeros)"
- "density: fraction of non-sparse values"
- "npoints: count of non-sparse values"
- "Use cases: one-hot encoding, user-item matrix, NLP"
- "Limitation: some ops require densification"
- "Trade-off: memory vs computation speed"
Questions asked in Google interview
- How would you optimize a Pandas operation running slowly on large dataset?
- Explain the difference between merge() and join()
- Write code to calculate rolling averages with different window sizes
- How would you handle a DataFrame with 100 million rows?
- Explain memory optimization techniques
- Write code to perform complex GroupBy with multiple aggregations
- Explain the internal data structure of DataFrame
- How would you implement feature engineering pipelines?
- Write code to calculate year-over-year growth
- Explain vectorized operations and their importance
- How to handle SettingWithCopyWarning?
- Write code to perform window functions similar to SQL
Questions asked in Amazon interview
- Write code to merge multiple DataFrames with different schemas
- How would you calculate year-over-year growth?
- Explain how to handle time series data with irregular intervals
- Write code to identify and remove duplicate records
- How would you implement a moving average crossover strategy?
- Explain the difference between transform() and apply()
- Write code to pivot data for sales analysis
- How would you handle categorical variables with high cardinality?
- Explain how to optimize for memory efficiency
- Write code to perform cohort analysis
Questions asked in Meta interview
- Write code to analyze user engagement data
- How would you calculate conversion funnels?
- Explain how to handle large-scale data processing
- Write code to resample time series data
- How would you implement A/B testing analysis?
- Explain method chaining and its benefits
- Write code to calculate retention metrics
- How would you handle hierarchical data structures?
- Explain vectorization benefits over loops
- Write code to analyze network data
Questions asked in Microsoft interview
- Explain the SettingWithCopyWarning and how to avoid it
- Write code to perform window functions similar to SQL
- How would you handle timezone conversions?
- Explain the difference between views and copies
- Write code to implement custom aggregation functions
- How would you optimize Pandas for production?
- Explain multi-level indexing use cases
- Write code to compare two DataFrames
- How would you handle missing data in time series?
- Explain eval() and query() methods
Questions asked in Netflix interview
- Write code to analyze viewing patterns and user behavior
- How would you calculate streaming quality metrics?
- Explain how to handle messy data from multiple sources
- Write code to implement collaborative filtering preprocessing
- How would you analyze content performance across regions?
- Explain time series decomposition
- Write code to calculate customer lifetime value
- How would you handle data for recommendation systems?
- Explain rolling window calculations for real-time analytics
- Write code to analyze A/B test results
Questions asked in Apple interview
- Write code to perform data validation on imported data
- How would you implement data quality checks?
- Explain how to handle multi-format data imports
- Write code to analyze product performance metrics
- How would you implement data anonymization?
- Explain best practices for production Pandas code
- Write code to create automated data reports
- How would you handle data versioning?
- Explain memory management for large DataFrames
- Write code to implement time-based partitioning
Questions asked in Flipkart interview
- Write code to analyze e-commerce transaction data
- How would you calculate GMV metrics?
- Explain handling high-cardinality categorical data
- Write code to analyze customer purchase patterns
- How would you implement product recommendation preprocessing?
- Explain data aggregation for dashboard analytics
Questions asked in LinkedIn interview
- Write code to analyze professional network connections
- How would you calculate engagement metrics for posts?
- Explain how to handle user activity data
- Write code to implement skill-based matching
- How would you analyze job posting performance?
- Explain data preprocessing for NLP tasks