October 7th Data Analysis

Tl;dr

Investigating the gruesome stats of the October 7th massacre, using Pandas, to provide some additional information regarding this tragedy. Feel free to browse to the bottom to understand my research.

Surprise Surprise

4 Months away from the terrible October 7th massacre, I attempted to try and deal with it using data analysis tools. Simply put, while browsing through my favorite news site through the weekend, I saw that an updated Excel spreadsheet containing the victims names and details was shared on that site. Is being updated constantly and the download changes (sadly, when there are new additions):

Opened up my laptop to see what’s going on, the information there was exactly what I was chasing for the past few months. It’s important to stress out that everything related to mass terror casualties is scarce, specifically because everything happened at the same time and not enough people documented the cause of death, not to mention, making it accessible to people like me to investigate.

There were some personal initiatives, but they were led by the few, not on the governmental scale.

I quickly downloaded that Excel spreadsheet and then it hit me. That’s the real deal. Full stats of the October 7th victims; civilian and military casualties, mixed together. Let’s use Pandas for this one.

Detective Hat on

So what is Pandas? Pandas is a python library used to analyze and present massive amounts of data, to perform manipulations on, to split and use mathematicaly functions easily as code, sorting, joining and what not. While I don’t use Pandas daily (Pyspark ftw), I decided to give it a go.

Ok, let’s get dirty - how can we analyze this file and what types of data points do we have in order to pivot on?

  • First name
  • Last name
  • Military rank (empty for civilians, some Prof.’s and some Dr’s.)
  • Place of residence (towns and cities for Israelis, countries for foreign nationals)
  • Date of death
  • Cause of death
  • Age (youngest was 10 months old baby, oldest was 89 years old)

Gender was not mentioned. The cause of death was one of the painful ones. It contained values ranging from combats, friendly fire, “murdered in their homes” and what not. Unlike English, Hebrew has unique female and male abbreviations for every verb. Murdered in English refers to he\she\it\they, however in Hebrew it’s totally different. Every subject has it’s own verb, making it theoretically possible to deduct the gender of the victim.

Statistically Speaking or Research Questions

Based on these data points, what can we get out of it?

  • Max, Min, Median - age, for military and civilian casualties
  • Death toll based on the Date of death - very well known
  • Total number of civilian casualties vs military ones
  • Splitting based on gender
  • Ranks statistics
  • Battalions casualties stats - which battaltion suffered the most casualties?
  • Specific causes of death by area
  • Potential total number of family members that were murdered together (last name, cause and place of residence)
  • Foreign nationals that were murdered and their originating countries
  • Sorting the mass casualties by geo locations (based on the cause and place of residence)

Let’s get to it.

Jupyter Notebook Code and Results

Imports that we need:

1
2
3
4
5
6
7
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
from datetime import datetime
import re
import seaborn as sns

Data file that we’ll be working on:

1
2
3
file_path = 'ynetlist152.xlsx'
data = pd.read_excel(file_path)
data.head()

Basic allocations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Convert 'תאריך פטירה' to datetime format and handle different formats
data['תאריך פטירה'] = pd.to_datetime(data['תאריך פטירה'], errors='coerce', dayfirst=True)

# Basic Descriptive Statistics for Age
age_stats = data['גיל'].describe()

# Rank Distribution
rank_distribution = data['דרגה'].value_counts(dropna=False)

# Time-Series Analysis Preparation: Count deaths by date
deaths_by_date = data['תאריך פטירה'].value_counts().sort_index()

age_stats, rank_distribution, deaths_by_date.head()

Wordcloud plotting, based on the cause of death. Please be advised that Wordcloud does not support RTL languages, therefore we need to use the oldest trick in the book - reverse the characters, and then the words. In addition, a specific font is required as well to present it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
info_text = ' '.join(data['מידע על המוות'].dropna())

def reverse_hebrew_phrase(text):
    # Reverse characters in each word
    reversed_words = [''.join(reversed(word)) for word in text.split()]
    # Reverse the order of words to maintain correct phrase order
    correct_order_phrase = ' '.join(reversed(reversed_words))
    return correct_order_phrase

# Applying the function to the Hebrew text
corrected_info_text = reverse_hebrew_phrase(info_text)

# Generate a word cloud with the corrected text
wordcloud = WordCloud(width = 800, height = 800, 
                      background_color ='white', 
                      font_path='DejaVuSans-Bold.ttf',
                      regexp=r"[\w']+",
                      min_font_size = 10).generate(corrected_info_text)

# Plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show()

This results in this wordcloud plot.

Males vs Females breakdown:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Define keywords for identifying gender and context of death in the descriptions
female_keywords = ['בביתה', 'נרצחה', 'נהרגה']
male_keywords = ['בביתו', 'נרצח', 'נהרג']

# Function to classify gender based on keywords
def classify_gender(row):
    if isinstance(row['מידע על המוות'], str):
        for keyword in female_keywords:
            if keyword in row['מידע על המוות']:
                return 'נקבה'  # Female
        for keyword in male_keywords:
            if keyword in row['מידע על המוות']:
                return 'זכר'  # Male
    return 'unknown'  # Adjusted to handle cases not covered by keywords

# Apply the function to classify gender based on "Info on Death"
data['Gender Classification'] = data.apply(classify_gender, axis=1)

# Summary of gender classification
gender_classification_summary = data['Gender Classification'].value_counts()

gender_classification_summary

Military vs Civilian casualties:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Correcting gender classification based on detailed descriptions
def refined_classify_gender(description):
    if pd.notna(description):  # This checks if the description is not NaN
        if 'נרצחה' in description or 'נהרגה' in description:
            return 'Female'
        elif 'נרצח' in description or 'נהרג' in description:
            return 'Male'
    return 'Unknown'

data['Refined Gender'] = data['מידע על המוות'].apply(refined_classify_gender)

# Adjusting military vs. civilian classification based on ranks and action-related terms
def adjust_military_civilian_classification(row):
    if pd.notna(row['דרגה']) and row['דרגה'] != 'אזרח':
        return 'Military'
    elif pd.notna(row['מידע על המוות']) and ('נהרג' in row['מידע על המוות'] or 'נהרגה' in row['מידע על המוות']):
        return 'Military'
    return 'Civilian'

data['Military vs. Civilian'] = data.apply(adjust_military_civilian_classification, axis=1)

The following code samples generates graphs with the average ages for both civilians and military personnel. In addition, it also shows the top 10 military personnel ranks that were killed in action. As can be seen, the vast majority were Staff Sargents.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

# Convert 'גיל' column to numeric, setting errors='coerce' to convert non-numeric strings to NaN
data['גיל'] = pd.to_numeric(data['גיל'], errors='coerce')

# Recalculate the mean ages with the cleaned 'גיל' column
mean_military_age = data[data['Military vs. Civilian'] == 'Military']['גיל'].mean()
mean_civilian_age = data[data['Military vs. Civilian'] == 'Civilian']['גיל'].mean()

# Proceed with plotting
plt.figure(figsize=(6, 4))
plt.bar('Military', mean_military_age, color='blue')
plt.title('Average Age of Military Victims')
plt.ylabel('Average Age')
plt.ylim(0, max(mean_military_age, mean_civilian_age) + 10)
plt.show()

plt.figure(figsize=(6, 4))
plt.bar('Civilian', mean_civilian_age, color='red')
plt.title('Average Age of Civilian Victims')
plt.ylabel('Average Age')
plt.ylim(0, max(mean_military_age, mean_civilian_age) + 10)
plt.show()


# Assuming military_rank_death_counts is recalculated if necessary
top_10_filtered_role_summary = military_rank_death_counts.head(10)

plt.figure(figsize=(10, 6))
top_10_filtered_role_summary.plot(kind='bar', color='green')
plt.title('Death Counts by Military Rank')
plt.xlabel('Military Rank')
plt.ylabel('Number of Deaths')

# Reverse Hebrew labels for display
labels = [label[::-1] for label in top_10_filtered_role_summary.index]
plt.xticks(ticks=range(len(labels)), labels=labels, rotation=45)
plt.show()

# Reverse Hebrew labels for display
labels = [label[::-1] for label in military_rank_death_counts.index]
plt.xticks(ticks=range(len(labels)), labels=labels, rotation=45)
plt.show()

Calculating the number of civilians vs military personnel:

1
2
3
4
5
6
7
8
# Visualizing the distribution of military vs. civilian victims
plt.figure(figsize=(10, 6))
data['Military vs. Civilian'].value_counts().plot(kind='bar', color=['blue', 'red'])
plt.title('Military vs. Civilian Victims')
plt.xlabel('Category')
plt.ylabel('Number of Victims')
plt.xticks(rotation=0)
plt.show()

Checking for foreign national victims:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
place_of_residency_counts = data['מקום מגורים'].value_counts()

# Expanded list of foreign countries in Hebrew for identifying foreign victims
foreign_countries_hebrew_expanded = ['תאילנד', 'קמבודיה', 'ארצות הברית', 'רוסיה', 'אוקראינה', 'פולין', 'צרפת', 'גרמניה', 'איטליה', 'בריטניה']

# Function to check if the place of residence indicates a foreign victim
def is_foreign_country(place_of_residence):
    return any(country in place_of_residence for country in foreign_countries_hebrew_expanded) if isinstance(place_of_residence, str) else False

# Apply function to identify foreign victims
data['Is Foreigner Expanded'] = data['מקום מגורים'].apply(is_foreign_country)

# Summarize foreign victims based on expanded criteria
foreign_victims_expanded_count = data['Is Foreigner Expanded'].value_counts()
foreign_victims_expanded_distribution = data[data['Is Foreigner Expanded']]['מידע על המוות'].value_counts()


foreign_victims_expanded_count, foreign_victims_expanded_distribution

Based on that, we can see that most of them were hired professionals from Thailand (40), while the rest:

1
2
3
4
5
6
7
8
 True       43
 Name: count, dtype: int64,
 מידע על המוות
 נרצח בכיסופים               6
 סטודנט, נרצח בכרמיה         1
 נרצחה בניר עוז              1
 נרצח בניר עוז               1
 מאבטח, נהרג במסיבה ברעים    1

The main battalion that suffered the most casualties. As can be seen, it’s Golani:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Adjust the function to handle non-string (including NaN) values in "Info on Death"
def extract_military_info_corrected(row):
   info = row['מידע על המוות']
   if pd.isna(info):
       return pd.Series(["Unknown", "Unknown", "Unknown"], index=['Role', 'Base', 'Context'])
   
   role_match = re.search(r"^(.*),", info)
   role = role_match.group(1) if role_match else "Unknown"

   base_match = re.search(r"בבסיס (\w+)", info)
   base = base_match.group(1) if base_match else "Unknown"

   # Determining context of death
   context = 'Unknown'
   if any(keyword in info for keyword in female_keywords + male_keywords):
       context = 'Killed in Action'

   return pd.Series([role, base, context], index=['Role', 'Base', 'Context'])

# Reapply the corrected function
data[['Role', 'Base', 'Context']] = data.apply(extract_military_info_corrected, axis=1)

# Recalculate summaries
role_summary = data['Role'].value_counts()
context_summary = data['Context'].value_counts()
base_summary = data['Base'].value_counts()

role_summary.head(), context_summary, base_summary.head()

# Filter out "Unknown" from role_summary and plot
filtered_role_summary = role_summary[role_summary.index != 'Unknown']
top_10_filtered_role_summary = filtered_role_summary.head(10)

plt.figure(figsize=(10, 6))
top_10_filtered_role_summary.plot(kind='bar', color='skyblue')
plt.title('Top 10 Military Roles Distribution (Excluding Unknown)')
plt.xlabel('Role')
plt.ylabel('Count')
labels = [label[::-1] for label in top_10_filtered_role_summary.index]
plt.xticks(ticks=range(len(labels)), labels=labels, rotation=45)
plt.show()
Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy