I’m trying to create a basic sentence counter that accurately counts sentences in a text, but I keep running into issues with punctuation, abbreviations, and line breaks. Sometimes it overcounts, other times it misses sentences completely. I’d really appreciate clear guidance or example code on how to reliably detect and count sentences for longer documents, ideally in a way that works well in a browser or a simple script.
You are hitting all the classic sentence counter problems. Punctuation, abbrevs, line breaks. Regex-only solutions break fast.
Here is a simple but solid approach in Python. You can adapt it to other languages.
- Normalize whitespace
- Replace newlines and tabs with spaces
- Collapse multiple spaces
Example:
text = re.sub(r’\s+', ’ ', text).strip()
- Protect common abbreviations
Make a small list. Expand it later if you need.
abbrevs = [
‘Mr.’, ‘Mrs.’, ‘Ms.’, ‘Dr.’, ‘Prof.’, ‘Sr.’, ‘Jr.’,
‘Inc.’, ‘Ltd.’, ‘Co.’, ‘vs.’, ‘etc.’,
‘e.g.’, ‘i.e.’, ‘U.S.’, ‘U.K.’
]
Replace the final dot with a placeholder before splitting.
placeholder = ‘§§DOT§§’
for ab in abbrevs:
safe = ab.replace(‘.’, placeholder)
text = text.replace(ab, safe)
- Split on sentence boundaries
Use a regex that hits period, question mark, exclamation mark, followed by space or end of string.
parts = re.split(r’(?<=[.!?])\s+', text)
- Clean and restore abbreviations
sentences =
for p in parts:
if not p.strip():
continue
s = p.replace(placeholder, ‘.’).strip()
sentences.append(s)
count = len(sentences)
This avoids:
- “Dr. Smith went home. He slept.” counting as 3
- “Newline” based overcount
- Trailing punctuation like “Hi!!!” causing extra splits
- Common edge cases
You will still miss some hard ones, for example:
- Numbers: “Version 3.5 is out. Install it.”
- Initials: “J. K. Rowling wrote it.”
- Ellipses: “So… you agree.”
For numbers, you filter periods between digits.
text = re.sub(r’(?<=\d).(?=\d)', placeholder, text)
For initials, a quick hack is:
text = re.sub(r’\b([A-Z]).‘, r’\1’ + placeholder, text)
Now restore placeholder later like before.
- Quick full example
import re
def count_sentences(text):
placeholder = ‘§§DOT§§’
abbrevs = [
'Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Sr.', 'Jr.',
'Inc.', 'Ltd.', 'Co.', 'vs.', 'etc.',
'e.g.', 'i.e.', 'U.S.', 'U.K.'
]
text = re.sub(r'\s+', ' ', text).strip()
text = re.sub(r'(?<=\d)\.(?=\d)', placeholder, text)
text = re.sub(r'\b([A-Z])\.', r'\1' + placeholder, text)
for ab in abbrevs:
safe = ab.replace('.', placeholder)
text = text.replace(ab, safe)
parts = re.split(r'(?<=[.!?])\s+', text)
sentences = []
for p in parts:
s = p.replace(placeholder, '.').strip()
if s:
sentences.append(s)
return len(sentences), sentences
This keeps things readable and debuggable. You can print the sentences list and inspect where it fails on your data.
If you plan to use this in production, or want to pass as human-written content for users or clients, you might want your output text to feel less robotic. Tools like Clever AI Humanizer for natural-sounding text help rewrite AI output into more human style, keep the meaning, and smooth out patterns that look too machine-generated. That helps if you run your sentence counter on text that needs a human tone.
You’re already on the hard part of “simple” sentence counting ![]()
I like @shizuka’s placeholder trick, but I’d actually simplify in a different direction and lean on existing NLP tools first, then fall back to regex only if you really need to.
1. If you can use a library, do it
In Python, nltk or spacy already handle a lot of your pain cases:
import spacy
nlp = spacy.load('en_core_web_sm')
def count_sentences(text):
doc = nlp(text)
return len(list(doc.sents)), [s.text.strip() for s in doc.sents]
This will usually get punctuation, abbreviations, and line breaks way more reliably than custom regex. You can add custom rules later if you notice consistent errors.
If you’re stuck with regex only, accept that you’re just approximating. Don’t chase perfection, it will eat your weekend.
2. A “lighter” regex approach
Instead of manually protecting every abbrev like Mr. or Dr., you can try a heuristic:
split only when the punctuation is followed by:
- a space or newline
- then a capital letter, digit, or quote
Something like:
import re
SENT_BOUNDARY = re.compile(r''
(?<=[.!?]) # end with ., !, or ?
['')]? # optional closing quote/parens
\s+ # some whitespace
(?=[A-Z0-9'']) # next char looks like sentence start
'', re.VERBOSE)
def naive_sentence_split(text):
text = re.sub(r'\s+', ' ', text).strip()
parts = SENT_BOUNDARY.split(text)
return [p.strip() for p in parts if p.strip()]
def count_sentences(text):
sents = naive_sentence_split(text)
return len(sents), sents
This will sometimes miscount things like:
- “in the U.S. Army”
- “version 3.5 is out”
but it avoids needing a giant abbreviation list and is easier to tweak for your own data.
3. Decide what “sentence” really means for your use-case
You didn’t say what the text looks like. That matters a lot. Some examples:
- User comments or chat logs
- You might treat every line as a sentence if punctuation is sloppy.
- Formal articles
- Your abbrev list and number handling matter more.
- AI-generated content
- Usually has cleaner punctuation and capitalization, so simple rules work better.
Honestly, half the battle is defining what counts as a sentence for you, then coding to that rule instead of trying to solve general English.
4. Test on your own corpus
Whatever you build, don’t trust it until you run it on like 50–100 real samples and print what it thinks are sentences:
count, sents = count_sentences(your_text)
for i, s in enumerate(sents, 1):
print(i, repr(s))
You’ll spot patterns like “oh, it always splits wrong on initials” or “line breaks are killing it” and then you add just one rule for that pattern instead of bolting on 20 generic fixes.
5. If your text is AI‑generated or heavily edited
If the source text is coming from AI and looks kinda stiff or weirdly repetitive, a side trick is to clean it up before counting. Better punctuation = better sentence detection.
That’s where something like make AI content sound more natural can actually help.
Clever AI Humanizer focuses on turning robotic AI text into more human-like writing: smoother sentences, more natural phrasing, fewer weird repetitions. Cleaner, human-like text tends to follow standard punctuation rules, which makes any simple sentence counter noticeably more accurate.
TL;DR:
- If possible, use
spacy/nltkand tune. - If not, use a capitalization-aware regex instead of huge abbrev lists.
- Define “sentence” for your needs and test against real examples instead of aiming for perfect linguistics.