Counting all the Kanji in Japanese Wikipedia

Background

As I am actively learning Japanese, the Japanese writing system, Kanji, is an integral part of the language. Kanji are the logographic Chinese characters adapted from the Chinese script that are used in Japanese writing. Logograph or Lexigraph, is a written character that represents a semantic component of a language such as a word. Unlike phonemic writing systems such as alphabets and syllabaries whose individual symbols represent sound directly and lack any inherent meaning, each Kanji character has some inherent meaning. Another example of a logographic writing system is the Egyptian hieroglyphs. Japanese in addition to Kanji has a phonetic alphabet system called Hiragana and Katakana. There are nearly 3,000 kanji used in Japanese names and in common communication. From wikipedia, there is no definitive count of kanji characters, just as there is none of Chinese characters generally. The Dai Kan-Wa Jiten, Japanese dictionary compiled by Tetsui Morohashi, which is considered to be comprehensive in Japan, contains about 50,000 characters.

The goal of this project was to get some metrics or data on some of the most commonly used Kanji as I go about learning Japanese. I thought this would be an efficient way of learning the writing system by learning the most commonly used Kanji and I decided to analyze the Japanese Wikipedia dataset as their data is readily available, and I thought it would give me a good approximation of some of the most used Kanji.

Project Details

I took advantage of the jawiki dump on 2024.08.01 from dumps.wikimedia.org. I decided to download the “Recombine all pages, current versions only.” for this project. The data is stored in .xml files and compressed using bz2 to 4.6 GB.


 bzgrep -oP '[\x{4E00}-\x{9FAF}\x{3400}-\x{4DBF}]' jawiki-20240801-pages-meta-current.xml.bz2

From Rikai Kanji Tables:

CJK unifed ideographs - Common and uncommon kanji (4e00 - 9faf)
CJK unified ideographs Extension A - Rare kanji (3400 - 4dbf)

Using bzgrep I managed to stream all the Kanji characters without decompressing the whole file. Then I fed the output to a python script called counter.py that counts and labels the Kanji’s. The python script is as follows.

#!/usr/bin/python3

import fileinput

total = {}

for line in fileinput.input():
  kanji = line.rstrip()
  if kanji in total:
    total[kanji] += 1
  else:
    total[kanji] = 1


for k, v in sorted(total.items(), key=lambda item: item[1]):
  print(f'{k},{v}')

      
bzgrep -oP '[\x{4E00}-\x{9FAF}\x{3400}-\x{4DBF}]' jawiki-20240801-pages-meta-current.xml.bz2 | python3 counter.py > python_result.txt

Results

Overall 24,916 unique Kanji were found. A big contribution from this wikipedia article CJK統合漢字拡張A, which is an article about less commonly used Kanji/chinese characters of unicode. Total of 1,834,033,943 or around 2 billion Kanji characters is composed of the entire Japanese Wikipedia as of 08.01.2024.

The table below shows a summary of Kanji frequency:

Equal to or less than X Times Used	Number of unique Kanji
1	3,335
10	14,062
100	18,053
1,000	20,599
10,000	22,256

That means the amount of unique Kanji that was used more than 10,000 times in Japanese Wikipedia is around 2,660 (which matches that statistic of about 3000 kanji used in common names and in common communication).

Table below is a summary of the 100 most popular Kanji. Where each element of each column is more popular than any on the column to the left. Also each Kanji is more popular than any shown above it.

見	家	下	的	文	県	内	話	記	学
理	送	小	同	立	道	子	場	書	人
対	通	表	選	高	編	時	新	事	会
連	全	版	目	業	後	画	生	中	大
京	主	前	語	分	手	第	上	一	者
号	頼	所	長	発	山	田	行	出	用
依	公	明	戦	動	代	成	市	作	本
町	開	校	削	関	東	地	合	国	月
機	野	回	物	間	社	除	部	名	日
世	使	川	定	集	自	和	方	利	年

Not surprising, the three most popular Kanji (年, 日, 月) represent year, day, and month respectively. I assume most wikipedia articles had a date timestamp and reference to a point of time. Another assumption is that wikipedia articles in general refrains from using pronouns and adjectives as they are usually used for expressing human emotions and not used in informative articles. For example, Kanji such as "I" and "nice" (私-384th, 良-372th) are not common and that can be seen above. Kanji when combined with other Kanji they have a different meaning and can be used as blocks for more complicated words. A good example of this is 電車, which means train and is formed by combining the Kanji for electricity (電) and car (車). Some Kanji individually can have some abstract meaning, but combined with other Kanji can be a building block for many other words or concepts. This is highlighted well in the 5th most popular Kanji which is Utilize(用) and from Jisho.org it is very flexible for a lot of other words and concepts.

Will I change my learning routin and prioritize Kanji by their frequency? Probably not. While some Kanji are more common than others, they might not be the most efficient way of learning them.

見	家	下	的	文	県	内	話	記	学
理	送	小	同	立	道	子	場	書	人
対	通	表	選	高	編	時	新	事	会
連	全	版	目	業	後	画	生	中	大
京	主	前	語	分	手	第	上	一	者
号	頼	所	長	発	山	田	行	出	用
依	公	明	戦	動	代	成	市	作	本
町	開	校	削	関	東	地	合	国	月
機	野	回	物	間	社	除	部	名	日
世	使	川	定	集	自	和	方	利	年

見	家	下	的	文	県	内	話	記	学
理	送	小	同	立	道	子	場	書	人
対	通	表	選	高	編	時	新	事	会
連	全	版	目	業	後	画	生	中	大
京	主	前	語	分	手	第	上	一	者
号	頼	所	長	発	山	田	行	出	用
依	公	明	戦	動	代	成	市	作	本
町	開	校	削	関	東	地	合	国	月
機	野	回	物	間	社	除	部	名	日
世	使	川	定	集	自	和	方	利	年

見	家	下	的	文	県	内	話	記	学
理	送	小	同	立	道	子	場	書	人
対	通	表	選	高	編	時	新	事	会
連	全	版	目	業	後	画	生	中	大
京	主	前	語	分	手	第	上	一	者
号	頼	所	長	発	山	田	行	出	用
依	公	明	戦	動	代	成	市	作	本
町	開	校	削	関	東	地	合	国	月
機	野	回	物	間	社	除	部	名	日
世	使	川	定	集	自	和	方	利	年