Dataset statistics
| Number of variables | 6 |
|---|---|
| Number of observations | 139023 |
| Missing cells | 0 |
| Missing cells (%) | 0.0% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 3.6 MiB |
| Average record size in memory | 27.0 B |
Variable types
| Categorical | 5 |
|---|---|
| Numeric | 1 |
Ab has a high cardinality: 138472 distinct values | High cardinality |
Ch has a high cardinality: 118291 distinct values | High cardinality |
Lang_Ch is highly correlated with Lang_En | High correlation |
Lang_En is highly correlated with Lang_Ch | High correlation |
Ab is uniformly distributed | Uniform |
Reproduction
| Analysis started | 2021-05-08 07:18:30.970771 |
|---|---|
| Analysis finished | 2021-05-08 07:18:45.182861 |
| Duration | 14.21 seconds |
| Software version | pandas-profiling v2.12.0 |
| Download configuration | config.yaml |
| Distinct | 16 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 136.6 KiB |
| Rukai | |
|---|---|
| Bunun | |
| Atayal | |
| Puyuma | |
| Amis | |
| Other values (11) |
Length
| Max length | 10 |
|---|---|
| Median length | 6 |
| Mean length | 5.783891874 |
| Min length | 4 |
Characters and Unicode
| Total characters | 804094 |
|---|---|
| Distinct characters | 27 |
| Distinct categories | 2 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | Sakizaya |
|---|---|
| 2nd row | Sakizaya |
| 3rd row | Sakizaya |
| 4th row | Sakizaya |
| 5th row | Sakizaya |
| Value | Count | Frequency (%) |
| Rukai | 15036 | |
| Bunun | 13382 | 9.6% |
| Atayal | 11289 | 8.1% |
| Puyuma | 10359 | 7.5% |
| Amis | 9978 | 7.2% |
| Kavalan | 9444 | 6.8% |
| Thao | 8777 | 6.3% |
| Seediq | 8025 | 5.8% |
| Paiwan | 8009 | 5.8% |
| Yami | 7867 | 5.7% |
| Other values (6) | 36857 |
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| rukai | 15036 | |
| bunun | 13382 | 9.6% |
| atayal | 11289 | 8.1% |
| puyuma | 10359 | 7.5% |
| amis | 9978 | 7.2% |
| kavalan | 9444 | 6.8% |
| thao | 8777 | 6.3% |
| seediq | 8025 | 5.8% |
| paiwan | 8009 | 5.8% |
| yami | 7867 | 5.7% |
| Other values (6) | 36857 |
Most occurring characters
| Value | Count | Frequency (%) |
| a | 189565 | |
| u | 85836 | 10.7% |
| i | 68931 | 8.6% |
| n | 59909 | 7.5% |
| y | 34769 | 4.3% |
| k | 34272 | 4.3% |
| m | 28204 | 3.5% |
| S | 26728 | 3.3% |
| s | 22017 | 2.7% |
| A | 21267 | 2.6% |
| Other values (17) | 232596 |
Most occurring categories
| Value | Count | Frequency (%) |
| Lowercase Letter | 665071 | |
| Uppercase Letter | 139023 | 17.3% |
Most frequent character per category
| Value | Count | Frequency (%) |
| a | 189565 | |
| u | 85836 | |
| i | 68931 | 10.4% |
| n | 59909 | 9.0% |
| y | 34769 | 5.2% |
| k | 34272 | 5.2% |
| m | 28204 | 4.2% |
| s | 22017 | 3.3% |
| l | 20733 | 3.1% |
| o | 19503 | 2.9% |
| Other values (9) | 101332 |
| Value | Count | Frequency (%) |
| S | 26728 | |
| A | 21267 | |
| T | 19085 | |
| P | 18368 | |
| K | 17290 | |
| R | 15036 | |
| B | 13382 | |
| Y | 7867 | 5.7% |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 804094 |
Most frequent character per script
| Value | Count | Frequency (%) |
| a | 189565 | |
| u | 85836 | 10.7% |
| i | 68931 | 8.6% |
| n | 59909 | 7.5% |
| y | 34769 | 4.3% |
| k | 34272 | 4.3% |
| m | 28204 | 3.5% |
| S | 26728 | 3.3% |
| s | 22017 | 2.7% |
| A | 21267 | 2.6% |
| Other values (17) | 232596 |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 804094 |
Most frequent character per block
| Value | Count | Frequency (%) |
| a | 189565 | |
| u | 85836 | 10.7% |
| i | 68931 | 8.6% |
| n | 59909 | 7.5% |
| y | 34769 | 4.3% |
| k | 34272 | 4.3% |
| m | 28204 | 3.5% |
| S | 26728 | 3.3% |
| s | 22017 | 2.7% |
| A | 21267 | 2.6% |
| Other values (17) | 232596 |
| Distinct | 43 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 137.5 KiB |
| 魯凱_霧台 | |
|---|---|
| 布農_郡群 | |
| 噶瑪蘭 | |
| 邵 | 8777 |
| 泰雅_賽考利克 | 8350 |
| Other values (38) |
Length
| Max length | 8 |
|---|---|
| Median length | 5 |
| Mean length | 4.223373111 |
| Min length | 1 |
Characters and Unicode
| Total characters | 587146 |
|---|---|
| Distinct characters | 80 |
| Distinct categories | 2 ? |
| Distinct scripts | 2 ? |
| Distinct blocks | 2 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | 撒奇萊雅 |
|---|---|
| 2nd row | 撒奇萊雅 |
| 3rd row | 撒奇萊雅 |
| 4th row | 撒奇萊雅 |
| 5th row | 撒奇萊雅 |
| Value | Count | Frequency (%) |
| 魯凱_霧台 | 11015 | 7.9% |
| 布農_郡群 | 10446 | 7.5% |
| 噶瑪蘭 | 9444 | 6.8% |
| 邵 | 8777 | 6.3% |
| 泰雅_賽考利克 | 8350 | 6.0% |
| 達悟 | 7867 | 5.7% |
| 卡那卡那富 | 7846 | 5.6% |
| 卑南_南王 | 7700 | 5.5% |
| 賽夏 | 6895 | 5.0% |
| 賽德克_德固達雅 | 6599 | 4.7% |
| Other values (33) | 54084 |
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| 魯凱_霧台 | 11015 | 7.9% |
| 布農_郡群 | 10446 | 7.5% |
| 噶瑪蘭 | 9444 | 6.8% |
| 邵 | 8777 | 6.3% |
| 泰雅_賽考利克 | 8350 | 6.0% |
| 達悟 | 7867 | 5.7% |
| 卡那卡那富 | 7846 | 5.6% |
| 卑南_南王 | 7700 | 5.5% |
| 賽夏 | 6895 | 5.0% |
| 賽德克_德固達雅 | 6599 | 4.7% |
| Other values (33) | 54084 |
Most occurring characters
| Value | Count | Frequency (%) |
| _ | 76078 | 13.0% |
| 魯 | 25782 | 4.4% |
| 雅 | 24114 | 4.1% |
| 賽 | 23270 | 4.0% |
| 南 | 19707 | 3.4% |
| 卡 | 16480 | 2.8% |
| 克 | 16375 | 2.8% |
| 那 | 15692 | 2.7% |
| 阿 | 15560 | 2.7% |
| 德 | 15349 | 2.6% |
| Other values (70) | 338739 |
Most occurring categories
| Value | Count | Frequency (%) |
| Other Letter | 511068 | |
| Connector Punctuation | 76078 | 13.0% |
Most frequent character per category
| Value | Count | Frequency (%) |
| 魯 | 25782 | 5.0% |
| 雅 | 24114 | 4.7% |
| 賽 | 23270 | 4.6% |
| 南 | 19707 | 3.9% |
| 卡 | 16480 | 3.2% |
| 克 | 16375 | 3.2% |
| 那 | 15692 | 3.1% |
| 阿 | 15560 | 3.0% |
| 德 | 15349 | 3.0% |
| 達 | 15167 | 3.0% |
| Other values (69) | 323572 |
| Value | Count | Frequency (%) |
| _ | 76078 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Han | 511068 | |
| Common | 76078 | 13.0% |
Most frequent character per script
| Value | Count | Frequency (%) |
| 魯 | 25782 | 5.0% |
| 雅 | 24114 | 4.7% |
| 賽 | 23270 | 4.6% |
| 南 | 19707 | 3.9% |
| 卡 | 16480 | 3.2% |
| 克 | 16375 | 3.2% |
| 那 | 15692 | 3.1% |
| 阿 | 15560 | 3.0% |
| 德 | 15349 | 3.0% |
| 達 | 15167 | 3.0% |
| Other values (69) | 323572 |
| Value | Count | Frequency (%) |
| _ | 76078 |
Most occurring blocks
| Value | Count | Frequency (%) |
| CJK | 511068 | |
| ASCII | 76078 | 13.0% |
Most frequent character per block
| Value | Count | Frequency (%) |
| 魯 | 25782 | 5.0% |
| 雅 | 24114 | 4.7% |
| 賽 | 23270 | 4.6% |
| 南 | 19707 | 3.9% |
| 卡 | 16480 | 3.2% |
| 克 | 16375 | 3.2% |
| 那 | 15692 | 3.1% |
| 阿 | 15560 | 3.0% |
| 德 | 15349 | 3.0% |
| 達 | 15167 | 3.0% |
| Other values (69) | 323572 |
| Value | Count | Frequency (%) |
| _ | 76078 |
| Distinct | 138472 |
|---|---|
| Distinct (%) | 99.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 1.1 MiB |
| na. | 4 |
|---|---|
| anema azua? | 4 |
| sinsi, nana ku walri . | 4 |
| Satokien ako to romi’ami’ad . | 3 |
| su sinsi timadju? | 3 |
| Other values (138467) |
Length
| Max length | 486 |
|---|---|
| Median length | 37 |
| Mean length | 39.69009444 |
| Min length | 1 |
Characters and Unicode
| Total characters | 5517836 |
|---|---|
| Distinct characters | 154 |
| Distinct categories | 18 ? |
| Distinct scripts | 4 ? |
| Distinct blocks | 10 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 137968 ? |
|---|---|
| Unique (%) | 99.2% |
Sample
| 1st row | malalikid ku niyazu' i waluay a bulad. |
|---|---|
| 2nd row | kaudadan a demiad milalupela' kita. |
| 3rd row | i buyubuyu'an ku aadupen a mauzip. |
| 4th row | u aam ku sakalanam tu sananal. |
| 5th row | aamen nu miaamay ku tubah ni Bunga! |
| Value | Count | Frequency (%) |
| na. | 4 | < 0.1% |
| anema azua? | 4 | < 0.1% |
| sinsi, nana ku walri . | 4 | < 0.1% |
| Satokien ako to romi’ami’ad . | 3 | < 0.1% |
| su sinsi timadju? | 3 | < 0.1% |
| nana ku matra. | 3 | < 0.1% |
| sgagay ta la! | 3 | < 0.1% |
| imu, muruma’ ku lra . | 3 | < 0.1% |
| tatelraw nu ’arevu? | 3 | < 0.1% |
| nu mukuwa ku i takesiyan zi nu muruma’ ku mu . | 3 | < 0.1% |
| Other values (138462) | 138990 |
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| a | 24357 | 2.7% |
| ku | 17678 | 1.9% |
| na | 17666 | 1.9% |
| ka | 17234 | 1.9% |
| tu | 15594 | 1.7% |
| i | 10176 | 1.1% |
| o | 8694 | 0.9% |
| 7670 | 0.8% | |
| su | 7317 | 0.8% |
| ta | 6945 | 0.8% |
| Other values (140997) | 782473 |
Most occurring characters
| Value | Count | Frequency (%) |
| a | 1008449 | |
| 800736 | ||
| i | 418441 | 7.6% |
| n | 384737 | 7.0% |
| u | 357291 | 6.5% |
| k | 239393 | 4.3% |
| m | 208199 | 3.8% |
| s | 177487 | 3.2% |
| t | 177300 | 3.2% |
| l | 157855 | 2.9% |
| Other values (144) | 1587948 |
Most occurring categories
| Value | Count | Frequency (%) |
| Lowercase Letter | 4362309 | |
| Space Separator | 800740 | 14.5% |
| Other Punctuation | 258804 | 4.7% |
| Uppercase Letter | 64443 | 1.2% |
| Final Punctuation | 21452 | 0.4% |
| Dash Punctuation | 7056 | 0.1% |
| Initial Punctuation | 567 | < 0.1% |
| Open Punctuation | 524 | < 0.1% |
| Close Punctuation | 522 | < 0.1% |
| Modifier Symbol | 508 | < 0.1% |
| Other values (8) | 911 | < 0.1% |
Most frequent character per category
| Value | Count | Frequency (%) |
| a | 1008449 | |
| i | 418441 | 9.6% |
| n | 384737 | 8.8% |
| u | 357291 | 8.2% |
| k | 239393 | 5.5% |
| m | 208199 | 4.8% |
| s | 177487 | 4.1% |
| t | 177300 | 4.1% |
| l | 157855 | 3.6% |
| e | 143962 | 3.3% |
| Other values (30) | 1089195 |
| Value | Count | Frequency (%) |
| M | 10030 | |
| S | 9783 | |
| R | 6910 | |
| T | 4639 | 7.2% |
| P | 4543 | 7.0% |
| I | 3602 | 5.6% |
| A | 3561 | 5.5% |
| K | 3304 | 5.1% |
| N | 2524 | 3.9% |
| O | 2038 | 3.2% |
| Other values (18) | 13509 |
| Value | Count | Frequency (%) |
| 答 | 13 | |
| 問 | 10 | |
| 等 | 5 | 8.2% |
| 人 | 5 | 8.2% |
| 何 | 4 | 6.6% |
| 汝 | 4 | 6.6% |
| 芬 | 4 | 6.6% |
| 林 | 1 | 1.6% |
| 太 | 1 | 1.6% |
| 約 | 1 | 1.6% |
| Other values (13) | 13 |
| Value | Count | Frequency (%) |
| . | 114962 | |
| ' | 66659 | |
| , | 35630 | 13.8% |
| ? | 21850 | 8.4% |
| ! | 11300 | 4.4% |
| : | 6242 | 2.4% |
| ; | 1051 | 0.4% |
| / | 597 | 0.2% |
| " | 266 | 0.1% |
| ! | 69 | < 0.1% |
| Other values (10) | 178 | 0.1% |
| Value | Count | Frequency (%) |
| 1 | 87 | |
| 0 | 69 | |
| 8 | 46 | |
| 2 | 45 | |
| 9 | 42 | |
| 5 | 39 | |
| 3 | 33 | 7.6% |
| 7 | 28 | 6.4% |
| 4 | 28 | 6.4% |
| 6 | 20 | 4.6% |
| Value | Count | Frequency (%) |
| ( | 485 | |
| 「 | 28 | 5.3% |
| ( | 6 | 1.1% |
| [ | 5 | 1.0% |
| Value | Count | Frequency (%) |
| ) | 484 | |
| 」 | 27 | 5.2% |
| ) | 6 | 1.1% |
| ] | 5 | 1.0% |
| Value | Count | Frequency (%) |
| 800736 | ||
| 3 | < 0.1% | |
| 1 | < 0.1% |
| Value | Count | Frequency (%) |
| ^ | 497 | |
| ˄ | 10 | 2.0% |
| ´ | 1 | 0.2% |
| Value | Count | Frequency (%) |
| │ | 1 | |
| ↘ | 1 | |
| ─ | 1 |
| Value | Count | Frequency (%) |
| 32 | ||
| 22 | ||
| 5 | 8.5% |
| Value | Count | Frequency (%) |
| ́ | 43 | |
| ̄ | 7 | 13.7% |
| ̅ | 1 | 2.0% |
| Value | Count | Frequency (%) |
| “ | 518 | |
| ‘ | 49 | 8.6% |
| Value | Count | Frequency (%) |
| ’ | 20910 | |
| ” | 542 | 2.5% |
| Value | Count | Frequency (%) |
| ʼ | 2 | |
| ˆ | 2 |
| Value | Count | Frequency (%) |
| = | 27 | |
| ~ | 6 | 18.2% |
| Value | Count | Frequency (%) |
| - | 7056 |
| Value | Count | Frequency (%) |
| _ | 263 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 4426752 | |
| Common | 1090972 | 19.8% |
| Han | 61 | < 0.1% |
| Inherited | 51 | < 0.1% |
Most frequent character per script
| Value | Count | Frequency (%) |
| a | 1008449 | |
| i | 418441 | 9.5% |
| n | 384737 | 8.7% |
| u | 357291 | 8.1% |
| k | 239393 | 5.4% |
| m | 208199 | 4.7% |
| s | 177487 | 4.0% |
| t | 177300 | 4.0% |
| l | 157855 | 3.6% |
| e | 143962 | 3.3% |
| Other values (58) | 1153638 |
| Value | Count | Frequency (%) |
| 800736 | ||
| . | 114962 | 10.5% |
| ' | 66659 | 6.1% |
| , | 35630 | 3.3% |
| ? | 21850 | 2.0% |
| ’ | 20910 | 1.9% |
| ! | 11300 | 1.0% |
| - | 7056 | 0.6% |
| : | 6242 | 0.6% |
| ; | 1051 | 0.1% |
| Other values (50) | 4576 | 0.4% |
| Value | Count | Frequency (%) |
| 答 | 13 | |
| 問 | 10 | |
| 等 | 5 | 8.2% |
| 人 | 5 | 8.2% |
| 何 | 4 | 6.6% |
| 汝 | 4 | 6.6% |
| 芬 | 4 | 6.6% |
| 林 | 1 | 1.6% |
| 太 | 1 | 1.6% |
| 約 | 1 | 1.6% |
| Other values (13) | 13 |
| Value | Count | Frequency (%) |
| ́ | 43 | |
| ̄ | 7 | 13.7% |
| ̅ | 1 | 2.0% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 5463197 | |
| IPA Ext | 30735 | 0.6% |
| Punctuation | 22049 | 0.4% |
| None | 1332 | < 0.1% |
| Latin Ext Additional | 394 | < 0.1% |
| CJK | 61 | < 0.1% |
| Diacriticals | 51 | < 0.1% |
| Modifier Letters | 14 | < 0.1% |
| Box Drawing | 2 | < 0.1% |
| Arrows | 1 | < 0.1% |
Most frequent character per block
| Value | Count | Frequency (%) |
| a | 1008449 | |
| 800736 | ||
| i | 418441 | 7.7% |
| n | 384737 | 7.0% |
| u | 357291 | 6.5% |
| k | 239393 | 4.4% |
| m | 208199 | 3.8% |
| s | 177487 | 3.2% |
| t | 177300 | 3.2% |
| l | 157855 | 2.9% |
| Other values (76) | 1533309 |
| Value | Count | Frequency (%) |
| ’ | 20910 | |
| ” | 542 | 2.5% |
| “ | 518 | 2.3% |
| ‘ | 49 | 0.2% |
| … | 28 | 0.1% |
| ′ | 2 | < 0.1% |
| Value | Count | Frequency (%) |
| é | 720 | |
| á | 103 | 7.7% |
| ē | 79 | 5.9% |
| ! | 69 | 5.2% |
| í | 67 | 5.0% |
| ú | 67 | 5.0% |
| 、 | 45 | 3.4% |
| ? | 43 | 3.2% |
| 「 | 28 | 2.1% |
| 」 | 27 | 2.0% |
| Other values (17) | 84 | 6.3% |
| Value | Count | Frequency (%) |
| │ | 1 | |
| ─ | 1 |
| Value | Count | Frequency (%) |
| ˄ | 10 | |
| ʼ | 2 | 14.3% |
| ˆ | 2 | 14.3% |
| Value | Count | Frequency (%) |
| ʉ | 29406 | |
| ɨ | 1329 | 4.3% |
| Value | Count | Frequency (%) |
| 答 | 13 | |
| 問 | 10 | |
| 等 | 5 | 8.2% |
| 人 | 5 | 8.2% |
| 何 | 4 | 6.6% |
| 汝 | 4 | 6.6% |
| 芬 | 4 | 6.6% |
| 林 | 1 | 1.6% |
| 太 | 1 | 1.6% |
| 約 | 1 | 1.6% |
| Other values (13) | 13 |
| Value | Count | Frequency (%) |
| ́ | 43 | |
| ̄ | 7 | 13.7% |
| ̅ | 1 | 2.0% |
| Value | Count | Frequency (%) |
| ↘ | 1 |
| Value | Count | Frequency (%) |
| ṟ | 394 |
| Distinct | 118291 |
|---|---|
| Distinct (%) | 85.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 1.1 MiB |
| 那個人很勤勞嗎? | 74 |
|---|---|
| 下雨了!你帶著雨傘嗎? | 72 |
| 今天熱嗎? | 71 |
| 你們天天來這裡吃晚餐嗎? | 71 |
| 你有幾個兄弟姊妹? | 71 |
| Other values (118286) |
Length
| Max length | 128 |
|---|---|
| Median length | 11 |
| Mean length | 12.14047316 |
| Min length | 1 |
Characters and Unicode
| Total characters | 1687805 |
|---|---|
| Distinct characters | 4437 |
| Distinct categories | 18 ? |
| Distinct scripts | 5 ? |
| Distinct blocks | 15 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 112797 ? |
|---|---|
| Unique (%) | 81.1% |
Sample
| 1st row | 八月份是部落的豐年祭。 |
|---|---|
| 2nd row | 下雨天我們一起去撿天使的眼淚。 |
| 3rd row | 動物生存在山裡。 |
| 4th row | 早餐吃的是稀飯。 |
| 5th row | 乞丐去向Bunga乞討地瓜! |
| Value | Count | Frequency (%) |
| 那個人很勤勞嗎? | 74 | 0.1% |
| 下雨了!你帶著雨傘嗎? | 72 | 0.1% |
| 今天熱嗎? | 71 | 0.1% |
| 你們天天來這裡吃晚餐嗎? | 71 | 0.1% |
| 你有幾個兄弟姊妹? | 71 | 0.1% |
| 那間房子很大嗎? | 69 | < 0.1% |
| 他們天天看電視嗎? | 69 | < 0.1% |
| 在下雨嗎? | 68 | < 0.1% |
| 那張椅子很重嗎? | 66 | < 0.1% |
| 她的衣服是紅色的嗎? | 66 | < 0.1% |
| Other values (118281) | 138326 |
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| 。 | 88 | 0.1% |
| 和 | 81 | 0.1% |
| 那個人很勤勞嗎? | 74 | 0.1% |
| subali | 73 | 0.1% |
| 元。 | 73 | 0.1% |
| 下雨了!你帶著雨傘嗎? | 72 | 0.1% |
| 你們天天來這裡吃晚餐嗎? | 71 | 0.1% |
| 你有幾個兄弟姊妹? | 71 | 0.1% |
| 今天熱嗎? | 71 | 0.1% |
| 他們天天看電視嗎? | 69 | < 0.1% |
| Other values (118731) | 140906 |
Most occurring characters
| Value | Count | Frequency (%) |
| 。 | 103122 | 6.1% |
| 的 | 63472 | 3.8% |
| 我 | 48642 | 2.9% |
| , | 36223 | 2.1% |
| 你 | 26784 | 1.6% |
| 是 | 22800 | 1.4% |
| 他 | 20704 | 1.2% |
| 要 | 20680 | 1.2% |
| 們 | 20651 | 1.2% |
| 了 | 20331 | 1.2% |
| Other values (4427) | 1304396 |
Most occurring categories
| Value | Count | Frequency (%) |
| Other Letter | 1410000 | |
| Other Punctuation | 182776 | 10.8% |
| Lowercase Letter | 60147 | 3.6% |
| Uppercase Letter | 10715 | 0.6% |
| Open Punctuation | 9507 | 0.6% |
| Close Punctuation | 9410 | 0.6% |
| Space Separator | 2976 | 0.2% |
| Decimal Number | 1661 | 0.1% |
| Final Punctuation | 390 | < 0.1% |
| Dash Punctuation | 60 | < 0.1% |
| Other values (8) | 163 | < 0.1% |
Most frequent character per category
| Value | Count | Frequency (%) |
| 的 | 63472 | 4.5% |
| 我 | 48642 | 3.4% |
| 你 | 26784 | 1.9% |
| 是 | 22800 | 1.6% |
| 他 | 20704 | 1.5% |
| 要 | 20680 | 1.5% |
| 們 | 20651 | 1.5% |
| 了 | 20331 | 1.4% |
| 在 | 19094 | 1.4% |
| 不 | 18573 | 1.3% |
| Other values (4276) | 1128269 |
| Value | Count | Frequency (%) |
| P | 1512 | |
| T | 1285 | |
| A | 1248 | |
| S | 1001 | |
| K | 763 | 7.1% |
| B | 743 | 6.9% |
| M | 634 | 5.9% |
| U | 605 | 5.6% |
| Y | 489 | 4.6% |
| L | 357 | 3.3% |
| Other values (22) | 2078 |
| Value | Count | Frequency (%) |
| a | 14231 | |
| u | 6096 | |
| i | 5769 | 9.6% |
| n | 5378 | 8.9% |
| y | 2711 | 4.5% |
| s | 2400 | 4.0% |
| g | 2376 | 4.0% |
| l | 2369 | 3.9% |
| k | 2164 | 3.6% |
| w | 2012 | 3.3% |
| Other values (20) | 14641 |
| Value | Count | Frequency (%) |
| 。 | 103122 | |
| , | 36223 | 19.8% |
| ? | 17466 | 9.6% |
| ! | 10484 | 5.7% |
| ? | 4392 | 2.4% |
| ' | 1916 | 1.0% |
| 、 | 1845 | 1.0% |
| / | 1612 | 0.9% |
| . | 1530 | 0.8% |
| ; | 1052 | 0.6% |
| Other values (18) | 3134 | 1.7% |
| Value | Count | Frequency (%) |
| ( | 5745 | |
| ( | 2821 | |
| 「 | 835 | 8.8% |
| [ | 69 | 0.7% |
| 「 | 17 | 0.2% |
| 〔 | 11 | 0.1% |
| 【 | 3 | < 0.1% |
| 『 | 2 | < 0.1% |
| 《 | 2 | < 0.1% |
| 〈 | 1 | < 0.1% |
| Value | Count | Frequency (%) |
| ) | 5682 | |
| ) | 2793 | |
| 」 | 829 | 8.8% |
| ] | 69 | 0.7% |
| 」 | 17 | 0.2% |
| 〕 | 11 | 0.1% |
| 】 | 3 | < 0.1% |
| 』 | 2 | < 0.1% |
| 》 | 2 | < 0.1% |
| 〉 | 1 | < 0.1% |
| Value | Count | Frequency (%) |
| 0 | 471 | |
| 1 | 301 | |
| 2 | 200 | |
| 5 | 169 | 10.2% |
| 9 | 105 | 6.3% |
| 4 | 95 | 5.7% |
| 3 | 93 | 5.6% |
| 8 | 89 | 5.4% |
| 7 | 74 | 4.5% |
| 6 | 64 | 3.9% |
| Value | Count | Frequency (%) |
| 18 | ||
| 5 | 17.2% | |
| 2 | 6.9% | |
| 2 | 6.9% | |
| | 1 | 3.4% |
| 1 | 3.4% |
| Value | Count | Frequency (%) |
| = | 19 | |
| = | 9 | |
| ~ | 2 | 6.2% |
| ⎯ | 1 | 3.1% |
| ⋯ | 1 | 3.1% |
| Value | Count | Frequency (%) |
| 2806 | ||
| 122 | 4.1% | |
| 48 | 1.6% |
| Value | Count | Frequency (%) |
| - | 53 | |
| - | 4 | 6.7% |
| — | 3 | 5.0% |
| Value | Count | Frequency (%) |
| ─ | 9 | |
| ★ | 7 | |
| ○ | 4 |
| Value | Count | Frequency (%) |
| ’ | 232 | |
| ” | 158 |
| Value | Count | Frequency (%) |
| ^ | 2 | |
| ´ | 1 |
| Value | Count | Frequency (%) |
| “ | 34 | |
| ‘ | 1 | 2.9% |
| Value | Count | Frequency (%) |
| ˋ | 35 |
| Value | Count | Frequency (%) |
| | 1 |
| Value | Count | Frequency (%) |
| | 8 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Han | 1409938 | |
| Common | 206935 | 12.3% |
| Latin | 70862 | 4.2% |
| Bopomofo | 62 | < 0.1% |
| Unknown | 8 | < 0.1% |
Most frequent character per script
| Value | Count | Frequency (%) |
| 的 | 63472 | 4.5% |
| 我 | 48642 | 3.4% |
| 你 | 26784 | 1.9% |
| 是 | 22800 | 1.6% |
| 他 | 20704 | 1.5% |
| 要 | 20680 | 1.5% |
| 們 | 20651 | 1.5% |
| 了 | 20331 | 1.4% |
| 在 | 19094 | 1.4% |
| 不 | 18573 | 1.3% |
| Other values (4272) | 1128207 |
| Value | Count | Frequency (%) |
| 。 | 103122 | |
| , | 36223 | 17.5% |
| ? | 17466 | 8.4% |
| ! | 10484 | 5.1% |
| ( | 5745 | 2.8% |
| ) | 5682 | 2.7% |
| ? | 4392 | 2.1% |
| ( | 2821 | 1.4% |
| 2806 | 1.4% | |
| ) | 2793 | 1.3% |
| Other values (78) | 15401 | 7.4% |
| Value | Count | Frequency (%) |
| a | 14231 | |
| u | 6096 | 8.6% |
| i | 5769 | 8.1% |
| n | 5378 | 7.6% |
| y | 2711 | 3.8% |
| s | 2400 | 3.4% |
| g | 2376 | 3.4% |
| l | 2369 | 3.3% |
| k | 2164 | 3.1% |
| w | 2012 | 2.8% |
| Other values (52) | 25356 |
| Value | Count | Frequency (%) |
| ㄧ | 58 | |
| ㄚ | 2 | 3.2% |
| ㄇ | 1 | 1.6% |
| ㄗ | 1 | 1.6% |
| Value | Count | Frequency (%) |
| | 8 |
Most occurring blocks
| Value | Count | Frequency (%) |
| CJK | 1409259 | |
| None | 178616 | 10.6% |
| ASCII | 98392 | 5.8% |
| CJK Compat Ideographs | 679 | < 0.1% |
| Punctuation | 533 | < 0.1% |
| Small Forms | 125 | < 0.1% |
| IPA Ext | 74 | < 0.1% |
| Bopomofo | 62 | < 0.1% |
| Modifier Letters | 35 | < 0.1% |
| Box Drawing | 9 | < 0.1% |
| Other values (5) | 21 | < 0.1% |
Most frequent character per block
| Value | Count | Frequency (%) |
| 的 | 63472 | 4.5% |
| 我 | 48642 | 3.5% |
| 你 | 26784 | 1.9% |
| 是 | 22800 | 1.6% |
| 他 | 20704 | 1.5% |
| 要 | 20680 | 1.5% |
| 們 | 20651 | 1.5% |
| 了 | 20331 | 1.4% |
| 在 | 19094 | 1.4% |
| 不 | 18573 | 1.3% |
| Other values (4202) | 1127528 |
| Value | Count | Frequency (%) |
| 。 | 103122 | |
| , | 36223 | 20.3% |
| ? | 17466 | 9.8% |
| ! | 10484 | 5.9% |
| ( | 2821 | 1.6% |
| ) | 2793 | 1.6% |
| 、 | 1845 | 1.0% |
| ; | 1052 | 0.6% |
| 「 | 835 | 0.5% |
| : | 834 | 0.5% |
| Other values (31) | 1141 | 0.6% |
| Value | Count | Frequency (%) |
| a | 14231 | 14.5% |
| u | 6096 | 6.2% |
| i | 5769 | 5.9% |
| ( | 5745 | 5.8% |
| ) | 5682 | 5.8% |
| n | 5378 | 5.5% |
| ? | 4392 | 4.5% |
| 2806 | 2.9% | |
| y | 2711 | 2.8% |
| s | 2400 | 2.4% |
| Other values (78) | 43182 |
| Value | Count | Frequency (%) |
| ˋ | 35 |
| Value | Count | Frequency (%) |
| ㄧ | 58 | |
| ㄚ | 2 | 3.2% |
| ㄇ | 1 | 1.6% |
| ㄗ | 1 | 1.6% |
| Value | Count | Frequency (%) |
| ’ | 232 | |
| ” | 158 | |
| … | 94 | |
| “ | 34 | 6.4% |
| ‧ | 11 | 2.1% |
| — | 3 | 0.6% |
| ‘ | 1 | 0.2% |
| Value | Count | Frequency (%) |
| ★ | 7 |
| Value | Count | Frequency (%) |
| ﹗ | 99 | |
| ﹕ | 13 | 10.4% |
| ﹖ | 7 | 5.6% |
| ﹐ | 4 | 3.2% |
| ﹝ | 1 | 0.8% |
| ﹞ | 1 | 0.8% |
| Value | Count | Frequency (%) |
| ─ | 9 |
| Value | Count | Frequency (%) |
| ʉ | 62 | |
| ɨ | 12 | 16.2% |
| Value | Count | Frequency (%) |
| | 8 |
| Value | Count | Frequency (%) |
| 裡 | 73 | 10.8% |
| 不 | 67 | 9.9% |
| 來 | 50 | 7.4% |
| 了 | 44 | 6.5% |
| 老 | 42 | 6.2% |
| 都 | 40 | 5.9% |
| 年 | 32 | 4.7% |
| 落 | 30 | 4.4% |
| 讀 | 22 | 3.2% |
| 說 | 19 | 2.8% |
| Other values (60) | 260 |
| Value | Count | Frequency (%) |
| ⎯ | 1 |
| Value | Count | Frequency (%) |
| ○ | 4 |
| Value | Count | Frequency (%) |
| ⋯ | 1 |
From
Categorical
| Distinct | 5 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 136.1 KiB |
| 詞典 | |
|---|---|
| 生活會話 | |
| 句型 | |
| 九階教材 | 6088 |
| 文法 | 5727 |
Length
| Max length | 4 |
|---|---|
| Median length | 2 |
| Mean length | 2.273048345 |
| Min length | 2 |
Characters and Unicode
| Total characters | 316006 |
|---|---|
| Distinct characters | 14 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | 詞典 |
|---|---|
| 2nd row | 詞典 |
| 3rd row | 詞典 |
| 4th row | 詞典 |
| 5th row | 詞典 |
| Value | Count | Frequency (%) |
| 詞典 | 103864 | |
| 生活會話 | 12892 | 9.3% |
| 句型 | 10452 | 7.5% |
| 九階教材 | 6088 | 4.4% |
| 文法 | 5727 | 4.1% |
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| 詞典 | 103864 | |
| 生活會話 | 12892 | 9.3% |
| 句型 | 10452 | 7.5% |
| 九階教材 | 6088 | 4.4% |
| 文法 | 5727 | 4.1% |
Most occurring characters
| Value | Count | Frequency (%) |
| 詞 | 103864 | |
| 典 | 103864 | |
| 生 | 12892 | 4.1% |
| 活 | 12892 | 4.1% |
| 會 | 12892 | 4.1% |
| 話 | 12892 | 4.1% |
| 句 | 10452 | 3.3% |
| 型 | 10452 | 3.3% |
| 九 | 6088 | 1.9% |
| 階 | 6088 | 1.9% |
| Other values (4) | 23630 | 7.5% |
Most occurring categories
| Value | Count | Frequency (%) |
| Other Letter | 316006 |
Most frequent character per category
| Value | Count | Frequency (%) |
| 詞 | 103864 | |
| 典 | 103864 | |
| 生 | 12892 | 4.1% |
| 活 | 12892 | 4.1% |
| 會 | 12892 | 4.1% |
| 話 | 12892 | 4.1% |
| 句 | 10452 | 3.3% |
| 型 | 10452 | 3.3% |
| 九 | 6088 | 1.9% |
| 階 | 6088 | 1.9% |
| Other values (4) | 23630 | 7.5% |
Most occurring scripts
| Value | Count | Frequency (%) |
| Han | 316006 |
Most frequent character per script
| Value | Count | Frequency (%) |
| 詞 | 103864 | |
| 典 | 103864 | |
| 生 | 12892 | 4.1% |
| 活 | 12892 | 4.1% |
| 會 | 12892 | 4.1% |
| 話 | 12892 | 4.1% |
| 句 | 10452 | 3.3% |
| 型 | 10452 | 3.3% |
| 九 | 6088 | 1.9% |
| 階 | 6088 | 1.9% |
| Other values (4) | 23630 | 7.5% |
Most occurring blocks
| Value | Count | Frequency (%) |
| CJK | 316006 |
Most frequent character per block
| Value | Count | Frequency (%) |
| 詞 | 103864 | |
| 典 | 103864 | |
| 生 | 12892 | 4.1% |
| 活 | 12892 | 4.1% |
| 會 | 12892 | 4.1% |
| 話 | 12892 | 4.1% |
| 句 | 10452 | 3.3% |
| 型 | 10452 | 3.3% |
| 九 | 6088 | 1.9% |
| 階 | 6088 | 1.9% |
| Other values (4) | 23630 | 7.5% |
word_counts
Real number (ℝ≥0)
| Distinct | 51 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 6.58742798 |
| Minimum | 1 |
|---|---|
| Maximum | 89 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 1.1 MiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 3 |
| Q1 | 5 |
| median | 6 |
| Q3 | 8 |
| 95-th percentile | 12 |
| Maximum | 89 |
| Range | 88 |
| Interquartile range (IQR) | 3 |
Descriptive statistics
| Standard deviation | 3.09127209 |
|---|---|
| Coefficient of variation (CV) | 0.4692684458 |
| Kurtosis | 13.47000465 |
| Mean | 6.58742798 |
| Median Absolute Deviation (MAD) | 2 |
| Skewness | 2.008589441 |
| Sum | 915804 |
| Variance | 9.555963133 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 5 | 23536 | |
| 6 | 22566 | |
| 7 | 18690 | |
| 4 | 17842 | |
| 8 | 13567 | |
| 3 | 10447 | |
| 9 | 9231 | 6.6% |
| 10 | 5867 | 4.2% |
| 11 | 3906 | 2.8% |
| 2 | 3409 | 2.5% |
| Other values (41) | 9962 |
| Value | Count | Frequency (%) |
| 1 | 1164 | 0.8% |
| 2 | 3409 | 2.5% |
| 3 | 10447 | |
| 4 | 17842 | |
| 5 | 23536 |
| Value | Count | Frequency (%) |
| 89 | 1 | |
| 63 | 1 | |
| 57 | 2 | |
| 52 | 1 | |
| 49 | 1 |
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
First rows
| Lang_En | Lang_Ch | Ab | Ch | From | word_counts | |
|---|---|---|---|---|---|---|
| 0 | Sakizaya | 撒奇萊雅 | malalikid ku niyazu' i waluay a bulad. | 八月份是部落的豐年祭。 | 詞典 | 7 |
| 1 | Sakizaya | 撒奇萊雅 | kaudadan a demiad milalupela' kita. | 下雨天我們一起去撿天使的眼淚。 | 詞典 | 5 |
| 2 | Sakizaya | 撒奇萊雅 | i buyubuyu'an ku aadupen a mauzip. | 動物生存在山裡。 | 詞典 | 6 |
| 3 | Sakizaya | 撒奇萊雅 | u aam ku sakalanam tu sananal. | 早餐吃的是稀飯。 | 詞典 | 6 |
| 4 | Sakizaya | 撒奇萊雅 | aamen nu miaamay ku tubah ni Bunga! | 乞丐去向Bunga乞討地瓜! | 詞典 | 7 |
| 5 | Sakizaya | 撒奇萊雅 | miaam ku miaamay tu hemay. | 乞丐常常來討飯。 | 詞典 | 5 |
| 6 | Sakizaya | 撒奇萊雅 | katuud ku miaamay i Taypak. | 臺北市有很多乞丐。 | 詞典 | 5 |
| 7 | Sakizaya | 撒奇萊雅 | misaaam kaku tu sakalanam nu niyam. | 我要煮我們早餐要吃的稀飯。 | 詞典 | 6 |
| 8 | Sakizaya | 撒奇萊雅 | sapisaaam kina dangah. | 這是煮稀飯的大鍋。 | 詞典 | 3 |
| 9 | Sakizaya | 撒奇萊雅 | kau baduwac nu pabuy ku pacamul tu sasaaamen. | 用豬的排骨來熬稀飯。 | 詞典 | 8 |
Last rows
| Lang_En | Lang_Ch | Ab | Ch | From | word_counts | |
|---|---|---|---|---|---|---|
| 139013 | Bunun | 布農_郡群 | Inaak kaviaz hai, kuzamian tantungu. | 我的朋友到我們的地方作客。 | 詞典 | 5 |
| 139014 | Bunun | 布農_郡群 | Izamian tu sinsusuaz hai, matalbuh amin. | 我們的農作物都很肥碩。 | 詞典 | 6 |
| 139015 | Bunun | 布農_郡群 | pinitsanavan. | 在我們這裡吃晚餐吧。 | 詞典 | 1 |
| 139016 | Bunun | 布農_郡群 | Mali hai, mazaum aupa ukaan is-aang. | 氣球軟軟的,因為沒有氣。 | 詞典 | 6 |
| 139017 | Bunun | 布農_郡群 | Ukaan saikin mas zikaang pishasibang. | 我沒有時間玩。 | 詞典 | 5 |
| 139018 | Bunun | 布農_郡群 | Asa tu kapimaupa mas sinpatupa tu zikaang. | 要遵守約定的時間。 | 詞典 | 7 |
| 139019 | Bunun | 布農_郡群 | Isia makazavan tu hanian, uvaaz hai, supahan mas zungzung. | 寒冷的天氣裡,小孩子鼻涕很多。 | 詞典 | 9 |
| 139020 | Bunun | 布農_郡群 | Zungzung hai, maduhtaz. | 鼻涕是黏的。 | 詞典 | 3 |
| 139021 | Bunun | 布農_郡群 | Maza hazam hai, pandu sia lukis tu zuszus. | 鳥兒停棲在樹梢。 | 詞典 | 8 |
| 139022 | Bunun | 布農_郡群 | Mazima saikin maun mas kinal-ing tu lili tu zuszus. | 我喜歡吃炒過貓的嫩芽。 | 詞典 | 9 |