Dataset statistics
Number of variables | 6 |
---|---|
Number of observations | 139023 |
Missing cells | 0 |
Missing cells (%) | 0.0% |
Duplicate rows | 0 |
Duplicate rows (%) | 0.0% |
Total size in memory | 3.6 MiB |
Average record size in memory | 27.0 B |
Variable types
Categorical | 5 |
---|---|
Numeric | 1 |
Ab has a high cardinality: 138472 distinct values | High cardinality |
Ch has a high cardinality: 118291 distinct values | High cardinality |
Lang_Ch is highly correlated with Lang_En | High correlation |
Lang_En is highly correlated with Lang_Ch | High correlation |
Ab is uniformly distributed | Uniform |
Reproduction
Analysis started | 2021-05-08 07:18:30.970771 |
---|---|
Analysis finished | 2021-05-08 07:18:45.182861 |
Duration | 14.21 seconds |
Software version | pandas-profiling v2.12.0 |
Download configuration | config.yaml |
Distinct | 16 |
---|---|
Distinct (%) | < 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 136.6 KiB |
Rukai | |
---|---|
Bunun | |
Atayal | |
Puyuma | |
Amis | |
Other values (11) |
Value | Count | Frequency (%) |
Rukai | 15036 | |
Bunun | 13382 | 9.6% |
Atayal | 11289 | 8.1% |
Puyuma | 10359 | 7.5% |
Amis | 9978 | 7.2% |
Kavalan | 9444 | 6.8% |
Thao | 8777 | 6.3% |
Seediq | 8025 | 5.8% |
Paiwan | 8009 | 5.8% |
Yami | 7867 | 5.7% |
Other values (6) | 36857 |
Value | Count | Frequency (%) |
rukai | 15036 | |
bunun | 13382 | 9.6% |
atayal | 11289 | 8.1% |
puyuma | 10359 | 7.5% |
amis | 9978 | 7.2% |
kavalan | 9444 | 6.8% |
thao | 8777 | 6.3% |
seediq | 8025 | 5.8% |
paiwan | 8009 | 5.8% |
yami | 7867 | 5.7% |
Other values (6) | 36857 |
Most occurring characters
Value | Count | Frequency (%) |
a | 189565 | |
u | 85836 | 10.7% |
i | 68931 | 8.6% |
n | 59909 | 7.5% |
y | 34769 | 4.3% |
k | 34272 | 4.3% |
m | 28204 | 3.5% |
S | 26728 | 3.3% |
s | 22017 | 2.7% |
A | 21267 | 2.6% |
Other values (17) | 232596 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 665071 | |
Uppercase Letter | 139023 | 17.3% |
Most frequent character per category
Value | Count | Frequency (%) |
a | 189565 | |
u | 85836 | |
i | 68931 | 10.4% |
n | 59909 | 9.0% |
y | 34769 | 5.2% |
k | 34272 | 5.2% |
m | 28204 | 4.2% |
s | 22017 | 3.3% |
l | 20733 | 3.1% |
o | 19503 | 2.9% |
Other values (9) | 101332 |
Value | Count | Frequency (%) |
S | 26728 | |
A | 21267 | |
T | 19085 | |
P | 18368 | |
K | 17290 | |
R | 15036 | |
B | 13382 | |
Y | 7867 | 5.7% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 804094 |
Most frequent character per script
Value | Count | Frequency (%) |
a | 189565 | |
u | 85836 | 10.7% |
i | 68931 | 8.6% |
n | 59909 | 7.5% |
y | 34769 | 4.3% |
k | 34272 | 4.3% |
m | 28204 | 3.5% |
S | 26728 | 3.3% |
s | 22017 | 2.7% |
A | 21267 | 2.6% |
Other values (17) | 232596 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 804094 |
Most frequent character per block
Value | Count | Frequency (%) |
a | 189565 | |
u | 85836 | 10.7% |
i | 68931 | 8.6% |
n | 59909 | 7.5% |
y | 34769 | 4.3% |
k | 34272 | 4.3% |
m | 28204 | 3.5% |
S | 26728 | 3.3% |
s | 22017 | 2.7% |
A | 21267 | 2.6% |
Other values (17) | 232596 |
Distinct | 43 |
---|---|
Distinct (%) | < 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 137.5 KiB |
魯凱_霧台 | |
---|---|
布農_郡群 | |
噶瑪蘭 | |
邵 | 8777 |
泰雅_賽考利克 | 8350 |
Other values (38) |
Value | Count | Frequency (%) |
魯凱_霧台 | 11015 | 7.9% |
布農_郡群 | 10446 | 7.5% |
噶瑪蘭 | 9444 | 6.8% |
邵 | 8777 | 6.3% |
泰雅_賽考利克 | 8350 | 6.0% |
達悟 | 7867 | 5.7% |
卡那卡那富 | 7846 | 5.6% |
卑南_南王 | 7700 | 5.5% |
賽夏 | 6895 | 5.0% |
賽德克_德固達雅 | 6599 | 4.7% |
Other values (33) | 54084 |
Value | Count | Frequency (%) |
魯凱_霧台 | 11015 | 7.9% |
布農_郡群 | 10446 | 7.5% |
噶瑪蘭 | 9444 | 6.8% |
邵 | 8777 | 6.3% |
泰雅_賽考利克 | 8350 | 6.0% |
達悟 | 7867 | 5.7% |
卡那卡那富 | 7846 | 5.6% |
卑南_南王 | 7700 | 5.5% |
賽夏 | 6895 | 5.0% |
賽德克_德固達雅 | 6599 | 4.7% |
Other values (33) | 54084 |
Most occurring characters
Value | Count | Frequency (%) |
_ | 76078 | 13.0% |
魯 | 25782 | 4.4% |
雅 | 24114 | 4.1% |
賽 | 23270 | 4.0% |
南 | 19707 | 3.4% |
卡 | 16480 | 2.8% |
克 | 16375 | 2.8% |
那 | 15692 | 2.7% |
阿 | 15560 | 2.7% |
德 | 15349 | 2.6% |
Other values (70) | 338739 |
Most occurring categories
Value | Count | Frequency (%) |
Other Letter | 511068 | |
Connector Punctuation | 76078 | 13.0% |
Most frequent character per category
Value | Count | Frequency (%) |
魯 | 25782 | 5.0% |
雅 | 24114 | 4.7% |
賽 | 23270 | 4.6% |
南 | 19707 | 3.9% |
卡 | 16480 | 3.2% |
克 | 16375 | 3.2% |
那 | 15692 | 3.1% |
阿 | 15560 | 3.0% |
德 | 15349 | 3.0% |
達 | 15167 | 3.0% |
Other values (69) | 323572 |
Value | Count | Frequency (%) |
_ | 76078 |
Most occurring scripts
Value | Count | Frequency (%) |
Han | 511068 | |
Common | 76078 | 13.0% |
Most frequent character per script
Value | Count | Frequency (%) |
魯 | 25782 | 5.0% |
雅 | 24114 | 4.7% |
賽 | 23270 | 4.6% |
南 | 19707 | 3.9% |
卡 | 16480 | 3.2% |
克 | 16375 | 3.2% |
那 | 15692 | 3.1% |
阿 | 15560 | 3.0% |
德 | 15349 | 3.0% |
達 | 15167 | 3.0% |
Other values (69) | 323572 |
Value | Count | Frequency (%) |
_ | 76078 |
Most occurring blocks
Value | Count | Frequency (%) |
CJK | 511068 | |
ASCII | 76078 | 13.0% |
Most frequent character per block
Value | Count | Frequency (%) |
魯 | 25782 | 5.0% |
雅 | 24114 | 4.7% |
賽 | 23270 | 4.6% |
南 | 19707 | 3.9% |
卡 | 16480 | 3.2% |
克 | 16375 | 3.2% |
那 | 15692 | 3.1% |
阿 | 15560 | 3.0% |
德 | 15349 | 3.0% |
達 | 15167 | 3.0% |
Other values (69) | 323572 |
Value | Count | Frequency (%) |
_ | 76078 |
Distinct | 138472 |
---|---|
Distinct (%) | 99.6% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 1.1 MiB |
na. | 4 |
---|---|
anema azua? | 4 |
sinsi, nana ku walri . | 4 |
Satokien ako to romi’ami’ad . | 3 |
su sinsi timadju? | 3 |
Other values (138467) |
Length
Max length | 486 |
---|---|
Median length | 37 |
Mean length | 39.69009444 |
Min length | 1 |
Characters and Unicode
Total characters | 5517836 |
---|---|
Distinct characters | 154 |
Distinct categories | 18 ? |
Distinct scripts | 4 ? |
Distinct blocks | 10 ? |
Unique
Unique | 137968 ? |
---|---|
Unique (%) | 99.2% |
Sample
1st row | malalikid ku niyazu' i waluay a bulad. |
---|---|
2nd row | kaudadan a demiad milalupela' kita. |
3rd row | i buyubuyu'an ku aadupen a mauzip. |
4th row | u aam ku sakalanam tu sananal. |
5th row | aamen nu miaamay ku tubah ni Bunga! |
Value | Count | Frequency (%) |
na. | 4 | < 0.1% |
anema azua? | 4 | < 0.1% |
sinsi, nana ku walri . | 4 | < 0.1% |
Satokien ako to romi’ami’ad . | 3 | < 0.1% |
su sinsi timadju? | 3 | < 0.1% |
nana ku matra. | 3 | < 0.1% |
sgagay ta la! | 3 | < 0.1% |
imu, muruma’ ku lra . | 3 | < 0.1% |
tatelraw nu ’arevu? | 3 | < 0.1% |
nu mukuwa ku i takesiyan zi nu muruma’ ku mu . | 3 | < 0.1% |
Other values (138462) | 138990 |
Value | Count | Frequency (%) |
a | 24357 | 2.7% |
ku | 17678 | 1.9% |
na | 17666 | 1.9% |
ka | 17234 | 1.9% |
tu | 15594 | 1.7% |
i | 10176 | 1.1% |
o | 8694 | 0.9% |
7670 | 0.8% | |
su | 7317 | 0.8% |
ta | 6945 | 0.8% |
Other values (140997) | 782473 |
Most occurring characters
Value | Count | Frequency (%) |
a | 1008449 | |
800736 | ||
i | 418441 | 7.6% |
n | 384737 | 7.0% |
u | 357291 | 6.5% |
k | 239393 | 4.3% |
m | 208199 | 3.8% |
s | 177487 | 3.2% |
t | 177300 | 3.2% |
l | 157855 | 2.9% |
Other values (144) | 1587948 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 4362309 | |
Space Separator | 800740 | 14.5% |
Other Punctuation | 258804 | 4.7% |
Uppercase Letter | 64443 | 1.2% |
Final Punctuation | 21452 | 0.4% |
Dash Punctuation | 7056 | 0.1% |
Initial Punctuation | 567 | < 0.1% |
Open Punctuation | 524 | < 0.1% |
Close Punctuation | 522 | < 0.1% |
Modifier Symbol | 508 | < 0.1% |
Other values (8) | 911 | < 0.1% |
Most frequent character per category
Value | Count | Frequency (%) |
a | 1008449 | |
i | 418441 | 9.6% |
n | 384737 | 8.8% |
u | 357291 | 8.2% |
k | 239393 | 5.5% |
m | 208199 | 4.8% |
s | 177487 | 4.1% |
t | 177300 | 4.1% |
l | 157855 | 3.6% |
e | 143962 | 3.3% |
Other values (30) | 1089195 |
Value | Count | Frequency (%) |
M | 10030 | |
S | 9783 | |
R | 6910 | |
T | 4639 | 7.2% |
P | 4543 | 7.0% |
I | 3602 | 5.6% |
A | 3561 | 5.5% |
K | 3304 | 5.1% |
N | 2524 | 3.9% |
O | 2038 | 3.2% |
Other values (18) | 13509 |
Value | Count | Frequency (%) |
答 | 13 | |
問 | 10 | |
等 | 5 | 8.2% |
人 | 5 | 8.2% |
何 | 4 | 6.6% |
汝 | 4 | 6.6% |
芬 | 4 | 6.6% |
林 | 1 | 1.6% |
太 | 1 | 1.6% |
約 | 1 | 1.6% |
Other values (13) | 13 |
Value | Count | Frequency (%) |
. | 114962 | |
' | 66659 | |
, | 35630 | 13.8% |
? | 21850 | 8.4% |
! | 11300 | 4.4% |
: | 6242 | 2.4% |
; | 1051 | 0.4% |
/ | 597 | 0.2% |
" | 266 | 0.1% |
! | 69 | < 0.1% |
Other values (10) | 178 | 0.1% |
Value | Count | Frequency (%) |
1 | 87 | |
0 | 69 | |
8 | 46 | |
2 | 45 | |
9 | 42 | |
5 | 39 | |
3 | 33 | 7.6% |
7 | 28 | 6.4% |
4 | 28 | 6.4% |
6 | 20 | 4.6% |
Value | Count | Frequency (%) |
( | 485 | |
「 | 28 | 5.3% |
( | 6 | 1.1% |
[ | 5 | 1.0% |
Value | Count | Frequency (%) |
) | 484 | |
」 | 27 | 5.2% |
) | 6 | 1.1% |
] | 5 | 1.0% |
Value | Count | Frequency (%) |
800736 | ||
3 | < 0.1% | |
1 | < 0.1% |
Value | Count | Frequency (%) |
^ | 497 | |
˄ | 10 | 2.0% |
´ | 1 | 0.2% |
Value | Count | Frequency (%) |
│ | 1 | |
↘ | 1 | |
─ | 1 |
Value | Count | Frequency (%) |
32 | ||
22 | ||
5 | 8.5% |
Value | Count | Frequency (%) |
́ | 43 | |
̄ | 7 | 13.7% |
̅ | 1 | 2.0% |
Value | Count | Frequency (%) |
“ | 518 | |
‘ | 49 | 8.6% |
Value | Count | Frequency (%) |
’ | 20910 | |
” | 542 | 2.5% |
Value | Count | Frequency (%) |
ʼ | 2 | |
ˆ | 2 |
Value | Count | Frequency (%) |
= | 27 | |
~ | 6 | 18.2% |
Value | Count | Frequency (%) |
- | 7056 |
Value | Count | Frequency (%) |
_ | 263 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 4426752 | |
Common | 1090972 | 19.8% |
Han | 61 | < 0.1% |
Inherited | 51 | < 0.1% |
Most frequent character per script
Value | Count | Frequency (%) |
a | 1008449 | |
i | 418441 | 9.5% |
n | 384737 | 8.7% |
u | 357291 | 8.1% |
k | 239393 | 5.4% |
m | 208199 | 4.7% |
s | 177487 | 4.0% |
t | 177300 | 4.0% |
l | 157855 | 3.6% |
e | 143962 | 3.3% |
Other values (58) | 1153638 |
Value | Count | Frequency (%) |
800736 | ||
. | 114962 | 10.5% |
' | 66659 | 6.1% |
, | 35630 | 3.3% |
? | 21850 | 2.0% |
’ | 20910 | 1.9% |
! | 11300 | 1.0% |
- | 7056 | 0.6% |
: | 6242 | 0.6% |
; | 1051 | 0.1% |
Other values (50) | 4576 | 0.4% |
Value | Count | Frequency (%) |
答 | 13 | |
問 | 10 | |
等 | 5 | 8.2% |
人 | 5 | 8.2% |
何 | 4 | 6.6% |
汝 | 4 | 6.6% |
芬 | 4 | 6.6% |
林 | 1 | 1.6% |
太 | 1 | 1.6% |
約 | 1 | 1.6% |
Other values (13) | 13 |
Value | Count | Frequency (%) |
́ | 43 | |
̄ | 7 | 13.7% |
̅ | 1 | 2.0% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5463197 | |
IPA Ext | 30735 | 0.6% |
Punctuation | 22049 | 0.4% |
None | 1332 | < 0.1% |
Latin Ext Additional | 394 | < 0.1% |
CJK | 61 | < 0.1% |
Diacriticals | 51 | < 0.1% |
Modifier Letters | 14 | < 0.1% |
Box Drawing | 2 | < 0.1% |
Arrows | 1 | < 0.1% |
Most frequent character per block
Value | Count | Frequency (%) |
a | 1008449 | |
800736 | ||
i | 418441 | 7.7% |
n | 384737 | 7.0% |
u | 357291 | 6.5% |
k | 239393 | 4.4% |
m | 208199 | 3.8% |
s | 177487 | 3.2% |
t | 177300 | 3.2% |
l | 157855 | 2.9% |
Other values (76) | 1533309 |
Value | Count | Frequency (%) |
’ | 20910 | |
” | 542 | 2.5% |
“ | 518 | 2.3% |
‘ | 49 | 0.2% |
… | 28 | 0.1% |
′ | 2 | < 0.1% |
Value | Count | Frequency (%) |
é | 720 | |
á | 103 | 7.7% |
ē | 79 | 5.9% |
! | 69 | 5.2% |
í | 67 | 5.0% |
ú | 67 | 5.0% |
、 | 45 | 3.4% |
? | 43 | 3.2% |
「 | 28 | 2.1% |
」 | 27 | 2.0% |
Other values (17) | 84 | 6.3% |
Value | Count | Frequency (%) |
│ | 1 | |
─ | 1 |
Value | Count | Frequency (%) |
˄ | 10 | |
ʼ | 2 | 14.3% |
ˆ | 2 | 14.3% |
Value | Count | Frequency (%) |
ʉ | 29406 | |
ɨ | 1329 | 4.3% |
Value | Count | Frequency (%) |
答 | 13 | |
問 | 10 | |
等 | 5 | 8.2% |
人 | 5 | 8.2% |
何 | 4 | 6.6% |
汝 | 4 | 6.6% |
芬 | 4 | 6.6% |
林 | 1 | 1.6% |
太 | 1 | 1.6% |
約 | 1 | 1.6% |
Other values (13) | 13 |
Value | Count | Frequency (%) |
́ | 43 | |
̄ | 7 | 13.7% |
̅ | 1 | 2.0% |
Value | Count | Frequency (%) |
↘ | 1 |
Value | Count | Frequency (%) |
ṟ | 394 |
Distinct | 118291 |
---|---|
Distinct (%) | 85.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 1.1 MiB |
那個人很勤勞嗎? | 74 |
---|---|
下雨了!你帶著雨傘嗎? | 72 |
今天熱嗎? | 71 |
你們天天來這裡吃晚餐嗎? | 71 |
你有幾個兄弟姊妹? | 71 |
Other values (118286) |
Length
Max length | 128 |
---|---|
Median length | 11 |
Mean length | 12.14047316 |
Min length | 1 |
Characters and Unicode
Total characters | 1687805 |
---|---|
Distinct characters | 4437 |
Distinct categories | 18 ? |
Distinct scripts | 5 ? |
Distinct blocks | 15 ? |
Unique
Unique | 112797 ? |
---|---|
Unique (%) | 81.1% |
Sample
1st row | 八月份是部落的豐年祭。 |
---|---|
2nd row | 下雨天我們一起去撿天使的眼淚。 |
3rd row | 動物生存在山裡。 |
4th row | 早餐吃的是稀飯。 |
5th row | 乞丐去向Bunga乞討地瓜! |
Value | Count | Frequency (%) |
那個人很勤勞嗎? | 74 | 0.1% |
下雨了!你帶著雨傘嗎? | 72 | 0.1% |
今天熱嗎? | 71 | 0.1% |
你們天天來這裡吃晚餐嗎? | 71 | 0.1% |
你有幾個兄弟姊妹? | 71 | 0.1% |
那間房子很大嗎? | 69 | < 0.1% |
他們天天看電視嗎? | 69 | < 0.1% |
在下雨嗎? | 68 | < 0.1% |
那張椅子很重嗎? | 66 | < 0.1% |
她的衣服是紅色的嗎? | 66 | < 0.1% |
Other values (118281) | 138326 |
Value | Count | Frequency (%) |
。 | 88 | 0.1% |
和 | 81 | 0.1% |
那個人很勤勞嗎? | 74 | 0.1% |
subali | 73 | 0.1% |
元。 | 73 | 0.1% |
下雨了!你帶著雨傘嗎? | 72 | 0.1% |
你們天天來這裡吃晚餐嗎? | 71 | 0.1% |
你有幾個兄弟姊妹? | 71 | 0.1% |
今天熱嗎? | 71 | 0.1% |
他們天天看電視嗎? | 69 | < 0.1% |
Other values (118731) | 140906 |
Most occurring characters
Value | Count | Frequency (%) |
。 | 103122 | 6.1% |
的 | 63472 | 3.8% |
我 | 48642 | 2.9% |
, | 36223 | 2.1% |
你 | 26784 | 1.6% |
是 | 22800 | 1.4% |
他 | 20704 | 1.2% |
要 | 20680 | 1.2% |
們 | 20651 | 1.2% |
了 | 20331 | 1.2% |
Other values (4427) | 1304396 |
Most occurring categories
Value | Count | Frequency (%) |
Other Letter | 1410000 | |
Other Punctuation | 182776 | 10.8% |
Lowercase Letter | 60147 | 3.6% |
Uppercase Letter | 10715 | 0.6% |
Open Punctuation | 9507 | 0.6% |
Close Punctuation | 9410 | 0.6% |
Space Separator | 2976 | 0.2% |
Decimal Number | 1661 | 0.1% |
Final Punctuation | 390 | < 0.1% |
Dash Punctuation | 60 | < 0.1% |
Other values (8) | 163 | < 0.1% |
Most frequent character per category
Value | Count | Frequency (%) |
的 | 63472 | 4.5% |
我 | 48642 | 3.4% |
你 | 26784 | 1.9% |
是 | 22800 | 1.6% |
他 | 20704 | 1.5% |
要 | 20680 | 1.5% |
們 | 20651 | 1.5% |
了 | 20331 | 1.4% |
在 | 19094 | 1.4% |
不 | 18573 | 1.3% |
Other values (4276) | 1128269 |
Value | Count | Frequency (%) |
P | 1512 | |
T | 1285 | |
A | 1248 | |
S | 1001 | |
K | 763 | 7.1% |
B | 743 | 6.9% |
M | 634 | 5.9% |
U | 605 | 5.6% |
Y | 489 | 4.6% |
L | 357 | 3.3% |
Other values (22) | 2078 |
Value | Count | Frequency (%) |
a | 14231 | |
u | 6096 | |
i | 5769 | 9.6% |
n | 5378 | 8.9% |
y | 2711 | 4.5% |
s | 2400 | 4.0% |
g | 2376 | 4.0% |
l | 2369 | 3.9% |
k | 2164 | 3.6% |
w | 2012 | 3.3% |
Other values (20) | 14641 |
Value | Count | Frequency (%) |
。 | 103122 | |
, | 36223 | 19.8% |
? | 17466 | 9.6% |
! | 10484 | 5.7% |
? | 4392 | 2.4% |
' | 1916 | 1.0% |
、 | 1845 | 1.0% |
/ | 1612 | 0.9% |
. | 1530 | 0.8% |
; | 1052 | 0.6% |
Other values (18) | 3134 | 1.7% |
Value | Count | Frequency (%) |
( | 5745 | |
( | 2821 | |
「 | 835 | 8.8% |
[ | 69 | 0.7% |
「 | 17 | 0.2% |
〔 | 11 | 0.1% |
【 | 3 | < 0.1% |
『 | 2 | < 0.1% |
《 | 2 | < 0.1% |
〈 | 1 | < 0.1% |
Value | Count | Frequency (%) |
) | 5682 | |
) | 2793 | |
」 | 829 | 8.8% |
] | 69 | 0.7% |
」 | 17 | 0.2% |
〕 | 11 | 0.1% |
】 | 3 | < 0.1% |
』 | 2 | < 0.1% |
》 | 2 | < 0.1% |
〉 | 1 | < 0.1% |
Value | Count | Frequency (%) |
0 | 471 | |
1 | 301 | |
2 | 200 | |
5 | 169 | 10.2% |
9 | 105 | 6.3% |
4 | 95 | 5.7% |
3 | 93 | 5.6% |
8 | 89 | 5.4% |
7 | 74 | 4.5% |
6 | 64 | 3.9% |
Value | Count | Frequency (%) |
18 | ||
5 | 17.2% | |
2 | 6.9% | |
2 | 6.9% | |
| 1 | 3.4% |
1 | 3.4% |
Value | Count | Frequency (%) |
= | 19 | |
= | 9 | |
~ | 2 | 6.2% |
⎯ | 1 | 3.1% |
⋯ | 1 | 3.1% |
Value | Count | Frequency (%) |
2806 | ||
122 | 4.1% | |
48 | 1.6% |
Value | Count | Frequency (%) |
- | 53 | |
- | 4 | 6.7% |
— | 3 | 5.0% |
Value | Count | Frequency (%) |
─ | 9 | |
★ | 7 | |
○ | 4 |
Value | Count | Frequency (%) |
’ | 232 | |
” | 158 |
Value | Count | Frequency (%) |
^ | 2 | |
´ | 1 |
Value | Count | Frequency (%) |
“ | 34 | |
‘ | 1 | 2.9% |
Value | Count | Frequency (%) |
ˋ | 35 |
Value | Count | Frequency (%) |
| 1 |
Value | Count | Frequency (%) |
| 8 |
Most occurring scripts
Value | Count | Frequency (%) |
Han | 1409938 | |
Common | 206935 | 12.3% |
Latin | 70862 | 4.2% |
Bopomofo | 62 | < 0.1% |
Unknown | 8 | < 0.1% |
Most frequent character per script
Value | Count | Frequency (%) |
的 | 63472 | 4.5% |
我 | 48642 | 3.4% |
你 | 26784 | 1.9% |
是 | 22800 | 1.6% |
他 | 20704 | 1.5% |
要 | 20680 | 1.5% |
們 | 20651 | 1.5% |
了 | 20331 | 1.4% |
在 | 19094 | 1.4% |
不 | 18573 | 1.3% |
Other values (4272) | 1128207 |
Value | Count | Frequency (%) |
。 | 103122 | |
, | 36223 | 17.5% |
? | 17466 | 8.4% |
! | 10484 | 5.1% |
( | 5745 | 2.8% |
) | 5682 | 2.7% |
? | 4392 | 2.1% |
( | 2821 | 1.4% |
2806 | 1.4% | |
) | 2793 | 1.3% |
Other values (78) | 15401 | 7.4% |
Value | Count | Frequency (%) |
a | 14231 | |
u | 6096 | 8.6% |
i | 5769 | 8.1% |
n | 5378 | 7.6% |
y | 2711 | 3.8% |
s | 2400 | 3.4% |
g | 2376 | 3.4% |
l | 2369 | 3.3% |
k | 2164 | 3.1% |
w | 2012 | 2.8% |
Other values (52) | 25356 |
Value | Count | Frequency (%) |
ㄧ | 58 | |
ㄚ | 2 | 3.2% |
ㄇ | 1 | 1.6% |
ㄗ | 1 | 1.6% |
Value | Count | Frequency (%) |
| 8 |
Most occurring blocks
Value | Count | Frequency (%) |
CJK | 1409259 | |
None | 178616 | 10.6% |
ASCII | 98392 | 5.8% |
CJK Compat Ideographs | 679 | < 0.1% |
Punctuation | 533 | < 0.1% |
Small Forms | 125 | < 0.1% |
IPA Ext | 74 | < 0.1% |
Bopomofo | 62 | < 0.1% |
Modifier Letters | 35 | < 0.1% |
Box Drawing | 9 | < 0.1% |
Other values (5) | 21 | < 0.1% |
Most frequent character per block
Value | Count | Frequency (%) |
的 | 63472 | 4.5% |
我 | 48642 | 3.5% |
你 | 26784 | 1.9% |
是 | 22800 | 1.6% |
他 | 20704 | 1.5% |
要 | 20680 | 1.5% |
們 | 20651 | 1.5% |
了 | 20331 | 1.4% |
在 | 19094 | 1.4% |
不 | 18573 | 1.3% |
Other values (4202) | 1127528 |
Value | Count | Frequency (%) |
。 | 103122 | |
, | 36223 | 20.3% |
? | 17466 | 9.8% |
! | 10484 | 5.9% |
( | 2821 | 1.6% |
) | 2793 | 1.6% |
、 | 1845 | 1.0% |
; | 1052 | 0.6% |
「 | 835 | 0.5% |
: | 834 | 0.5% |
Other values (31) | 1141 | 0.6% |
Value | Count | Frequency (%) |
a | 14231 | 14.5% |
u | 6096 | 6.2% |
i | 5769 | 5.9% |
( | 5745 | 5.8% |
) | 5682 | 5.8% |
n | 5378 | 5.5% |
? | 4392 | 4.5% |
2806 | 2.9% | |
y | 2711 | 2.8% |
s | 2400 | 2.4% |
Other values (78) | 43182 |
Value | Count | Frequency (%) |
ˋ | 35 |
Value | Count | Frequency (%) |
ㄧ | 58 | |
ㄚ | 2 | 3.2% |
ㄇ | 1 | 1.6% |
ㄗ | 1 | 1.6% |
Value | Count | Frequency (%) |
’ | 232 | |
” | 158 | |
… | 94 | |
“ | 34 | 6.4% |
‧ | 11 | 2.1% |
— | 3 | 0.6% |
‘ | 1 | 0.2% |
Value | Count | Frequency (%) |
★ | 7 |
Value | Count | Frequency (%) |
﹗ | 99 | |
﹕ | 13 | 10.4% |
﹖ | 7 | 5.6% |
﹐ | 4 | 3.2% |
﹝ | 1 | 0.8% |
﹞ | 1 | 0.8% |
Value | Count | Frequency (%) |
─ | 9 |
Value | Count | Frequency (%) |
ʉ | 62 | |
ɨ | 12 | 16.2% |
Value | Count | Frequency (%) |
| 8 |
Value | Count | Frequency (%) |
裡 | 73 | 10.8% |
不 | 67 | 9.9% |
來 | 50 | 7.4% |
了 | 44 | 6.5% |
老 | 42 | 6.2% |
都 | 40 | 5.9% |
年 | 32 | 4.7% |
落 | 30 | 4.4% |
讀 | 22 | 3.2% |
說 | 19 | 2.8% |
Other values (60) | 260 |
Value | Count | Frequency (%) |
⎯ | 1 |
Value | Count | Frequency (%) |
○ | 4 |
Value | Count | Frequency (%) |
⋯ | 1 |
From
Categorical
Distinct | 5 |
---|---|
Distinct (%) | < 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 136.1 KiB |
詞典 | |
---|---|
生活會話 | |
句型 | |
九階教材 | 6088 |
文法 | 5727 |
Value | Count | Frequency (%) |
詞典 | 103864 | |
生活會話 | 12892 | 9.3% |
句型 | 10452 | 7.5% |
九階教材 | 6088 | 4.4% |
文法 | 5727 | 4.1% |
Value | Count | Frequency (%) |
詞典 | 103864 | |
生活會話 | 12892 | 9.3% |
句型 | 10452 | 7.5% |
九階教材 | 6088 | 4.4% |
文法 | 5727 | 4.1% |
Most occurring characters
Value | Count | Frequency (%) |
詞 | 103864 | |
典 | 103864 | |
生 | 12892 | 4.1% |
活 | 12892 | 4.1% |
會 | 12892 | 4.1% |
話 | 12892 | 4.1% |
句 | 10452 | 3.3% |
型 | 10452 | 3.3% |
九 | 6088 | 1.9% |
階 | 6088 | 1.9% |
Other values (4) | 23630 | 7.5% |
Most occurring categories
Value | Count | Frequency (%) |
Other Letter | 316006 |
Most frequent character per category
Value | Count | Frequency (%) |
詞 | 103864 | |
典 | 103864 | |
生 | 12892 | 4.1% |
活 | 12892 | 4.1% |
會 | 12892 | 4.1% |
話 | 12892 | 4.1% |
句 | 10452 | 3.3% |
型 | 10452 | 3.3% |
九 | 6088 | 1.9% |
階 | 6088 | 1.9% |
Other values (4) | 23630 | 7.5% |
Most occurring scripts
Value | Count | Frequency (%) |
Han | 316006 |
Most frequent character per script
Value | Count | Frequency (%) |
詞 | 103864 | |
典 | 103864 | |
生 | 12892 | 4.1% |
活 | 12892 | 4.1% |
會 | 12892 | 4.1% |
話 | 12892 | 4.1% |
句 | 10452 | 3.3% |
型 | 10452 | 3.3% |
九 | 6088 | 1.9% |
階 | 6088 | 1.9% |
Other values (4) | 23630 | 7.5% |
Most occurring blocks
Value | Count | Frequency (%) |
CJK | 316006 |
Most frequent character per block
Value | Count | Frequency (%) |
詞 | 103864 | |
典 | 103864 | |
生 | 12892 | 4.1% |
活 | 12892 | 4.1% |
會 | 12892 | 4.1% |
話 | 12892 | 4.1% |
句 | 10452 | 3.3% |
型 | 10452 | 3.3% |
九 | 6088 | 1.9% |
階 | 6088 | 1.9% |
Other values (4) | 23630 | 7.5% |
word_counts
Real number (ℝ≥0)
Distinct | 51 |
---|---|
Distinct (%) | < 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Mean | 6.58742798 |
Minimum | 1 |
---|---|
Maximum | 89 |
Zeros | 0 |
Zeros (%) | 0.0% |
Negative | 0 |
Negative (%) | 0.0% |
Memory size | 1.1 MiB |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 3 |
Q1 | 5 |
median | 6 |
Q3 | 8 |
95-th percentile | 12 |
Maximum | 89 |
Range | 88 |
Interquartile range (IQR) | 3 |
Descriptive statistics
Standard deviation | 3.09127209 |
---|---|
Coefficient of variation (CV) | 0.4692684458 |
Kurtosis | 13.47000465 |
Mean | 6.58742798 |
Median Absolute Deviation (MAD) | 2 |
Skewness | 2.008589441 |
Sum | 915804 |
Variance | 9.555963133 |
Monotonicity | Not monotonic |
Value | Count | Frequency (%) |
5 | 23536 | |
6 | 22566 | |
7 | 18690 | |
4 | 17842 | |
8 | 13567 | |
3 | 10447 | |
9 | 9231 | 6.6% |
10 | 5867 | 4.2% |
11 | 3906 | 2.8% |
2 | 3409 | 2.5% |
Other values (41) | 9962 |
Value | Count | Frequency (%) |
1 | 1164 | 0.8% |
2 | 3409 | 2.5% |
3 | 10447 | |
4 | 17842 | |
5 | 23536 |
Value | Count | Frequency (%) |
89 | 1 | |
63 | 1 | |
57 | 2 | |
52 | 1 | |
49 | 1 |
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.First rows
Lang_En | Lang_Ch | Ab | Ch | From | word_counts | |
---|---|---|---|---|---|---|
0 | Sakizaya | 撒奇萊雅 | malalikid ku niyazu' i waluay a bulad. | 八月份是部落的豐年祭。 | 詞典 | 7 |
1 | Sakizaya | 撒奇萊雅 | kaudadan a demiad milalupela' kita. | 下雨天我們一起去撿天使的眼淚。 | 詞典 | 5 |
2 | Sakizaya | 撒奇萊雅 | i buyubuyu'an ku aadupen a mauzip. | 動物生存在山裡。 | 詞典 | 6 |
3 | Sakizaya | 撒奇萊雅 | u aam ku sakalanam tu sananal. | 早餐吃的是稀飯。 | 詞典 | 6 |
4 | Sakizaya | 撒奇萊雅 | aamen nu miaamay ku tubah ni Bunga! | 乞丐去向Bunga乞討地瓜! | 詞典 | 7 |
5 | Sakizaya | 撒奇萊雅 | miaam ku miaamay tu hemay. | 乞丐常常來討飯。 | 詞典 | 5 |
6 | Sakizaya | 撒奇萊雅 | katuud ku miaamay i Taypak. | 臺北市有很多乞丐。 | 詞典 | 5 |
7 | Sakizaya | 撒奇萊雅 | misaaam kaku tu sakalanam nu niyam. | 我要煮我們早餐要吃的稀飯。 | 詞典 | 6 |
8 | Sakizaya | 撒奇萊雅 | sapisaaam kina dangah. | 這是煮稀飯的大鍋。 | 詞典 | 3 |
9 | Sakizaya | 撒奇萊雅 | kau baduwac nu pabuy ku pacamul tu sasaaamen. | 用豬的排骨來熬稀飯。 | 詞典 | 8 |
Last rows
Lang_En | Lang_Ch | Ab | Ch | From | word_counts | |
---|---|---|---|---|---|---|
139013 | Bunun | 布農_郡群 | Inaak kaviaz hai, kuzamian tantungu. | 我的朋友到我們的地方作客。 | 詞典 | 5 |
139014 | Bunun | 布農_郡群 | Izamian tu sinsusuaz hai, matalbuh amin. | 我們的農作物都很肥碩。 | 詞典 | 6 |
139015 | Bunun | 布農_郡群 | pinitsanavan. | 在我們這裡吃晚餐吧。 | 詞典 | 1 |
139016 | Bunun | 布農_郡群 | Mali hai, mazaum aupa ukaan is-aang. | 氣球軟軟的,因為沒有氣。 | 詞典 | 6 |
139017 | Bunun | 布農_郡群 | Ukaan saikin mas zikaang pishasibang. | 我沒有時間玩。 | 詞典 | 5 |
139018 | Bunun | 布農_郡群 | Asa tu kapimaupa mas sinpatupa tu zikaang. | 要遵守約定的時間。 | 詞典 | 7 |
139019 | Bunun | 布農_郡群 | Isia makazavan tu hanian, uvaaz hai, supahan mas zungzung. | 寒冷的天氣裡,小孩子鼻涕很多。 | 詞典 | 9 |
139020 | Bunun | 布農_郡群 | Zungzung hai, maduhtaz. | 鼻涕是黏的。 | 詞典 | 3 |
139021 | Bunun | 布農_郡群 | Maza hazam hai, pandu sia lukis tu zuszus. | 鳥兒停棲在樹梢。 | 詞典 | 8 |
139022 | Bunun | 布農_郡群 | Mazima saikin maun mas kinal-ing tu lili tu zuszus. | 我喜歡吃炒過貓的嫩芽。 | 詞典 | 9 |