Formosan Dataset

Dataset statistics

Number of variables	6
Number of observations	139023
Missing cells	0
Missing cells (%)	0.0%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	3.6 MiB
Average record size in memory	27.0 B

Variable types

Categorical	5
Numeric	1

Warnings

`Ab` has a high cardinality: 138472 distinct values	High cardinality
`Ch` has a high cardinality: 118291 distinct values	High cardinality
`Lang_Ch` is highly correlated with `Lang_En`	High correlation
`Lang_En` is highly correlated with `Lang_Ch`	High correlation
`Ab` is uniformly distributed	Uniform

Reproduction

Analysis started	2021-05-08 07:18:30.970771
Analysis finished	2021-05-08 07:18:45.182861
Duration	14.21 seconds
Software version	pandas-profiling v2.12.0
Download configuration	config.yaml

Lang_En
Categorical

HIGH CORRELATION

Distinct	16
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	136.6 KiB

Rukai	15036
Bunun	13382
Atayal	11289
Puyuma	10359
Amis	9978
Other values (11)	78979

Length

Max length	10
Median length	6
Mean length	5.783891874
Min length	4

Characters and Unicode

Total characters	804094
Distinct characters	27
Distinct categories	2 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	Sakizaya
2nd row	Sakizaya
3rd row	Sakizaya
4th row	Sakizaya
5th row	Sakizaya

Value	Count	Frequency (%)
Rukai	15036	10.8%
Bunun	13382	9.6%
Atayal	11289	8.1%
Puyuma	10359	7.5%
Amis	9978	7.2%
Kavalan	9444	6.8%
Thao	8777	6.3%
Seediq	8025	5.8%
Paiwan	8009	5.8%
Yami	7867	5.7%
Other values (6)	36857	26.5%

Histogram of lengths of the category

Value	Count	Frequency (%)
rukai	15036	10.8%
bunun	13382	9.6%
atayal	11289	8.1%
puyuma	10359	7.5%
amis	9978	7.2%
kavalan	9444	6.8%
thao	8777	6.3%
seediq	8025	5.8%
paiwan	8009	5.8%
yami	7867	5.7%
Other values (6)	36857	26.5%

Most occurring characters

Value	Count	Frequency (%)
a	189565	23.6%
u	85836	10.7%
i	68931	8.6%
n	59909	7.5%
y	34769	4.3%
k	34272	4.3%
m	28204	3.5%
S	26728	3.3%
s	22017	2.7%
A	21267	2.6%
Other values (17)	232596	28.9%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	665071	82.7%
Uppercase Letter	139023	17.3%

Most frequent character per category

Value	Count	Frequency (%)
a	189565	28.5%
u	85836	12.9%
i	68931	10.4%
n	59909	9.0%
y	34769	5.2%
k	34272	5.2%
m	28204	4.2%
s	22017	3.3%
l	20733	3.1%
o	19503	2.9%
Other values (9)	101332	15.2%

Value	Count	Frequency (%)
S	26728	19.2%
A	21267	15.3%
T	19085	13.7%
P	18368	13.2%
K	17290	12.4%
R	15036	10.8%
B	13382	9.6%
Y	7867	5.7%

Most occurring scripts

Value	Count	Frequency (%)
Latin	804094	100.0%

Most frequent character per script

Value	Count	Frequency (%)
a	189565	23.6%
u	85836	10.7%
i	68931	8.6%
n	59909	7.5%
y	34769	4.3%
k	34272	4.3%
m	28204	3.5%
S	26728	3.3%
s	22017	2.7%
A	21267	2.6%
Other values (17)	232596	28.9%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	804094	100.0%

Most frequent character per block

Value	Count	Frequency (%)
a	189565	23.6%
u	85836	10.7%
i	68931	8.6%
n	59909	7.5%
y	34769	4.3%
k	34272	4.3%
m	28204	3.5%
S	26728	3.3%
s	22017	2.7%
A	21267	2.6%
Other values (17)	232596	28.9%

Lang_Ch
Categorical

HIGH CORRELATION

Distinct	43
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	137.5 KiB

魯凱_霧台	11015
布農_郡群	10446
噶瑪蘭	9444
邵	8777
泰雅_賽考利克	8350
Other values (38)	90991

Length

Max length	8
Median length	5
Mean length	4.223373111
Min length	1

Characters and Unicode

Total characters	587146
Distinct characters	80
Distinct categories	2 ?
Distinct scripts	2 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	撒奇萊雅
2nd row	撒奇萊雅
3rd row	撒奇萊雅
4th row	撒奇萊雅
5th row	撒奇萊雅

Value	Count	Frequency (%)
魯凱_霧台	11015	7.9%
布農_郡群	10446	7.5%
噶瑪蘭	9444	6.8%
邵	8777	6.3%
泰雅_賽考利克	8350	6.0%
達悟	7867	5.7%
卡那卡那富	7846	5.6%
卑南_南王	7700	5.5%
賽夏	6895	5.0%
賽德克_德固達雅	6599	4.7%
Other values (33)	54084	38.9%

Histogram of lengths of the category

Value	Count	Frequency (%)
魯凱_霧台	11015	7.9%
布農_郡群	10446	7.5%
噶瑪蘭	9444	6.8%
邵	8777	6.3%
泰雅_賽考利克	8350	6.0%
達悟	7867	5.7%
卡那卡那富	7846	5.6%
卑南_南王	7700	5.5%
賽夏	6895	5.0%
賽德克_德固達雅	6599	4.7%
Other values (33)	54084	38.9%

Most occurring characters

Value	Count	Frequency (%)
_	76078	13.0%
魯	25782	4.4%
雅	24114	4.1%
賽	23270	4.0%
南	19707	3.4%
卡	16480	2.8%
克	16375	2.8%
那	15692	2.7%
阿	15560	2.7%
德	15349	2.6%
Other values (70)	338739	57.7%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	511068	87.0%
Connector Punctuation	76078	13.0%

Most frequent character per category

Value	Count	Frequency (%)
魯	25782	5.0%
雅	24114	4.7%
賽	23270	4.6%
南	19707	3.9%
卡	16480	3.2%
克	16375	3.2%
那	15692	3.1%
阿	15560	3.0%
德	15349	3.0%
達	15167	3.0%
Other values (69)	323572	63.3%

Value	Count	Frequency (%)
_	76078	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Han	511068	87.0%
Common	76078	13.0%

Most frequent character per script

Value	Count	Frequency (%)
魯	25782	5.0%
雅	24114	4.7%
賽	23270	4.6%
南	19707	3.9%
卡	16480	3.2%
克	16375	3.2%
那	15692	3.1%
阿	15560	3.0%
德	15349	3.0%
達	15167	3.0%
Other values (69)	323572	63.3%

Value	Count	Frequency (%)
_	76078	100.0%

Most occurring blocks

Value	Count	Frequency (%)
CJK	511068	87.0%
ASCII	76078	13.0%

Most frequent character per block

Value	Count	Frequency (%)
魯	25782	5.0%
雅	24114	4.7%
賽	23270	4.6%
南	19707	3.9%
卡	16480	3.2%
克	16375	3.2%
那	15692	3.1%
阿	15560	3.0%
德	15349	3.0%
達	15167	3.0%
Other values (69)	323572	63.3%

Value	Count	Frequency (%)
_	76078	100.0%

Ab
Categorical

HIGH CARDINALITY
UNIFORM

Distinct	138472
Distinct (%)	99.6%
Missing	0
Missing (%)	0.0%
Memory size	1.1 MiB

na.	4
anema azua?	4
sinsi, nana ku walri .	4
Satokien ako to romi’ami’ad .	3
su sinsi timadju?	3
Other values (138467)	139005

Length

Max length	486
Median length	37
Mean length	39.69009444
Min length	1

Characters and Unicode

Total characters	5517836
Distinct characters	154
Distinct categories	18 ?
Distinct scripts	4 ?
Distinct blocks	10 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	137968 ?
Unique (%)	99.2%

Sample

1st row	malalikid ku niyazu' i waluay a bulad.
2nd row	kaudadan a demiad milalupela' kita.
3rd row	i buyubuyu'an ku aadupen a mauzip.
4th row	u aam ku sakalanam tu sananal.
5th row	aamen nu miaamay ku tubah ni Bunga!

Value	Count	Frequency (%)
na.	4	< 0.1%
anema azua?	4	< 0.1%
sinsi, nana ku walri .	4	< 0.1%
Satokien ako to romi’ami’ad .	3	< 0.1%
su sinsi timadju?	3	< 0.1%
nana ku matra.	3	< 0.1%
sgagay ta la!	3	< 0.1%
imu, muruma’ ku lra .	3	< 0.1%
tatelraw nu ’arevu?	3	< 0.1%
nu mukuwa ku i takesiyan zi nu muruma’ ku mu .	3	< 0.1%
Other values (138462)	138990	> 99.9%

Histogram of lengths of the category

Value	Count	Frequency (%)
a	24357	2.7%
ku	17678	1.9%
na	17666	1.9%
ka	17234	1.9%
tu	15594	1.7%
i	10176	1.1%
o	8694	0.9%
	7670	0.8%
su	7317	0.8%
ta	6945	0.8%
Other values (140997)	782473	85.4%

Most occurring characters

Value	Count	Frequency (%)
a	1008449	18.3%
	800736	14.5%
i	418441	7.6%
n	384737	7.0%
u	357291	6.5%
k	239393	4.3%
m	208199	3.8%
s	177487	3.2%
t	177300	3.2%
l	157855	2.9%
Other values (144)	1587948	28.8%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	4362309	79.1%
Space Separator	800740	14.5%
Other Punctuation	258804	4.7%
Uppercase Letter	64443	1.2%
Final Punctuation	21452	0.4%
Dash Punctuation	7056	0.1%
Initial Punctuation	567	< 0.1%
Open Punctuation	524	< 0.1%
Close Punctuation	522	< 0.1%
Modifier Symbol	508	< 0.1%
Other values (8)	911	< 0.1%

Most frequent character per category

Value	Count	Frequency (%)
a	1008449	23.1%
i	418441	9.6%
n	384737	8.8%
u	357291	8.2%
k	239393	5.5%
m	208199	4.8%
s	177487	4.1%
t	177300	4.1%
l	157855	3.6%
e	143962	3.3%
Other values (30)	1089195	25.0%

Value	Count	Frequency (%)
M	10030	15.6%
S	9783	15.2%
R	6910	10.7%
T	4639	7.2%
P	4543	7.0%
I	3602	5.6%
A	3561	5.5%
K	3304	5.1%
N	2524	3.9%
O	2038	3.2%
Other values (18)	13509	21.0%

Value	Count	Frequency (%)
答	13	21.3%
問	10	16.4%
等	5	8.2%
人	5	8.2%
何	4	6.6%
汝	4	6.6%
芬	4	6.6%
林	1	1.6%
太	1	1.6%
約	1	1.6%
Other values (13)	13	21.3%

Value	Count	Frequency (%)
.	114962	44.4%
'	66659	25.8%
,	35630	13.8%
?	21850	8.4%
!	11300	4.4%
:	6242	2.4%
;	1051	0.4%
/	597	0.2%
"	266	0.1%
！	69	< 0.1%
Other values (10)	178	0.1%

Value	Count	Frequency (%)
1	87	19.9%
0	69	15.8%
8	46	10.5%
2	45	10.3%
9	42	9.6%
5	39	8.9%
3	33	7.6%
7	28	6.4%
4	28	6.4%
6	20	4.6%

Value	Count	Frequency (%)
(	485	92.6%
「	28	5.3%
（	6	1.1%
[	5	1.0%

Value	Count	Frequency (%)
)	484	92.7%
」	27	5.2%
）	6	1.1%
]	5	1.0%

Value	Count	Frequency (%)
	800736	> 99.9%
	3	< 0.1%
	1	< 0.1%

Value	Count	Frequency (%)
^	497	97.8%
˄	10	2.0%
´	1	0.2%

Value	Count	Frequency (%)
│	1	33.3%
↘	1	33.3%
─	1	33.3%

Value	Count	Frequency (%)
	32	54.2%
	22	37.3%
	5	8.5%

Value	Count	Frequency (%)
́	43	84.3%
̄	7	13.7%
̅	1	2.0%

Value	Count	Frequency (%)
“	518	91.4%
‘	49	8.6%

Value	Count	Frequency (%)
’	20910	97.5%
”	542	2.5%

Value	Count	Frequency (%)
ʼ	2	50.0%
ˆ	2	50.0%

Value	Count	Frequency (%)
=	27	81.8%
~	6	18.2%

Value	Count	Frequency (%)
-	7056	100.0%

Value	Count	Frequency (%)
_	263	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	4426752	80.2%
Common	1090972	19.8%
Han	61	< 0.1%
Inherited	51	< 0.1%

Most frequent character per script

Value	Count	Frequency (%)
a	1008449	22.8%
i	418441	9.5%
n	384737	8.7%
u	357291	8.1%
k	239393	5.4%
m	208199	4.7%
s	177487	4.0%
t	177300	4.0%
l	157855	3.6%
e	143962	3.3%
Other values (58)	1153638	26.1%

Value	Count	Frequency (%)
	800736	73.4%
.	114962	10.5%
'	66659	6.1%
,	35630	3.3%
?	21850	2.0%
’	20910	1.9%
!	11300	1.0%
-	7056	0.6%
:	6242	0.6%
;	1051	0.1%
Other values (50)	4576	0.4%

Value	Count	Frequency (%)
答	13	21.3%
問	10	16.4%
等	5	8.2%
人	5	8.2%
何	4	6.6%
汝	4	6.6%
芬	4	6.6%
林	1	1.6%
太	1	1.6%
約	1	1.6%
Other values (13)	13	21.3%

Value	Count	Frequency (%)
́	43	84.3%
̄	7	13.7%
̅	1	2.0%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	5463197	99.0%
IPA Ext	30735	0.6%
Punctuation	22049	0.4%
None	1332	< 0.1%
Latin Ext Additional	394	< 0.1%
CJK	61	< 0.1%
Diacriticals	51	< 0.1%
Modifier Letters	14	< 0.1%
Box Drawing	2	< 0.1%
Arrows	1	< 0.1%

Most frequent character per block

Value	Count	Frequency (%)
a	1008449	18.5%
	800736	14.7%
i	418441	7.7%
n	384737	7.0%
u	357291	6.5%
k	239393	4.4%
m	208199	3.8%
s	177487	3.2%
t	177300	3.2%
l	157855	2.9%
Other values (76)	1533309	28.1%

Value	Count	Frequency (%)
’	20910	94.8%
”	542	2.5%
“	518	2.3%
‘	49	0.2%
…	28	0.1%
′	2	< 0.1%

Value	Count	Frequency (%)
é	720	54.1%
á	103	7.7%
ē	79	5.9%
！	69	5.2%
í	67	5.0%
ú	67	5.0%
、	45	3.4%
？	43	3.2%
「	28	2.1%
」	27	2.0%
Other values (17)	84	6.3%

Value	Count	Frequency (%)
│	1	50.0%
─	1	50.0%

Value	Count	Frequency (%)
˄	10	71.4%
ʼ	2	14.3%
ˆ	2	14.3%

Value	Count	Frequency (%)
ʉ	29406	95.7%
ɨ	1329	4.3%

Value	Count	Frequency (%)
答	13	21.3%
問	10	16.4%
等	5	8.2%
人	5	8.2%
何	4	6.6%
汝	4	6.6%
芬	4	6.6%
林	1	1.6%
太	1	1.6%
約	1	1.6%
Other values (13)	13	21.3%

Value	Count	Frequency (%)
́	43	84.3%
̄	7	13.7%
̅	1	2.0%

Value	Count	Frequency (%)
↘	1	100.0%

Value	Count	Frequency (%)
ṟ	394	100.0%

Ch
Categorical

HIGH CARDINALITY

Distinct	118291
Distinct (%)	85.1%
Missing	0
Missing (%)	0.0%
Memory size	1.1 MiB

那個人很勤勞嗎？	74
下雨了！你帶著雨傘嗎？	72
今天熱嗎？	71
你們天天來這裡吃晚餐嗎？	71
你有幾個兄弟姊妹？	71
Other values (118286)	138664

Length

Max length	128
Median length	11
Mean length	12.14047316
Min length	1

Characters and Unicode

Total characters	1687805
Distinct characters	4437
Distinct categories	18 ?
Distinct scripts	5 ?
Distinct blocks	15 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	112797 ?
Unique (%)	81.1%

Sample

1st row	八月份是部落的豐年祭。
2nd row	下雨天我們一起去撿天使的眼淚。
3rd row	動物生存在山裡。
4th row	早餐吃的是稀飯。
5th row	乞丐去向Bunga乞討地瓜！

Value	Count	Frequency (%)
那個人很勤勞嗎？	74	0.1%
下雨了！你帶著雨傘嗎？	72	0.1%
今天熱嗎？	71	0.1%
你們天天來這裡吃晚餐嗎？	71	0.1%
你有幾個兄弟姊妹？	71	0.1%
那間房子很大嗎？	69	< 0.1%
他們天天看電視嗎？	69	< 0.1%
在下雨嗎？	68	< 0.1%
那張椅子很重嗎？	66	< 0.1%
她的衣服是紅色的嗎？	66	< 0.1%
Other values (118281)	138326	99.5%

Histogram of lengths of the category

Value	Count	Frequency (%)
。	88	0.1%
和	81	0.1%
那個人很勤勞嗎？	74	0.1%
subali	73	0.1%
元。	73	0.1%
下雨了！你帶著雨傘嗎？	72	0.1%
你們天天來這裡吃晚餐嗎？	71	0.1%
你有幾個兄弟姊妹？	71	0.1%
今天熱嗎？	71	0.1%
他們天天看電視嗎？	69	< 0.1%
Other values (118731)	140906	99.5%

Most occurring characters

Value	Count	Frequency (%)
。	103122	6.1%
的	63472	3.8%
我	48642	2.9%
，	36223	2.1%
你	26784	1.6%
是	22800	1.4%
他	20704	1.2%
要	20680	1.2%
們	20651	1.2%
了	20331	1.2%
Other values (4427)	1304396	77.3%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	1410000	83.5%
Other Punctuation	182776	10.8%
Lowercase Letter	60147	3.6%
Uppercase Letter	10715	0.6%
Open Punctuation	9507	0.6%
Close Punctuation	9410	0.6%
Space Separator	2976	0.2%
Decimal Number	1661	0.1%
Final Punctuation	390	< 0.1%
Dash Punctuation	60	< 0.1%
Other values (8)	163	< 0.1%

Most frequent character per category

Value	Count	Frequency (%)
的	63472	4.5%
我	48642	3.4%
你	26784	1.9%
是	22800	1.6%
他	20704	1.5%
要	20680	1.5%
們	20651	1.5%
了	20331	1.4%
在	19094	1.4%
不	18573	1.3%
Other values (4276)	1128269	80.0%

Value	Count	Frequency (%)
P	1512	14.1%
T	1285	12.0%
A	1248	11.6%
S	1001	9.3%
K	763	7.1%
B	743	6.9%
M	634	5.9%
U	605	5.6%
Y	489	4.6%
L	357	3.3%
Other values (22)	2078	19.4%

Value	Count	Frequency (%)
a	14231	23.7%
u	6096	10.1%
i	5769	9.6%
n	5378	8.9%
y	2711	4.5%
s	2400	4.0%
g	2376	4.0%
l	2369	3.9%
k	2164	3.6%
w	2012	3.3%
Other values (20)	14641	24.3%

Value	Count	Frequency (%)
。	103122	56.4%
，	36223	19.8%
？	17466	9.6%
！	10484	5.7%
?	4392	2.4%
'	1916	1.0%
、	1845	1.0%
/	1612	0.9%
.	1530	0.8%
；	1052	0.6%
Other values (18)	3134	1.7%

Value	Count	Frequency (%)
(	5745	60.4%
（	2821	29.7%
「	835	8.8%
[	69	0.7%
｢	17	0.2%
〔	11	0.1%
【	3	< 0.1%
『	2	< 0.1%
《	2	< 0.1%
〈	1	< 0.1%

Value	Count	Frequency (%)
)	5682	60.4%
）	2793	29.7%
」	829	8.8%
]	69	0.7%
｣	17	0.2%
〕	11	0.1%
】	3	< 0.1%
』	2	< 0.1%
》	2	< 0.1%
〉	1	< 0.1%

Value	Count	Frequency (%)
0	471	28.4%
1	301	18.1%
2	200	12.0%
5	169	10.2%
9	105	6.3%
4	95	5.7%
3	93	5.6%
8	89	5.4%
7	74	4.5%
6	64	3.9%

Value	Count	Frequency (%)
	18	62.1%
	5	17.2%
	2	6.9%
	2	6.9%
	1	3.4%
	1	3.4%

Value	Count	Frequency (%)
=	19	59.4%
＝	9	28.1%
~	2	6.2%
⎯	1	3.1%
⋯	1	3.1%

Value	Count	Frequency (%)
	2806	94.3%
	122	4.1%
	48	1.6%

Value	Count	Frequency (%)
-	53	88.3%
－	4	6.7%
—	3	5.0%

Value	Count	Frequency (%)
─	9	45.0%
★	7	35.0%
○	4	20.0%

Value	Count	Frequency (%)
’	232	59.5%
”	158	40.5%

Value	Count	Frequency (%)
^	2	66.7%
´	1	33.3%

Value	Count	Frequency (%)
“	34	97.1%
‘	1	2.9%

Value	Count	Frequency (%)
ˋ	35	100.0%

Value	Count	Frequency (%)
	1	100.0%

Value	Count	Frequency (%)
	8	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Han	1409938	83.5%
Common	206935	12.3%
Latin	70862	4.2%
Bopomofo	62	< 0.1%
Unknown	8	< 0.1%

Most frequent character per script

Value	Count	Frequency (%)
的	63472	4.5%
我	48642	3.4%
你	26784	1.9%
是	22800	1.6%
他	20704	1.5%
要	20680	1.5%
們	20651	1.5%
了	20331	1.4%
在	19094	1.4%
不	18573	1.3%
Other values (4272)	1128207	80.0%

Value	Count	Frequency (%)
。	103122	49.8%
，	36223	17.5%
？	17466	8.4%
！	10484	5.1%
(	5745	2.8%
)	5682	2.7%
?	4392	2.1%
（	2821	1.4%
	2806	1.4%
）	2793	1.3%
Other values (78)	15401	7.4%

Value	Count	Frequency (%)
a	14231	20.1%
u	6096	8.6%
i	5769	8.1%
n	5378	7.6%
y	2711	3.8%
s	2400	3.4%
g	2376	3.4%
l	2369	3.3%
k	2164	3.1%
w	2012	2.8%
Other values (52)	25356	35.8%

Value	Count	Frequency (%)
ㄧ	58	93.5%
ㄚ	2	3.2%
ㄇ	1	1.6%
ㄗ	1	1.6%

Value	Count	Frequency (%)
	8	100.0%

Most occurring blocks

Value	Count	Frequency (%)
CJK	1409259	83.5%
None	178616	10.6%
ASCII	98392	5.8%
CJK Compat Ideographs	679	< 0.1%
Punctuation	533	< 0.1%
Small Forms	125	< 0.1%
IPA Ext	74	< 0.1%
Bopomofo	62	< 0.1%
Modifier Letters	35	< 0.1%
Box Drawing	9	< 0.1%
Other values (5)	21	< 0.1%

Most frequent character per block

Value	Count	Frequency (%)
的	63472	4.5%
我	48642	3.5%
你	26784	1.9%
是	22800	1.6%
他	20704	1.5%
要	20680	1.5%
們	20651	1.5%
了	20331	1.4%
在	19094	1.4%
不	18573	1.3%
Other values (4202)	1127528	80.0%

Value	Count	Frequency (%)
。	103122	57.7%
，	36223	20.3%
？	17466	9.8%
！	10484	5.9%
（	2821	1.6%
）	2793	1.6%
、	1845	1.0%
；	1052	0.6%
「	835	0.5%
：	834	0.5%
Other values (31)	1141	0.6%

Value	Count	Frequency (%)
a	14231	14.5%
u	6096	6.2%
i	5769	5.9%
(	5745	5.8%
)	5682	5.8%
n	5378	5.5%
?	4392	4.5%
	2806	2.9%
y	2711	2.8%
s	2400	2.4%
Other values (78)	43182	43.9%

Value	Count	Frequency (%)
ˋ	35	100.0%

Value	Count	Frequency (%)
ㄧ	58	93.5%
ㄚ	2	3.2%
ㄇ	1	1.6%
ㄗ	1	1.6%

Value	Count	Frequency (%)
’	232	43.5%
”	158	29.6%
…	94	17.6%
“	34	6.4%
‧	11	2.1%
—	3	0.6%
‘	1	0.2%

Value	Count	Frequency (%)
★	7	100.0%

Value	Count	Frequency (%)
﹗	99	79.2%
﹕	13	10.4%
﹖	7	5.6%
﹐	4	3.2%
﹝	1	0.8%
﹞	1	0.8%

Value	Count	Frequency (%)
─	9	100.0%

Value	Count	Frequency (%)
ʉ	62	83.8%
ɨ	12	16.2%

Value	Count	Frequency (%)
	8	100.0%

Value	Count	Frequency (%)
裡	73	10.8%
不	67	9.9%
來	50	7.4%
了	44	6.5%
老	42	6.2%
都	40	5.9%
年	32	4.7%
落	30	4.4%
讀	22	3.2%
說	19	2.8%
Other values (60)	260	38.3%

Value	Count	Frequency (%)
⎯	1	100.0%

Value	Count	Frequency (%)
○	4	100.0%

Value	Count	Frequency (%)
⋯	1	100.0%

From
Categorical

Distinct	5
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	136.1 KiB

詞典	103864
生活會話	12892
句型	10452
九階教材	6088
文法	5727

Length

Max length	4
Median length	2
Mean length	2.273048345
Min length	2

Characters and Unicode

Total characters	316006
Distinct characters	14
Distinct categories	1 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	詞典
2nd row	詞典
3rd row	詞典
4th row	詞典
5th row	詞典

Value	Count	Frequency (%)
詞典	103864	74.7%
生活會話	12892	9.3%
句型	10452	7.5%
九階教材	6088	4.4%
文法	5727	4.1%

Histogram of lengths of the category

Value	Count	Frequency (%)
詞典	103864	74.7%
生活會話	12892	9.3%
句型	10452	7.5%
九階教材	6088	4.4%
文法	5727	4.1%

Most occurring characters

Value	Count	Frequency (%)
詞	103864	32.9%
典	103864	32.9%
生	12892	4.1%
活	12892	4.1%
會	12892	4.1%
話	12892	4.1%
句	10452	3.3%
型	10452	3.3%
九	6088	1.9%
階	6088	1.9%
Other values (4)	23630	7.5%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	316006	100.0%

Most frequent character per category

Value	Count	Frequency (%)
詞	103864	32.9%
典	103864	32.9%
生	12892	4.1%
活	12892	4.1%
會	12892	4.1%
話	12892	4.1%
句	10452	3.3%
型	10452	3.3%
九	6088	1.9%
階	6088	1.9%
Other values (4)	23630	7.5%

Most occurring scripts

Value	Count	Frequency (%)
Han	316006	100.0%

Most frequent character per script

Value	Count	Frequency (%)
詞	103864	32.9%
典	103864	32.9%
生	12892	4.1%
活	12892	4.1%
會	12892	4.1%
話	12892	4.1%
句	10452	3.3%
型	10452	3.3%
九	6088	1.9%
階	6088	1.9%
Other values (4)	23630	7.5%

Most occurring blocks

Value	Count	Frequency (%)
CJK	316006	100.0%

Most frequent character per block

Value	Count	Frequency (%)
詞	103864	32.9%
典	103864	32.9%
生	12892	4.1%
活	12892	4.1%
會	12892	4.1%
話	12892	4.1%
句	10452	3.3%
型	10452	3.3%
九	6088	1.9%
階	6088	1.9%
Other values (4)	23630	7.5%

word_counts
Real number (ℝ_≥0)

Distinct	51
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Mean	6.58742798

Minimum	1
Maximum	89
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	1.1 MiB

Quantile statistics

Minimum	1
5-th percentile	3
Q1	5
median	6
Q3	8
95-th percentile	12
Maximum	89
Range	88
Interquartile range (IQR)	3

Descriptive statistics

Standard deviation	3.09127209
Coefficient of variation (CV)	0.4692684458
Kurtosis	13.47000465
Mean	6.58742798
Median Absolute Deviation (MAD)	2
Skewness	2.008589441
Sum	915804
Variance	9.555963133
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
5	23536	16.9%
6	22566	16.2%
7	18690	13.4%
4	17842	12.8%
8	13567	9.8%
3	10447	7.5%
9	9231	6.6%
10	5867	4.2%
11	3906	2.8%
2	3409	2.5%
Other values (41)	9962	7.2%

Minimum 5 values
Maximum 5 values

Value	Count	Frequency (%)
1	1164	0.8%
2	3409	2.5%
3	10447	7.5%
4	17842	12.8%
5	23536	16.9%

Value	Count	Frequency (%)
89	1	< 0.1%
63	1	< 0.1%
57	2	< 0.1%
52	1	< 0.1%
49	1	< 0.1%

word_counts

word_counts

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows

	Lang_En	Lang_Ch	Ab	Ch	From	word_counts
0	Sakizaya	撒奇萊雅	malalikid ku niyazu' i waluay a bulad.	八月份是部落的豐年祭。	詞典	7
1	Sakizaya	撒奇萊雅	kaudadan a demiad milalupela' kita.	下雨天我們一起去撿天使的眼淚。	詞典	5
2	Sakizaya	撒奇萊雅	i buyubuyu'an ku aadupen a mauzip.	動物生存在山裡。	詞典	6
3	Sakizaya	撒奇萊雅	u aam ku sakalanam tu sananal.	早餐吃的是稀飯。	詞典	6
4	Sakizaya	撒奇萊雅	aamen nu miaamay ku tubah ni Bunga!	乞丐去向Bunga乞討地瓜！	詞典	7
5	Sakizaya	撒奇萊雅	miaam ku miaamay tu hemay.	乞丐常常來討飯。	詞典	5
6	Sakizaya	撒奇萊雅	katuud ku miaamay i Taypak.	臺北市有很多乞丐。	詞典	5
7	Sakizaya	撒奇萊雅	misaaam kaku tu sakalanam nu niyam.	我要煮我們早餐要吃的稀飯。	詞典	6
8	Sakizaya	撒奇萊雅	sapisaaam kina dangah.	這是煮稀飯的大鍋。	詞典	3
9	Sakizaya	撒奇萊雅	kau baduwac nu pabuy ku pacamul tu sasaaamen.	用豬的排骨來熬稀飯。	詞典	8

Last rows

	Lang_En	Lang_Ch	Ab	Ch	From	word_counts
139013	Bunun	布農_郡群	Inaak kaviaz hai, kuzamian tantungu.	我的朋友到我們的地方作客。	詞典	5
139014	Bunun	布農_郡群	Izamian tu sinsusuaz hai, matalbuh amin.	我們的農作物都很肥碩。	詞典	6
139015	Bunun	布農_郡群	pinitsanavan.	在我們這裡吃晚餐吧。	詞典	1
139016	Bunun	布農_郡群	Mali hai, mazaum aupa ukaan is-aang.	氣球軟軟的，因為沒有氣。	詞典	6
139017	Bunun	布農_郡群	Ukaan saikin mas zikaang pishasibang.	我沒有時間玩。	詞典	5
139018	Bunun	布農_郡群	Asa tu kapimaupa mas sinpatupa tu zikaang.	要遵守約定的時間。	詞典	7
139019	Bunun	布農_郡群	Isia makazavan tu hanian, uvaaz hai, supahan mas zungzung.	寒冷的天氣裡，小孩子鼻涕很多。	詞典	9
139020	Bunun	布農_郡群	Zungzung hai, maduhtaz.	鼻涕是黏的。	詞典	3
139021	Bunun	布農_郡群	Maza hazam hai, pandu sia lukis tu zuszus.	鳥兒停棲在樹梢。	詞典	8
139022	Bunun	布農_郡群	Mazima saikin maun mas kinal-ing tu lili tu zuszus.	我喜歡吃炒過貓的嫩芽。	詞典	9

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Most occurring scripts

Most frequent character per script

Most occurring blocks

Most frequent character per block

Most occurring characters

Most occurring categories

Most frequent character per category

Most occurring scripts

Most frequent character per script

Most occurring blocks

Most frequent character per block

Most occurring characters

Most occurring categories

Most frequent character per category

Most occurring scripts

Most frequent character per script

Most occurring blocks

Most frequent character per block

Most occurring characters

Most occurring categories

Most frequent character per category

Most occurring scripts

Most frequent character per script

Most occurring blocks

Most frequent character per block

Most occurring characters

Most occurring categories

Most frequent character per category

Most occurring scripts

Most frequent character per script

Most occurring blocks

Most frequent character per block

Interactions

Correlations

Pearson's r

Spearman's ρ

Kendall's τ

Phik (φk)

Cramér's V (φc)

Missing values

Sample

First rows

Last rows