-
Notifications
You must be signed in to change notification settings - Fork 23
/
index.html
2652 lines (2652 loc) · 282 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.css" integrity="sha384-vKruj+a13U8yHIkAyGgK1J3ArTLzrFGBbBc0tDp4ad/EyewESeXE/Iv67Aj8gKZ0" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js" integrity="sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/@alpinejs/collapse@3.x.x/dist/cdn.min.js"></script>
<script defer src="https://cdn.jsdelivr.net/npm/alpinejs@3.x.x/dist/cdn.min.js"></script>
</head>
<body>
<div class="relative mx-auto h-full max-w-2xl text-md">
<table class="table-auto">
<tbody>
<tr>
<td></td>
<td>
<h1 class="text-4xl pt-4 font-bold"><span class="underline">Vincent's</span> Arxiv FrontPage</h1>
<br>
<p>Generated on 2024-09-17.</p><br/>
<p class="text-sm text-gray-500 pt-2">This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. </p>
<br>
</td>
</tr><tr>
<td></td>
<td>
<h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Current open-vocabulary scene graph generation algorithms highly rely on both 3D scene point cloud data and posed RGB-D images and thus have limited applications in scenarios where RGB-D images or camera poses are not readily available.<span class='px-1 mx-1 bg-yellow-200'>To solve this problem, we propose Point2Graph, a novel end-to-end point cloud-based 3D open-vocabulary scene graph generation framework in which the requirement of posed RGB-D image series is eliminated. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.753</span></span>This hierarchical framework contains room and object detection/segmentation and open-vocabulary classification.For the room layer, we leverage the advantage of merging the geometry-based border detection algorithm with the learning-based region detection to segment rooms and create a "Snap-Lookup" framework for open-vocabulary room classification.In addition, we create an end-to-end pipeline for the object layer to detect and classify 3D objects based solely on 3D point cloud data.Our evaluation results show that our framework can outperform the current state-of-the-art (SOTA) open-vocabulary object and room segmentation and classification algorithm on widely used real-scene datasets.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10350v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Co-speech gestures are fundamental for communication.The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents.<span class='px-1 mx-1 bg-yellow-200'>"In-the-wild" datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.76</span></span>Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases.However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain.This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored.Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models.We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D.We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10357v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
A Large-Scale Privacy Assessment of Android Third-Party SDKs
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Third-party Software Development Kits (SDKs) are widely adopted in Android app development, to effortlessly accelerate development pipelines and enhance app functionality.However, this convenience raises substantial concerns about unauthorized access to users' privacy-sensitive information, which could be further abused for illegitimate purposes like user tracking or monetization.Our study offers a targeted analysis of user privacy protection among Android third-party SDKs, filling a critical gap in the Android software supply chain.It focuses on two aspects of their privacy practices, including data exfiltration and behavior-policy compliance (or privacy compliance), utilizing techniques of taint analysis and large language models.<span class='px-1 mx-1 bg-yellow-200'>It covers 158 widely-used SDKs from two key SDK release platforms, the official one and a large alternative one. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.833</span></span>From them, we identified 338 instances of privacy data exfiltration.On the privacy compliance, our study reveals that more than 30% of the examined SDKs fail to provide a privacy policy to disclose their data handling practices.Among those that provide privacy policies, 37% of them over-collect user data, and 88% falsely claim access to sensitive data.We revisit the latest versions of the SDKs after 12 months.Our analysis demonstrates a persistent lack of improvement in these concerning trends.Based on our findings, we propose three actionable recommendations to mitigate the privacy leakage risks and enhance privacy protection for Android users.Our research not only serves as an urgent call for industry attention but also provides crucial insights for future regulatory interventions.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10411v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Charting EDA: Characterizing Interactive Visualization Use in Computational Notebooks with a Mixed-Methods Formalism
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Interactive visualizations are powerful tools for Exploratory Data Analysis (EDA), but how do they affect the observations analysts make about their data?<span class='px-1 mx-1 bg-yellow-200'>We conducted a qualitative experiment with 13 professional data scientists analyzing two datasets with Jupyter notebooks, collecting a rich dataset of interaction traces and think-aloud utterances. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.825</span></span>By qualitatively coding participant utterances, we introduce a formalism that describes EDA as a sequence of analysis states, where each state is comprised of either a representation an analyst constructs (e.g., the output of a data frame, an interactive visualization, etc.) or an observation the analyst makes (e.g., about missing data, the relationship between variables, etc.).By applying our formalism to our dataset, we identify that interactive visualizations, on average, lead to earlier and more complex insights about relationships between dataset attributes compared to static visualizations.Moreover, by calculating metrics such as revisit count and representational diversity, we uncover that some representations serve more as "planning aids" during EDA rather than tools strictly for hypothesis-answering.We show how these measures help identify other patterns of analysis behavior, such as the "80-20 rule", where a small subset of representations drove the majority of observations.Based on these findings, we offer design guidelines for interactive exploratory analysis tooling and reflect on future directions for studying the role that visualizations play in EDA.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10450v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Do Pre-trained Vision-Language Models Encode Object States?
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple).Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts.We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives.We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects' physical states.Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states.<span class='px-1 mx-1 bg-yellow-200'>Data and code are released. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.735</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10488v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Pennsieve - A Collaborative Platform for Translational Neuroscience and Beyond
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The exponential growth of neuroscientific data necessitates platforms that facilitate data management and multidisciplinary collaboration.<span class='px-1 mx-1 bg-yellow-200'>In this paper, we introduce Pennsieve - an open-source, cloud-based scientific data management platform built to meet these needs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.727</span></span><span class='px-1 mx-1 bg-yellow-200'>Pennsieve supports complex multimodal datasets and provides tools for data visualization and analyses. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.768</span></span>It takes a comprehensive approach to data integration, enabling researchers to define custom metadata schemas and utilize advanced tools to filter and query their data.Pennsieve's modular architecture allows external applications to extend its capabilities, and collaborative workspaces with peer-reviewed data publishing mechanisms promote high-quality datasets optimized for downstream analysis, both in the cloud and on-premises. Pennsieve forms the core for major neuroscience research programs including the NIH SPARC Initiative, NIH HEAL Initiative's PRECISION Human Pain Network, and NIH HEAL RE-JOIN Initiative.It serves more than 80 research groups worldwide, along with several large-scale, inter-institutional projects at clinical sites through the University of Pennsylvania.Underpinning the SPARC.Science, Epilepsy.<span class='px-1 mx-1 bg-yellow-200'>Science, and Pennsieve Discover portals, Pennsieve stores over 125 TB of scientific data, with 35 TB of data publicly available across more than 350 high-impact datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.899</span></span>It adheres to the findable, accessible, interoperable, and reusable (FAIR) principles of data sharing and is recognized as one of the NIH-approved Data Repositories.By facilitating scientific data management, discovery, and analysis, Pennsieve fosters a robust and collaborative research ecosystem for neuroscience and beyond.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10509v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Enhancing Video Transmission with Machine Learning based Routing in Software-Defined Networks
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Our study uses the centralized, flexible, dynamic, and programmable structure of Software-Defined networks (SDN) to overcome the problems.Although SDN effectively addresses the challenges present in traditional networks, it still requires further enhancements to achieve a more optimized network architecture.The Floodlight controller utilized in this study employs metrics such as hop count, which provides limited information for routing.In scenarios such as video transmission, this situation is insufficient and the need for optimization arises.For this purpose, an artificial intelligence (AI) based routing algorithm is proposed between the server and the client in the scenario based on NSFNET topology.The topology designed with the Floodlight controller in the Mininet simulation environment includes a client, a server, and 14 switches.A realistic network environment is provided by adding different receivers and creating TCP traffic between these receivers using the iperf3 tool.In three scenarios, video streaming is performed using the FFmpeg tool, and 49 path metrics such as RTT, throughput, and loss are recorded.In these scenarios, PSNR and SSIM calculations are made to observe the differences between the transmitted and the original video in congested and uncongested environments.<span class='px-1 mx-1 bg-yellow-200'>Due to the lack of a dataset suitable for the proposed network environment in the literature, a new dataset consisting of 876 records is created using continuously transmitted video traffic. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.881</span></span>Low and high traffic levels are created within the dataset, and different machine learning techniques such as KNN, Random Forest, SVM, AdaBoost, Logistic Regression and XGBoost are applied using the features that affect the traffic levels.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10512v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper explores the intersection of technological innovation and access to justice by developing a benchmark for predicting case outcomes in the UK Employment Tribunal (UKET).To address the challenge of extensive manual annotation, the study employs a large language model (LLM) for automatic annotation, resulting in the creation of the CLC-UKET dataset.<span class='px-1 mx-1 bg-yellow-200'>The dataset consists of approximately 19,000 UKET cases and their metadata. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.961</span></span>Comprehensive legal annotations cover facts, claims, precedent references, statutory references, case outcomes, reasons and jurisdiction codes.Facilitated by the CLC-UKET data, we examine a multi-class case outcome prediction task in the UKET.Human predictions are collected to establish a performance reference for model comparison.Empirical results from baseline models indicate that finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET prediction task.The performance of zero-shot LLMs can be enhanced by integrating task-related information into few-shot examples.We hope that the CLC-UKET dataset, along with human annotations and empirical findings, can serve as a valuable benchmark for employment-related dispute resolution.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.08098v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The examination of the musculoskeletal system in dogs is a challenging task in veterinary practice.In this work, a novel method has been developed that enables efficient documentation of a dog's condition through a visual representation.However, since the visual documentation is new, there is no existing training data.The objective of this work is therefore to mitigate the impact of data scarcity in order to develop an AI-based diagnostic support system.To this end, the potential of synthetic data that mimics realistic visual documentations of diseases for pre-training AI models is investigated.We propose a method for generating synthetic image data that mimics realistic visual documentations.<span class='px-1 mx-1 bg-yellow-200'>Initially, a basic dataset containing three distinct classes is generated, followed by the creation of a more sophisticated dataset containing 36 different classes. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.715</span></span>Both datasets are used for the pre-training of an AI model.<span class='px-1 mx-1 bg-yellow-200'>Subsequently, an evaluation dataset is created, consisting of 250 manually created visual documentations for five different diseases. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.757</span></span><span class='px-1 mx-1 bg-yellow-200'>This dataset, along with a subset containing 25 examples. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.964</span></span>The obtained results on the evaluation dataset containing 25 examples demonstrate a significant enhancement of approximately 10% in diagnosis accuracy when utilizing generated synthetic images that mimic real-world visual documentations.However, these results do not hold true for the larger evaluation dataset containing 250 examples, indicating that the advantages of using synthetic data for pre-training an AI model emerge primarily when dealing with few examples of visual documentations for a given disease.Overall, this work provides valuable insights into mitigating the limitations imposed by limited training data through the strategic use of generated synthetic data, presenting an approach applicable beyond the canine musculoskeletal assessment domain.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.08181v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
AudioBERT: Audio Knowledge Augmented Language Model
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge, \textit{e.g.,} colors of everyday objects.Motivated by this observation, we ask whether a similar shortcoming exists in terms of the \textit{auditory} knowledge.To answer this question, we construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge.Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge.To address this limitation, we propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach.First, we detect auditory knowledge spans in prompts to query our retrieval model efficiently.Then, we inject audio knowledge into BERT and switch on low-rank adaptation for effective adaptation when audio knowledge is required.Our experiments demonstrate that AudioBERT is quite effective, achieving superior performance on the AuditoryBench.<span class='px-1 mx-1 bg-yellow-200'>The dataset and code are available at \bulurl{https://github.com/HJ-Ok/AudioBERT}. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.766</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.08199v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Today's touch sensors come in many shapes and sizes.This has made it challenging to develop general-purpose touch processing methods since models are generally tied to one specific sensor design.We address this problem by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, we use a generative model to estimate how the same physical contact would be perceived by another sensor.This allows us to apply sensor-specific methods to the generated signal.We implement this idea by training a diffusion model to translate between the popular GelSlim and Soft Bubble sensors.As a downstream task, we perform in-hand object pose estimation using GelSlim sensors while using an algorithm that operates only on Soft Bubble signals.<span class='px-1 mx-1 bg-yellow-200'>The dataset, the code, and additional details can be found at https://www.mmintlab.com/research/touch2touch/. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.86</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.08269v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-12</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description.<span class='px-1 mx-1 bg-yellow-200'>This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.715</span></span>To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs.We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits.However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients.To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation.During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation.We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.08278v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.785</span></span>ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz.Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio.The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field.It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method.This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field.With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.07259v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Visual Compositional Data Analytics for Spatial Transcriptomics
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p><span class='px-1 mx-1 bg-yellow-200'>For the Bio+Med-Vis Challenge 2024, we propose a visual analytics system as a redesign for the scatter pie chart visualization of cell type proportions of spatial transcriptomics data. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.729</span></span>Our design uses three linked views: a view of the histological image of the tissue, a stacked bar chart showing cell type proportions of the spots, and a scatter plot showing a dimensionality reduction of the multivariate proportions.Furthermore, we apply a compositional data analysis framework, the Aitchison geometry, to the proportions for dimensionality reduction and $k$-means clustering.Leveraging brushing and linking, the system allows one to explore and uncover patterns in the cell type mixtures and relate them to their spatial locations on the cellular tissue.This redesign shifts the pattern recognition workload from the human visual system to computational methods commonly used in visual analytics.<span class='px-1 mx-1 bg-yellow-200'>We provide the code and setup instructions of our visual analytics system on GitHub (https://github.com/UniStuttgart-VISUS/va-for-spatial-transcriptomics). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.734</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.07306v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Benchmarking 2D Egocentric Hand Pose Datasets
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Hand pose estimation from egocentric video has broad implications across various domains, including human-computer interaction, assistive technologies, activity recognition, and robotics, making it a topic of significant research interest.The efficacy of modern machine learning models depends on the quality of data used for their training.<span class='px-1 mx-1 bg-yellow-200'>Thus, this work is devoted to the analysis of state-of-the-art egocentric datasets suitable for 2D hand pose estimation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.736</span></span><span class='px-1 mx-1 bg-yellow-200'>We propose a novel protocol for dataset evaluation, which encompasses not only the analysis of stated dataset characteristics and assessment of data quality, but also the identification of dataset shortcomings through the evaluation of state-of-the-art hand pose estimation models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.788</span></span>Our study reveals that despite the availability of numerous egocentric databases intended for 2D hand pose estimation, the majority are tailored for specific use cases.<span class='px-1 mx-1 bg-yellow-200'>There is no ideal benchmark dataset yet; however, H2O and GANerated Hands datasets emerge as the most promising real and synthetic datasets, respectively. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.808</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.07337v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Large Language Models (LLMs) have shown great promise in vulnerability identification.As C/C++ comprises half of the Open-Source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential.However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges.In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner.We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs.Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details.Specifically, CLNX first applies structure-level naturalization to decompose complex programs, followed by token-level naturalization to interpret complex symbols.<span class='px-1 mx-1 bg-yellow-200'>We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.748</span></span>The results show that CLNX significantly enhances the performance of LLMs on identifying C/C++ VCCs.Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art and identifies 38 OSS vulnerabilities in the real world.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.07407v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices.The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting.We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task.To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing.<span class='px-1 mx-1 bg-yellow-200'>Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.792</span></span>Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays.In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.07447v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We present a framework for learning to generate background music from video inputs.Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music.This enables our model to learn to generate realistic and diverse music.To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme.Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content.We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video.Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames.<span class='px-1 mx-1 bg-yellow-200'>We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.735</span></span>Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation.Results are available at https://genjib.github.io/project_page/VMAs/index.html</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.07450v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-11</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness.In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation).This methodology delves into the underlying temporal consistency knowledge in video diffusion model that generalizes well to geometry consistency across multiple views in 3D generation.Technically, Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior (camera pose condition), yielding multi-view images with low-resolution texture details.A 3D-aware video-to-video refiner is learnt to further scale up the multi-view images with high-resolution texture details.Such high-resolution multi-view images are further augmented with novel views through 3D Gaussian Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D reconstruction.Extensive experiments on both novel view synthesis and single view reconstruction demonstrate that our Hi3D manages to produce superior multi-view consistency images with highly-detailed textures.<span class='px-1 mx-1 bg-yellow-200'>Source code and data are available at \url{https://github.com/yanghb22-fdu/Hi3D-Official}. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.733</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.07452v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations.<span class='px-1 mx-1 bg-yellow-200'>We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.806</span></span>We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -- subject-verb agreement across a variety of sentence structures -- in several languages.Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations.Using a two-level architecture that solves the problem in two steps -- detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences -- we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.06567v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction.However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs.To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs.LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder.It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency.We build our model based on the latest Llama-3.1-8B-Instruct model.<span class='px-1 mx-1 bg-yellow-200'>To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.85</span></span>Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms.Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.06666v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-10</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Benchmarking Sub-Genre Classification For Mainstage Dance Music
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Music classification, with a wide range of applications, is one of the most prominent tasks in music information retrieval.To address the absence of comprehensive datasets and high-performing methods in the classification of mainstage dance music, this work introduces a novel benchmark comprising a new dataset and a baseline.<span class='px-1 mx-1 bg-yellow-200'>Our dataset extends the number of sub-genres to cover most recent mainstage live sets by top DJs worldwide in music festivals. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.85</span></span>A continuous soft labeling approach is employed to account for tracks that span multiple sub-genres, preserving the inherent sophistication.For the baseline, we developed deep learning models that outperform current state-of-the-art multimodel language models, which struggle to identify house music sub-genres, emphasizing the need for specialized models trained on fine-grained datasets.Our benchmark is applicable to serve for application scenarios such as music recommendation, DJ set curation, and interactive multimedia, where we also provide video demos.Our code is on \url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.06690v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
AnomalyCD: A benchmark for Earth anomaly change detection with high-resolution and time-series observations
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Various Earth anomalies have destroyed the stable, balanced state, resulting in fatalities and serious destruction of property.With the advantages of large-scale and precise observation, high-resolution remote sensing images have been widely used for anomaly monitoring and localization.Powered by the deep representation, the existing methods have achieved remarkable advances, primarily in classification and change detection techniques.However, labeled samples are difficult to acquire due to the low probability of anomaly occurrence, and the trained models are limited to fixed anomaly categories, which hinders the application for anomalies with few samples or unknown anomalies.In this paper, to tackle this problem, we propose the anomaly change detection (AnomalyCD) technique, which accepts time-series observations and learns to identify anomalous changes by learning from the historical normal change pattern.Compared to the existing techniques, AnomalyCD processes an unfixed number of time steps and can localize the various anomalies in a unified manner, without human supervision.<span class='px-1 mx-1 bg-yellow-200'>To benchmark AnomalyCD, we constructed a high-resolution dataset with time-series images dedicated to various Earth anomalies (the AnomalyCDD dataset). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.878</span></span>AnomalyCDD contains high-resolution (from 0.15 to 2.39 m/pixel), time-series (from 3 to 7 time steps), and large-scale images (1927.93 km2 in total) collected globally Furthermore, we developed a zero-shot baseline model (AnomalyCDM), which implements the AnomalyCD technique by extracting a general representation from the segment anything model (SAM) and conducting temporal comparison to distinguish the anomalous changes from normal changes.AnomalyCDM is designed as a two-stage workflow to enhance the efficiency, and has the ability to process the unseen images directly, without retraining for each scene.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05679v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Achieving 3D understanding of non-Lambertian objects is an important task with many useful applications, but most existing algorithms struggle to deal with such objects.One major obstacle towards progress in this field is the lack of holistic non-Lambertian benchmarks -- most benchmarks have low scene and object diversity, and none provide multi-layer 3D annotations for objects occluded by transparent surfaces.In this paper, we introduce LayeredFlow, a real world benchmark containing multi-layer ground truth annotation for optical flow of non-Lambertian objects.Compared to previous benchmarks, our benchmark exhibits greater scene and object diversity, with 150k high quality optical flow and stereo pairs taken over 185 indoor and outdoor scenes and 360 unique objects.Using LayeredFlow as evaluation data, we propose a new task called multi-layer optical flow.<span class='px-1 mx-1 bg-yellow-200'>To provide training data for this task, we introduce a large-scale densely-annotated synthetic dataset containing 60k images within 30 scenes tailored for non-Lambertian objects. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.91</span></span>Training on our synthetic dataset enables model to predict multi-layer optical flow, while fine-tuning existing optical flow methods on the dataset notably boosts their performance on non-Lambertian objects without compromising the performance on diffuse objects.Data is available at https://layeredflow.cs.princeton.edu.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05688v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Extracting the U.S. building types from OpenStreetMap data
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Building type information is crucial for population estimation, traffic planning, urban planning, and emergency response applications.Although essential, such data is often not readily available.<span class='px-1 mx-1 bg-yellow-200'>To alleviate this problem, this work creates a comprehensive dataset by providing residential/non-residential building classification covering the entire United States. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.783</span></span>We propose and utilize an unsupervised machine learning method to classify building types based on building footprints and available OpenStreetMap information.The classification result is validated using authoritative ground truth data for select counties in the U.S.The validation shows a high precision for non-residential building classification and a high recall for residential buildings.We identified various approaches to improving the quality of the classification, such as removing sheds and garages from the dataset.Furthermore, analyzing the misclassifications revealed that they are mainly due to missing and scarce metadata in OSM.<span class='px-1 mx-1 bg-yellow-200'>A major result of this work is the resulting dataset of classifying 67,705,475 buildings. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.89</span></span>We hope that this data is of value to the scientific community, including urban and transportation planners.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05692v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Open-source, multilingual medical large language models (LLMs) have the potential to serve linguistically diverse populations across different regions.Adapting generic LLMs for healthcare often requires continual pretraining, but this approach is computationally expensive and sometimes impractical.Instruction fine-tuning on a specific task may not always guarantee optimal performance due to the lack of broader domain knowledge that the model needs to understand and reason effectively in diverse scenarios.<span class='px-1 mx-1 bg-yellow-200'>To address these challenges, we introduce two multilingual instruction fine-tuning datasets, MMed-IFT and MMed-IFT-MC, containing over 200k high-quality medical samples in six languages. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.775</span></span>We propose a two-stage training paradigm: the first stage injects general medical knowledge using MMed-IFT, while the second stage fine-tunes task-specific multiple-choice questions with MMed-IFT-MC.Our method achieves competitive results on both English and multilingual benchmarks, striking a balance between computational efficiency and performance.We plan to make our dataset and model weights public at \url{https://github.com/SpassMed/Med-Llama3} in the future.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05732v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Are Heterophily-Specific GNNs and Homophily Metrics Really Effective? Evaluation Pitfalls and New Benchmarks
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Over the past decade, Graph Neural Networks (GNNs) have achieved great success on machine learning tasks with relational data.However, recent studies have found that heterophily can cause significant performance degradation of GNNs, especially on node-level tasks.Numerous heterophilic benchmark datasets have been put forward to validate the efficacy of heterophily-specific GNNs and various homophily metrics have been designed to help people recognize these malignant datasets.Nevertheless, there still exist multiple pitfalls that severely hinder the proper evaluation of new models and metrics.In this paper, we point out three most serious pitfalls: 1) a lack of hyperparameter tuning; 2) insufficient model evaluation on the real challenging heterophilic datasets; 3) missing quantitative evaluation benchmark for homophily metrics on synthetic graphs.<span class='px-1 mx-1 bg-yellow-200'>To overcome these challenges, we first train and fine-tune baseline models on $27$ most widely used benchmark datasets, categorize them into three distinct groups: malignant, benign and ambiguous heterophilic datasets, and identify the real challenging subsets of tasks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.71</span></span>To our best knowledge, we are the first to propose such taxonomy.Then, we re-evaluate $10$ heterophily-specific state-of-the-arts (SOTA) GNNs with fine-tuned hyperparameters on different groups of heterophilic datasets.Based on the model performance, we reassess their effectiveness on addressing heterophily challenge.At last, we evaluate $11$ popular homophily metrics on synthetic graphs with three different generation approaches.To compare the metrics strictly, we propose the first quantitative evaluation method based on Fr\'echet distance.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05755v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Benchmarking Chinese Knowledge Rectification in Large Language Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>While Large Language Models (LLMs) exhibit remarkable generative capabilities, they are not without flaws, particularly in the form of hallucinations.This issue is even more pronounced when LLMs are applied to specific languages and domains.For example, LLMs may generate nonsense information when handling Chinese ancient poetry, proverbs, or idioms, owing to the lack of specific knowledge.To this end, this paper introduces a benchmark for rectifying Chinese knowledge in LLMs via knowledge editing.<span class='px-1 mx-1 bg-yellow-200'>Specifically, we introduce a new Chinese dataset, CKnowEdit, by collecting seven type of knowledge from various sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba, thereby accounting for the unique polyphony, antithesis, and logical constructs inherent in the Chinese language. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.823</span></span>Through the analysis of this dataset, we uncover the challenges faced by current LLMs in mastering Chinese.Furthermore, our evaluation of state-of-the-art knowledge editing techniques on this dataset unveil the substantial scope for advancement in the rectification of Chinese knowledge.<span class='px-1 mx-1 bg-yellow-200'>Code and dataset are available at https://github.com/zjunlp/EasyEdit. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.842</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05806v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Evaluating Multiview Object Consistency in Humans and Image Models
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task.We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation.We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects).<span class='px-1 mx-1 bg-yellow-200'>After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.725</span></span>This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data.We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP).We find that humans outperform all models by a wide margin.Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials.All images, data, and code can be accessed via our project page.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05862v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Promptable Closed-loop Traffic Simulation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Simulation stands as a cornerstone for safe and efficient autonomous driving development.At its core a simulation system ought to produce realistic, reactive, and controllable traffic patterns.In this paper, we propose ProSim, a multimodal promptable closed-loop traffic simulation framework.ProSim allows the user to give a complex set of numerical, categorical or textual prompts to instruct each agent's behavior and intention.ProSim then rolls out a traffic scenario in a closed-loop manner, modeling each agent's interaction with other traffic participants.Our experiments show that ProSim achieves high prompt controllability given different user prompts, while reaching competitive performance on the Waymo Sim Agents Challenge when no prompt is given.<span class='px-1 mx-1 bg-yellow-200'>To support research on promptable traffic simulation, we create ProSim-Instruct-520k, a multimodal prompt-scenario paired driving dataset with over 10M text prompts for over 520k real-world driving scenarios. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span><span class='px-1 mx-1 bg-yellow-200'>We will release code of ProSim as well as data and labeling tools of ProSim-Instruct-520k at https://ariostgx.github.io/ProSim. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.768</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05863v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion).Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand.Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency.In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance.<span class='px-1 mx-1 bg-yellow-200'>Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.776</span></span>Our source code is available at: https://github.com/BOVIFOCR/tcdiff.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.03600v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Reimagining Data Visualization to Address Sustainability Goals
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Information visualization holds significant potential to support sustainability goals such as environmental stewardship, and climate resilience by transforming complex data into accessible visual formats that enhance public understanding of complex climate change data and drive actionable insights.While the field has predominantly focused on analytical orientation of visualization, challenging traditional visualization techniques and goals, through critical visualization research expands existing assumptions and conventions in the field.In this paper, I explore how reimagining overlooked aspects of data visualization, such as engagement, emotional resonance, communication, and community empowerment, can contribute to achieving sustainability objectives.I argue that by focusing on inclusive data visualization that promotes clarity, understandability, and public participation, we can make complex data more relatable and actionable, fostering broader connections and mobilizing collective action on critical issues like climate change.<span class='px-1 mx-1 bg-yellow-200'>Moreover, I discuss the role of emotional receptivity in environmental data communication, stressing the need for visualizations that respect diverse cultural perspectives and emotional responses to achieve impactful outcomes. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.708</span></span>Drawing on insights from a decade of research in public participation and community engagement, I aim to highlight how data visualization can democratize data access and increase public involvement in order to contribute to a more sustainable and resilient future.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.03611v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Practical Forecasting of Cryptocoins Timeseries using Correlation Patterns
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Cryptocoins (i.e., Bitcoin, Ether, Litecoin) are tradable digital assets.Ownerships of cryptocoins are registered on distributed ledgers (i.e., blockchains).Secure encryption techniques guarantee the security of the transactions (transfers of coins among owners), registered into the ledger.Cryptocoins are exchanged for specific trading prices.The extreme volatility of such trading prices across all different sets of crypto-assets remains undisputed.However, the relations between the trading prices across different cryptocoins remains largely unexplored.Major coin exchanges indicate trend correlation to advise for sells or buys.However, price correlations remain largely unexplored.We shed some light on the trend correlations across a large variety of cryptocoins, by investigating their coin/price correlation trends over the past two years.We study the causality between the trends, and exploit the derived correlations to understand the accuracy of state-of-the-art forecasting techniques for time series modeling (e.g., GBMs, LSTM and GRU) of correlated cryptocoins.Our evaluation shows (i) strong correlation patterns between the most traded coins (e.g., Bitcoin and Ether) and other types of cryptocurrencies, and (ii) state-of-the-art time series forecasting algorithms can be used to forecast cryptocoins price trends.<span class='px-1 mx-1 bg-yellow-200'>We released datasets and code to reproduce our analysis to the research community. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.847</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.03674v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Classification and Prediction of Heart Diseases using Machine Learning Algorithms
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Heart disease is a serious worldwide health issue because it claims the lives of many people who might have been treated if the disease had been identified earlier.The leading cause of death in the world is cardiovascular disease, usually referred to as heart disease.Creating reliable, effective, and precise predictions for these diseases is one of the biggest issues facing the medical world today.Although there are tools for predicting heart diseases, they are either expensive or challenging to apply for determining a patient's risk.The best classifier for foretelling and spotting heart disease was the aim of this research.This experiment examined a range of machine learning approaches, including Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Artificial Neural Networks, to determine which machine learning algorithm was most effective at predicting heart diseases.<span class='px-1 mx-1 bg-yellow-200'>One of the most often utilized data sets for this purpose, the UCI heart disease repository provided the data set for this study. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.795</span></span>The K-Nearest Neighbor technique was shown to be the most effective machine learning algorithm for determining whether a patient has heart disease.It will be beneficial to conduct further studies on the application of additional machine learning algorithms for heart disease prediction.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.03697v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
ArtiFade: Learning to Generate High-quality Subject from Blemished Images
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images.However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts.This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts.<span class='px-1 mx-1 bg-yellow-200'>In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.84</span></span>Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts.The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning.ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images.We further devise evaluation benchmarks tailored for this task.Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.03745v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions.However, the sheer volume of this data makes manually examining individual conversations impractical.<span class='px-1 mx-1 bg-yellow-200'>To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.725</span></span>WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria.To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds.We demonstrate WildVis's utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns.WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.03753v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-05</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Foundation Model or Finetune? Evaluation of few-shot semantic segmentation for river pollution
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Foundation models (FMs) are a popular topic of research in AI.Their ability to generalize to new tasks and datasets without retraining or needing an abundance of data makes them an appealing candidate for applications on specialist datasets.In this work, we compare the performance of FMs to finetuned pre-trained supervised models in the task of semantic segmentation on an entirely new dataset.We see that finetuned models consistently outperform the FMs tested, even in cases were data is scarce.<span class='px-1 mx-1 bg-yellow-200'>We release the code and dataset for this work on GitHub. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.837</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.03754v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-04</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
The Impact of Balancing Real and Synthetic Data on Accuracy and Fairness in Face Recognition
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Over the recent years, the advancements in deep face recognition have fueled an increasing demand for large and diverse datasets.Nevertheless, the authentic data acquired to create those datasets is typically sourced from the web, which, in many cases, can lead to significant privacy issues due to the lack of explicit user consent.Furthermore, obtaining a demographically balanced, large dataset is even more difficult because of the natural imbalance in the distribution of images from different demographic groups.In this paper, we investigate the impact of demographically balanced authentic and synthetic data, both individually and in combination, on the accuracy and fairness of face recognition models.Initially, several generative methods were used to balance the demographic representations of the corresponding synthetic datasets.Then a state-of-the-art face encoder was trained and evaluated using (combinations of) synthetic and authentic images.Our findings emphasized two main points: (i) the increased effectiveness of training data generated by diffusion-based models in enhancing accuracy, whether used alone or combined with subsets of authentic data, and (ii) the minimal impact of incorporating balanced data from pre-trained generative methods on fairness (in nearly all tested scenarios using combined datasets, fairness scores remained either unchanged or worsened, even when compared to unbalanced authentic datasets).<span class='px-1 mx-1 bg-yellow-200'>Source code and data are available at \url{https://cutt.ly/AeQy1K5G} for reproducibility. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.74</span></span></p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.02867v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-04</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models.However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving.Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety.<span class='px-1 mx-1 bg-yellow-200'>To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.916</span></span>Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice.In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis.We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset.The project page can be found at: \url{https://4dvlab.github.io/project_page/idkb.html}</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.02914v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-04</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Effective collaboration of dual-arm robots and their tool use capabilities are increasingly important areas in the advancement of robotics.These skills play a significant role in expanding robots' ability to operate in diverse real-world environments.However, progress is impeded by the scarcity of specialized training data.<span class='px-1 mx-1 bg-yellow-200'>This paper introduces RoboTwin, a novel benchmark dataset combining real-world teleoperated data with synthetic data from digital twins, designed for dual-arm robotic scenarios. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.717</span></span>Using the COBOT Magic platform, we have collected diverse data on tool usage and human-robot interaction.We present a innovative approach to creating digital twins using AI-generated content, transforming 2D images into detailed 3D models.Furthermore, we utilize large language models to generate expert-level training data and task-specific pose sequences oriented toward functionality.Our key contributions are: 1) the RoboTwin benchmark dataset, 2) an efficient real-to-simulation pipeline, and 3) the use of language models for automatic expert-level data generation.These advancements are designed to address the shortage of robotic training data, potentially accelerating the development of more capable and versatile robotic systems for a wide range of real-world applications.The project page is available at https://robotwin-benchmark.github.io/early-version/</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.02920v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td></td>
<td>
<h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Detecting Sexism in German Online Newspaper Comments with Open-Source Text Embeddings (Team GDA, GermEval2024 Shared Task 1: GerMS-Detect, Subtasks 1 and 2, Closed Track)
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Sexism in online media comments is a pervasive challenge that often manifests subtly, complicating moderation efforts as interpretations of what constitutes sexism can vary among individuals.We study monolingual and multilingual open-source text embeddings to reliably detect sexism and misogyny in German-language online comments from an Austrian newspaper.<span class='px-1 mx-1 bg-yellow-200'>We observed classifiers trained on text embeddings to mimic closely the individual judgements of human annotators. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.628</span></span>Our method showed robust performance in the GermEval 2024 GerMS-Detect Subtask 1 challenge, achieving an average macro F1 score of 0.597 (4th place, as reported on Codabench).It also accurately predicted the distribution of human annotations in GerMS-Detect Subtask 2, with an average Jensen-Shannon distance of 0.301 (2nd place).The computational efficiency of our approach suggests potential for scalable applications across various languages and linguistic contexts.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10341v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Efficiently Crowdsourcing Visual Importance with Punch-Hole Annotation
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>We introduce a novel crowdsourcing method for identifying important areas in graphical images through punch-hole labeling.Traditional methods, such as gaze trackers and mouse-based annotations, which generate continuous data, can be impractical in crowdsourcing scenarios.They require many participants, and the outcome data can be noisy.In contrast, our method first segments the graphical image with a grid and drops a portion of the patches (punch holes).<span class='px-1 mx-1 bg-yellow-200'>Then, we iteratively ask the labeler to validate each annotation with holes, narrowing down the annotation only having the most important area. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.724</span></span><span class='px-1 mx-1 bg-yellow-200'>This approach aims to reduce annotation noise in crowdsourcing by standardizing the annotations while enhancing labeling efficiency and reliability. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.771</span></span>Preliminary findings from fundamental charts demonstrate that punch-hole labeling can effectively pinpoint critical regions.This also highlights its potential for broader application in visualization research, particularly in studying large-scale users' graphical perception.Our future work aims to enhance the algorithm to achieve faster labeling speed and prove its utility through large-scale experiments.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10459v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-16</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Predicting high-dimensional or extreme multilabels, such as in medical coding, requires both accuracy and interpretability.<span class='px-1 mx-1 bg-yellow-200'>Existing works often rely on local interpretability methods, failing to provide comprehensive explanations of the overall mechanism behind each label prediction within a multilabel set. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.636</span></span>We propose a mechanistic interpretability module called DIctionary Label Attention (\method) that disentangles uninterpretable dense embeddings into a sparse embedding space, where each nonzero element (a dictionary feature) represents a globally learned medical concept.Through human evaluations, we show that our sparse embeddings are more human understandable than its dense counterparts by at least 50 percent.Our automated dictionary feature identification pipeline, leveraging large language models (LLMs), uncovers thousands of learned medical concepts by examining and summarizing the highest activating tokens for each dictionary feature.We represent the relationships between dictionary features and medical codes through a sparse interpretable matrix, enhancing the mechanistic and global understanding of the model's predictions while maintaining competitive performance and scalability without extensive human annotation.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.10504v1' target="_blank">
link
</a>
</p>
</div>
</div>
</td>
</tr><tr>
<td class="inline-block">
<p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-09-09</p>
</td>
<td>
<div x-data="{open: false}">
<span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
Robust Loss Functions for Object Grasping under Limited Ground Truth
</span>
<div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
<div class="text-center pt-2"></div>
<p class="pt-2">
<p>Object grasping is a crucial technology enabling robots to perceive and interact with the environment sufficiently.However, in practical applications, researchers are faced with missing or noisy ground truth while training the convolutional neural network, which decreases the accuracy of the model.Therefore, different loss functions are proposed to deal with these problems to improve the accuracy of the neural network.For missing ground truth, a new predicted category probability method is defined for unlabeled samples, which works effectively in conjunction with the pseudo-labeling method.<span class='px-1 mx-1 bg-yellow-200'>Furthermore, for noisy ground truth, a symmetric loss function is introduced to resist the corruption of label noises. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.645</span></span>The proposed loss functions are powerful, robust, and easy to use.Experimental results based on the typical grasping neural network show that our method can improve performance by 2 to 13 percent.</p>
</p>
<p class="pb-2 pt-2 text-center">
<a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2409.05742v1' target="_blank">
link
</a>
</p>