](https://deep-paper.org/en/paper/2406.18725/images/cover.png)
Lost in Transliteration — How Arabizi Bypasses LLM Safety Filters
Large Language Models (LLMs) like GPT-4 and Claude 3 are designed to be helpful, but they are also designed to be safe. If you ask these models to write a guide on how to create malware or build a bomb, they are trained to refuse. This safety training, often achieved through Reinforcement Learning from Human Feedback (RLHF), acts as a firewall around the model’s vast knowledge. However, security researchers are constantly searching for cracks in this firewall. While most safety training focuses heavily on English, a new vulnerability has emerged in the linguistic “blind spots” of these models. ...
](https://deep-paper.org/en/paper/2403.05020/images/cover.png)
](https://deep-paper.org/en/paper/2406.14829/images/cover.png)
](https://deep-paper.org/en/paper/2410.03466/images/cover.png)
](https://deep-paper.org/en/paper/2402.14016/images/cover.png)
](https://deep-paper.org/en/paper/2407.00402/images/cover.png)
](https://deep-paper.org/en/paper/2406.12822/images/cover.png)
](https://deep-paper.org/en/paper/2408.03617/images/cover.png)
](https://deep-paper.org/en/paper/2410.07461/images/cover.png)
](https://deep-paper.org/en/paper/2410.03176/images/cover.png)
](https://deep-paper.org/en/paper/2406.14511/images/cover.png)
](https://deep-paper.org/en/paper/2402.13703/images/cover.png)
](https://deep-paper.org/en/paper/2411.01706/images/cover.png)
](https://deep-paper.org/en/paper/2407.08495/images/cover.png)
](https://deep-paper.org/en/paper/2407.15286/images/cover.png)
](https://deep-paper.org/en/paper/2408.15232/images/cover.png)
](https://deep-paper.org/en/paper/2410.15609/images/cover.png)
](https://deep-paper.org/en/paper/2402.15055/images/cover.png)
](https://deep-paper.org/en/paper/file-3216/images/cover.png)
](https://deep-paper.org/en/paper/file-3215/images/cover.png)