Episode 5 on diffusion, audio, video, 3D and multimodal models

Hello again, hope you had a great summer! We continue our journey in gen AI land with the rest of the AI models, from pics to vids, 3D objects, voice clones and multimodal models. Enjoy!

Blueberry Thoughts
Blueberry Thoughts
Ep. 005 – A Diffusion Confusion – Painting Pictures with AI
Loading
/

Transcription

Welcome back to another episode of Blueberry Thoughts with your host, Ivor J. Burks. Today, we continue our deep dive into the captivating world of generative artificial intelligence, picking up the thread we left in our previous episode on the rise of Large Language Models. We’ve seen how AI has revolutionized language processing, now let’s uncover how it’s painting pictures with AI and weaving narratives in multimedia.
In today’s episode, we journey through the digital landscapes of AI-powered image, sound and 3D object generators, video creation tools, and we’ll even have a short look at the intriguing concept of multimodal models. Along the way, we’ll address the controversies and legal hurdles surrounding these technologies.
So, buckle up as we navigate the bleeding edge of AI’s artistic frontier and explore the ethical maze that accompanies it. This is an episode you don’t want to miss!

When we examine AI’s intersection with artistry, it’s impossible to overlook the potent tool that is the diffusion model. Picture this model as an enchanting art wizard, conjuring stunning visuals through an iterative process of adding and subtracting ‘noise’ to an image. This yields a plethora of high-quality images, each one a variation, yet a distinct piece of art in its own right.
To illustrate this, imagine depositing a droplet of ink into a clear water-filled bowl. The ink disperses, forming entrancing patterns as it integrates with the water. The diffusion model operates similarly, sprinkling ‘noise’ akin to the diffusing ink onto an image.
The model’s spellbinding prowess lies not in just adding noise, but also in its ability to gradually erase it. In doing so, it restores the original image and simultaneously generates a spectrum of unique alternatives. It’s akin to watching the ever-evolving patterns in our ink-infused water.
These models employ a fusion of score-based generative modeling, latent space, Gaussian noise, and a reverse diffusion process to achieve these results. While these terms may sound arcane, their collective purpose is simple – to create these remarkable visual marvels.
The scope of diffusion models extends beyond crafting new images. They can also modify existing ones through a technique known as inpainting and outpainting. This allows for seamless content replacement, enabling users to restore vintage photos or expand images to form larger scenes without advanced Photoshop skills. Speaking of which, even Adobe’s legendary Photoshop software has recently been AI-enhanced, with the beta version boasting several generative AI features for image expansion and modification.
Diffusion models are also making waves in the domain of video generation. They can create brief video clips based on text prompts. For instance, input ‘people at a barbecue party drinking beer,’ and the model breathes life into the scene. But are the AI videos reality or nightmare-like? We’ll delve into that as this episode progresses.

When discussing diffusion models, we encounter terms such as ‘score-based generative modeling,’ ‘latent space,’ ‘Gaussian noise,’ and ‘reverse diffusion process.’ Let’s delve into these concepts.
‘Score-based generative modeling’ pertains to the model’s ability to estimate the likelihood of creating a new image based on its prior knowledge. Picture a seasoned chef, whose culinary expertise allows them to invent delicious dishes based on their understanding of flavors. Similarly, the model draws upon its experience of diverse images to predict possible appearances for a new image.
The term ‘latent space’ is akin to an abstract, mathematical playground where the model ‘paints’ new images reminiscent of those it’s acquainted with. Imagine our experienced chef in a well-stocked kitchen, utilizing familiar ingredients to whip up new, yet reminiscent culinary creations.
‘Gaussian noise’ introduces an element of unpredictability, akin to a surprise ingredient in a cooking contest. This randomness added to the input data challenges the model to identify correlations within this noise, enhancing its learning process.
Lastly, we encounter the ‘reverse diffusion process.’ Visualize the restoration of a faded, vintage photograph to its former crystal-clear glory. That’s what this process accomplishes. It starts with a noisy image and gradually polishes it into a high-resolution image.
This process isn’t linear, however. A fundamental aspect of diffusion models is the periodic re-introduction of noise during the reverse diffusion process. The rationale behind this lies in avoiding a formless result, which would occur if the noise was eliminated in one fell swoop. By intermittently reinfusing noise, an element of randomness is maintained, allowing the model to create a unique, yet cohesive image instead of an ambiguous blob. This intricate dance between the addition and removal of noise gives diffusion models their distinctive prowess in generating a broad spectrum of realistic images.

Let’s turn our attention to the main characters of this episode: MidJourney, DALL-E 2, Stable Diffusion with its different versions, and SDXL! Each of these models possesses a unique quality.
Consider MidJourney, known for its user-friendly interface, an excellent choice for those embarking on their AI journey due to its easy-to-use Discord integration. However, as it is closed-source, it restricts room for customization and refinement. It also tends to exercise considerable content censorship, potentially stifling creative expression. DALL-E 2, a release from OpenAI, who brought us ChatGPT, shares many of these traits. But, in terms of image quality, it doesn’t quite meet the bar set by MidJourney. Interestingly, Microsoft’s Bing AI and Microsoft Designer leverage a modified version of DALL-E 2 for image generation.
At the other end, we find Stable Diffusion version 1 point 5. This open-source model is the dependable workhorse of AI art, propelling several popular tools. While it may show signs of aging, it remains a sturdy platform offering features like inpainting, outpainting, and third-party plugin compatibility. However, it needs a high-end GPU for smooth operation, although GPU rental services for training and running AI models are becoming more and more affordable. There are also newer versions of Stable Diffusion which do not necessarily output better images, they are just models using different text encoders and datasets.
SDXL, another offering from Stability AI, is the dynamic newcomer. It boasts dual text encoders for enhanced prompt comprehension and a two-stage generation process ensuring image coherence at high resolutions. It offers more advanced features,but has a steeper learning curve.
A flurry of innovative startups is joining this arena, utilizing diffusion models’ power to create images. These include Leonardo.ai, Catbird.ai, and BlueWillow.ai, each presenting unique tools based on diffusion models. Another long-standing player, Craiyon, also known as ‘DALL-E Mini,’ continues to contribute to the ever-evolving landscape. Many offerings in the space come in the form of mobile or web apps that let you upload several pictures of yourself and generate digital avatar versions of your portrait in different artistic styles. The burgeoning diversity and creativity in this sector are truly exhilarating.

The field of AI isn’t limited to just language and image generation, it’s now making waves in the realm of audio. Though the progress in audio generation has trailed its text and image counterparts, recent advancements are rapidly closing the gap.
Borrowing cues from large language models and text-to-image generation, AI-powered audio-generative systems of unparalleled quality have emerged. Capable of crafting high-quality audio from text or melodic prompts, these systems are redefining the soundscape. Techniques like tokenization, quantization, and vectorization are used to learn discrete representations of audio features and apply transformers for audio continuation.
Audio signals, with their unique complexity that encompasses pitch, loudness, phonetic properties of speech, prosody, and emotional intonation, present a unique challenge. Recent models equipped with Audio Embedding, Audio Quantization, and Neural Audio Codecs are stepping up to meet these challenges, advancing text-to-speech and text-to-music models, and thereby pushing the boundaries of audio generation.
Consider MusicLM, a music generator capable of turning user-provided melodic prompts into harmonious outputs, using hierarchical tokenization and generation schemes. It can even generate music based on inputs like humming or whistling.
The field of voice cloning has also witnessed impressive advancements. Text-to-speech models like VALL-E and NaturalSpeech 2 employ discrete acoustic tokens and continuous vector representations to mimic human speech. Take, for instance, Apple’s upcoming voice-cloning feature, which can learn to speak in your voice in just 15 minutes, or Microsoft’s VALL-E that can synthesize the speaker’s unique timbre and tone from only three seconds of audio.
These leaps in AI audio technology are not just transforming music creation but also making a significant impact on social media and platforms like YouTube. Fake voice stand-offs featuring imitated voices of well-known figures highlight the pervasiveness of this technology.
The potential of these advanced generative audio models promises a revolution in content generation and user experience across music, podcasts, and audiobooks. We anticipate the pace to only quicken in this domain, unlocking a plethora of thrilling possibilities.
To bring this point home, you’re currently witnessing these advancements in action. This podcast is produced using a voice cloning service by ElevenLabs, harnessing AI to generate a convincing voice clone of the human behind Blueberry Thoughts.

AI-powered 3D object generators are transforming the 3D object generation field, enhancing efficiency, accuracy, and accessibility. Serving game developers, graphic designers, and tech enthusiasts, these tools enable users to visualize their ideas in three dimensions.
Among the top performers, Masterpiece Studio stands out. It employs advanced Natural Language Processing (NLP) technology to translate descriptive text into fully-fledged 3D models and animations. With just a few text lines, users can create impressive 3D models, all facilitated by an intuitive user interface. Meshcapade is another notable tool, offering a unified platform for creating high-quality 3D models from text inputs. Compatible with various game engines and graphic applications, it’s the go-to for avatar creation. Luma AI leads in 3D picture production, creating photorealistic 3D models from text and rendering live video streams into 3D environments, unlocking potential for immersive experiences. And then there’s NeROIC, remarkable for its prowess in generating 3D models from images and videos, and creating fully interactive 3D environments. These are the leaders revolutionizing 3D object creation and visualization.
In the realm of video generation, we have Gen-2 from Google-backed startup Runway. Despite being more of an interesting novelty than a practical tool in current video workflows due to low frame rates and grainy clips, it’s a creative powerhouse with its ability to understand a variety of styles from anime to claymation.
Recently, Open-Source alternatives have emerged, such as Potat 1 AI and Deforum, based on Stable Diffusion or similar technologies. Then we have the “deep-fake-as-a-service” video generators that render realistic voice-over videos of digital people for tutorials or online courses. This success is attributed to the limited scope of video generation and fine-tuning on a single subject. Here, we can point to Synthesia, Lumen5, and ex-human. These advancements promise an exciting future for digital content creation.

As we continue to push AI boundaries, the rise of multimodal models is introducing an exciting paradigm. These models are designed to work with diverse data forms—text, images, audio, video, and beyond. Yet, we must remember, many multimodal models are still research-stage, although the potential applications appear limitless.
Take, for instance, Google’s robotic model, PaLM-E. This model is an extension of their language model, PaLM, fortified with raw sensor data streams. While it primarily remains a language model, the extra sensory inputs add a new layer of versatility. Google DeepMind recently launched Robotic Transformer 2 (RT-2), a “novel vision-language-action (VLA) model” that learns from both web and robotics data, translating this knowledge into generalized instructions for robotic control.
Meta is also venturing into this domain, having open-sourced a research project known as ImageBind. This project aims to fuse text, audio, visual, movement, thermal, and depth data, suggesting the future potential for generating multisensory content.
Finally, we can’t ignore GPT-4, currently setting the bar for large language models, which technically includes multimodal capabilities. It’s known to drive Bing AI’s image recognition function, effectively bridging the divide between text and visual data.

Despite the remarkable achievements of these models, they’ve certainly stirred controversy. Whenever revolutionary technology surfaces, it tends to ruffle feathers and stir the pot. In the context of diffusion models, this agitation has taken the form of copyright infringement concerns.
In 2023, diffusion models garnered attention, not always for celebratory reasons. Viral images, such as the Midjourney-created snapshot of Pope Francis in a chic white Balenciaga puffer coat or the fictional arrest of Donald Trump, generated quite a stir. Another image of a hoax attack on the Pentagon spread quickly on Twitter and sent jitters through the stock market. Meanwhile, there was significant dissent within the artist community. On January 13, 2023, artists Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, alleging infringement on artists’ rights by training AI tools on billions of web-scraped images without original artists’ consent.
These aren’t minor allegations. Spearheaded by attorney Matthew Butterick, partnering with the Joseph Saveri Law Firm, they are also confronting heavyweights like Microsoft, GitHub, and OpenAI in court. As these lawsuits unfold, it’s clear not all share the enthusiasm for AI-driven media advancements.
Universal Music Group, which controls a large portion of the global music market, expressed concern as deepfake music generators became increasingly available, leading to the creation of viral tracks such as Ghostwriter’s “Heart on My Sleeve”. This AI-crafted collaboration featuring voices of Drake and The Weeknd garnered millions of views before being removed.
AI misuse concerns aren’t new, mirroring issues previously discussed surrounding large language models. Just as models like GPT-4 can generate compelling, creative text potentially used for misinformation, diffusion models can fabricate realistic images that could be misused to manipulate perception and sow confusion.
The abilities of these AI technologies have transcended innocent pranks to unsettling realities, with the rise in non-consensual pornographic imagery creation (so-called revenge porn) using generative AI models. This has led to legislative actions in multiple countries.
The real-world implications of AI misuse are profound, with experts expressing grave concerns. The widespread distribution of AI-generated images, video, and audio can lead to serious consequences like personal defamation, political manipulation, and societal trust erosion. With the US elections looming, apprehensions about the potential exploitation of these technologies to influence public opinion or incite discord are intensifying.
Furthermore, there’s the copyright issue. Navigating copyright laws in a world where AI can fabricate images nearly indistinguishable from human-created art is a challenge. And what if AI produces an image that mirrors an existing copyrighted work, even if generated without direct reference to that work?
These are substantial ethical and legal challenges that society must tackle as diffusion models and similar technologies progress. They represent a double-edged sword: their remarkable capabilities could unlock new realms of creativity and innovation, but if unchecked, they could be misused to harm individuals and society.
And we must consider the societal implications tied to increased AI access. Just as smartphones and social media have forever altered our lives and culture, so too will widespread AI use. It’s crucial to engage in open, informed discussions about these issues and seek balanced, effective solutions.

As we wrap up our exploration of diffusion models, voice clones, and AI-generated videos, we’ve witnessed how AI keeps morphing and advancing, steadily eroding the line between human creativity and machine creativity. This episode builds on our prior discourse in episode 4, where we delved into the rise of large language models, their vast potential, and the ethical dilemmas they entail.
In our relentless quest to understand the ever-evolving AI landscape, we wonder, “What’s the next big thing in AI?” “How will it redefine creativity and innovation?” And, crucially, “How can we steer this transformation to benefit everyone?”
In the quest for answers, we find ourselves circling back to familiar entities, albeit in new contexts. McKinsey & Company, a consulting giant we mentioned in episode 3, emerges once again in our narrative. If you recall, we explored Ted Chiang’s intriguing New Yorker piece, “Will AI Become the New McKinsey?”, where he likened AI’s transformative power to the significant influence of management-consulting firms like McKinsey, renowned for reshaping corporate strategies.
Next time on Blueberry Thoughts, we’ll extend this discussion with McKinsey’s perspective as we delve into their June 2023 report titled “The economic potential of generative AI: The next productivity frontier”. This document presents an in-depth exploration of the swift integration of AI into our daily lives, the extensive utility of generative AI, and the urgent need for a better grasp of this technology as we navigate its implications.
There’s much to mull over and look forward to. Join us on the next episode of Blueberry Thoughts as we traverse the exciting, constantly changing terrain of AI.
Until then, this is Ivor J. Burks. Stay curious.

Sources:

From Noise to Art: Understanding Diffusion Models in Machine Learning (renaissancerachel.com)

Midjourney vs DALL-E vs Stable Diffusion: Best AI Art Player (analyticsinsight.net)

20 Midjourney Alternatives to Try in 2023 – MarkTechPost

FACT FOCUS: Fake image of Pentagon explosion briefly sends jitters through stock market | AP News

Recent developments in Generative AI for Audio (assemblyai.com)

Meta open-sources multisensory AI model that combines six types of data – The Verge


Posted

in

, , , , ,

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *